Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Tessa Walsh	4143ebbd02	Store archive dir size in Redis (#291 )	2023-04-19 18:10:02 -07:00
Tessa Walsh	c23cd66c66	Store done in redis as integer and only save full json in redis for failed pages (#284 ) * Store done in redis as integer rather than full json * Add numFailed to crawler stats * Cast numDone to int before returning * Increment done counter for failed URLs * Fix movefailed to push failed URL to failed not done key * Don't add failed to total stats twice	2023-04-13 13:31:33 -07:00
Tessa Walsh	3864c76090	Add option to log errors to redis (#279 )	2023-04-11 11:32:52 -04:00
Ilya Kreymer	4ba6e949d3	Reset locked pending URLs when crawler restarts. (#267 ) * pending lock reset: - quicker retry of pending URLs after crawler crash by clearing pending page locks - pending urls are locked with <crawl>:p:<url> to indicate they are current being rendered - when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending lock, to allow the URL to be retried again, as it's no longer actively being crawled.	2023-03-30 21:29:41 -07:00
Tessa Walsh	b0e93cb06e	Add option for sleep interval after behaviors run + timing cleanup (#257 ) * Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131) * Store total page time in 'maxPageTime', include pageExtraDelay * Rename timeout->pageLoadTimeout * cleanup: - store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions - add secondsElapsed() utility function to help checking time elapsed - cleanup comments --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-22 11:50:18 -07:00
Ilya Kreymer	02fb137b2c	Catch loading issues (#255 ) * various loading improvements to avoid pages getting 'stuck' + load state tracking - add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object. - set loadState to 0 (failed) by default - set loadState to 1 (content-loaded) on 'domcontentloaded' event - if page.goto() finishes, set to loadState to 2 'full-page-load'. - if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors - page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out - pages: log 'loadState' as part of pages.jsonl - improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well - screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting - deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements) - workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages - worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state - screencaster: allow reusing caster slots with fixed ids - interrupt timedCrawlPage() wait if 'crash' event happens - crawler: pageFinished() callback when page finishes - worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty	2023-03-20 18:31:37 -07:00
Ilya Kreymer	07e503a8e6	Logger cleanup (#254 ) * logging: convert logger to a singleton to simplify use * add logger to create-login-profile.js	2023-03-17 14:24:44 -07:00
Ilya Kreymer	82808d8133	Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253 ) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-03-17 12:50:32 -07:00
Ilya Kreymer	4b8a414410	Add total timeout + limit redis queue retries (#248 ) * time limits: readd total timeount to runTask() in worker, just in case refactor working runTask() to either return true/false if task was timed out if timed out, recreate the page redis: add limit to retried URLs, currently set to 1 * retry: remove URL if not retrying, log removal of URL from queue	2023-03-13 14:48:04 -07:00
Tessa Walsh	1bee46b321	Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219 ) * This commit removes puppeteer-cluster as a dependency in favor of a simpler concurrency implementation, using p-queue to limit concurrency to the number of available workers. As part of the refactor, the custom window concurrency model in windowconcur.js is removed and its logic implemented in the new Worker class's initPage method. * Remove concurrency models, always use new tab * logging improvements: include worker-id in logs, use 'worker' context - logging: log info string / version as first line - logging: improve logging of error stack traces - interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue - interruption: don't repair if interrupting, wait for queue to be idle - log text extraction - init order: ensure wb-manager init called first, then logs created - logging: adjust info->debug logging - Log no jobs available as debug * tests: bail on first failure * iframe filtering: - fix filtering for about:blank iframes, support non-async shouldProcessFrame() - filter iframes both for behaviors and for link extraction - add 5-second timeout to link extraction, to avoid link extraction holding up crawl! - cache filtered frames * healthcheck/worker reuse: - refactor healthchecker into separate class - increment healthchecker (if provided) if new page load fails - remove expermeintal repair functionality for now - add healthcheck * deps: bump puppeteer-core to 17.1.2 - bump to 0.9.0-beta.0 -------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-08 18:31:19 -08:00
Ilya Kreymer	5da379cb5f	Logging and Behavior Tweaks (#229 ) - Ensure page is included in all logging details - Update logging messages to be a single string, with variables added in the details - Always wait for all pending wait requests to finish (unless counter <0) - Don't set puppeteer-cluster timeout (prep for removing puppeeteer-cluster) - Add behaviorTimeout to running behaviors in crawler, in addition to in behaviors themselves. - Add logging for behavior start, finish and timeout - Move writeStats() logging to beginning of each page as well as at the end, to avoid confusion about pending pages. - For events from frames, use frameUrl along with current page - deps: bump browsertrix-behaviors to 0.4.2 - version: bump to 0.8.1	2023-02-23 18:50:22 -08:00
Tessa Walsh	c0b0d5b87f	Serialize Redis pending pages as JSON objects (#212 ) * Add redis:// prefix to test --redisStoreUrl * Serialize pending pages as JSON objects	2023-01-23 16:44:03 -08:00
Ilya Kreymer	a767721f5e	crawl state: add getPendingList() to return pending state from either… (#205 ) * crawl state: add getPendingList() to return pending state from either memory or redis crawl state, fix stats logging with redis state. Return pending list as json object logging: check if data object is an error, log fields from error. Convert missing console.* to new logger * evaluate failuire: log with error, not fatal	2023-01-23 10:43:12 -08:00
Tessa Walsh	0192d05f4c	Implement improved json-l logging - Add Logger class with methods for info, error, warn, debug, fatal - Add context, timestamp, and details fields to log entries - Log messages as JSON Lines - Replace puppeteer-cluster stats with custom stats implementation - Log behaviors by default - Amend argParser to reflect logging changes - Capture and log stdout/stderr from awaited child_processes - Modify tests to use webrecorder.net to avoid timeouts	2023-01-19 14:17:27 -05:00
Ilya Kreymer	277314f2de	Convert to ESM (#179 ) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0	2022-11-15 18:30:27 -08:00
Ilya Kreymer	cf90304fa7	0.6.0 Wait State + Screencasting Fixes (#141 ) * new options: - to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished - when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished. - set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable) - when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting. - also write WACZ if exiting due to size limit exceed, but not do to other interruptions - change sleep() to be in seconds * misc fixes: - crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue - don't fail crawl if isFinished() is true - don't keep looping in pending wait for urls to finish if received abort request * screencast improvements (fix related to webrecorder/browsertrix-cloud#233) - more optimized screencasting, don't close and restart after every page. - don't assume targets change after every page, they don't in window mode! - only send 'close' message when target is actually closed * bump to 0.6.0	2022-06-17 11:58:44 -07:00
Ilya Kreymer	70ba9241ca	limit interrupt fix: after self-interrupting, only look at local pending list (for redis state) logging: don't log CF check errors, do log when errorCount is reset	2022-05-19 06:25:46 +00:00
Ilya Kreymer	93b6dad7b9	Health Check + Size Limits + Profile fixes (#138 ) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1	2022-05-18 22:51:55 -07:00
Ilya Kreymer	8727ca7f8c	redis state error handling: catch and log potential errors with reading json state for next url bump version to 0.5.0-beta.7	2022-03-18 10:34:17 -07:00
Ilya Kreymer	ab096cd5b0	Improve to URL direct check and fetch (#125 ) - direct check fix: only do direct check if HEAD returns 200 status code - if direct load results in non-200 status code, still load in browser - error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error - state: tweak error logging message	2022-03-14 11:11:53 -07:00
Ilya Kreymer	e160382f4d	Screencast + Redis state tweaks (#109 ) * redis save state: load queued and done urls in chunks in case lists are large * screencast: add 'init' message to include number of workers and dimensions	2022-03-02 13:26:11 -08:00
Ilya Kreymer	56be08e2e0	state improvements: - local: use map for pending state - redis: uset hmap for pending state - redis: support requeing if only pending urls are left, add expiring keys per pending page for pageTimeout	2022-02-09 22:53:15 -08:00
Ilya Kreymer	c2ce9fc001	various state + wacz fixes: (#101 ) - wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue: * puppeteer-cluster: add custom 'start()' callback to indicate task actually starting * new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started' when actaully started minio: use fPutObject to support parallel uploading, compute hash and size separately (for now) dependencies: update to latest minio error checking: * print number of WARCs found, exit with error if 0 * ensure wacz creation succeeds, exit with error code if not * validate wacz after creation, exit with error code if validation fails bump to 0.5.0-beta.3	2022-02-08 15:31:55 -08:00
Ilya Kreymer	201eab4ad1	Support Extra Hops beyond current scope with --extraHops option (#98 ) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2	2022-01-15 09:03:09 -08:00
Ilya Kreymer	39ddecd35e	State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78 ) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta	2021-09-28 09:41:16 -07:00