Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	810856fcf9	add --browserdriver for testing different drivers add testing with puppeteer again! comment out BlockRules / AdBlockRules for now	2023-04-24 23:40:11 -07:00
Ilya Kreymer	2c3a866b22	logging: add memory logging update playwright reset to recycle after 5 pages as before	2023-04-24 20:25:33 -07:00
Ilya Kreymer	96a3aa837b	possible work for #298 - split browser into NewContextBrowser and PersistentContextBrowser (old behavior) - NewContextBrowser() closes and recreates the context when pages are closed - worker uses 'storageState' to get state, set again on new page for that worker	2023-04-24 16:32:03 -07:00
Ilya Kreymer	52822f9e42	worker: lower wait time, in case where no additional pages remain and other workers will finish quickly. otherwise, results in a min 10 seconds wait for >1 workers if only one page is encountered (#289 )	2023-04-17 18:11:56 -07:00
Ilya Kreymer	fcd55c690a	worker index: set worker index automatically to work with k8s naming (#266 ) - if CRAWL_ID env var set to 'crawl-id-name' while hostname is 'crawl-id-name-N' (automatically set via k8s statefulsets), then set starting worker index to N * numWorkers	2023-03-29 22:27:17 -07:00
Tessa Walsh	b0e93cb06e	Add option for sleep interval after behaviors run + timing cleanup (#257 ) * Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131) * Store total page time in 'maxPageTime', include pageExtraDelay * Rename timeout->pageLoadTimeout * cleanup: - store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions - add secondsElapsed() utility function to help checking time elapsed - cleanup comments --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-22 11:50:18 -07:00
Ilya Kreymer	02fb137b2c	Catch loading issues (#255 ) * various loading improvements to avoid pages getting 'stuck' + load state tracking - add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object. - set loadState to 0 (failed) by default - set loadState to 1 (content-loaded) on 'domcontentloaded' event - if page.goto() finishes, set to loadState to 2 'full-page-load'. - if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors - page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out - pages: log 'loadState' as part of pages.jsonl - improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well - screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting - deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements) - workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages - worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state - screencaster: allow reusing caster slots with fixed ids - interrupt timedCrawlPage() wait if 'crash' event happens - crawler: pageFinished() callback when page finishes - worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty	2023-03-20 18:31:37 -07:00
Ilya Kreymer	07e503a8e6	Logger cleanup (#254 ) * logging: convert logger to a singleton to simplify use * add logger to create-login-profile.js	2023-03-17 14:24:44 -07:00
Ilya Kreymer	82808d8133	Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253 ) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-03-17 12:50:32 -07:00
Tessa Walsh	f19f1fcb8d	Minor crawler fixes after puppeteer-cluster removal refactoring (#250 ) * Remove screencaster from Worker/WorkerPool * Don't increment errors in crawlPageInWorker * Set pageTarget variable early	2023-03-13 15:07:59 -07:00
Ilya Kreymer	4b8a414410	Add total timeout + limit redis queue retries (#248 ) * time limits: readd total timeount to runTask() in worker, just in case refactor working runTask() to either return true/false if task was timed out if timed out, recreate the page redis: add limit to retried URLs, currently set to 1 * retry: remove URL if not retrying, log removal of URL from queue	2023-03-13 14:48:04 -07:00
Tessa Walsh	aadd9a0483	Add timedRun to prevent async operations from hanging (#243 ) * Add timedRun and apply to network requests * Remove debugging print statement * minor tweaks: - move seconds to 2nd param, make param required - use FETCH_TIMEOUT_SECS for fetch events and PAGE_OP_TIMEOUT_SECS for in-page events respectively - use timedRun() for check CF action - remove extra async * additional logging ensure queue is cleared when interrupting! --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-10 20:11:24 -08:00
Tessa Walsh	1bee46b321	Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219 ) * This commit removes puppeteer-cluster as a dependency in favor of a simpler concurrency implementation, using p-queue to limit concurrency to the number of available workers. As part of the refactor, the custom window concurrency model in windowconcur.js is removed and its logic implemented in the new Worker class's initPage method. * Remove concurrency models, always use new tab * logging improvements: include worker-id in logs, use 'worker' context - logging: log info string / version as first line - logging: improve logging of error stack traces - interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue - interruption: don't repair if interrupting, wait for queue to be idle - log text extraction - init order: ensure wb-manager init called first, then logs created - logging: adjust info->debug logging - Log no jobs available as debug * tests: bail on first failure * iframe filtering: - fix filtering for about:blank iframes, support non-async shouldProcessFrame() - filter iframes both for behaviors and for link extraction - add 5-second timeout to link extraction, to avoid link extraction holding up crawl! - cache filtered frames * healthcheck/worker reuse: - refactor healthchecker into separate class - increment healthchecker (if provided) if new page load fails - remove expermeintal repair functionality for now - add healthcheck * deps: bump puppeteer-core to 17.1.2 - bump to 0.9.0-beta.0 -------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-08 18:31:19 -08:00