Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-27 02:04:10 +00:00

Author	SHA1	Message	Date
Tessa Walsh	74831373fd	Update README options	2023-07-06 15:21:30 -04:00
wvengen	de2b4512b6	Allow configuration of deduplication policy (#331 ) (#332 )	2023-07-06 14:54:35 -04:00
Tessa Walsh	22dc2e8426	deps: bump browsertrix-behaviors to ^0.5.1 (#341 )	2023-07-06 10:15:18 -07:00
Ilya Kreymer	5ce410c275	profiles: use newly provided puppeteer page.setBypassServiceWorker() (#340 ) * profiles: use newly provided puppeteer page.setBypassServiceWorker() instead of cdp command bump puppeteer core to 20.7.4	2023-07-06 10:09:32 -04:00
Tessa Walsh	254da95a44	Fix disk utilization computation errors (#338 ) * Check size of /crawls by default to fix disk utilization check * Refactor calculating percentage used and add unit tests * add tests using df output for with disk usage above and below threshold --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-05 21:58:28 -07:00
Ilya Kreymer	3049b957bd	version: bump to 0.10.2 deps: bump to py-wacz 0.4.9	2023-07-05 21:20:58 -07:00
Ilya Kreymer	c7dc504c75	deps: update puppeteer-core to 20.4.0, fixes #324 (#325 )	2023-05-30 19:25:54 -07:00
Ilya Kreymer	7b906f921c	Origin Overrides: Ensure Host header also set (#326 ) * origin overrides: ensure 'host' and 'origin' headers are also overridden, set to the original host and origin when sent to the destination origin	2023-05-30 19:25:37 -07:00
Ilya Kreymer	7c6c7d57a8	version: bump to 0.10.1	2023-05-30 19:12:28 -07:00
Tessa Walsh	d9b72bb9f5	Ignore spaces in double quotes when splitting process.env.CRAWL_ARGS (#323 )	2023-05-30 19:06:44 -07:00
Ilya Kreymer	db46cdf6d5	version: bump to 0.10.0	2023-05-23 12:45:29 -07:00
Ilya Kreymer	392c8bba0f	allow adding --include with pre-existing --scopeType values (besides custom) (fixes #318 ) (#319 ) remove warning when --scopeType and --include used together tests: update tests to reflect new semantics of --include + --scopeType	2023-05-23 09:43:11 -07:00
Ilya Kreymer	f51154facb	Chrome 112 + new headless mode + consistent viewport tweaks (#316 ) * base: update to chrome 112 headless: switch to using new headless mode available in 112 which is more in sync with headful mode viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set) profiles: fix catching new window message, reopening page in current window versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1) bump to 0.10.0-beta.4 * profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages	2023-05-22 16:24:39 -07:00
Tessa Walsh	cc606deba9	Improve thumbnails with sharp (#304 ) * Resize thumbnails to 640x360 with sharp	2023-05-19 11:30:24 -07:00
Ilya Kreymer	b5df5ad3c1	version: bump to 0.10.0-beta.3	2023-05-19 07:44:29 -07:00
Ilya Kreymer	77f0a935aa	stopping: if crawl is marked as stopping, and no warcs found, mark state as failed also (to avoid loop in cloud when (#314 ) crawler is restarted)	2023-05-19 07:38:16 -07:00
Marc-Andre Lemburg	f0d69ba399	Disable Chrome optimization logic (#312 ) These optimizations can often lead to Chrome downloading large ML models in the background, which then end up in the web crawling archives, even though they don't have anything to do with the crawl. Fixes #311.	2023-05-19 07:30:53 -07:00
Ilya Kreymer	4b0dee56c2	state: adjust redis keys to be more consistent (#309 ) - use <crawlid>:stopping for crawl stop request - use <crawlid>:size for total setting crawl total size bump to 0.10.0-beta.2	2023-05-07 13:01:24 -07:00
Tessa Walsh	f3c64b2b07	Consolidate wacz error loglines (#306 ) * Print WACZ and reindexing errors/stacktraces on single line * Log full stderr as single line if debug is enabled	2023-05-07 13:00:56 -07:00
Tessa Walsh	a0cf0ebde7	Log fatal messages to redis errors (#305 )	2023-05-07 00:43:19 -07:00
Ilya Kreymer	ba6a3b6d6a	version: bump to 0.10.0-beta.1	2023-05-06 00:12:09 -07:00
Ilya Kreymer	f4c4203381	crawl stopping / additional states: (#303 ) * crawl stopping / additional states: - adds check for 'isCrawlStopped()' which checks redis key to see if crawl has been stopped externally, and interrupts work loop and prevents crawl from starting on load - additional crawl states: 'generate-wacz', 'generate-cdx', 'generate-warc', 'uploading-wacz', and 'pending-wait' to indicate when crawl is no longer running but crawler performing work - addresses part of webrecorder/browsertrix-cloud#263, webrecorder/browsertrix-cloud#637	2023-05-03 16:25:59 -07:00
Tessa Walsh	d4bc9e80b9	Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed (#300 ) * Catch 400 pywb errors on page load and mark page failed * Add --failOnFailedSeed option to fail crawl with exit code 1 if seed doesn't load, resolves #207 * Handle 4xx or 5xx page.goto responses as page load errors	2023-04-26 16:49:32 -07:00
Ilya Kreymer	71b618fe94	Switch back to Puppeteer from Playwright (#301 ) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-26 15:41:35 -07:00
Ilya Kreymer	d4e222fab2	merge regression fixes from 0.9.1: full page screenshot + allow service workers if no profile used (#297 ) * browser: just pass profileUrl and track if custom profile is used browser: don't disable service workers always (accidentally added as part of playwright migration) only disable if using profile, same as 0.8.x behavior fix for #288 * Fix full page screenshot (#296) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-24 10:26:56 -07:00
Ilya Kreymer	3c7c7bfbc4	optimize shutdown: if after interrupt signal was received, redis connection is gone, assume crawler is being terminated and exit quickly, (#292 ) don't attemtpt to reconnect to redis (assume crawler is also being shutdown)	2023-04-24 09:50:49 -07:00
Ilya Kreymer	5c497f4fa4	version: bump version to 0.10.0-beta.0	2023-04-19 19:17:58 -07:00
Ilya Kreymer	3d8e21ea59	origin override: add --originOverride source=dest to allow routing where https://src-host:src-port/path/page.html -> http://dest-host:dest-port/path/page.html where source=https://src-host:src-port and dest=http://dest-host:dest-port (#281 )	2023-04-19 19:17:15 -07:00
Tessa Walsh	4143ebbd02	Store archive dir size in Redis (#291 )	2023-04-19 18:10:02 -07:00
Ilya Kreymer	52822f9e42	worker: lower wait time, in case where no additional pages remain and other workers will finish quickly. otherwise, results in a min 10 seconds wait for >1 workers if only one page is encountered (#289 )	2023-04-17 18:11:56 -07:00
Tessa Walsh	c23cd66c66	Store done in redis as integer and only save full json in redis for failed pages (#284 ) * Store done in redis as integer rather than full json * Add numFailed to crawler stats * Cast numDone to int before returning * Increment done counter for failed URLs * Fix movefailed to push failed URL to failed not done key * Don't add failed to total stats twice	2023-04-13 13:31:33 -07:00
Tessa Walsh	3864c76090	Add option to log errors to redis (#279 )	2023-04-11 11:32:52 -04:00
Ilya Kreymer	4a27f8c4a0	version: bump to 0.9.1	2023-04-08 16:53:57 -07:00
Ilya Kreymer	ebdf0ac8f8	version: bump to 0.9.0!	2023-04-07 17:42:46 -07:00
Tessa Walsh	e2e80e98ef	Don't set viewport for full page screenshots (#221 )	2023-04-07 17:42:06 -07:00
Tessa Walsh	b303af02ef	Add --title and --description CLI args to write metadata into datapackage.json (#276 ) Multi-word values including spaces must be enclosed in double quotes. Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-04 10:46:03 -04:00
Ilya Kreymer	d4233582bb	ci: bump yarn install timeout for ci, use latest gh action	2023-04-03 12:18:42 -07:00
Ilya Kreymer	24e9c43b29	version: bump to 0.9.0-beta.2	2023-04-03 11:52:24 -07:00
Ilya Kreymer	78faa965c5	Add --maxPageLimit override (#275 ) * max page limit: - rename --limit -> --pageLimit (keep alias for now) - add new --maxPageLimit flag which overrides --pageLimit to ensure it is not greater than max - readme: add new --pageLimit, --maxPageLimit to README	2023-04-03 11:10:47 -07:00
Ilya Kreymer	86e930d633	blockrules/logger: use global logger var (#274 )	2023-04-03 10:58:13 -07:00
Tessa Walsh	d8c505a076	Update README for 0.9.0 (#272 ) * Update README for Playwright/0.9.0 * Add ad blocking to README	2023-04-02 21:55:14 -07:00
Tessa Walsh	62fe4b4a99	Add options to filter logs by --logLevel and --context (#271 ) * Add .DS_Store to gitignore * Add --logLevel and --context filtering options * Add log filtering test	2023-04-01 10:07:59 -07:00
Tessa Walsh	746d80adc7	Ensure crawler can't run out of space with --diskUtilization param (#264 ) * Implement --diskUtilization * Keep threshold fixed but project usage based on archive dir size	2023-03-31 09:35:18 -07:00
Ilya Kreymer	4ba6e949d3	Reset locked pending URLs when crawler restarts. (#267 ) * pending lock reset: - quicker retry of pending URLs after crawler crash by clearing pending page locks - pending urls are locked with <crawl>:p:<url> to indicate they are current being rendered - when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending lock, to allow the URL to be retried again, as it's no longer actively being crawled.	2023-03-30 21:29:41 -07:00
Ilya Kreymer	fcd55c690a	worker index: set worker index automatically to work with k8s naming (#266 ) - if CRAWL_ID env var set to 'crawl-id-name' while hostname is 'crawl-id-name-N' (automatically set via k8s statefulsets), then set starting worker index to N * numWorkers	2023-03-29 22:27:17 -07:00
Tessa Walsh	b0e93cb06e	Add option for sleep interval after behaviors run + timing cleanup (#257 ) * Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131) * Store total page time in 'maxPageTime', include pageExtraDelay * Rename timeout->pageLoadTimeout * cleanup: - store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions - add secondsElapsed() utility function to help checking time elapsed - cleanup comments --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-22 11:50:18 -07:00
Ilya Kreymer	02fb137b2c	Catch loading issues (#255 ) * various loading improvements to avoid pages getting 'stuck' + load state tracking - add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object. - set loadState to 0 (failed) by default - set loadState to 1 (content-loaded) on 'domcontentloaded' event - if page.goto() finishes, set to loadState to 2 'full-page-load'. - if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors - page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out - pages: log 'loadState' as part of pages.jsonl - improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well - screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting - deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements) - workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages - worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state - screencaster: allow reusing caster slots with fixed ids - interrupt timedCrawlPage() wait if 'crash' event happens - crawler: pageFinished() callback when page finishes - worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty	2023-03-20 18:31:37 -07:00
Ilya Kreymer	07e503a8e6	Logger cleanup (#254 ) * logging: convert logger to a singleton to simplify use * add logger to create-login-profile.js	2023-03-17 14:24:44 -07:00
Ilya Kreymer	82808d8133	Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253 ) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-03-17 12:50:32 -07:00
Tessa Walsh	f19f1fcb8d	Minor crawler fixes after puppeteer-cluster removal refactoring (#250 ) * Remove screencaster from Worker/WorkerPool * Don't increment errors in crawlPageInWorker * Set pageTarget variable early	2023-03-13 15:07:59 -07:00

... 5 6 7 8 9 ...

505 commits