Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	8595bcebc1	add new logger.interrupt() which will interrupt and exit crawl but not fail unlike logger.fatal() replace some logger.fatal() with interrupts to allow for retries instead of immediate failure, esp. when external inputs (profile, behaviors) can not be downloaded	2025-11-25 07:58:30 -08:00
Ilya Kreymer	565ba54454	better failure detection, allow update support for captcha detection via behaviors (#917 ) - allow fail on content check from main behavior - update to behaviors 0.9.6 to support 'captcha_found' content check for tiktok - allow throwing from timedRun - call fatal() if profile can not be extracted	2025-11-19 15:49:49 -08:00
Ilya Kreymer	87edef3362	netIdle cleanup + better default for pages where networkIdle timesout (#916 ) - set default networkIdle to 2 - add netIdleMaxRequests as an option, default to 1 (in case of long running requests) - further fix for #913 - avoid accidental logging --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-11-18 16:34:02 -08:00
aponb	b50ef1230f	feat: add extraChromeArgs support for passing custom Chrome flags (#877 ) This change introduces a new CLI option --extraChromeArgs to Browsertrix Crawler, allowing users to pass arbitrary Chrome flags without modifying the codebase. This approach is future-proof: any Chrome flag can be provided at runtime, avoiding the need for hard-coded allowlists. Maintains backward compatibility: if no extraChromeArgs are passed, behavior remains unchanged. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-11-11 12:03:30 -08:00
Ilya Kreymer	3935526240	add --saveProfile option to save profile after successful crawl (#903 ) - if --saveProfile is specified, attempt to save profile to same target as --profile - if --saveProfile <target>, save to target - save profile on finalExit if browser has launched - supports local file paths and storage-relative path with '@' (same as --profile) - also clear cache in first worker to match regular profile creation fixes #898	2025-10-29 19:57:25 -07:00
Ilya Kreymer	002feb287b	dismiss js dialog popups (#895 ) move the JS dialog handler to not be only for autoclick, dismiss all JS dialogs (alert(), prompt()) to avoid blocking page fixes #891	2025-10-08 14:57:52 -07:00
Ilya Kreymer	2270964996	logging: remove duplicate seeds found error (#893 ) Per discussion, the message is unnecessary / confusing (doesn't provide enough info) and can also happen on crawler restart.	2025-10-07 08:18:22 -07:00
Ilya Kreymer	a2742df328	seed urls list: check for quoted URLs and remove quotes (#883 ) - check for urls that are wrapped in quotes, eg. 'https://example.com/' or "https://example.com/" and trim and remove the quotes before adding seed - tests: add quoted URL to tests, fix old.webrecorder.net test - deps: update wabac.js, RWP to latest - logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached - fix for #882	2025-09-12 13:34:41 -07:00
Ilya Kreymer	705bc0cd9f	Async Fetch Refactor (#880 ) - separate out reading stream response while browser is waiting (not really async) from actual async loading, this is not handled via fetchResponseBody() - unify async fetch into first trying browser networking for regular GET, fallback to regular fetch() - load headers and body separately in async fetch, allowing for cancelling request after headers - refactor direct fetch of non-html pages: load headers and handle loading body, adding page async, allowing worker to continue loading browser-based pages (should allow more parallelization in the future) - unify WARC writing in preparation for dedup: unified serializeWARC() called for all paths, WARC digest computed, additional checks for payload added for streaming loading	2025-09-10 12:05:21 -07:00
Ilya Kreymer	a42c0b926e	Support host-specific proxies with proxy config YAML (#837 ) - Adds support for YAML-based config for multiple proxies, containing 'matchHosts' section by regex and 'proxies' declaration, allowing matching any number of hosts to any number of named proxies. - Specified via --proxyServerConfig option passed to both crawl and profile creation commands. - Implemented internally by generating a proxy PAC script which does regex matching and running browser with the specified proxy PAC script served by an internal http server. - Also support matching different undici Agents by regex, for using different proxies with direct fetching - Precedence: --proxyServerConfig takes precedence over --proxyServer / PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided - Updated proxies doc section with example - Updated tests with sample bad and good auth examples of proxy config Fixes #836 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-08-20 16:07:29 -07:00
Ilya Kreymer	18fe5a9676	behavior logging: remove last line dupe check for behavior logs (#874 ) Shouldn't skip multiple log messages, as this is unexpected behavior for user-defined behaviors.	2025-07-30 16:20:14 -07:00
Ilya Kreymer	0652a3fb1d	quickfix: WACZ upload retry support: (#871 ) - if a failure occurs on failed upload, and crawler restarts on error, exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix app) - otherwise, a failed upload will exit the crawl with no WACZ, resulting in overall crawl failure	2025-07-29 15:41:22 -07:00
sua yoo	bc4d649307	Capitalization fix for log messages (#870 ) Capitalizes "URL" in log messages.	2025-07-24 23:52:12 -07:00
Ilya Kreymer	1a4341bfbc	url queueing: log skipped URLs as errors if depth === 0 (#868 ) - will ensure sees from URL list are reported as errors if skipped - also set logging context to 'scope' instead of 'links' - fixes #866 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-07-23 10:05:40 -07:00
Ilya Kreymer	549d655173	Support option to fail crawl on content check (#861 ) - add --failOnContentCheck for quick fail if content check in behavior fails - expose __bx_contentCheckFailed to cause an immediately failure from behavior - only allow failing crawl due to content check from within awaitPageLoad() callback - set a 'failReason' key to track that crawl failed due to a particular content check reason - deps: update to browsertrix-behaviors 0.9.0, update to wabac.js (2.23.6) - fixes #860 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-07-08 13:08:52 -07:00
Tessa Walsh	2af94ffab5	Support downloading seed file from URL (#852 ) Fixes #841 Crawler work toward long URL lists in Browsertrix. This PR moves seed handling from the arg parser's validation step to the crawler's bootstrap step in order to be able to async fetch the seed file from a URL. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-03 10:49:37 -04:00
Ilya Kreymer	52235ab21e	tmpdir: use os.tmpdir() instead of hardcoded '/tmp' (#842 ) allows for customizing tmp directory with TMPDIR env var	2025-05-28 12:48:06 -07:00
Ilya Kreymer	71de8d6582	lang code fixes: (#834 ) - validate --lang values, fail immediately with invalid iso-639-1 country code - ignore --lang value when using profile, print warning that profile language takes precedence - fixes #833	2025-05-12 16:06:29 -07:00
Ilya Kreymer	e39d5a31eb	support pause interrupt: (#825 ) - add new interrupt reason / exit code - add isCrawlPaused() which checks redis <id>:paused key - exit gracefully, upload WACZ file when paused fixes #824	2025-05-05 10:10:08 -07:00
Ilya Kreymer	13e9648398	state: add trimqueue() redis command to trim queue / seen list (#821 ) useful to support dynamically lowering pageLimit when restarting a crawl fixes issue raised in webrecorder/browsertrix#2514	2025-04-29 18:18:04 -07:00
Ilya Kreymer	f2dac05577	regression fix: start redis if needed before attempting to init state! (#819 ) bump to 1.6.0-beta.1	2025-04-09 21:37:46 +02:00
Ilya Kreymer	c796996664	Support for behaviors from 'recorder flow' JSON created in devtools (#818 ) New Feature: - support 'flow behavior' from JSON specification - detect .json files via --customBehaviors - log behavior progress while running - logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for custom behaviors - differentiate logging for iframes, move more behavior messages to debug - move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors - docs to be added in separate follow-up PR	2025-04-09 12:24:29 +02:00
Tessa Walsh	f83d0e8f02	Add option to push behavior + behavior script logs to Redis (#805 ) Fixes #804 - Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3) - Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs - Noisy logs from built-in behaviors like autoscroll are now logged to debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92 and so won't be pushed to Redis for newer versions of the crawler. - Updates browsertrix-behaviors to 0.8.3 and makes some changes to log format in tests accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-03 15:46:10 -07:00
aponb	6898bcf7ae	useSHA1 Parameter for generating SHA1 record hashes (#532 ) (#812 ) By using the useSHA1 flag, the payload digest in records will use SHA-1 with Base32 encoding instead of the default SHA-256 Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>	2025-04-02 17:10:50 -07:00
Ilya Kreymer	bf6fbe8776	Remove extra console.log statements (#811 ) - remove one added in screencaster - also remove others that are outside logging system - bump to 1.5.10	2025-04-02 09:25:11 -07:00
Ilya Kreymer	fd41b32100	saved state tweaks: (#809 ) - if saved state filename is somehow duplicated, don't readd to array to avoid deletion (fixes edge case in #791) - also avoid double interpolation of filename	2025-04-01 18:59:04 -07:00
Emma Segal-Grossman	41b968baac	Dynamically adjust reported aspect ratio based on GEOMETRY (#794 ) Closes #793 Related to #733 Adjusts the reported aspect ratio based on GEOMETRY env var. Also adjusts stylesheet in screencast HTML to match. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:26:12 -07:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
Ilya Kreymer	9a7ac9bef1	Fix using cached WACZ filename if already set ahead of time. (#783 ) - if <uid>:nextWacz filename already exists, actually get it and use that! - don't merge cdx if not generating wacz yet, use same condition for both bump version to 1.5.8 - fix follow-up to #748, fix #747	2025-02-28 17:58:56 -08:00
benoit74	fc56c2cf76	Add more exit codes to detect interruption reason (#764 ) Fix #584 - Replace interrupted with interruptReason - Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16) are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10), SignalInterrupted (11) and SignalInterruptedForce (13) - Doc fix to cli args --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-10 14:00:55 -08:00
Ilya Kreymer	846f0355f6	Improved handling of browser stuck / crashed (#763 ) - only attempt to close browser if not browser crashed - add timeout for browser.close() - ensure browser crash results in healthchecker failure - bump to 1.5.3	2025-02-10 10:16:25 -08:00
Ilya Kreymer	5807c320bf	remove fatal() on new window error + stats fix (#762 ) logging (#752): ensure failed included in totals fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted. bump to 1.5.2	2025-02-09 15:26:36 -08:00
Ilya Kreymer	a5050a25d7	Readd health check on retry (#759 ) - health check failures should be incremented even if retrying, in case restart is needed - cleanup writePage() - bump default --maxPageRetries to 2 for better default for Browsertrix	2025-02-06 20:13:20 -08:00
Ilya Kreymer	00835fc4f2	Retry same queue (#757 ) - follow up to #743 - page retries are simply added back to the same queue with `retry` param incremented and a higher scope, after extraHops, to ensure retries are added at the end. - score calculation is: `score = depth + (extraHops * MAX_DEPTH) + (retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority than extraHops, and additional retries even lower priority (higher score). - warning is logged when a retry happens, error only when all retries are exhausted. - back to one failure list, urls added there only when all retries are exhausted. - rename --numRetries -> --maxRetries / --retries for clarity - state load: allow retrying previously failed URLs if --maxRetries is higher then on previous run. - ensure working with --failOnFailedStatus, if provided, invalid status codes (>= 400) are retried along with page load failures - fixes #132 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-06 18:48:40 -08:00
Ilya Kreymer	5c9d808651	exit code cleanup (#753 ) - use consistent enums for exit codes - add disk space check on startup and add OutOfSpace exit code (3) - preparation for #584	2025-02-06 17:54:51 -08:00
Ilya Kreymer	2e46140c3f	Make numRetries configurable (#754 ) Add --numRetries param, default to 1 instead of 5.	2025-02-05 23:34:55 -08:00
Ilya Kreymer	95a631188d	hang protection: wrap remaining evaluate() calls to avoid rare hangs (#750 ) wrap remaining frame.evaluate() and page.evaluate() calls that are not already within a timedRun() in their own timedRun() to avoid rare cases where they do not return (eg. if page crashes during the evaluate)	2025-01-30 17:39:20 -08:00
Ilya Kreymer	fe6199eebd	pages redis: include 'depth', 'seed' and 'favIconUrl' in page data added to redis (#749 ) follow-up to #747	2025-01-30 11:18:59 -08:00
Ilya Kreymer	457d07aea4	if uploading wacz files, compute waczfile name on load to be able to … (#748 ) …store filename along with page data: - set filename on crawler load, if not already set, otherwise use existing - store filename per crawler instance in <crawlid>:nextWacz - add 'filename' field to page when writing pages to redis - clear wacz filename when wacz is uploaded to set a new one - fixes #747 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-01-29 18:15:28 -08:00
Ilya Kreymer	a00866bbab	Apply exclusions to redirects (#745 ) - if redirected page is excluded, block loading of page - mark page as excluded, don't retry, and don't write to page list - support generic blocking of pages based on initial page response - fixes #744	2025-01-28 11:28:23 -08:00
Ilya Kreymer	f7cbf9645b	Retry support and additional fixes (#743 ) - retries: for failed pages, set retry to 5 in cases multiple retries may be needed. - redirect: if page url is /path/ -> /path, don't add as extra seed - proxy: don't use global dispatcher, pass dispatcher explicitly when using proxy, as proxy may interfere with local network requests - final exit flag: if crawl is done and also interrupted, ensure WACZ is still written/uploaded by setting final exit to true - hashtag only change force reload: if loading page with same URL but different hashtag, eg. `https://example.com/#B` after `https://example.com/#A`, do a full reload	2025-01-25 22:55:49 -08:00
Ilya Kreymer	5d9c62e264	Retry Failed Pages + Ignore Hashtags in Redirect Check (#739 ) - Retry pages that are marked as failed once, at the end of the crawl, in case it was due to a timeout - Also, don't treat differences in hashtag between seed page loaded and actual URL as a redirect (eg. don't add as new seed)	2025-01-16 15:51:35 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	d923e11436	separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736 ) Chromium now interrupts fetch() if abort() is called or page is navigated, so autofetch behavior using native fetch() is less than ideal. This PR adds support for __bx_fetch() command for autofetch behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately from browser's reguar fetch() - __bx_fetch() starts a fetch, but does not return content to browser, doesn't need abort(), unaffected by page navigation, but will still try to use browser network stack when possible, making it more efficient for background fetching. - if network stack fetch fails, fallback to regular node fetch() in the crawler. Additional improvements for interrupted fetch: - don't store truncated media responses, even for 200 - avoid doing duplicate async fetching if response already handled (eg. fetch handled in multiple contexts) - fixes #735, where fetch was interrupted, resulted in an empty response	2024-12-31 13:52:12 -08:00
Francesco Servida	07e5ceb4c2	Implemented option for FullPage screenshot after the behaviours have run (#656 ) - new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped. Related to #486 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-23 21:26:55 -08:00
Ilya Kreymer	d04509639a	Support custom css selectors for extracting links (#689 ) Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].	2024-11-08 11:04:41 -05:00
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Ilya Kreymer	e5bab8e7c8	various edge-case loading optimizations: (#709 ) - rework 'should stream' logic: * ensure 206 responses (or any response) greater than 25M are streamed * response between 5M and 25M are read into memory if text/css/js as they may be rewritten * responses <5M are read into memory * responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small - likely fix for issues in #706 - if too many range requests for same URL are being made, try skipping/failing right away to reduce load - assume main browser context is used not just for service workers, always enable - check false positive 'net-aborted' error that may actually be ok for media, as well as documents - improve logging - interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-31 14:06:17 -07:00
Ilya Kreymer	652cf9cfa6	link extraction promise cleanup: (#701 ) - catch frame.evaluate() directly and log errors there to avoid any possibility of exception being propagated before wrapping in timedRun() - also add clearTimeout() to timedRun() - possibly fixes openzim/zimit#376	2024-10-11 00:11:24 -07:00

1 2 3

120 commits