Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2026-04-18 07:00:22 +00:00

Author	SHA1	Message	Date
emma	9bfc190867	remove redundant digest analytics they should already be covered by the dedupe indexer	2026-02-17 14:28:35 -05:00
emma	6d75d60a5e	first pass - add redis reports index for crawl statistics introduces a new RedisReportsIndex class that tracks statistics about crawl data in redis, including counts and sizes by host, crawl, mime type, and HTTP status code category	2026-02-16 19:23:38 -05:00
Ilya Kreymer	f3b4446638	frame behaviors: use frame.evaluate() instead of custom evaluteWithCLI() (#964 ) Removes our custom evaluateWithCLI() call in favor of using standard `frame.evaluate()`. The custom method became more tricky to use, and doesn't work for all iframes. The main benefit of it was to inject `getEventListeners()`, which is now only used in Autoscroll to potentially skip scrolling, and not needed for any other behaviors. We could potentially use this workaround: https://stackoverflow.com/questions/75517220/geteventlistener-function-support-for-latest-puppeteer-versions/75581410#75581410 to bring back `getEventListeners` or, adding a custom callback, like `hasEventListener()` would suffice for that check. This simplifies the codebase and ensures that running behaviors is more reliable. Also, adds a callback for new frames that may get added to page, such as during scrolling, an ensures behaviors are called on those iframes as well.	2026-02-12 16:02:52 -08:00
Ilya Kreymer	f27ffd4319	Fix browser network loading (#963 ) When available, async fetch should try to load via the browser network, especially for in-page discovered URLs, to ensure proper credentials are used (may fix #960): - adds missing CDP param that resulted in browser network being skipped! - try browser network for direct fetch too, if page is available, but then fallback to node fetch - default to node fetch when network loading failed, or if in browser (non-page) context for request interception, eg. in a worker. - updates to browsertrix-behaviors 0.9.8, which prefers in-browser fetch when possible.	2026-02-12 13:43:50 -08:00
Ilya Kreymer	06435f1743	version: bump to 1.12.0-beta.1	2026-02-12 13:41:32 -08:00
Ilya Kreymer	154151913a	Dedup Initial Implementation (#889 ) Fixes #884 - Support for hash-based deduplication via a Redis provided with --redisDedupeUrl (can be same as default redis) - Support for writing WARC revisit records for duplicates - Support for new indexer mode which imports CDXJ from one or more WACZs (refactored from replay) to populate the dedup index - Crawl and aggregate stats updated in dedupe index, including total urls, deduped URLs, conserved size (difference between revisit and response records), and estimated redundant size (aggregate) of duplicates not deduped. - Track removed crawls on index update, support for --remove operation to purge removed crawls, otherwise removed crawl aggregate data is maintained. - Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field. - Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added. - WIP on page-level dedupe (preempt loading entire pages) if HTML is a dupe/matches exactly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2026-02-12 13:40:49 -08:00
Ilya Kreymer	325d7fe1ca	add decompress() interceptor, support undici.request() without decompression + keep content-encoding if no decompression (#970 ) Fixes #971 - Add decompression interceptor as default for getProxyDispatcher(), no need to decompress sitemap explicitly - Add option to not decompress, keep content-encoding header and don't add `x-orig-` when using node async fetch - Create three dispatcher variants per proxy: redirect + decompress, redirect + no decompress, no redirect + no decompress	2026-02-12 09:53:05 -08:00
Ilya Kreymer	80901a12e1	version: bump to 1.11.3	2026-02-09 10:35:03 -08:00
Ilya Kreymer	e15368a057	fix issues related to profile directory placed in /profile: (#969 ) - always attempt to delete existing profile dir before moving new one in its place, fixes #968 - treat 304 (eg. if recrawling with existing profile) as cached resource, don't attempt to write/check size - fix typo in shouldSkipSave() for incomplete 206 responses	2026-02-09 10:33:37 -08:00
Ilya Kreymer	c57481f9e1	Fix default user-agent to not include minor version + set sec-ua-ch-* headers (#962 ) - Only use major version from, set rest to 0.0.0 to match Brave/Chrome behavior - Store major version in Browser - Also set some `sec-ua-ch-*` headers to match brave default - Don't disable cache when creating profile to avoid sending different Cache-Control headers when in profile creation (cache cleared before profile created anyway) --------- Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>	2026-02-04 16:06:28 -08:00
Ilya Kreymer	689d9f6c6b	Apply pageExtraDelay after successful direct fetch (#961 ) fixes #957 also apply page extra delay when if direct fetch succeeded to enforce consistent rate limiting	2026-01-30 13:18:48 -08:00
Ilya Kreymer	8b0bbe76c4	typo fix: ./logger -> ./logger.js	2026-01-30 10:19:48 -08:00
zakk	1fd2aeba81	bugfix(normalize): normalize urls for seeds, add normalizeUrl wrapper (#959 ) - applies normalizeUrl() to seed URL and seed isIncluded() check - add normalizeUrl() wrapper which applies standard opts and also catches and logs any errors from normalization - test: add scope tests to ensure URL with differently sorted query args still in scope --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2026-01-30 10:00:19 -08:00
Ilya Kreymer	1c32f64566	deps: update browser 1.86.146 (#958 ) - update brave and latest puppeteer-core - tests: fix sitemapper tests hopefully!	2026-01-30 01:25:59 -08:00
Ilya Kreymer	efb5f6aaec	Fix sitemap done check (#956 ) Sitemap checkIfDone() should be called in finally, otherwise 'end' event may never be emitted in some cases!	2026-01-29 13:57:12 -08:00
Ilya Kreymer	1e327fc351	ensure redirects are followed for sitemap, robots, other requests converted from fetch() (#955 ) Refactor dispatcher apis: - Use `getProxyDispatcher(withRedirect = true)` to follow redirects by default, with option to disable (eg. in recorder). This dispatcher also ignores TLS errors, to match current browser config. Used for fetching archival content. - Use `getFollowRedirectsDispatcher()` with follows redirects but does not ignore TLS and does not use proxies, for fetching non-archival configs (profiles, behaviors, etc...) Fixes #954, regression from #946	2026-01-29 10:00:16 -08:00
Ilya Kreymer	f6ff8d5122	warcio update + add links discovered from autoclick (#952 ) - update warcio.js to 2.4.9 to fix issue with multiple repeated headers values (now allowed for HTTP headers) - ensure links discovered from autoclick are also crawled: the links are stored in a set to avoid dupe links, but no reason not to also queue them for crawling, if they're in scope. - bump to 1.11.2	2026-01-29 09:59:33 -08:00
Ilya Kreymer	581a70340a	fix signal handling edge-cases: (#951 ) - ensure two signals at least 1 sec apart are received before immediate termination - only exit immediately if crawl not already post-processing, otherwise let post-processing run its course	2026-01-19 13:31:27 -08:00
Ilya Kreymer	3ce09e6d3a	add getFileOrUrlAsJson for loading local/remote JSON, don't use blob for local files (#949 ) - remove openAsBlob() as that doesn't work with request() api - but keep openAsBlob for when interface with wabac.js fetch() - also remove commented out code	2026-01-12 10:34:30 -08:00
Ilya Kreymer	5cb237d2bd	Replace minio client with aws client-s3 + lib-storage for multi-part upload (#943 ) Extends work in #547 adds Upload via @aws-sdk/lib-storage library: - Replaces minio client with official aws s3 client - Uses @aws-sdk/lib-storage for multi-part upload support Testing: - This should address issues from #479 and webrecorder/browsertrix#2925 ideally - Tested with all the major S3 implementations: VersityGW, RustFS, SeaweedFS, Garage as well as Minio --------- Co-authored-by: Mattia <m@ttia.it> Co-authored-by: Mattia <mattia@guella.it> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2026-01-08 12:29:53 -08:00
Ilya Kreymer	3fbe66ad8b	deps: update brave + bump to 1.11.0 (#948 )	2026-01-08 12:24:06 -08:00
Ilya Kreymer	88277ea5ab	Replace fetch() with optimized undici request() (#946 ) Per https://undici.nodejs.org/#/?id=benchmarks request() is supposed to be much more performant compared to fetch() with almost the same interface. This PR replaces all of the fetch() calls (both using proxy dispatcher and regular fetch) with undici request() The migration is fairly simply as shown in https://undici.nodejs.org/#/?id=migration-guide The migration eliminates various web streams to node streams conversions. To support automatic redirects, undici redirect interceptor is used. Also updates to latest undici (v7)	2026-01-07 21:42:37 -08:00
Emma Segal-Grossman	ebd5a05865	Update Puppeteer mobile device descriptor URL (#947 )	2026-01-05 17:35:59 -05:00
emma	bf8dc77053	update Puppeteer mobile device descriptor URL	2026-01-05 16:08:34 -05:00
Ilya Kreymer	d3932f9c74	set ulimit before launching x11vnc to work around libvncserver bug (#945 ) Fixes #944 Sets file ulimit to 8192 and then launches x11vnc. Should result in faster profile loading if file limit is set especially high, due to a bug in libvncserver (see #944 for more details).	2025-12-31 12:23:04 -08:00
Ilya Kreymer	376cef0404	follow-up to #915 , add --allow-brave-component-update flag (#942 ) can still disable most other component updates, but enable brave components, including shields, based on: https://github.com/webrecorder/browsertrix-crawler/pull/915#issuecomment-3689772878	2025-12-31 12:18:50 -08:00
Tessa Walsh	0ecaa38e68	Fix custom behavior class example in docs (#940 ) Updates custom behavior sample class and examples to be accurate: - Include missing required `init()` method - Fix arguments in examples uses of `Lib.getState()`	2025-12-16 19:26:51 -05:00
Ilya Kreymer	e320908e6a	don't fail crawl if profile can not be saved (#939 ) - log exception caught from saving profile, log as error - bump to 1.10.2	2025-12-15 12:18:55 -08:00
Ilya Kreymer	df26169975	Sitemaps: parse /sitemap.xml if no sitemap listed in robots.txt (#933 ) - Ensure the /sitemap.xml is parsed even if robots.txt exists, but no sitemaps listed there. - Resolve relative URLs listed in robots.txt, eg. 'Sitemap: /my-sitemap.xml' - Simplify sitemap detection logic, check robots first, then sitemap.xml OR alternate url if provided via --useSitemap <url> - Have two main methods, parseSitemap() and parseSitemapFromRobots() that handle the parsing. - follow-up to #930 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-12-11 10:37:37 -08:00
Ilya Kreymer	850a6a6665	Don't remove excluded-on-redirect URLs from seen list (#936 ) Fixes #937 - Don't remove URLs from seen list - Add new excluded key, add URLs to be excluded (out-of-scope on redirect) to excluded set. The size of this set can be used to get the URLs that have been excluded in this way, to compute number of discovered URLs. - Don't write urn:pageinfo records for excluded pages, along with not writing to pages/extraPages.jsonl	2025-12-08 22:41:52 -08:00
Ilya Kreymer	4a703cdc09	sort query args before queuing URLs (#935 ) - use 'normalize-url' package to avoid differently sorted query args that are the same url - configure other options, such as keeping www. and trailing slashes, only using this for query arg sorting	2025-12-08 15:51:50 -08:00
Ilya Kreymer	993081d3ee	better handling of net::ERR_HTTP_RESPONSE_CODE_FAILURE: (#934 ) - http headers provided but no payload, record response - record page as failed with status code provided, don't attempt to retry	2025-12-05 16:56:42 -08:00
Ilya Kreymer	822de93301	version: bump to 1.10.0	2025-12-03 14:56:02 -08:00
Ilya Kreymer	042acc9c39	version: bump to 1.10.0.beta-2	2025-12-02 17:00:41 -08:00
Tessa Walsh	ff5619e624	Rename robots flag to --useRobots, keep --robots as alias (#932 ) Follow-up to https://github.com/webrecorder/browsertrix-crawler/issues/631 Based on feedback from https://github.com/webrecorder/browsertrix/pull/3029 Renaming `--robots` to `--useRobots` will allow us to keep the Browsertrix backend API more consistent with similar flags like `--useSitemap`. Keeping `--robots` as it's a nice shorthand alias.	2025-12-02 15:55:25 -08:00
Ilya Kreymer	2914e93152	sitemapper refactor to fix concurrency: (#930 ) - original implementation did not actually wait for sitemap to complete before queuing new ones, resulting in concurrency resource leak - refactor to await completion of sitemap parser, replacing pending list with counter - also, don't parse sitemap if single-page and no extra hops! - fixes issues in #928	2025-12-02 15:52:33 -08:00
Ilya Kreymer	59df6bbd3f	crash page on prompt dialog loop to continue: (#929 ) - if a page is stuck in a window.alert / window.prompt loop, showing >10 or more consecutive dialogs (unrelated to unloading), call Page.crash() to more quickly move on to next page, as not much else can be done. - add exception handling in dialog accept/dismiss to avoid crawler crash - fixes #926	2025-12-01 16:57:00 -08:00
Ilya Kreymer	8e44b31b45	version: bump to 1.10.0-beta.1	2025-11-27 22:25:11 -08:00
Ilya Kreymer	2ef8e00268	fix connection leaks in aborted fetch() requests (#924 ) - in doCancel(), use abort controller and call abort(), instead of body.cancel() - ensure doCancel() is called when a WARC record is not written, eg. is a dupe, as stream is likely not consumed - also call IO.close() when uses browser network reader - fixes #923 - also adds missing dupe check to async resources queued from behaviors (were being deduped on write, but were still fetched unnecessarily)	2025-11-27 20:37:24 -08:00
Ilya Kreymer	8658df3999	deps: update to browsertrix-behaviors 0.9.7, puppeteer-core 24.31.0 (#922 )	2025-11-26 20:12:16 -08:00
Ilya Kreymer	30646ca7ba	Add downloads dir to cache external dependency within the crawl (#921 ) Fixes #920 - Downloads profile, custom behavior, and seed list to `/downloads` directory in the crawl - Seed File: Downloaded into downloads. Never refetched if already exists on subsequent crawl restarts. - Custom Behaviors: Git: Downloaded into dir, then moved to /downloads/behaviors/<dir name>. if already exist, failure to downloaded will reuse existing directory - Custom Behaviors: File: Downloaded into temp file, then moved to /downloads/behaviors/<name.js>. if already exists, failure to download will reuse existing file. - Profile: using `/profile` directory to contain the browser profile - Profile: downloaded to temp file, then placed into /downloads/profile.tar.gz. If failed to download, but already exists, existing /profile directory is used - Also fixes #897	2025-11-26 19:30:27 -08:00
Tessa Walsh	1d15a155f2	Add option to respect robots.txt disallows (#888 ) Fixes #631 - Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler. - Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x' - Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types. - Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K - Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all' - Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-11-26 19:00:06 -08:00
Ilya Kreymer	75a0c9a305	version: bump to 1.10.0-beta.0	2025-11-26 15:15:45 -08:00
hexagonwin	9cd2d393bc	Fix typo 'runInIframes' (#918 ) 'runInIframes' appears to be a typo. (https://github.com/webrecorder/custom-behaviors/blob/main/behaviors/timeline.js example)	2025-11-25 19:19:01 -08:00
Ilya Kreymer	b9b804e660	improvements to support pausing: (#919 ) - clear size to 0 immediately after wacz is uploaded - if crawler is in paused, ensure upload of any data on startup - fetcher q: stop queuing async requests if recorder is marked for stopping	2025-11-25 19:17:39 -08:00
Ilya Kreymer	565ba54454	better failure detection, allow update support for captcha detection via behaviors (#917 ) - allow fail on content check from main behavior - update to behaviors 0.9.6 to support 'captcha_found' content check for tiktok - allow throwing from timedRun - call fatal() if profile can not be extracted	2025-11-19 15:49:49 -08:00
Ilya Kreymer	87edef3362	netIdle cleanup + better default for pages where networkIdle timesout (#916 ) - set default networkIdle to 2 - add netIdleMaxRequests as an option, default to 1 (in case of long running requests) - further fix for #913 - avoid accidental logging --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-11-18 16:34:02 -08:00
Ilya Kreymer	8c8fd6be08	remove --disable-component-update flag, fixes shields not working (#915 ) should fix main cause of slow down in #913 deps: update to brave 1.84.139, puppeteer 24.30.0 bump to 1.9.1	2025-11-14 20:30:42 -08:00
Ilya Kreymer	bb11147234	brave: update policies to disable new brave services (#914 )	2025-11-14 20:00:58 -08:00
Ilya Kreymer	59fe064c62	version: bump to 1.9.0	2025-11-11 18:28:21 -08:00

1 2 3 4 5 ...

618 commits