Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	9db0872ecc	rebase fix	2025-11-27 22:41:34 -08:00
Ilya Kreymer	7c37672ae9	add removing option to also remove unused crawls if doing a full sync, disable by default	2025-11-27 22:35:15 -08:00
Ilya Kreymer	0d414f72f1	indexer optimize: commit only if added	2025-11-27 22:35:15 -08:00
Ilya Kreymer	dd8d2e1ea7	rename 'dedup' -> 'dedupe' for consistency	2025-11-27 22:35:15 -08:00
Ilya Kreymer	c4f07c4e59	always return wacz, store wacz depends only for current wacz store crawlid depends for entire crawl	2025-11-27 22:35:14 -08:00
Ilya Kreymer	9fba5da0ce	cleanup, keep compatibility with redis 6 still set to 'post-crawl' state after uploading	2025-11-27 22:34:43 -08:00
Ilya Kreymer	6579b2dc95	update to new data model: - hashes stored in separate crawl specific entries, h:<crawlid> - wacz files stored in crawl specific list, c:<crawlid>:wacz - hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set - store filename, crawlId in related.requires list entries for each wacz	2025-11-27 22:32:52 -08:00
Ilya Kreymer	298b901558	- track source index for each hash, so entry becomes '<source index> <date> <url>' - entry for source index can contain the crawl id (or possibly wacz and crawl id) - also store dependent sources in relation.requires in datapackage.json - tests: update tests to check for relation.requires	2025-11-27 22:29:37 -08:00
Ilya Kreymer	8d53399455	dedup post requests and non-404s as well! update timestamp after import	2025-11-27 22:28:43 -08:00
Ilya Kreymer	78b8847323	use dedup redis for queue up wacz files that need to be updated use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries	2025-11-27 22:28:43 -08:00
Ilya Kreymer	ca02f09b5d	dedup indexing: strip hash prefix from digest, as cdx does not have it tests: add index import + dedup crawl to ensure digests match fully	2025-11-27 22:28:43 -08:00
Ilya Kreymer	db4393c2a1	deps update	2025-11-27 22:28:42 -08:00
Ilya Kreymer	0cadf371d0	tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records	2025-11-27 22:28:13 -08:00
Ilya Kreymer	c447428450	bump to 2.4.7	2025-11-27 22:28:12 -08:00
Ilya Kreymer	2f81798f09	update to latest warcio (2.4.7) to fix issus when returning payload only size	2025-11-27 22:27:56 -08:00
Ilya Kreymer	db9e78e823	rename --dedupStoreUrl -> redisDedupUrl bump version to 1.9.0 fix typo	2025-11-27 22:27:26 -08:00
Ilya Kreymer	bbe084daa0	warc writing: - update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response	2025-11-27 22:27:09 -08:00
Ilya Kreymer	87c94876f6	keep skipping dupe URLs as before	2025-11-27 22:26:35 -08:00
Ilya Kreymer	2ecf290d38	add indexer entrypoint: - populate dedup index from remote wacz/multi wacz/multiwacz json refactor: - move WACZLoader to wacz to be shared with indexer - state: move hash-based dedup to RedisDedupIndex cli args: - add --minPageDedupDepth to indicate when pages are skipped for dedup - skip same URLs by same hash within same crawl	2025-11-27 22:26:32 -08:00
Ilya Kreymer	eb6b87fbaf	args: add separate --dedupIndexUrl to support separate redis for dedup indexing prep: - move WACZLoader to wacz for reuse	2025-11-27 22:25:59 -08:00
Ilya Kreymer	00eca5329d	dedup work: - resource dedup via page digest - page dedup via page digest check, blocking of dupe page	2025-11-27 22:25:59 -08:00
Ilya Kreymer	8e44b31b45	version: bump to 1.10.0-beta.1	2025-11-27 22:25:11 -08:00
Ilya Kreymer	2ef8e00268	fix connection leaks in aborted fetch() requests (#924 ) - in doCancel(), use abort controller and call abort(), instead of body.cancel() - ensure doCancel() is called when a WARC record is not written, eg. is a dupe, as stream is likely not consumed - also call IO.close() when uses browser network reader - fixes #923 - also adds missing dupe check to async resources queued from behaviors (were being deduped on write, but were still fetched unnecessarily)	2025-11-27 20:37:24 -08:00
Ilya Kreymer	8658df3999	deps: update to browsertrix-behaviors 0.9.7, puppeteer-core 24.31.0 (#922 )	2025-11-26 20:12:16 -08:00
Ilya Kreymer	30646ca7ba	Add downloads dir to cache external dependency within the crawl (#921 ) Fixes #920 - Downloads profile, custom behavior, and seed list to `/downloads` directory in the crawl - Seed File: Downloaded into downloads. Never refetched if already exists on subsequent crawl restarts. - Custom Behaviors: Git: Downloaded into dir, then moved to /downloads/behaviors/<dir name>. if already exist, failure to downloaded will reuse existing directory - Custom Behaviors: File: Downloaded into temp file, then moved to /downloads/behaviors/<name.js>. if already exists, failure to download will reuse existing file. - Profile: using `/profile` directory to contain the browser profile - Profile: downloaded to temp file, then placed into /downloads/profile.tar.gz. If failed to download, but already exists, existing /profile directory is used - Also fixes #897	2025-11-26 19:30:27 -08:00
Tessa Walsh	1d15a155f2	Add option to respect robots.txt disallows (#888 ) Fixes #631 - Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler. - Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x' - Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types. - Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K - Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all' - Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-11-26 19:00:06 -08:00
Ilya Kreymer	75a0c9a305	version: bump to 1.10.0-beta.0	2025-11-26 15:15:45 -08:00
hexagonwin	9cd2d393bc	Fix typo 'runInIframes' (#918 ) 'runInIframes' appears to be a typo. (https://github.com/webrecorder/custom-behaviors/blob/main/behaviors/timeline.js example)	2025-11-25 19:19:01 -08:00
Ilya Kreymer	b9b804e660	improvements to support pausing: (#919 ) - clear size to 0 immediately after wacz is uploaded - if crawler is in paused, ensure upload of any data on startup - fetcher q: stop queuing async requests if recorder is marked for stopping	2025-11-25 19:17:39 -08:00
Ilya Kreymer	565ba54454	better failure detection, allow update support for captcha detection via behaviors (#917 ) - allow fail on content check from main behavior - update to behaviors 0.9.6 to support 'captcha_found' content check for tiktok - allow throwing from timedRun - call fatal() if profile can not be extracted	2025-11-19 15:49:49 -08:00
Ilya Kreymer	87edef3362	netIdle cleanup + better default for pages where networkIdle timesout (#916 ) - set default networkIdle to 2 - add netIdleMaxRequests as an option, default to 1 (in case of long running requests) - further fix for #913 - avoid accidental logging --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-11-18 16:34:02 -08:00
Ilya Kreymer	8c8fd6be08	remove --disable-component-update flag, fixes shields not working (#915 ) should fix main cause of slow down in #913 deps: update to brave 1.84.139, puppeteer 24.30.0 bump to 1.9.1	2025-11-14 20:30:42 -08:00
Ilya Kreymer	bb11147234	brave: update policies to disable new brave services (#914 )	2025-11-14 20:00:58 -08:00
Ilya Kreymer	59fe064c62	version: bump to 1.9.0	2025-11-11 18:28:21 -08:00
Ilya Kreymer	85c5632eb1	deps: bump dependencies for 1.9.0 (#912 ) update to brave 1.84.135, wabac.js 2.24.5	2025-11-11 14:38:35 -08:00
Tessa Walsh	11f52db31e	Fix linting following external contribution (#911 ) Quick-follow to `7dd13a9ec4`, to fix linting issue introduced in that PR.	2025-11-11 12:03:56 -08:00
aponb	b50ef1230f	feat: add extraChromeArgs support for passing custom Chrome flags (#877 ) This change introduces a new CLI option --extraChromeArgs to Browsertrix Crawler, allowing users to pass arbitrary Chrome flags without modifying the codebase. This approach is future-proof: any Chrome flag can be provided at runtime, avoiding the need for hard-coded allowlists. Maintains backward compatibility: if no extraChromeArgs are passed, behavior remains unchanged. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-11-11 12:03:30 -08:00
Percival	7dd13a9ec4	fix: Skip proxy for seed file and custom behavior downloads (#907 )	2025-11-11 10:51:24 -05:00
Wannaphong Phatthiyaphaibun	37a6fa974b	Fix directory path in user guide for WACZ file (#910 ) I found that the directory path in the user guide for the WACZ file is wrong. It should be `crawls/collections/test/test.wacz`.	2025-11-07 12:39:01 -08:00
Ilya Kreymer	74b6ad0ae0	deps: bump behaviors to 0.9.5 beta 1.9.0-beta.1	2025-11-02 12:30:09 -08:00
Ilya Kreymer	390d036f9e	deps: update to browsertrix-behaviors 0.9.4 (#906 ) Includes fixes for autoclick behavior: - able to click on svgs - don't navgiate back if click did not result in history stack change	2025-11-02 09:12:15 -08:00
Ilya Kreymer	5685cb2cbe	profiles: add singleton lock removal on startup to avoid any issues (#904 ) If the SingletonLock (and SingletonPort, SingletonSocket) files somehow made it into the profile, the browser will refuse to start. This will ensure that it is cleared. (Could also do it before saving it as well, but this will catch it for any existing profiles).	2025-11-02 09:12:07 -08:00
Ilya Kreymer	3935526240	add --saveProfile option to save profile after successful crawl (#903 ) - if --saveProfile is specified, attempt to save profile to same target as --profile - if --saveProfile <target>, save to target - save profile on finalExit if browser has launched - supports local file paths and storage-relative path with '@' (same as --profile) - also clear cache in first worker to match regular profile creation fixes #898	2025-10-29 19:57:25 -07:00
Ilya Kreymer	afdb6674e5	profile download improvements: (#899 ) - log when profie download starts - ensure there is a timeout to profile download attempt (60 secs) - attempt retry 2 more times if initial profile download times out - fail crawl after 3 retries, if profile can not be downloaded successfully bumpt to 1.8.2	2025-10-25 16:49:40 -07:00
Ilya Kreymer	6f26148a9b	bump version to 1.8.1	2025-10-08 17:11:04 -07:00
Ilya Kreymer	4f234040ce	Profile Saving Improvements (#894 ) fix some observed errors that occur when saving profile: - use browser.cookies instead of page.cookies to get all cookies, not just from page - catch exception when clearing cache and ignore - logging: log when proxy init is happening on all paths, in case error in proxy connection	2025-10-08 17:09:20 -07:00
Ilya Kreymer	002feb287b	dismiss js dialog popups (#895 ) move the JS dialog handler to not be only for autoclick, dismiss all JS dialogs (alert(), prompt()) to avoid blocking page fixes #891	2025-10-08 14:57:52 -07:00
Ilya Kreymer	2270964996	logging: remove duplicate seeds found error (#893 ) Per discussion, the message is unnecessary / confusing (doesn't provide enough info) and can also happen on crawler restart.	2025-10-07 08:18:22 -07:00
Ilya Kreymer	fd49041f63	flow behaviors: add scrolling into view (#892 ) Some page elements don't quite respond correctly if the element is not in view, so should add the setEnsureElementIsInTheViewport() to click, doubleclick, hover and change step locators.	2025-10-07 08:17:56 -07:00
Ed Summers	cc2d890916	Add addLink doc (#890 ) It's helpful to know this function is there!	2025-10-02 15:45:55 -04:00

1 2 3 4 5 ...

602 commits