Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	9fba5da0ce	cleanup, keep compatibility with redis 6 still set to 'post-crawl' state after uploading	2025-11-27 22:34:43 -08:00
Ilya Kreymer	6579b2dc95	update to new data model: - hashes stored in separate crawl specific entries, h:<crawlid> - wacz files stored in crawl specific list, c:<crawlid>:wacz - hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set - store filename, crawlId in related.requires list entries for each wacz	2025-11-27 22:32:52 -08:00
Ilya Kreymer	8e44b31b45	version: bump to 1.10.0-beta.1	2025-11-27 22:25:11 -08:00
Ilya Kreymer	8658df3999	deps: update to browsertrix-behaviors 0.9.7, puppeteer-core 24.31.0 (#922 )	2025-11-26 20:12:16 -08:00
Tessa Walsh	1d15a155f2	Add option to respect robots.txt disallows (#888 ) Fixes #631 - Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler. - Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x' - Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types. - Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K - Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all' - Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-11-26 19:00:06 -08:00
Ilya Kreymer	75a0c9a305	version: bump to 1.10.0-beta.0	2025-11-26 15:15:45 -08:00
Ilya Kreymer	565ba54454	better failure detection, allow update support for captcha detection via behaviors (#917 ) - allow fail on content check from main behavior - update to behaviors 0.9.6 to support 'captcha_found' content check for tiktok - allow throwing from timedRun - call fatal() if profile can not be extracted	2025-11-19 15:49:49 -08:00
Ilya Kreymer	8c8fd6be08	remove --disable-component-update flag, fixes shields not working (#915 ) should fix main cause of slow down in #913 deps: update to brave 1.84.139, puppeteer 24.30.0 bump to 1.9.1	2025-11-14 20:30:42 -08:00
Ilya Kreymer	59fe064c62	version: bump to 1.9.0	2025-11-11 18:28:21 -08:00
Ilya Kreymer	85c5632eb1	deps: bump dependencies for 1.9.0 (#912 ) update to brave 1.84.135, wabac.js 2.24.5	2025-11-11 14:38:35 -08:00
Ilya Kreymer	74b6ad0ae0	deps: bump behaviors to 0.9.5 beta 1.9.0-beta.1	2025-11-02 12:30:09 -08:00
Ilya Kreymer	390d036f9e	deps: update to browsertrix-behaviors 0.9.4 (#906 ) Includes fixes for autoclick behavior: - able to click on svgs - don't navgiate back if click did not result in history stack change	2025-11-02 09:12:15 -08:00
Ilya Kreymer	3935526240	add --saveProfile option to save profile after successful crawl (#903 ) - if --saveProfile is specified, attempt to save profile to same target as --profile - if --saveProfile <target>, save to target - save profile on finalExit if browser has launched - supports local file paths and storage-relative path with '@' (same as --profile) - also clear cache in first worker to match regular profile creation fixes #898	2025-10-29 19:57:25 -07:00
Ilya Kreymer	afdb6674e5	profile download improvements: (#899 ) - log when profie download starts - ensure there is a timeout to profile download attempt (60 secs) - attempt retry 2 more times if initial profile download times out - fail crawl after 3 retries, if profile can not be downloaded successfully bumpt to 1.8.2	2025-10-25 16:49:40 -07:00
Ilya Kreymer	6f26148a9b	bump version to 1.8.1	2025-10-08 17:11:04 -07:00
Ilya Kreymer	f7a080fe83	version: bump to 1.8.0	2025-09-25 10:42:02 -07:00
Ilya Kreymer	048b72ca87	deps update: bump browser to brave 1.82.170, wabac.js 2.24.1 (#886 ) use latest puppeteer-core, puppeteer/replay bump to 1.8.0-beta.1	2025-09-20 11:38:20 -07:00
Ilya Kreymer	a2742df328	seed urls list: check for quoted URLs and remove quotes (#883 ) - check for urls that are wrapped in quotes, eg. 'https://example.com/' or "https://example.com/" and trim and remove the quotes before adding seed - tests: add quoted URL to tests, fix old.webrecorder.net test - deps: update wabac.js, RWP to latest - logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached - fix for #882	2025-09-12 13:34:41 -07:00
Ilya Kreymer	a42c0b926e	Support host-specific proxies with proxy config YAML (#837 ) - Adds support for YAML-based config for multiple proxies, containing 'matchHosts' section by regex and 'proxies' declaration, allowing matching any number of hosts to any number of named proxies. - Specified via --proxyServerConfig option passed to both crawl and profile creation commands. - Implemented internally by generating a proxy PAC script which does regex matching and running browser with the specified proxy PAC script served by an internal http server. - Also support matching different undici Agents by regex, for using different proxies with direct fetching - Precedence: --proxyServerConfig takes precedence over --proxyServer / PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided - Updated proxies doc section with example - Updated tests with sample bad and good auth examples of proxy config Fixes #836 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-08-20 16:07:29 -07:00
Ilya Kreymer	a6ad6a0e42	version: bump to 1.7.0	2025-07-31 15:23:42 -07:00
Ilya Kreymer	18fe5a9676	behavior logging: remove last line dupe check for behavior logs (#874 ) Shouldn't skip multiple log messages, as this is unexpected behavior for user-defined behaviors.	2025-07-30 16:20:14 -07:00
Ilya Kreymer	1a4341bfbc	url queueing: log skipped URLs as errors if depth === 0 (#868 ) - will ensure sees from URL list are reported as errors if skipped - also set logging context to 'scope' instead of 'links' - fixes #866 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-07-23 10:05:40 -07:00
Ilya Kreymer	96fd22971f	deps update: (#867 ) - bump brave to 1.80.122 - bump wabac.js to 2.23.8 - bump RWP to 2.3.15 - bump browsertrix-behaviors to 0.9.1	2025-07-22 21:06:12 -07:00
Ilya Kreymer	549d655173	Support option to fail crawl on content check (#861 ) - add --failOnContentCheck for quick fail if content check in behavior fails - expose __bx_contentCheckFailed to cause an immediately failure from behavior - only allow failing crawl due to content check from within awaitPageLoad() callback - set a 'failReason' key to track that crawl failed due to a particular content check reason - deps: update to browsertrix-behaviors 0.9.0, update to wabac.js (2.23.6) - fixes #860 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-07-08 13:08:52 -07:00
Ilya Kreymer	eb374fa835	base: bump to brave 1.80.113 (#857 ) version: bump to 1.7.0-beta.0 tests: update deprecated command to work with latest minio	2025-06-30 19:55:38 -07:00
Ilya Kreymer	d2a6aa9805	version: bump to 1.6.3 (#851 ) cli: regen cli docs to update from #850	2025-06-16 15:55:05 -04:00
Ilya Kreymer	da953b670b	content-type compare for rewriting: use case-insensitive check (#849 ) update to wabac.js 2.23.3 for HLS rewriting fixes part of capture fix for webrecorder/replayweb.page#433	2025-06-16 11:09:44 -04:00
Ilya Kreymer	a5936b56aa	deps: bump brave 1.79.118 (#845 ) bump version to 1.6.2	2025-06-03 12:52:07 -07:00
Ilya Kreymer	71de8d6582	lang code fixes: (#834 ) - validate --lang values, fail immediately with invalid iso-639-1 country code - ignore --lang value when using profile, print warning that profile language takes precedence - fixes #833	2025-05-12 16:06:29 -07:00
Ilya Kreymer	f9bd534e4c	more dependency updates: (#827 ) - update wabac.js to 2.22.16, RWP to 2.3.7 - fidelity: fixes capture of fb and insta (via wabac.js 2.22.16) - policy: disable tg popups - bump version to 1.6.1!	2025-05-05 10:08:59 -07:00
Ilya Kreymer	fc59d04231	Deps update 1.6.1 (#826 )	2025-05-02 00:43:37 -07:00
Ilya Kreymer	1cb1b2edb9	Update Behaviors Docs (#820 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-04-10 03:58:07 -04:00
Ilya Kreymer	f2dac05577	regression fix: start redis if needed before attempting to init state! (#819 ) bump to 1.6.0-beta.1	2025-04-09 21:37:46 +02:00
Ilya Kreymer	c796996664	Support for behaviors from 'recorder flow' JSON created in devtools (#818 ) New Feature: - support 'flow behavior' from JSON specification - detect .json files via --customBehaviors - log behavior progress while running - logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for custom behaviors - differentiate logging for iframes, move more behavior messages to debug - move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors - docs to be added in separate follow-up PR	2025-04-09 12:24:29 +02:00
Tessa Walsh	2961d3b9f2	Write behaviors downloaded from URL to tempdir (#816 ) Follow-up to #368 This makes download locations consistent between custom behaviors downloaded from URLs and those downloaded from Git repos, and resolves a container security issue in Browsertrix.	2025-04-04 11:23:29 -04:00
Tessa Walsh	f83d0e8f02	Add option to push behavior + behavior script logs to Redis (#805 ) Fixes #804 - Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3) - Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs - Noisy logs from built-in behaviors like autoscroll are now logged to debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92 and so won't be pushed to Redis for newer versions of the crawler. - Updates browsertrix-behaviors to 0.8.3 and makes some changes to log format in tests accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-03 15:46:10 -07:00
Ilya Kreymer	bf6fbe8776	Remove extra console.log statements (#811 ) - remove one added in screencaster - also remove others that are outside logging system - bump to 1.5.10	2025-04-02 09:25:11 -07:00
Ilya Kreymer	91f8fadc5f	deps update: update webrecorder dependencies (#810 ) - browsertrix-behaviors 0.8.1 for improved logging / new behavior functions - wabac.js 2.22.9 - RWP 2.3.4 for QA - update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js	2025-04-01 22:11:56 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
Ilya Kreymer	47d61a6baf	version: bump to 1.5.9	2025-03-28 13:41:53 -07:00
Ilya Kreymer	8c96a10f67	deps: update to warcio.js 2.4.4, fixes #796 (#802 )	2025-03-28 13:38:15 -07:00
Ilya Kreymer	9a7ac9bef1	Fix using cached WACZ filename if already set ahead of time. (#783 ) - if <uid>:nextWacz filename already exists, actually get it and use that! - don't merge cdx if not generating wacz yet, use same condition for both bump version to 1.5.8 - fix follow-up to #748, fix #747	2025-02-28 17:58:56 -08:00
Ilya Kreymer	2aec2e1a33	reset back to latest image, 1.77.52 bump version to 1.5.7	2025-02-27 16:06:43 -08:00
Ilya Kreymer	0e7391b668	follow-up to #781 : (#782 ) - undo accidentally setting window timeout to 20000 seconds instead of 20 for debugging! - follow up to #781 - bump to 1.5.6.1 - should hopefully fix crawls stuck in this way..	2025-02-27 16:02:33 -08:00
Ilya Kreymer	9b22df5c90	revert brave version: not ideal, but need to revert to chromium 132 u… (#781 ) …ntil we figure out various stalling issues that still persist in chromium >=133 bump to 1.5.6	2025-02-27 07:05:31 -08:00
Ilya Kreymer	6e42e056b1	version: bump to 1.5.5	2025-02-26 12:42:00 -08:00
Ilya Kreymer	c25c6771a8	browser: update brave to 1.77.52 to get Chromium 134 (#773 ) should fix browser timing out on new window, fixes #766 bump to 1.5.4	2025-02-20 09:14:32 -08:00
Ilya Kreymer	846f0355f6	Improved handling of browser stuck / crashed (#763 ) - only attempt to close browser if not browser crashed - add timeout for browser.close() - ensure browser crash results in healthchecker failure - bump to 1.5.3	2025-02-10 10:16:25 -08:00
Ilya Kreymer	5807c320bf	remove fatal() on new window error + stats fix (#762 ) logging (#752): ensure failed included in totals fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted. bump to 1.5.2	2025-02-09 15:26:36 -08:00
Ilya Kreymer	b435afeb4b	version: bump to 1.5.1	2025-02-06 11:40:31 -08:00

1 2 3 4 5

232 commits