Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	549d655173	Support option to fail crawl on content check (#861 ) - add --failOnContentCheck for quick fail if content check in behavior fails - expose __bx_contentCheckFailed to cause an immediately failure from behavior - only allow failing crawl due to content check from within awaitPageLoad() callback - set a 'failReason' key to track that crawl failed due to a particular content check reason - deps: update to browsertrix-behaviors 0.9.0, update to wabac.js (2.23.6) - fixes #860 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-07-08 13:08:52 -07:00
Ilya Kreymer	6244515818	async fetch: allow retrying async fetch if interrupted (#863 ) - retry if 'truncated' set, or if size mismatch, or other exception occurs - retry only for network load and async fetch, not for response fetch - set max retries to 2 (same as default for pages currently) - fixes #831	2025-07-08 10:02:09 -07:00
Ilya Kreymer	c84f58f539	Use consistent profile directory name (merge 1.6.4 change) (#859 ) - Use `TMPDIR/btrixProfile` as consistent profile directory name - Avoid accumulation of temp profile dirs if crawler is restarted multiple times, eg. if tmp dir is mapped to /crawls (as is in Browsertrix now), this prevents a proliferation of /crawls/tmp/profile-* dirs for each crawler restart - change released in 1.6.4, merging into main	2025-07-03 19:49:05 -07:00
Tessa Walsh	2af94ffab5	Support downloading seed file from URL (#852 ) Fixes #841 Crawler work toward long URL lists in Browsertrix. This PR moves seed handling from the arg parser's validation step to the crawler's bootstrap step in order to be able to async fetch the seed file from a URL. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-03 10:49:37 -04:00
Ilya Kreymer	687f08b1d0	Add option to save local/sessionStorage (#856 ) If --saveStorage is set, localStorage and sessionStorage will be serialized with the WARC record for the page. If a page redirects, track what the current page URL is and save storage as part of the page's WARC record. Fixes #855	2025-06-30 19:58:19 -07:00
Ilya Kreymer	eb374fa835	base: bump to brave 1.80.113 (#857 ) version: bump to 1.7.0-beta.0 tests: update deprecated command to work with latest minio	2025-06-30 19:55:38 -07:00
Ilya Kreymer	d2a6aa9805	version: bump to 1.6.3 (#851 ) cli: regen cli docs to update from #850	2025-06-16 15:55:05 -04:00
Rijnder Wever	fa26f05f66	cleanup: remove dead pywb code from argparser and docs (#847 ) The value of `--dedupPolicy` was once passed to pywb (see https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording). Now that pywb has been dropped, there is no need to keep this option around. In fact, I know multiple users that have been confused by the mention of this option in the docs (myself included). (for historical context, see https://github.com/webrecorder/browsertrix-crawler/pull/332)	2025-06-16 12:36:32 -04:00
Tessa Walsh	e09d10c582	Disable disk utilization check by default (#850 ) Related to https://github.com/webrecorder/browsertrix-crawler/issues/848 Several users have had issues with disk utilization checks, including the values reported by `df` inside the crawler container having unexpected results for mounted volumes. The commonly recommended solution to this is to use `docker system ps`, but that is of course not available within the Docker container itself. This PR changes disk utilization checks to be an opt-in feature by setting the default value to `0` (disabled).	2025-06-16 12:36:15 -04:00
Ilya Kreymer	da953b670b	content-type compare for rewriting: use case-insensitive check (#849 ) update to wabac.js 2.23.3 for HLS rewriting fixes part of capture fix for webrecorder/replayweb.page#433	2025-06-16 11:09:44 -04:00
Ilya Kreymer	a5936b56aa	deps: bump brave 1.79.118 (#845 ) bump version to 1.6.2	2025-06-03 12:52:07 -07:00
Ilya Kreymer	178b10a37f	remove early serialization which may result in missing WARC-Protocol and security metadata (#844 ) - drop early serialization in handleFetchResponse(), can result in writing WARC record too early, before the WARC-Protocol and other data is available. (Added previously for requests loaded via browser context / service worker which did not get a 'loadingFinished' message, but now these will still be closed in awaitPageResources()) - don't log 'skipping URL from unknown frame' warning since it is often spurious, since frame can be added in subsequent message and response is not skipped.	2025-05-29 08:33:30 -07:00
Ilya Kreymer	7bf10f7f18	optimization: normalize dedup status: treat 0 (response code not yet known) or 206 as 200… (#835 ) Avoids fetching duplicate content when fetched through different code path (eg. autoplay behavior calling fetch, vs video playing automatically)	2025-05-28 15:46:40 -07:00
Tessa Walsh	46a02d12a3	Remove hardcoded /tmp prefix from path (#843 ) Fast-follow to #842 to fix a typo	2025-05-28 15:46:19 -07:00
Ilya Kreymer	52235ab21e	tmpdir: use os.tmpdir() instead of hardcoded '/tmp' (#842 ) allows for customizing tmp directory with TMPDIR env var	2025-05-28 12:48:06 -07:00
Ilya Kreymer	e72b34318d	Add WARC-Protocol header (#715 ) - add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42 - also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-05-19 18:59:52 -07:00
Ilya Kreymer	71de8d6582	lang code fixes: (#834 ) - validate --lang values, fail immediately with invalid iso-639-1 country code - ignore --lang value when using profile, print warning that profile language takes precedence - fixes #833	2025-05-12 16:06:29 -07:00
Ilya Kreymer	e39d5a31eb	support pause interrupt: (#825 ) - add new interrupt reason / exit code - add isCrawlPaused() which checks redis <id>:paused key - exit gracefully, upload WACZ file when paused fixes #824	2025-05-05 10:10:08 -07:00
Ilya Kreymer	f9bd534e4c	more dependency updates: (#827 ) - update wabac.js to 2.22.16, RWP to 2.3.7 - fidelity: fixes capture of fb and insta (via wabac.js 2.22.16) - policy: disable tg popups - bump version to 1.6.1!	2025-05-05 10:08:59 -07:00
Ilya Kreymer	fc59d04231	Deps update 1.6.1 (#826 )	2025-05-02 00:43:37 -07:00
Ilya Kreymer	d47812d139	Config Policy Update (#822 ) Fixes webrecorder/replayweb.page#416 Update enterprise policy to: - Disable Spellcheck, which should include downloading spellcheck dictionary, possibly issue raised in #817 - Disable automatic http->https redirects, which insert an extra 307 response, as raised in: webrecorder/replayweb.page#416	2025-05-01 23:01:24 -07:00
Ilya Kreymer	13e9648398	state: add trimqueue() redis command to trim queue / seen list (#821 ) useful to support dynamically lowering pageLimit when restarting a crawl fixes issue raised in webrecorder/browsertrix#2514	2025-04-29 18:18:04 -07:00
Ilya Kreymer	1cb1b2edb9	Update Behaviors Docs (#820 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-04-10 03:58:07 -04:00
Ilya Kreymer	f2dac05577	regression fix: start redis if needed before attempting to init state! (#819 ) bump to 1.6.0-beta.1	2025-04-09 21:37:46 +02:00
Ilya Kreymer	c796996664	Support for behaviors from 'recorder flow' JSON created in devtools (#818 ) New Feature: - support 'flow behavior' from JSON specification - detect .json files via --customBehaviors - log behavior progress while running - logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for custom behaviors - differentiate logging for iframes, move more behavior messages to debug - move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors - docs to be added in separate follow-up PR	2025-04-09 12:24:29 +02:00
Tessa Walsh	2961d3b9f2	Write behaviors downloaded from URL to tempdir (#816 ) Follow-up to #368 This makes download locations consistent between custom behaviors downloaded from URLs and those downloaded from Git repos, and resolves a container security issue in Browsertrix.	2025-04-04 11:23:29 -04:00
Ilya Kreymer	28241c824e	ci: fixes to deploy ci workflow	2025-04-03 23:36:49 -07:00
Ilya Kreymer	7421404aee	ci: add workflow to deploy to dev channels (requires actions secrets config) (#815 ) - uses DEPLOY_REGISTRY, DEPLOY_REGISTRY_PATH, DEPLOY_REGISTRY_API_TOKEN secrets	2025-04-03 23:21:48 -07:00
Ilya Kreymer	66c71d03c8	deps: bump base browser image to 1.77.95 (#814 )	2025-04-03 17:25:29 -07:00
Ilya Kreymer	ba4c432ce8	browser crash handling, follow-up to #808 : (#813 ) - if not restartOnError, attempt to kill browser and try again, 3 more times - if still unable to open window, mark browser as crashed an exit	2025-04-03 16:10:54 -07:00
Tessa Walsh	f83d0e8f02	Add option to push behavior + behavior script logs to Redis (#805 ) Fixes #804 - Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3) - Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs - Noisy logs from built-in behaviors like autoscroll are now logged to debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92 and so won't be pushed to Redis for newer versions of the crawler. - Updates browsertrix-behaviors to 0.8.3 and makes some changes to log format in tests accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-03 15:46:10 -07:00
aponb	6898bcf7ae	useSHA1 Parameter for generating SHA1 record hashes (#532 ) (#812 ) By using the useSHA1 flag, the payload digest in records will use SHA-1 with Base32 encoding instead of the default SHA-256 Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>	2025-04-02 17:10:50 -07:00
Ilya Kreymer	bf6fbe8776	Remove extra console.log statements (#811 ) - remove one added in screencaster - also remove others that are outside logging system - bump to 1.5.10	2025-04-02 09:25:11 -07:00
Ilya Kreymer	91f8fadc5f	deps update: update webrecorder dependencies (#810 ) - browsertrix-behaviors 0.8.1 for improved logging / new behavior functions - wabac.js 2.22.9 - RWP 2.3.4 for QA - update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js	2025-04-01 22:11:56 -07:00
Ilya Kreymer	fd41b32100	saved state tweaks: (#809 ) - if saved state filename is somehow duplicated, don't readd to array to avoid deletion (fixes edge case in #791) - also avoid double interpolation of filename	2025-04-01 18:59:04 -07:00
Emma Segal-Grossman	41b968baac	Dynamically adjust reported aspect ratio based on GEOMETRY (#794 ) Closes #793 Related to #733 Adjusts the reported aspect ratio based on GEOMETRY env var. Also adjusts stylesheet in screencast HTML to match. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:26:12 -07:00
Tessa Walsh	2b00c1f065	Tweaks for custom behavior loading (#807 ) Follow-up to #712 Fixes a few things I noticed while testing out https://github.com/webrecorder/browsertrix/pull/2520 - Ignore `.git` directory of git repositories when recursively walking cloned git repo to collect custom behaviors - Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2 was overly restrictive for Git repositories) - Log name of custom behavior scripts (filename or URLs) added as info messages in `behavior` context --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:15:57 -07:00
Ilya Kreymer	2b56455e8b	stuck page handling: when attempting to restart browser, add more retries (#808 ) fixes issue mentioned in: https://github.com/webrecorder/browsertrix-crawler/issues/791#issuecomment-2734342186	2025-04-01 16:56:01 -07:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
Tessa Walsh	5fedde6eee	Fail crawl with fatal message if custom behavior isn't loaded (#799 ) Fixes #797 The crawler will now exit with a fatal log message and exit code 17 if: - A Git repository specified with `--customBehavior` cannot be cloned successfully (new) - A custom behavior file at a URL specified with `--customBehavior` is not fetched successfully (new) - No custom behaviors are collected at a local filepath specified with `--customBehavior`, or if an error is thrown while attempting to collect files from a nonexistent path (new) - Any custom behaviors collected fail `Browser.checkScript` validation (existing behavior) Tests have also been added accordingly.	2025-03-31 17:35:30 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
benoit74	02c4353b4a	Add clarification in usage about hostname used (#771 ) clarify that the crawlId defaults to the Docker container hostname	2025-03-30 21:16:58 -07:00
Tessa Walsh	8f581a587c	Validate Autoclick selector, fail crawl if invalid (#800 ) Fixes #798 Also modifies the existing test for link selector validation to check 17 status code on exit when link selectors fail validation. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-30 13:48:41 -07:00
Ilya Kreymer	47d61a6baf	version: bump to 1.5.9	2025-03-28 13:41:53 -07:00
Ilya Kreymer	8c96a10f67	deps: update to warcio.js 2.4.4, fixes #796 (#802 )	2025-03-28 13:38:15 -07:00
Ilya Kreymer	323b654c54	tests: update qa test to use awp site	2025-03-21 13:06:53 -07:00
Henry Wilkinson	34a1e3d6c0	docs: Update header font (#785 ) Updated alongside https://github.com/webrecorder/replayweb.page/pull/405 Long overdue match to Browsertrix docs styling ### Screenshots <img width="465" alt="Screenshot 2025-03-03 at 7 25 04 PM" src="https://github.com/user-attachments/assets/6829dcb7-d486-4793-a635-f1286b30efc0" />	2025-03-05 14:21:00 -08:00
Ilya Kreymer	9a7ac9bef1	Fix using cached WACZ filename if already set ahead of time. (#783 ) - if <uid>:nextWacz filename already exists, actually get it and use that! - don't merge cdx if not generating wacz yet, use same condition for both bump version to 1.5.8 - fix follow-up to #748, fix #747	2025-02-28 17:58:56 -08:00
Ilya Kreymer	2aec2e1a33	reset back to latest image, 1.77.52 bump version to 1.5.7	2025-02-27 16:06:43 -08:00
Ilya Kreymer	0e7391b668	follow-up to #781 : (#782 ) - undo accidentally setting window timeout to 20000 seconds instead of 20 for debugging! - follow up to #781 - bump to 1.5.6.1 - should hopefully fix crawls stuck in this way..	2025-02-27 16:02:33 -08:00

1 2 3 4 5 ...

536 commits