Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	1cb1b2edb9	Update Behaviors Docs (#820 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-04-10 03:58:07 -04:00
Ilya Kreymer	f2dac05577	regression fix: start redis if needed before attempting to init state! (#819 ) bump to 1.6.0-beta.1	2025-04-09 21:37:46 +02:00
Ilya Kreymer	c796996664	Support for behaviors from 'recorder flow' JSON created in devtools (#818 ) New Feature: - support 'flow behavior' from JSON specification - detect .json files via --customBehaviors - log behavior progress while running - logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for custom behaviors - differentiate logging for iframes, move more behavior messages to debug - move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors - docs to be added in separate follow-up PR	2025-04-09 12:24:29 +02:00
Tessa Walsh	2961d3b9f2	Write behaviors downloaded from URL to tempdir (#816 ) Follow-up to #368 This makes download locations consistent between custom behaviors downloaded from URLs and those downloaded from Git repos, and resolves a container security issue in Browsertrix.	2025-04-04 11:23:29 -04:00
Ilya Kreymer	28241c824e	ci: fixes to deploy ci workflow	2025-04-03 23:36:49 -07:00
Ilya Kreymer	7421404aee	ci: add workflow to deploy to dev channels (requires actions secrets config) (#815 ) - uses DEPLOY_REGISTRY, DEPLOY_REGISTRY_PATH, DEPLOY_REGISTRY_API_TOKEN secrets	2025-04-03 23:21:48 -07:00
Ilya Kreymer	66c71d03c8	deps: bump base browser image to 1.77.95 (#814 )	2025-04-03 17:25:29 -07:00
Ilya Kreymer	ba4c432ce8	browser crash handling, follow-up to #808 : (#813 ) - if not restartOnError, attempt to kill browser and try again, 3 more times - if still unable to open window, mark browser as crashed an exit	2025-04-03 16:10:54 -07:00
Tessa Walsh	f83d0e8f02	Add option to push behavior + behavior script logs to Redis (#805 ) Fixes #804 - Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3) - Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs - Noisy logs from built-in behaviors like autoscroll are now logged to debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92 and so won't be pushed to Redis for newer versions of the crawler. - Updates browsertrix-behaviors to 0.8.3 and makes some changes to log format in tests accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-03 15:46:10 -07:00
aponb	6898bcf7ae	useSHA1 Parameter for generating SHA1 record hashes (#532 ) (#812 ) By using the useSHA1 flag, the payload digest in records will use SHA-1 with Base32 encoding instead of the default SHA-256 Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>	2025-04-02 17:10:50 -07:00
Ilya Kreymer	bf6fbe8776	Remove extra console.log statements (#811 ) - remove one added in screencaster - also remove others that are outside logging system - bump to 1.5.10	2025-04-02 09:25:11 -07:00
Ilya Kreymer	91f8fadc5f	deps update: update webrecorder dependencies (#810 ) - browsertrix-behaviors 0.8.1 for improved logging / new behavior functions - wabac.js 2.22.9 - RWP 2.3.4 for QA - update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js	2025-04-01 22:11:56 -07:00
Ilya Kreymer	fd41b32100	saved state tweaks: (#809 ) - if saved state filename is somehow duplicated, don't readd to array to avoid deletion (fixes edge case in #791) - also avoid double interpolation of filename	2025-04-01 18:59:04 -07:00
Emma Segal-Grossman	41b968baac	Dynamically adjust reported aspect ratio based on GEOMETRY (#794 ) Closes #793 Related to #733 Adjusts the reported aspect ratio based on GEOMETRY env var. Also adjusts stylesheet in screencast HTML to match. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:26:12 -07:00
Tessa Walsh	2b00c1f065	Tweaks for custom behavior loading (#807 ) Follow-up to #712 Fixes a few things I noticed while testing out https://github.com/webrecorder/browsertrix/pull/2520 - Ignore `.git` directory of git repositories when recursively walking cloned git repo to collect custom behaviors - Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2 was overly restrictive for Git repositories) - Log name of custom behavior scripts (filename or URLs) added as info messages in `behavior` context --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:15:57 -07:00
Ilya Kreymer	2b56455e8b	stuck page handling: when attempting to restart browser, add more retries (#808 ) fixes issue mentioned in: https://github.com/webrecorder/browsertrix-crawler/issues/791#issuecomment-2734342186	2025-04-01 16:56:01 -07:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
Tessa Walsh	5fedde6eee	Fail crawl with fatal message if custom behavior isn't loaded (#799 ) Fixes #797 The crawler will now exit with a fatal log message and exit code 17 if: - A Git repository specified with `--customBehavior` cannot be cloned successfully (new) - A custom behavior file at a URL specified with `--customBehavior` is not fetched successfully (new) - No custom behaviors are collected at a local filepath specified with `--customBehavior`, or if an error is thrown while attempting to collect files from a nonexistent path (new) - Any custom behaviors collected fail `Browser.checkScript` validation (existing behavior) Tests have also been added accordingly.	2025-03-31 17:35:30 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
benoit74	02c4353b4a	Add clarification in usage about hostname used (#771 ) clarify that the crawlId defaults to the Docker container hostname	2025-03-30 21:16:58 -07:00
Tessa Walsh	8f581a587c	Validate Autoclick selector, fail crawl if invalid (#800 ) Fixes #798 Also modifies the existing test for link selector validation to check 17 status code on exit when link selectors fail validation. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-30 13:48:41 -07:00
Ilya Kreymer	47d61a6baf	version: bump to 1.5.9	2025-03-28 13:41:53 -07:00
Ilya Kreymer	8c96a10f67	deps: update to warcio.js 2.4.4, fixes #796 (#802 )	2025-03-28 13:38:15 -07:00
Ilya Kreymer	323b654c54	tests: update qa test to use awp site	2025-03-21 13:06:53 -07:00
Henry Wilkinson	34a1e3d6c0	docs: Update header font (#785 ) Updated alongside https://github.com/webrecorder/replayweb.page/pull/405 Long overdue match to Browsertrix docs styling ### Screenshots <img width="465" alt="Screenshot 2025-03-03 at 7 25 04 PM" src="https://github.com/user-attachments/assets/6829dcb7-d486-4793-a635-f1286b30efc0" />	2025-03-05 14:21:00 -08:00
Ilya Kreymer	9a7ac9bef1	Fix using cached WACZ filename if already set ahead of time. (#783 ) - if <uid>:nextWacz filename already exists, actually get it and use that! - don't merge cdx if not generating wacz yet, use same condition for both bump version to 1.5.8 - fix follow-up to #748, fix #747	2025-02-28 17:58:56 -08:00
Ilya Kreymer	2aec2e1a33	reset back to latest image, 1.77.52 bump version to 1.5.7	2025-02-27 16:06:43 -08:00
Ilya Kreymer	0e7391b668	follow-up to #781 : (#782 ) - undo accidentally setting window timeout to 20000 seconds instead of 20 for debugging! - follow up to #781 - bump to 1.5.6.1 - should hopefully fix crawls stuck in this way..	2025-02-27 16:02:33 -08:00
Ilya Kreymer	9b22df5c90	revert brave version: not ideal, but need to revert to chromium 132 u… (#781 ) …ntil we figure out various stalling issues that still persist in chromium >=133 bump to 1.5.6	2025-02-27 07:05:31 -08:00
Ilya Kreymer	6e42e056b1	version: bump to 1.5.5	2025-02-26 12:42:00 -08:00
Ilya Kreymer	24ca818356	further fix to stuck on getting new window: (#779 ) - set retries back to 3, was set high by mistake - if will restart, throw exception to restart crawler - otherwise, attempt to kill browser process that is stalled (appears to work in testing) - follow-up to #766	2025-02-26 12:32:05 -08:00
Tessa Walsh	e402ddc202	Strip credentials from proxy address in crawl logs (#778 ) Fixes https://github.com/webrecorder/security/issues/14	2025-02-26 15:23:38 -05:00
Ilya Kreymer	c25c6771a8	browser: update brave to 1.77.52 to get Chromium 134 (#773 ) should fix browser timing out on new window, fixes #766 bump to 1.5.4	2025-02-20 09:14:32 -08:00
Tessa Walsh	f16be32ba6	Make sure all exit calls use ExitCodes enum (#767 ) Quick follow-up to #584 to make sure enum is used everywhere in profile editing mode: - profile browser exits with ExitCodes.SignalInterrupted in response to signal - use ExitCodes.Success or GenericError for other exit codes	2025-02-11 12:04:38 -08:00
benoit74	4b72b7c7dc	Add documentation on exit codes (#765 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-11 12:16:29 -05:00
benoit74	fc56c2cf76	Add more exit codes to detect interruption reason (#764 ) Fix #584 - Replace interrupted with interruptReason - Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16) are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10), SignalInterrupted (11) and SignalInterruptedForce (13) - Doc fix to cli args --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-10 14:00:55 -08:00
Ilya Kreymer	846f0355f6	Improved handling of browser stuck / crashed (#763 ) - only attempt to close browser if not browser crashed - add timeout for browser.close() - ensure browser crash results in healthchecker failure - bump to 1.5.3	2025-02-10 10:16:25 -08:00
Ilya Kreymer	5807c320bf	remove fatal() on new window error + stats fix (#762 ) logging (#752): ensure failed included in totals fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted. bump to 1.5.2	2025-02-09 15:26:36 -08:00
Ilya Kreymer	a5050a25d7	Readd health check on retry (#759 ) - health check failures should be incremented even if retrying, in case restart is needed - cleanup writePage() - bump default --maxPageRetries to 2 for better default for Browsertrix	2025-02-06 20:13:20 -08:00
Ilya Kreymer	00835fc4f2	Retry same queue (#757 ) - follow up to #743 - page retries are simply added back to the same queue with `retry` param incremented and a higher scope, after extraHops, to ensure retries are added at the end. - score calculation is: `score = depth + (extraHops * MAX_DEPTH) + (retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority than extraHops, and additional retries even lower priority (higher score). - warning is logged when a retry happens, error only when all retries are exhausted. - back to one failure list, urls added there only when all retries are exhausted. - rename --numRetries -> --maxRetries / --retries for clarity - state load: allow retrying previously failed URLs if --maxRetries is higher then on previous run. - ensure working with --failOnFailedStatus, if provided, invalid status codes (>= 400) are retried along with page load failures - fixes #132 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-06 18:48:40 -08:00
Ilya Kreymer	5c9d808651	exit code cleanup (#753 ) - use consistent enums for exit codes - add disk space check on startup and add OutOfSpace exit code (3) - preparation for #584	2025-02-06 17:54:51 -08:00
Ilya Kreymer	b435afeb4b	version: bump to 1.5.1	2025-02-06 11:40:31 -08:00
Ilya Kreymer	0ca27e4fa1	QA fix: ensure replay iframe actually been updated after goto call! (#756 ) qa fix: check url of iframe, ensure it is not about:blank anymore test: add test to ensure expected diff deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0	2025-02-06 10:41:38 -08:00
Ilya Kreymer	2e46140c3f	Make numRetries configurable (#754 ) Add --numRetries param, default to 1 instead of 5.	2025-02-05 23:34:55 -08:00
Ilya Kreymer	f379da19be	version: bump to 1.5.0!	2025-01-31 21:57:18 -08:00
Ilya Kreymer	95a631188d	hang protection: wrap remaining evaluate() calls to avoid rare hangs (#750 ) wrap remaining frame.evaluate() and page.evaluate() calls that are not already within a timedRun() in their own timedRun() to avoid rare cases where they do not return (eg. if page crashes during the evaluate)	2025-01-30 17:39:20 -08:00
Ilya Kreymer	1da49258c4	version: bump to 1.5.0-beta.4	2025-01-30 14:32:30 -08:00
Ilya Kreymer	fe6199eebd	pages redis: include 'depth', 'seed' and 'favIconUrl' in page data added to redis (#749 ) follow-up to #747	2025-01-30 11:18:59 -08:00
Ilya Kreymer	457d07aea4	if uploading wacz files, compute waczfile name on load to be able to … (#748 ) …store filename along with page data: - set filename on crawler load, if not already set, otherwise use existing - store filename per crawler instance in <crawlid>:nextWacz - add 'filename' field to page when writing pages to redis - clear wacz filename when wacz is uploaded to set a new one - fixes #747 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-01-29 18:15:28 -08:00
Ilya Kreymer	a00866bbab	Apply exclusions to redirects (#745 ) - if redirected page is excluded, block loading of page - mark page as excluded, don't retry, and don't write to page list - support generic blocking of pages based on initial page response - fixes #744	2025-01-28 11:28:23 -08:00

1 2 3 4 5 ...

514 commits