Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	4d2e6d9934	behavior logging: - add 'behaviorScriptCustom' log context if siteSpecific is set - move 'msg' to main message field if present in behavior log object	2025-04-02 10:28:18 -07:00
Ilya Kreymer	bf6fbe8776	Remove extra console.log statements (#811 ) - remove one added in screencaster - also remove others that are outside logging system - bump to 1.5.10	2025-04-02 09:25:11 -07:00
Ilya Kreymer	91f8fadc5f	deps update: update webrecorder dependencies (#810 ) - browsertrix-behaviors 0.8.1 for improved logging / new behavior functions - wabac.js 2.22.9 - RWP 2.3.4 for QA - update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js	2025-04-01 22:11:56 -07:00
Ilya Kreymer	fd41b32100	saved state tweaks: (#809 ) - if saved state filename is somehow duplicated, don't readd to array to avoid deletion (fixes edge case in #791) - also avoid double interpolation of filename	2025-04-01 18:59:04 -07:00
Emma Segal-Grossman	41b968baac	Dynamically adjust reported aspect ratio based on GEOMETRY (#794 ) Closes #793 Related to #733 Adjusts the reported aspect ratio based on GEOMETRY env var. Also adjusts stylesheet in screencast HTML to match. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:26:12 -07:00
Tessa Walsh	2b00c1f065	Tweaks for custom behavior loading (#807 ) Follow-up to #712 Fixes a few things I noticed while testing out https://github.com/webrecorder/browsertrix/pull/2520 - Ignore `.git` directory of git repositories when recursively walking cloned git repo to collect custom behaviors - Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2 was overly restrictive for Git repositories) - Log name of custom behavior scripts (filename or URLs) added as info messages in `behavior` context --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-01 18:15:57 -07:00
Ilya Kreymer	2b56455e8b	stuck page handling: when attempting to restart browser, add more retries (#808 ) fixes issue mentioned in: https://github.com/webrecorder/browsertrix-crawler/issues/791#issuecomment-2734342186	2025-04-01 16:56:01 -07:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
Tessa Walsh	5fedde6eee	Fail crawl with fatal message if custom behavior isn't loaded (#799 ) Fixes #797 The crawler will now exit with a fatal log message and exit code 17 if: - A Git repository specified with `--customBehavior` cannot be cloned successfully (new) - A custom behavior file at a URL specified with `--customBehavior` is not fetched successfully (new) - No custom behaviors are collected at a local filepath specified with `--customBehavior`, or if an error is thrown while attempting to collect files from a nonexistent path (new) - Any custom behaviors collected fail `Browser.checkScript` validation (existing behavior) Tests have also been added accordingly.	2025-03-31 17:35:30 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
benoit74	02c4353b4a	Add clarification in usage about hostname used (#771 ) clarify that the crawlId defaults to the Docker container hostname	2025-03-30 21:16:58 -07:00
Tessa Walsh	8f581a587c	Validate Autoclick selector, fail crawl if invalid (#800 ) Fixes #798 Also modifies the existing test for link selector validation to check 17 status code on exit when link selectors fail validation. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-30 13:48:41 -07:00
Ilya Kreymer	47d61a6baf	version: bump to 1.5.9	2025-03-28 13:41:53 -07:00
Ilya Kreymer	8c96a10f67	deps: update to warcio.js 2.4.4, fixes #796 (#802 )	2025-03-28 13:38:15 -07:00
Ilya Kreymer	323b654c54	tests: update qa test to use awp site	2025-03-21 13:06:53 -07:00
Henry Wilkinson	34a1e3d6c0	docs: Update header font (#785 ) Updated alongside https://github.com/webrecorder/replayweb.page/pull/405 Long overdue match to Browsertrix docs styling ### Screenshots <img width="465" alt="Screenshot 2025-03-03 at 7 25 04 PM" src="https://github.com/user-attachments/assets/6829dcb7-d486-4793-a635-f1286b30efc0" />	2025-03-05 14:21:00 -08:00
Ilya Kreymer	9a7ac9bef1	Fix using cached WACZ filename if already set ahead of time. (#783 ) - if <uid>:nextWacz filename already exists, actually get it and use that! - don't merge cdx if not generating wacz yet, use same condition for both bump version to 1.5.8 - fix follow-up to #748, fix #747	2025-02-28 17:58:56 -08:00
Ilya Kreymer	2aec2e1a33	reset back to latest image, 1.77.52 bump version to 1.5.7	2025-02-27 16:06:43 -08:00
Ilya Kreymer	0e7391b668	follow-up to #781 : (#782 ) - undo accidentally setting window timeout to 20000 seconds instead of 20 for debugging! - follow up to #781 - bump to 1.5.6.1 - should hopefully fix crawls stuck in this way..	2025-02-27 16:02:33 -08:00
Ilya Kreymer	9b22df5c90	revert brave version: not ideal, but need to revert to chromium 132 u… (#781 ) …ntil we figure out various stalling issues that still persist in chromium >=133 bump to 1.5.6	2025-02-27 07:05:31 -08:00
Ilya Kreymer	6e42e056b1	version: bump to 1.5.5	2025-02-26 12:42:00 -08:00
Ilya Kreymer	24ca818356	further fix to stuck on getting new window: (#779 ) - set retries back to 3, was set high by mistake - if will restart, throw exception to restart crawler - otherwise, attempt to kill browser process that is stalled (appears to work in testing) - follow-up to #766	2025-02-26 12:32:05 -08:00
Tessa Walsh	e402ddc202	Strip credentials from proxy address in crawl logs (#778 ) Fixes https://github.com/webrecorder/security/issues/14	2025-02-26 15:23:38 -05:00
Ilya Kreymer	c25c6771a8	browser: update brave to 1.77.52 to get Chromium 134 (#773 ) should fix browser timing out on new window, fixes #766 bump to 1.5.4	2025-02-20 09:14:32 -08:00
Tessa Walsh	f16be32ba6	Make sure all exit calls use ExitCodes enum (#767 ) Quick follow-up to #584 to make sure enum is used everywhere in profile editing mode: - profile browser exits with ExitCodes.SignalInterrupted in response to signal - use ExitCodes.Success or GenericError for other exit codes	2025-02-11 12:04:38 -08:00
benoit74	4b72b7c7dc	Add documentation on exit codes (#765 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-11 12:16:29 -05:00
benoit74	fc56c2cf76	Add more exit codes to detect interruption reason (#764 ) Fix #584 - Replace interrupted with interruptReason - Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16) are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10), SignalInterrupted (11) and SignalInterruptedForce (13) - Doc fix to cli args --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-10 14:00:55 -08:00
Ilya Kreymer	846f0355f6	Improved handling of browser stuck / crashed (#763 ) - only attempt to close browser if not browser crashed - add timeout for browser.close() - ensure browser crash results in healthchecker failure - bump to 1.5.3	2025-02-10 10:16:25 -08:00
Ilya Kreymer	5807c320bf	remove fatal() on new window error + stats fix (#762 ) logging (#752): ensure failed included in totals fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted. bump to 1.5.2	2025-02-09 15:26:36 -08:00
Ilya Kreymer	a5050a25d7	Readd health check on retry (#759 ) - health check failures should be incremented even if retrying, in case restart is needed - cleanup writePage() - bump default --maxPageRetries to 2 for better default for Browsertrix	2025-02-06 20:13:20 -08:00
Ilya Kreymer	00835fc4f2	Retry same queue (#757 ) - follow up to #743 - page retries are simply added back to the same queue with `retry` param incremented and a higher scope, after extraHops, to ensure retries are added at the end. - score calculation is: `score = depth + (extraHops * MAX_DEPTH) + (retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority than extraHops, and additional retries even lower priority (higher score). - warning is logged when a retry happens, error only when all retries are exhausted. - back to one failure list, urls added there only when all retries are exhausted. - rename --numRetries -> --maxRetries / --retries for clarity - state load: allow retrying previously failed URLs if --maxRetries is higher then on previous run. - ensure working with --failOnFailedStatus, if provided, invalid status codes (>= 400) are retried along with page load failures - fixes #132 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-06 18:48:40 -08:00
Ilya Kreymer	5c9d808651	exit code cleanup (#753 ) - use consistent enums for exit codes - add disk space check on startup and add OutOfSpace exit code (3) - preparation for #584	2025-02-06 17:54:51 -08:00
Ilya Kreymer	b435afeb4b	version: bump to 1.5.1	2025-02-06 11:40:31 -08:00
Ilya Kreymer	0ca27e4fa1	QA fix: ensure replay iframe actually been updated after goto call! (#756 ) qa fix: check url of iframe, ensure it is not about:blank anymore test: add test to ensure expected diff deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0	2025-02-06 10:41:38 -08:00
Ilya Kreymer	2e46140c3f	Make numRetries configurable (#754 ) Add --numRetries param, default to 1 instead of 5.	2025-02-05 23:34:55 -08:00
Ilya Kreymer	f379da19be	version: bump to 1.5.0!	2025-01-31 21:57:18 -08:00
Ilya Kreymer	95a631188d	hang protection: wrap remaining evaluate() calls to avoid rare hangs (#750 ) wrap remaining frame.evaluate() and page.evaluate() calls that are not already within a timedRun() in their own timedRun() to avoid rare cases where they do not return (eg. if page crashes during the evaluate)	2025-01-30 17:39:20 -08:00
Ilya Kreymer	1da49258c4	version: bump to 1.5.0-beta.4	2025-01-30 14:32:30 -08:00
Ilya Kreymer	fe6199eebd	pages redis: include 'depth', 'seed' and 'favIconUrl' in page data added to redis (#749 ) follow-up to #747	2025-01-30 11:18:59 -08:00
Ilya Kreymer	457d07aea4	if uploading wacz files, compute waczfile name on load to be able to … (#748 ) …store filename along with page data: - set filename on crawler load, if not already set, otherwise use existing - store filename per crawler instance in <crawlid>:nextWacz - add 'filename' field to page when writing pages to redis - clear wacz filename when wacz is uploaded to set a new one - fixes #747 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-01-29 18:15:28 -08:00
Ilya Kreymer	a00866bbab	Apply exclusions to redirects (#745 ) - if redirected page is excluded, block loading of page - mark page as excluded, don't retry, and don't write to page list - support generic blocking of pages based on initial page response - fixes #744	2025-01-28 11:28:23 -08:00
Ilya Kreymer	f7cbf9645b	Retry support and additional fixes (#743 ) - retries: for failed pages, set retry to 5 in cases multiple retries may be needed. - redirect: if page url is /path/ -> /path, don't add as extra seed - proxy: don't use global dispatcher, pass dispatcher explicitly when using proxy, as proxy may interfere with local network requests - final exit flag: if crawl is done and also interrupted, ensure WACZ is still written/uploaded by setting final exit to true - hashtag only change force reload: if loading page with same URL but different hashtag, eg. `https://example.com/#B` after `https://example.com/#A`, do a full reload	2025-01-25 22:55:49 -08:00
Ilya Kreymer	5d9c62e264	Retry Failed Pages + Ignore Hashtags in Redirect Check (#739 ) - Retry pages that are marked as failed once, at the end of the crawl, in case it was due to a timeout - Also, don't treat differences in hashtag between seed page loaded and actual URL as a redirect (eg. don't add as new seed)	2025-01-16 15:51:35 -08:00
Ilya Kreymer	bc4a95883d	clear out core dumps to avoid using up volume space: (#740 ) - add 'ulimit -c' to startup script - delete any './core' files that exist in working dir just in case - fixes #738	2025-01-16 15:50:59 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	871490758a	Dependency Update for 1.4.2 (#737 )	2025-01-06 12:06:40 -08:00
Ilya Kreymer	d923e11436	separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736 ) Chromium now interrupts fetch() if abort() is called or page is navigated, so autofetch behavior using native fetch() is less than ideal. This PR adds support for __bx_fetch() command for autofetch behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately from browser's reguar fetch() - __bx_fetch() starts a fetch, but does not return content to browser, doesn't need abort(), unaffected by page navigation, but will still try to use browser network stack when possible, making it more efficient for background fetching. - if network stack fetch fails, fallback to regular node fetch() in the crawler. Additional improvements for interrupted fetch: - don't store truncated media responses, even for 200 - avoid doing duplicate async fetching if response already handled (eg. fetch handled in multiple contexts) - fixes #735, where fetch was interrupted, resulted in an empty response	2024-12-31 13:52:12 -08:00
Ilya Kreymer	fb8ed18f82	package: pin @novnc/novnc to 1.4.0 to prevent accidental upgrades (#727 ) - novnc 1.5.0 not compatible with current configuration) - fixes #726 - bump to 1.4.1	2024-11-25 18:42:56 -08:00
Ilya Kreymer	9af34f9a1d	version: bump to 1.4.0	2024-11-25 00:36:43 -08:00
Ilya Kreymer	6bfa7d5766	Dependency Update (#725 ) - update yarn packages - update RWP to 2.2.4 - update base image to brave 1.73.91 - fix typing issue - bump to 1.4.0-beta.1	2024-11-24 01:22:50 -08:00

1 2 3 4 5 ...

505 commits