Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	f2dac05577	regression fix: start redis if needed before attempting to init state! (#819 ) bump to 1.6.0-beta.1	2025-04-09 21:37:46 +02:00
Ilya Kreymer	c796996664	Support for behaviors from 'recorder flow' JSON created in devtools (#818 ) New Feature: - support 'flow behavior' from JSON specification - detect .json files via --customBehaviors - log behavior progress while running - logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for custom behaviors - differentiate logging for iframes, move more behavior messages to debug - move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors - docs to be added in separate follow-up PR	2025-04-09 12:24:29 +02:00
Tessa Walsh	2961d3b9f2	Write behaviors downloaded from URL to tempdir (#816 ) Follow-up to #368 This makes download locations consistent between custom behaviors downloaded from URLs and those downloaded from Git repos, and resolves a container security issue in Browsertrix.	2025-04-04 11:23:29 -04:00
Tessa Walsh	f83d0e8f02	Add option to push behavior + behavior script logs to Redis (#805 ) Fixes #804 - Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3) - Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs - Noisy logs from built-in behaviors like autoscroll are now logged to debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92 and so won't be pushed to Redis for newer versions of the crawler. - Updates browsertrix-behaviors to 0.8.3 and makes some changes to log format in tests accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-03 15:46:10 -07:00
Ilya Kreymer	bf6fbe8776	Remove extra console.log statements (#811 ) - remove one added in screencaster - also remove others that are outside logging system - bump to 1.5.10	2025-04-02 09:25:11 -07:00
Ilya Kreymer	91f8fadc5f	deps update: update webrecorder dependencies (#810 ) - browsertrix-behaviors 0.8.1 for improved logging / new behavior functions - wabac.js 2.22.9 - RWP 2.3.4 for QA - update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js	2025-04-01 22:11:56 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
Ilya Kreymer	47d61a6baf	version: bump to 1.5.9	2025-03-28 13:41:53 -07:00
Ilya Kreymer	8c96a10f67	deps: update to warcio.js 2.4.4, fixes #796 (#802 )	2025-03-28 13:38:15 -07:00
Ilya Kreymer	9a7ac9bef1	Fix using cached WACZ filename if already set ahead of time. (#783 ) - if <uid>:nextWacz filename already exists, actually get it and use that! - don't merge cdx if not generating wacz yet, use same condition for both bump version to 1.5.8 - fix follow-up to #748, fix #747	2025-02-28 17:58:56 -08:00
Ilya Kreymer	2aec2e1a33	reset back to latest image, 1.77.52 bump version to 1.5.7	2025-02-27 16:06:43 -08:00
Ilya Kreymer	0e7391b668	follow-up to #781 : (#782 ) - undo accidentally setting window timeout to 20000 seconds instead of 20 for debugging! - follow up to #781 - bump to 1.5.6.1 - should hopefully fix crawls stuck in this way..	2025-02-27 16:02:33 -08:00
Ilya Kreymer	9b22df5c90	revert brave version: not ideal, but need to revert to chromium 132 u… (#781 ) …ntil we figure out various stalling issues that still persist in chromium >=133 bump to 1.5.6	2025-02-27 07:05:31 -08:00
Ilya Kreymer	6e42e056b1	version: bump to 1.5.5	2025-02-26 12:42:00 -08:00
Ilya Kreymer	c25c6771a8	browser: update brave to 1.77.52 to get Chromium 134 (#773 ) should fix browser timing out on new window, fixes #766 bump to 1.5.4	2025-02-20 09:14:32 -08:00
Ilya Kreymer	846f0355f6	Improved handling of browser stuck / crashed (#763 ) - only attempt to close browser if not browser crashed - add timeout for browser.close() - ensure browser crash results in healthchecker failure - bump to 1.5.3	2025-02-10 10:16:25 -08:00
Ilya Kreymer	5807c320bf	remove fatal() on new window error + stats fix (#762 ) logging (#752): ensure failed included in totals fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted. bump to 1.5.2	2025-02-09 15:26:36 -08:00
Ilya Kreymer	b435afeb4b	version: bump to 1.5.1	2025-02-06 11:40:31 -08:00
Ilya Kreymer	0ca27e4fa1	QA fix: ensure replay iframe actually been updated after goto call! (#756 ) qa fix: check url of iframe, ensure it is not about:blank anymore test: add test to ensure expected diff deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0	2025-02-06 10:41:38 -08:00
Ilya Kreymer	f379da19be	version: bump to 1.5.0!	2025-01-31 21:57:18 -08:00
Ilya Kreymer	1da49258c4	version: bump to 1.5.0-beta.4	2025-01-30 14:32:30 -08:00
Ilya Kreymer	f7cbf9645b	Retry support and additional fixes (#743 ) - retries: for failed pages, set retry to 5 in cases multiple retries may be needed. - redirect: if page url is /path/ -> /path, don't add as extra seed - proxy: don't use global dispatcher, pass dispatcher explicitly when using proxy, as proxy may interfere with local network requests - final exit flag: if crawl is done and also interrupted, ensure WACZ is still written/uploaded by setting final exit to true - hashtag only change force reload: if loading page with same URL but different hashtag, eg. `https://example.com/#B` after `https://example.com/#A`, do a full reload	2025-01-25 22:55:49 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	871490758a	Dependency Update for 1.4.2 (#737 )	2025-01-06 12:06:40 -08:00
Ilya Kreymer	d923e11436	separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736 ) Chromium now interrupts fetch() if abort() is called or page is navigated, so autofetch behavior using native fetch() is less than ideal. This PR adds support for __bx_fetch() command for autofetch behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately from browser's reguar fetch() - __bx_fetch() starts a fetch, but does not return content to browser, doesn't need abort(), unaffected by page navigation, but will still try to use browser network stack when possible, making it more efficient for background fetching. - if network stack fetch fails, fallback to regular node fetch() in the crawler. Additional improvements for interrupted fetch: - don't store truncated media responses, even for 200 - avoid doing duplicate async fetching if response already handled (eg. fetch handled in multiple contexts) - fixes #735, where fetch was interrupted, resulted in an empty response	2024-12-31 13:52:12 -08:00
Ilya Kreymer	fb8ed18f82	package: pin @novnc/novnc to 1.4.0 to prevent accidental upgrades (#727 ) - novnc 1.5.0 not compatible with current configuration) - fixes #726 - bump to 1.4.1	2024-11-25 18:42:56 -08:00
Ilya Kreymer	9af34f9a1d	version: bump to 1.4.0	2024-11-25 00:36:43 -08:00
Ilya Kreymer	6bfa7d5766	Dependency Update (#725 ) - update yarn packages - update RWP to 2.2.4 - update base image to brave 1.73.91 - fix typing issue - bump to 1.4.0-beta.1	2024-11-24 01:22:50 -08:00
Ilya Kreymer	214eb6ca8f	support removing range from query (via wabac.js 2.20.6): (#724 ) - fix for archiving facebook video, to match webrecorder/archiveweb.page#272 - permissions: auto enable permissions to avoid possibly modal (for both profiles and crawling) - deps: update to latest wabac.js + warcio.js	2024-11-22 10:31:12 -08:00
Ilya Kreymer	f56d6505c1	fix indexing of cookie header: (#714 ) - add fields option for adding req.http:cookie and referrer entries to the cdxj - update to warcio 2.4.0 to support this functionality	2024-11-13 23:13:40 -08:00
Ilya Kreymer	c8e2e43d4d	Dependency Update (#718 ) - bump browsertrix-behaviors to 0.6.5 - bump browsertrix-base-image to 1.71.123 - bump puppeteer-core to 23.7.1	2024-11-10 19:34:38 -08:00
Ilya Kreymer	d04509639a	Support custom css selectors for extracting links (#689 ) Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].	2024-11-08 11:04:41 -05:00
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Ilya Kreymer	e5bab8e7c8	various edge-case loading optimizations: (#709 ) - rework 'should stream' logic: * ensure 206 responses (or any response) greater than 25M are streamed * response between 5M and 25M are read into memory if text/css/js as they may be rewritten * responses <5M are read into memory * responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small - likely fix for issues in #706 - if too many range requests for same URL are being made, try skipping/failing right away to reduce load - assume main browser context is used not just for service workers, always enable - check false positive 'net-aborted' error that may actually be ok for media, as well as documents - improve logging - interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-31 14:06:17 -07:00
Ilya Kreymer	181d9b824c	deps: update to latest wabac (#708 ) bump version to 1.3.4	2024-10-26 11:02:32 -07:00
Ilya Kreymer	0d39ea3590	dep: update to wabac.js 2.20 (#704 ) Update imports for new TS-based wabac.js	2024-10-16 21:02:04 -07:00
Ilya Kreymer	a45b85dd74	version: bump to 1.3.3	2024-10-11 00:12:23 -07:00
Ilya Kreymer	282c47ad66	bump puppeteer core to 23.5.1 (#700 ) includes possible improvements for detecting crashes with wrong stack trace (see: puppeteer/puppeteer#13056)	2024-10-07 16:39:48 -07:00
Ilya Kreymer	356b3f8d10	bump to 1.3.2	2024-09-30 15:51:13 -07:00
Ilya Kreymer	9f310907f0	version: bump to 1.3.1	2024-09-27 14:30:56 -04:00
Ilya Kreymer	da442573b8	version: bump to 1.3.0	2024-09-12 09:22:22 -07:00
Ilya Kreymer	083a9d2090	version: bump to 1.3.0-beta.1	2024-09-05 18:11:52 -07:00
Ilya Kreymer	9d0e3423a3	WARC writer + incremental indexing fixes (#679 ) - ensure WARC rollover happens only after response/request + cdx or single record + cdx have been written - ensure request payload is buffered for POST request indexing - update to warcio 2.3.1 for POST request case-insensitive 'content-type' check - recorder: remove unused 'tempdir', no longer used as warcio chooses a temp file on it's own	2024-09-05 11:10:31 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Ilya Kreymer	23fbbcb6bf	version: bump to 1.3.0-beta.0	2024-08-14 20:12:48 -07:00
Ilya Kreymer	8d7fb1e084	1.2.8 updates: (#668 ) - rewriting: update wabac.js, use getCustomRewriter(), don't truncate POST request bodies for URLs that use a custom rewriter - browser: disable --enable-automation, setting webdriver = true, so no need for override - deps: update puppeteer-core, necessary changes for latest puppeteer	2024-08-13 23:38:55 -07:00
Ilya Kreymer	bb34c5ef47	version: bump to 1.2.7 deps: bump RWP in Dockerfile to 2.1.3	2024-08-09 13:23:16 -07:00
Ilya Kreymer	a1ba29d878	deps: update puppeteer-core to 22.14.0 (#661 )	2024-07-30 13:51:52 -07:00
Ilya Kreymer	ff81048d3a	deps: bump browsertrix-behaviors to 0.6.3 (#659 ) adds support for detecting videos in shadow dom with query-selector-shadow-dom library	2024-07-30 09:41:21 -07:00
Ilya Kreymer	9f2b9bf4e5	version: bump to 1.2.6	2024-07-29 16:41:40 -07:00

1 2 3 4

200 commits