Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	5961a521c2	set failed URL retry to 5 by default	2025-01-17 18:19:31 -08:00
Ilya Kreymer	5d9c62e264	Retry Failed Pages + Ignore Hashtags in Redirect Check (#739 ) - Retry pages that are marked as failed once, at the end of the crawl, in case it was due to a timeout - Also, don't treat differences in hashtag between seed page loaded and actual URL as a redirect (eg. don't add as new seed)	2025-01-16 15:51:35 -08:00
Ilya Kreymer	bc4a95883d	clear out core dumps to avoid using up volume space: (#740 ) - add 'ulimit -c' to startup script - delete any './core' files that exist in working dir just in case - fixes #738	2025-01-16 15:50:59 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	871490758a	Dependency Update for 1.4.2 (#737 )	2025-01-06 12:06:40 -08:00
Ilya Kreymer	d923e11436	separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736 ) Chromium now interrupts fetch() if abort() is called or page is navigated, so autofetch behavior using native fetch() is less than ideal. This PR adds support for __bx_fetch() command for autofetch behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately from browser's reguar fetch() - __bx_fetch() starts a fetch, but does not return content to browser, doesn't need abort(), unaffected by page navigation, but will still try to use browser network stack when possible, making it more efficient for background fetching. - if network stack fetch fails, fallback to regular node fetch() in the crawler. Additional improvements for interrupted fetch: - don't store truncated media responses, even for 200 - avoid doing duplicate async fetching if response already handled (eg. fetch handled in multiple contexts) - fixes #735, where fetch was interrupted, resulted in an empty response	2024-12-31 13:52:12 -08:00
Ilya Kreymer	fb8ed18f82	package: pin @novnc/novnc to 1.4.0 to prevent accidental upgrades (#727 ) - novnc 1.5.0 not compatible with current configuration) - fixes #726 - bump to 1.4.1	2024-11-25 18:42:56 -08:00
Ilya Kreymer	9af34f9a1d	version: bump to 1.4.0	2024-11-25 00:36:43 -08:00
Ilya Kreymer	6bfa7d5766	Dependency Update (#725 ) - update yarn packages - update RWP to 2.2.4 - update base image to brave 1.73.91 - fix typing issue - bump to 1.4.0-beta.1	2024-11-24 01:22:50 -08:00
Francesco Servida	07e5ceb4c2	Implemented option for FullPage screenshot after the behaviours have run (#656 ) - new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped. Related to #486 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-23 21:26:55 -08:00
Ilya Kreymer	214eb6ca8f	support removing range from query (via wabac.js 2.20.6): (#724 ) - fix for archiving facebook video, to match webrecorder/archiveweb.page#272 - permissions: auto enable permissions to avoid possibly modal (for both profiles and crawling) - deps: update to latest wabac.js + warcio.js	2024-11-22 10:31:12 -08:00
Ilya Kreymer	0b9cd71c5a	Ensure partial responses are not written (#721 ) various fixes for streaming, especially related to range requests - follow up to #709 - fix: prefer streaming current response via takeStream, not only when size is unknown - don't serialize async responses prematurely - don't serialize 206 responses if there is size mismatch	2024-11-13 23:28:37 -08:00
Ilya Kreymer	f56d6505c1	fix indexing of cookie header: (#714 ) - add fields option for adding req.http:cookie and referrer entries to the cdxj - update to warcio 2.4.0 to support this functionality	2024-11-13 23:13:40 -08:00
Tessa Walsh	60c84b342e	Support loading custom behaviors from git repo (#717 ) Fixes #712 - Also expands the existing documentation about behaviors and adds a test. - Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-11-13 22:50:33 -08:00
Ilya Kreymer	ea05307528	add disable-lazy-loading flag, should fix #699 (#720 )	2024-11-11 21:55:09 -08:00
Ilya Kreymer	c8e2e43d4d	Dependency Update (#718 ) - bump browsertrix-behaviors to 0.6.5 - bump browsertrix-base-image to 1.71.123 - bump puppeteer-core to 23.7.1	2024-11-10 19:34:38 -08:00
Ilya Kreymer	d04509639a	Support custom css selectors for extracting links (#689 ) Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].	2024-11-08 11:04:41 -05:00
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Ilya Kreymer	e5bab8e7c8	various edge-case loading optimizations: (#709 ) - rework 'should stream' logic: * ensure 206 responses (or any response) greater than 25M are streamed * response between 5M and 25M are read into memory if text/css/js as they may be rewritten * responses <5M are read into memory * responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small - likely fix for issues in #706 - if too many range requests for same URL are being made, try skipping/failing right away to reduce load - assume main browser context is used not just for service workers, always enable - check false positive 'net-aborted' error that may actually be ok for media, as well as documents - improve logging - interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-31 14:06:17 -07:00
Ilya Kreymer	5c00bca2b4	tests: use old.webrecorder.net for testing (#710 ) replace webrecorder.net -> old.webrecorder.net to fix tests relying on old website for now	2024-10-31 13:24:58 -04:00
Ilya Kreymer	181d9b824c	deps: update to latest wabac (#708 ) bump version to 1.3.4	2024-10-26 11:02:32 -07:00
Ilya Kreymer	0d39ea3590	dep: update to wabac.js 2.20 (#704 ) Update imports for new TS-based wabac.js	2024-10-16 21:02:04 -07:00
Ilya Kreymer	a45b85dd74	version: bump to 1.3.3	2024-10-11 00:12:23 -07:00
Ilya Kreymer	652cf9cfa6	link extraction promise cleanup: (#701 ) - catch frame.evaluate() directly and log errors there to avoid any possibility of exception being propagated before wrapping in timedRun() - also add clearTimeout() to timedRun() - possibly fixes openzim/zimit#376	2024-10-11 00:11:24 -07:00
Ilya Kreymer	157ac34d8c	fix typo in QA exclude check, which resulted in all URLs being excluded (#697 ) - ensure exclusions now work as expected in replay mode - add test for using --exclude with replay	2024-10-07 17:25:36 -07:00
Ilya Kreymer	282c47ad66	bump puppeteer core to 23.5.1 (#700 ) includes possible improvements for detecting crashes with wrong stack trace (see: puppeteer/puppeteer#13056)	2024-10-07 16:39:48 -07:00
Tessa Walsh	e05d50d637	Add documentation for crawl collections (#695 ) Fixes #675	2024-10-05 11:51:32 -07:00
Ilya Kreymer	d497a424fc	tests: disable blockrules youtube tests in CI (#698 ) due to youtube being blocked, disable test involving youtube embeds when running in CI for now	2024-10-04 17:37:13 -07:00
Ilya Kreymer	356b3f8d10	bump to 1.3.2	2024-09-30 15:51:13 -07:00
Ilya Kreymer	728f00219a	ensure extraHops also apply to maxDepth (#694 ) - if extraHops is set, crawler should visit pages beyond maxDepth - currently returning out of scope at depth limit even if extraHops is set - adjust isInScope and isAtMaxDepth to account for extraHops - tests: update extra hops test to test extraHops beyond depth - fixes #693	2024-09-30 15:46:34 -07:00
Ilya Kreymer	9f310907f0	version: bump to 1.3.1	2024-09-27 14:30:56 -04:00
Ilya Kreymer	a56e13d2ff	Additional exception safety (#692 ) - add additional catch() block - wrap page.title() in timedRun() to catch/log exception if this fails - log error in getting cookies - hopefully fixes hard-to-repro edge case crash in openzim/zimit#376	2024-09-27 14:30:25 -04:00
Tessa Walsh	607fc84c7d	Include depth in pages JSONL files (#691 ) Fixes #690	2024-09-27 10:01:20 -04:00
Ilya Kreymer	6b4ba5b430	direct fetch: when cancelling due to redirect, read full body (#688 ) to avoid possible exception due to encoding. (Probably a node bug, reported in nodejs/undici#3616) Replace abort with cancel, which is the recommended way to cancel the response. fixes #687	2024-09-17 10:29:23 -07:00
Ilya Kreymer	da442573b8	version: bump to 1.3.0	2024-09-12 09:22:22 -07:00
Ilya Kreymer	eb50fdffde	exit codes: exit with error code 10 if interrupt is caused by unexpected browser exit (#686 ) Differentiate from expected/predictable interrupts due to limits (exit code 11) and unexpected interrupt due to browser crash (now exit code 10) fixes #683	2024-09-12 09:10:23 -07:00
Ilya Kreymer	fdb76f2c88	update current crawl size in redis on each healthcheck call (#685 ) - allows Browsertrix app to adjust size, if needed, more frequently - run checkLimits() before starting crawl, in case out of space	2024-09-10 08:28:07 -07:00
Ilya Kreymer	b42548373d	eslint: add strict await checking: (#684 ) - require await / void / catch for promises - don't allow unnecessary await	2024-09-06 16:24:18 -07:00
Ilya Kreymer	9cacae6bb6	cleanup: remove old config files from pywb (#682 )	2024-09-05 20:23:34 -07:00
Ilya Kreymer	c38b69e74b	bump browser to 1.69.162 (#681 )	2024-09-05 20:21:43 -07:00
Ilya Kreymer	083a9d2090	version: bump to 1.3.0-beta.1	2024-09-05 18:11:52 -07:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Ilya Kreymer	802a416c7e	Additional direct fetch improvements (#678 ) - use existing headersTimeout in undici to limit time to headers fetch to 30 seconds, reject direct fetch if timeout is reached - allow full page timeout for loading payload via direct fetch - support setting global fetch() settings - add markPageUsed() to only reuse pages when not doing direct fetch - apply auth headers to direct fetch - catch failed fetch and timeout errors - support failOnFailedSeeds for direct fetch, ensure timeout is working	2024-09-05 13:28:49 -07:00
Ilya Kreymer	9d0e3423a3	WARC writer + incremental indexing fixes (#679 ) - ensure WARC rollover happens only after response/request + cdx or single record + cdx have been written - ensure request payload is buffered for POST request indexing - update to warcio 2.3.1 for POST request case-insensitive 'content-type' check - recorder: remove unused 'tempdir', no longer used as warcio chooses a temp file on it's own	2024-09-05 11:10:31 -07:00
Ilya Kreymer	0d6a0b0efa	fix for direct fetch timeouts (#677 ) - use '--timeout' value for direct fetch timeout, instead of fixed 30 seconds - don't consider 'document' as essential resource regardless of mime type, as any top-level URL is a document - don't count non-200 responses as non-essential even if missing content-type fixes #676	2024-09-05 10:32:31 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Ilya Kreymer	8934feaf70	SOCKS5 over SSH Tunnel Support (#671 ) - Adds support for running a SOCKS5 proxy over an SSH connection. This can be configured by using `--proxyServer ssh://user@host[:port]` config and also passing an `--sshProxyPrivateKeyFile <private key file>` file param and an optional `--sshProxyKnownHostsFile <public host key file>`file param. The key files are expected to be mounted as volumes into the crawler. - Same arguments are also available for create-login-profile - The proxy config uses autossh to establish a more robust connection, and also waits until a connection can be established before proceeding. - Docs are updated to include a new 'Crawling with Proxies' page in the user guide - Tests are updated to include crawling through an SSH proxy running locally. --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>	2024-08-28 18:47:24 -07:00
Tessa Walsh	39c8f48bb2	Disable behaviors entirely if --behaviors array is empty (#672 ) Fixes #651	2024-08-27 13:20:19 -07:00
Ilya Kreymer	c61a03de6e	ci: use docker compose instead of docker-compose	2024-08-14 21:21:35 -07:00
Henry Wilkinson	4c1da90d8f	Adds warning about crawling with basic auth (#669 ) Closes https://github.com/webrecorder/browsertrix/issues/1950 over here too ### Changes - Adds a warning about using basic auth - Adds a link to MDN because learning and cross referencing is fun!	2024-08-14 21:14:31 -07:00

1 2 3 4 5 ...

464 commits