Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	a42c0b926e	Support host-specific proxies with proxy config YAML (#837 ) - Adds support for YAML-based config for multiple proxies, containing 'matchHosts' section by regex and 'proxies' declaration, allowing matching any number of hosts to any number of named proxies. - Specified via --proxyServerConfig option passed to both crawl and profile creation commands. - Implemented internally by generating a proxy PAC script which does regex matching and running browser with the specified proxy PAC script served by an internal http server. - Also support matching different undici Agents by regex, for using different proxies with direct fetching - Precedence: --proxyServerConfig takes precedence over --proxyServer / PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided - Updated proxies doc section with example - Updated tests with sample bad and good auth examples of proxy config Fixes #836 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-08-20 16:07:29 -07:00
Ilya Kreymer	549d655173	Support option to fail crawl on content check (#861 ) - add --failOnContentCheck for quick fail if content check in behavior fails - expose __bx_contentCheckFailed to cause an immediately failure from behavior - only allow failing crawl due to content check from within awaitPageLoad() callback - set a 'failReason' key to track that crawl failed due to a particular content check reason - deps: update to browsertrix-behaviors 0.9.0, update to wabac.js (2.23.6) - fixes #860 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-07-08 13:08:52 -07:00
Tessa Walsh	2af94ffab5	Support downloading seed file from URL (#852 ) Fixes #841 Crawler work toward long URL lists in Browsertrix. This PR moves seed handling from the arg parser's validation step to the crawler's bootstrap step in order to be able to async fetch the seed file from a URL. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-07-03 10:49:37 -04:00
Ilya Kreymer	687f08b1d0	Add option to save local/sessionStorage (#856 ) If --saveStorage is set, localStorage and sessionStorage will be serialized with the WARC record for the page. If a page redirects, track what the current page URL is and save storage as part of the page's WARC record. Fixes #855	2025-06-30 19:58:19 -07:00
Rijnder Wever	fa26f05f66	cleanup: remove dead pywb code from argparser and docs (#847 ) The value of `--dedupPolicy` was once passed to pywb (see https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording). Now that pywb has been dropped, there is no need to keep this option around. In fact, I know multiple users that have been confused by the mention of this option in the docs (myself included). (for historical context, see https://github.com/webrecorder/browsertrix-crawler/pull/332)	2025-06-16 12:36:32 -04:00
Tessa Walsh	e09d10c582	Disable disk utilization check by default (#850 ) Related to https://github.com/webrecorder/browsertrix-crawler/issues/848 Several users have had issues with disk utilization checks, including the values reported by `df` inside the crawler container having unexpected results for mounted volumes. The commonly recommended solution to this is to use `docker system ps`, but that is of course not available within the Docker container itself. This PR changes disk utilization checks to be an opt-in feature by setting the default value to `0` (disabled).	2025-06-16 12:36:15 -04:00
Ilya Kreymer	71de8d6582	lang code fixes: (#834 ) - validate --lang values, fail immediately with invalid iso-639-1 country code - ignore --lang value when using profile, print warning that profile language takes precedence - fixes #833	2025-05-12 16:06:29 -07:00
Tessa Walsh	f83d0e8f02	Add option to push behavior + behavior script logs to Redis (#805 ) Fixes #804 - Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3) - Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs - Noisy logs from built-in behaviors like autoscroll are now logged to debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92 and so won't be pushed to Redis for newer versions of the crawler. - Updates browsertrix-behaviors to 0.8.3 and makes some changes to log format in tests accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-04-03 15:46:10 -07:00
aponb	6898bcf7ae	useSHA1 Parameter for generating SHA1 record hashes (#532 ) (#812 ) By using the useSHA1 flag, the payload digest in records will use SHA-1 with Base32 encoding instead of the default SHA-256 Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>	2025-04-02 17:10:50 -07:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
Ilya Kreymer	e751929a7a	Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803 ) - extractLinks() now handled via browsertix-behaviors - fixes #770 via browsertrix-behaviors, checks for toJSON overrides - organize exposed functions to enum list	2025-03-31 12:02:25 -07:00
benoit74	02c4353b4a	Add clarification in usage about hostname used (#771 ) clarify that the crawlId defaults to the Docker container hostname	2025-03-30 21:16:58 -07:00
Tessa Walsh	8f581a587c	Validate Autoclick selector, fail crawl if invalid (#800 ) Fixes #798 Also modifies the existing test for link selector validation to check 17 status code on exit when link selectors fail validation. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-03-30 13:48:41 -07:00
Ilya Kreymer	00835fc4f2	Retry same queue (#757 ) - follow up to #743 - page retries are simply added back to the same queue with `retry` param incremented and a higher scope, after extraHops, to ensure retries are added at the end. - score calculation is: `score = depth + (extraHops * MAX_DEPTH) + (retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority than extraHops, and additional retries even lower priority (higher score). - warning is logged when a retry happens, error only when all retries are exhausted. - back to one failure list, urls added there only when all retries are exhausted. - rename --numRetries -> --maxRetries / --retries for clarity - state load: allow retrying previously failed URLs if --maxRetries is higher then on previous run. - ensure working with --failOnFailedStatus, if provided, invalid status codes (>= 400) are retried along with page load failures - fixes #132 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-06 18:48:40 -08:00
Ilya Kreymer	2e46140c3f	Make numRetries configurable (#754 ) Add --numRetries param, default to 1 instead of 5.	2025-02-05 23:34:55 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	6bfa7d5766	Dependency Update (#725 ) - update yarn packages - update RWP to 2.2.4 - update base image to brave 1.73.91 - fix typing issue - bump to 1.4.0-beta.1	2024-11-24 01:22:50 -08:00
Francesco Servida	07e5ceb4c2	Implemented option for FullPage screenshot after the behaviours have run (#656 ) - new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped. Related to #486 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-23 21:26:55 -08:00
Tessa Walsh	60c84b342e	Support loading custom behaviors from git repo (#717 ) Fixes #712 - Also expands the existing documentation about behaviors and adds a test. - Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-11-13 22:50:33 -08:00
Ilya Kreymer	d04509639a	Support custom css selectors for extracting links (#689 ) Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].	2024-11-08 11:04:41 -05:00
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Ilya Kreymer	8934feaf70	SOCKS5 over SSH Tunnel Support (#671 ) - Adds support for running a SOCKS5 proxy over an SSH connection. This can be configured by using `--proxyServer ssh://user@host[:port]` config and also passing an `--sshProxyPrivateKeyFile <private key file>` file param and an optional `--sshProxyKnownHostsFile <public host key file>`file param. The key files are expected to be mounted as volumes into the crawler. - Same arguments are also available for create-login-profile - The proxy config uses autossh to establish a more robust connection, and also waits until a connection can be established before proceeding. - Docs are updated to include a new 'Crawling with Proxies' page in the user guide - Tests are updated to include crawling through an SSH proxy running locally. --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>	2024-08-28 18:47:24 -07:00
Tessa Walsh	39c8f48bb2	Disable behaviors entirely if --behaviors array is empty (#672 ) Fixes #651	2024-08-27 13:20:19 -07:00
benoit74	1099f4f3c8	Make it clear that profile argument can be an HTTP(S) URL (#649 ) Small documentation enhancement to make it clear that browser profile can be passed as HTTP(S) URL as well.	2024-07-19 18:53:28 -07:00
Ilya Kreymer	9847af7765	disable socat by default (#622 ) - crawling: add '--debugAccessBrowser' flag to enable connecting via 9222, only run socat then - profiles: only run socat in headless mode	2024-06-20 20:10:25 -07:00
Ilya Kreymer	febf4b7532	logging: log error message when seed is failed to be created (#619 ) for example, due to bad include/exclude regex, fixes #598	2024-06-20 18:41:57 -07:00
Ilya Kreymer	f504effa51	Merge branch 'main' into release/1.1.4 bump to 1.2.0-beta.1	2024-06-13 19:28:25 -07:00
Ilya Kreymer	8f8326eaf5	Fix synching extraSeeds state with multiple crawler instances (#605 ) Fixes #604 Ensures that extra seeds are propagated to all crawler instances. Adds a new redis hashmap key to store the extraSeed mappings url->extraSeeds index, to ensure the extra seeds are added in the same order on other instances, even if encountered in different order. Add a new redis lua primitive 'addnewseed' which combines several operations: check if extra seed already exists and returning existing index, add new seed to extraSeed list, also add to regular URL seed list.	2024-06-13 17:18:06 -07:00
Ilya Kreymer	e2b4cc1844	proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589 ) fixes #587 The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they were hardcoded to obsolete values in the Dockerfile. Proxy settings can now be set, in order of precedence via: - --proxyServer cli flag - PROXY_SERVER env var - PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server only (for backwards compatibility with 0.12.x) The --proxyServer / PROXY_SERVER settings are passed to the browser via the --proxy-server flag. AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying. Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth (supported in Brave, but not Chrome!) --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-10 13:11:00 -07:00
Ilya Kreymer	b83d1c58da	add --dryRun flag and mode (#594 ) - if set, runs the crawl but doesn't store any archive data (WARCS, WACZ, CDXJ) while logs and pages are still written, and saved state can be generated (per the --saveState options). - adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun - screenshot, text extraction are skipped altogether in dryRun mode, warning is printed that storage and archiving-related options may be ignored - fixes #593	2024-06-07 10:34:19 -07:00
Tessa Walsh	8318039ae3	Fix regressions with `failOnFailedSeed` option (#572 ) Fixes #563 This PR makes a few changes to fix a regression in behavior around `failOnFailedSeed` for the 1.x releases: - Fail with exit code 1, not 17, when pages are unreachable due to DNS not resolving or other network errors if the page is a seed and `failOnFailedSeed` is set - Extend tests, add test to ensure crawl succeeds on 404 seed status code if `failOnINvalidStatus` isn't set	2024-05-15 11:02:33 -07:00
Ilya Kreymer	c247189474	qa/replay crawl loading improvements (#526 ) - use frame.load() to load RWP frame directly instead of waiting for navigation messages - retry loading RWP if replay frame is missing - support --postLoadDelay in replay crawl - support --include / --exclude options in replay crawler, allow excluding and including pages to QA via regex - improve --qaDebugImageDiff debug image saving, save images to same dir, using ${counter}-${workerid}-${pageid}-{crawl,replay,vdiff}.png for better sorting - when running QA crawl, check and use QA_ARGS instead of CRAWL_ARGS if provided - ensure empty string text from page is treated different from error (undefined) - ensure info.warc.gz is closed in closeFiles() misc: - fix typo in --postLoadDelay check! - enable 'startEarly' mode for behaviors (autofetch, autoplay) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-04 13:05:24 -07:00
Ilya Kreymer	2059f2b6ae	add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520 ) but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.	2024-03-28 17:17:29 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	22a7351dc7	service worker capture fix: disable by default for now (#506 ) Due to issues with capturing top-level pages, make bypassing service workers the default for now. Previously, it was only disabled when using profiles. (This is also consistent with ArchiveWeb.page behavior). Includes: - add --serviceWorker option which can be `disabled`, disabled-if-profile (previous default) and `enabled` - ensure page timestamp is set for direct fetch - warn if page timestamp is missing on serialization, then set to now before serializing bump version to 1.0.2	2024-03-22 13:37:14 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	4520e9e96f	Fail on status code option + requeue fix (#480 ) Add fail on status code option, --failOnInvalidStatus to treat non-200 responses as failures. Can be useful especially when combined with --failOnFailedSeed or --failOnFailedLimit requeue: ensure requeued urls are requeued with same depth/priority, not 0	2024-03-04 17:21:44 -08:00
Ilya Kreymer	dd48251b39	Include WARC prefix for screenshots and text WARCs (#473 ) Ensure the env var / cli <warc prefix>-<crawlId> is also applied to `screenshots.warc.gz` and `text.warc.gz`	2024-02-27 23:33:34 -08:00
Tessa Walsh	bdffa7922c	Add arg to write pages to Redis (#464 ) Fixes #462 Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add pages to the database for each crawl. Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis) Also include timestamp (as ISO date) in `pageinfo:` records --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-09 16:44:17 -08:00
Ilya Kreymer	3323262852	WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440 ) Support for rollover size and custom WARC prefix templates: - reenable --rolloverSize (default to 1GB) for when a new WARC is created - support custom WARC prefix via --warcPrefix, prepended to new WARC filename, test via basic_crawl.test.js - filename template for new files is: `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}` with `$ts` replaced at new file creation time with current timestamp Improved support for long (non-terminating) responses, such as from live-streaming: - add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k - change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured. - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - add timeout to final fetch tasks to avoid never hanging on finish - fix adding `WARC-Truncated` header, need to set after stream is finished to determine if its been truncated - move temp download `tmp-dl` dir to main temp folder, outside of collection (no need to be there).	2023-12-07 23:02:55 -08:00
Ilya Kreymer	19dac943cc	Add types + validation for log context options (#435 ) - add LogContext type and enumerate all log contexts - also add LOG_CONTEXT_TYPES array to validate --context arg - rename errJSON -> formatErr, convert unknown (likely Error) to dict - make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers	2023-11-14 21:54:40 -08:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00

46 commits