Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	a42c0b926e	Support host-specific proxies with proxy config YAML (#837 ) - Adds support for YAML-based config for multiple proxies, containing 'matchHosts' section by regex and 'proxies' declaration, allowing matching any number of hosts to any number of named proxies. - Specified via --proxyServerConfig option passed to both crawl and profile creation commands. - Implemented internally by generating a proxy PAC script which does regex matching and running browser with the specified proxy PAC script served by an internal http server. - Also support matching different undici Agents by regex, for using different proxies with direct fetching - Precedence: --proxyServerConfig takes precedence over --proxyServer / PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided - Updated proxies doc section with example - Updated tests with sample bad and good auth examples of proxy config Fixes #836 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-08-20 16:07:29 -07:00
Tessa Walsh	66402c2e53	Add documentation for `--failOnContentCheck` and update CLI options in docs (#869 ) Related to #860 This will give us something we can link to from Browsertrix/the Browsertrix User Guide for up-to-date information on this option.	2025-07-23 12:54:12 -07:00
Tessa Walsh	acae5155f5	Fix docs mistaking --waitUntil with --pageLoadTimeout (#864 ) Fixes https://github.com/webrecorder/browsertrix-crawler/issues/853 Corrects a documentation inaccuracy pointed out by a user	2025-07-21 12:52:58 -07:00
Ilya Kreymer	d2a6aa9805	version: bump to 1.6.3 (#851 ) cli: regen cli docs to update from #850	2025-06-16 15:55:05 -04:00
Rijnder Wever	fa26f05f66	cleanup: remove dead pywb code from argparser and docs (#847 ) The value of `--dedupPolicy` was once passed to pywb (see https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording). Now that pywb has been dropped, there is no need to keep this option around. In fact, I know multiple users that have been confused by the mention of this option in the docs (myself included). (for historical context, see https://github.com/webrecorder/browsertrix-crawler/pull/332)	2025-06-16 12:36:32 -04:00
Ilya Kreymer	1cb1b2edb9	Update Behaviors Docs (#820 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-04-10 03:58:07 -04:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
benoit74	02c4353b4a	Add clarification in usage about hostname used (#771 ) clarify that the crawlId defaults to the Docker container hostname	2025-03-30 21:16:58 -07:00
Henry Wilkinson	34a1e3d6c0	docs: Update header font (#785 ) Updated alongside https://github.com/webrecorder/replayweb.page/pull/405 Long overdue match to Browsertrix docs styling ### Screenshots <img width="465" alt="Screenshot 2025-03-03 at 7 25 04 PM" src="https://github.com/user-attachments/assets/6829dcb7-d486-4793-a635-f1286b30efc0" />	2025-03-05 14:21:00 -08:00
benoit74	4b72b7c7dc	Add documentation on exit codes (#765 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-11 12:16:29 -05:00
benoit74	fc56c2cf76	Add more exit codes to detect interruption reason (#764 ) Fix #584 - Replace interrupted with interruptReason - Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16) are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10), SignalInterrupted (11) and SignalInterruptedForce (13) - Doc fix to cli args --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-10 14:00:55 -08:00
Ilya Kreymer	00835fc4f2	Retry same queue (#757 ) - follow up to #743 - page retries are simply added back to the same queue with `retry` param incremented and a higher scope, after extraHops, to ensure retries are added at the end. - score calculation is: `score = depth + (extraHops * MAX_DEPTH) + (retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority than extraHops, and additional retries even lower priority (higher score). - warning is logged when a retry happens, error only when all retries are exhausted. - back to one failure list, urls added there only when all retries are exhausted. - rename --numRetries -> --maxRetries / --retries for clarity - state load: allow retrying previously failed URLs if --maxRetries is higher then on previous run. - ensure working with --failOnFailedStatus, if provided, invalid status codes (>= 400) are retried along with page load failures - fixes #132 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-06 18:48:40 -08:00
Ilya Kreymer	2e46140c3f	Make numRetries configurable (#754 ) Add --numRetries param, default to 1 instead of 5.	2025-02-05 23:34:55 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Tessa Walsh	60c84b342e	Support loading custom behaviors from git repo (#717 ) Fixes #712 - Also expands the existing documentation about behaviors and adds a test. - Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively. --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2024-11-13 22:50:33 -08:00
Ilya Kreymer	d04509639a	Support custom css selectors for extracting links (#689 ) Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].	2024-11-08 11:04:41 -05:00
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Tessa Walsh	e05d50d637	Add documentation for crawl collections (#695 ) Fixes #675	2024-10-05 11:51:32 -07:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Ilya Kreymer	8934feaf70	SOCKS5 over SSH Tunnel Support (#671 ) - Adds support for running a SOCKS5 proxy over an SSH connection. This can be configured by using `--proxyServer ssh://user@host[:port]` config and also passing an `--sshProxyPrivateKeyFile <private key file>` file param and an optional `--sshProxyKnownHostsFile <public host key file>`file param. The key files are expected to be mounted as volumes into the crawler. - Same arguments are also available for create-login-profile - The proxy config uses autossh to establish a more robust connection, and also waits until a connection can be established before proceeding. - Docs are updated to include a new 'Crawling with Proxies' page in the user guide - Tests are updated to include crawling through an SSH proxy running locally. --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>	2024-08-28 18:47:24 -07:00
Tessa Walsh	39c8f48bb2	Disable behaviors entirely if --behaviors array is empty (#672 ) Fixes #651	2024-08-27 13:20:19 -07:00
Henry Wilkinson	4c1da90d8f	Adds warning about crawling with basic auth (#669 ) Closes https://github.com/webrecorder/browsertrix/issues/1950 over here too ### Changes - Adds a warning about using basic auth - Adds a link to MDN because learning and cross referencing is fun!	2024-08-14 21:14:31 -07:00
Ilya Kreymer	48716c172d	docs: regnerate cli options with ./docs/gen-cli.sh	2024-07-19 18:53:50 -07:00
benoit74	1099f4f3c8	Make it clear that profile argument can be an HTTP(S) URL (#649 ) Small documentation enhancement to make it clear that browser profile can be passed as HTTP(S) URL as well.	2024-07-19 18:53:28 -07:00
Ilya Kreymer	3339374092	http auth support per seed (supersedes #566 ): (#616 ) - parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config) - add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders() - tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI) - docs: add HTTP Auth to YAML config section --------- Co-authored-by: Ed Summers <ehs@pobox.com>	2024-06-20 16:35:30 -07:00
Ilya Kreymer	b83d1c58da	add --dryRun flag and mode (#594 ) - if set, runs the crawl but doesn't store any archive data (WARCS, WACZ, CDXJ) while logs and pages are still written, and saved state can be generated (per the --saveState options). - adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun - screenshot, text extraction are skipped altogether in dryRun mode, warning is printed that storage and archiving-related options may be ignored - fixes #593	2024-06-07 10:34:19 -07:00
Ed Summers	2ef116d667	Mention command line options when restarting (#577 ) It's probably worth reminding people that the command line options need to be passed in again since the crawl state doesn't include them. Refs #568	2024-05-21 10:57:50 -07:00
Ilya Kreymer	c71274d841	add STORE_REGION env var to be able to specify region (#565 ) defaults to us-east-1 for minio compatibility fixes #515	2024-05-12 12:42:04 -04:00
Ilya Kreymer	0201fef559	docs: fix typo	2024-04-18 17:19:13 -07:00
Tessa Walsh	75b617dc94	Add crawler QA docs (#551 ) Fixes #550 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-04-18 16:18:22 -04:00
Tessa Walsh	1325cc3868	Gracefully handle non-absolute path for create-login-profile --filename (#521 ) Fixes #513 If an absolute path isn't provided to the `create-login-profile` entrypoint's `--filename` option, resolve the value given within `/crawls/profiles`. Also updates the docs cli-options section to include the `create-login-profile` entrypoint and adjusts the script to automatically generate this page accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-29 13:46:54 -07:00
Ilya Kreymer	2059f2b6ae	add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520 ) but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.	2024-03-28 17:17:29 -07:00
Ilya Kreymer	93c3894d6f	improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504 ) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-21 13:56:05 -07:00
Henry Wilkinson	5e2768ebcf	Docs homepage link fix @tw4l Oops :\	2024-03-20 14:13:52 -04:00
Henry Wilkinson	3ec9d1b9e8	Update docs/docs/index.md Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 13:03:16 -04:00
Henry Wilkinson	0d26cf2619	Adds note about where to find Browsertrix — the cloud service	2024-03-20 12:41:29 -04:00
Henry Wilkinson	4b5ebb04f8	Fixes docs edit link	2024-03-20 12:34:29 -04:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	8ea3bf8319	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00

40 commits