Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Tessa Walsh	e05d50d637	Add documentation for crawl collections (#695 ) Fixes #675	2024-10-05 11:51:32 -07:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Ilya Kreymer	8934feaf70	SOCKS5 over SSH Tunnel Support (#671 ) - Adds support for running a SOCKS5 proxy over an SSH connection. This can be configured by using `--proxyServer ssh://user@host[:port]` config and also passing an `--sshProxyPrivateKeyFile <private key file>` file param and an optional `--sshProxyKnownHostsFile <public host key file>`file param. The key files are expected to be mounted as volumes into the crawler. - Same arguments are also available for create-login-profile - The proxy config uses autossh to establish a more robust connection, and also waits until a connection can be established before proceeding. - Docs are updated to include a new 'Crawling with Proxies' page in the user guide - Tests are updated to include crawling through an SSH proxy running locally. --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>	2024-08-28 18:47:24 -07:00
Tessa Walsh	39c8f48bb2	Disable behaviors entirely if --behaviors array is empty (#672 ) Fixes #651	2024-08-27 13:20:19 -07:00
Henry Wilkinson	4c1da90d8f	Adds warning about crawling with basic auth (#669 ) Closes https://github.com/webrecorder/browsertrix/issues/1950 over here too ### Changes - Adds a warning about using basic auth - Adds a link to MDN because learning and cross referencing is fun!	2024-08-14 21:14:31 -07:00
Ilya Kreymer	48716c172d	docs: regnerate cli options with ./docs/gen-cli.sh	2024-07-19 18:53:50 -07:00
benoit74	1099f4f3c8	Make it clear that profile argument can be an HTTP(S) URL (#649 ) Small documentation enhancement to make it clear that browser profile can be passed as HTTP(S) URL as well.	2024-07-19 18:53:28 -07:00
Ilya Kreymer	3339374092	http auth support per seed (supersedes #566 ): (#616 ) - parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config) - add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders() - tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI) - docs: add HTTP Auth to YAML config section --------- Co-authored-by: Ed Summers <ehs@pobox.com>	2024-06-20 16:35:30 -07:00
Ilya Kreymer	b83d1c58da	add --dryRun flag and mode (#594 ) - if set, runs the crawl but doesn't store any archive data (WARCS, WACZ, CDXJ) while logs and pages are still written, and saved state can be generated (per the --saveState options). - adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun - screenshot, text extraction are skipped altogether in dryRun mode, warning is printed that storage and archiving-related options may be ignored - fixes #593	2024-06-07 10:34:19 -07:00
Ed Summers	2ef116d667	Mention command line options when restarting (#577 ) It's probably worth reminding people that the command line options need to be passed in again since the crawl state doesn't include them. Refs #568	2024-05-21 10:57:50 -07:00
Ilya Kreymer	c71274d841	add STORE_REGION env var to be able to specify region (#565 ) defaults to us-east-1 for minio compatibility fixes #515	2024-05-12 12:42:04 -04:00
Ilya Kreymer	0201fef559	docs: fix typo	2024-04-18 17:19:13 -07:00
Tessa Walsh	75b617dc94	Add crawler QA docs (#551 ) Fixes #550 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-04-18 16:18:22 -04:00
Tessa Walsh	1325cc3868	Gracefully handle non-absolute path for create-login-profile --filename (#521 ) Fixes #513 If an absolute path isn't provided to the `create-login-profile` entrypoint's `--filename` option, resolve the value given within `/crawls/profiles`. Also updates the docs cli-options section to include the `create-login-profile` entrypoint and adjusts the script to automatically generate this page accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-29 13:46:54 -07:00
Ilya Kreymer	2059f2b6ae	add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520 ) but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.	2024-03-28 17:17:29 -07:00
Ilya Kreymer	93c3894d6f	improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504 ) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-21 13:56:05 -07:00
Henry Wilkinson	5e2768ebcf	Docs homepage link fix @tw4l Oops :\	2024-03-20 14:13:52 -04:00
Henry Wilkinson	3ec9d1b9e8	Update docs/docs/index.md Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 13:03:16 -04:00
Henry Wilkinson	0d26cf2619	Adds note about where to find Browsertrix — the cloud service	2024-03-20 12:41:29 -04:00
Henry Wilkinson	4b5ebb04f8	Fixes docs edit link	2024-03-20 12:34:29 -04:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	8ea3bf8319	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00

23 commits