Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ed Summers	2ef116d667	Mention command line options when restarting (#577 ) It's probably worth reminding people that the command line options need to be passed in again since the crawl state doesn't include them. Refs #568	2024-05-21 10:57:50 -07:00
Ilya Kreymer	c71274d841	add STORE_REGION env var to be able to specify region (#565 ) defaults to us-east-1 for minio compatibility fixes #515	2024-05-12 12:42:04 -04:00
Ilya Kreymer	0201fef559	docs: fix typo	2024-04-18 17:19:13 -07:00
Tessa Walsh	75b617dc94	Add crawler QA docs (#551 ) Fixes #550 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>	2024-04-18 16:18:22 -04:00
Tessa Walsh	1325cc3868	Gracefully handle non-absolute path for create-login-profile --filename (#521 ) Fixes #513 If an absolute path isn't provided to the `create-login-profile` entrypoint's `--filename` option, resolve the value given within `/crawls/profiles`. Also updates the docs cli-options section to include the `create-login-profile` entrypoint and adjusts the script to automatically generate this page accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-29 13:46:54 -07:00
Ilya Kreymer	2059f2b6ae	add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520 ) but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.	2024-03-28 17:17:29 -07:00
Ilya Kreymer	93c3894d6f	improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504 ) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-21 13:56:05 -07:00
Henry Wilkinson	5e2768ebcf	Docs homepage link fix @tw4l Oops :\	2024-03-20 14:13:52 -04:00
Henry Wilkinson	3ec9d1b9e8	Update docs/docs/index.md Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 13:03:16 -04:00
Henry Wilkinson	0d26cf2619	Adds note about where to find Browsertrix — the cloud service	2024-03-20 12:41:29 -04:00
Henry Wilkinson	4b5ebb04f8	Fixes docs edit link	2024-03-20 12:34:29 -04:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	8ea3bf8319	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00

14 commits