Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	c42d3df889	remove different port, add qa_compare with different user	2024-03-22 09:17:21 -07:00
Ilya Kreymer	a7ee58cc26	fix permissions on downloaded files	2024-03-22 09:04:13 -07:00
Ilya Kreymer	d2760f7054	bump jest version	2024-03-22 09:03:26 -07:00
Ilya Kreymer	f136cdf18c	attempt to change port on repeat call	2024-03-22 00:24:21 -07:00
Ilya Kreymer	10e92a4f7b	tests: disable retryStrategy for redis, test for better termination behavior	2024-03-22 00:10:26 -07:00
Ilya Kreymer	f6a7dab3ba	Merge branch 'main' into qa-crawl-work	2024-03-21 14:04:37 -07:00
Ilya Kreymer	93c3894d6f	improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504 ) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-21 13:56:05 -07:00
Ilya Kreymer	ce2ffca78c	Merge branch 'main' into qa-crawl-work	2024-03-21 13:23:13 -07:00
Ilya Kreymer	1fe810b1df	Improved support for running as non-root (#503 ) This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1	2024-03-21 08:16:59 -07:00
Ilya Kreymer	b18148b715	tests: change ports for different tests that use redis to be unique	2024-03-20 12:14:49 -07:00
Ilya Kreymer	aee5af5578	more cleanup	2024-03-20 12:05:55 -07:00
Ilya Kreymer	52f80d0440	cleanup, add more constants, remove commented out code	2024-03-20 12:02:37 -07:00
Henry Wilkinson	5e2768ebcf	Docs homepage link fix @tw4l Oops :\	2024-03-20 14:13:52 -04:00
Henry Wilkinson	79e39ae2f0	Merge pull request #501 from webrecorder/docs-minor-fixes Docs: Minor fixes to edit link & clarifications	2024-03-20 13:04:12 -04:00
Henry Wilkinson	3ec9d1b9e8	Update docs/docs/index.md Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 13:03:16 -04:00
Henry Wilkinson	0d26cf2619	Adds note about where to find Browsertrix — the cloud service	2024-03-20 12:41:29 -04:00
Henry Wilkinson	4b5ebb04f8	Fixes docs edit link	2024-03-20 12:34:29 -04:00
Ilya Kreymer	cb435f6d4f	readd parseArgs import	2024-03-19 11:26:20 -07:00
Ilya Kreymer	e4d8388ac8	Merge branch 'main' into qa-crawl-work	2024-03-19 11:25:58 -07:00
Ilya Kreymer	9a2ada3461	version: bump to 1.0.0	2024-03-18 19:15:35 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	5060e6b0b1	profiles: handle terminate signals directly (#500 ) - add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup	2024-03-18 17:24:48 -04:00
Tessa Walsh	4d64eedcd3	Temporarily disable tmp-cdx creation (#499 ) Fixes #498 To revert after 1.0.0 when we make changes that allow for using the temp CDX in WACZ creation.	2024-03-18 14:03:34 -07:00
Ilya Kreymer	251e1b3005	Merge branch 'main' into qa-crawl-work bump to 1.1.0-beta.1	2024-03-16 15:34:57 -07:00
Ilya Kreymer	f96c6a13dc	version: bump to 1.0.0-beta.8	2024-03-16 15:32:19 -07:00
Ilya Kreymer	8ea3bf8319	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00
Ilya Kreymer	6d04c9575f	Fix Save/Load State (#495 ) - Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491	2024-03-15 20:54:43 -04:00
Ilya Kreymer	ceffad9599	cleanup	2024-03-12 21:49:58 -04:00
Ilya Kreymer	8d0f4117dc	disable CORS for replaycrawler (for now) to allow loading any existing WACZ from 'localhost' for replay QA	2024-03-12 17:06:58 -04:00
Ilya Kreymer	aa4ecd5a31	qa crawl init: support loading pages from json file if 'pages' key is specified, otherwise load from 'resources'	2024-03-12 08:08:02 -04:00
Ilya Kreymer	d7d6558741	support loading multi-wacz .json files locally support parsing out the query string when detecting file type	2024-03-10 18:18:34 -07:00
Ilya Kreymer	3a9ffd826c	tests: try different port for redis	2024-03-08 13:13:49 -08:00
Ilya Kreymer	0abfaac87d	qa test: use redis://127.0.0.1:36379 for ci to match other redis usage	2024-03-08 13:02:00 -08:00
Ilya Kreymer	0a1018a780	Merge branch 'main' into qa-crawl-work	2024-03-08 12:53:33 -08:00
Ilya Kreymer	fa37f62c86	Additional type fixes, follow-up to #488 (#489 ) More type safety (keep using WorkerOpts when needed) follow-up to changes in #488	2024-03-08 12:52:30 -08:00
Ilya Kreymer	5a1b2a99bb	tests: add qa comparison test: - run crawl with 3 pages, text/screenshots enabled - run qa crawl using resulting WACZ - enable writing pages to redis - verify comparison data is included in page data added to redis ':pages' key while crawl is running	2024-03-08 12:47:36 -08:00
Ilya Kreymer	4f4f7a1324	qa: consolidate comparison data into pages data added to redis - add pageEntryForRedis() overridable in replaycrawler to add 'comparison' data - add seperate type for ComparisonData - add comparison data for processPageInfo, if pagestate is available - additional type fixes - remove --qaWriteToRedis, now included with page data	2024-03-08 11:31:11 -08:00
Ilya Kreymer	5c42549228	Merge branch 'main' into qa-crawl-work	2024-03-08 11:07:37 -08:00
Ilya Kreymer	3b6c11d77b	page state type fixes: (#488 ) - ensure pageid always inited for pagestate - remove generic any from PageState - use WorkerState instead of internal WorkerOpts	2024-03-08 11:05:26 -08:00
Ilya Kreymer	c4231e5196	misc qa work: - ensure original pageid is used for qa'd pages - use standard ':qa' key to write qa comparison data to with --qaWriteToRedis - print crawl stats in qa - include title + favicons in qa	2024-03-07 17:14:46 -08:00
Ilya Kreymer	2d85f2de2b	Merge branch 'main' into qa-crawl-work	2024-03-07 14:23:30 -08:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	c98742417f	Merge branch 'main' into qa-crawl-work	2024-03-05 11:00:51 -08:00
Ilya Kreymer	65133c9d9d	resourceType lowercase fix: (#483 ) follow up to #481, check reqresp.resourceType with lowercase value just set message based on resourceType value	2024-03-04 23:58:39 -08:00
Ilya Kreymer	63cedbc91a	version: bump to 1.0.0-beta.6	2024-03-04 18:11:28 -08:00
Ilya Kreymer	5a47cc4b41	warc: add Network.resourceType (https://chromedevtools.github.io/devt … (#481 ) Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention fixes #451	2024-03-04 18:10:45 -08:00
Ilya Kreymer	4520e9e96f	Fail on status code option + requeue fix (#480 ) Add fail on status code option, --failOnInvalidStatus to treat non-200 responses as failures. Can be useful especially when combined with --failOnFailedSeed or --failOnFailedLimit requeue: ensure requeued urls are requeued with same depth/priority, not 0	2024-03-04 17:21:44 -08:00
Ilya Kreymer	0e0d74e799	fixes for 1.0.0-beta.5 merge	2024-02-28 23:33:35 -08:00
Ilya Kreymer	fb9de39cb3	Merge branch 'dev-1.0.0' into qa-crawl-work	2024-02-28 23:28:50 -08:00

1 2 3 4 5 ...

359 commits