Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	f2fa0f8de0	cleanup	2024-03-22 21:50:54 -07:00
Ilya Kreymer	50a771cc68	Merge branch 'unify-warc-writer' into use-js-wacz	2024-03-22 21:49:15 -07:00
Ilya Kreymer	750d51aede	fix screenshots path, disable tempcdx still	2024-03-22 21:44:55 -07:00
Ilya Kreymer	adbcf76502	remove warcresourcewriter unify warc-writing into single WARCWriter class to support cdx indexing for all records create dedicated writers for screenshots and text	2024-03-22 21:08:29 -07:00
Ilya Kreymer	3e76568113	Merge branch 'main' into use-js-wacz	2024-03-22 18:04:28 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Tessa Walsh	d5e5976b6f	Switch js-wacz dependency to ^0.1.0	2024-03-22 16:49:48 -04:00
Ilya Kreymer	22a7351dc7	service worker capture fix: disable by default for now (#506 ) Due to issues with capturing top-level pages, make bypassing service workers the default for now. Previously, it was only disabled when using profiles. (This is also consistent with ArchiveWeb.page behavior). Includes: - add --serviceWorker option which can be `disabled`, disabled-if-profile (previous default) and `enabled` - ensure page timestamp is set for direct fetch - warn if page timestamp is missing on serialization, then set to now before serializing bump version to 1.0.2	2024-03-22 13:37:14 -07:00
Tessa Walsh	a5d36ce1ad	Generate CDX with warcio CDXIndexer	2024-03-22 16:35:56 -04:00
Tessa Walsh	82169feffe	Fix typo	2024-03-22 16:35:56 -04:00
Tessa Walsh	118ffb0327	Fix extra hops test	2024-03-22 16:35:56 -04:00
Tessa Walsh	84c1ef2098	Fix custom driver test to account for extraPages	2024-03-22 16:35:56 -04:00
Tessa Walsh	97b1069f30	Fix extra hops test to account for extraPages	2024-03-22 16:35:56 -04:00
Tessa Walsh	13b6385a14	Temporariy comment out validation tests using py-wacz	2024-03-22 16:35:56 -04:00
Tessa Walsh	c68d117692	Add WACZLogger class for use with js-wacz	2024-03-22 16:35:56 -04:00
Tessa Walsh	952cd75a66	Wait until after WACZ generation to delete tmp-cdx	2024-03-22 09:21:25 -04:00
Ilya Kreymer	1595b3595d	fix tests?	2024-03-21 20:08:16 -07:00
Ilya Kreymer	c6723b007f	replace generateCDX with just moves files from tmp-cdx	2024-03-21 19:55:58 -07:00
Ilya Kreymer	a457a5e079	switch to using js-wacz natively for wacz creation! remove python dependencies	2024-03-21 19:43:54 -07:00
Ilya Kreymer	93c3894d6f	improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504 ) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-21 13:56:05 -07:00
Ilya Kreymer	1fe810b1df	Improved support for running as non-root (#503 ) This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1	2024-03-21 08:16:59 -07:00
Henry Wilkinson	5e2768ebcf	Docs homepage link fix @tw4l Oops :\	2024-03-20 14:13:52 -04:00
Henry Wilkinson	79e39ae2f0	Merge pull request #501 from webrecorder/docs-minor-fixes Docs: Minor fixes to edit link & clarifications	2024-03-20 13:04:12 -04:00
Henry Wilkinson	3ec9d1b9e8	Update docs/docs/index.md Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 13:03:16 -04:00
Henry Wilkinson	0d26cf2619	Adds note about where to find Browsertrix — the cloud service	2024-03-20 12:41:29 -04:00
Henry Wilkinson	4b5ebb04f8	Fixes docs edit link	2024-03-20 12:34:29 -04:00
Ilya Kreymer	9a2ada3461	version: bump to 1.0.0	2024-03-18 19:15:35 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	5060e6b0b1	profiles: handle terminate signals directly (#500 ) - add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup	2024-03-18 17:24:48 -04:00
Tessa Walsh	4d64eedcd3	Temporarily disable tmp-cdx creation (#499 ) Fixes #498 To revert after 1.0.0 when we make changes that allow for using the temp CDX in WACZ creation.	2024-03-18 14:03:34 -07:00
Ilya Kreymer	f96c6a13dc	version: bump to 1.0.0-beta.8	2024-03-16 15:32:19 -07:00
Ilya Kreymer	8ea3bf8319	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00
Ilya Kreymer	6d04c9575f	Fix Save/Load State (#495 ) - Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491	2024-03-15 20:54:43 -04:00
Ilya Kreymer	fa37f62c86	Additional type fixes, follow-up to #488 (#489 ) More type safety (keep using WorkerOpts when needed) follow-up to changes in #488	2024-03-08 12:52:30 -08:00
Ilya Kreymer	3b6c11d77b	page state type fixes: (#488 ) - ensure pageid always inited for pagestate - remove generic any from PageState - use WorkerState instead of internal WorkerOpts	2024-03-08 11:05:26 -08:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	65133c9d9d	resourceType lowercase fix: (#483 ) follow up to #481, check reqresp.resourceType with lowercase value just set message based on resourceType value	2024-03-04 23:58:39 -08:00
Ilya Kreymer	63cedbc91a	version: bump to 1.0.0-beta.6	2024-03-04 18:11:28 -08:00
Ilya Kreymer	5a47cc4b41	warc: add Network.resourceType (https://chromedevtools.github.io/devt … (#481 ) Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention fixes #451	2024-03-04 18:10:45 -08:00
Ilya Kreymer	4520e9e96f	Fail on status code option + requeue fix (#480 ) Add fail on status code option, --failOnInvalidStatus to treat non-200 responses as failures. Can be useful especially when combined with --failOnFailedSeed or --failOnFailedLimit requeue: ensure requeued urls are requeued with same depth/priority, not 0	2024-03-04 17:21:44 -08:00
Ilya Kreymer	dd78457b2b	version: bump to 1.0.0-beta.5	2024-02-28 22:57:05 -08:00
Ilya Kreymer	184f4a2395	Ensure links added via behaviors also get processed (#478 ) Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors 0.5.3, which will add support for behaviors to add links. Simplify adding links by simply adding the links directly, instead of batching to 500 links. Errors are already being logged in queueing a new URL fails.	2024-02-28 22:56:32 -08:00
Ilya Kreymer	c348de270f	store page statusCode if not 200 (#477 ) don't treat non-200 pages as errors, still extract text, take screenshots, and run behaviors only consider actual page load errors, eg. chrome-error:// page url, as errors	2024-02-28 22:56:12 -08:00
Ilya Kreymer	fba4730d88	new seed on redirect + error page check: (#476 ) - if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope - add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes - also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error:// - fixes #475	2024-02-28 11:31:59 -08:00
Ilya Kreymer	dd48251b39	Include WARC prefix for screenshots and text WARCs (#473 ) Ensure the env var / cli <warc prefix>-<crawlId> is also applied to `screenshots.warc.gz` and `text.warc.gz`	2024-02-27 23:33:34 -08:00
Ilya Kreymer	cdd047d15e	warcwriter: better filehandle init on first use (#474 ) Ensure warcwriter file is inited on first use, instead of throwing error - was initing from writeRecordPair() but not writeSingleRecord()	2024-02-23 21:35:55 -08:00
Ilya Kreymer	d36564e0b0	typo: remove extra console.log	2024-02-22 16:13:50 -08:00
Ilya Kreymer	51660cdcc4	pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471 ) Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.	2024-02-21 16:02:25 -08:00
Ilya Kreymer	a5e939567c	Set warc prefix via WARC_PREFIX env var (#470 ) In addition to `--warcPrefix` flag, also support WARC_PREFIX env var, which takes precedence. Bump to 1.0.0-beta.4	2024-02-21 11:30:28 -08:00

1 2 3 4 5 ...

331 commits