Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	089d901b9b	Always add warcinfo records to all WARCs (#556 ) Fixes #553 Includes `warcinfo` records at the beginning of new WARCs, as well as the combined WARC. Makes the warcinfo record also WARC/1.1 to match the rest of the WARC records.	2024-05-22 15:47:05 -07:00
Tessa Walsh	1fcd3b7d6b	Fix failOnFailedLimit and add tests (#580 ) Fixes #575 - Adds a missing await to fetching the number of failed pages from Redis - Fixes a typo in the fatal logging message - Adds a test to ensure that the crawl fails with exit code 17 if --failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that will 404	2024-05-21 16:35:43 -07:00
Ilya Kreymer	6b04a39f2f	save state: export pending list as array of json strings + fix importing save state to support pending (#576 ) The save state export accidentally exported the pending data as an object, instead of a list of JSON strings, as it is stored in Redis, while import was expecting list of json strings. The getPendingList() function parses the json, but then was re-encoding it for writeStats(). This was likely a mistake. This PR fixes things: - support loading pending state as both array of objects and array of json strings for backwards compatibility - save state as array of json strings - remove json decoding and encoding in getPendingList() and writeStats() Fixes #568	2024-05-21 10:58:35 -07:00
Tessa Walsh	8318039ae3	Fix regressions with `failOnFailedSeed` option (#572 ) Fixes #563 This PR makes a few changes to fix a regression in behavior around `failOnFailedSeed` for the 1.x releases: - Fail with exit code 1, not 17, when pages are unreachable due to DNS not resolving or other network errors if the page is a seed and `failOnFailedSeed` is set - Extend tests, add test to ensure crawl succeeds on 404 seed status code if `failOnINvalidStatus` isn't set	2024-05-15 11:02:33 -07:00
Ilya Kreymer	10f6414f2f	PDF loading status code fix (#571 ) when loading a PDF as a page, the browser returns a 'false positive' net::ERR_ABORTED even though the PDF is loaded. - this is already handled, but status code was still being cleared, ensure status code is not reset to 0 on response - ensure page status and mime are also recorded if this failure is ignored (in shouldIgnoreAbort) - tests: add test for PDF capture fixes #570	2024-05-14 15:26:06 -07:00
Ilya Kreymer	d2fbe7344f	Skip Checking Empty Frame + eval timeout (#564 ) Don't run frame.evaluate() on an empty frame, also add a timeout just in case to frame.evaluate().	2024-05-09 11:05:33 +02:00
Ilya Kreymer	15d2b09757	warcinfo: fix version to 1.1 to avoid confusion (part of #553 ) (#557 ) Ensure warcinfo record is also WARC/1.1	2024-04-18 21:52:24 -07:00
Ilya Kreymer	51d82598e7	Support site-specific wait via browsertrix-behaviors (#555 ) The 0.6.0 release of Browsertrix Behaviors / webrecorder/browsertrix-behaviors#70 introduces support for site-specific behaviors to implement an `awaitPageLoad()` function which allows for waiting for specific resources on the page load. - This PR just adds a call to this function directly after page load. - Factors out into an `awaitPageLoad()` method used in both crawler and replaycrawler to support the same wait in QA Mode - This is to support custom loading wait time for Instagram (other sites in the future)	2024-04-18 17:16:57 -07:00
Tessa Walsh	efebc331ee	Set mime type for html pages (#545 ) Fixes #544 As long as the response has a content-type header, we should use it to set MIME type for the page.	2024-04-15 14:04:30 -07:00
Ilya Kreymer	f6edec0b95	Fix for --rolloverSize for individual WARCs in 1.x (#542 ) Fixes #533 Fixes rollover in WARCWriter, separate from combined WARC rollover size: - check rolloverSize and close previous WARCs when size exceeds - add timestamp to resource WARC filenames to support rollover, eg. screenshots-{ts}.warc.gz - use append mode for all write streams, just in case - tests: add test for rollover of individual WARCs with 500K size limit - tests: update screenshot tests to account for WARCs now being named screenshots-{ts}.warc.gz instead of just screenshots.warc.gz	2024-04-15 13:43:08 -07:00
Ilya Kreymer	16671cb610	qa: filter out non-html pages (#541 ) Fixes #540 Also ensure mime type is set on page for non-html pages when loaded through browser, already being set for direct fetch path.	2024-04-12 16:21:50 -07:00
Ilya Kreymer	8d4e9ca2dc	Better logging of all queue WARCWriter operations (#536 ) warcwriter operations result in a write promise being put on a queue, and handled one-at-a-time. This change wraps that promise in an async function that awaits the actual write and logs any rejections. - If an additional log details is provided, successful writes are also logged for now, including success logging for resource records (text, screenshot, pageinfo) - screenshot / text / pageinfo use the appropriate logcontext for the resource for better log filtering	2024-04-12 14:31:07 -07:00
Ilya Kreymer	b5f3238c29	Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz (#535 ) Cherry-picked from the use-js-wacz branch, now implementing separate writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and new `--copy-page-files` flag. Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-11 13:55:52 -07:00
Ilya Kreymer	c247189474	qa/replay crawl loading improvements (#526 ) - use frame.load() to load RWP frame directly instead of waiting for navigation messages - retry loading RWP if replay frame is missing - support --postLoadDelay in replay crawl - support --include / --exclude options in replay crawler, allow excluding and including pages to QA via regex - improve --qaDebugImageDiff debug image saving, save images to same dir, using ${counter}-${workerid}-${pageid}-{crawl,replay,vdiff}.png for better sorting - when running QA crawl, check and use QA_ARGS instead of CRAWL_ARGS if provided - ensure empty string text from page is treated different from error (undefined) - ensure info.warc.gz is closed in closeFiles() misc: - fix typo in --postLoadDelay check! - enable 'startEarly' mode for behaviors (autofetch, autoplay) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-04 13:05:24 -07:00
Ilya Kreymer	2059f2b6ae	add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520 ) but before running link extraction, text extraction, screenshots and behaviors. Useful for sites that load quickly but perform async loading / init afterwards, fixes #519 A simple workaround for when it's tricky to detect when a page has actually fully loaded. Useful for sites such as Instagram.	2024-03-28 17:17:29 -07:00
Ilya Kreymer	ea098b6daf	avoid cloudflare detection of puppeteer when using browser profiles: (#518 ) - filter out 'other' / no url targets from puppeteer attachment - disable '--disable-site-isolation-trials' for profiles - workaround for #446 with profiles - also fixes `pageExtraDelay` not working for non-200 responses - may be useful for detecting captcha blocked pages. - connect VNC right away instead of waiting for page to fully finish loading, hopefully resulting in faster profile start-up time.	2024-03-28 10:21:31 -07:00
Ilya Kreymer	0d973d67e3	upgrade puppeteer-core to 22.6.1 (#516 ) Using latest puppeteer-core to keep up with latest browsers, mostly minor syntax changes Due to change in puppeteer hiding the executionContextId, need to create a frameId->executionContextId mapping and track it ourselves to support the custom evaluateWithCLI() function	2024-03-27 09:26:51 -07:00
Ilya Kreymer	0ad10a8dee	Unify WARC writing + CDXJ indexing into single class (#507 ) Previously, there was the main WARCWriter as well as utility WARCResourceWriter that was used for screenshots, text, pageinfo and only generated resource records. This separate WARC writing path did not generate CDX, but used appendFile() to append new WARC records to an existing WARC. This change removes WARCResourceWriter and ensures all WARC writing is done through a single WARCWriter, which uses a writable stream to append records, and can also generate CDX on the fly. This change is a pre-requisite to the js-wacz conversion (#484) since all WARCs need to have generated CDX. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-26 14:54:27 -07:00
Ilya Kreymer	01c4139aa7	Fixes from 1.0.3 release -> main (#517 ) sitemap improvements: gz support + application/xml + extraHops fix #511 - follow up to https://github.com/webrecorder/browsertrix-crawler/issues/496 - support parsing sitemap urls that end in .gz with gzip decompression - support both `application/xml` and `text/xml` as valid sitemap content-types (add test for both) - ignore extraHops for sitemap found URLs by setting to past extraHops limit (otherwise, all sitemap URLs would be treated as links from seed page) fixes redirected seed (from #476) being counted against page limit: #509 - subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508	2024-03-26 14:50:36 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	22a7351dc7	service worker capture fix: disable by default for now (#506 ) Due to issues with capturing top-level pages, make bypassing service workers the default for now. Previously, it was only disabled when using profiles. (This is also consistent with ArchiveWeb.page behavior). Includes: - add --serviceWorker option which can be `disabled`, disabled-if-profile (previous default) and `enabled` - ensure page timestamp is set for direct fetch - warn if page timestamp is missing on serialization, then set to now before serializing bump version to 1.0.2	2024-03-22 13:37:14 -07:00
Ilya Kreymer	93c3894d6f	improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504 ) The intent is for even non-graceful interruption (duplicate Ctrl+C) to still result in valid WARC records, even if page is unfinished: - immediately exit the browser, and call closeWorkers() - finalize() recorder, finish active WARC records but don't fetch anything else - flush() existing open writer, mark as done, don't write anything else - possible fix to additional issues raised in #487 Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-21 13:56:05 -07:00
Ilya Kreymer	1fe810b1df	Improved support for running as non-root (#503 ) This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1	2024-03-21 08:16:59 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	5060e6b0b1	profiles: handle terminate signals directly (#500 ) - add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup	2024-03-18 17:24:48 -04:00
Ilya Kreymer	6d04c9575f	Fix Save/Load State (#495 ) - Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491	2024-03-15 20:54:43 -04:00
Ilya Kreymer	fa37f62c86	Additional type fixes, follow-up to #488 (#489 ) More type safety (keep using WorkerOpts when needed) follow-up to changes in #488	2024-03-08 12:52:30 -08:00
Ilya Kreymer	3b6c11d77b	page state type fixes: (#488 ) - ensure pageid always inited for pagestate - remove generic any from PageState - use WorkerState instead of internal WorkerOpts	2024-03-08 11:05:26 -08:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	4520e9e96f	Fail on status code option + requeue fix (#480 ) Add fail on status code option, --failOnInvalidStatus to treat non-200 responses as failures. Can be useful especially when combined with --failOnFailedSeed or --failOnFailedLimit requeue: ensure requeued urls are requeued with same depth/priority, not 0	2024-03-04 17:21:44 -08:00
Ilya Kreymer	184f4a2395	Ensure links added via behaviors also get processed (#478 ) Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors 0.5.3, which will add support for behaviors to add links. Simplify adding links by simply adding the links directly, instead of batching to 500 links. Errors are already being logged in queueing a new URL fails.	2024-02-28 22:56:32 -08:00
Ilya Kreymer	c348de270f	store page statusCode if not 200 (#477 ) don't treat non-200 pages as errors, still extract text, take screenshots, and run behaviors only consider actual page load errors, eg. chrome-error:// page url, as errors	2024-02-28 22:56:12 -08:00
Ilya Kreymer	fba4730d88	new seed on redirect + error page check: (#476 ) - if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope - add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes - also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error:// - fixes #475	2024-02-28 11:31:59 -08:00
Ilya Kreymer	dd48251b39	Include WARC prefix for screenshots and text WARCs (#473 ) Ensure the env var / cli <warc prefix>-<crawlId> is also applied to `screenshots.warc.gz` and `text.warc.gz`	2024-02-27 23:33:34 -08:00
Tessa Walsh	bdffa7922c	Add arg to write pages to Redis (#464 ) Fixes #462 Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add pages to the database for each crawl. Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis) Also include timestamp (as ISO date) in `pageinfo:` records --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-09 16:44:17 -08:00
Ilya Kreymer	703835a7dd	detect invalid custom behaviors on load: (#450 ) - on first page, attempt to evaluate the behavior class to ensure it compiles - if fails to compile, log exception with fatal and exit - update behavior gathering code to keep track of behavior filename - tests: add test for invalid behavior which causes crawl to exit with fatal exit code (17)	2023-12-13 15:14:53 -05:00
Ilya Kreymer	3323262852	WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440 ) Support for rollover size and custom WARC prefix templates: - reenable --rolloverSize (default to 1GB) for when a new WARC is created - support custom WARC prefix via --warcPrefix, prepended to new WARC filename, test via basic_crawl.test.js - filename template for new files is: `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}` with `$ts` replaced at new file creation time with current timestamp Improved support for long (non-terminating) responses, such as from live-streaming: - add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k - change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured. - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - add timeout to final fetch tasks to avoid never hanging on finish - fix adding `WARC-Truncated` header, need to set after stream is finished to determine if its been truncated - move temp download `tmp-dl` dir to main temp folder, outside of collection (no need to be there).	2023-12-07 23:02:55 -08:00
Ilya Kreymer	19dac943cc	Add types + validation for log context options (#435 ) - add LogContext type and enumerate all log contexts - also add LOG_CONTEXT_TYPES array to validate --context arg - rename errJSON -> formatErr, convert unknown (likely Error) to dict - make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers	2023-11-14 21:54:40 -08:00
Ilya Kreymer	456155ecf6	more specific types additions (#434 ) - add QueueEntry for type of json object stored in Redis - and PageCallbacks for callback type - use Crawler type	2023-11-13 09:31:52 -08:00
Ilya Kreymer	3972942f5f	logging: don't log filtered out direct fetch attempt as error (#432 ) When calling directFetchCapture, and aborting the response via an exception, throw `new Error("response-filtered-out");` so that it can be ignored. This exception is only used for direct capture, and should not be logged as an error - rethrow and handle in calling function to indicate direct fetch is skipped	2023-11-13 09:16:57 -08:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00

42 commits