Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-26 17:54:11 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	9a2ada3461	version: bump to 1.0.0	2024-03-18 19:15:35 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	5060e6b0b1	profiles: handle terminate signals directly (#500 ) - add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup	2024-03-18 17:24:48 -04:00
Tessa Walsh	4d64eedcd3	Temporarily disable tmp-cdx creation (#499 ) Fixes #498 To revert after 1.0.0 when we make changes that allow for using the temp CDX in WACZ creation.	2024-03-18 14:03:34 -07:00
Ilya Kreymer	f96c6a13dc	version: bump to 1.0.0-beta.8	2024-03-16 15:32:19 -07:00
Ilya Kreymer	8ea3bf8319	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00
Ilya Kreymer	6d04c9575f	Fix Save/Load State (#495 ) - Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491	2024-03-15 20:54:43 -04:00
Ilya Kreymer	fa37f62c86	Additional type fixes, follow-up to #488 (#489 ) More type safety (keep using WorkerOpts when needed) follow-up to changes in #488	2024-03-08 12:52:30 -08:00
Ilya Kreymer	3b6c11d77b	page state type fixes: (#488 ) - ensure pageid always inited for pagestate - remove generic any from PageState - use WorkerState instead of internal WorkerOpts	2024-03-08 11:05:26 -08:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	65133c9d9d	resourceType lowercase fix: (#483 ) follow up to #481, check reqresp.resourceType with lowercase value just set message based on resourceType value	2024-03-04 23:58:39 -08:00
Ilya Kreymer	63cedbc91a	version: bump to 1.0.0-beta.6	2024-03-04 18:11:28 -08:00
Ilya Kreymer	5a47cc4b41	warc: add Network.resourceType (https://chromedevtools.github.io/devt … (#481 ) Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention fixes #451	2024-03-04 18:10:45 -08:00
Ilya Kreymer	4520e9e96f	Fail on status code option + requeue fix (#480 ) Add fail on status code option, --failOnInvalidStatus to treat non-200 responses as failures. Can be useful especially when combined with --failOnFailedSeed or --failOnFailedLimit requeue: ensure requeued urls are requeued with same depth/priority, not 0	2024-03-04 17:21:44 -08:00
Ilya Kreymer	dd78457b2b	version: bump to 1.0.0-beta.5	2024-02-28 22:57:05 -08:00
Ilya Kreymer	184f4a2395	Ensure links added via behaviors also get processed (#478 ) Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors 0.5.3, which will add support for behaviors to add links. Simplify adding links by simply adding the links directly, instead of batching to 500 links. Errors are already being logged in queueing a new URL fails.	2024-02-28 22:56:32 -08:00
Ilya Kreymer	c348de270f	store page statusCode if not 200 (#477 ) don't treat non-200 pages as errors, still extract text, take screenshots, and run behaviors only consider actual page load errors, eg. chrome-error:// page url, as errors	2024-02-28 22:56:12 -08:00
Ilya Kreymer	fba4730d88	new seed on redirect + error page check: (#476 ) - if a seed page redirects (page response != seed url), then add the final url as a new seed with same scope - add newScopeSeed() to ScopedSeed to duplicate seed with different URL, store original includes / excludes - also add check for 'chrome-error://' URLs for the page, and ensure page is marked as failed if page.url() starts with chrome-error:// - fixes #475	2024-02-28 11:31:59 -08:00
Ilya Kreymer	dd48251b39	Include WARC prefix for screenshots and text WARCs (#473 ) Ensure the env var / cli <warc prefix>-<crawlId> is also applied to `screenshots.warc.gz` and `text.warc.gz`	2024-02-27 23:33:34 -08:00
Ilya Kreymer	cdd047d15e	warcwriter: better filehandle init on first use (#474 ) Ensure warcwriter file is inited on first use, instead of throwing error - was initing from writeRecordPair() but not writeSingleRecord()	2024-02-23 21:35:55 -08:00
Ilya Kreymer	d36564e0b0	typo: remove extra console.log	2024-02-22 16:13:50 -08:00
Ilya Kreymer	51660cdcc4	pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471 ) Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.	2024-02-21 16:02:25 -08:00
Ilya Kreymer	a5e939567c	Set warc prefix via WARC_PREFIX env var (#470 ) In addition to `--warcPrefix` flag, also support WARC_PREFIX env var, which takes precedence. Bump to 1.0.0-beta.4	2024-02-21 11:30:28 -08:00
Ilya Kreymer	a512e92886	Include resource type + mime type in page resources list (#468 ) The `:pageinfo:<url>` record now includes the mime type + resource type (from Chrome) along with status code for each resource, for better filtering / comparison.	2024-02-19 19:11:48 -08:00
Ilya Kreymer	8d2d79a5df	Misc Page Resource/Recorder Fixes (#467 ) - recorder: don't attempt to record response with mime type `text/event-stream` (will not terminate). - resources: don't track non http/https resources. - resources: store page timestamp on first resources URL match, in case multiple responses for same page encountered.	2024-02-17 23:32:19 -08:00
Ilya Kreymer	e8f2073a7e	Update Browser Image (#466 ) - Update to Brave browser (1.62.165) - Update page resource test to reflect latest Brave behavior	2024-02-17 22:40:12 -08:00
Ilya Kreymer	46eb02dfcb	version: bump to 1.0.0-beta.3	2024-02-16 14:37:58 -08:00
Ilya Kreymer	96f3c407b1	Page Resources: Include Cached Resources (#465 ) Ensure cached resources (that are not written to WARC) are still included in the `url:pageinfo:...` records. This will make it easier to track which resources are actually loaded from a given page. Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about include cached resources	2024-02-16 14:36:32 -08:00
Tessa Walsh	bdffa7922c	Add arg to write pages to Redis (#464 ) Fixes #462 Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add pages to the database for each crawl. Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis) Also include timestamp (as ISO date) in `pageinfo:` records --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-09 16:44:17 -08:00
Ilya Kreymer	298deac59d	add fix from 0.12.4 - puppeteer-core to 20.8.2 bump to 1.0.0-beta.2	2024-01-17 14:44:34 -08:00
Ilya Kreymer	f4ecaa8454	Merge branch 'main' into dev-1.0.0	2024-01-17 14:42:13 -08:00
Ilya Kreymer	18ffb3d971	skipping resources: ensure HEAD, OPTIONS, 206, and 304 response/request pairs are not written to WARC (#460 ) Allows for skipping network traffic that doesn't need to be stored, as it is not necessary/will result in incorrect replay (eg. 304 instead of a 200).	2024-01-17 14:27:51 -08:00
Ilya Kreymer	2fc0f67f04	Generate urn:pageinfo:<page url> records (#458 ) Generate records for each page, containing a list of resources and their status codes, to aid in future diffing/comparison. Generates a `urn:pageinfo:<page url>` record for each page - Adds POST / non-GET request canonicalization from warcio to handle non-GET requests - Adds `writeSingleRecord` to WARCWriter Fixes #457	2024-01-15 16:08:13 -05:00
Tessa Walsh	cd3a1b0c6c	Bump puppeteer-core to ^20.8.2 to patch vulnerability (#459 ) Fixes https://github.com/webrecorder/browsertrix-crawler/issues/456	2024-01-15 12:02:18 -08:00
Ilya Kreymer	db2dbe042f	bump to 1.0.0-beta.1 update yarn.lock	2024-01-03 00:21:03 -08:00
Ilya Kreymer	63c884fb1b	Merge branch 'main' (0.12.3) into 1.0.0	2024-01-03 00:20:23 -08:00
Ilya Kreymer	703835a7dd	detect invalid custom behaviors on load: (#450 ) - on first page, attempt to evaluate the behavior class to ensure it compiles - if fails to compile, log exception with fatal and exit - update behavior gathering code to keep track of behavior filename - tests: add test for invalid behavior which causes crawl to exit with fatal exit code (17)	2023-12-13 15:14:53 -05:00
Ilya Kreymer	3323262852	WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440 ) Support for rollover size and custom WARC prefix templates: - reenable --rolloverSize (default to 1GB) for when a new WARC is created - support custom WARC prefix via --warcPrefix, prepended to new WARC filename, test via basic_crawl.test.js - filename template for new files is: `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}` with `$ts` replaced at new file creation time with current timestamp Improved support for long (non-terminating) responses, such as from live-streaming: - add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k - change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured. - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - add timeout to final fetch tasks to avoid never hanging on finish - fix adding `WARC-Truncated` header, need to set after stream is finished to determine if its been truncated - move temp download `tmp-dl` dir to main temp folder, outside of collection (no need to be there).	2023-12-07 23:02:55 -08:00
Ilya Kreymer	c3b98e5047	Add timeout to final awaitPendingClear() (#442 ) Ensure the final pending wait also has a timeout, set to max page timeout x num workers. Could also set higher, but needs to have a timeout, eg. in case of downloading live stream that never terminates. Fixes #348 in the 0.12.x line. Also bumps version to 0.12.3	2023-11-16 16:20:09 -05:00
dependabot[bot]	540c355d25	Bump sharp from 0.32.1 to 0.32.6 (#443 ) Bumps [sharp](https://github.com/lovell/sharp) from 0.32.1 to 0.32.6 to fix vulnerability	2023-11-16 16:18:00 -05:00
Ilya Kreymer	e9ed7a45df	Merge 0.12.2 into dev-1.0.0	2023-11-15 23:00:13 -08:00
Ilya Kreymer	19dac943cc	Add types + validation for log context options (#435 ) - add LogContext type and enumerate all log contexts - also add LOG_CONTEXT_TYPES array to validate --context arg - rename errJSON -> formatErr, convert unknown (likely Error) to dict - make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers	2023-11-14 21:54:40 -08:00
Ilya Kreymer	9ba0b9edc1	Backport pending list never being reprocessed (#438 ) Backport of #433 to 0.12.x. Bump version to 0.12.2	2023-11-13 19:21:48 -08:00
Ilya Kreymer	456155ecf6	more specific types additions (#434 ) - add QueueEntry for type of json object stored in Redis - and PageCallbacks for callback type - use Crawler type	2023-11-13 09:31:52 -08:00
Ilya Kreymer	0d51e03825	Fix potential for pending list never being processed (#433 ) Due to an optimization, numPending() call assumed that queueSize() would be called to update cached queue size. However, in the current worker code, this is not the case. Remove cacheing the queue size and just check queue size in numPending(), to ensure pending list is always processed.	2023-11-13 09:31:21 -08:00
Ilya Kreymer	3972942f5f	logging: don't log filtered out direct fetch attempt as error (#432 ) When calling directFetchCapture, and aborting the response via an exception, throw `new Error("response-filtered-out");` so that it can be ignored. This exception is only used for direct capture, and should not be logged as an error - rethrow and handle in calling function to indicate direct fetch is skipped	2023-11-13 09:16:57 -08:00
Ilya Kreymer	ab0f66aa54	Raise size limit for large HTML pages (#430 ) Previously, responses >2MB are streamed to disk and an empty response returned to browser, to avoid holding large response in memory. This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded. This PR sets different limits for: - HTML as well as other JS necessary for page to load to 25MB - All other content limit is set to 5MB Also includes some more type fixing	2023-11-09 18:33:44 -08:00
Ilya Kreymer	783d006d52	follow-up to #428 : update ignore files (#431 ) - actually update lint/prettier/git ignore files with scatch, crawls, test-crawls, behaviors, as needed	2023-11-09 17:13:53 -08:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00

... 3 4 5 6 7 ...

505 commits