Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	ea9639f0bd	don't exclude 204s	2024-01-17 13:28:14 -08:00
Ilya Kreymer	bc201be7f1	skipping resources: ensure HEAD, OPTIONS, 204, 206, and 304 response/request pairs are not written to WARC	2024-01-16 18:31:03 -08:00
Ilya Kreymer	2fc0f67f04	Generate urn:pageinfo:<page url> records (#458 ) Generate records for each page, containing a list of resources and their status codes, to aid in future diffing/comparison. Generates a `urn:pageinfo:<page url>` record for each page - Adds POST / non-GET request canonicalization from warcio to handle non-GET requests - Adds `writeSingleRecord` to WARCWriter Fixes #457	2024-01-15 16:08:13 -05:00
Ilya Kreymer	db2dbe042f	bump to 1.0.0-beta.1 update yarn.lock	2024-01-03 00:21:03 -08:00
Ilya Kreymer	63c884fb1b	Merge branch 'main' (0.12.3) into 1.0.0	2024-01-03 00:20:23 -08:00
Ilya Kreymer	703835a7dd	detect invalid custom behaviors on load: (#450 ) - on first page, attempt to evaluate the behavior class to ensure it compiles - if fails to compile, log exception with fatal and exit - update behavior gathering code to keep track of behavior filename - tests: add test for invalid behavior which causes crawl to exit with fatal exit code (17)	2023-12-13 15:14:53 -05:00
Ilya Kreymer	3323262852	WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440 ) Support for rollover size and custom WARC prefix templates: - reenable --rolloverSize (default to 1GB) for when a new WARC is created - support custom WARC prefix via --warcPrefix, prepended to new WARC filename, test via basic_crawl.test.js - filename template for new files is: `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}` with `$ts` replaced at new file creation time with current timestamp Improved support for long (non-terminating) responses, such as from live-streaming: - add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k - change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured. - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - add timeout to final fetch tasks to avoid never hanging on finish - fix adding `WARC-Truncated` header, need to set after stream is finished to determine if its been truncated - move temp download `tmp-dl` dir to main temp folder, outside of collection (no need to be there).	2023-12-07 23:02:55 -08:00
Ilya Kreymer	c3b98e5047	Add timeout to final awaitPendingClear() (#442 ) Ensure the final pending wait also has a timeout, set to max page timeout x num workers. Could also set higher, but needs to have a timeout, eg. in case of downloading live stream that never terminates. Fixes #348 in the 0.12.x line. Also bumps version to 0.12.3	2023-11-16 16:20:09 -05:00
dependabot[bot]	540c355d25	Bump sharp from 0.32.1 to 0.32.6 (#443 ) Bumps [sharp](https://github.com/lovell/sharp) from 0.32.1 to 0.32.6 to fix vulnerability	2023-11-16 16:18:00 -05:00
Ilya Kreymer	e9ed7a45df	Merge 0.12.2 into dev-1.0.0	2023-11-15 23:00:13 -08:00
Ilya Kreymer	19dac943cc	Add types + validation for log context options (#435 ) - add LogContext type and enumerate all log contexts - also add LOG_CONTEXT_TYPES array to validate --context arg - rename errJSON -> formatErr, convert unknown (likely Error) to dict - make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers	2023-11-14 21:54:40 -08:00
Ilya Kreymer	9ba0b9edc1	Backport pending list never being reprocessed (#438 ) Backport of #433 to 0.12.x. Bump version to 0.12.2	2023-11-13 19:21:48 -08:00
Ilya Kreymer	456155ecf6	more specific types additions (#434 ) - add QueueEntry for type of json object stored in Redis - and PageCallbacks for callback type - use Crawler type	2023-11-13 09:31:52 -08:00
Ilya Kreymer	0d51e03825	Fix potential for pending list never being processed (#433 ) Due to an optimization, numPending() call assumed that queueSize() would be called to update cached queue size. However, in the current worker code, this is not the case. Remove cacheing the queue size and just check queue size in numPending(), to ensure pending list is always processed.	2023-11-13 09:31:21 -08:00
Ilya Kreymer	3972942f5f	logging: don't log filtered out direct fetch attempt as error (#432 ) When calling directFetchCapture, and aborting the response via an exception, throw `new Error("response-filtered-out");` so that it can be ignored. This exception is only used for direct capture, and should not be logged as an error - rethrow and handle in calling function to indicate direct fetch is skipped	2023-11-13 09:16:57 -08:00
Ilya Kreymer	ab0f66aa54	Raise size limit for large HTML pages (#430 ) Previously, responses >2MB are streamed to disk and an empty response returned to browser, to avoid holding large response in memory. This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded. This PR sets different limits for: - HTML as well as other JS necessary for page to load to 25MB - All other content limit is set to 5MB Also includes some more type fixing	2023-11-09 18:33:44 -08:00
Ilya Kreymer	783d006d52	follow-up to #428 : update ignore files (#431 ) - actually update lint/prettier/git ignore files with scatch, crawls, test-crawls, behaviors, as needed	2023-11-09 17:13:53 -08:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00
Ilya Kreymer	877d9f5b44	Use new browser-based archiving mechanism instead of pywb proxy (#424 ) Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.	2023-11-07 21:38:50 -08:00
Ilya Kreymer	dd7b926d87	Exclusion Optimizations: follow-up to (#423 ) Follow-up to #408 - optimized exclusion filtering: - use zscan with default count instead of ordered scan to remvoe - use glob match when possible (non-regex as determined by string check) - move isInScope() check to worker to avoid creating a page and then closing for every excluded URL - tests: update saved-state test to be more resilient to delays args: also support '--text false' for backwards compatibility, fixes webrecorder/browsertrix-cloud#1334 bump to 0.12.1	2023-11-03 15:15:09 -07:00
Ilya Kreymer	15661eb9c8	More flexible multi value arg parsing + README update for 0.12.0 (#422 ) Updated arg parsing thanks to example in https://github.com/yargs/yargs/issues/846#issuecomment-517264899 to support multiple value arguments specified as either one string or multiple string using array type + coerce function. This allows for `choice` option to also be used to validate the options, when needed. With this setup, `--text to-pages,to-warc,final-to-warc`, `--text to-pages,to-warc --text final-to-warc` and `--text to-pages --text to-warc --text final-to-warc` all result in the same configuration! Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system. Also updated README with new text extraction options and bumped version to 0.12.0 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 11:47:37 -07:00
Ilya Kreymer	2aeda56d40	improved text extraction: (addresses #403 ) (#404 ) - use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to get the snapshot (consistent with ArchiveWeb.page) - should be slightly more performant - keep option to use DOM.getDocument - refactor warc resource writing to separate class, used by text extraction and screenshots - write extracted text to WARC files as 'urn:text:<url>' after page loads, similar to screenshots - also store final text to WARC as 'urn:textFinal:<url>' if it is different - cli options: update `--text` to take one more more comma-separated string options `--text to-warc,to-pages,final-to-warc`. For backwards compatibility, support `--text` and `--text true` to be equivalent to `--text to-pages`. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-31 23:05:30 -07:00
Ilya Kreymer	064db52272	base image: bump brave to 1.59.120 version: bump to 0.12.0-beta.2	2023-10-26 19:48:49 -07:00
benoit74	bc730a0d37	Return User-Agent on all code path to set headers appropriately (#420 ) Fixes #419	2023-10-25 12:32:10 -04:00
Ilya Kreymer	ffc1d3ffa4	quickfix: storage webhook, keep path and bytes!	2023-10-23 18:35:03 -07:00
Ilya Kreymer	8c92901889	load saved state fixes + redis tests (#415 ) - set done key correctly, just an int now - also check if array for old-style save states (for backwards compatibility) - fixes #411 - tests: includes tests using redis: tests save state + dynamically adding exclusions (follow up to #408) - adds `--debugAccessRedis` flag to allow accessing local redis outside container	2023-10-23 09:36:10 -07:00
Ilya Kreymer	45139dba0b	Support adding/removing exclusions without restarting the crawler (#408 ) Part of work for webrecorder/browsertrix-cloud#1216: - support adding/removing exclusions dynamically via a Redis message list - add processMessage() which checks <uid>:msg list for any messages - handle addExclusion / removeExclusion messages to add / remove exclusions for each seed - also add filterQueue() which filters queue, one URL at a time, async when a new exclusion is added	2023-10-21 19:11:31 -07:00
Ilya Kreymer	3a83695524	storage: also compute crc32 as part of storage webhook when uploading… (#414 ) … a WACZ file fixes #412	2023-10-20 16:29:07 -07:00
Ilya Kreymer	f6d5a019b1	disable component updates by setting --component-updater to invalid URL (#413 ) Currently, Brave will attempt an automatic update of components on launch. This should prevent that.	2023-10-20 16:28:22 -07:00
Ilya Kreymer	9ae297c000	version: bump to 0.12.0-beta.1	2023-10-09 14:03:31 -07:00
Ilya Kreymer	1a273abc20	remove tracking execution time here (handled in browsertrix cloud app instead) (#406 ) - don't set start / end time in redis - rename setEndTimeAndExit to setStatusAndExit add 'fast cancel' option: - add isCrawlCanceled() to state, which checks redis canceled key - on interrupt, if canceled, immediately exit with status 0 - on fatal, exit with code 0 if restartsOnError is set - no longer keeping track of start/end time in crawler itself	2023-10-09 12:28:58 -07:00
Ilya Kreymer	14c8221d46	tests: disable ad-block tests: seeing inconsistent ci behavior, though tests pass on local brave (#407 )	2023-10-09 09:41:50 -07:00
Ilya Kreymer	8533f6ccf9	additional failure logic: (#402 ) - logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.	2023-10-03 20:21:30 -07:00
Tessa Walsh	a23f840318	Store crawler start and end times in Redis lists (#397 ) * Store crawler start and end times in Redis lists * end time tweaks: - set end time for logger.fatal() - set missing start time into setEndTime() --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-02 17:55:52 -07:00
Ilya Kreymer	f453dbfb56	Switch to Brave Base Image (#400 ) * switch to brave: - switch base browser to brave base image 1.58.135 - tests: add extra delay for blocking tests - bump to 0.12.0-beta.0	2023-10-02 14:30:44 -07:00
Ilya Kreymer	4c7ebf18d4	version: bump to 0.11.2	2023-09-29 11:18:22 -07:00
Tessa Walsh	7e03dc076f	Set new logic for invalid seeds (#395 ) Allow for some seeds to be invalid unless failOnFailedSeed is set Fail crawl if not valid seeds are provided Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-29 13:02:52 -04:00
gitreich	18dce9534e	Update README.md (#390 ) added missing quotes in command to extend an existing profiles	2023-09-29 09:23:05 -07:00
Ilya Kreymer	52817c776e	add more timeouts to operations that happen outside of page processing time: (#396 ) - await page.close() if not finished within 20s - await crawler.pageFinished() if not finished within 60s (in case config is being written)	2023-09-27 15:46:36 -07:00
Ilya Kreymer	165a9787af	logging and beheaviors improvements (#389 ) - run behaviors: check if behaviors object exists before trying to run behaviors to avoid failure message - skip behaviors if frame no longer attached / has empty URL	2023-09-20 15:02:37 -04:00
Ilya Kreymer	c6cbbc1a17	Update CI Release Action (#386 ) * update to latest actions, use docker meta action with semver tags	2023-09-18 22:43:47 -05:00
Ilya Kreymer	c4287c7ed9	Error handling fixes to avoid crawler getting stuck. (#385 ) * error handling fixes: - listen to correct event for page crashes, 'error' instead of 'crash', may fix #371, #351 - more removal of duplicate logging for status-related errors, eg. if page crashed, don't log worker exception - detect browser 'disconnected' event, interrupt crawl (but allow post-crawl tasks, such as waiting for pending requests to run), set browser to null to avoid trying to use again. worker - bump new page timeout to 20 - if loading page from new domain, always use new page logger: - log timestamp first for better sorting	2023-09-18 15:24:33 -07:00
Ilya Kreymer	0c88eb78af	favicon: use 127.0.0.1 instead of localhost (#384 ) catch exception in fetch bump to 0.11.1	2023-09-17 12:50:39 -07:00
Ilya Kreymer	debfe8945f	README: add --restartOnError cli opt	2023-09-15 11:22:52 -07:00
Ilya Kreymer	e5b0c4ec1b	optimize link extraction: (fixes #376 ) (#380 ) * optimize link extraction: (fixes #376) - dedup urls in browser first - don't return entire list of URLs, process one-at-a-time via callback - add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler - optimize addqueue: atomically check if already at max urls and if url already seen in one redis call - add QueueState enum to indicate possible states: url added, limit hit, or dupe url - better logging: log rejected promises for link extraction - tests: add test for exact page limit being reached	2023-09-15 10:12:08 -07:00
benoit74	947d15725b	Enhance file stats test to detect file modification (#382 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-15 12:34:56 -04:00
Vinzenz Sinapius	7b6bb681c7	Update tldextract cache for pywb in build process (#383 )	2023-09-15 12:22:17 -04:00
Ilya Kreymer	3c9be514d3	behavior logging tweaks, add netIdle (#381 ) * behavior logging tweaks, add netIdle * fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes! now used not only for link extraction but also to run() behaviors * add logging if iframe check fails * Dockerfile: add commented out line to use local behaviors.js * bump behaviors to 0.5.2	2023-09-14 19:48:41 -07:00
benoit74	d72443ced3	Add option to output stats file live, i.e. after each page crawled (#374 ) * Add option to output stats file live, i.e. after each page crawled * Always output stat files after each page crawled (+ test) * Fix inversion between expected and test value	2023-09-14 15:16:19 -07:00

1 2 3 4 5 ...

273 commits