Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	1e7f8361fe	tests: fix blockrules tests (#603 ) The blockrules tests assumed the youtube serves videos with `video/mp4` mime. However, now youtube also serves them with mime `application/vnd.yt-ump`. Both mime types are now checked to verify video are present.	2024-06-13 12:12:46 -07:00
Ilya Kreymer	e2b4cc1844	proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589 ) fixes #587 The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they were hardcoded to obsolete values in the Dockerfile. Proxy settings can now be set, in order of precedence via: - --proxyServer cli flag - PROXY_SERVER env var - PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server only (for backwards compatibility with 0.12.x) The --proxyServer / PROXY_SERVER settings are passed to the browser via the --proxy-server flag. AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying. Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth (supported in Brave, but not Chrome!) --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-10 13:11:00 -07:00
Ilya Kreymer	b83d1c58da	add --dryRun flag and mode (#594 ) - if set, runs the crawl but doesn't store any archive data (WARCS, WACZ, CDXJ) while logs and pages are still written, and saved state can be generated (per the --saveState options). - adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun - screenshot, text extraction are skipped altogether in dryRun mode, warning is printed that storage and archiving-related options may be ignored - fixes #593	2024-06-07 10:34:19 -07:00
benoit74	32435bfac7	Consider disk usage of collDir instead of default /crawls (#586 ) Fix #585 Changes: - compute disk usage based on crawler `collDir` property instead of always computing it on `/crawls` directory	2024-06-07 10:13:15 -07:00
Ilya Kreymer	1bd94d93a1	cleanup dockerfile + fix test (#595 ) - remove obsolete line from Dockerfile - fix pdf test to webrecorder-hosted pdf	2024-06-06 12:14:44 -07:00
Ilya Kreymer	089d901b9b	Always add warcinfo records to all WARCs (#556 ) Fixes #553 Includes `warcinfo` records at the beginning of new WARCs, as well as the combined WARC. Makes the warcinfo record also WARC/1.1 to match the rest of the WARC records.	2024-05-22 15:47:05 -07:00
Ilya Kreymer	894681e5fc	Bump version to 1.2.0 Beta + make draft release for each commit (#582 ) Generate draft release from main and *-release branches to simplify release process --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-05-22 15:45:48 -07:00
Tessa Walsh	1fcd3b7d6b	Fix failOnFailedLimit and add tests (#580 ) Fixes #575 - Adds a missing await to fetching the number of failed pages from Redis - Fixes a typo in the fatal logging message - Adds a test to ensure that the crawl fails with exit code 17 if --failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that will 404	2024-05-21 16:35:43 -07:00
Tessa Walsh	8318039ae3	Fix regressions with `failOnFailedSeed` option (#572 ) Fixes #563 This PR makes a few changes to fix a regression in behavior around `failOnFailedSeed` for the 1.x releases: - Fail with exit code 1, not 17, when pages are unreachable due to DNS not resolving or other network errors if the page is a seed and `failOnFailedSeed` is set - Extend tests, add test to ensure crawl succeeds on 404 seed status code if `failOnINvalidStatus` isn't set	2024-05-15 11:02:33 -07:00
Ilya Kreymer	10f6414f2f	PDF loading status code fix (#571 ) when loading a PDF as a page, the browser returns a 'false positive' net::ERR_ABORTED even though the PDF is loaded. - this is already handled, but status code was still being cleared, ensure status code is not reset to 0 on response - ensure page status and mime are also recorded if this failure is ignored (in shouldIgnoreAbort) - tests: add test for PDF capture fixes #570	2024-05-14 15:26:06 -07:00
Ilya Kreymer	15d2b09757	warcinfo: fix version to 1.1 to avoid confusion (part of #553 ) (#557 ) Ensure warcinfo record is also WARC/1.1	2024-04-18 21:52:24 -07:00
Ilya Kreymer	f6edec0b95	Fix for --rolloverSize for individual WARCs in 1.x (#542 ) Fixes #533 Fixes rollover in WARCWriter, separate from combined WARC rollover size: - check rolloverSize and close previous WARCs when size exceeds - add timestamp to resource WARC filenames to support rollover, eg. screenshots-{ts}.warc.gz - use append mode for all write streams, just in case - tests: add test for rollover of individual WARCs with 500K size limit - tests: update screenshot tests to account for WARCs now being named screenshots-{ts}.warc.gz instead of just screenshots.warc.gz	2024-04-15 13:43:08 -07:00
Ilya Kreymer	b5f3238c29	Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz (#535 ) Cherry-picked from the use-js-wacz branch, now implementing separate writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and new `--copy-page-files` flag. Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-11 13:55:52 -07:00
Ilya Kreymer	01c4139aa7	Fixes from 1.0.3 release -> main (#517 ) sitemap improvements: gz support + application/xml + extraHops fix #511 - follow up to https://github.com/webrecorder/browsertrix-crawler/issues/496 - support parsing sitemap urls that end in .gz with gzip decompression - support both `application/xml` and `text/xml` as valid sitemap content-types (add test for both) - ignore extraHops for sitemap found URLs by setting to past extraHops limit (otherwise, all sitemap URLs would be treated as links from seed page) fixes redirected seed (from #476) being counted against page limit: #509 - subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508	2024-03-26 14:50:36 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	6d04c9575f	Fix Save/Load State (#495 ) - Fixes state serialization, which was missing the done list. Instead, adds a 'finished' list computed from the seen list, minus failed and queued URLs. - Also adds serialization support for 'extraSeeds', seeds added dynamically from a redirect (via #475). Extra seeds are added to Redis and also included in the serialization. Fixes #491	2024-03-15 20:54:43 -04:00
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	5a47cc4b41	warc: add Network.resourceType (https://chromedevtools.github.io/devt … (#481 ) Add resourcesType value from https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention fixes #451	2024-03-04 18:10:45 -08:00
Ilya Kreymer	51660cdcc4	pageinfo: add console errors to pageinfo record, tracking in 'counts' field (#471 ) Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.	2024-02-21 16:02:25 -08:00
Ilya Kreymer	a512e92886	Include resource type + mime type in page resources list (#468 ) The `:pageinfo:<url>` record now includes the mime type + resource type (from Chrome) along with status code for each resource, for better filtering / comparison.	2024-02-19 19:11:48 -08:00
Ilya Kreymer	e8f2073a7e	Update Browser Image (#466 ) - Update to Brave browser (1.62.165) - Update page resource test to reflect latest Brave behavior	2024-02-17 22:40:12 -08:00
Ilya Kreymer	96f3c407b1	Page Resources: Include Cached Resources (#465 ) Ensure cached resources (that are not written to WARC) are still included in the `url:pageinfo:...` records. This will make it easier to track which resources are actually loaded from a given page. Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about include cached resources	2024-02-16 14:36:32 -08:00
Ilya Kreymer	703835a7dd	detect invalid custom behaviors on load: (#450 ) - on first page, attempt to evaluate the behavior class to ensure it compiles - if fails to compile, log exception with fatal and exit - update behavior gathering code to keep track of behavior filename - tests: add test for invalid behavior which causes crawl to exit with fatal exit code (17)	2023-12-13 15:14:53 -05:00
Ilya Kreymer	3323262852	WARC filename prefix + rollover size + improved 'livestream' / truncated response support. (#440 ) Support for rollover size and custom WARC prefix templates: - reenable --rolloverSize (default to 1GB) for when a new WARC is created - support custom WARC prefix via --warcPrefix, prepended to new WARC filename, test via basic_crawl.test.js - filename template for new files is: `${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}` with `$ts` replaced at new file creation time with current timestamp Improved support for long (non-terminating) responses, such as from live-streaming: - add a size to CDP takeStream to ensure data is streamed in fixed chunks, defaulting to 64k - change shutdown order: first close browser, then finish writing all WARCs to ensure any truncated responses can be captured. - ensure WARC is not rewritten after it is done, skip writing records if stream already flushed - add timeout to final fetch tasks to avoid never hanging on finish - fix adding `WARC-Truncated` header, need to set after stream is finished to determine if its been truncated - move temp download `tmp-dl` dir to main temp folder, outside of collection (no need to be there).	2023-12-07 23:02:55 -08:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00
Ilya Kreymer	877d9f5b44	Use new browser-based archiving mechanism instead of pywb proxy (#424 ) Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.	2023-11-07 21:38:50 -08:00
Ilya Kreymer	dd7b926d87	Exclusion Optimizations: follow-up to (#423 ) Follow-up to #408 - optimized exclusion filtering: - use zscan with default count instead of ordered scan to remvoe - use glob match when possible (non-regex as determined by string check) - move isInScope() check to worker to avoid creating a page and then closing for every excluded URL - tests: update saved-state test to be more resilient to delays args: also support '--text false' for backwards compatibility, fixes webrecorder/browsertrix-cloud#1334 bump to 0.12.1	2023-11-03 15:15:09 -07:00
Ilya Kreymer	2aeda56d40	improved text extraction: (addresses #403 ) (#404 ) - use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to get the snapshot (consistent with ArchiveWeb.page) - should be slightly more performant - keep option to use DOM.getDocument - refactor warc resource writing to separate class, used by text extraction and screenshots - write extracted text to WARC files as 'urn:text:<url>' after page loads, similar to screenshots - also store final text to WARC as 'urn:textFinal:<url>' if it is different - cli options: update `--text` to take one more more comma-separated string options `--text to-warc,to-pages,final-to-warc`. For backwards compatibility, support `--text` and `--text true` to be equivalent to `--text to-pages`. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-10-31 23:05:30 -07:00
Ilya Kreymer	8c92901889	load saved state fixes + redis tests (#415 ) - set done key correctly, just an int now - also check if array for old-style save states (for backwards compatibility) - fixes #411 - tests: includes tests using redis: tests save state + dynamically adding exclusions (follow up to #408) - adds `--debugAccessRedis` flag to allow accessing local redis outside container	2023-10-23 09:36:10 -07:00
Ilya Kreymer	14c8221d46	tests: disable ad-block tests: seeing inconsistent ci behavior, though tests pass on local brave (#407 )	2023-10-09 09:41:50 -07:00
Ilya Kreymer	f453dbfb56	Switch to Brave Base Image (#400 ) * switch to brave: - switch base browser to brave base image 1.58.135 - tests: add extra delay for blocking tests - bump to 0.12.0-beta.0	2023-10-02 14:30:44 -07:00
Tessa Walsh	7e03dc076f	Set new logic for invalid seeds (#395 ) Allow for some seeds to be invalid unless failOnFailedSeed is set Fail crawl if not valid seeds are provided Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-29 13:02:52 -04:00
Ilya Kreymer	e5b0c4ec1b	optimize link extraction: (fixes #376 ) (#380 ) * optimize link extraction: (fixes #376) - dedup urls in browser first - don't return entire list of URLs, process one-at-a-time via callback - add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler - optimize addqueue: atomically check if already at max urls and if url already seen in one redis call - add QueueState enum to indicate possible states: url added, limit hit, or dupe url - better logging: log rejected promises for link extraction - tests: add test for exact page limit being reached	2023-09-15 10:12:08 -07:00
benoit74	947d15725b	Enhance file stats test to detect file modification (#382 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-15 12:34:56 -04:00
benoit74	d72443ced3	Add option to output stats file live, i.e. after each page crawled (#374 ) * Add option to output stats file live, i.e. after each page crawled * Always output stat files after each page crawled (+ test) * Fix inversion between expected and test value	2023-09-14 15:16:19 -07:00
Anish Lakhwara	1c486ea1f3	Capture Favicon (#362 ) - get favicon from CDP debug page, if available, log warning if not - store in favIconUrl in pages.jsonl - test: add test for favIcon and additional multi-page crawls	2023-09-10 11:29:35 -07:00
Ilya Kreymer	5ba6c33bff	args parsing: fix parseRx() for inclusions/exclusions to deal with non-string types (fixes #352 ) (#353 ) treat non-regexes as strings and pass to RegExp constructor tests: add additional scope parsing tests for different types passed in as exclusions update yargs bump to 0.10.4	2023-08-13 15:08:36 -07:00
Amani	442f4486d3	feat: Add custom behavior injection (#285 ) * support loading custom behaviors from a specified directory via --customBehaviors * call load() for each behavior incrementally, then call selectMainBehavior() (available in browsertrix-behaviors 0.5.1) * tests: add tests for multiple custom behaviors --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-06 13:09:48 -07:00
Tessa Walsh	254da95a44	Fix disk utilization computation errors (#338 ) * Check size of /crawls by default to fix disk utilization check * Refactor calculating percentage used and add unit tests * add tests using df output for with disk usage above and below threshold --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-05 21:58:28 -07:00
Ilya Kreymer	392c8bba0f	allow adding --include with pre-existing --scopeType values (besides custom) (fixes #318 ) (#319 ) remove warning when --scopeType and --include used together tests: update tests to reflect new semantics of --include + --scopeType	2023-05-23 09:43:11 -07:00
Ilya Kreymer	71b618fe94	Switch back to Puppeteer from Playwright (#301 ) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-26 15:41:35 -07:00
Tessa Walsh	b303af02ef	Add --title and --description CLI args to write metadata into datapackage.json (#276 ) Multi-word values including spaces must be enclosed in double quotes. Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-04 10:46:03 -04:00
Tessa Walsh	62fe4b4a99	Add options to filter logs by --logLevel and --context (#271 ) * Add .DS_Store to gitignore * Add --logLevel and --context filtering options * Add log filtering test	2023-04-01 10:07:59 -07:00
Ilya Kreymer	82808d8133	Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253 ) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-03-17 12:50:32 -07:00
Tessa Walsh	0cf6219d80	Fix --overwrite CLI flag (#220 ) * Delete collection if --overwrite before wb-manager init * Add tests	2023-02-02 21:02:47 -08:00
Tessa Walsh	c0b0d5b87f	Serialize Redis pending pages as JSON objects (#212 ) * Add redis:// prefix to test --redisStoreUrl * Serialize pending pages as JSON objects	2023-01-23 16:44:03 -08:00
Tessa Walsh	1a066dbd7b	Add RedisCrawlState test (#208 )	2023-01-23 10:16:22 -08:00
Tessa Walsh	0192d05f4c	Implement improved json-l logging - Add Logger class with methods for info, error, warn, debug, fatal - Add context, timestamp, and details fields to log entries - Log messages as JSON Lines - Replace puppeteer-cluster stats with custom stats implementation - Log behaviors by default - Amend argParser to reflect logging changes - Capture and log stdout/stderr from awaited child_processes - Modify tests to use webrecorder.net to avoid timeouts	2023-01-19 14:17:27 -05:00

1 2 3

127 commits