Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	9f18a49c0a	Better tracking of failed requests + logging context exclude (#485 ) - add --logExcludeContext for log contexts that should be excluded (while --logContext specifies which are to be included) - enable 'recorderNetwork' logging for debugging CDP network - create default log context exclude list (containing: screencast, recorderNetwork, jsErrors), customizable via --logExcludeContext recorder: Track failed requests and include in pageinfo records with status code 0 - cleanup cdp handler methods - intercept requestWillBeSent to track requests that started (but may not complete) - fix shouldSkip() still working if no url is provided (eg. check only headers) - set status to 0 for async fetch failures - remove responseServedFromCache interception, as response data generally not available then, and responseReceived is still called - pageinfo: include page requests that failed with status code 0, also include 'error' status if available. - ensure page is closed on failure - ensure pageinfo still written even if nothing else is crawled for a page - track cached responses, add to debug logging (can also add to pageinfo later if needed) tests: add pageinfo test for crawling invalid URL, which should still result in pageinfo record with status code 0 bump to 1.0.0-beta.7	2024-03-07 11:35:53 -05:00
Ilya Kreymer	63cedbc91a	version: bump to 1.0.0-beta.6	2024-03-04 18:11:28 -08:00
Ilya Kreymer	dd78457b2b	version: bump to 1.0.0-beta.5	2024-02-28 22:57:05 -08:00
Ilya Kreymer	184f4a2395	Ensure links added via behaviors also get processed (#478 ) Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors 0.5.3, which will add support for behaviors to add links. Simplify adding links by simply adding the links directly, instead of batching to 500 links. Errors are already being logged in queueing a new URL fails.	2024-02-28 22:56:32 -08:00
Ilya Kreymer	a5e939567c	Set warc prefix via WARC_PREFIX env var (#470 ) In addition to `--warcPrefix` flag, also support WARC_PREFIX env var, which takes precedence. Bump to 1.0.0-beta.4	2024-02-21 11:30:28 -08:00
Ilya Kreymer	46eb02dfcb	version: bump to 1.0.0-beta.3	2024-02-16 14:37:58 -08:00
Ilya Kreymer	298deac59d	add fix from 0.12.4 - puppeteer-core to 20.8.2 bump to 1.0.0-beta.2	2024-01-17 14:44:34 -08:00
Ilya Kreymer	db2dbe042f	bump to 1.0.0-beta.1 update yarn.lock	2024-01-03 00:21:03 -08:00
Ilya Kreymer	63c884fb1b	Merge branch 'main' (0.12.3) into 1.0.0	2024-01-03 00:20:23 -08:00
Ilya Kreymer	c3b98e5047	Add timeout to final awaitPendingClear() (#442 ) Ensure the final pending wait also has a timeout, set to max page timeout x num workers. Could also set higher, but needs to have a timeout, eg. in case of downloading live stream that never terminates. Fixes #348 in the 0.12.x line. Also bumps version to 0.12.3	2023-11-16 16:20:09 -05:00
dependabot[bot]	540c355d25	Bump sharp from 0.32.1 to 0.32.6 (#443 ) Bumps [sharp](https://github.com/lovell/sharp) from 0.32.1 to 0.32.6 to fix vulnerability	2023-11-16 16:18:00 -05:00
Ilya Kreymer	9ba0b9edc1	Backport pending list never being reprocessed (#438 ) Backport of #433 to 0.12.x. Bump version to 0.12.2	2023-11-13 19:21:48 -08:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00
Ilya Kreymer	877d9f5b44	Use new browser-based archiving mechanism instead of pywb proxy (#424 ) Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.	2023-11-07 21:38:50 -08:00
Ilya Kreymer	dd7b926d87	Exclusion Optimizations: follow-up to (#423 ) Follow-up to #408 - optimized exclusion filtering: - use zscan with default count instead of ordered scan to remvoe - use glob match when possible (non-regex as determined by string check) - move isInScope() check to worker to avoid creating a page and then closing for every excluded URL - tests: update saved-state test to be more resilient to delays args: also support '--text false' for backwards compatibility, fixes webrecorder/browsertrix-cloud#1334 bump to 0.12.1	2023-11-03 15:15:09 -07:00
Ilya Kreymer	15661eb9c8	More flexible multi value arg parsing + README update for 0.12.0 (#422 ) Updated arg parsing thanks to example in https://github.com/yargs/yargs/issues/846#issuecomment-517264899 to support multiple value arguments specified as either one string or multiple string using array type + coerce function. This allows for `choice` option to also be used to validate the options, when needed. With this setup, `--text to-pages,to-warc,final-to-warc`, `--text to-pages,to-warc --text final-to-warc` and `--text to-pages --text to-warc --text final-to-warc` all result in the same configuration! Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system. Also updated README with new text extraction options and bumped version to 0.12.0 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 11:47:37 -07:00
Ilya Kreymer	064db52272	base image: bump brave to 1.59.120 version: bump to 0.12.0-beta.2	2023-10-26 19:48:49 -07:00
Ilya Kreymer	3a83695524	storage: also compute crc32 as part of storage webhook when uploading… (#414 ) … a WACZ file fixes #412	2023-10-20 16:29:07 -07:00
Ilya Kreymer	9ae297c000	version: bump to 0.12.0-beta.1	2023-10-09 14:03:31 -07:00
Ilya Kreymer	f453dbfb56	Switch to Brave Base Image (#400 ) * switch to brave: - switch base browser to brave base image 1.58.135 - tests: add extra delay for blocking tests - bump to 0.12.0-beta.0	2023-10-02 14:30:44 -07:00
Ilya Kreymer	4c7ebf18d4	version: bump to 0.11.2	2023-09-29 11:18:22 -07:00
Ilya Kreymer	0c88eb78af	favicon: use 127.0.0.1 instead of localhost (#384 ) catch exception in fetch bump to 0.11.1	2023-09-17 12:50:39 -07:00
Ilya Kreymer	3c9be514d3	behavior logging tweaks, add netIdle (#381 ) * behavior logging tweaks, add netIdle * fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes! now used not only for link extraction but also to run() behaviors * add logging if iframe check fails * Dockerfile: add commented out line to use local behaviors.js * bump behaviors to 0.5.2	2023-09-14 19:48:41 -07:00
Ilya Kreymer	6a73d292b4	bump to 0.11.0 for new features	2023-09-13 10:39:59 -07:00
Graham Hukill	1eeee2c215	Surface lastmod option for sitemap parser (#367 ) * Surface lastmod option for sitemap parser - Add --sitemapFromDate to use along with --useSitemap which will filter sitemap by on or after specified ISO date. The library used to parse sitemaps for URLs added an optional "lastmod" argument in v3.2.5 that allows filtering URLs returned by a "last_modified" element present in sitemap XMLs. This surfaces that argument to the browsertrix-crawler CLI runtime parameters. This can be useful for orienting a crawl around a list of seeds known to contain sitemaps, but are only interested in including URLs that have been modified on or after X date. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-13 10:20:41 -07:00
Anish Lakhwara	d42010a598	feat: precommit (#363 ) * add .husky/pre-commit * run lint on precommit	2023-09-07 13:03:22 -07:00
Ilya Kreymer	3c2f5f8934	link extraction optimization: for scopeType page, set depth == extraHops to avoid getting links (#364 ) if we know no additional links wil be used	2023-08-31 13:42:14 -07:00
Ilya Kreymer	5ba6c33bff	args parsing: fix parseRx() for inclusions/exclusions to deal with non-string types (fixes #352 ) (#353 ) treat non-regexes as strings and pass to RegExp constructor tests: add additional scope parsing tests for different types passed in as exclusions update yargs bump to 0.10.4	2023-08-13 15:08:36 -07:00
Ilya Kreymer	16751de147	version: bump to 0.10.3	2023-08-08 08:43:27 -07:00
Tessa Walsh	22dc2e8426	deps: bump browsertrix-behaviors to ^0.5.1 (#341 )	2023-07-06 10:15:18 -07:00
Ilya Kreymer	5ce410c275	profiles: use newly provided puppeteer page.setBypassServiceWorker() (#340 ) * profiles: use newly provided puppeteer page.setBypassServiceWorker() instead of cdp command bump puppeteer core to 20.7.4	2023-07-06 10:09:32 -04:00
Ilya Kreymer	3049b957bd	version: bump to 0.10.2 deps: bump to py-wacz 0.4.9	2023-07-05 21:20:58 -07:00
Ilya Kreymer	c7dc504c75	deps: update puppeteer-core to 20.4.0, fixes #324 (#325 )	2023-05-30 19:25:54 -07:00
Ilya Kreymer	7c6c7d57a8	version: bump to 0.10.1	2023-05-30 19:12:28 -07:00
Ilya Kreymer	db46cdf6d5	version: bump to 0.10.0	2023-05-23 12:45:29 -07:00
Ilya Kreymer	f51154facb	Chrome 112 + new headless mode + consistent viewport tweaks (#316 ) * base: update to chrome 112 headless: switch to using new headless mode available in 112 which is more in sync with headful mode viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set) profiles: fix catching new window message, reopening page in current window versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1) bump to 0.10.0-beta.4 * profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages	2023-05-22 16:24:39 -07:00
Tessa Walsh	cc606deba9	Improve thumbnails with sharp (#304 ) * Resize thumbnails to 640x360 with sharp	2023-05-19 11:30:24 -07:00
Ilya Kreymer	b5df5ad3c1	version: bump to 0.10.0-beta.3	2023-05-19 07:44:29 -07:00
Ilya Kreymer	4b0dee56c2	state: adjust redis keys to be more consistent (#309 ) - use <crawlid>:stopping for crawl stop request - use <crawlid>:size for total setting crawl total size bump to 0.10.0-beta.2	2023-05-07 13:01:24 -07:00
Ilya Kreymer	ba6a3b6d6a	version: bump to 0.10.0-beta.1	2023-05-06 00:12:09 -07:00
Ilya Kreymer	71b618fe94	Switch back to Puppeteer from Playwright (#301 ) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-26 15:41:35 -07:00
Ilya Kreymer	5c497f4fa4	version: bump version to 0.10.0-beta.0	2023-04-19 19:17:58 -07:00
Ilya Kreymer	4a27f8c4a0	version: bump to 0.9.1	2023-04-08 16:53:57 -07:00
Ilya Kreymer	ebdf0ac8f8	version: bump to 0.9.0!	2023-04-07 17:42:46 -07:00
Ilya Kreymer	24e9c43b29	version: bump to 0.9.0-beta.2	2023-04-03 11:52:24 -07:00
Ilya Kreymer	02fb137b2c	Catch loading issues (#255 ) * various loading improvements to avoid pages getting 'stuck' + load state tracking - add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object. - set loadState to 0 (failed) by default - set loadState to 1 (content-loaded) on 'domcontentloaded' event - if page.goto() finishes, set to loadState to 2 'full-page-load'. - if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors - page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out - pages: log 'loadState' as part of pages.jsonl - improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well - screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting - deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements) - workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages - worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state - screencaster: allow reusing caster slots with fixed ids - interrupt timedCrawlPage() wait if 'crash' event happens - crawler: pageFinished() callback when page finishes - worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty	2023-03-20 18:31:37 -07:00
Ilya Kreymer	82808d8133	Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253 ) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-03-17 12:50:32 -07:00
Tessa Walsh	1bee46b321	Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219 ) * This commit removes puppeteer-cluster as a dependency in favor of a simpler concurrency implementation, using p-queue to limit concurrency to the number of available workers. As part of the refactor, the custom window concurrency model in windowconcur.js is removed and its logic implemented in the new Worker class's initPage method. * Remove concurrency models, always use new tab * logging improvements: include worker-id in logs, use 'worker' context - logging: log info string / version as first line - logging: improve logging of error stack traces - interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue - interruption: don't repair if interrupting, wait for queue to be idle - log text extraction - init order: ensure wb-manager init called first, then logs created - logging: adjust info->debug logging - Log no jobs available as debug * tests: bail on first failure * iframe filtering: - fix filtering for about:blank iframes, support non-async shouldProcessFrame() - filter iframes both for behaviors and for link extraction - add 5-second timeout to link extraction, to avoid link extraction holding up crawl! - cache filtered frames * healthcheck/worker reuse: - refactor healthchecker into separate class - increment healthchecker (if provided) if new page load fails - remove expermeintal repair functionality for now - add healthcheck * deps: bump puppeteer-core to 17.1.2 - bump to 0.9.0-beta.0 -------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-08 18:31:19 -08:00
Ilya Kreymer	5da379cb5f	Logging and Behavior Tweaks (#229 ) - Ensure page is included in all logging details - Update logging messages to be a single string, with variables added in the details - Always wait for all pending wait requests to finish (unless counter <0) - Don't set puppeteer-cluster timeout (prep for removing puppeeteer-cluster) - Add behaviorTimeout to running behaviors in crawler, in addition to in behaviors themselves. - Add logging for behavior start, finish and timeout - Move writeStats() logging to beginning of each page as well as at the end, to avoid confusion about pending pages. - For events from frames, use frameUrl along with current page - deps: bump browsertrix-behaviors to 0.4.2 - version: bump to 0.8.1	2023-02-23 18:50:22 -08:00

1 2 3 4 5

218 commits