Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	b5f3238c29	Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz (#535 ) Cherry-picked from the use-js-wacz branch, now implementing separate writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and new `--copy-page-files` flag. Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43) --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-04-11 13:55:52 -07:00
Ilya Kreymer	877d9f5b44	Use new browser-based archiving mechanism instead of pywb proxy (#424 ) Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.	2023-11-07 21:38:50 -08:00
Ilya Kreymer	3049b957bd	version: bump to 0.10.2 deps: bump to py-wacz 0.4.9	2023-07-05 21:20:58 -07:00
Ilya Kreymer	f51154facb	Chrome 112 + new headless mode + consistent viewport tweaks (#316 ) * base: update to chrome 112 headless: switch to using new headless mode available in 112 which is more in sync with headful mode viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set) profiles: fix catching new window message, reopening page in current window versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1) bump to 0.10.0-beta.4 * profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages	2023-05-22 16:24:39 -07:00
Ilya Kreymer	63717c4b04	Crawl log (#231 ) * logging: - write most of the crawl log to '{coll}/logs/crawl-{iso-timestamp}.log', part of #230 - ensure log filename consists of numeric timestamp only - close log before wacz file is generated to allow storing log in wacz - close log after writing stats - add logs/ directory to wacz with new py-wacz - deps: bump to py-wacz 0.4.8 to support logs in wacz	2023-02-24 18:31:08 -08:00
Ilya Kreymer	b513246b03	deps: bump pywb to 2.7.3, update CHANGES to current version (#222 ) * deps: bump pywb to 2.7.3 bump to 0.8.0 for release * update CHANGES	2023-02-03 17:56:30 -08:00
kuechensofa	f9df7a94ce	Add requests[socks] python dependency (#201 ) Add requests[socks] python dependency to enable SOCKS proxy support for pywb inside the docker container	2023-01-19 21:55:07 -08:00
Ilya Kreymer	5ee05985b1	Use VNC for headful profile creation (#197 ) * profiles: use vnc for automatic profile creation (fixes #194): - add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode - use @novnc/novnc to serve vnc JS library - add novnc_lite.html to serve the content from an iframe - optimization: don't show initial blank page / don't wait for initial page in puppeteer * more vnc work: - set position of browser at 0,0, avoid needing offset to fit - add /vncpass endpoint to query vnc password (for use with browsertrix-cloud) - remove websockify, x11vnc now supports ws connections directly! - vnc_lite: support reconnecting ws if gracefully disconnected * x11vnc cleanup: just pass password via cmdline to simplify setup * make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified README updates: - mention new VNC-based streaming - mention new --automated flag, move automated info below interactive * README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently	2023-01-09 23:56:53 -08:00
Ilya Kreymer	a52ee5ed1f	dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3	2022-09-02 17:45:16 -07:00
Ilya Kreymer	5dfbfbeaf6	update dependencies: (#134 ) - update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX - update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction - update browsertrix-behaviors to 0.3.0, support for telegram behavior - bump version to 0.5.1	2022-04-15 16:22:47 -07:00
Ilya Kreymer	9b938304ce	dependencies: update to pywb>=2.6.6, wacz>=0.4.5	2022-04-11 15:09:59 -07:00
Ilya Kreymer	09082e8abb	dependencies: set wacz>=0.4.4	2022-03-18 10:38:34 -07:00
Ilya Kreymer	affa45a7d4	dependency: update py-wacz dependency to 0.4.3 (to include webrecorder/py-wacz#16 fix) bump to 0.5.0-beta.4	2022-03-07 08:46:12 -08:00
Ilya Kreymer	761ce7067b	behaviors update (#105 ) * update to browsertrix-behaviors 0.2.5 to support improved autoscroll - add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early - inject behaviors into interactive loader to allow testing - fix signal handler if state not inited yet - dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5	2022-02-20 22:22:19 -08:00
Ilya Kreymer	c2ce9fc001	various state + wacz fixes: (#101 ) - wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue: * puppeteer-cluster: add custom 'start()' callback to indicate task actually starting * new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started' when actaully started minio: use fPutObject to support parallel uploading, compute hash and size separately (for now) dependencies: update to latest minio error checking: * print number of WARCs found, exit with error if 0 * ensure wacz creation succeeds, exit with error code if not * validate wacz after creation, exit with error code if validation fails bump to 0.5.0-beta.3	2022-02-08 15:31:55 -08:00
Ilya Kreymer	66ce6688eb	Add WACZ Signing Support (#99 ) * initial support for wacz signing (using a custom version py-wacz) - signing url and token set via env vars WACZ_SIGN_TOKEN and WACZ_SIGN_URL - add CHANGELIST for 0.5.0 - bump pywb to 2.6.4	2022-01-26 16:06:10 -08:00
Ilya Kreymer	201eab4ad1	Support Extra Hops beyond current scope with --extraHops option (#98 ) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2	2022-01-15 09:03:09 -08:00
Ilya Kreymer	39ddecd35e	State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78 ) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta	2021-09-28 09:41:16 -07:00
Ilya Kreymer	c5494be653	Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81 ) * blockrules improvements: - add await to continue/abort to catch errors, each called only in one place. - avoid adding multiple interception handlers for same page to avoid 'request already handled' errors - disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning * setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set. * scopeType rename: - rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage - rename 'none' -> page to indicate default single-page-only crawl - messaging: adjust error message displaying valid scopeTypes * README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80 bump to 0.4.4	2021-08-17 20:54:18 -07:00
Ilya Kreymer	d40cf6cc2b	Interactive Profiles + bug fixes (#69 ) * support for interactive profile creation mode via --interactive file * screencasting error catching, ensure errors in screencasting do not interrupt crawl * better error reporting for invalid seed URLs, fixes #67 * README: update to mention interactive profile creation, additional * dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3	2021-07-20 15:45:51 -07:00
Ilya Kreymer	6dbdff9656	Support for per-URL conditional Block Rules (#68 ) - Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex. - Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex. - Support for restricting block rules based on containing frame URL, specified via inFrameURL param. - Testing for various blockRules configurations - Fixes Support URL-level WARC-writing inclusion/exclusion lists #15 - optional message to add when a URL is blocked, specified via 'blockMessage' - update README for blockRules - bump to pywb dependency 2.5.0b4	2021-07-19 15:50:32 -07:00
Ilya Kreymer	f57818f2f6	New Docker Image, Customizable Browser Source + Binary (#62 ) * switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!) * add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...) * github action ci: use system unzip * update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file. * Update README with info on customizing build image * bump version to 0.4.0-beta.2	2021-06-24 15:39:17 -07:00
Ilya Kreymer	3ebe511b32	Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59 ) * Create an argument parser class * move constants, arg parser to separate files in utils/* * ensure yaml config overriden by command-line args * yaml loading work: - simplify yaml config by using yargs.config option - move all option parsing to argParser, simply expose parseArgs - export constants directly - add lint to util/* files * support inline 'seeds' in cmdline and yaml config tests: - add test for crawl config, ensuring seeds crawled + wacz created - add test to ensure cmdline overrides yaml config * scope fix: empty scope implies only fixed list, use '.' for any scope lint fix * update readme with yaml config info * allow 'url' and 'seeds' if both provided Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: emmadickson <emma.dickson@artsymail.com>	2021-06-23 19:45:40 -07:00
Ilya Kreymer	ae4ce979fb	Screencast Support for Debugging (fixes #43 ) (#52 ) * screencast support (fixes #43): - add NewWindowPage concurrency mode to support opening new window, and also reusing pages - add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client - concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session * add warning if using screencasting + more than one worker + page context, recommend 'window' * cleanup: remove debug console, bump py-wacz dependency, improve close message * README: add screencasting info to README	2021-06-07 17:43:36 -07:00
Ilya Kreymer	b1e0654bdd	update to browsertrix-behaviors 0.2.0 update to latest pywb@main create-login-profile: also allow 'email' as alternative to user name bump to 0.3.1-beta.0	2021-04-28 11:00:43 -07:00
Ilya Kreymer	b59788ea04	Profiles: Support for running with existing profiles + saving profile after a login (#34 ) Support for profiles via a mounted .tar.gz and --profile option + improved docs #18 * support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted. * use patched pywb for fix * bump browsertrix-behaviors to 0.1.0 * README: updates to include better getting started, behaviors and profile reference/examples * bump version to 0.3.0!	2021-04-10 13:08:22 -07:00
Ilya Kreymer	bc7f1badf3	factor out behaviors to browsertrix-behaviors: (#32 ) - inject built 'behaviors.js' from browsertrix-behaviors, init with options and run - remove bgbehaviors - move textextract to root for now - add requirements.txt for python dependencies - remove obsolete --scroll option, to part of the behaviors system logging: - configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug' - inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set - 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification) dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step bump to 0.3.0-beta.0	2021-03-13 19:48:31 -05:00

27 commits