Cherry-picked from the use-js-wacz branch, now implementing separate
writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and
new `--copy-page-files` flag.
Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343
Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream,
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
* base: update to chrome 112
headless: switch to using new headless mode available in 112 which is more in sync with headful mode
viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set)
profiles: fix catching new window message, reopening page in current window
versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1)
bump to 0.10.0-beta.4
* profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages
* logging:
- write most of the crawl log to '{coll}/logs/crawl-{iso-timestamp}.log', part of #230
- ensure log filename consists of numeric timestamp only
- close log before wacz file is generated to allow storing log in wacz
- close log after writing stats
- add logs/ directory to wacz with new py-wacz
- deps: bump to py-wacz 0.4.8 to support logs in wacz
* profiles: use vnc for automatic profile creation (fixes#194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer
* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected
* x11vnc cleanup: just pass password via cmdline to simplify setup
* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive
* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for telegram behavior
- bump version to 0.5.1
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
- wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes
state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue:
* puppeteer-cluster: add custom 'start()' callback to indicate task actually starting
* new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started' when actaully started
minio: use fPutObject to support parallel uploading, compute hash and size separately (for now)
dependencies: update to latest minio
error checking:
* print number of WARCs found, exit with error if 0
* ensure wacz creation succeeds, exit with error code if not
* validate wacz after creation, exit with error code if validation fails
bump to 0.5.0-beta.3
* initial support for wacz signing (using a custom version py-wacz)
- signing url and token set via env vars WACZ_SIGN_TOKEN and WACZ_SIGN_URL
- add CHANGELIST for 0.5.0
- bump pywb to 2.6.4
* extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83
* update README with info on `extraHops`, add tests for extraHops
* dependency fix: use pywb 2.6.3, warcio 1.5.0
* bump to 0.5.0-beta.2
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
* blockrules improvements:
- add await to continue/abort to catch errors, each called only in one place.
- avoid adding multiple interception handlers for same page to avoid 'request already handled' errors
- disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning
* setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set.
* scopeType rename:
- rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage
- rename 'none' -> page to indicate default single-page-only crawl
- messaging: adjust error message displaying valid scopeTypes
* README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80
bump to 0.4.4
- Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex.
- Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex.
- Support for restricting block rules based on containing frame URL, specified via inFrameURL param.
- Testing for various blockRules configurations
- Fixes Support URL-level WARC-writing inclusion/exclusion lists #15
- optional message to add when a URL is blocked, specified via 'blockMessage'
- update README for blockRules
- bump to pywb dependency 2.5.0b4
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)
* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)
* github action ci: use system unzip
* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.
* Update README with info on customizing build image
* bump version to 0.4.0-beta.2
* Create an argument parser class
* move constants, arg parser to separate files in utils/*
* ensure yaml config overriden by command-line args
* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files
* support inline 'seeds' in cmdline and yaml config
tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config
* scope fix: empty scope implies only fixed list, use '.*' for any scope
* lint fix
* update readme with yaml config info
* allow 'url' and 'seeds' if both provided
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
* screencast support (fixes#43):
- add NewWindowPage concurrency mode to support opening new window, and also reusing pages
- add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client
- concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session
* add warning if using screencasting + more than one worker + page context, recommend 'window'
* cleanup: remove debug console, bump py-wacz dependency, improve close message
* README: add screencasting info to README
Support for profiles via a mounted .tar.gz and --profile option + improved docs #18
* support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted.
* use patched pywb for fix
* bump browsertrix-behaviors to 0.1.0
* README: updates to include better getting started, behaviors and profile reference/examples
* bump version to 0.3.0!
- inject built 'behaviors.js' from browsertrix-behaviors, init with options and run
- remove bgbehaviors
- move textextract to root for now
- add requirements.txt for python dependencies
- remove obsolete --scroll option, to part of the behaviors system
logging:
- configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug'
- inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set
- 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification)
dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step
bump to 0.3.0-beta.0