This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes#426
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
treat non-regexes as strings and pass to RegExp constructor
tests: add additional scope parsing tests for different types passed in as exclusions
update yargs
bump to 0.10.4
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
- add 'scopeType: domain' to include all subdomains of a given seed url, eg. given `https://example.com/path' as starting seed, will consider `https://*.example.com/` to be in scope.
- include both http/https in all the default scopes except single page (page-spa, prefix, host, domain), eg. given https://example.com/, will also include http://example.com/
- fixes#116
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
* blockrules improvements:
- add await to continue/abort to catch errors, each called only in one place.
- avoid adding multiple interception handlers for same page to avoid 'request already handled' errors
- disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning
* setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set.
* scopeType rename:
- rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage
- rename 'none' -> page to indicate default single-page-only crawl
- messaging: adjust error message displaying valid scopeTypes
* README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80
bump to 0.4.4
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure
* custom link extraction improvements (improvements for #25)
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector
* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction
* add to CHANGES, bump to 0.4.2
* optimization: don't intercept requests if no blockRules set
* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages
* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)
* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile
* screencasting: convert newContext to window instead of page by default, instead of just warning about it
* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2
* seeds: add trim() to seed URLs
* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically
* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles
* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes#25
* update CHANGES and README with new features
* bump version to 0.4.1
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap
tests:
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm
* update README with new flags
* bump version to 0.4.0-beta.3