* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure
* custom link extraction improvements (improvements for #25)
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector
* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction
* add to CHANGES, bump to 0.4.2
* optimization: don't intercept requests if no blockRules set
* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages
* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)
* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile
* screencasting: convert newContext to window instead of page by default, instead of just warning about it
* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2
* seeds: add trim() to seed URLs
* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically
* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles
* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes#25
* update CHANGES and README with new features
* bump version to 0.4.1
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)
* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)
* github action ci: use system unzip
* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.
* Update README with info on customizing build image
* bump version to 0.4.0-beta.2
* Create an argument parser class
* move constants, arg parser to separate files in utils/*
* ensure yaml config overriden by command-line args
* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files
* support inline 'seeds' in cmdline and yaml config
tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config
* scope fix: empty scope implies only fixed list, use '.*' for any scope
* lint fix
* update readme with yaml config info
* allow 'url' and 'seeds' if both provided
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>