Cherry-picked from the use-js-wacz branch, now implementing separate
writing of pages.jsonl / extraPages.jsonl to be used with py-wacz and
new `--copy-page-files` flag.
Dependent on py-wacz 0.5.0 (via webrecorder/py-wacz#43)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
* Add screenshot and thumbnail functionality
Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.
In addition, this commit:
- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker
* Separate screenshotTypes into exported const
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
- support uploading WACZ to s3-compatible storage (via minio client)
- config storage loaded from env vars, enabled when WACZ output is used.
- support pinging either or an http or a redis key-based webhook,
- webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal)
- consolidate redis init to redis.js
- support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId)
- README: add docs for s3 storage, remove unused args
- update to pywb 2.6.2, browsertrix-behaviors 0.2.4
* fix to `limit` option, ensure limit check uses shared state
* bump version to 0.5.0-beta.1
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure
* custom link extraction improvements (improvements for #25)
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector
* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction
* add to CHANGES, bump to 0.4.2