Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp
Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
- add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure
* custom link extraction improvements (improvements for #25)
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector
* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction
* add to CHANGES, bump to 0.4.2