* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list
arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'
* scope: fix typo in 'prefix' scope
* update browsertrix-behaviors to 0.2.2
* tests: add test for passing config via stdin, also adding --excludes via cmdline
* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
* Create an argument parser class
* move constants, arg parser to separate files in utils/*
* ensure yaml config overriden by command-line args
* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files
* support inline 'seeds' in cmdline and yaml config
tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config
* scope fix: empty scope implies only fixed list, use '.*' for any scope
* lint fix
* update readme with yaml config info
* allow 'url' and 'seeds' if both provided
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0
* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well
* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)
* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)
- bump version to 0.4.0-beta.1
* Resolves#12
* Make --url param optional. Only one of --url of --urlFile should be specified.
* Add ignoreScope option queueUrls() to support adding specific URLs
* add tests for urlFile
* bump version to 0.3.2
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
* pending request wait:
- instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending
- keep checking key until pending requests are at 0
- requires latest pywb 2.6.0+
- should fix#44
* fix test to no longer look for waiting for 5s message
* lint settings and fixes: allow constant in loops, add lint command to script
* chrome: bump default image to chrome:90 image
* combine WARC/async fixes:
- use streams for combine WARCs to avoid any issues with sync apis
- use async apis for writing/reading pages as well
* use async stat()
* fix tests, also sets extension to .warc.gz, addresses #41
* generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc..
* each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC.
* add test for --combineWARC flag
* add improved lint rules
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>