Commit graph

109 commits

Author SHA1 Message Date
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
3ebe511b32 Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59)
* Create an argument parser class

* move constants, arg parser to separate files in utils/*

* ensure yaml config overriden by command-line args

* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files

* support inline 'seeds' in cmdline and yaml config

tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config

* scope fix: empty scope implies only fixed list, use '.*' for any scope

* lint fix

* update readme with yaml config info

* allow 'url' and 'seeds' if both provided

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
2021-06-23 19:45:40 -07:00
Ilya Kreymer
e7d3767efb
Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51)
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0

* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well

* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)

* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)

- bump version to 0.4.0-beta.1
2021-05-21 15:37:02 -07:00
Emma Dickson
63376ab6ac
Add --urlFile param to specify text file with a list of URLs to crawl (#38)
* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-05-12 22:57:06 -07:00
Ilya Kreymer
183f8edf10
Wait for Pending Requests to Finish (#47)
* pending request wait:
- instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending
- keep checking key until pending requests are at 0
- requires latest pywb 2.6.0+
- should fix #44

* fix test to no longer look for waiting for 5s message

* lint settings and fixes: allow constant in loops, add lint command to script

* chrome: bump default image to chrome:90 image
2021-04-30 15:31:14 -04:00
Ilya Kreymer
9293375790
combine WARC/async fixes: (#49)
* combine WARC/async fixes:
- use streams for combine WARCs to avoid any issues with sync apis
- use async apis for writing/reading pages as well

* use async stat()

* fix tests, also sets extension to .warc.gz, addresses #41
2021-04-29 14:34:56 -07:00
Emma Dickson
c9f8fe051c
add collection name validation (#37)
* add collection name validation

* linter fix

* add tests and optimize

* linter fix

* move to validateargs

* properly reference collection

* Update regex and error message

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-04-07 20:24:01 -04:00
Emma Dickson
24e2c4ddf8
Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33)
* generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc..
* each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC.
* add test for --combineWARC flag
* add improved lint rules

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-31 10:41:27 -07:00
Emma Dickson
fb0f1d8db9
tests text extraction (#30)
* new tests

* add jest to eslint, lint fixes

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-01 16:00:23 -08:00