Commit graph

8 commits

Author SHA1 Message Date
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
e7d3767efb
Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51)
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0

* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well

* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)

* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)

- bump version to 0.4.0-beta.1
2021-05-21 15:37:02 -07:00
Ilya Kreymer
bc7f1badf3
factor out behaviors to browsertrix-behaviors: (#32)
- inject built 'behaviors.js' from browsertrix-behaviors, init with options and run
- remove bgbehaviors
- move textextract to root for now
- add requirements.txt for python dependencies
- remove obsolete --scroll option, to part of the behaviors system

logging:
- configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug'
- inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set
- 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification)

dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step

bump to 0.3.0-beta.0
2021-03-13 19:48:31 -05:00
Ilya Kreymer
4d6dcbc3d6 bump version, remove extraneous console.log 2021-02-16 20:00:33 -08:00
Ilya Kreymer
8c85ca2749 background behaviors refactor: (fixes #23)
- move auto-play, auto-fetch and auto-scroll behaviors to behaviors/global/*
- bgbehaviors manages these background behaviors
- command line --bgbehaviors option specifies which background behaviors to run (defaults to auto-fetch and auto-play)
2021-02-08 22:21:34 -08:00
Ilya Kreymer
5386d897f7 add autofetcher script to be injected by defaultDriver to capture srcsets + URLs
in dynamically added stylesheets
should fix openzim/zimit#63
bump version to 0.1.4
2020-12-13 21:10:50 +00:00
Ilya Kreymer
fe406b5f74 browser config settings:
- add support for --userAgent to override user agent
- add support for --mobileDevice to use puppeteer device emulation presets
- add support for --userAgentSuffix to append to default user agent (including device userAgent)
bump to 0.1.2
2020-11-14 19:32:31 +00:00
Ilya Kreymer
91b8994a08 refactor crawler and default driver:
- add extensible defaultDriver, wrap crawling functionality in Crawler class
- support headless/non-headless, custom driver
- support custom collection name for pywb, generate-cdx option
- autoplay: add slightly delay for splash loading
2020-11-01 19:53:47 -08:00