2020-10-31 13:16:37 -07:00
|
|
|
{
|
|
|
|
"name": "browsertrix-crawler",
|
2022-02-08 15:31:55 -08:00
|
|
|
"version": "0.5.0-beta.3",
|
2020-10-31 13:16:37 -07:00
|
|
|
"main": "browsertrix-crawler",
|
|
|
|
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
|
|
|
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
|
|
|
"license": "MIT",
|
2021-04-30 12:31:14 -07:00
|
|
|
"scripts": {
|
2021-07-23 18:31:43 -07:00
|
|
|
"lint": "eslint *.js util/*.js tests/*.test.js"
|
2021-04-30 12:31:14 -07:00
|
|
|
},
|
2020-10-31 13:16:37 -07:00
|
|
|
"dependencies": {
|
|
|
|
"abort-controller": "^3.0.0",
|
2021-11-23 12:53:30 -08:00
|
|
|
"browsertrix-behaviors": "^0.2.4",
|
2021-04-30 12:31:14 -07:00
|
|
|
"ioredis": "^4.27.1",
|
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list
arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'
* scope: fix typo in 'prefix' scope
* update browsertrix-behaviors to 0.2.2
* tests: add test for passing config via stdin, also adding --excludes via cmdline
* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
|
|
|
"js-yaml": "^4.1.0",
|
2022-02-08 15:31:55 -08:00
|
|
|
"minio": "^7.0.26",
|
2020-10-31 13:16:37 -07:00
|
|
|
"node-fetch": "^2.6.1",
|
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
|
|
|
"puppeteer-cluster": "github:ikreymer/puppeteer-cluster#async-job-queue",
|
2021-06-07 17:43:36 -07:00
|
|
|
"puppeteer-core": "^8.0.0",
|
2020-11-14 21:55:02 +00:00
|
|
|
"sitemapper": "^3.1.2",
|
2021-02-17 12:37:07 -05:00
|
|
|
"uuid": "8.3.2",
|
2021-06-07 17:43:36 -07:00
|
|
|
"ws": "^7.4.4",
|
2021-02-17 12:37:07 -05:00
|
|
|
"yargs": "^16.0.3"
|
|
|
|
},
|
2020-11-01 19:22:53 -08:00
|
|
|
"devDependencies": {
|
2021-02-17 12:37:07 -05:00
|
|
|
"eslint": "^7.20.0",
|
2021-03-01 19:00:23 -05:00
|
|
|
"eslint-plugin-react": "^7.22.0",
|
|
|
|
"jest": "^26.6.3",
|
2021-03-31 13:41:27 -04:00
|
|
|
"md5": "^2.3.0",
|
2022-01-15 09:03:09 -08:00
|
|
|
"warcio": "^1.5.0"
|
2020-10-31 13:16:37 -07:00
|
|
|
}
|
|
|
|
}
|