2020-10-31 13:16:37 -07:00
|
|
|
{
|
|
|
|
"name": "browsertrix-crawler",
|
2022-08-19 09:23:40 -07:00
|
|
|
"version": "0.7.0-beta.3",
|
2020-10-31 13:16:37 -07:00
|
|
|
"main": "browsertrix-crawler",
|
|
|
|
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
|
|
|
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
|
|
|
"license": "MIT",
|
2021-04-30 12:31:14 -07:00
|
|
|
"scripts": {
|
2021-07-23 18:31:43 -07:00
|
|
|
"lint": "eslint *.js util/*.js tests/*.test.js"
|
2021-04-30 12:31:14 -07:00
|
|
|
},
|
2020-10-31 13:16:37 -07:00
|
|
|
"dependencies": {
|
|
|
|
"abort-controller": "^3.0.0",
|
2022-09-02 17:45:16 -07:00
|
|
|
"browsertrix-behaviors": "^0.3.3",
|
Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check
- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.
- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.
- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.
- S3 Storage refactor, simplify, don't add additional paths by default.
- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.
- wacz save: reenable wacz validation after save.
- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.
- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
|
|
|
"get-folder-size": "2",
|
2021-04-30 12:31:14 -07:00
|
|
|
"ioredis": "^4.27.1",
|
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list
arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'
* scope: fix typo in 'prefix' scope
* update browsertrix-behaviors to 0.2.2
* tests: add test for passing config via stdin, also adding --excludes via cmdline
* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
|
|
|
"js-yaml": "^4.1.0",
|
2022-04-09 22:06:35 -07:00
|
|
|
"minio": "7.0.26",
|
2020-10-31 13:16:37 -07:00
|
|
|
"node-fetch": "^2.6.1",
|
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state
* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT
* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats
* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible
* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0
* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes
* py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture
* update to latest browsertrix-behaviors
* fix setuptools dependency #88
* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
|
|
|
"puppeteer-cluster": "github:ikreymer/puppeteer-cluster#async-job-queue",
|
2022-08-19 09:23:40 -07:00
|
|
|
"puppeteer-core": "^16.1.1",
|
2022-03-14 14:44:24 -07:00
|
|
|
"request": "^2.88.2",
|
2020-11-14 21:55:02 +00:00
|
|
|
"sitemapper": "^3.1.2",
|
2021-02-17 12:37:07 -05:00
|
|
|
"uuid": "8.3.2",
|
2022-02-20 22:22:19 -08:00
|
|
|
"warcio": "^1.5.0",
|
2021-06-07 17:43:36 -07:00
|
|
|
"ws": "^7.4.4",
|
2021-02-17 12:37:07 -05:00
|
|
|
"yargs": "^16.0.3"
|
|
|
|
},
|
2020-11-01 19:22:53 -08:00
|
|
|
"devDependencies": {
|
2021-02-17 12:37:07 -05:00
|
|
|
"eslint": "^7.20.0",
|
2021-03-01 19:00:23 -05:00
|
|
|
"eslint-plugin-react": "^7.22.0",
|
|
|
|
"jest": "^26.6.3",
|
2022-02-20 22:22:19 -08:00
|
|
|
"md5": "^2.3.0"
|
2020-10-31 13:16:37 -07:00
|
|
|
}
|
|
|
|
}
|