browsertrix-crawler/package.json

{
  "name": "browsertrix-crawler",
  "version": "0.7.1",
  "main": "browsertrix-crawler",
  "repository": "https://github.com/webrecorder/browsertrix-crawler",
  "author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
  "license": "MIT",
  "scripts": {
    "lint": "eslint *.js util/*.js tests/*.test.js"
  },
  "dependencies": {
    "abort-controller": "^3.0.0",
    "browsertrix-behaviors": "^0.3.4",
    "get-folder-size": "2",
    "ioredis": "^4.27.1",
    "js-yaml": "^4.1.0",
    "minio": "7.0.26",
    "node-fetch": "^2.6.1",
    "puppeteer-cluster": "github:ikreymer/puppeteer-cluster#async-job-queue",
    "puppeteer-core": "^17.1.2",
    "request": "^2.88.2",
    "sitemapper": "^3.1.2",
    "uuid": "8.3.2",
    "warcio": "1.5.1",
    "ws": "^7.4.4",
    "yargs": "^16.0.3"
  },
  "devDependencies": {
    "eslint": "^7.20.0",
    "eslint-plugin-react": "^7.22.0",
    "jest": "^26.6.3",
    "md5": "^2.3.0"
  }
}
initial commit after split from zimit 2020-10-31 13:16:37 -07:00			`{`
			`"name": "browsertrix-crawler",`
Fix for warcio.js (#178) * dependency fix: set warcio to 1.5.1 until we update to esm support bump test timeout fixes #175 bump to 0.7.1 2022-10-24 08:20:01 +02:00			`"version": "0.7.1",`
initial commit after split from zimit 2020-10-31 13:16:37 -07:00			`"main": "browsertrix-crawler",`
			`"repository": "https://github.com/webrecorder/browsertrix-crawler",`
			`"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",`
			`"license": "MIT",`
Wait for Pending Requests to Finish (#47) * pending request wait: - instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending - keep checking key until pending requests are at 0 - requires latest pywb 2.6.0+ - should fix #44 * fix test to no longer look for waiting for 5s message * lint settings and fixes: allow constant in loops, add lint command to script * chrome: bump default image to chrome:90 image 2021-04-30 12:31:14 -07:00			`"scripts": {`
Customizable extract selectors + typo fix (0.4.2) (#72) * fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail. - wrap directFetchCapture() to retry browser loading in case of failure * custom link extraction improvements (improvements for #25) - extractLinks() returns a list of link URLs to allow for more flexibility in custom driver - rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed - loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false} - tests: add test for custom driver which uses custom selector * tests - tests: all tests uses 'test-crawls' instead of crawls - consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js - add custom driver test and fixture to test custom link extraction * add to CHANGES, bump to 0.4.2 2021-07-23 18:31:43 -07:00			`"lint": "eslint .js util/.js tests/*.test.js"`
Wait for Pending Requests to Finish (#47) * pending request wait: - instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending - keep checking key until pending requests are at 0 - requires latest pywb 2.6.0+ - should fix #44 * fix test to no longer look for waiting for 5s message * lint settings and fixes: allow constant in loops, add lint command to script * chrome: bump default image to chrome:90 image 2021-04-30 12:31:14 -07:00			`},`
initial commit after split from zimit 2020-10-31 13:16:37 -07:00			`"dependencies": {`
			`"abort-controller": "^3.0.0",`
dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165) bump to 0.7.0-beta.5 2022-09-15 23:13:31 -07:00			`"browsertrix-behaviors": "^0.3.4",`
Health Check + Size Limits + Profile fixes (#138) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1 2022-05-18 22:51:55 -07:00			`"get-folder-size": "2",`
Wait for Pending Requests to Finish (#47) * pending request wait: - instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending - keep checking key until pending requests are at 0 - requires latest pywb 2.6.0+ - should fix #44 * fix test to no longer look for waiting for 5s message * lint settings and fixes: allow constant in loops, add lint command to script * chrome: bump default image to chrome:90 image 2021-04-30 12:31:14 -07:00			`"ioredis": "^4.27.1",`
Per-Seed Scoping Rules + Crawl Depth (#63) * scoped seeds: - support per-seed scoping (include + exclude), allowHash, depth, and sitemap options - support maxDepth per seed #16 - combine --url, --seed and --urlFile/--seedFile urls into a unified seed list arg parsing: - simplify seed file options into --seedFile/--urlFile, move option in help display - rename --maxDepth -> --depth, supported globally and per seed - ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation) - update to latest js-yaml - rename --yamlConfig -> --config - config: support reading config from stdin if --config set to 'stdin' * scope: fix typo in 'prefix' scope * update browsertrix-behaviors to 0.2.2 * tests: add test for passing config via stdin, also adding --excludes via cmdline * update README: - latest cli, add docs on config via stdin - rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position - info on scoped seeds - list current scope types 2021-06-26 13:11:29 -07:00			`"js-yaml": "^4.1.0",`
package: set minio version to fixed (7.0.26) 2022-04-09 22:06:35 -07:00			`"minio": "7.0.26",`
initial commit after split from zimit 2020-10-31 13:16:37 -07:00			`"node-fetch": "^2.6.1",`
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta 2021-09-28 09:41:16 -07:00			`"puppeteer-cluster": "github:ikreymer/puppeteer-cluster#async-job-queue",`
Default Wait-Time Improvements (#162) - netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds - default behaviors: include autoscroll in default behavior as well - restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting. - bump to puppeteer-core 17.1.2 - bump to 0.7.0-beta.4 2022-09-08 23:39:26 -07:00			`"puppeteer-core": "^17.1.2",`
Profile download support (#126) * profiles: support loading profiles via a URL. * add 'request' dependency * README: mention profile URLs 2022-03-14 14:44:24 -07:00			`"request": "^2.88.2",`
add support for sitemaps with --useSitemap flag, defaults to /sitemap.xml if no string provided 2020-11-14 21:55:02 +00:00			`"sitemapper": "^3.1.2",`
case insensitive params (#27) * make --generateWacz, --generateCdx case insensitive with alias option * fix eslint config and eslint issues Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> 2021-02-17 12:37:07 -05:00			`"uuid": "8.3.2",`
Fix for warcio.js (#178) * dependency fix: set warcio to 1.5.1 until we update to esm support bump test timeout fixes #175 bump to 0.7.1 2022-10-24 08:20:01 +02:00			`"warcio": "1.5.1",`
Screencast Support for Debugging (fixes #43) (#52) * screencast support (fixes #43): - add NewWindowPage concurrency mode to support opening new window, and also reusing pages - add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client - concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session * add warning if using screencasting + more than one worker + page context, recommend 'window' * cleanup: remove debug console, bump py-wacz dependency, improve close message * README: add screencasting info to README 2021-06-07 17:43:36 -07:00			`"ws": "^7.4.4",`
case insensitive params (#27) * make --generateWacz, --generateCdx case insensitive with alias option * fix eslint config and eslint issues Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> 2021-02-17 12:37:07 -05:00			`"yargs": "^16.0.3"`
			`},`
refactor crawler and default driver: - add extensible defaultDriver, wrap crawling functionality in Crawler class - support headless/non-headless, custom driver - support custom collection name for pywb, generate-cdx option - autoplay: add slightly delay for splash loading 2020-11-01 19:22:53 -08:00			`"devDependencies": {`
case insensitive params (#27) * make --generateWacz, --generateCdx case insensitive with alias option * fix eslint config and eslint issues Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> 2021-02-17 12:37:07 -05:00			`"eslint": "^7.20.0",`
tests text extraction (#30) * new tests * add jest to eslint, lint fixes Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> 2021-03-01 19:00:23 -05:00			`"eslint-plugin-react": "^7.22.0",`
			`"jest": "^26.6.3",`
behaviors update (#105) * update to browsertrix-behaviors 0.2.5 to support improved autoscroll - add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early - inject behaviors into interactive loader to allow testing - fix signal handler if state not inited yet - dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5 2022-02-20 22:22:19 -08:00			`"md5": "^2.3.0"`
initial commit after split from zimit 2020-10-31 13:16:37 -07:00			`}`
			`}`