2020-10-31 13:16:37 -07:00
|
|
|
{
|
|
|
|
"name": "browsertrix-crawler",
|
2024-05-22 15:45:48 -07:00
|
|
|
"version": "1.2.0-beta.0",
|
2020-10-31 13:16:37 -07:00
|
|
|
"main": "browsertrix-crawler",
|
2022-10-24 15:30:10 +02:00
|
|
|
"type": "module",
|
2020-10-31 13:16:37 -07:00
|
|
|
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
|
|
|
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
2022-11-21 09:19:42 -08:00
|
|
|
"license": "AGPL-3.0-or-later",
|
2021-04-30 12:31:14 -07:00
|
|
|
"scripts": {
|
2023-11-09 11:27:11 -08:00
|
|
|
"tsc": "tsc",
|
2024-03-16 17:59:32 -04:00
|
|
|
"format": "prettier src/ --check",
|
|
|
|
"format:fix": "prettier src/ --write",
|
|
|
|
"lint": "eslint src/",
|
|
|
|
"lint:fix": "yarn format:fix && eslint src/ --fix",
|
2023-09-07 13:03:22 -07:00
|
|
|
"test": "yarn node --experimental-vm-modules $(yarn bin jest --bail 1)",
|
|
|
|
"prepare": "husky install"
|
2021-04-30 12:31:14 -07:00
|
|
|
},
|
2020-10-31 13:16:37 -07:00
|
|
|
"dependencies": {
|
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary
* State / Worker Refactor (#252)
* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt
* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages
* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1
* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)
* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
|
|
|
"@novnc/novnc": "^1.4.0",
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
"@types/sax": "^1.2.7",
|
2023-11-07 21:38:50 -08:00
|
|
|
"@webrecorder/wabac": "^2.16.12",
|
2024-04-18 17:16:57 -07:00
|
|
|
"browsertrix-behaviors": "^0.6.0",
|
2023-10-20 16:29:07 -07:00
|
|
|
"crc": "^4.3.2",
|
2022-10-24 15:30:10 +02:00
|
|
|
"get-folder-size": "^4.0.0",
|
2023-09-07 13:03:22 -07:00
|
|
|
"husky": "^8.0.3",
|
2023-11-09 11:27:11 -08:00
|
|
|
"ioredis": "^5.3.2",
|
2024-03-22 17:32:42 -07:00
|
|
|
"js-levenshtein": "^1.1.6",
|
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list
arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'
* scope: fix typo in 'prefix' scope
* update browsertrix-behaviors to 0.2.2
* tests: add test for passing config via stdin, also adding --excludes via cmdline
* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
|
|
|
"js-yaml": "^4.1.0",
|
2023-11-09 11:27:11 -08:00
|
|
|
"minio": "^7.1.3",
|
2023-11-07 21:38:50 -08:00
|
|
|
"p-queue": "^7.3.4",
|
2024-03-22 17:32:42 -07:00
|
|
|
"pixelmatch": "^5.3.0",
|
|
|
|
"pngjs": "^7.0.0",
|
2024-03-27 09:26:51 -07:00
|
|
|
"puppeteer-core": "^22.6.1",
|
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes #496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
|
|
|
"sax": "^1.3.0",
|
2023-11-16 16:18:00 -05:00
|
|
|
"sharp": "^0.32.6",
|
2023-11-09 11:27:11 -08:00
|
|
|
"tsc": "^2.0.4",
|
2021-02-17 12:37:07 -05:00
|
|
|
"uuid": "8.3.2",
|
2023-11-09 11:27:11 -08:00
|
|
|
"warcio": "^2.2.1",
|
2021-06-07 17:43:36 -07:00
|
|
|
"ws": "^7.4.4",
|
2023-08-13 15:08:36 -07:00
|
|
|
"yargs": "^17.7.2"
|
2021-02-17 12:37:07 -05:00
|
|
|
},
|
2020-11-01 19:22:53 -08:00
|
|
|
"devDependencies": {
|
2024-03-22 17:32:42 -07:00
|
|
|
"@types/js-levenshtein": "^1.1.3",
|
2023-11-09 11:27:11 -08:00
|
|
|
"@types/js-yaml": "^4.0.8",
|
|
|
|
"@types/node": "^20.8.7",
|
2024-03-22 17:32:42 -07:00
|
|
|
"@types/pixelmatch": "^5.2.6",
|
|
|
|
"@types/pngjs": "^6.0.4",
|
2023-11-09 11:27:11 -08:00
|
|
|
"@types/uuid": "^9.0.6",
|
|
|
|
"@types/ws": "^8.5.8",
|
|
|
|
"@typescript-eslint/eslint-plugin": "^6.10.0",
|
|
|
|
"@typescript-eslint/parser": "^6.10.0",
|
|
|
|
"eslint": "^8.53.0",
|
2023-11-09 19:11:11 -05:00
|
|
|
"eslint-config-prettier": "^9.0.0",
|
2021-03-01 19:00:23 -05:00
|
|
|
"eslint-plugin-react": "^7.22.0",
|
2024-03-22 17:32:42 -07:00
|
|
|
"jest": "^29.7.0",
|
2023-11-09 11:27:11 -08:00
|
|
|
"md5": "^2.3.0",
|
2023-11-09 19:11:11 -05:00
|
|
|
"prettier": "3.0.3",
|
2023-11-09 11:27:11 -08:00
|
|
|
"typescript": "^5.2.2"
|
2022-10-24 15:30:10 +02:00
|
|
|
},
|
|
|
|
"jest": {
|
|
|
|
"transform": {},
|
|
|
|
"testTimeout": 90000
|
2020-10-31 13:16:37 -07:00
|
|
|
}
|
|
|
|
}
|