browsertrix-crawler/package.json
Ilya Kreymer 82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00

37 lines
990 B
JSON

{
"name": "browsertrix-crawler",
"version": "0.9.0-beta.1",
"main": "browsertrix-crawler",
"type": "module",
"repository": "https://github.com/webrecorder/browsertrix-crawler",
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
"license": "AGPL-3.0-or-later",
"scripts": {
"lint": "eslint *.js util/*.js tests/*.test.js",
"test": "yarn node --experimental-vm-modules $(yarn bin jest --bail 1)"
},
"dependencies": {
"@novnc/novnc": "^1.4.0",
"browsertrix-behaviors": "^0.4.2",
"get-folder-size": "^4.0.0",
"ioredis": "^4.27.1",
"js-yaml": "^4.1.0",
"minio": "7.0.26",
"playwright-core": "^1.31.2",
"sitemapper": "^3.1.2",
"uuid": "8.3.2",
"warcio": "^1.6.0",
"ws": "^7.4.4",
"yargs": "^16.0.3"
},
"devDependencies": {
"eslint": "^7.20.0",
"eslint-plugin-react": "^7.22.0",
"jest": "^29.2.1",
"md5": "^2.3.0"
},
"jest": {
"transform": {},
"testTimeout": 90000
}
}