mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 06:23:16 +00:00

* Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
37 lines
990 B
JSON
37 lines
990 B
JSON
{
|
|
"name": "browsertrix-crawler",
|
|
"version": "0.9.0-beta.1",
|
|
"main": "browsertrix-crawler",
|
|
"type": "module",
|
|
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
|
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
|
"license": "AGPL-3.0-or-later",
|
|
"scripts": {
|
|
"lint": "eslint *.js util/*.js tests/*.test.js",
|
|
"test": "yarn node --experimental-vm-modules $(yarn bin jest --bail 1)"
|
|
},
|
|
"dependencies": {
|
|
"@novnc/novnc": "^1.4.0",
|
|
"browsertrix-behaviors": "^0.4.2",
|
|
"get-folder-size": "^4.0.0",
|
|
"ioredis": "^4.27.1",
|
|
"js-yaml": "^4.1.0",
|
|
"minio": "7.0.26",
|
|
"playwright-core": "^1.31.2",
|
|
"sitemapper": "^3.1.2",
|
|
"uuid": "8.3.2",
|
|
"warcio": "^1.6.0",
|
|
"ws": "^7.4.4",
|
|
"yargs": "^16.0.3"
|
|
},
|
|
"devDependencies": {
|
|
"eslint": "^7.20.0",
|
|
"eslint-plugin-react": "^7.22.0",
|
|
"jest": "^29.2.1",
|
|
"md5": "^2.3.0"
|
|
},
|
|
"jest": {
|
|
"transform": {},
|
|
"testTimeout": 90000
|
|
}
|
|
}
|