Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219)

* This commit removes puppeteer-cluster as a dependency in favor of
a simpler concurrency implementation, using p-queue to limit
concurrency to the number of available workers. As part of the
refactor, the custom window concurrency model in windowconcur.js
is removed and its logic implemented in the new Worker class's
initPage method.

* Remove concurrency models, always use new tab

* logging improvements: include worker-id in logs, use 'worker' context
- logging: log info string / version as first line
- logging: improve logging of error stack traces
- interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue
- interruption: don't repair if interrupting, wait for queue to be idle
- log text extraction
- init order: ensure wb-manager init called first, then logs created
- logging: adjust info->debug logging
- Log no jobs available as debug

* tests: bail on first failure

* iframe filtering:
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames

* healthcheck/worker reuse:
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck

* deps: bump puppeteer-core to 17.1.2
- bump to 0.9.0-beta.0

--------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
This commit is contained in:
Tessa Walsh 2023-03-08 21:31:19 -05:00 committed by GitHub
parent ac5a720362
commit 1bee46b321
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
13 changed files with 622 additions and 334 deletions

View file

@ -1,6 +1,6 @@
{
"name": "browsertrix-crawler",
"version": "0.8.1",
"version": "0.9.0-beta.0",
"main": "browsertrix-crawler",
"type": "module",
"repository": "https://github.com/webrecorder/browsertrix-crawler",
@ -8,7 +8,7 @@
"license": "AGPL-3.0-or-later",
"scripts": {
"lint": "eslint *.js util/*.js tests/*.test.js",
"test": "yarn node --experimental-vm-modules $(yarn bin jest)"
"test": "yarn node --experimental-vm-modules $(yarn bin jest --bail 1)"
},
"dependencies": {
"@novnc/novnc": "1.4.0-beta",
@ -18,8 +18,8 @@
"ioredis": "^4.27.1",
"js-yaml": "^4.1.0",
"minio": "7.0.26",
"puppeteer-cluster": "github:ikreymer/puppeteer-cluster#async-job-queue",
"puppeteer-core": "^17.1.2",
"p-queue": "^7.3.0",
"puppeteer-core": "^19.7.2",
"request": "^2.88.2",
"sitemapper": "^3.1.2",
"uuid": "8.3.2",