Commit graph

13 commits

Author SHA1 Message Date
Ilya Kreymer
810856fcf9 add --browserdriver for testing different drivers
add testing with puppeteer again!
comment out BlockRules / AdBlockRules for now
2023-04-24 23:40:11 -07:00
Ilya Kreymer
2c3a866b22 logging: add memory logging
update playwright
reset to recycle after 5 pages as before
2023-04-24 20:25:33 -07:00
Ilya Kreymer
96a3aa837b possible work for #298
- split browser into NewContextBrowser and PersistentContextBrowser (old behavior)
- NewContextBrowser() closes and recreates the context when pages are closed
- worker uses 'storageState' to get state, set again on new page for that worker
2023-04-24 16:32:03 -07:00
Ilya Kreymer
52822f9e42
worker: lower wait time, in case where no additional pages remain and other workers will finish quickly. otherwise, results in a min 10 seconds wait for >1 workers if only one page is encountered (#289) 2023-04-17 18:11:56 -07:00
Ilya Kreymer
fcd55c690a
worker index: set worker index automatically to work with k8s naming (#266)
- if CRAWL_ID env var set to 'crawl-id-name' while hostname is 'crawl-id-name-N' (automatically set via k8s statefulsets),
then set starting worker index to N * numWorkers
2023-03-29 22:27:17 -07:00
Tessa Walsh
b0e93cb06e
Add option for sleep interval after behaviors run + timing cleanup (#257)
* Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131)

* Store total page time in 'maxPageTime', include pageExtraDelay

* Rename timeout->pageLoadTimeout

* cleanup:
- store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions
- add secondsElapsed() utility function to help checking time elapsed
- cleanup comments

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-22 11:50:18 -07:00
Ilya Kreymer
02fb137b2c
Catch loading issues (#255)
* various loading improvements to avoid pages getting 'stuck' + load state tracking
- add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object.
- set loadState to 0 (failed) by default
- set loadState to 1 (content-loaded) on 'domcontentloaded' event
- if page.goto() finishes, set to loadState to 2 'full-page-load'. 
- if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors
- page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out
- pages: log 'loadState' as part of pages.jsonl
- improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well
- screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting
- deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements)
- workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages
- worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state
- screencaster: allow reusing caster slots with fixed ids
- interrupt timedCrawlPage() wait if 'crash' event happens
- crawler: pageFinished() callback when page finishes
- worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty
2023-03-20 18:31:37 -07:00
Ilya Kreymer
07e503a8e6
Logger cleanup (#254)
* logging: convert logger to a singleton to simplify use

* add logger to create-login-profile.js
2023-03-17 14:24:44 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Tessa Walsh
f19f1fcb8d
Minor crawler fixes after puppeteer-cluster removal refactoring (#250)
* Remove screencaster from Worker/WorkerPool

* Don't increment errors in crawlPageInWorker

* Set pageTarget variable early
2023-03-13 15:07:59 -07:00
Ilya Kreymer
4b8a414410
Add total timeout + limit redis queue retries (#248)
* time limits: readd total timeount to runTask() in worker, just in case
refactor working runTask() to either return true/false if task was timed out
if timed out, recreate the page
redis: add limit to retried URLs, currently set to 1
* retry: remove URL if not retrying, log removal of URL from queue
2023-03-13 14:48:04 -07:00
Tessa Walsh
aadd9a0483
Add timedRun to prevent async operations from hanging (#243)
* Add timedRun and apply to network requests

* Remove debugging print statement

* minor tweaks:
- move seconds to 2nd param, make param required
- use FETCH_TIMEOUT_SECS for fetch events and PAGE_OP_TIMEOUT_SECS for in-page events respectively
- use timedRun() for check CF action
- remove extra async

* additional logging
ensure queue is cleared when interrupting!

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-10 20:11:24 -08:00
Tessa Walsh
1bee46b321
Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219)
* This commit removes puppeteer-cluster as a dependency in favor of
a simpler concurrency implementation, using p-queue to limit
concurrency to the number of available workers. As part of the
refactor, the custom window concurrency model in windowconcur.js
is removed and its logic implemented in the new Worker class's
initPage method.

* Remove concurrency models, always use new tab

* logging improvements: include worker-id in logs, use 'worker' context
- logging: log info string / version as first line
- logging: improve logging of error stack traces
- interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue
- interruption: don't repair if interrupting, wait for queue to be idle
- log text extraction
- init order: ensure wb-manager init called first, then logs created
- logging: adjust info->debug logging
- Log no jobs available as debug

* tests: bail on first failure

* iframe filtering:
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames

* healthcheck/worker reuse:
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck

* deps: bump puppeteer-core to 17.1.2
- bump to 0.9.0-beta.0

--------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-08 18:31:19 -08:00