* Check size of /crawls by default to fix disk utilization check
* Refactor calculating percentage used and add unit tests
* add tests using df output for with disk usage above and below
threshold
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* base: update to chrome 112
headless: switch to using new headless mode available in 112 which is more in sync with headful mode
viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set)
profiles: fix catching new window message, reopening page in current window
versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1)
bump to 0.10.0-beta.4
* profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages
These optimizations can often lead to Chrome downloading large ML models in
the background, which then end up in the web crawling archives, even though
they don't have anything to do with the crawl.
Fixes#311.
* crawl stopping / additional states:
- adds check for 'isCrawlStopped()' which checks redis key to see if crawl has been stopped externally, and interrupts work
loop and prevents crawl from starting on load
- additional crawl states: 'generate-wacz', 'generate-cdx', 'generate-warc', 'uploading-wacz', and 'pending-wait' to indicate
when crawl is no longer running but crawler performing work
- addresses part of webrecorder/browsertrix-cloud#263, webrecorder/browsertrix-cloud#637
* Catch 400 pywb errors on page load and mark page failed
* Add --failOnFailedSeed option to fail crawl with exit code 1 if seed doesn't load, resolves#207
* Handle 4xx or 5xx page.goto responses as page load errors
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298)
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* browser: just pass profileUrl and track if custom profile is used
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
fix for #288
* Fix full page screenshot (#296)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* Store done in redis as integer rather than full json
* Add numFailed to crawler stats
* Cast numDone to int before returning
* Increment done counter for failed URLs
* Fix movefailed to push failed URL to failed not done key
* Don't add failed to total stats twice
* max page limit:
- rename --limit -> --pageLimit (keep alias for now)
- add new --maxPageLimit flag which overrides --pageLimit to ensure it is not greater than max
- readme: add new --pageLimit, --maxPageLimit to README
* pending lock reset:
- quicker retry of pending URLs after crawler crash by clearing pending page locks
- pending urls are locked with <crawl>:p:<url> to indicate they are current being rendered
- when a crawler restarts, check if <crawl>:p:<url> is set to its unique id and remove pending lock, to allow the URL
to be retried again, as it's no longer actively being crawled.
- if CRAWL_ID env var set to 'crawl-id-name' while hostname is 'crawl-id-name-N' (automatically set via k8s statefulsets),
then set starting worker index to N * numWorkers
* Add --pageExtraDelay option to add extra delay/wait time after every page (fixes#131)
* Store total page time in 'maxPageTime', include pageExtraDelay
* Rename timeout->pageLoadTimeout
* cleanup:
- store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions
- add secondsElapsed() utility function to help checking time elapsed
- cleanup comments
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* various loading improvements to avoid pages getting 'stuck' + load state tracking
- add PageState object, store loadstate (0 to 4) as well as other per-page-state properties on defined object.
- set loadState to 0 (failed) by default
- set loadState to 1 (content-loaded) on 'domcontentloaded' event
- if page.goto() finishes, set to loadState to 2 'full-page-load'.
- if page.goto() times out, if no domcontentloaded either, fail immediately. if domcontentloaded reached, extract links, but don't run behaviors
- page considered 'finished' if it got to at least loadState 2 'full-pageload', even if behaviors timed out
- pages: log 'loadState' as part of pages.jsonl
- improve frame detection: detect if frame actually not from a frame tag (eg. OBJECT) tag, and skip as well
- screencaster: try screencasting every frame for now instead of every other frame, for smoother screencasting
- deps: behaviors: bump to browsertrix-behaviors 0.5.0-beta.0 release (includes autoscroll improvements)
- workers ids: just use 0, 1, ... n-1 worker indexes, send numeric index as part of screencast messages
- worker: only keeps track of crash state to recreate page, decouple crash and page failed/succeeded state
- screencaster: allow reusing caster slots with fixed ids
- interrupt timedCrawlPage() wait if 'crash' event happens
- crawler: pageFinished() callback when page finishes
- worker: add workerIdle callback, call screencaster.stopById() and send 'close' message when worker is empty
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary
* State / Worker Refactor (#252)
* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes#150)
- override console.error to avoid logging ioredis errors (fixes#244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt
* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages
* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1
* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)
* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>