Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays
args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334
bump to 0.12.1
- use DOMSnapshot.captureSnapshot instead of older DOM.getDocument to
get the snapshot (consistent with ArchiveWeb.page) - should be slightly
more performant
- keep option to use DOM.getDocument
- refactor warc resource writing to separate class, used by text
extraction and screenshots
- write extracted text to WARC files as 'urn:text:<url>' after page
loads, similar to screenshots
- also store final text to WARC as 'urn:textFinal:<url>' if it is
different
- cli options: update `--text` to take one more more comma-separated
string options `--text to-warc,to-pages,final-to-warc`. For backwards
compatibility, support `--text` and `--text true` to be equivalent to
`--text to-pages`.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes#411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
- if HEAD succeeds, do a direct fetch of non-HTML resource
- add filter to AsyncFetcher: reject if non-200 response or response sets cookies
- set loadState to 'full page loaded' (2) for direct-fetched pages
- also set mime type to better differntiate non-HTML pages, and lower loadState
- AsyncFetcher dupe handling: load() returns, "fetched", "dupe" or "notfetched" to differentiate dupe vs failed loading
- response async loading: if 'dupe', don't attempt to load again
- direct fetch: add ignoreDupe to ignore dupe check: if loading as page, always load again, even if previously loaded as a non-page resource
- don't set start / end time in redis
- rename setEndTimeAndExit to setStatusAndExit
add 'fast cancel' option:
- add isCrawlCanceled() to state, which checks redis canceled key
- on interrupt, if canceled, immediately exit with status 0
- on fatal, exit with code 0 if restartsOnError is set
- no longer keeping track of start/end time in crawler itself
- logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting
- add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.
* Store crawler start and end times in Redis lists
* end time tweaks:
- set end time for logger.fatal()
- set missing start time into setEndTime()
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- run behaviors: check if behaviors object exists before trying to run behaviors to avoid failure message
- skip behaviors if frame no longer attached / has empty URL
* error handling fixes:
- listen to correct event for page crashes, 'error' instead of 'crash', may fix#371, #351
- more removal of duplicate logging for status-related errors, eg. if page crashed, don't log worker exception
- detect browser 'disconnected' event, interrupt crawl (but allow post-crawl tasks, such as waiting for pending requests to run), set browser to null to avoid trying to use again.
worker
- bump new page timeout to 20
- if loading page from new domain, always use new page
logger:
- log timestamp first for better sorting
* optimize link extraction: (fixes#376)
- dedup urls in browser first
- don't return entire list of URLs, process one-at-a-time via callback
- add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler
- optimize addqueue: atomically check if already at max urls and if url already seen in one redis call
- add QueueState enum to indicate possible states: url added, limit hit, or dupe url
- better logging: log rejected promises for link extraction
- tests: add test for exact page limit being reached
* behavior logging tweaks, add netIdle
* fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes!
now used not only for link extraction but also to run() behaviors
* add logging if iframe check fails
* Dockerfile: add commented out line to use local behaviors.js
* bump behaviors to 0.5.2
* Add option to output stats file live, i.e. after each page crawled
* Always output stat files after each page crawled (+ test)
* Fix inversion between expected and test value
* additional fixes:
- use distinct exit code for subsequent interrupt (13) and fatal interrupt (17)
- if crawl has been stopped, mark for final exit for post crawl tasks
- stopped takes precedence over interrupted: if both, still exit with 0 (and marked for final exit)
- if no warcs found, crawl stopped, but previous pages found, don't consider failed!
- cleanup: remove unused code, rename to gracefulFinishOnInterrupt, separate from graceful finish via crawl stopped
* Surface lastmod option for sitemap parser
- Add --sitemapFromDate to use along with --useSitemap which will filter sitemap by on or after
specified ISO date.
The library used to parse sitemaps for URLs added an optional
"lastmod" argument in v3.2.5 that allows filtering URLs returned
by a "last_modified" element present in sitemap XMLs. This
surfaces that argument to the browsertrix-crawler CLI runtime
parameters.
This can be useful for orienting a crawl around a list of seeds
known to contain sitemaps, but are only interested in including
URLs that have been modified on or after X date.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- avoid duplicate logging for same error, if logging more specific message and rethrowing exception,
set e.detail to "logged" and worker exception handler will not log same error again
- add option to log timeouts as warnings instead of errors
- remove unneed async method in browser, get headers directly
- fix logging in screenshots to include page
* logging: resolve confusion with 'crawl done' not being written to log, because the log is itself stored in the WACZ: (fixes#365)
- keep log file open until end, even if its being written to WACZ, close before exit
- add logging of 'crawling done' when crawling is done (writing to WACZ or not)
- add debug logging of 'end of log file' to indicate log file is being added to WACZ and nothing else will be added there in the WACZ.
- get favicon from CDP debug page, if available, log warning if not
- store in favIconUrl in pages.jsonl
- test: add test for favIcon and additional multi-page crawls
- if interrupted (via signal or due to limits) and not finished, return error code 11 to indicate interruption
- allow stopping single instances with hset '<crawlid>:stopone' uid (similar to status)
- deliberate stop via redis not considered interruption (exit 0)
- handle browser crash -- if getting new page fails after 5 tries, assume browser crashed and exit
- check if timedRun() returns a non-null value before expanding
- update timedRun() to rethrow any non-timeout exception, instead of just logging 'unknown exception', as it should be handled downstream.
* support loading custom behaviors from a specified directory via --customBehaviors
* call load() for each behavior incrementally, then call selectMainBehavior() (available in browsertrix-behaviors 0.5.1)
* tests: add tests for multiple custom behaviors
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* Check size of /crawls by default to fix disk utilization check
* Refactor calculating percentage used and add unit tests
* add tests using df output for with disk usage above and below
threshold
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
* base: update to chrome 112
headless: switch to using new headless mode available in 112 which is more in sync with headful mode
viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set)
profiles: fix catching new window message, reopening page in current window
versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1)
bump to 0.10.0-beta.4
* profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages
* crawl stopping / additional states:
- adds check for 'isCrawlStopped()' which checks redis key to see if crawl has been stopped externally, and interrupts work
loop and prevents crawl from starting on load
- additional crawl states: 'generate-wacz', 'generate-cdx', 'generate-warc', 'uploading-wacz', and 'pending-wait' to indicate
when crawl is no longer running but crawler performing work
- addresses part of webrecorder/browsertrix-cloud#263, webrecorder/browsertrix-cloud#637
* Catch 400 pywb errors on page load and mark page failed
* Add --failOnFailedSeed option to fail crawl with exit code 1 if seed doesn't load, resolves#207
* Handle 4xx or 5xx page.goto responses as page load errors
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298)
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
* browser: just pass profileUrl and track if custom profile is used
browser: don't disable service workers always (accidentally added as part of playwright migration)
only disable if using profile, same as 0.8.x behavior
fix for #288
* Fix full page screenshot (#296)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>