wrap remaining frame.evaluate() and page.evaluate() calls that are not
already within a timedRun() in their own timedRun() to avoid rare cases
where they do not return (eg. if page crashes during the evaluate)
…store filename along with page data:
- set filename on crawler load, if not already set, otherwise use
existing
- store filename per crawler instance in <crawlid>:nextWacz
- add 'filename' field to page when writing pages to redis
- clear wacz filename when wacz is uploaded to set a new one
- fixes#747
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- if redirected page is excluded, block loading of page
- mark page as excluded, don't retry, and don't write to page list
- support generic blocking of pages based on initial page response
- fixes#744
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
- Retry pages that are marked as failed once, at the end of the crawl,
in case it was due to a timeout
- Also, don't treat differences in hashtag between seed page loaded and
actual URL as a redirect (eg. don't add as new seed)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future
Fixes#728, also #216, #665, #31
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes#735, where fetch was interrupted, resulted in an empty response
- new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped.
Related to #486
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- fix for archiving facebook video, to match
webrecorder/archiveweb.page#272
- permissions: auto enable permissions to avoid possibly modal (for both
profiles and crawling)
- deps: update to latest wabac.js + warcio.js
various fixes for streaming, especially related to range requests
- follow up to #709
- fix: prefer streaming current response via takeStream, not only when
size is unknown
- don't serialize async responses prematurely
- don't serialize 206 responses if there is size mismatch
Fixes#712
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#368
The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.
Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.
New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- catch frame.evaluate() directly and log errors there to avoid any
possibility of exception being propagated before wrapping in timedRun()
- also add clearTimeout() to timedRun()
- possibly fixesopenzim/zimit#376
- if extraHops is set, crawler should visit pages beyond maxDepth
- currently returning out of scope at depth limit even if extraHops is
set
- adjust isInScope and isAtMaxDepth to account for extraHops
- tests: update extra hops test to test extraHops beyond depth
- fixes#693
- add additional catch() block
- wrap page.title() in timedRun() to catch/log exception if this fails
- log error in getting cookies
- hopefully fixes hard-to-repro edge case crash in openzim/zimit#376
to avoid possible exception due to encoding. (Probably a node bug,
reported in nodejs/undici#3616)
Replace abort with cancel, which is the recommended way to cancel the
response.
fixes#687
Differentiate from expected/predictable interrupts due to limits (exit
code 11) and unexpected interrupt due to browser crash (now exit code
10)
fixes#683
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
- use existing headersTimeout in undici to limit time to headers fetch
to 30 seconds, reject direct fetch if timeout is reached
- allow full page timeout for loading payload via direct fetch
- support setting global fetch() settings
- add markPageUsed() to only reuse pages when not doing direct fetch
- apply auth headers to direct fetch
- catch failed fetch and timeout errors
- support failOnFailedSeeds for direct fetch, ensure timeout is working
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own