Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future
Fixes#728, also #216, #665, #31
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes#735, where fetch was interrupted, resulted in an empty response
- fix for archiving facebook video, to match
webrecorder/archiveweb.page#272
- permissions: auto enable permissions to avoid possibly modal (for both
profiles and crawling)
- deps: update to latest wabac.js + warcio.js
Fixes#368
The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.
Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.
New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own
Fixes#674
This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- rewriting: update wabac.js, use getCustomRewriter(), don't truncate
POST request bodies for URLs that use a custom rewriter
- browser: disable --enable-automation, setting webdriver = true, so no
need for override
- deps: update puppeteer-core, necessary changes for latest puppeteer
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes#645
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
- update to wabac.js 2.19.0 to use new html rewriting support in
wabac.js 2.19.0
- update to browsertrix-behaviors to 0.6.1 to fix instagram behavior
- bump to 1.2.0-beta.3
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section
---------
Co-authored-by: Ed Summers <ehs@pobox.com>
fixes#587
The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.
Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)
The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)
---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
In particular, an API call to /navigate starts, but doesn't wait for a
page load to finish, since user can choose to close the profile browser
at any time. This ensures that user operations don't cause the browser to crash if
page.goto() is interrupted/fails (browser closed, profile is saved, etc...) while a page is still loading.
bump to 1.1.1
The 0.6.0 release of Browsertrix Behaviors /
webrecorder/browsertrix-behaviors#70 introduces support for site-specific behaviors to implement an `awaitPageLoad()` function which allows for waiting for specific resources on the page load.
- This PR just adds a call to this function directly after page load.
- Factors out into an `awaitPageLoad()` method used in both crawler and replaycrawler to support the same wait in QA Mode
- This is to support custom loading wait time for Instagram (other sites in the future)
Now that RWP 2.0.0 with adblock support has been released
(webrecorder/replayweb.page#307), this enables adblock on the QA mode
RWP embed, to get more accurate screenshots.
Fetches the adblock.gz directly from RWP (though could also fetch it
separately from Easylist)
Updates to 1.1.0-beta.5