fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config
Fixes#836
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- if a failure occurs on failed upload, and crawler restarts on error,
exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
app)
- otherwise, a failed upload will exit the crawl with no WACZ, resulting
in overall crawl failure
- will ensure sees from URL list are reported as errors if skipped
- also set logging context to 'scope' instead of 'links'
- fixes#866
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes#860
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- retry if 'truncated' set, or if size mismatch, or other exception
occurs
- retry only for network load and async fetch, not for response fetch
- set max retries to 2 (same as default for pages currently)
- fixes#831
- Use `TMPDIR/btrixProfile` as consistent profile directory name
- Avoid accumulation of temp profile dirs if crawler is restarted
multiple times, eg. if tmp dir is mapped to /crawls (as is in
Browsertrix now), this prevents a proliferation of
/crawls/tmp/profile-* dirs for each crawler restart
- change released in 1.6.4, merging into main
Fixes#841
Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
If --saveStorage is set, localStorage and sessionStorage will be
serialized with the WARC record for the page.
If a page redirects, track what the current page URL is and save storage
as part of the page's WARC record.
Fixes#855
Related to https://github.com/webrecorder/browsertrix-crawler/issues/848
Several users have had issues with disk utilization checks, including
the values reported by `df` inside the crawler container having
unexpected results for mounted volumes. The commonly recommended
solution to this is to use `docker system ps`, but that is of course not
available within the Docker container itself.
This PR changes disk utilization checks to be an opt-in feature by
setting the default value to `0` (disabled).
- drop early serialization in handleFetchResponse(), can result in
writing WARC record too early, before the WARC-Protocol and other data
is available. (Added previously for requests loaded via browser context /
service worker which did not get a 'loadingFinished' message, but now
these will still be closed in awaitPageResources())
- don't log 'skipping URL from unknown frame' warning since it is often
spurious, since frame can be added in subsequent message and response is
*not* skipped.
- add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42
- also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes#833
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
Fixes webrecorder/replayweb.page#416
Update enterprise policy to:
- Disable Spellcheck, which should include downloading spellcheck
dictionary, possibly issue raised in #817
- Disable automatic http->https redirects, which insert an extra 307
response, as raised in: webrecorder/replayweb.page#416
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
Follow-up to #368
This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.