- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes#860
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- retry if 'truncated' set, or if size mismatch, or other exception
occurs
- retry only for network load and async fetch, not for response fetch
- set max retries to 2 (same as default for pages currently)
- fixes#831
- Use `TMPDIR/btrixProfile` as consistent profile directory name
- Avoid accumulation of temp profile dirs if crawler is restarted
multiple times, eg. if tmp dir is mapped to /crawls (as is in
Browsertrix now), this prevents a proliferation of
/crawls/tmp/profile-* dirs for each crawler restart
- change released in 1.6.4, merging into main
Fixes#841
Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
If --saveStorage is set, localStorage and sessionStorage will be
serialized with the WARC record for the page.
If a page redirects, track what the current page URL is and save storage
as part of the page's WARC record.
Fixes#855
Related to https://github.com/webrecorder/browsertrix-crawler/issues/848
Several users have had issues with disk utilization checks, including
the values reported by `df` inside the crawler container having
unexpected results for mounted volumes. The commonly recommended
solution to this is to use `docker system ps`, but that is of course not
available within the Docker container itself.
This PR changes disk utilization checks to be an opt-in feature by
setting the default value to `0` (disabled).
- drop early serialization in handleFetchResponse(), can result in
writing WARC record too early, before the WARC-Protocol and other data
is available. (Added previously for requests loaded via browser context /
service worker which did not get a 'loadingFinished' message, but now
these will still be closed in awaitPageResources())
- don't log 'skipping URL from unknown frame' warning since it is often
spurious, since frame can be added in subsequent message and response is
*not* skipped.
- add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42
- also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes#833
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
Fixes webrecorder/replayweb.page#416
Update enterprise policy to:
- Disable Spellcheck, which should include downloading spellcheck
dictionary, possibly issue raised in #817
- Disable automatic http->https redirects, which insert an extra 307
response, as raised in: webrecorder/replayweb.page#416
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
Follow-up to #368
This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.
Fixes#804
- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
By using the useSHA1 flag, the payload digest in records will use SHA-1
with Base32 encoding instead of the default SHA-256
Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>
- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
- if saved state filename is somehow duplicated, don't readd to array to
avoid deletion (fixes edge case in #791)
- also avoid double interpolation of filename
Closes#793
Related to #733
Adjusts the reported aspect ratio based on GEOMETRY env var.
Also adjusts stylesheet in screencast HTML to match.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Follow-up to #712
Fixes a few things I noticed while testing out
https://github.com/webrecorder/browsertrix/pull/2520
- Ignore `.git` directory of git repositories when recursively walking
cloned git repo to collect custom behaviors
- Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2
was overly restrictive for Git repositories)
- Log name of custom behavior scripts (filename or URLs) added as info messages in
`behavior` context
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
Fixes#797
The crawler will now exit with a fatal log message and exit code 17 if:
- A Git repository specified with `--customBehavior` cannot be cloned
successfully (new)
- A custom behavior file at a URL specified with `--customBehavior` is
not fetched successfully (new)
- No custom behaviors are collected at a local filepath specified with
`--customBehavior`, or if an error is thrown while attempting to collect
files from a nonexistent path (new)
- Any custom behaviors collected fail `Browser.checkScript` validation
(existing behavior)
Tests have also been added accordingly.
- extractLinks() now handled via browsertix-behaviors
- fixes#770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
Fixes#798
Also modifies the existing test for link selector validation to check 17
status code on exit when link selectors fail validation.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- if <uid>:nextWacz filename already exists, actually get it and use
that!
- don't merge cdx if not generating wacz yet, use same condition for
both bump version to 1.5.8
- fix follow-up to #748, fix#747
- undo accidentally setting window timeout to 20000 seconds instead of
20 for debugging!
- follow up to #781
- bump to 1.5.6.1
- should hopefully fix crawls stuck in this way..