Commit graph

546 commits

Author SHA1 Message Date
Ilya Kreymer
a6ad6a0e42 version: bump to 1.7.0 2025-07-31 15:23:42 -07:00
Ilya Kreymer
5c7ff3dfef
deps: bump base to brave 1.80.125 (#875) 2025-07-31 14:51:18 -07:00
Ilya Kreymer
18fe5a9676
behavior logging: remove last line dupe check for behavior logs (#874)
Shouldn't skip multiple log messages, as this is unexpected behavior for
user-defined behaviors.
2025-07-30 16:20:14 -07:00
Tessa Walsh
aba065c8fb
Don't trim to limit if limit is default of 0 (#873)
Fixes #872 

Fix for restarting crawl from saved state, where the default `--limit`
value of 0 was incorrectly preventing any URLs from being re-queued.
2025-07-29 15:48:08 -07:00
Ilya Kreymer
0652a3fb1d
quickfix: WACZ upload retry support: (#871)
- if a failure occurs on failed upload, and crawler restarts on error,
exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
app)
- otherwise, a failed upload will exit the crawl with no WACZ, resulting
in overall crawl failure
2025-07-29 15:41:22 -07:00
sua yoo
bc4d649307
Capitalization fix for log messages (#870)
Capitalizes "URL" in log messages.
2025-07-24 23:52:12 -07:00
Tessa Walsh
66402c2e53
Add documentation for --failOnContentCheck and update CLI options in docs (#869)
Related to #860 

This will give us something we can link to from Browsertrix/the
Browsertrix User Guide for up-to-date information on this option.
2025-07-23 12:54:12 -07:00
Ilya Kreymer
1a4341bfbc
url queueing: log skipped URLs as errors if depth === 0 (#868)
- will ensure sees from URL list are reported as errors if skipped
- also set logging context to 'scope' instead of 'links'
- fixes #866

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-07-23 10:05:40 -07:00
Ilya Kreymer
96fd22971f
deps update: (#867)
- bump brave to 1.80.122
- bump wabac.js to 2.23.8
- bump RWP to 2.3.15
- bump browsertrix-behaviors to 0.9.1
2025-07-22 21:06:12 -07:00
Tessa Walsh
acae5155f5
Fix docs mistaking --waitUntil with --pageLoadTimeout (#864)
Fixes https://github.com/webrecorder/browsertrix-crawler/issues/853

Corrects a documentation inaccuracy pointed out by a user
2025-07-21 12:52:58 -07:00
Ilya Kreymer
549d655173
Support option to fail crawl on content check (#861)
- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes #860

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-07-08 13:08:52 -07:00
Ilya Kreymer
6244515818
async fetch: allow retrying async fetch if interrupted (#863)
- retry if 'truncated' set, or if size mismatch, or other exception
occurs
- retry only for network load and async fetch, not for response fetch
- set max retries to 2 (same as default for pages currently)
- fixes #831
2025-07-08 10:02:09 -07:00
Ilya Kreymer
c84f58f539
Use consistent profile directory name (merge 1.6.4 change) (#859)
- Use `TMPDIR/btrixProfile` as consistent profile directory name
- Avoid accumulation of temp profile dirs if crawler is restarted
multiple times, eg. if tmp dir is mapped to /crawls (as is in
Browsertrix now), this prevents a proliferation of
/crawls/tmp/profile-* dirs for each crawler restart
- change released in 1.6.4, merging into main
2025-07-03 19:49:05 -07:00
Tessa Walsh
2af94ffab5
Support downloading seed file from URL (#852)
Fixes #841 

Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-03 10:49:37 -04:00
Ilya Kreymer
687f08b1d0
Add option to save local/sessionStorage (#856)
If --saveStorage is set, localStorage and sessionStorage will be
serialized with the WARC record for the page.
If a page redirects, track what the current page URL is and save storage
as part of the page's WARC record.

Fixes #855
2025-06-30 19:58:19 -07:00
Ilya Kreymer
eb374fa835
base: bump to brave 1.80.113 (#857)
version: bump to 1.7.0-beta.0
tests: update deprecated command to work with latest minio
2025-06-30 19:55:38 -07:00
Ilya Kreymer
d2a6aa9805
version: bump to 1.6.3 (#851)
cli: regen cli docs to update from #850
2025-06-16 15:55:05 -04:00
Rijnder Wever
fa26f05f66
cleanup: remove dead pywb code from argparser and docs (#847)
The value of `--dedupPolicy` was once passed to pywb (see
https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording).
Now that pywb has been dropped, there is no need to keep this option
around.

In fact, I know multiple users that have been confused by the mention of
this option in the docs (myself included).

(for historical context, see
https://github.com/webrecorder/browsertrix-crawler/pull/332)
2025-06-16 12:36:32 -04:00
Tessa Walsh
e09d10c582
Disable disk utilization check by default (#850)
Related to https://github.com/webrecorder/browsertrix-crawler/issues/848

Several users have had issues with disk utilization checks, including
the values reported by `df` inside the crawler container having
unexpected results for mounted volumes. The commonly recommended
solution to this is to use `docker system ps`, but that is of course not
available within the Docker container itself.

This PR changes disk utilization checks to be an opt-in feature by
setting the default value to `0` (disabled).
2025-06-16 12:36:15 -04:00
Ilya Kreymer
da953b670b
content-type compare for rewriting: use case-insensitive check (#849)
update to wabac.js 2.23.3 for HLS rewriting fixes
part of capture fix for webrecorder/replayweb.page#433
2025-06-16 11:09:44 -04:00
Ilya Kreymer
a5936b56aa
deps: bump brave 1.79.118 (#845)
bump version to 1.6.2
2025-06-03 12:52:07 -07:00
Ilya Kreymer
178b10a37f
remove early serialization which may result in missing WARC-Protocol and security metadata (#844)
- drop early serialization in handleFetchResponse(), can result in
writing WARC record too early, before the WARC-Protocol and other data
is available. (Added previously for requests loaded via browser context /
service worker which did not get a 'loadingFinished' message, but now
these will still be closed in awaitPageResources())
- don't log 'skipping URL from unknown frame' warning since it is often
spurious, since frame can be added in subsequent message and response is
*not* skipped.
2025-05-29 08:33:30 -07:00
Ilya Kreymer
7bf10f7f18
optimization: normalize dedup status: treat 0 (response code not yet known) or 206 as 200… (#835)
Avoids fetching duplicate content when fetched through different code
path (eg. autoplay behavior calling fetch, vs video playing automatically)
2025-05-28 15:46:40 -07:00
Tessa Walsh
46a02d12a3
Remove hardcoded /tmp prefix from path (#843)
Fast-follow to #842 to fix a typo
2025-05-28 15:46:19 -07:00
Ilya Kreymer
52235ab21e
tmpdir: use os.tmpdir() instead of hardcoded '/tmp' (#842)
allows for customizing tmp directory with TMPDIR env var
2025-05-28 12:48:06 -07:00
Ilya Kreymer
e72b34318d
Add WARC-Protocol header (#715)
- add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42
- also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-05-19 18:59:52 -07:00
Ilya Kreymer
71de8d6582
lang code fixes: (#834)
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes #833
2025-05-12 16:06:29 -07:00
Ilya Kreymer
e39d5a31eb
support pause interrupt: (#825)
- add new interrupt reason / exit code
- add isCrawlPaused() which checks redis <id>:paused key
- exit gracefully, upload WACZ file when paused

fixes #824
2025-05-05 10:10:08 -07:00
Ilya Kreymer
f9bd534e4c
more dependency updates: (#827)
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
2025-05-05 10:08:59 -07:00
Ilya Kreymer
fc59d04231
Deps update 1.6.1 (#826) 2025-05-02 00:43:37 -07:00
Ilya Kreymer
d47812d139
Config Policy Update (#822)
Fixes webrecorder/replayweb.page#416

Update enterprise policy to:
- Disable Spellcheck, which should include downloading spellcheck
dictionary, possibly issue raised in #817
- Disable automatic http->https redirects, which insert an extra 307
response, as raised in: webrecorder/replayweb.page#416
2025-05-01 23:01:24 -07:00
Ilya Kreymer
13e9648398
state: add trimqueue() redis command to trim queue / seen list (#821)
useful to support dynamically lowering pageLimit when restarting a crawl
fixes issue raised in webrecorder/browsertrix#2514
2025-04-29 18:18:04 -07:00
Ilya Kreymer
1cb1b2edb9
Update Behaviors Docs (#820)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-04-10 03:58:07 -04:00
Ilya Kreymer
f2dac05577
regression fix: start redis if needed before attempting to init state! (#819)
bump to 1.6.0-beta.1
2025-04-09 21:37:46 +02:00
Ilya Kreymer
c796996664
Support for behaviors from 'recorder flow' JSON created in devtools (#818)
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
2025-04-09 12:24:29 +02:00
Tessa Walsh
2961d3b9f2
Write behaviors downloaded from URL to tempdir (#816)
Follow-up to #368 

This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.
2025-04-04 11:23:29 -04:00
Ilya Kreymer
28241c824e ci: fixes to deploy ci workflow 2025-04-03 23:36:49 -07:00
Ilya Kreymer
7421404aee
ci: add workflow to deploy to dev channels (requires actions secrets config) (#815)
- uses DEPLOY_REGISTRY, DEPLOY_REGISTRY_PATH, DEPLOY_REGISTRY_API_TOKEN
secrets
2025-04-03 23:21:48 -07:00
Ilya Kreymer
66c71d03c8
deps: bump base browser image to 1.77.95 (#814) 2025-04-03 17:25:29 -07:00
Ilya Kreymer
ba4c432ce8
browser crash handling, follow-up to #808: (#813)
- if not restartOnError, attempt to kill browser and try again, 3 more
times
- if still unable to open window, mark browser as crashed an exit
2025-04-03 16:10:54 -07:00
Tessa Walsh
f83d0e8f02
Add option to push behavior + behavior script logs to Redis (#805)
Fixes #804 

- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-03 15:46:10 -07:00
aponb
6898bcf7ae
useSHA1 Parameter for generating SHA1 record hashes (#532) (#812)
By using the useSHA1 flag, the payload digest in records will use SHA-1
with Base32 encoding instead of the default SHA-256

Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>
2025-04-02 17:10:50 -07:00
Ilya Kreymer
bf6fbe8776
Remove extra console.log statements (#811)
- remove one added in screencaster
- also remove others that are outside logging system
- bump to 1.5.10
2025-04-02 09:25:11 -07:00
Ilya Kreymer
91f8fadc5f
deps update: update webrecorder dependencies (#810)
- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
2025-04-01 22:11:56 -07:00
Ilya Kreymer
fd41b32100
saved state tweaks: (#809)
- if saved state filename is somehow duplicated, don't readd to array to
avoid deletion (fixes edge case in #791)
- also avoid double interpolation of filename
2025-04-01 18:59:04 -07:00
Emma Segal-Grossman
41b968baac
Dynamically adjust reported aspect ratio based on GEOMETRY (#794)
Closes #793 
Related to #733

Adjusts the reported aspect ratio based on GEOMETRY env var.
Also adjusts stylesheet in screencast HTML to match.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-01 18:26:12 -07:00
Tessa Walsh
2b00c1f065
Tweaks for custom behavior loading (#807)
Follow-up to #712 

Fixes a few things I noticed while testing out
https://github.com/webrecorder/browsertrix/pull/2520

- Ignore `.git` directory of git repositories when recursively walking
cloned git repo to collect custom behaviors
- Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2
was overly restrictive for Git repositories)
- Log name of custom behavior scripts (filename or URLs) added as info messages in
`behavior` context

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-01 18:15:57 -07:00
Ilya Kreymer
2b56455e8b
stuck page handling: when attempting to restart browser, add more retries (#808)
fixes issue mentioned in:
https://github.com/webrecorder/browsertrix-crawler/issues/791#issuecomment-2734342186
2025-04-01 16:56:01 -07:00
Ilya Kreymer
e585b6d194
Better default crawlId (#806)
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784 
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
2025-04-01 13:40:03 -07:00
Tessa Walsh
5fedde6eee
Fail crawl with fatal message if custom behavior isn't loaded (#799)
Fixes #797 

The crawler will now exit with a fatal log message and exit code 17 if:

- A Git repository specified with `--customBehavior` cannot be cloned
successfully (new)
- A custom behavior file at a URL specified with `--customBehavior` is
not fetched successfully (new)
- No custom behaviors are collected at a local filepath specified with
`--customBehavior`, or if an error is thrown while attempting to collect
files from a nonexistent path (new)
- Any custom behaviors collected fail `Browser.checkScript` validation
(existing behavior)

Tests have also been added accordingly.
2025-03-31 17:35:30 -07:00