Commit graph

525 commits

Author SHA1 Message Date
Ilya Kreymer
178b10a37f
remove early serialization which may result in missing WARC-Protocol and security metadata (#844)
- drop early serialization in handleFetchResponse(), can result in
writing WARC record too early, before the WARC-Protocol and other data
is available. (Added previously for requests loaded via browser context /
service worker which did not get a 'loadingFinished' message, but now
these will still be closed in awaitPageResources())
- don't log 'skipping URL from unknown frame' warning since it is often
spurious, since frame can be added in subsequent message and response is
*not* skipped.
2025-05-29 08:33:30 -07:00
Ilya Kreymer
7bf10f7f18
optimization: normalize dedup status: treat 0 (response code not yet known) or 206 as 200… (#835)
Avoids fetching duplicate content when fetched through different code
path (eg. autoplay behavior calling fetch, vs video playing automatically)
2025-05-28 15:46:40 -07:00
Tessa Walsh
46a02d12a3
Remove hardcoded /tmp prefix from path (#843)
Fast-follow to #842 to fix a typo
2025-05-28 15:46:19 -07:00
Ilya Kreymer
52235ab21e
tmpdir: use os.tmpdir() instead of hardcoded '/tmp' (#842)
allows for customizing tmp directory with TMPDIR env var
2025-05-28 12:48:06 -07:00
Ilya Kreymer
e72b34318d
Add WARC-Protocol header (#715)
- add WARC-Protocol repeated header(s) for HTTP, TLS as per iipc/warc-specifications#42
- also set HTTP/1.0 on WARC record if actually http/1.0, otherwise keep HTTP/1.1

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-05-19 18:59:52 -07:00
Ilya Kreymer
71de8d6582
lang code fixes: (#834)
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes #833
2025-05-12 16:06:29 -07:00
Ilya Kreymer
e39d5a31eb
support pause interrupt: (#825)
- add new interrupt reason / exit code
- add isCrawlPaused() which checks redis <id>:paused key
- exit gracefully, upload WACZ file when paused

fixes #824
2025-05-05 10:10:08 -07:00
Ilya Kreymer
f9bd534e4c
more dependency updates: (#827)
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
2025-05-05 10:08:59 -07:00
Ilya Kreymer
fc59d04231
Deps update 1.6.1 (#826) 2025-05-02 00:43:37 -07:00
Ilya Kreymer
d47812d139
Config Policy Update (#822)
Fixes webrecorder/replayweb.page#416

Update enterprise policy to:
- Disable Spellcheck, which should include downloading spellcheck
dictionary, possibly issue raised in #817
- Disable automatic http->https redirects, which insert an extra 307
response, as raised in: webrecorder/replayweb.page#416
2025-05-01 23:01:24 -07:00
Ilya Kreymer
13e9648398
state: add trimqueue() redis command to trim queue / seen list (#821)
useful to support dynamically lowering pageLimit when restarting a crawl
fixes issue raised in webrecorder/browsertrix#2514
2025-04-29 18:18:04 -07:00
Ilya Kreymer
1cb1b2edb9
Update Behaviors Docs (#820)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-04-10 03:58:07 -04:00
Ilya Kreymer
f2dac05577
regression fix: start redis if needed before attempting to init state! (#819)
bump to 1.6.0-beta.1
2025-04-09 21:37:46 +02:00
Ilya Kreymer
c796996664
Support for behaviors from 'recorder flow' JSON created in devtools (#818)
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
2025-04-09 12:24:29 +02:00
Tessa Walsh
2961d3b9f2
Write behaviors downloaded from URL to tempdir (#816)
Follow-up to #368 

This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.
2025-04-04 11:23:29 -04:00
Ilya Kreymer
28241c824e ci: fixes to deploy ci workflow 2025-04-03 23:36:49 -07:00
Ilya Kreymer
7421404aee
ci: add workflow to deploy to dev channels (requires actions secrets config) (#815)
- uses DEPLOY_REGISTRY, DEPLOY_REGISTRY_PATH, DEPLOY_REGISTRY_API_TOKEN
secrets
2025-04-03 23:21:48 -07:00
Ilya Kreymer
66c71d03c8
deps: bump base browser image to 1.77.95 (#814) 2025-04-03 17:25:29 -07:00
Ilya Kreymer
ba4c432ce8
browser crash handling, follow-up to #808: (#813)
- if not restartOnError, attempt to kill browser and try again, 3 more
times
- if still unable to open window, mark browser as crashed an exit
2025-04-03 16:10:54 -07:00
Tessa Walsh
f83d0e8f02
Add option to push behavior + behavior script logs to Redis (#805)
Fixes #804 

- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-03 15:46:10 -07:00
aponb
6898bcf7ae
useSHA1 Parameter for generating SHA1 record hashes (#532) (#812)
By using the useSHA1 flag, the payload digest in records will use SHA-1
with Base32 encoding instead of the default SHA-256

Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>
2025-04-02 17:10:50 -07:00
Ilya Kreymer
bf6fbe8776
Remove extra console.log statements (#811)
- remove one added in screencaster
- also remove others that are outside logging system
- bump to 1.5.10
2025-04-02 09:25:11 -07:00
Ilya Kreymer
91f8fadc5f
deps update: update webrecorder dependencies (#810)
- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
2025-04-01 22:11:56 -07:00
Ilya Kreymer
fd41b32100
saved state tweaks: (#809)
- if saved state filename is somehow duplicated, don't readd to array to
avoid deletion (fixes edge case in #791)
- also avoid double interpolation of filename
2025-04-01 18:59:04 -07:00
Emma Segal-Grossman
41b968baac
Dynamically adjust reported aspect ratio based on GEOMETRY (#794)
Closes #793 
Related to #733

Adjusts the reported aspect ratio based on GEOMETRY env var.
Also adjusts stylesheet in screencast HTML to match.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-01 18:26:12 -07:00
Tessa Walsh
2b00c1f065
Tweaks for custom behavior loading (#807)
Follow-up to #712 

Fixes a few things I noticed while testing out
https://github.com/webrecorder/browsertrix/pull/2520

- Ignore `.git` directory of git repositories when recursively walking
cloned git repo to collect custom behaviors
- Increase MAX_DEPTH for collecting behaviors to 5 (previous limit of 2
was overly restrictive for Git repositories)
- Log name of custom behavior scripts (filename or URLs) added as info messages in
`behavior` context

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-01 18:15:57 -07:00
Ilya Kreymer
2b56455e8b
stuck page handling: when attempting to restart browser, add more retries (#808)
fixes issue mentioned in:
https://github.com/webrecorder/browsertrix-crawler/issues/791#issuecomment-2734342186
2025-04-01 16:56:01 -07:00
Ilya Kreymer
e585b6d194
Better default crawlId (#806)
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784 
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
2025-04-01 13:40:03 -07:00
Tessa Walsh
5fedde6eee
Fail crawl with fatal message if custom behavior isn't loaded (#799)
Fixes #797 

The crawler will now exit with a fatal log message and exit code 17 if:

- A Git repository specified with `--customBehavior` cannot be cloned
successfully (new)
- A custom behavior file at a URL specified with `--customBehavior` is
not fetched successfully (new)
- No custom behaviors are collected at a local filepath specified with
`--customBehavior`, or if an error is thrown while attempting to collect
files from a nonexistent path (new)
- Any custom behaviors collected fail `Browser.checkScript` validation
(existing behavior)

Tests have also been added accordingly.
2025-03-31 17:35:30 -07:00
Ilya Kreymer
e751929a7a
Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803)
- extractLinks() now handled via browsertix-behaviors
- fixes #770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
2025-03-31 12:02:25 -07:00
benoit74
02c4353b4a
Add clarification in usage about hostname used (#771)
clarify that the crawlId defaults to the Docker container hostname
2025-03-30 21:16:58 -07:00
Tessa Walsh
8f581a587c
Validate Autoclick selector, fail crawl if invalid (#800)
Fixes #798 

Also modifies the existing test for link selector validation to check 17
status code on exit when link selectors fail validation.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-03-30 13:48:41 -07:00
Ilya Kreymer
47d61a6baf version: bump to 1.5.9 2025-03-28 13:41:53 -07:00
Ilya Kreymer
8c96a10f67
deps: update to warcio.js 2.4.4, fixes #796 (#802) 2025-03-28 13:38:15 -07:00
Ilya Kreymer
323b654c54 tests: update qa test to use awp site 2025-03-21 13:06:53 -07:00
Henry Wilkinson
34a1e3d6c0
docs: Update header font (#785)
Updated alongside https://github.com/webrecorder/replayweb.page/pull/405

Long overdue match to Browsertrix docs styling

### Screenshots

<img width="465" alt="Screenshot 2025-03-03 at 7 25 04 PM"
src="https://github.com/user-attachments/assets/6829dcb7-d486-4793-a635-f1286b30efc0"
/>
2025-03-05 14:21:00 -08:00
Ilya Kreymer
9a7ac9bef1
Fix using cached WACZ filename if already set ahead of time. (#783)
- if <uid>:nextWacz filename already exists, actually get it and use
that!
- don't merge cdx if not generating wacz yet, use same condition for
both bump version to 1.5.8
- fix follow-up to #748, fix #747
2025-02-28 17:58:56 -08:00
Ilya Kreymer
2aec2e1a33 reset back to latest image, 1.77.52
bump version to 1.5.7
2025-02-27 16:06:43 -08:00
Ilya Kreymer
0e7391b668
follow-up to #781: (#782)
- undo accidentally setting window timeout to 20000 seconds instead of
20 for debugging!
- follow up to #781
- bump to 1.5.6.1
- should hopefully fix crawls stuck in this way..
2025-02-27 16:02:33 -08:00
Ilya Kreymer
9b22df5c90
revert brave version: not ideal, but need to revert to chromium 132 u… (#781)
…ntil we figure out various stalling issues that still persist in
chromium >=133

bump to 1.5.6
2025-02-27 07:05:31 -08:00
Ilya Kreymer
6e42e056b1 version: bump to 1.5.5 2025-02-26 12:42:00 -08:00
Ilya Kreymer
24ca818356
further fix to stuck on getting new window: (#779)
- set retries back to 3, was set high by mistake
- if will restart, throw exception to restart crawler
- otherwise, attempt to kill browser process that is stalled (appears to
work in testing)
- follow-up to #766
2025-02-26 12:32:05 -08:00
Tessa Walsh
e402ddc202
Strip credentials from proxy address in crawl logs (#778)
Fixes https://github.com/webrecorder/security/issues/14
2025-02-26 15:23:38 -05:00
Ilya Kreymer
c25c6771a8
browser: update brave to 1.77.52 to get Chromium 134 (#773)
should fix browser timing out on new window, fixes #766 bump to 1.5.4
2025-02-20 09:14:32 -08:00
Tessa Walsh
f16be32ba6
Make sure all exit calls use ExitCodes enum (#767)
Quick follow-up to #584 to make sure enum is used everywhere in profile editing mode:
- profile browser exits with ExitCodes.SignalInterrupted in response to signal
- use ExitCodes.Success or GenericError for other exit codes
2025-02-11 12:04:38 -08:00
benoit74
4b72b7c7dc
Add documentation on exit codes (#765)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-11 12:16:29 -05:00
benoit74
fc56c2cf76
Add more exit codes to detect interruption reason (#764)
Fix #584

- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-10 14:00:55 -08:00
Ilya Kreymer
846f0355f6
Improved handling of browser stuck / crashed (#763)
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
2025-02-10 10:16:25 -08:00
Ilya Kreymer
5807c320bf
remove fatal() on new window error + stats fix (#762)
logging (#752): ensure failed included in totals
fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted.
bump to 1.5.2
2025-02-09 15:26:36 -08:00
Ilya Kreymer
a5050a25d7
Readd health check on retry (#759)
- health check failures should be incremented even if retrying, in case
restart is needed
- cleanup writePage()
- bump default --maxPageRetries to 2 for better default for Browsertrix
2025-02-06 20:13:20 -08:00