Commit graph

575 commits

Author SHA1 Message Date
Ilya Kreymer
5bb4527de2
(backport for 1.9.3 release) fix connection leaks in aborted fetch() requests (#924) (#925)
- in doCancel(), use abort controller and call abort(), instead of
body.cancel()
- ensure doCancel() is called when a WARC record is not written, eg. is
a dupe, as stream is likely not consumed
- also call IO.close() when uses browser network reader
- fixes #923
- also adds missing dupe check to async resources queued from behaviors
(were being deduped on write, but were still fetched unnecessarily)
- backport of #924 for 1.9.3
2025-11-27 21:00:24 -08:00
Ilya Kreymer
6a163ddc47 version: 1.9.3 2025-11-27 20:41:27 -08:00
Ilya Kreymer
565ba54454
better failure detection, allow update support for captcha detection via behaviors (#917)
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
2025-11-19 15:49:49 -08:00
Ilya Kreymer
87edef3362
netIdle cleanup + better default for pages where networkIdle timesout (#916)
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913 
- avoid accidental logging

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-11-18 16:34:02 -08:00
Ilya Kreymer
8c8fd6be08
remove --disable-component-update flag, fixes shields not working (#915)
should fix main cause of slow down in #913 
deps: update to brave 1.84.139, puppeteer 24.30.0
bump to 1.9.1
2025-11-14 20:30:42 -08:00
Ilya Kreymer
bb11147234
brave: update policies to disable new brave services (#914) 2025-11-14 20:00:58 -08:00
Ilya Kreymer
59fe064c62 version: bump to 1.9.0 2025-11-11 18:28:21 -08:00
Ilya Kreymer
85c5632eb1
deps: bump dependencies for 1.9.0 (#912)
update to brave 1.84.135, wabac.js 2.24.5
2025-11-11 14:38:35 -08:00
Tessa Walsh
11f52db31e
Fix linting following external contribution (#911)
Quick-follow to
7dd13a9ec4,
to fix linting issue introduced in that PR.
2025-11-11 12:03:56 -08:00
aponb
b50ef1230f
feat: add extraChromeArgs support for passing custom Chrome flags (#877)
This change introduces a new CLI option --extraChromeArgs to Browsertrix
Crawler, allowing users to pass arbitrary Chrome flags without modifying
the codebase.

This approach is future-proof: any Chrome flag can be provided at
runtime, avoiding the need for hard-coded allowlists.
Maintains backward compatibility: if no extraChromeArgs are passed,
behavior remains unchanged.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-11-11 12:03:30 -08:00
Percival
7dd13a9ec4
fix: Skip proxy for seed file and custom behavior downloads (#907) 2025-11-11 10:51:24 -05:00
Wannaphong Phatthiyaphaibun
37a6fa974b
Fix directory path in user guide for WACZ file (#910)
I found that the directory path in the user guide for the WACZ file is
wrong. It should be `crawls/collections/test/test.wacz`.
2025-11-07 12:39:01 -08:00
Ilya Kreymer
74b6ad0ae0 deps: bump behaviors to 0.9.5
beta 1.9.0-beta.1
2025-11-02 12:30:09 -08:00
Ilya Kreymer
390d036f9e
deps: update to browsertrix-behaviors 0.9.4 (#906)
Includes fixes for autoclick behavior:
- able to click on svgs
- don't navgiate back if click did not result in history stack change
2025-11-02 09:12:15 -08:00
Ilya Kreymer
5685cb2cbe
profiles: add singleton lock removal on startup to avoid any issues (#904)
If the SingletonLock (and SingletonPort, SingletonSocket) files somehow
made it into the profile, the browser will refuse to start. This will
ensure that it is cleared.
(Could also do it before saving it as well, but this will catch it for
any existing profiles).
2025-11-02 09:12:07 -08:00
Ilya Kreymer
3935526240
add --saveProfile option to save profile after successful crawl (#903)
- if --saveProfile is specified, attempt to save profile to same target
as --profile
- if --saveProfile <target>, save to target
- save profile on finalExit if browser has launched
- supports local file paths and storage-relative path with '@' (same as
--profile)
- also clear cache in first worker to match regular profile creation

fixes #898
2025-10-29 19:57:25 -07:00
Ilya Kreymer
afdb6674e5
profile download improvements: (#899)
- log when profie download starts
- ensure there is a timeout to profile download attempt (60 secs)
- attempt retry 2 more times if initial profile download times out
- fail crawl after 3 retries, if profile can not be downloaded
successfully

bumpt to 1.8.2
2025-10-25 16:49:40 -07:00
Ilya Kreymer
6f26148a9b bump version to 1.8.1 2025-10-08 17:11:04 -07:00
Ilya Kreymer
4f234040ce
Profile Saving Improvements (#894)
fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
2025-10-08 17:09:20 -07:00
Ilya Kreymer
002feb287b
dismiss js dialog popups (#895)
move the JS dialog handler to not be only for autoclick, dismiss all JS
dialogs (alert(), prompt()) to avoid blocking page
fixes #891
2025-10-08 14:57:52 -07:00
Ilya Kreymer
2270964996
logging: remove duplicate seeds found error (#893)
Per discussion, the message is unnecessary / confusing (doesn't provide
enough info) and can also happen on crawler restart.
2025-10-07 08:18:22 -07:00
Ilya Kreymer
fd49041f63
flow behaviors: add scrolling into view (#892)
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.
2025-10-07 08:17:56 -07:00
Ed Summers
cc2d890916
Add addLink doc (#890)
It's helpful to know this function is there!
2025-10-02 15:45:55 -04:00
Ilya Kreymer
f7a080fe83 version: bump to 1.8.0 2025-09-25 10:42:02 -07:00
Ilya Kreymer
048b72ca87
deps update: bump browser to brave 1.82.170, wabac.js 2.24.1 (#886)
use latest puppeteer-core, puppeteer/replay

bump to 1.8.0-beta.1
2025-09-20 11:38:20 -07:00
Ilya Kreymer
8ca7756d1b
tests: remove example.com from tests (#885)
also use local http-server for behavior tests
2025-09-19 23:21:47 -07:00
Ilya Kreymer
a2742df328
seed urls list: check for quoted URLs and remove quotes (#883)
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
2025-09-12 13:34:41 -07:00
Ilya Kreymer
705bc0cd9f
Async Fetch Refactor (#880)
- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading
2025-09-10 12:05:21 -07:00
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Ilya Kreymer
a6ad6a0e42 version: bump to 1.7.0 2025-07-31 15:23:42 -07:00
Ilya Kreymer
5c7ff3dfef
deps: bump base to brave 1.80.125 (#875) 2025-07-31 14:51:18 -07:00
Ilya Kreymer
18fe5a9676
behavior logging: remove last line dupe check for behavior logs (#874)
Shouldn't skip multiple log messages, as this is unexpected behavior for
user-defined behaviors.
2025-07-30 16:20:14 -07:00
Tessa Walsh
aba065c8fb
Don't trim to limit if limit is default of 0 (#873)
Fixes #872 

Fix for restarting crawl from saved state, where the default `--limit`
value of 0 was incorrectly preventing any URLs from being re-queued.
2025-07-29 15:48:08 -07:00
Ilya Kreymer
0652a3fb1d
quickfix: WACZ upload retry support: (#871)
- if a failure occurs on failed upload, and crawler restarts on error,
exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
app)
- otherwise, a failed upload will exit the crawl with no WACZ, resulting
in overall crawl failure
2025-07-29 15:41:22 -07:00
sua yoo
bc4d649307
Capitalization fix for log messages (#870)
Capitalizes "URL" in log messages.
2025-07-24 23:52:12 -07:00
Tessa Walsh
66402c2e53
Add documentation for --failOnContentCheck and update CLI options in docs (#869)
Related to #860 

This will give us something we can link to from Browsertrix/the
Browsertrix User Guide for up-to-date information on this option.
2025-07-23 12:54:12 -07:00
Ilya Kreymer
1a4341bfbc
url queueing: log skipped URLs as errors if depth === 0 (#868)
- will ensure sees from URL list are reported as errors if skipped
- also set logging context to 'scope' instead of 'links'
- fixes #866

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-07-23 10:05:40 -07:00
Ilya Kreymer
96fd22971f
deps update: (#867)
- bump brave to 1.80.122
- bump wabac.js to 2.23.8
- bump RWP to 2.3.15
- bump browsertrix-behaviors to 0.9.1
2025-07-22 21:06:12 -07:00
Tessa Walsh
acae5155f5
Fix docs mistaking --waitUntil with --pageLoadTimeout (#864)
Fixes https://github.com/webrecorder/browsertrix-crawler/issues/853

Corrects a documentation inaccuracy pointed out by a user
2025-07-21 12:52:58 -07:00
Ilya Kreymer
549d655173
Support option to fail crawl on content check (#861)
- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes #860

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-07-08 13:08:52 -07:00
Ilya Kreymer
6244515818
async fetch: allow retrying async fetch if interrupted (#863)
- retry if 'truncated' set, or if size mismatch, or other exception
occurs
- retry only for network load and async fetch, not for response fetch
- set max retries to 2 (same as default for pages currently)
- fixes #831
2025-07-08 10:02:09 -07:00
Ilya Kreymer
c84f58f539
Use consistent profile directory name (merge 1.6.4 change) (#859)
- Use `TMPDIR/btrixProfile` as consistent profile directory name
- Avoid accumulation of temp profile dirs if crawler is restarted
multiple times, eg. if tmp dir is mapped to /crawls (as is in
Browsertrix now), this prevents a proliferation of
/crawls/tmp/profile-* dirs for each crawler restart
- change released in 1.6.4, merging into main
2025-07-03 19:49:05 -07:00
Tessa Walsh
2af94ffab5
Support downloading seed file from URL (#852)
Fixes #841 

Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-03 10:49:37 -04:00
Ilya Kreymer
687f08b1d0
Add option to save local/sessionStorage (#856)
If --saveStorage is set, localStorage and sessionStorage will be
serialized with the WARC record for the page.
If a page redirects, track what the current page URL is and save storage
as part of the page's WARC record.

Fixes #855
2025-06-30 19:58:19 -07:00
Ilya Kreymer
eb374fa835
base: bump to brave 1.80.113 (#857)
version: bump to 1.7.0-beta.0
tests: update deprecated command to work with latest minio
2025-06-30 19:55:38 -07:00
Ilya Kreymer
d2a6aa9805
version: bump to 1.6.3 (#851)
cli: regen cli docs to update from #850
2025-06-16 15:55:05 -04:00
Rijnder Wever
fa26f05f66
cleanup: remove dead pywb code from argparser and docs (#847)
The value of `--dedupPolicy` was once passed to pywb (see
https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording).
Now that pywb has been dropped, there is no need to keep this option
around.

In fact, I know multiple users that have been confused by the mention of
this option in the docs (myself included).

(for historical context, see
https://github.com/webrecorder/browsertrix-crawler/pull/332)
2025-06-16 12:36:32 -04:00
Tessa Walsh
e09d10c582
Disable disk utilization check by default (#850)
Related to https://github.com/webrecorder/browsertrix-crawler/issues/848

Several users have had issues with disk utilization checks, including
the values reported by `df` inside the crawler container having
unexpected results for mounted volumes. The commonly recommended
solution to this is to use `docker system ps`, but that is of course not
available within the Docker container itself.

This PR changes disk utilization checks to be an opt-in feature by
setting the default value to `0` (disabled).
2025-06-16 12:36:15 -04:00
Ilya Kreymer
da953b670b
content-type compare for rewriting: use case-insensitive check (#849)
update to wabac.js 2.23.3 for HLS rewriting fixes
part of capture fix for webrecorder/replayweb.page#433
2025-06-16 11:09:44 -04:00
Ilya Kreymer
a5936b56aa
deps: bump brave 1.79.118 (#845)
bump version to 1.6.2
2025-06-03 12:52:07 -07:00