Commit graph

205 commits

Author SHA1 Message Date
Ilya Kreymer
a5936b56aa
deps: bump brave 1.79.118 (#845)
bump version to 1.6.2
2025-06-03 12:52:07 -07:00
Ilya Kreymer
71de8d6582
lang code fixes: (#834)
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes #833
2025-05-12 16:06:29 -07:00
Ilya Kreymer
f9bd534e4c
more dependency updates: (#827)
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
2025-05-05 10:08:59 -07:00
Ilya Kreymer
fc59d04231
Deps update 1.6.1 (#826) 2025-05-02 00:43:37 -07:00
Ilya Kreymer
1cb1b2edb9
Update Behaviors Docs (#820)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-04-10 03:58:07 -04:00
Ilya Kreymer
f2dac05577
regression fix: start redis if needed before attempting to init state! (#819)
bump to 1.6.0-beta.1
2025-04-09 21:37:46 +02:00
Ilya Kreymer
c796996664
Support for behaviors from 'recorder flow' JSON created in devtools (#818)
New Feature:
- support 'flow behavior' from JSON specification
- detect .json files via --customBehaviors
- log behavior progress while running
- logging tweaks (via browsertrix-behaviors 0.8.4) to limit logging for
custom behaviors
- differentiate logging for iframes, move more behavior messages to
debug
- move initCrawlState() to happen earlier to ensure Redis logging can happen in case of fatal errors
- docs to be added in separate follow-up PR
2025-04-09 12:24:29 +02:00
Tessa Walsh
2961d3b9f2
Write behaviors downloaded from URL to tempdir (#816)
Follow-up to #368 

This makes download locations consistent between custom behaviors
downloaded from URLs and those downloaded from Git repos, and resolves a
container security issue in Browsertrix.
2025-04-04 11:23:29 -04:00
Tessa Walsh
f83d0e8f02
Add option to push behavior + behavior script logs to Redis (#805)
Fixes #804 

- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-04-03 15:46:10 -07:00
Ilya Kreymer
bf6fbe8776
Remove extra console.log statements (#811)
- remove one added in screencaster
- also remove others that are outside logging system
- bump to 1.5.10
2025-04-02 09:25:11 -07:00
Ilya Kreymer
91f8fadc5f
deps update: update webrecorder dependencies (#810)
- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
2025-04-01 22:11:56 -07:00
Ilya Kreymer
e751929a7a
Move extractLinks to behaviors + Update to browsertrix-behaviors 0.8.0 (#803)
- extractLinks() now handled via browsertix-behaviors
- fixes #770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
2025-03-31 12:02:25 -07:00
Ilya Kreymer
47d61a6baf version: bump to 1.5.9 2025-03-28 13:41:53 -07:00
Ilya Kreymer
8c96a10f67
deps: update to warcio.js 2.4.4, fixes #796 (#802) 2025-03-28 13:38:15 -07:00
Ilya Kreymer
9a7ac9bef1
Fix using cached WACZ filename if already set ahead of time. (#783)
- if <uid>:nextWacz filename already exists, actually get it and use
that!
- don't merge cdx if not generating wacz yet, use same condition for
both bump version to 1.5.8
- fix follow-up to #748, fix #747
2025-02-28 17:58:56 -08:00
Ilya Kreymer
2aec2e1a33 reset back to latest image, 1.77.52
bump version to 1.5.7
2025-02-27 16:06:43 -08:00
Ilya Kreymer
0e7391b668
follow-up to #781: (#782)
- undo accidentally setting window timeout to 20000 seconds instead of
20 for debugging!
- follow up to #781
- bump to 1.5.6.1
- should hopefully fix crawls stuck in this way..
2025-02-27 16:02:33 -08:00
Ilya Kreymer
9b22df5c90
revert brave version: not ideal, but need to revert to chromium 132 u… (#781)
…ntil we figure out various stalling issues that still persist in
chromium >=133

bump to 1.5.6
2025-02-27 07:05:31 -08:00
Ilya Kreymer
6e42e056b1 version: bump to 1.5.5 2025-02-26 12:42:00 -08:00
Ilya Kreymer
c25c6771a8
browser: update brave to 1.77.52 to get Chromium 134 (#773)
should fix browser timing out on new window, fixes #766 bump to 1.5.4
2025-02-20 09:14:32 -08:00
Ilya Kreymer
846f0355f6
Improved handling of browser stuck / crashed (#763)
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
2025-02-10 10:16:25 -08:00
Ilya Kreymer
5807c320bf
remove fatal() on new window error + stats fix (#762)
logging (#752): ensure failed included in totals
fatal rework: remove fatal() when failing to open new window, throw instead to ensure crawl is properly interrupted.
bump to 1.5.2
2025-02-09 15:26:36 -08:00
Ilya Kreymer
b435afeb4b version: bump to 1.5.1 2025-02-06 11:40:31 -08:00
Ilya Kreymer
0ca27e4fa1
QA fix: ensure replay iframe actually been updated after goto call! (#756)
qa fix: check url of iframe, ensure it is not about:blank anymore
test: add test to ensure expected diff
deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0
2025-02-06 10:41:38 -08:00
Ilya Kreymer
f379da19be version: bump to 1.5.0! 2025-01-31 21:57:18 -08:00
Ilya Kreymer
1da49258c4 version: bump to 1.5.0-beta.4 2025-01-30 14:32:30 -08:00
Ilya Kreymer
f7cbf9645b
Retry support and additional fixes (#743)
- retries: for failed pages, set retry to 5 in cases multiple retries
may be needed.
- redirect: if page url is /path/ -> /path, don't add as extra seed
- proxy: don't use global dispatcher, pass dispatcher explicitly when
using proxy, as proxy may interfere with local network requests
- final exit flag: if crawl is done and also interrupted, ensure WACZ is
still written/uploaded by setting final exit to true
- hashtag only change force reload: if loading page with same URL but
different hashtag, eg. `https://example.com/#B` after
`https://example.com/#A`, do a full reload
2025-01-25 22:55:49 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Ilya Kreymer
871490758a
Dependency Update for 1.4.2 (#737) 2025-01-06 12:06:40 -08:00
Ilya Kreymer
d923e11436
separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736)
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes #735, where fetch was interrupted, resulted in an empty response
2024-12-31 13:52:12 -08:00
Ilya Kreymer
fb8ed18f82
package: pin @novnc/novnc to 1.4.0 to prevent accidental upgrades (#727)
- novnc 1.5.0 not compatible with current configuration)
- fixes #726
- bump to 1.4.1
2024-11-25 18:42:56 -08:00
Ilya Kreymer
9af34f9a1d version: bump to 1.4.0 2024-11-25 00:36:43 -08:00
Ilya Kreymer
6bfa7d5766
Dependency Update (#725)
- update yarn packages
- update RWP to 2.2.4
- update base image to brave 1.73.91
- fix typing issue
- bump to 1.4.0-beta.1
2024-11-24 01:22:50 -08:00
Ilya Kreymer
214eb6ca8f
support removing range from query (via wabac.js 2.20.6): (#724)
- fix for archiving facebook video, to match
webrecorder/archiveweb.page#272
- permissions: auto enable permissions to avoid possibly modal (for both
profiles and crawling)
- deps: update to latest wabac.js + warcio.js
2024-11-22 10:31:12 -08:00
Ilya Kreymer
f56d6505c1
fix indexing of cookie header: (#714)
- add fields option for adding req.http:cookie and referrer entries to
the cdxj
- update to warcio 2.4.0 to support this functionality
2024-11-13 23:13:40 -08:00
Ilya Kreymer
c8e2e43d4d
Dependency Update (#718)
- bump browsertrix-behaviors to 0.6.5
- bump browsertrix-base-image to 1.71.123
- bump puppeteer-core to 23.7.1
2024-11-10 19:34:38 -08:00
Ilya Kreymer
d04509639a
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
2024-11-08 11:04:41 -05:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
e5bab8e7c8
various edge-case loading optimizations: (#709)
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-31 14:06:17 -07:00
Ilya Kreymer
181d9b824c
deps: update to latest wabac (#708)
bump version to 1.3.4
2024-10-26 11:02:32 -07:00
Ilya Kreymer
0d39ea3590
dep: update to wabac.js 2.20 (#704)
Update imports for new TS-based wabac.js
2024-10-16 21:02:04 -07:00
Ilya Kreymer
a45b85dd74 version: bump to 1.3.3 2024-10-11 00:12:23 -07:00
Ilya Kreymer
282c47ad66
bump puppeteer core to 23.5.1 (#700)
includes possible improvements for detecting crashes with wrong stack
trace (see: puppeteer/puppeteer#13056)
2024-10-07 16:39:48 -07:00
Ilya Kreymer
356b3f8d10 bump to 1.3.2 2024-09-30 15:51:13 -07:00
Ilya Kreymer
9f310907f0 version: bump to 1.3.1 2024-09-27 14:30:56 -04:00
Ilya Kreymer
da442573b8 version: bump to 1.3.0 2024-09-12 09:22:22 -07:00
Ilya Kreymer
083a9d2090 version: bump to 1.3.0-beta.1 2024-09-05 18:11:52 -07:00
Ilya Kreymer
9d0e3423a3
WARC writer + incremental indexing fixes (#679)
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own
2024-09-05 11:10:31 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Ilya Kreymer
23fbbcb6bf version: bump to 1.3.0-beta.0 2024-08-14 20:12:48 -07:00