Commit graph

81 commits

Author SHA1 Message Date
Ilya Kreymer
f153dd151e seed urls list:
- check for urls that are wrapped in quotes, eg. 'https://example.com/' or "https://example.com/" and remove the quotes
before adding seed
- tests: add quoted URL to tests
- deps: update wabac.js, RWP to latest
- bump to 1.7.1
2025-09-12 08:11:17 -07:00
Ilya Kreymer
5c7ff3dfef
deps: bump base to brave 1.80.125 (#875) 2025-07-31 14:51:18 -07:00
Ilya Kreymer
96fd22971f
deps update: (#867)
- bump brave to 1.80.122
- bump wabac.js to 2.23.8
- bump RWP to 2.3.15
- bump browsertrix-behaviors to 0.9.1
2025-07-22 21:06:12 -07:00
Ilya Kreymer
eb374fa835
base: bump to brave 1.80.113 (#857)
version: bump to 1.7.0-beta.0
tests: update deprecated command to work with latest minio
2025-06-30 19:55:38 -07:00
Ilya Kreymer
a5936b56aa
deps: bump brave 1.79.118 (#845)
bump version to 1.6.2
2025-06-03 12:52:07 -07:00
Ilya Kreymer
f9bd534e4c
more dependency updates: (#827)
- update wabac.js to 2.22.16, RWP to 2.3.7
- fidelity: fixes capture of fb and insta (via wabac.js 2.22.16)
- policy: disable tg popups
- bump version to 1.6.1!
2025-05-05 10:08:59 -07:00
Ilya Kreymer
fc59d04231
Deps update 1.6.1 (#826) 2025-05-02 00:43:37 -07:00
Ilya Kreymer
66c71d03c8
deps: bump base browser image to 1.77.95 (#814) 2025-04-03 17:25:29 -07:00
Ilya Kreymer
91f8fadc5f
deps update: update webrecorder dependencies (#810)
- browsertrix-behaviors 0.8.1 for improved logging / new behavior
functions
- wabac.js 2.22.9
- RWP 2.3.4 for QA
- update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js
2025-04-01 22:11:56 -07:00
Ilya Kreymer
2aec2e1a33 reset back to latest image, 1.77.52
bump version to 1.5.7
2025-02-27 16:06:43 -08:00
Ilya Kreymer
9b22df5c90
revert brave version: not ideal, but need to revert to chromium 132 u… (#781)
…ntil we figure out various stalling issues that still persist in
chromium >=133

bump to 1.5.6
2025-02-27 07:05:31 -08:00
Ilya Kreymer
c25c6771a8
browser: update brave to 1.77.52 to get Chromium 134 (#773)
should fix browser timing out on new window, fixes #766 bump to 1.5.4
2025-02-20 09:14:32 -08:00
Ilya Kreymer
846f0355f6
Improved handling of browser stuck / crashed (#763)
- only attempt to close browser if not browser crashed
- add timeout for browser.close()
- ensure browser crash results in healthchecker failure
- bump to 1.5.3
2025-02-10 10:16:25 -08:00
Ilya Kreymer
0ca27e4fa1
QA fix: ensure replay iframe actually been updated after goto call! (#756)
qa fix: check url of iframe, ensure it is not about:blank anymore
test: add test to ensure expected diff
deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0
2025-02-06 10:41:38 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Ilya Kreymer
871490758a
Dependency Update for 1.4.2 (#737) 2025-01-06 12:06:40 -08:00
Ilya Kreymer
6bfa7d5766
Dependency Update (#725)
- update yarn packages
- update RWP to 2.2.4
- update base image to brave 1.73.91
- fix typing issue
- bump to 1.4.0-beta.1
2024-11-24 01:22:50 -08:00
Ilya Kreymer
c8e2e43d4d
Dependency Update (#718)
- bump browsertrix-behaviors to 0.6.5
- bump browsertrix-base-image to 1.71.123
- bump puppeteer-core to 23.7.1
2024-11-10 19:34:38 -08:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
c38b69e74b
bump browser to 1.69.162 (#681) 2024-09-05 20:21:43 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
benoit74
4cc67a3267
Update Brave image + isolated Python venv for dependencies installation (#591)
- Debian distro now requires the use of virtual environments to not mess
with dependencies installed by official apt packages
- removes tldextract update now that pywb is not in use anymore
- bump brave version to 1.68.141, for use with base image added in
https://github.com/webrecorder/browsertrix-browser-base/pull/20

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-14 21:12:00 -07:00
Ilya Kreymer
8d7fb1e084
1.2.8 updates: (#668)
- rewriting: update wabac.js, use getCustomRewriter(), don't truncate
POST request bodies for URLs that use a custom rewriter
- browser: disable --enable-automation, setting webdriver = true, so no
need for override
- deps: update puppeteer-core, necessary changes for latest puppeteer
2024-08-13 23:38:55 -07:00
Ilya Kreymer
bb34c5ef47 version: bump to 1.2.7
deps: bump RWP in Dockerfile to 2.1.3
2024-08-09 13:23:16 -07:00
Ilya Kreymer
88a2fbd0a0
Fix 206 response + general video handling (#646)
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes #645
2024-07-17 13:24:25 -07:00
Ilya Kreymer
1a48b37478
bump replayweb.page to 2.1.1 (#640) 2024-07-11 16:22:37 -07:00
Ilya Kreymer
302b119908
Dependency Update / 1.2.2 (#633)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
2024-07-03 12:55:14 -07:00
Ilya Kreymer
2ab58c0ea3
Remove DISPLAY env var from image (#625)
To avoid a strange chromium bug:
https://issues.chromium.org/issues/40209037 which causes WebGL to fail
in headless mode if DISPLAY if set. Instead, just set DISPLAY directly
for Xvfb, x11vnc and pass in `--display=` to browser if running in
headful mode.
2024-06-25 13:53:43 -07:00
Ilya Kreymer
8af8b3c19a
1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624) 2024-06-21 16:34:06 -07:00
Ilya Kreymer
ea114c6083
bump brave to 1.67.119 (#620) 2024-06-20 20:10:46 -07:00
Ilya Kreymer
3c26996f93
add yarn.lock to Docker to ensure consistent builds! (#621) 2024-06-20 18:54:05 -07:00
Ilya Kreymer
ff481855d5
add EXPOSE for ports used inside container (#612)
documents fixed internal ports used in browsertrix, via EXPOSE cmd,
addresses #558
2024-06-14 15:19:35 -07:00
Ilya Kreymer
f504effa51 Merge branch 'main' into release/1.1.4
bump to 1.2.0-beta.1
2024-06-13 19:28:25 -07:00
Ilya Kreymer
53d437570e
dependency: update RWP to 2.0.1 (#610)
for QA, use ReplayWeb.page 2.0.1 by default
2024-06-13 18:43:58 -07:00
Ilya Kreymer
e2b4cc1844
proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589)
fixes #587 

The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.

Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)

The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)

---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-10 13:11:00 -07:00
Ilya Kreymer
1bd94d93a1
cleanup dockerfile + fix test (#595)
- remove obsolete line from Dockerfile
- fix pdf test to webrecorder-hosted pdf
2024-06-06 12:14:44 -07:00
Vinzenz Sinapius
068ee79288
Add group policies, limit browser access to container filesystem (#579)
Add some default policy settings to disable unneeded Brave features.
Helps a bit with #463, but Brave unfortunately doesn't provide all
mentioned settings as policy options.

Most important changes are in
`config/policies/lockdown-profilebrowser.json` it limits access to the
container filesystem especially during interactive profile browser
creation.
2024-06-05 12:46:49 -07:00
Ilya Kreymer
757e838832
base image version bump to brave 1.66.115 (#592) 2024-06-04 13:35:13 -07:00
Ilya Kreymer
e15f0c95d9
Adblock support (#534)
Now that RWP 2.0.0 with adblock support has been released
(webrecorder/replayweb.page#307), this enables adblock on the QA mode
RWP embed, to get more accurate screenshots.
Fetches the adblock.gz directly from RWP (though could also fetch it
separately from Easylist)
Updates to 1.1.0-beta.5
2024-04-12 09:47:32 -07:00
Ilya Kreymer
db613aa4ff
Revert "Make /app world-readable to better support non-root usage" (#529)
Reverts webrecorder/browsertrix-crawler#523
The chmod operation is a bit slow, and in testing don't think the CI is related to chmod :/
2024-04-03 19:48:37 -07:00
Vinzenz Sinapius
23fda685d9
Make /app world-readable to better support non-root usage (#523)
Possible fix for failing tests with non-root deployment.
2024-04-03 15:22:12 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
1fe810b1df
Improved support for running as non-root (#503)
This PR provides improved support for running crawler as non-root,
matching the user to the uid/gid of the crawl volume.

This fixes #502 initial regression from 0.12.4, where `chmod u+x` was
used instead of `chmod a+x` on the node binary files.

However, that was not enough to fully support equivalent signal handling
/ graceful shutdown as when running with the same user. To make the
running as different user path work the same way:
- need to switch to `gosu` instead of `su` (added in Brave 1.64.109
image)
- run all child processes as detached (redis-server, socat, wacz, etc..)
to avoid them automatically being killed via SIGINT/SIGTERM
- running detached is controlled via `DETACHED_CHILD_PROC=1` env
variable, set to 1 by default in the Dockerfile (to allow for overrides
just in case)

A test has been added which runs one of the tests with a non-root
`test-crawls` directory to test the different user path. The test
(saved-state.test.js) includes sending interrupt signals and graceful
shutdown and allows testing of those features for a non-root gosu
execution.

Also bumping crawler version to 1.0.1
2024-03-21 08:16:59 -07:00
Ilya Kreymer
e8f2073a7e
Update Browser Image (#466)
- Update to Brave browser (1.62.165)
- Update page resource test to reflect latest Brave behavior
2024-02-17 22:40:12 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Ilya Kreymer
877d9f5b44
Use new browser-based archiving mechanism instead of pywb proxy (#424)
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
2023-11-07 21:38:50 -08:00
Ilya Kreymer
064db52272 base image: bump brave to 1.59.120
version: bump to 0.12.0-beta.2
2023-10-26 19:48:49 -07:00
Ilya Kreymer
f453dbfb56
Switch to Brave Base Image (#400)
* switch to brave:
- switch base browser to brave base image 1.58.135
- tests: add extra delay for blocking tests
- bump to 0.12.0-beta.0
2023-10-02 14:30:44 -07:00
Vinzenz Sinapius
7b6bb681c7
Update tldextract cache for pywb in build process (#383) 2023-09-15 12:22:17 -04:00
Ilya Kreymer
3c9be514d3
behavior logging tweaks, add netIdle (#381)
* behavior logging tweaks, add netIdle
* fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes!
now used not only for link extraction but also to run() behaviors
* add logging if iframe check fails
* Dockerfile: add commented out line to use local behaviors.js
* bump behaviors to 0.5.2
2023-09-14 19:48:41 -07:00