Commit graph

434 commits

Author SHA1 Message Date
Ilya Kreymer
9f310907f0 version: bump to 1.3.1 2024-09-27 14:30:56 -04:00
Ilya Kreymer
a56e13d2ff
Additional exception safety (#692)
- add additional catch() block
- wrap page.title() in timedRun() to catch/log exception if this fails
- log error in getting cookies
- hopefully fixes hard-to-repro edge case crash in openzim/zimit#376
2024-09-27 14:30:25 -04:00
Tessa Walsh
607fc84c7d
Include depth in pages JSONL files (#691)
Fixes #690
2024-09-27 10:01:20 -04:00
Ilya Kreymer
6b4ba5b430
direct fetch: when cancelling due to redirect, read full body (#688)
to avoid possible exception due to encoding. (Probably a node bug,
reported in nodejs/undici#3616)
Replace abort with cancel, which is the recommended way to cancel the
response.

fixes #687
2024-09-17 10:29:23 -07:00
Ilya Kreymer
da442573b8 version: bump to 1.3.0 2024-09-12 09:22:22 -07:00
Ilya Kreymer
eb50fdffde
exit codes: exit with error code 10 if interrupt is caused by unexpected browser exit (#686)
Differentiate from expected/predictable interrupts due to limits (exit
code 11) and unexpected interrupt due to browser crash (now exit code
10)
fixes #683
2024-09-12 09:10:23 -07:00
Ilya Kreymer
fdb76f2c88
update current crawl size in redis on each healthcheck call (#685)
- allows Browsertrix app to adjust size, if needed, more frequently
- run checkLimits() before starting crawl, in case out of space
2024-09-10 08:28:07 -07:00
Ilya Kreymer
b42548373d
eslint: add strict await checking: (#684)
- require await / void / catch for promises
- don't allow unnecessary await
2024-09-06 16:24:18 -07:00
Ilya Kreymer
9cacae6bb6
cleanup: remove old config files from pywb (#682) 2024-09-05 20:23:34 -07:00
Ilya Kreymer
c38b69e74b
bump browser to 1.69.162 (#681) 2024-09-05 20:21:43 -07:00
Ilya Kreymer
083a9d2090 version: bump to 1.3.0-beta.1 2024-09-05 18:11:52 -07:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
802a416c7e
Additional direct fetch improvements (#678)
- use existing headersTimeout in undici to limit time to headers fetch
to 30 seconds, reject direct fetch if timeout is reached
- allow full page timeout for loading payload via direct fetch
- support setting global fetch() settings
- add markPageUsed() to only reuse pages when not doing direct fetch
- apply auth headers to direct fetch
- catch failed fetch and timeout errors
- support failOnFailedSeeds for direct fetch, ensure timeout is working
2024-09-05 13:28:49 -07:00
Ilya Kreymer
9d0e3423a3
WARC writer + incremental indexing fixes (#679)
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own
2024-09-05 11:10:31 -07:00
Ilya Kreymer
0d6a0b0efa
fix for direct fetch timeouts (#677)
- use '--timeout' value for direct fetch timeout, instead of fixed 30
seconds
- don't consider 'document' as essential resource regardless of mime
type, as any top-level URL is a document
- don't count non-200 responses as non-essential even if missing
content-type fixes #676
2024-09-05 10:32:31 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Ilya Kreymer
8934feaf70
SOCKS5 over SSH Tunnel Support (#671)
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
2024-08-28 18:47:24 -07:00
Tessa Walsh
39c8f48bb2
Disable behaviors entirely if --behaviors array is empty (#672)
Fixes #651
2024-08-27 13:20:19 -07:00
Ilya Kreymer
c61a03de6e ci: use docker compose instead of docker-compose 2024-08-14 21:21:35 -07:00
Henry Wilkinson
4c1da90d8f
Adds warning about crawling with basic auth (#669)
Closes https://github.com/webrecorder/browsertrix/issues/1950 over here
too

### Changes
- Adds a warning about using basic auth
- Adds a link to MDN because learning and cross referencing is fun!
2024-08-14 21:14:31 -07:00
benoit74
4cc67a3267
Update Brave image + isolated Python venv for dependencies installation (#591)
- Debian distro now requires the use of virtual environments to not mess
with dependencies installed by official apt packages
- removes tldextract update now that pywb is not in use anymore
- bump brave version to 1.68.141, for use with base image added in
https://github.com/webrecorder/browsertrix-browser-base/pull/20

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-14 21:12:00 -07:00
Ilya Kreymer
23fbbcb6bf version: bump to 1.3.0-beta.0 2024-08-14 20:12:48 -07:00
Ilya Kreymer
8d7fb1e084
1.2.8 updates: (#668)
- rewriting: update wabac.js, use getCustomRewriter(), don't truncate
POST request bodies for URLs that use a custom rewriter
- browser: disable --enable-automation, setting webdriver = true, so no
need for override
- deps: update puppeteer-core, necessary changes for latest puppeteer
2024-08-13 23:38:55 -07:00
Ilya Kreymer
bb34c5ef47 version: bump to 1.2.7
deps: bump RWP in Dockerfile to 2.1.3
2024-08-09 13:23:16 -07:00
Tessa Walsh
84129b1888
QA: Ensure empty string text is propagated for QA comparison (#667)
Fixes #666 

Fixes two issues with QA replay text extraction:
- ensures empty string text from QA replay is treated as empty string, instead of undefined
- avoids a divide by zero when both original and replay text
strings was 0.
Ensures the match is 1.0 if both crawl and QA replay text is an empty string

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-09 13:20:56 -07:00
Ilya Kreymer
a1ba29d878
deps: update puppeteer-core to 22.14.0 (#661) 2024-07-30 13:51:52 -07:00
Ilya Kreymer
ff81048d3a
deps: bump browsertrix-behaviors to 0.6.3 (#659)
adds support for detecting videos in shadow dom with
query-selector-shadow-dom library
2024-07-30 09:41:21 -07:00
Ilya Kreymer
9f2b9bf4e5 version: bump to 1.2.6 2024-07-29 16:41:40 -07:00
Ilya Kreymer
539730d54e
remove crc32 computation, fixes #653 (#657)
Removes crc32 computation, which was incorrect, and no longer needed
2024-07-29 16:19:44 -07:00
Ilya Kreymer
717dd138ec
Ignore invalid URLs in redirects (#658)
ensure URL from redirects are valid - it's possible that a redirect is a
'synthetic' redirect created by the browser for http->https enforcement,
which may include an invalid URL. eg: http://<invalid url> ->
https://<invalid url>
Prevent trying to record this invalid URL
fixes #654
2024-07-29 14:51:22 -07:00
Ilya Kreymer
d620eb8e31
misc tweaks: (#650)
- logging: log behavior options that are enabled on startup, after seeds
- redis: launch local redis only if --redisStoreUrl starts with
redis://localhost or redis://127.0.0.1
- interrupt: check that crawler is not 'done' before exiting with exit
code 13, if already done, exit with 0
2024-07-23 21:50:26 -04:00
Ilya Kreymer
48716c172d docs: regnerate cli options with ./docs/gen-cli.sh 2024-07-19 18:53:50 -07:00
benoit74
1099f4f3c8
Make it clear that profile argument can be an HTTP(S) URL (#649)
Small documentation enhancement to make it clear that browser profile
can be passed as HTTP(S) URL as well.
2024-07-19 18:53:28 -07:00
Ilya Kreymer
88a2fbd0a0
Fix 206 response + general video handling (#646)
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes #645
2024-07-17 13:24:25 -07:00
Ilya Kreymer
01666b4474 deps: bump browsertrix-behaviors to 0.6.2 2024-07-11 19:53:59 -07:00
Ilya Kreymer
4fb9577d4f
don't disable extraHops when using sitemaps: (#639)
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
2024-07-11 19:48:43 -07:00
Ilya Kreymer
1a48b37478
bump replayweb.page to 2.1.1 (#640) 2024-07-11 16:22:37 -07:00
Tessa Walsh
fd98033268
Loosen selectors for login fields in automated profile creation (#638)
Fixes #637 

- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password

This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.

It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
2024-07-11 15:55:06 -07:00
Ilya Kreymer
151115d46c Fix Pending Request causing timeout (#636)
Don't wait for requests that have been not intercepted (`intercepting` is not set) and are not loaded asynchronously (`asyncLoading` is not set) in awaitPageResources() when page is done. Occasionally, it seems some pending requests that only get added via `Network.requestWillBeSent` but never receive a finished/failed message may persist in the pending request list, and will now be discarded.
(Large requests that have a streaming response body will have either `intercepting` or `asyncLoading` set and will not be affected)
2024-07-09 11:02:41 -07:00
Ilya Kreymer
bfe42ad31e
Improved handling of pages that redirect back to the same page. (#635)
In the case of a page https://example.com/ which results in a redirect
chain: 307 https://example.com/ -> 307 https://auth.example.com/ -> 200
https://example.com/
- Includes status in dupe checks, ensures that `307
https://example.com/` and `200 https://example.com/` are both recorded
to WARC
- When setting page timestamp, update the timestamp to the lower status
code if above 300, eg. first setting to `307 https://example.com/` and
then to `200 https://example.com/`

Fixes #634
2024-07-08 10:51:37 -07:00
Ilya Kreymer
320c041235 version: bump to 1.2.3 2024-07-08 10:50:51 -07:00
Ilya Kreymer
302b119908
Dependency Update / 1.2.2 (#633)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
2024-07-03 12:55:14 -07:00
Ilya Kreymer
a3396adba2
tests: reduce logging (#596)
remove logging of crawl logs by default for clearer output from tests, only log in case of error.
2024-06-26 13:05:13 -07:00
Ilya Kreymer
4495532606
Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629)
Adds enterprise policy to always download PDF and sets download dir to
/dev/null
Moves policies to chromium.json and brave.json for clarity
Further cleanup of non-HTML loading path:
- sets downloadResponse when page load is aborted but response is
actually download
- sets firstResponse when first response finishes, but page doesn't
fully load
 - logs that non-HTML pages skip all post-crawl behaviors in one place
 - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-26 09:16:24 -07:00
Ilya Kreymer
6a9ca3df54
Don't filter saving redirect if no response body. (#628)
It's possible for a redirect, especially a browser-generated one to have
headers and no body (eg. Brave removing tracking url query). Don't
filter these redirects out from being written to WARC, just set payload to empty
buffer.

fixes #627 where Brave-generated redirect response was not stored.
2024-06-25 15:48:22 -07:00
Ilya Kreymer
2ab58c0ea3
Remove DISPLAY env var from image (#625)
To avoid a strange chromium bug:
https://issues.chromium.org/issues/40209037 which causes WebGL to fail
in headless mode if DISPLAY if set. Instead, just set DISPLAY directly
for Xvfb, x11vnc and pass in `--display=` to browser if running in
headful mode.
2024-06-25 13:53:43 -07:00
Ilya Kreymer
92ad800fe4
browser policies: disable restoring any tabs on startup + set new tab URL to about:blank (#626)
addresses memory issues with profiles as they accumulate tabs! fixes
webrecorder/browsertrix#1880
2024-06-25 13:38:52 -07:00
Ilya Kreymer
e65bf21135 version: bump to 1.2.1 2024-06-25 13:28:59 -07:00
Ilya Kreymer
8af8b3c19a
1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624) 2024-06-21 16:34:06 -07:00
Ilya Kreymer
65a86352fd
Updated rewriting for YouTube + dependency update (#623)
- update to wabac.js 2.19.0 to use new html rewriting support in
wabac.js 2.19.0
- update to browsertrix-behaviors to 0.6.1 to fix instagram behavior
- bump to 1.2.0-beta.3
2024-06-21 15:03:53 -07:00