Commit graph

546 commits

Author SHA1 Message Date
Ilya Kreymer
e5bab8e7c8
various edge-case loading optimizations: (#709)
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-31 14:06:17 -07:00
Ilya Kreymer
5c00bca2b4
tests: use old.webrecorder.net for testing (#710)
replace webrecorder.net -> old.webrecorder.net to fix tests relying on
old website for now
2024-10-31 13:24:58 -04:00
Ilya Kreymer
181d9b824c
deps: update to latest wabac (#708)
bump version to 1.3.4
2024-10-26 11:02:32 -07:00
Ilya Kreymer
0d39ea3590
dep: update to wabac.js 2.20 (#704)
Update imports for new TS-based wabac.js
2024-10-16 21:02:04 -07:00
Ilya Kreymer
a45b85dd74 version: bump to 1.3.3 2024-10-11 00:12:23 -07:00
Ilya Kreymer
652cf9cfa6
link extraction promise cleanup: (#701)
- catch frame.evaluate() directly and log errors there to avoid any
possibility of exception being propagated before wrapping in timedRun()
- also add clearTimeout() to timedRun()
- possibly fixes openzim/zimit#376
2024-10-11 00:11:24 -07:00
Ilya Kreymer
157ac34d8c
fix typo in QA exclude check, which resulted in all URLs being excluded (#697)
- ensure exclusions now work as expected in replay mode
- add test for using --exclude with replay
2024-10-07 17:25:36 -07:00
Ilya Kreymer
282c47ad66
bump puppeteer core to 23.5.1 (#700)
includes possible improvements for detecting crashes with wrong stack
trace (see: puppeteer/puppeteer#13056)
2024-10-07 16:39:48 -07:00
Tessa Walsh
e05d50d637
Add documentation for crawl collections (#695)
Fixes #675
2024-10-05 11:51:32 -07:00
Ilya Kreymer
d497a424fc
tests: disable blockrules youtube tests in CI (#698)
due to youtube being blocked, disable test involving youtube embeds when
running in CI for now
2024-10-04 17:37:13 -07:00
Ilya Kreymer
356b3f8d10 bump to 1.3.2 2024-09-30 15:51:13 -07:00
Ilya Kreymer
728f00219a
ensure extraHops also apply to maxDepth (#694)
- if extraHops is set, crawler should visit pages beyond maxDepth
- currently returning out of scope at depth limit even if extraHops is
set
- adjust isInScope and isAtMaxDepth to account for extraHops
- tests: update extra hops test to test extraHops beyond depth
- fixes #693
2024-09-30 15:46:34 -07:00
Ilya Kreymer
9f310907f0 version: bump to 1.3.1 2024-09-27 14:30:56 -04:00
Ilya Kreymer
a56e13d2ff
Additional exception safety (#692)
- add additional catch() block
- wrap page.title() in timedRun() to catch/log exception if this fails
- log error in getting cookies
- hopefully fixes hard-to-repro edge case crash in openzim/zimit#376
2024-09-27 14:30:25 -04:00
Tessa Walsh
607fc84c7d
Include depth in pages JSONL files (#691)
Fixes #690
2024-09-27 10:01:20 -04:00
Ilya Kreymer
6b4ba5b430
direct fetch: when cancelling due to redirect, read full body (#688)
to avoid possible exception due to encoding. (Probably a node bug,
reported in nodejs/undici#3616)
Replace abort with cancel, which is the recommended way to cancel the
response.

fixes #687
2024-09-17 10:29:23 -07:00
Ilya Kreymer
da442573b8 version: bump to 1.3.0 2024-09-12 09:22:22 -07:00
Ilya Kreymer
eb50fdffde
exit codes: exit with error code 10 if interrupt is caused by unexpected browser exit (#686)
Differentiate from expected/predictable interrupts due to limits (exit
code 11) and unexpected interrupt due to browser crash (now exit code
10)
fixes #683
2024-09-12 09:10:23 -07:00
Ilya Kreymer
fdb76f2c88
update current crawl size in redis on each healthcheck call (#685)
- allows Browsertrix app to adjust size, if needed, more frequently
- run checkLimits() before starting crawl, in case out of space
2024-09-10 08:28:07 -07:00
Ilya Kreymer
b42548373d
eslint: add strict await checking: (#684)
- require await / void / catch for promises
- don't allow unnecessary await
2024-09-06 16:24:18 -07:00
Ilya Kreymer
9cacae6bb6
cleanup: remove old config files from pywb (#682) 2024-09-05 20:23:34 -07:00
Ilya Kreymer
c38b69e74b
bump browser to 1.69.162 (#681) 2024-09-05 20:21:43 -07:00
Ilya Kreymer
083a9d2090 version: bump to 1.3.0-beta.1 2024-09-05 18:11:52 -07:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
802a416c7e
Additional direct fetch improvements (#678)
- use existing headersTimeout in undici to limit time to headers fetch
to 30 seconds, reject direct fetch if timeout is reached
- allow full page timeout for loading payload via direct fetch
- support setting global fetch() settings
- add markPageUsed() to only reuse pages when not doing direct fetch
- apply auth headers to direct fetch
- catch failed fetch and timeout errors
- support failOnFailedSeeds for direct fetch, ensure timeout is working
2024-09-05 13:28:49 -07:00
Ilya Kreymer
9d0e3423a3
WARC writer + incremental indexing fixes (#679)
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own
2024-09-05 11:10:31 -07:00
Ilya Kreymer
0d6a0b0efa
fix for direct fetch timeouts (#677)
- use '--timeout' value for direct fetch timeout, instead of fixed 30
seconds
- don't consider 'document' as essential resource regardless of mime
type, as any top-level URL is a document
- don't count non-200 responses as non-essential even if missing
content-type fixes #676
2024-09-05 10:32:31 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Ilya Kreymer
8934feaf70
SOCKS5 over SSH Tunnel Support (#671)
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
2024-08-28 18:47:24 -07:00
Tessa Walsh
39c8f48bb2
Disable behaviors entirely if --behaviors array is empty (#672)
Fixes #651
2024-08-27 13:20:19 -07:00
Ilya Kreymer
c61a03de6e ci: use docker compose instead of docker-compose 2024-08-14 21:21:35 -07:00
Henry Wilkinson
4c1da90d8f
Adds warning about crawling with basic auth (#669)
Closes https://github.com/webrecorder/browsertrix/issues/1950 over here
too

### Changes
- Adds a warning about using basic auth
- Adds a link to MDN because learning and cross referencing is fun!
2024-08-14 21:14:31 -07:00
benoit74
4cc67a3267
Update Brave image + isolated Python venv for dependencies installation (#591)
- Debian distro now requires the use of virtual environments to not mess
with dependencies installed by official apt packages
- removes tldextract update now that pywb is not in use anymore
- bump brave version to 1.68.141, for use with base image added in
https://github.com/webrecorder/browsertrix-browser-base/pull/20

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-14 21:12:00 -07:00
Ilya Kreymer
23fbbcb6bf version: bump to 1.3.0-beta.0 2024-08-14 20:12:48 -07:00
Ilya Kreymer
8d7fb1e084
1.2.8 updates: (#668)
- rewriting: update wabac.js, use getCustomRewriter(), don't truncate
POST request bodies for URLs that use a custom rewriter
- browser: disable --enable-automation, setting webdriver = true, so no
need for override
- deps: update puppeteer-core, necessary changes for latest puppeteer
2024-08-13 23:38:55 -07:00
Ilya Kreymer
bb34c5ef47 version: bump to 1.2.7
deps: bump RWP in Dockerfile to 2.1.3
2024-08-09 13:23:16 -07:00
Tessa Walsh
84129b1888
QA: Ensure empty string text is propagated for QA comparison (#667)
Fixes #666 

Fixes two issues with QA replay text extraction:
- ensures empty string text from QA replay is treated as empty string, instead of undefined
- avoids a divide by zero when both original and replay text
strings was 0.
Ensures the match is 1.0 if both crawl and QA replay text is an empty string

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-09 13:20:56 -07:00
Ilya Kreymer
a1ba29d878
deps: update puppeteer-core to 22.14.0 (#661) 2024-07-30 13:51:52 -07:00
Ilya Kreymer
ff81048d3a
deps: bump browsertrix-behaviors to 0.6.3 (#659)
adds support for detecting videos in shadow dom with
query-selector-shadow-dom library
2024-07-30 09:41:21 -07:00
Ilya Kreymer
9f2b9bf4e5 version: bump to 1.2.6 2024-07-29 16:41:40 -07:00
Ilya Kreymer
539730d54e
remove crc32 computation, fixes #653 (#657)
Removes crc32 computation, which was incorrect, and no longer needed
2024-07-29 16:19:44 -07:00
Ilya Kreymer
717dd138ec
Ignore invalid URLs in redirects (#658)
ensure URL from redirects are valid - it's possible that a redirect is a
'synthetic' redirect created by the browser for http->https enforcement,
which may include an invalid URL. eg: http://<invalid url> ->
https://<invalid url>
Prevent trying to record this invalid URL
fixes #654
2024-07-29 14:51:22 -07:00
Ilya Kreymer
d620eb8e31
misc tweaks: (#650)
- logging: log behavior options that are enabled on startup, after seeds
- redis: launch local redis only if --redisStoreUrl starts with
redis://localhost or redis://127.0.0.1
- interrupt: check that crawler is not 'done' before exiting with exit
code 13, if already done, exit with 0
2024-07-23 21:50:26 -04:00
Ilya Kreymer
48716c172d docs: regnerate cli options with ./docs/gen-cli.sh 2024-07-19 18:53:50 -07:00
benoit74
1099f4f3c8
Make it clear that profile argument can be an HTTP(S) URL (#649)
Small documentation enhancement to make it clear that browser profile
can be passed as HTTP(S) URL as well.
2024-07-19 18:53:28 -07:00
Ilya Kreymer
88a2fbd0a0
Fix 206 response + general video handling (#646)
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes #645
2024-07-17 13:24:25 -07:00
Ilya Kreymer
01666b4474 deps: bump browsertrix-behaviors to 0.6.2 2024-07-11 19:53:59 -07:00
Ilya Kreymer
4fb9577d4f
don't disable extraHops when using sitemaps: (#639)
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
2024-07-11 19:48:43 -07:00
Ilya Kreymer
1a48b37478
bump replayweb.page to 2.1.1 (#640) 2024-07-11 16:22:37 -07:00
Tessa Walsh
fd98033268
Loosen selectors for login fields in automated profile creation (#638)
Fixes #637 

- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password

This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.

It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
2024-07-11 15:55:06 -07:00