- if extraHops is set, crawler should visit pages beyond maxDepth
- currently returning out of scope at depth limit even if extraHops is
set
- adjust isInScope and isAtMaxDepth to account for extraHops
- tests: update extra hops test to test extraHops beyond depth
- fixes#693
- add additional catch() block
- wrap page.title() in timedRun() to catch/log exception if this fails
- log error in getting cookies
- hopefully fixes hard-to-repro edge case crash in openzim/zimit#376
to avoid possible exception due to encoding. (Probably a node bug,
reported in nodejs/undici#3616)
Replace abort with cancel, which is the recommended way to cancel the
response.
fixes#687
Differentiate from expected/predictable interrupts due to limits (exit
code 11) and unexpected interrupt due to browser crash (now exit code
10)
fixes#683
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
- use existing headersTimeout in undici to limit time to headers fetch
to 30 seconds, reject direct fetch if timeout is reached
- allow full page timeout for loading payload via direct fetch
- support setting global fetch() settings
- add markPageUsed() to only reuse pages when not doing direct fetch
- apply auth headers to direct fetch
- catch failed fetch and timeout errors
- support failOnFailedSeeds for direct fetch, ensure timeout is working
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own
- use '--timeout' value for direct fetch timeout, instead of fixed 30
seconds
- don't consider 'document' as essential resource regardless of mime
type, as any top-level URL is a document
- don't count non-200 responses as non-essential even if missing
content-type fixes#676
Fixes#674
This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.
- Same arguments are also available for create-login-profile
- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.
- Docs are updated to include a new 'Crawling with Proxies' page in the user guide
- Tests are updated to include crawling through an SSH proxy running locally.
---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
- Debian distro now requires the use of virtual environments to not mess
with dependencies installed by official apt packages
- removes tldextract update now that pywb is not in use anymore
- bump brave version to 1.68.141, for use with base image added in
https://github.com/webrecorder/browsertrix-browser-base/pull/20
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- rewriting: update wabac.js, use getCustomRewriter(), don't truncate
POST request bodies for URLs that use a custom rewriter
- browser: disable --enable-automation, setting webdriver = true, so no
need for override
- deps: update puppeteer-core, necessary changes for latest puppeteer
Fixes#666
Fixes two issues with QA replay text extraction:
- ensures empty string text from QA replay is treated as empty string, instead of undefined
- avoids a divide by zero when both original and replay text
strings was 0.
Ensures the match is 1.0 if both crawl and QA replay text is an empty string
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
ensure URL from redirects are valid - it's possible that a redirect is a
'synthetic' redirect created by the browser for http->https enforcement,
which may include an invalid URL. eg: http://<invalid url> ->
https://<invalid url>
Prevent trying to record this invalid URL
fixes#654
- logging: log behavior options that are enabled on startup, after seeds
- redis: launch local redis only if --redisStoreUrl starts with
redis://localhost or redis://127.0.0.1
- interrupt: check that crawler is not 'done' before exiting with exit
code 13, if already done, exit with 0
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes#645
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
Fixes#637
- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password
This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.
It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
Don't wait for requests that have been not intercepted (`intercepting` is not set) and are not loaded asynchronously (`asyncLoading` is not set) in awaitPageResources() when page is done. Occasionally, it seems some pending requests that only get added via `Network.requestWillBeSent` but never receive a finished/failed message may persist in the pending request list, and will now be discarded.
(Large requests that have a streaming response body will have either `intercepting` or `asyncLoading` set and will not be affected)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
Adds enterprise policy to always download PDF and sets download dir to
/dev/null
Moves policies to chromium.json and brave.json for clarity
Further cleanup of non-HTML loading path:
- sets downloadResponse when page load is aborted but response is
actually download
- sets firstResponse when first response finishes, but page doesn't
fully load
- logs that non-HTML pages skip all post-crawl behaviors in one place
- move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
It's possible for a redirect, especially a browser-generated one to have
headers and no body (eg. Brave removing tracking url query). Don't
filter these redirects out from being written to WARC, just set payload to empty
buffer.
fixes#627 where Brave-generated redirect response was not stored.
To avoid a strange chromium bug:
https://issues.chromium.org/issues/40209037 which causes WebGL to fail
in headless mode if DISPLAY if set. Instead, just set DISPLAY directly
for Xvfb, x11vnc and pass in `--display=` to browser if running in
headful mode.