Don't wait for requests that have been not intercepted (`intercepting` is not set) and are not loaded asynchronously (`asyncLoading` is not set) in awaitPageResources() when page is done. Occasionally, it seems some pending requests that only get added via `Network.requestWillBeSent` but never receive a finished/failed message may persist in the pending request list, and will now be discarded.
(Large requests that have a streaming response body will have either `intercepting` or `asyncLoading` set and will not be affected)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
Adds enterprise policy to always download PDF and sets download dir to
/dev/null
Moves policies to chromium.json and brave.json for clarity
Further cleanup of non-HTML loading path:
- sets downloadResponse when page load is aborted but response is
actually download
- sets firstResponse when first response finishes, but page doesn't
fully load
- logs that non-HTML pages skip all post-crawl behaviors in one place
- move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
It's possible for a redirect, especially a browser-generated one to have
headers and no body (eg. Brave removing tracking url query). Don't
filter these redirects out from being written to WARC, just set payload to empty
buffer.
fixes#627 where Brave-generated redirect response was not stored.
To avoid a strange chromium bug:
https://issues.chromium.org/issues/40209037 which causes WebGL to fail
in headless mode if DISPLAY if set. Instead, just set DISPLAY directly
for Xvfb, x11vnc and pass in `--display=` to browser if running in
headful mode.
- update to wabac.js 2.19.0 to use new html rewriting support in
wabac.js 2.19.0
- update to browsertrix-behaviors to 0.6.1 to fix instagram behavior
- bump to 1.2.0-beta.3
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section
---------
Co-authored-by: Ed Summers <ehs@pobox.com>
split isInScope into a protected sync getScope() used for link
extraction (no need for async as we know seed is already set) and which
returns url / isOOS count.
and a simpler, public async isInScope() which just returns a bool, but
also ensures the seed exists.
- subtract the browser ui height from default viewport computed from
screen dimensions
- hard-code height to 81px for now
- fixes#613, bottom of page being cut-off as viewport height was too
big
- Ensure newline escaping happens consistently, even for 'excluded'
headers which get a `x-orig-` prefix but are still added
- Ensure excluded headers in list path are still added with `x-orig-`
prefix.
- fixes#607
Fixes#604
Ensures that extra seeds are propagated to all crawler instances.
Adds a new redis hashmap key to store the extraSeed mappings
url->extraSeeds index, to ensure the extra seeds are added in the same
order on other instances, even if encountered in different order.
Add a new redis lua primitive 'addnewseed' which combines several
operations: check if extra seed already exists and returning existing
index, add new seed to extraSeed list, also add to regular URL seed
list.
Fixes#601, fixes issue with extra wait on PDF pages, where browser
seems to be waiting for a chrome-extension:// URL.
These should have already be getting skipped, but missed here.
The blockrules tests assumed the youtube serves videos with `video/mp4`
mime. However, now youtube also serves them with mime
`application/vnd.yt-ump`. Both mime types are now checked to verify video are present.
fixes#587
The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.
Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)
The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)
---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes#593
Add some default policy settings to disable unneeded Brave features.
Helps a bit with #463, but Brave unfortunately doesn't provide all
mentioned settings as policy options.
Most important changes are in
`config/policies/lockdown-profilebrowser.json` it limits access to the
container filesystem especially during interactive profile browser
creation.
Optimize the direct loading of non-HTML pages. Currently, the behavior
is:
- make a HEAD request first
- make a direct fetch request only if HEAD request is a non-HTML and 200
- only use fetch request if non-HTML and 200 and doesn't set any cookies
This changes the behavior to:
- get cookies from browser for page URL
- make a direct fetch request with cookies, if provided
- only use fetch request if non-HTML and 200
Also:
- ensures pageinfo is properly set with timestamp for direct fetch.
- remove obsolete Agent handling that is no longer used in default
(fetch)
If fetch request results in HTML, the response is aborted and browser
loading is used.
Fixes#553
Includes `warcinfo` records at the beginning of new WARCs, as well as
the combined WARC.
Makes the warcinfo record also WARC/1.1 to match the rest of the WARC
records.
Fixes#575
- Adds a missing await to fetching the number of failed pages from Redis
- Fixes a typo in the fatal logging message
- Adds a test to ensure that the crawl fails with exit code 17 if
--failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that
will 404
Additional fixes for sitemaps:
- Fix parsing sitemaps that have data wrapped in CDATA fields, fixes
part of https://github.com/webrecorder/browsertrix/issues/1750
- Fix parsing where the .gz sitemap have content-encoding and are
actually not gzipped
- Ensure error in gzip parsing doesn't break crawl, just errors sitemap
parsing.
The save state export accidentally exported the pending data as an
object, instead of a list of JSON strings, as it is stored in Redis,
while import was expecting list of json strings. The getPendingList()
function parses the json, but then was re-encoding it for writeStats().
This was likely a mistake.
This PR fixes things:
- support loading pending state as both array of objects and array of
json strings for backwards compatibility
- save state as array of json strings
- remove json decoding and encoding in getPendingList() and writeStats()
Fixes#568
Ensure headers are processed via internal checks before attempting to
pass to `new Headers` to ensure validity:
- filter out http/2 style pseudoheaders (starting with ':')
- check if header values are non-ascii, and if so, encode with
`encodeURI`
fixes#569 + prep for latest version of base image which contain
pseudo-headers (replaces #546)
Fixes#563
This PR makes a few changes to fix a regression in behavior around
`failOnFailedSeed` for the 1.x releases:
- Fail with exit code 1, not 17, when pages are unreachable due to DNS
not resolving or other network errors if the page is a seed and
`failOnFailedSeed` is set
- Extend tests, add test to ensure crawl succeeds on 404 seed status
code if `failOnINvalidStatus` isn't set
when loading a PDF as a page, the browser returns a 'false positive'
net::ERR_ABORTED even though the PDF is loaded.
- this is already handled, but status code was still being cleared,
ensure status code is not reset to 0 on response
- ensure page status and mime are also recorded if this failure is
ignored (in shouldIgnoreAbort)
- tests: add test for PDF capture
fixes#570