Commit graph

505 commits

Author SHA1 Message Date
Ilya Kreymer
717dd138ec
Ignore invalid URLs in redirects (#658)
ensure URL from redirects are valid - it's possible that a redirect is a
'synthetic' redirect created by the browser for http->https enforcement,
which may include an invalid URL. eg: http://<invalid url> ->
https://<invalid url>
Prevent trying to record this invalid URL
fixes #654
2024-07-29 14:51:22 -07:00
Ilya Kreymer
d620eb8e31
misc tweaks: (#650)
- logging: log behavior options that are enabled on startup, after seeds
- redis: launch local redis only if --redisStoreUrl starts with
redis://localhost or redis://127.0.0.1
- interrupt: check that crawler is not 'done' before exiting with exit
code 13, if already done, exit with 0
2024-07-23 21:50:26 -04:00
Ilya Kreymer
48716c172d docs: regnerate cli options with ./docs/gen-cli.sh 2024-07-19 18:53:50 -07:00
benoit74
1099f4f3c8
Make it clear that profile argument can be an HTTP(S) URL (#649)
Small documentation enhancement to make it clear that browser profile
can be passed as HTTP(S) URL as well.
2024-07-19 18:53:28 -07:00
Ilya Kreymer
88a2fbd0a0
Fix 206 response + general video handling (#646)
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes #645
2024-07-17 13:24:25 -07:00
Ilya Kreymer
01666b4474 deps: bump browsertrix-behaviors to 0.6.2 2024-07-11 19:53:59 -07:00
Ilya Kreymer
4fb9577d4f
don't disable extraHops when using sitemaps: (#639)
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
2024-07-11 19:48:43 -07:00
Ilya Kreymer
1a48b37478
bump replayweb.page to 2.1.1 (#640) 2024-07-11 16:22:37 -07:00
Tessa Walsh
fd98033268
Loosen selectors for login fields in automated profile creation (#638)
Fixes #637 

- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password

This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.

It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
2024-07-11 15:55:06 -07:00
Ilya Kreymer
151115d46c Fix Pending Request causing timeout (#636)
Don't wait for requests that have been not intercepted (`intercepting` is not set) and are not loaded asynchronously (`asyncLoading` is not set) in awaitPageResources() when page is done. Occasionally, it seems some pending requests that only get added via `Network.requestWillBeSent` but never receive a finished/failed message may persist in the pending request list, and will now be discarded.
(Large requests that have a streaming response body will have either `intercepting` or `asyncLoading` set and will not be affected)
2024-07-09 11:02:41 -07:00
Ilya Kreymer
bfe42ad31e
Improved handling of pages that redirect back to the same page. (#635)
In the case of a page https://example.com/ which results in a redirect
chain: 307 https://example.com/ -> 307 https://auth.example.com/ -> 200
https://example.com/
- Includes status in dupe checks, ensures that `307
https://example.com/` and `200 https://example.com/` are both recorded
to WARC
- When setting page timestamp, update the timestamp to the lower status
code if above 300, eg. first setting to `307 https://example.com/` and
then to `200 https://example.com/`

Fixes #634
2024-07-08 10:51:37 -07:00
Ilya Kreymer
320c041235 version: bump to 1.2.3 2024-07-08 10:50:51 -07:00
Ilya Kreymer
302b119908
Dependency Update / 1.2.2 (#633)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
2024-07-03 12:55:14 -07:00
Ilya Kreymer
a3396adba2
tests: reduce logging (#596)
remove logging of crawl logs by default for clearer output from tests, only log in case of error.
2024-06-26 13:05:13 -07:00
Ilya Kreymer
4495532606
Always download PDF + non HTML page cleanup + enterprise policy cleanup (#629)
Adds enterprise policy to always download PDF and sets download dir to
/dev/null
Moves policies to chromium.json and brave.json for clarity
Further cleanup of non-HTML loading path:
- sets downloadResponse when page load is aborted but response is
actually download
- sets firstResponse when first response finishes, but page doesn't
fully load
 - logs that non-HTML pages skip all post-crawl behaviors in one place
 - move page extra delay to separate awaitPageExtraDelay() function, applied for all pages (while post-load delay only applied to HTML pages)

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-26 09:16:24 -07:00
Ilya Kreymer
6a9ca3df54
Don't filter saving redirect if no response body. (#628)
It's possible for a redirect, especially a browser-generated one to have
headers and no body (eg. Brave removing tracking url query). Don't
filter these redirects out from being written to WARC, just set payload to empty
buffer.

fixes #627 where Brave-generated redirect response was not stored.
2024-06-25 15:48:22 -07:00
Ilya Kreymer
2ab58c0ea3
Remove DISPLAY env var from image (#625)
To avoid a strange chromium bug:
https://issues.chromium.org/issues/40209037 which causes WebGL to fail
in headless mode if DISPLAY if set. Instead, just set DISPLAY directly
for Xvfb, x11vnc and pass in `--display=` to browser if running in
headful mode.
2024-06-25 13:53:43 -07:00
Ilya Kreymer
92ad800fe4
browser policies: disable restoring any tabs on startup + set new tab URL to about:blank (#626)
addresses memory issues with profiles as they accumulate tabs! fixes
webrecorder/browsertrix#1880
2024-06-25 13:38:52 -07:00
Ilya Kreymer
e65bf21135 version: bump to 1.2.1 2024-06-25 13:28:59 -07:00
Ilya Kreymer
8af8b3c19a
1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624) 2024-06-21 16:34:06 -07:00
Ilya Kreymer
65a86352fd
Updated rewriting for YouTube + dependency update (#623)
- update to wabac.js 2.19.0 to use new html rewriting support in
wabac.js 2.19.0
- update to browsertrix-behaviors to 0.6.1 to fix instagram behavior
- bump to 1.2.0-beta.3
2024-06-21 15:03:53 -07:00
Ilya Kreymer
de10ba9f15 version: bump to 1.2.0-beta.2 2024-06-20 20:11:35 -07:00
Ilya Kreymer
ea114c6083
bump brave to 1.67.119 (#620) 2024-06-20 20:10:46 -07:00
Ilya Kreymer
9847af7765
disable socat by default (#622)
- crawling: add '--debugAccessBrowser' flag to enable connecting via
9222, only run socat then
- profiles: only run socat in headless mode
2024-06-20 20:10:25 -07:00
Ilya Kreymer
3c26996f93
add yarn.lock to Docker to ensure consistent builds! (#621) 2024-06-20 18:54:05 -07:00
Ilya Kreymer
febf4b7532
logging: log error message when seed is failed to be created (#619)
for example, due to bad include/exclude regex, fixes #598
2024-06-20 18:41:57 -07:00
Ilya Kreymer
3339374092
http auth support per seed (supersedes #566): (#616)
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section

---------
Co-authored-by: Ed Summers <ehs@pobox.com>
2024-06-20 16:35:30 -07:00
Ilya Kreymer
6329b19a20
clearer scope check (#615)
split isInScope into a protected sync getScope() used for link
extraction (no need for async as we know seed is already set) and which
returns url / isOOS count.
and a simpler, public async isInScope() which just returns a bool, but
also ensures the seed exists.
2024-06-18 16:11:48 -07:00
Ilya Kreymer
ac722cc856
adjust browser viewport to avoid cutting off bottom of page (#614)
- subtract the browser ui height from default viewport computed from
screen dimensions
- hard-code height to 81px for now
- fixes #613, bottom of page being cut-off as viewport height was too
big
2024-06-14 15:25:59 -07:00
Ilya Kreymer
ff481855d5
add EXPOSE for ports used inside container (#612)
documents fixed internal ports used in browsertrix, via EXPOSE cmd,
addresses #558
2024-06-14 15:19:35 -07:00
Ilya Kreymer
64080e7f67
merge 1.1.4 -> 1.2.0 beta.1 (#611) 2024-06-13 23:24:54 -07:00
Ilya Kreymer
f504effa51 Merge branch 'main' into release/1.1.4
bump to 1.2.0-beta.1
2024-06-13 19:28:25 -07:00
Ilya Kreymer
9094a8355f
Fix header newline escape (#609)
- Ensure newline escaping happens consistently, even for 'excluded'
headers which get a `x-orig-` prefix but are still added
- Ensure excluded headers in list path are still added with `x-orig-`
prefix.
- fixes #607
2024-06-13 19:13:12 -07:00
Ilya Kreymer
f85727954a
add undici for 1.1.4 release, to fix #606 (#608) 2024-06-13 18:46:05 -07:00
Ilya Kreymer
53d437570e
dependency: update RWP to 2.0.1 (#610)
for QA, use ReplayWeb.page 2.0.1 by default
2024-06-13 18:43:58 -07:00
Ilya Kreymer
8f8326eaf5
Fix synching extraSeeds state with multiple crawler instances (#605)
Fixes #604 

Ensures that extra seeds are propagated to all crawler instances.
Adds a new redis hashmap key to store the extraSeed mappings
url->extraSeeds index, to ensure the extra seeds are added in the same
order on other instances, even if encountered in different order.
Add a new redis lua primitive 'addnewseed' which combines several
operations: check if extra seed already exists and returning existing
index, add new seed to extraSeed list, also add to regular URL seed
list.
2024-06-13 17:18:06 -07:00
Tessa Walsh
9b6efdf6ba
Change some logged errors to warns (#600)
Fixes #599 

This PR modifies the following logging messages from `error` to `warn`,
to avoid alarming users for fairly expected behavior:

- Behavior timeout
- Page worker timeout
- Streaming fetch error (will already trigger another page error
message)

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-06-13 15:42:27 -04:00
Ilya Kreymer
833551bd77
recorder: add missing shouldSkip() to responseReceived callback (#602)
Fixes #601, fixes issue with extra wait on PDF pages, where browser
seems to be waiting for a chrome-extension:// URL.
These should have already be getting skipped, but missed here.
2024-06-13 12:13:14 -07:00
Ilya Kreymer
1e7f8361fe
tests: fix blockrules tests (#603)
The blockrules tests assumed the youtube serves videos with `video/mp4`
mime. However, now youtube also serves them with mime
`application/vnd.yt-ump`. Both mime types are now checked to verify video are present.
2024-06-13 12:12:46 -07:00
Ilya Kreymer
f6c4bf9935 bump version to 1.1.4 2024-06-13 10:31:57 -07:00
Ilya Kreymer
e2b4cc1844
proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589)
fixes #587 

The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.

Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)

The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)

---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-10 13:11:00 -07:00
Ilya Kreymer
b83d1c58da
add --dryRun flag and mode (#594)
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes #593
2024-06-07 10:34:19 -07:00
benoit74
32435bfac7
Consider disk usage of collDir instead of default /crawls (#586)
Fix #585 

Changes:
- compute disk usage based on crawler `collDir` property instead of
always computing it on `/crawls` directory
2024-06-07 10:13:15 -07:00
Ilya Kreymer
1bd94d93a1
cleanup dockerfile + fix test (#595)
- remove obsolete line from Dockerfile
- fix pdf test to webrecorder-hosted pdf
2024-06-06 12:14:44 -07:00
Vinzenz Sinapius
068ee79288
Add group policies, limit browser access to container filesystem (#579)
Add some default policy settings to disable unneeded Brave features.
Helps a bit with #463, but Brave unfortunately doesn't provide all
mentioned settings as policy options.

Most important changes are in
`config/policies/lockdown-profilebrowser.json` it limits access to the
container filesystem especially during interactive profile browser
creation.
2024-06-05 12:46:49 -07:00
Ilya Kreymer
757e838832
base image version bump to brave 1.66.115 (#592) 2024-06-04 13:35:13 -07:00
Ilya Kreymer
a7d279cfbd
Load non-HTML resources directly whenever possible (#583)
Optimize the direct loading of non-HTML pages. Currently, the behavior
is:
- make a HEAD request first
- make a direct fetch request only if HEAD request is a non-HTML and 200
- only use fetch request if non-HTML and 200 and doesn't set any cookies

This changes the behavior to:
- get cookies from browser for page URL
- make a direct fetch request with cookies, if provided
- only use fetch request if non-HTML and 200
Also:
- ensures pageinfo is properly set with timestamp for direct fetch.
- remove obsolete Agent handling that is no longer used in default
(fetch)

If fetch request results in HTML, the response is aborted and browser
loading is used.
2024-05-24 14:51:51 -07:00
Ilya Kreymer
089d901b9b
Always add warcinfo records to all WARCs (#556)
Fixes #553 

Includes `warcinfo` records at the beginning of new WARCs, as well as
the combined WARC.
Makes the warcinfo record also WARC/1.1 to match the rest of the WARC
records.
2024-05-22 15:47:05 -07:00
Ilya Kreymer
894681e5fc
Bump version to 1.2.0 Beta + make draft release for each commit (#582)
Generate draft release from main and *-release branches to simplify
release process

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-05-22 15:45:48 -07:00
Ilya Kreymer
6c15bb3f00 version: bump to 1.1.3 2024-05-21 16:37:03 -07:00