Commit graph

218 commits

Author SHA1 Message Date
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Ilya Kreymer
e5bab8e7c8
various edge-case loading optimizations: (#709)
- rework 'should stream' logic:
* ensure 206 responses (or any response) greater than 25M are streamed
* response between 5M and 25M are read into memory if text/css/js as they may be rewritten
* responses <5M are read into memory
* responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small
- likely fix for issues in #706
- if too many range requests for same URL are being made, try
skipping/failing right away to reduce load
- assume main browser context is used not just for service workers,
always enable
- check false positive 'net-aborted' error that may actually be ok for
media, as well as documents
- improve logging
- interrupt any pending requests (that may be loading via browser
context) after page timeout, log dropped requests
---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-10-31 14:06:17 -07:00
Ilya Kreymer
181d9b824c
deps: update to latest wabac (#708)
bump version to 1.3.4
2024-10-26 11:02:32 -07:00
Ilya Kreymer
0d39ea3590
dep: update to wabac.js 2.20 (#704)
Update imports for new TS-based wabac.js
2024-10-16 21:02:04 -07:00
Ilya Kreymer
a45b85dd74 version: bump to 1.3.3 2024-10-11 00:12:23 -07:00
Ilya Kreymer
282c47ad66
bump puppeteer core to 23.5.1 (#700)
includes possible improvements for detecting crashes with wrong stack
trace (see: puppeteer/puppeteer#13056)
2024-10-07 16:39:48 -07:00
Ilya Kreymer
356b3f8d10 bump to 1.3.2 2024-09-30 15:51:13 -07:00
Ilya Kreymer
9f310907f0 version: bump to 1.3.1 2024-09-27 14:30:56 -04:00
Ilya Kreymer
da442573b8 version: bump to 1.3.0 2024-09-12 09:22:22 -07:00
Ilya Kreymer
083a9d2090 version: bump to 1.3.0-beta.1 2024-09-05 18:11:52 -07:00
Ilya Kreymer
9d0e3423a3
WARC writer + incremental indexing fixes (#679)
- ensure WARC rollover happens only after response/request + cdx or
single record + cdx have been written
- ensure request payload is buffered for POST request indexing
- update to warcio 2.3.1 for POST request case-insensitive
'content-type' check
- recorder: remove unused 'tempdir', no longer used as warcio chooses a
temp file on it's own
2024-09-05 11:10:31 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Ilya Kreymer
23fbbcb6bf version: bump to 1.3.0-beta.0 2024-08-14 20:12:48 -07:00
Ilya Kreymer
8d7fb1e084
1.2.8 updates: (#668)
- rewriting: update wabac.js, use getCustomRewriter(), don't truncate
POST request bodies for URLs that use a custom rewriter
- browser: disable --enable-automation, setting webdriver = true, so no
need for override
- deps: update puppeteer-core, necessary changes for latest puppeteer
2024-08-13 23:38:55 -07:00
Ilya Kreymer
bb34c5ef47 version: bump to 1.2.7
deps: bump RWP in Dockerfile to 2.1.3
2024-08-09 13:23:16 -07:00
Ilya Kreymer
a1ba29d878
deps: update puppeteer-core to 22.14.0 (#661) 2024-07-30 13:51:52 -07:00
Ilya Kreymer
ff81048d3a
deps: bump browsertrix-behaviors to 0.6.3 (#659)
adds support for detecting videos in shadow dom with
query-selector-shadow-dom library
2024-07-30 09:41:21 -07:00
Ilya Kreymer
9f2b9bf4e5 version: bump to 1.2.6 2024-07-29 16:41:40 -07:00
Ilya Kreymer
539730d54e
remove crc32 computation, fixes #653 (#657)
Removes crc32 computation, which was incorrect, and no longer needed
2024-07-29 16:19:44 -07:00
Ilya Kreymer
88a2fbd0a0
Fix 206 response + general video handling (#646)
Refactors handling of 206 responses:
- If a 206 response is encountered, and its actually the full range,
convert to 200 and rewrite range and content-range headers to x-range
and x-orig-range. This is to support rewriting of 206 responses for DASH
manifests
- If a partial 206 response starting with `0-`, do a full async fetch
separately.
- If a partial 206 response not starting with 0-, just ignore (very
likely a duplicate picked up when handling the 0- response)
- Don't stream content-types that can be rewritten, since streaming
prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no
content-length and don't get properly rewritten.
- Overall, adds missing rewriting of DASH/HLS manifests that have no
content-length and are served as 206.
- Update to latest wabac.js which fixes rewriting of DASH manifest to
avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192
- Fixes #645
2024-07-17 13:24:25 -07:00
Ilya Kreymer
01666b4474 deps: bump browsertrix-behaviors to 0.6.2 2024-07-11 19:53:59 -07:00
Ilya Kreymer
4fb9577d4f
don't disable extraHops when using sitemaps: (#639)
- instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it.
-if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope.
- bump version to 1.2.4
2024-07-11 19:48:43 -07:00
Ilya Kreymer
320c041235 version: bump to 1.2.3 2024-07-08 10:50:51 -07:00
Ilya Kreymer
302b119908
Dependency Update / 1.2.2 (#633)
Dependency Updates:
- Bump Brave to 1.67.123
- Update puppeteer-core to latest, fixes possible crash when loading
current browser with old profiles
- Tests: simplifies extra hops test to avoid complex pages that could
lead to timeout
2024-07-03 12:55:14 -07:00
Ilya Kreymer
e65bf21135 version: bump to 1.2.1 2024-06-25 13:28:59 -07:00
Ilya Kreymer
8af8b3c19a
1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624) 2024-06-21 16:34:06 -07:00
Ilya Kreymer
65a86352fd
Updated rewriting for YouTube + dependency update (#623)
- update to wabac.js 2.19.0 to use new html rewriting support in
wabac.js 2.19.0
- update to browsertrix-behaviors to 0.6.1 to fix instagram behavior
- bump to 1.2.0-beta.3
2024-06-21 15:03:53 -07:00
Ilya Kreymer
de10ba9f15 version: bump to 1.2.0-beta.2 2024-06-20 20:11:35 -07:00
Ilya Kreymer
3339374092
http auth support per seed (supersedes #566): (#616)
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section

---------
Co-authored-by: Ed Summers <ehs@pobox.com>
2024-06-20 16:35:30 -07:00
Ilya Kreymer
f504effa51 Merge branch 'main' into release/1.1.4
bump to 1.2.0-beta.1
2024-06-13 19:28:25 -07:00
Ilya Kreymer
f85727954a
add undici for 1.1.4 release, to fix #606 (#608) 2024-06-13 18:46:05 -07:00
Ilya Kreymer
f6c4bf9935 bump version to 1.1.4 2024-06-13 10:31:57 -07:00
Ilya Kreymer
e2b4cc1844
proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589)
fixes #587 

The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.

Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)

The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)

---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-10 13:11:00 -07:00
Ilya Kreymer
894681e5fc
Bump version to 1.2.0 Beta + make draft release for each commit (#582)
Generate draft release from main and *-release branches to simplify
release process

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-05-22 15:45:48 -07:00
Ilya Kreymer
6c15bb3f00 version: bump to 1.1.3 2024-05-21 16:37:03 -07:00
Ilya Kreymer
bd5368cbca version: bump to 1.1.2 2024-05-07 13:46:05 +02:00
Ilya Kreymer
a61206fd73
profiles: ensure all page.goto() promises have at least catch block or are awaited (#559)
In particular, an API call to /navigate starts, but doesn't wait for a
page load to finish, since user can choose to close the profile browser
at any time. This ensures that user operations don't cause the browser to crash if
page.goto() is interrupted/fails (browser closed, profile is saved, etc...) while a page is still loading.

bump to 1.1.1
2024-04-25 09:34:57 +02:00
Ilya Kreymer
dece69c233 version: bump to 1.1.0! 2024-04-18 17:45:57 -07:00
Ilya Kreymer
51d82598e7
Support site-specific wait via browsertrix-behaviors (#555)
The 0.6.0 release of Browsertrix Behaviors /
webrecorder/browsertrix-behaviors#70 introduces support for site-specific behaviors to implement an `awaitPageLoad()` function which allows for waiting for specific resources on the page load.
- This PR just adds a call to this function directly after page load.
- Factors out into an `awaitPageLoad()` method used in both crawler and replaycrawler to support the same wait in QA Mode
- This is to support custom loading wait time for Instagram (other sites in the future)
2024-04-18 17:16:57 -07:00
Ilya Kreymer
e15f0c95d9
Adblock support (#534)
Now that RWP 2.0.0 with adblock support has been released
(webrecorder/replayweb.page#307), this enables adblock on the QA mode
RWP embed, to get more accurate screenshots.
Fetches the adblock.gz directly from RWP (though could also fetch it
separately from Easylist)
Updates to 1.1.0-beta.5
2024-04-12 09:47:32 -07:00
Tessa Walsh
1325cc3868
Gracefully handle non-absolute path for create-login-profile --filename (#521)
Fixes #513 

If an absolute path isn't provided to the `create-login-profile`
entrypoint's `--filename` option, resolve the value given within
`/crawls/profiles`.

Also updates the docs cli-options section to include the
`create-login-profile` entrypoint and adjusts the script to
automatically generate this page accordingly.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-29 13:46:54 -07:00
Ilya Kreymer
5152169916 bump version to 1.1.0-beta.3 2024-03-28 17:19:40 -07:00
Ilya Kreymer
0d973d67e3
upgrade puppeteer-core to 22.6.1 (#516)
Using latest puppeteer-core to keep up with latest browsers, mostly
minor syntax changes

Due to change in puppeteer hiding the executionContextId, need to create
a frameId->executionContextId mapping and track it ourselves to support
the custom evaluateWithCLI() function
2024-03-27 09:26:51 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
22a7351dc7
service worker capture fix: disable by default for now (#506)
Due to issues with capturing top-level pages, make bypassing service
workers the default for now. Previously, it was only disabled when using
profiles. (This is also consistent with ArchiveWeb.page behavior).
Includes:
- add --serviceWorker option which can be `disabled`,
disabled-if-profile (previous default) and `enabled`
- ensure page timestamp is set for direct fetch
- warn if page timestamp is missing on serialization, then set to now
before serializing

bump version to 1.0.2
2024-03-22 13:37:14 -07:00
Ilya Kreymer
1fe810b1df
Improved support for running as non-root (#503)
This PR provides improved support for running crawler as non-root,
matching the user to the uid/gid of the crawl volume.

This fixes #502 initial regression from 0.12.4, where `chmod u+x` was
used instead of `chmod a+x` on the node binary files.

However, that was not enough to fully support equivalent signal handling
/ graceful shutdown as when running with the same user. To make the
running as different user path work the same way:
- need to switch to `gosu` instead of `su` (added in Brave 1.64.109
image)
- run all child processes as detached (redis-server, socat, wacz, etc..)
to avoid them automatically being killed via SIGINT/SIGTERM
- running detached is controlled via `DETACHED_CHILD_PROC=1` env
variable, set to 1 by default in the Dockerfile (to allow for overrides
just in case)

A test has been added which runs one of the tests with a non-root
`test-crawls` directory to test the different user path. The test
(saved-state.test.js) includes sending interrupt signals and graceful
shutdown and allows testing of those features for a non-root gosu
execution.

Also bumping crawler version to 1.0.1
2024-03-21 08:16:59 -07:00
Ilya Kreymer
9a2ada3461 version: bump to 1.0.0 2024-03-18 19:15:35 -07:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
f96c6a13dc version: bump to 1.0.0-beta.8 2024-03-16 15:32:19 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00