Commit graph

618 commits

Author SHA1 Message Date
emma
9bfc190867
remove redundant digest analytics
they should already be covered by the dedupe indexer
2026-02-17 14:28:35 -05:00
emma
6d75d60a5e
first pass - add redis reports index for crawl statistics
introduces a new RedisReportsIndex class that tracks statistics
about crawl data in redis, including counts and sizes by host, crawl,
mime type, and HTTP status code category
2026-02-16 19:23:38 -05:00
Ilya Kreymer
f3b4446638
frame behaviors: use frame.evaluate() instead of custom evaluteWithCLI() (#964)
Removes our custom evaluateWithCLI() call in favor of using standard
`frame.evaluate()`.
The custom method became more tricky to use, and doesn't work for all
iframes.
The main benefit of it was to inject `getEventListeners()`, which is now
only used in Autoscroll to potentially skip scrolling, and not needed
for any other behaviors.

We could potentially use this workaround:
https://stackoverflow.com/questions/75517220/geteventlistener-function-support-for-latest-puppeteer-versions/75581410#75581410
to bring back `getEventListeners` or, adding a custom callback, like
`hasEventListener()` would suffice for that check.

This simplifies the codebase and ensures that running behaviors is more
reliable.
Also, adds a callback for new frames that may get added to page, such as
during scrolling, an ensures behaviors are called on those iframes as
well.
2026-02-12 16:02:52 -08:00
Ilya Kreymer
f27ffd4319
Fix browser network loading (#963)
When available, async fetch should try to load via the browser network,
especially for in-page discovered URLs, to ensure proper credentials are
used (may fix #960):

- adds missing CDP param that resulted in browser network being skipped!
- try browser network for direct fetch too, if page is available, but
then fallback to node fetch
- default to node fetch when network loading failed, or if in browser
(non-page) context for request interception, eg. in a worker.
- updates to browsertrix-behaviors 0.9.8, which prefers in-browser fetch when possible.
2026-02-12 13:43:50 -08:00
Ilya Kreymer
06435f1743 version: bump to 1.12.0-beta.1 2026-02-12 13:41:32 -08:00
Ilya Kreymer
154151913a
Dedup Initial Implementation (#889)
Fixes #884 

- Support for hash-based deduplication via a Redis provided with
--redisDedupeUrl (can be same as default redis)
- Support for writing WARC revisit records for duplicates
- Support for new indexer mode which imports CDXJ from one or more WACZs
(refactored from replay) to populate the dedup index
- Crawl and aggregate stats updated in dedupe index, including total
urls, deduped URLs, conserved size (difference between revisit and
response records), and estimated redundant size (aggregate) of
duplicates not deduped.
- Track removed crawls on index update, support for --remove operation
to purge removed crawls, otherwise removed crawl aggregate data is
maintained.
- Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field.
- Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added.
- WIP on page-level dedupe (preempt loading entire pages) if HTML is a
dupe/matches exactly.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2026-02-12 13:40:49 -08:00
Ilya Kreymer
325d7fe1ca
add decompress() interceptor, support undici.request() without decompression + keep content-encoding if no decompression (#970)
Fixes #971 
- Add decompression interceptor as default for getProxyDispatcher(), no
need to decompress sitemap explicitly
- Add option to not decompress, keep content-encoding header and don't
add `x-orig-` when using node async fetch
- Create three dispatcher variants per proxy: redirect + decompress,
redirect + no decompress, no redirect + no decompress
2026-02-12 09:53:05 -08:00
Ilya Kreymer
80901a12e1 version: bump to 1.11.3 2026-02-09 10:35:03 -08:00
Ilya Kreymer
e15368a057
fix issues related to profile directory placed in /profile: (#969)
- always attempt to delete existing profile dir before moving new one in
its place, fixes #968
- treat 304 (eg. if recrawling with existing profile) as cached
resource, don't attempt to write/check size
- fix typo in shouldSkipSave() for incomplete 206 responses
2026-02-09 10:33:37 -08:00
Ilya Kreymer
c57481f9e1
Fix default user-agent to not include minor version + set sec-ua-ch-* headers (#962)
- Only use major version from, set rest to 0.0.0 to match Brave/Chrome
behavior
- Store major version in Browser
- Also set some `sec-ua-ch-*` headers to match brave default
- Don't disable cache when creating profile to avoid sending different
Cache-Control headers when in profile creation (cache cleared before
profile created anyway)

---------

Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
2026-02-04 16:06:28 -08:00
Ilya Kreymer
689d9f6c6b
Apply pageExtraDelay after successful direct fetch (#961)
fixes #957

also apply page extra delay when if direct fetch succeeded to enforce
consistent rate limiting
2026-01-30 13:18:48 -08:00
Ilya Kreymer
8b0bbe76c4 typo fix: ./logger -> ./logger.js 2026-01-30 10:19:48 -08:00
zakk
1fd2aeba81
bugfix(normalize): normalize urls for seeds, add normalizeUrl wrapper (#959)
- applies normalizeUrl() to seed URL and seed isIncluded() check
- add normalizeUrl() wrapper which applies standard opts and also catches and logs any errors from normalization
- test: add scope tests to ensure URL with differently sorted query args still in scope
---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2026-01-30 10:00:19 -08:00
Ilya Kreymer
1c32f64566
deps: update browser 1.86.146 (#958)
- update brave and latest puppeteer-core
- tests: fix sitemapper tests hopefully!
2026-01-30 01:25:59 -08:00
Ilya Kreymer
efb5f6aaec
Fix sitemap done check (#956)
Sitemap checkIfDone() should be called in finally, otherwise 'end' event
may never be emitted in some cases!
2026-01-29 13:57:12 -08:00
Ilya Kreymer
1e327fc351
ensure redirects are followed for sitemap, robots, other requests converted from fetch() (#955)
Refactor dispatcher apis:
- Use `getProxyDispatcher(withRedirect = true)` to follow redirects by
default, with option to disable (eg. in recorder). This dispatcher also
ignores TLS errors, to match current browser config. Used for fetching
archival content.
- Use `getFollowRedirectsDispatcher()` with follows redirects but does
not ignore TLS and does not use proxies, for fetching non-archival
configs (profiles, behaviors, etc...)

Fixes #954, regression from #946
2026-01-29 10:00:16 -08:00
Ilya Kreymer
f6ff8d5122
warcio update + add links discovered from autoclick (#952)
- update warcio.js to 2.4.9 to fix issue with multiple repeated headers
values (now allowed for HTTP headers)
- ensure links discovered from autoclick are also crawled: the links are
stored in a set to avoid dupe links, but no reason not to also queue
them for crawling, if they're in scope.
- bump to 1.11.2
2026-01-29 09:59:33 -08:00
Ilya Kreymer
581a70340a
fix signal handling edge-cases: (#951)
- ensure two signals at least 1 sec apart are received before immediate
termination
- only exit immediately if crawl not already post-processing, otherwise
let post-processing run its course
2026-01-19 13:31:27 -08:00
Ilya Kreymer
3ce09e6d3a
add getFileOrUrlAsJson for loading local/remote JSON, don't use blob for local files (#949)
- remove openAsBlob() as that doesn't work with request() api
- but keep openAsBlob for when interface with wabac.js fetch()
- also remove commented out code
2026-01-12 10:34:30 -08:00
Ilya Kreymer
5cb237d2bd
Replace minio client with aws client-s3 + lib-storage for multi-part upload (#943)
Extends work in #547 adds Upload via @aws-sdk/lib-storage library:
- Replaces minio client with official aws s3 client
- Uses @aws-sdk/lib-storage for multi-part upload support

Testing:
- This should address issues from #479 and webrecorder/browsertrix#2925
ideally
- Tested with all the major S3 implementations: VersityGW, RustFS,
SeaweedFS, Garage as well as Minio

---------

Co-authored-by: Mattia <m@ttia.it>
Co-authored-by: Mattia <mattia@guella.it>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2026-01-08 12:29:53 -08:00
Ilya Kreymer
3fbe66ad8b
deps: update brave + bump to 1.11.0 (#948) 2026-01-08 12:24:06 -08:00
Ilya Kreymer
88277ea5ab
Replace fetch() with optimized undici request() (#946)
Per https://undici.nodejs.org/#/?id=benchmarks request() is supposed to
be much more performant compared to fetch() with almost the same
interface.

This PR replaces all of the fetch() calls (both using proxy dispatcher
and regular fetch) with undici request()
The migration is fairly simply as shown in
https://undici.nodejs.org/#/?id=migration-guide

The migration eliminates various web streams to node streams
conversions. To support automatic redirects, undici redirect interceptor is used.

Also updates to latest undici (v7)
2026-01-07 21:42:37 -08:00
Emma Segal-Grossman
ebd5a05865
Update Puppeteer mobile device descriptor URL (#947) 2026-01-05 17:35:59 -05:00
emma
bf8dc77053
update Puppeteer mobile device descriptor URL 2026-01-05 16:08:34 -05:00
Ilya Kreymer
d3932f9c74
set ulimit before launching x11vnc to work around libvncserver bug (#945)
Fixes #944 

Sets file ulimit to 8192 and then launches x11vnc. Should result in
faster profile loading if file limit is set especially high, due to a bug in libvncserver (see #944 for more details).
2025-12-31 12:23:04 -08:00
Ilya Kreymer
376cef0404
follow-up to #915, add --allow-brave-component-update flag (#942)
can still disable most other component updates, but enable brave
components, including shields, based on:

https://github.com/webrecorder/browsertrix-crawler/pull/915#issuecomment-3689772878
2025-12-31 12:18:50 -08:00
Tessa Walsh
0ecaa38e68
Fix custom behavior class example in docs (#940)
Updates custom behavior sample class and examples to be accurate:

- Include missing required `init()` method
- Fix arguments in examples uses of `Lib.getState()`
2025-12-16 19:26:51 -05:00
Ilya Kreymer
e320908e6a
don't fail crawl if profile can not be saved (#939)
- log exception caught from saving profile, log as error
- bump to 1.10.2
2025-12-15 12:18:55 -08:00
Ilya Kreymer
df26169975
Sitemaps: parse /sitemap.xml if no sitemap listed in robots.txt (#933)
- Ensure the /sitemap.xml is parsed even if robots.txt exists, but no
sitemaps listed there.
- Resolve relative URLs listed in robots.txt, eg. 'Sitemap:
/my-sitemap.xml'
- Simplify sitemap detection logic, check robots first, then sitemap.xml
OR alternate url if provided via --useSitemap <url>
- Have two main methods, parseSitemap() and parseSitemapFromRobots()
that handle the parsing.
- follow-up to #930

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-12-11 10:37:37 -08:00
Ilya Kreymer
850a6a6665
Don't remove excluded-on-redirect URLs from seen list (#936)
Fixes #937 
- Don't remove URLs from seen list
- Add new excluded key, add URLs to be excluded (out-of-scope on
redirect) to excluded set. The size of this set can be used to get the
URLs that have been excluded in this way, to compute number of
discovered URLs.
- Don't write urn:pageinfo records for excluded pages, along with not
writing to pages/extraPages.jsonl
2025-12-08 22:41:52 -08:00
Ilya Kreymer
4a703cdc09
sort query args before queuing URLs (#935)
- use 'normalize-url' package to avoid differently sorted query args
that are the same url
- configure other options, such as keeping www. and trailing slashes,
only using this for query arg sorting
2025-12-08 15:51:50 -08:00
Ilya Kreymer
993081d3ee
better handling of net::ERR_HTTP_RESPONSE_CODE_FAILURE: (#934)
- http headers provided but no payload, record response
- record page as failed with status code provided, don't attempt to
retry
2025-12-05 16:56:42 -08:00
Ilya Kreymer
822de93301 version: bump to 1.10.0 2025-12-03 14:56:02 -08:00
Ilya Kreymer
042acc9c39 version: bump to 1.10.0.beta-2 2025-12-02 17:00:41 -08:00
Tessa Walsh
ff5619e624
Rename robots flag to --useRobots, keep --robots as alias (#932)
Follow-up to
https://github.com/webrecorder/browsertrix-crawler/issues/631

Based on feedback from
https://github.com/webrecorder/browsertrix/pull/3029

Renaming `--robots` to `--useRobots` will allow us to keep the
Browsertrix backend API more consistent with similar flags like
`--useSitemap`. Keeping `--robots` as it's a nice shorthand alias.
2025-12-02 15:55:25 -08:00
Ilya Kreymer
2914e93152
sitemapper refactor to fix concurrency: (#930)
- original implementation did not actually wait for sitemap to complete
before queuing new ones, resulting in concurrency resource leak
- refactor to await completion of sitemap parser, replacing pending list
with counter
- also, don't parse sitemap if single-page and no extra hops!
- fixes issues in #928
2025-12-02 15:52:33 -08:00
Ilya Kreymer
59df6bbd3f
crash page on prompt dialog loop to continue: (#929)
- if a page is stuck in a window.alert / window.prompt loop, showing >10
or more consecutive dialogs (unrelated to unloading), call Page.crash()
to more quickly move on to next page, as not much else can be done.
- add exception handling in dialog accept/dismiss to avoid crawler crash
- fixes #926
2025-12-01 16:57:00 -08:00
Ilya Kreymer
8e44b31b45 version: bump to 1.10.0-beta.1 2025-11-27 22:25:11 -08:00
Ilya Kreymer
2ef8e00268
fix connection leaks in aborted fetch() requests (#924)
- in doCancel(), use abort controller and call abort(), instead of
body.cancel()
- ensure doCancel() is called when a WARC record is not written, eg. is
a dupe, as stream is likely not consumed
- also call IO.close() when uses browser network reader
- fixes #923
- also adds missing dupe check to async resources queued from behaviors
(were being deduped on write, but were still fetched unnecessarily)
2025-11-27 20:37:24 -08:00
Ilya Kreymer
8658df3999
deps: update to browsertrix-behaviors 0.9.7, puppeteer-core 24.31.0 (#922) 2025-11-26 20:12:16 -08:00
Ilya Kreymer
30646ca7ba
Add downloads dir to cache external dependency within the crawl (#921)
Fixes #920 
- Downloads profile, custom behavior, and seed list to `/downloads`
directory in the crawl
- Seed File: Downloaded into downloads. Never refetched if already
exists on subsequent crawl restarts.
- Custom Behaviors: Git: Downloaded into dir, then moved to
/downloads/behaviors/<dir name>. if already exist, failure to downloaded
will reuse existing directory
- Custom Behaviors: File: Downloaded into temp file, then moved to
/downloads/behaviors/<name.js>. if already exists, failure to download
will reuse existing file.
- Profile: using `/profile` directory to contain the browser profile
- Profile: downloaded to temp file, then placed into
/downloads/profile.tar.gz. If failed to download, but already exists,
existing /profile directory is used
- Also fixes #897
2025-11-26 19:30:27 -08:00
Tessa Walsh
1d15a155f2
Add option to respect robots.txt disallows (#888)
Fixes #631 
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-11-26 19:00:06 -08:00
Ilya Kreymer
75a0c9a305 version: bump to 1.10.0-beta.0 2025-11-26 15:15:45 -08:00
hexagonwin
9cd2d393bc
Fix typo 'runInIframes' (#918)
'runInIframes' appears to be a typo.
(https://github.com/webrecorder/custom-behaviors/blob/main/behaviors/timeline.js
example)
2025-11-25 19:19:01 -08:00
Ilya Kreymer
b9b804e660
improvements to support pausing: (#919)
- clear size to 0 immediately after wacz is uploaded
- if crawler is in paused, ensure upload of any data on startup
- fetcher q: stop queuing async requests if recorder is marked for
stopping
2025-11-25 19:17:39 -08:00
Ilya Kreymer
565ba54454
better failure detection, allow update support for captcha detection via behaviors (#917)
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
2025-11-19 15:49:49 -08:00
Ilya Kreymer
87edef3362
netIdle cleanup + better default for pages where networkIdle timesout (#916)
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913 
- avoid accidental logging

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-11-18 16:34:02 -08:00
Ilya Kreymer
8c8fd6be08
remove --disable-component-update flag, fixes shields not working (#915)
should fix main cause of slow down in #913 
deps: update to brave 1.84.139, puppeteer 24.30.0
bump to 1.9.1
2025-11-14 20:30:42 -08:00
Ilya Kreymer
bb11147234
brave: update policies to disable new brave services (#914) 2025-11-14 20:00:58 -08:00
Ilya Kreymer
59fe064c62 version: bump to 1.9.0 2025-11-11 18:28:21 -08:00