Commit graph

602 commits

Author SHA1 Message Date
Ilya Kreymer
9db0872ecc rebase fix 2025-11-27 22:41:34 -08:00
Ilya Kreymer
7c37672ae9 add removing option to also remove unused crawls if doing a full sync, disable by default 2025-11-27 22:35:15 -08:00
Ilya Kreymer
0d414f72f1 indexer optimize: commit only if added 2025-11-27 22:35:15 -08:00
Ilya Kreymer
dd8d2e1ea7 rename 'dedup' -> 'dedupe' for consistency 2025-11-27 22:35:15 -08:00
Ilya Kreymer
c4f07c4e59 always return wacz, store wacz depends only for current wacz
store crawlid depends for entire crawl
2025-11-27 22:35:14 -08:00
Ilya Kreymer
9fba5da0ce cleanup, keep compatibility with redis 6 still
set to 'post-crawl' state after uploading
2025-11-27 22:34:43 -08:00
Ilya Kreymer
6579b2dc95 update to new data model:
- hashes stored in separate crawl specific entries, h:<crawlid>
- wacz files stored in crawl specific list, c:<crawlid>:wacz
- hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set
- store filename, crawlId in related.requires list entries for each wacz
2025-11-27 22:32:52 -08:00
Ilya Kreymer
298b901558 - track source index for each hash, so entry becomes '<source index> <date> <url>'
- entry for source index can contain the crawl id (or possibly wacz and crawl id)
- also store dependent sources in relation.requires in datapackage.json
- tests: update tests to check for relation.requires
2025-11-27 22:29:37 -08:00
Ilya Kreymer
8d53399455 dedup post requests and non-404s as well!
update timestamp after import
2025-11-27 22:28:43 -08:00
Ilya Kreymer
78b8847323 use dedup redis for queue up wacz files that need to be updated
use pending queue to support retries in case of failure
store both id and actual URL in case URL changes in subsequent retries
2025-11-27 22:28:43 -08:00
Ilya Kreymer
ca02f09b5d dedup indexing: strip hash prefix from digest, as cdx does not have it
tests: add index import + dedup crawl to ensure digests match fully
2025-11-27 22:28:43 -08:00
Ilya Kreymer
db4393c2a1 deps update 2025-11-27 22:28:42 -08:00
Ilya Kreymer
0cadf371d0 tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records 2025-11-27 22:28:13 -08:00
Ilya Kreymer
c447428450 bump to 2.4.7 2025-11-27 22:28:12 -08:00
Ilya Kreymer
2f81798f09 update to latest warcio (2.4.7) to fix issus when returning payload only size 2025-11-27 22:27:56 -08:00
Ilya Kreymer
db9e78e823 rename --dedupStoreUrl -> redisDedupUrl
bump version to 1.9.0
fix typo
2025-11-27 22:27:26 -08:00
Ilya Kreymer
bbe084daa0 warc writing:
- update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists
- copy additional custom WARC headers to revisit from response
2025-11-27 22:27:09 -08:00
Ilya Kreymer
87c94876f6 keep skipping dupe URLs as before 2025-11-27 22:26:35 -08:00
Ilya Kreymer
2ecf290d38 add indexer entrypoint:
- populate dedup index from remote wacz/multi wacz/multiwacz json

refactor:
- move WACZLoader to wacz to be shared with indexer
- state: move hash-based dedup to RedisDedupIndex

cli args:
- add --minPageDedupDepth to indicate when pages are skipped for dedup

- skip same URLs by same hash within same crawl
2025-11-27 22:26:32 -08:00
Ilya Kreymer
eb6b87fbaf args: add separate --dedupIndexUrl to support separate redis for dedup
indexing prep:
- move WACZLoader to wacz for reuse
2025-11-27 22:25:59 -08:00
Ilya Kreymer
00eca5329d dedup work:
- resource dedup via page digest
- page dedup via page digest check, blocking of dupe page
2025-11-27 22:25:59 -08:00
Ilya Kreymer
8e44b31b45 version: bump to 1.10.0-beta.1 2025-11-27 22:25:11 -08:00
Ilya Kreymer
2ef8e00268
fix connection leaks in aborted fetch() requests (#924)
- in doCancel(), use abort controller and call abort(), instead of
body.cancel()
- ensure doCancel() is called when a WARC record is not written, eg. is
a dupe, as stream is likely not consumed
- also call IO.close() when uses browser network reader
- fixes #923
- also adds missing dupe check to async resources queued from behaviors
(were being deduped on write, but were still fetched unnecessarily)
2025-11-27 20:37:24 -08:00
Ilya Kreymer
8658df3999
deps: update to browsertrix-behaviors 0.9.7, puppeteer-core 24.31.0 (#922) 2025-11-26 20:12:16 -08:00
Ilya Kreymer
30646ca7ba
Add downloads dir to cache external dependency within the crawl (#921)
Fixes #920 
- Downloads profile, custom behavior, and seed list to `/downloads`
directory in the crawl
- Seed File: Downloaded into downloads. Never refetched if already
exists on subsequent crawl restarts.
- Custom Behaviors: Git: Downloaded into dir, then moved to
/downloads/behaviors/<dir name>. if already exist, failure to downloaded
will reuse existing directory
- Custom Behaviors: File: Downloaded into temp file, then moved to
/downloads/behaviors/<name.js>. if already exists, failure to download
will reuse existing file.
- Profile: using `/profile` directory to contain the browser profile
- Profile: downloaded to temp file, then placed into
/downloads/profile.tar.gz. If failed to download, but already exists,
existing /profile directory is used
- Also fixes #897
2025-11-26 19:30:27 -08:00
Tessa Walsh
1d15a155f2
Add option to respect robots.txt disallows (#888)
Fixes #631 
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-11-26 19:00:06 -08:00
Ilya Kreymer
75a0c9a305 version: bump to 1.10.0-beta.0 2025-11-26 15:15:45 -08:00
hexagonwin
9cd2d393bc
Fix typo 'runInIframes' (#918)
'runInIframes' appears to be a typo.
(https://github.com/webrecorder/custom-behaviors/blob/main/behaviors/timeline.js
example)
2025-11-25 19:19:01 -08:00
Ilya Kreymer
b9b804e660
improvements to support pausing: (#919)
- clear size to 0 immediately after wacz is uploaded
- if crawler is in paused, ensure upload of any data on startup
- fetcher q: stop queuing async requests if recorder is marked for
stopping
2025-11-25 19:17:39 -08:00
Ilya Kreymer
565ba54454
better failure detection, allow update support for captcha detection via behaviors (#917)
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
2025-11-19 15:49:49 -08:00
Ilya Kreymer
87edef3362
netIdle cleanup + better default for pages where networkIdle timesout (#916)
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913 
- avoid accidental logging

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-11-18 16:34:02 -08:00
Ilya Kreymer
8c8fd6be08
remove --disable-component-update flag, fixes shields not working (#915)
should fix main cause of slow down in #913 
deps: update to brave 1.84.139, puppeteer 24.30.0
bump to 1.9.1
2025-11-14 20:30:42 -08:00
Ilya Kreymer
bb11147234
brave: update policies to disable new brave services (#914) 2025-11-14 20:00:58 -08:00
Ilya Kreymer
59fe064c62 version: bump to 1.9.0 2025-11-11 18:28:21 -08:00
Ilya Kreymer
85c5632eb1
deps: bump dependencies for 1.9.0 (#912)
update to brave 1.84.135, wabac.js 2.24.5
2025-11-11 14:38:35 -08:00
Tessa Walsh
11f52db31e
Fix linting following external contribution (#911)
Quick-follow to
7dd13a9ec4,
to fix linting issue introduced in that PR.
2025-11-11 12:03:56 -08:00
aponb
b50ef1230f
feat: add extraChromeArgs support for passing custom Chrome flags (#877)
This change introduces a new CLI option --extraChromeArgs to Browsertrix
Crawler, allowing users to pass arbitrary Chrome flags without modifying
the codebase.

This approach is future-proof: any Chrome flag can be provided at
runtime, avoiding the need for hard-coded allowlists.
Maintains backward compatibility: if no extraChromeArgs are passed,
behavior remains unchanged.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-11-11 12:03:30 -08:00
Percival
7dd13a9ec4
fix: Skip proxy for seed file and custom behavior downloads (#907) 2025-11-11 10:51:24 -05:00
Wannaphong Phatthiyaphaibun
37a6fa974b
Fix directory path in user guide for WACZ file (#910)
I found that the directory path in the user guide for the WACZ file is
wrong. It should be `crawls/collections/test/test.wacz`.
2025-11-07 12:39:01 -08:00
Ilya Kreymer
74b6ad0ae0 deps: bump behaviors to 0.9.5
beta 1.9.0-beta.1
2025-11-02 12:30:09 -08:00
Ilya Kreymer
390d036f9e
deps: update to browsertrix-behaviors 0.9.4 (#906)
Includes fixes for autoclick behavior:
- able to click on svgs
- don't navgiate back if click did not result in history stack change
2025-11-02 09:12:15 -08:00
Ilya Kreymer
5685cb2cbe
profiles: add singleton lock removal on startup to avoid any issues (#904)
If the SingletonLock (and SingletonPort, SingletonSocket) files somehow
made it into the profile, the browser will refuse to start. This will
ensure that it is cleared.
(Could also do it before saving it as well, but this will catch it for
any existing profiles).
2025-11-02 09:12:07 -08:00
Ilya Kreymer
3935526240
add --saveProfile option to save profile after successful crawl (#903)
- if --saveProfile is specified, attempt to save profile to same target
as --profile
- if --saveProfile <target>, save to target
- save profile on finalExit if browser has launched
- supports local file paths and storage-relative path with '@' (same as
--profile)
- also clear cache in first worker to match regular profile creation

fixes #898
2025-10-29 19:57:25 -07:00
Ilya Kreymer
afdb6674e5
profile download improvements: (#899)
- log when profie download starts
- ensure there is a timeout to profile download attempt (60 secs)
- attempt retry 2 more times if initial profile download times out
- fail crawl after 3 retries, if profile can not be downloaded
successfully

bumpt to 1.8.2
2025-10-25 16:49:40 -07:00
Ilya Kreymer
6f26148a9b bump version to 1.8.1 2025-10-08 17:11:04 -07:00
Ilya Kreymer
4f234040ce
Profile Saving Improvements (#894)
fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
2025-10-08 17:09:20 -07:00
Ilya Kreymer
002feb287b
dismiss js dialog popups (#895)
move the JS dialog handler to not be only for autoclick, dismiss all JS
dialogs (alert(), prompt()) to avoid blocking page
fixes #891
2025-10-08 14:57:52 -07:00
Ilya Kreymer
2270964996
logging: remove duplicate seeds found error (#893)
Per discussion, the message is unnecessary / confusing (doesn't provide
enough info) and can also happen on crawler restart.
2025-10-07 08:18:22 -07:00
Ilya Kreymer
fd49041f63
flow behaviors: add scrolling into view (#892)
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.
2025-10-07 08:17:56 -07:00
Ed Summers
cc2d890916
Add addLink doc (#890)
It's helpful to know this function is there!
2025-10-02 15:45:55 -04:00