- hashes stored in separate crawl specific entries, h:<crawlid>
- wacz files stored in crawl specific list, c:<crawlid>:wacz
- hashes committed to 'alldupes' hashset when crawl is complete, crawls added to 'allcrawls' set
- store filename, crawlId in related.requires list entries for each wacz
- entry for source index can contain the crawl id (or possibly wacz and crawl id)
- also store dependent sources in relation.requires in datapackage.json
- tests: update tests to check for relation.requires
- update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists
- copy additional custom WARC headers to revisit from response
- populate dedup index from remote wacz/multi wacz/multiwacz json
refactor:
- move WACZLoader to wacz to be shared with indexer
- state: move hash-based dedup to RedisDedupIndex
cli args:
- add --minPageDedupDepth to indicate when pages are skipped for dedup
- skip same URLs by same hash within same crawl
- in doCancel(), use abort controller and call abort(), instead of
body.cancel()
- ensure doCancel() is called when a WARC record is not written, eg. is
a dupe, as stream is likely not consumed
- also call IO.close() when uses browser network reader
- fixes#923
- also adds missing dupe check to async resources queued from behaviors
(were being deduped on write, but were still fetched unnecessarily)
Fixes#920
- Downloads profile, custom behavior, and seed list to `/downloads`
directory in the crawl
- Seed File: Downloaded into downloads. Never refetched if already
exists on subsequent crawl restarts.
- Custom Behaviors: Git: Downloaded into dir, then moved to
/downloads/behaviors/<dir name>. if already exist, failure to downloaded
will reuse existing directory
- Custom Behaviors: File: Downloaded into temp file, then moved to
/downloads/behaviors/<name.js>. if already exists, failure to download
will reuse existing file.
- Profile: using `/profile` directory to contain the browser profile
- Profile: downloaded to temp file, then placed into
/downloads/profile.tar.gz. If failed to download, but already exists,
existing /profile directory is used
- Also fixes#897
Fixes#631
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- clear size to 0 immediately after wacz is uploaded
- if crawler is in paused, ensure upload of any data on startup
- fetcher q: stop queuing async requests if recorder is marked for
stopping
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913
- avoid accidental logging
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This change introduces a new CLI option --extraChromeArgs to Browsertrix
Crawler, allowing users to pass arbitrary Chrome flags without modifying
the codebase.
This approach is future-proof: any Chrome flag can be provided at
runtime, avoiding the need for hard-coded allowlists.
Maintains backward compatibility: if no extraChromeArgs are passed,
behavior remains unchanged.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
If the SingletonLock (and SingletonPort, SingletonSocket) files somehow
made it into the profile, the browser will refuse to start. This will
ensure that it is cleared.
(Could also do it before saving it as well, but this will catch it for
any existing profiles).
- if --saveProfile is specified, attempt to save profile to same target
as --profile
- if --saveProfile <target>, save to target
- save profile on finalExit if browser has launched
- supports local file paths and storage-relative path with '@' (same as
--profile)
- also clear cache in first worker to match regular profile creation
fixes#898
- log when profie download starts
- ensure there is a timeout to profile download attempt (60 secs)
- attempt retry 2 more times if initial profile download times out
- fail crawl after 3 retries, if profile can not be downloaded
successfully
bumpt to 1.8.2
fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.