introduces a new RedisReportsIndex class that tracks statistics
about crawl data in redis, including counts and sizes by host, crawl,
mime type, and HTTP status code category
Removes our custom evaluateWithCLI() call in favor of using standard
`frame.evaluate()`.
The custom method became more tricky to use, and doesn't work for all
iframes.
The main benefit of it was to inject `getEventListeners()`, which is now
only used in Autoscroll to potentially skip scrolling, and not needed
for any other behaviors.
We could potentially use this workaround:
https://stackoverflow.com/questions/75517220/geteventlistener-function-support-for-latest-puppeteer-versions/75581410#75581410
to bring back `getEventListeners` or, adding a custom callback, like
`hasEventListener()` would suffice for that check.
This simplifies the codebase and ensures that running behaviors is more
reliable.
Also, adds a callback for new frames that may get added to page, such as
during scrolling, an ensures behaviors are called on those iframes as
well.
When available, async fetch should try to load via the browser network,
especially for in-page discovered URLs, to ensure proper credentials are
used (may fix#960):
- adds missing CDP param that resulted in browser network being skipped!
- try browser network for direct fetch too, if page is available, but
then fallback to node fetch
- default to node fetch when network loading failed, or if in browser
(non-page) context for request interception, eg. in a worker.
- updates to browsertrix-behaviors 0.9.8, which prefers in-browser fetch when possible.
Fixes#884
- Support for hash-based deduplication via a Redis provided with
--redisDedupeUrl (can be same as default redis)
- Support for writing WARC revisit records for duplicates
- Support for new indexer mode which imports CDXJ from one or more WACZs
(refactored from replay) to populate the dedup index
- Crawl and aggregate stats updated in dedupe index, including total
urls, deduped URLs, conserved size (difference between revisit and
response records), and estimated redundant size (aggregate) of
duplicates not deduped.
- Track removed crawls on index update, support for --remove operation
to purge removed crawls, otherwise removed crawl aggregate data is
maintained.
- Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field.
- Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added.
- WIP on page-level dedupe (preempt loading entire pages) if HTML is a
dupe/matches exactly.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#971
- Add decompression interceptor as default for getProxyDispatcher(), no
need to decompress sitemap explicitly
- Add option to not decompress, keep content-encoding header and don't
add `x-orig-` when using node async fetch
- Create three dispatcher variants per proxy: redirect + decompress,
redirect + no decompress, no redirect + no decompress
- always attempt to delete existing profile dir before moving new one in
its place, fixes#968
- treat 304 (eg. if recrawling with existing profile) as cached
resource, don't attempt to write/check size
- fix typo in shouldSkipSave() for incomplete 206 responses
- Only use major version from, set rest to 0.0.0 to match Brave/Chrome
behavior
- Store major version in Browser
- Also set some `sec-ua-ch-*` headers to match brave default
- Don't disable cache when creating profile to avoid sending different
Cache-Control headers when in profile creation (cache cleared before
profile created anyway)
---------
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
- applies normalizeUrl() to seed URL and seed isIncluded() check
- add normalizeUrl() wrapper which applies standard opts and also catches and logs any errors from normalization
- test: add scope tests to ensure URL with differently sorted query args still in scope
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Refactor dispatcher apis:
- Use `getProxyDispatcher(withRedirect = true)` to follow redirects by
default, with option to disable (eg. in recorder). This dispatcher also
ignores TLS errors, to match current browser config. Used for fetching
archival content.
- Use `getFollowRedirectsDispatcher()` with follows redirects but does
not ignore TLS and does not use proxies, for fetching non-archival
configs (profiles, behaviors, etc...)
Fixes#954, regression from #946
- update warcio.js to 2.4.9 to fix issue with multiple repeated headers
values (now allowed for HTTP headers)
- ensure links discovered from autoclick are also crawled: the links are
stored in a set to avoid dupe links, but no reason not to also queue
them for crawling, if they're in scope.
- bump to 1.11.2
- ensure two signals at least 1 sec apart are received before immediate
termination
- only exit immediately if crawl not already post-processing, otherwise
let post-processing run its course
- remove openAsBlob() as that doesn't work with request() api
- but keep openAsBlob for when interface with wabac.js fetch()
- also remove commented out code
Extends work in #547 adds Upload via @aws-sdk/lib-storage library:
- Replaces minio client with official aws s3 client
- Uses @aws-sdk/lib-storage for multi-part upload support
Testing:
- This should address issues from #479 and webrecorder/browsertrix#2925
ideally
- Tested with all the major S3 implementations: VersityGW, RustFS,
SeaweedFS, Garage as well as Minio
---------
Co-authored-by: Mattia <m@ttia.it>
Co-authored-by: Mattia <mattia@guella.it>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Per https://undici.nodejs.org/#/?id=benchmarks request() is supposed to
be much more performant compared to fetch() with almost the same
interface.
This PR replaces all of the fetch() calls (both using proxy dispatcher
and regular fetch) with undici request()
The migration is fairly simply as shown in
https://undici.nodejs.org/#/?id=migration-guide
The migration eliminates various web streams to node streams
conversions. To support automatic redirects, undici redirect interceptor is used.
Also updates to latest undici (v7)
Fixes#944
Sets file ulimit to 8192 and then launches x11vnc. Should result in
faster profile loading if file limit is set especially high, due to a bug in libvncserver (see #944 for more details).
Updates custom behavior sample class and examples to be accurate:
- Include missing required `init()` method
- Fix arguments in examples uses of `Lib.getState()`
- Ensure the /sitemap.xml is parsed even if robots.txt exists, but no
sitemaps listed there.
- Resolve relative URLs listed in robots.txt, eg. 'Sitemap:
/my-sitemap.xml'
- Simplify sitemap detection logic, check robots first, then sitemap.xml
OR alternate url if provided via --useSitemap <url>
- Have two main methods, parseSitemap() and parseSitemapFromRobots()
that handle the parsing.
- follow-up to #930
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#937
- Don't remove URLs from seen list
- Add new excluded key, add URLs to be excluded (out-of-scope on
redirect) to excluded set. The size of this set can be used to get the
URLs that have been excluded in this way, to compute number of
discovered URLs.
- Don't write urn:pageinfo records for excluded pages, along with not
writing to pages/extraPages.jsonl
- use 'normalize-url' package to avoid differently sorted query args
that are the same url
- configure other options, such as keeping www. and trailing slashes,
only using this for query arg sorting
- original implementation did not actually wait for sitemap to complete
before queuing new ones, resulting in concurrency resource leak
- refactor to await completion of sitemap parser, replacing pending list
with counter
- also, don't parse sitemap if single-page and no extra hops!
- fixes issues in #928
- if a page is stuck in a window.alert / window.prompt loop, showing >10
or more consecutive dialogs (unrelated to unloading), call Page.crash()
to more quickly move on to next page, as not much else can be done.
- add exception handling in dialog accept/dismiss to avoid crawler crash
- fixes#926
- in doCancel(), use abort controller and call abort(), instead of
body.cancel()
- ensure doCancel() is called when a WARC record is not written, eg. is
a dupe, as stream is likely not consumed
- also call IO.close() when uses browser network reader
- fixes#923
- also adds missing dupe check to async resources queued from behaviors
(were being deduped on write, but were still fetched unnecessarily)
Fixes#920
- Downloads profile, custom behavior, and seed list to `/downloads`
directory in the crawl
- Seed File: Downloaded into downloads. Never refetched if already
exists on subsequent crawl restarts.
- Custom Behaviors: Git: Downloaded into dir, then moved to
/downloads/behaviors/<dir name>. if already exist, failure to downloaded
will reuse existing directory
- Custom Behaviors: File: Downloaded into temp file, then moved to
/downloads/behaviors/<name.js>. if already exists, failure to download
will reuse existing file.
- Profile: using `/profile` directory to contain the browser profile
- Profile: downloaded to temp file, then placed into
/downloads/profile.tar.gz. If failed to download, but already exists,
existing /profile directory is used
- Also fixes#897
Fixes#631
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- clear size to 0 immediately after wacz is uploaded
- if crawler is in paused, ensure upload of any data on startup
- fetcher q: stop queuing async requests if recorder is marked for
stopping
- allow fail on content check from main behavior
- update to behaviors 0.9.6 to support 'captcha_found' content check for
tiktok
- allow throwing from timedRun
- call fatal() if profile can not be extracted
- set default networkIdle to 2
- add netIdleMaxRequests as an option, default to 1 (in case of long
running requests)
- further fix for #913
- avoid accidental logging
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>