- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config
Fixes#836
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- add --failOnContentCheck for quick fail if content check in behavior
fails
- expose __bx_contentCheckFailed to cause an immediately failure from
behavior
- only allow failing crawl due to content check from within
awaitPageLoad() callback
- set a 'failReason' key to track that crawl failed due to a particular
content check reason
- deps: update to browsertrix-behaviors 0.9.0, update to wabac.js
(2.23.6)
- fixes#860
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#841
Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
If --saveStorage is set, localStorage and sessionStorage will be
serialized with the WARC record for the page.
If a page redirects, track what the current page URL is and save storage
as part of the page's WARC record.
Fixes#855
Related to https://github.com/webrecorder/browsertrix-crawler/issues/848
Several users have had issues with disk utilization checks, including
the values reported by `df` inside the crawler container having
unexpected results for mounted volumes. The commonly recommended
solution to this is to use `docker system ps`, but that is of course not
available within the Docker container itself.
This PR changes disk utilization checks to be an opt-in feature by
setting the default value to `0` (disabled).
- validate --lang values, fail immediately with invalid iso-639-1
country code
- ignore --lang value when using profile, print warning that profile
language takes precedence
- fixes#833
Fixes#804
- Site-specific behaviors use behaviorScriptCustom log context (via browsertrix-behaviors 0.8.3)
- Add behavior logs to redis if --logBehaviorsToRedis is set, including non-debug behaviors / behaviorsScript context and all behaviorScriptCustom logs
- Noisy logs from built-in behaviors like autoscroll are now logged to
debug in https://github.com/webrecorder/browsertrix-behaviors/pull/92
and so won't be pushed to Redis for newer versions of the crawler.
- Updates browsertrix-behaviors to 0.8.3 and makes some changes to
log format in tests accordingly.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
By using the useSHA1 flag, the payload digest in records will use SHA-1
with Base32 encoding instead of the default SHA-256
Co-authored-by: Andreas Predikaka <andreas.predikaka@onb.ac.at>
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
- extractLinks() now handled via browsertix-behaviors
- fixes#770 via browsertrix-behaviors, checks for toJSON overrides
- organize exposed functions to enum list
Fixes#798
Also modifies the existing test for link selector validation to check 17
status code on exit when link selectors fail validation.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes#132
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future
Fixes#728, also #216, #665, #31
- new `fullPageFinal` screenshot option, which will take a full page screenshot after behaviors are run, or before moving onto next page if behaviors are skipped.
Related to #486
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Fixes#712
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#368
The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.
Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.
New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
Fixes#674
This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.
- Same arguments are also available for create-login-profile
- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.
- Docs are updated to include a new 'Crawling with Proxies' page in the user guide
- Tests are updated to include crawling through an SSH proxy running locally.
---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Fixes#604
Ensures that extra seeds are propagated to all crawler instances.
Adds a new redis hashmap key to store the extraSeed mappings
url->extraSeeds index, to ensure the extra seeds are added in the same
order on other instances, even if encountered in different order.
Add a new redis lua primitive 'addnewseed' which combines several
operations: check if extra seed already exists and returning existing
index, add new seed to extraSeed list, also add to regular URL seed
list.
fixes#587
The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.
Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)
The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)
---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes#593
Fixes#563
This PR makes a few changes to fix a regression in behavior around
`failOnFailedSeed` for the 1.x releases:
- Fail with exit code 1, not 17, when pages are unreachable due to DNS
not resolving or other network errors if the page is a seed and
`failOnFailedSeed` is set
- Extend tests, add test to ensure crawl succeeds on 404 seed status
code if `failOnINvalidStatus` isn't set
- use frame.load() to load RWP frame directly instead of waiting for
navigation messages
- retry loading RWP if replay frame is missing
- support --postLoadDelay in replay crawl
- support --include / --exclude options in replay crawler, allow
excluding and including pages to QA via regex
- improve --qaDebugImageDiff debug image saving, save images to same
dir, using ${counter}-${workerid}-${pageid}-{crawl,replay,vdiff}.png for
better sorting
- when running QA crawl, check and use QA_ARGS instead of CRAWL_ARGS if
provided
- ensure empty string text from page is treated different from error (undefined)
- ensure info.warc.gz is closed in closeFiles()
misc:
- fix typo in --postLoadDelay check!
- enable 'startEarly' mode for behaviors (autofetch, autoplay)
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
but before running link extraction, text extraction, screenshots and
behaviors.
Useful for sites that load quickly but perform async loading / init
afterwards, fixes#519
A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs
Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
comparison: {
screenshotMatch?: number;
textMatch?: number;
resourceCounts: {
crawlGood?: number;
crawlBad?: number;
replayGood?: number;
replayBad?: number;
};
};
```
- bump version to 1.1.0-beta.2
Due to issues with capturing top-level pages, make bypassing service
workers the default for now. Previously, it was only disabled when using
profiles. (This is also consistent with ArchiveWeb.page behavior).
Includes:
- add --serviceWorker option which can be `disabled`,
disabled-if-profile (previous default) and `enabled`
- ensure page timestamp is set for direct fetch
- warn if page timestamp is missing on serialization, then set to now
before serializing
bump version to 1.0.2
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes#496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- add --logExcludeContext for log contexts that should be excluded
(while --logContext specifies which are to be included)
- enable 'recorderNetwork' logging for debugging CDP network
- create default log context exclude list (containing: screencast,
recorderNetwork, jsErrors), customizable via --logExcludeContext
recorder: Track failed requests and include in pageinfo records with
status code 0
- cleanup cdp handler methods
- intercept requestWillBeSent to track requests that started (but may
not complete)
- fix shouldSkip() still working if no url is provided (eg. check only
headers)
- set status to 0 for async fetch failures
- remove responseServedFromCache interception, as response data
generally not available then, and responseReceived is still called
- pageinfo: include page requests that failed with status code 0, also
include 'error' status if available.
- ensure page is closed on failure
- ensure pageinfo still written even if nothing else is crawled for a
page
- track cached responses, add to debug logging (can also add to pageinfo
later if needed)
tests: add pageinfo test for crawling invalid URL, which should still
result in pageinfo record with status code 0
bump to 1.0.0-beta.7
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit
requeue: ensure requeued urls are requeued with same depth/priority, not
0
Fixes#462
Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add
pages to the database for each crawl.
Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis)
Also include timestamp (as ISO date) in `pageinfo:` records
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp
Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
- add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
- add LogContext type and enumerate all log contexts
- also add LOG_CONTEXT_TYPES array to validate --context arg
- rename errJSON -> formatErr, convert unknown (likely Error) to dict
- make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes#426
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>