Fix#584
- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes#132
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future
Fixes#728, also #216, #665, #31
Fixes#712
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Fixes#368
The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.
Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.
New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.
- Same arguments are also available for create-login-profile
- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.
- Docs are updated to include a new 'Crawling with Proxies' page in the user guide
- Tests are updated to include crawling through an SSH proxy running locally.
---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section
---------
Co-authored-by: Ed Summers <ehs@pobox.com>
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes#593
Fixes#513
If an absolute path isn't provided to the `create-login-profile`
entrypoint's `--filename` option, resolve the value given within
`/crawls/profiles`.
Also updates the docs cli-options section to include the
`create-login-profile` entrypoint and adjusts the script to
automatically generate this page accordingly.
---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
but before running link extraction, text extraction, screenshots and
behaviors.
Useful for sites that load quickly but perform async loading / init
afterwards, fixes#519
A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487
Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser
Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.
Fixes#496
---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Fixes#493
This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.
Initial docs site set to https://crawler.docs.browsertrix.com/
Many thanks to @Shrinks99 for help setting this up!
---------
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>