Commit graph

40 commits

Author SHA1 Message Date
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Tessa Walsh
66402c2e53
Add documentation for --failOnContentCheck and update CLI options in docs (#869)
Related to #860 

This will give us something we can link to from Browsertrix/the
Browsertrix User Guide for up-to-date information on this option.
2025-07-23 12:54:12 -07:00
Tessa Walsh
acae5155f5
Fix docs mistaking --waitUntil with --pageLoadTimeout (#864)
Fixes https://github.com/webrecorder/browsertrix-crawler/issues/853

Corrects a documentation inaccuracy pointed out by a user
2025-07-21 12:52:58 -07:00
Ilya Kreymer
d2a6aa9805
version: bump to 1.6.3 (#851)
cli: regen cli docs to update from #850
2025-06-16 15:55:05 -04:00
Rijnder Wever
fa26f05f66
cleanup: remove dead pywb code from argparser and docs (#847)
The value of `--dedupPolicy` was once passed to pywb (see
https://pywb.readthedocs.io/en/latest/manual/configuring.html#dedup-options-for-recording).
Now that pywb has been dropped, there is no need to keep this option
around.

In fact, I know multiple users that have been confused by the mention of
this option in the docs (myself included).

(for historical context, see
https://github.com/webrecorder/browsertrix-crawler/pull/332)
2025-06-16 12:36:32 -04:00
Ilya Kreymer
1cb1b2edb9
Update Behaviors Docs (#820)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-04-10 03:58:07 -04:00
Ilya Kreymer
e585b6d194
Better default crawlId (#806)
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784 
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
2025-04-01 13:40:03 -07:00
benoit74
02c4353b4a
Add clarification in usage about hostname used (#771)
clarify that the crawlId defaults to the Docker container hostname
2025-03-30 21:16:58 -07:00
Henry Wilkinson
34a1e3d6c0
docs: Update header font (#785)
Updated alongside https://github.com/webrecorder/replayweb.page/pull/405

Long overdue match to Browsertrix docs styling

### Screenshots

<img width="465" alt="Screenshot 2025-03-03 at 7 25 04 PM"
src="https://github.com/user-attachments/assets/6829dcb7-d486-4793-a635-f1286b30efc0"
/>
2025-03-05 14:21:00 -08:00
benoit74
4b72b7c7dc
Add documentation on exit codes (#765)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-11 12:16:29 -05:00
benoit74
fc56c2cf76
Add more exit codes to detect interruption reason (#764)
Fix #584

- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-02-10 14:00:55 -08:00
Ilya Kreymer
00835fc4f2
Retry same queue (#757)
- follow up to #743
- page retries are simply added back to the same queue with `retry`
param incremented and a higher scope, after extraHops, to ensure retries
are added at the end.
- score calculation is: `score = depth + (extraHops * MAX_DEPTH) +
(retry * MAX_DEPTH * 2)`, this ensures that retries have lower priority
than extraHops, and additional retries even lower priority (higher
score).
- warning is logged when a retry happens, error only when all retries
are exhausted.
- back to one failure list, urls added there only when all retries are
exhausted.
- rename --numRetries -> --maxRetries / --retries for clarity
- state load: allow retrying previously failed URLs if --maxRetries is
higher then on previous run.
- ensure working with --failOnFailedStatus, if provided, invalid status
codes (>= 400) are retried along with page load failures
- fixes #132

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-02-06 18:48:40 -08:00
Ilya Kreymer
2e46140c3f
Make numRetries configurable (#754)
Add --numRetries param, default to 1 instead of 5.
2025-02-05 23:34:55 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Tessa Walsh
60c84b342e
Support loading custom behaviors from git repo (#717)
Fixes #712 
- Also expands the existing documentation about behaviors and adds a test.
- Uses query arg for 'branch' and 'path' to specify git branch and subpath in repo, respectively.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2024-11-13 22:50:33 -08:00
Ilya Kreymer
d04509639a
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
2024-11-08 11:04:41 -05:00
Tessa Walsh
2a9b152531
Support loading custom behaviors from URLs and/or filepaths (#707)
Fixes #368 

The `--customBehaviors` flag is now an array, making it repeatable. This
should be backwards compatible with the CLI flag, but may require
changes to YAML configs when custom behaviors are used.

Custom behaviors can be loaded from URLs, local filepaths, and paths to
local directories, including any combination thereof.

New tests are added to ensure loading behaviors from URLs as well as a
mixed combination of URL and filepath works as expected.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-11-04 20:30:53 -08:00
Tessa Walsh
e05d50d637
Add documentation for crawl collections (#695)
Fixes #675
2024-10-05 11:51:32 -07:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
8934feaf70
SOCKS5 over SSH Tunnel Support (#671)
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
2024-08-28 18:47:24 -07:00
Tessa Walsh
39c8f48bb2
Disable behaviors entirely if --behaviors array is empty (#672)
Fixes #651
2024-08-27 13:20:19 -07:00
Henry Wilkinson
4c1da90d8f
Adds warning about crawling with basic auth (#669)
Closes https://github.com/webrecorder/browsertrix/issues/1950 over here
too

### Changes
- Adds a warning about using basic auth
- Adds a link to MDN because learning and cross referencing is fun!
2024-08-14 21:14:31 -07:00
Ilya Kreymer
48716c172d docs: regnerate cli options with ./docs/gen-cli.sh 2024-07-19 18:53:50 -07:00
benoit74
1099f4f3c8
Make it clear that profile argument can be an HTTP(S) URL (#649)
Small documentation enhancement to make it clear that browser profile
can be passed as HTTP(S) URL as well.
2024-07-19 18:53:28 -07:00
Ilya Kreymer
3339374092
http auth support per seed (supersedes #566): (#616)
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section

---------
Co-authored-by: Ed Summers <ehs@pobox.com>
2024-06-20 16:35:30 -07:00
Ilya Kreymer
b83d1c58da
add --dryRun flag and mode (#594)
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes #593
2024-06-07 10:34:19 -07:00
Ed Summers
2ef116d667
Mention command line options when restarting (#577)
It's probably worth reminding people that the command line options need
to be passed in again since the crawl state doesn't include them.

Refs #568
2024-05-21 10:57:50 -07:00
Ilya Kreymer
c71274d841
add STORE_REGION env var to be able to specify region (#565)
defaults to us-east-1 for minio compatibility
fixes #515
2024-05-12 12:42:04 -04:00
Ilya Kreymer
0201fef559 docs: fix typo 2024-04-18 17:19:13 -07:00
Tessa Walsh
75b617dc94
Add crawler QA docs (#551)
Fixes #550

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
2024-04-18 16:18:22 -04:00
Tessa Walsh
1325cc3868
Gracefully handle non-absolute path for create-login-profile --filename (#521)
Fixes #513 

If an absolute path isn't provided to the `create-login-profile`
entrypoint's `--filename` option, resolve the value given within
`/crawls/profiles`.

Also updates the docs cli-options section to include the
`create-login-profile` entrypoint and adjusts the script to
automatically generate this page accordingly.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-29 13:46:54 -07:00
Ilya Kreymer
2059f2b6ae
add an extra --postLoadDelay param to specify how many seconds to wait after page-load (#520)
but before running link extraction, text extraction, screenshots and
behaviors.

Useful for sites that load quickly but perform async loading / init
afterwards, fixes #519

A simple workaround for when it's tricky to detect when a page has
actually fully loaded. Useful for sites such as Instagram.
2024-03-28 17:17:29 -07:00
Ilya Kreymer
93c3894d6f
improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully (#504)
The intent is for even non-graceful interruption (duplicate Ctrl+C) to
still result in valid WARC records, even if page is unfinished:
- immediately exit the browser, and call closeWorkers()
- finalize() recorder, finish active WARC records but don't fetch
anything else
- flush() existing open writer, mark as done, don't write anything else
- possible fix to additional issues raised in #487 

Docs: Update docs on different interrupt options, eg. single SIGINT/SIGTERM, multiple SIGINT/SIGTERM (as handled here) vs SIGKILL

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-21 13:56:05 -07:00
Henry Wilkinson
5e2768ebcf
Docs homepage link fix
@tw4l Oops :\
2024-03-20 14:13:52 -04:00
Henry Wilkinson
3ec9d1b9e8
Update docs/docs/index.md
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 13:03:16 -04:00
Henry Wilkinson
0d26cf2619
Adds note about where to find Browsertrix — the cloud service 2024-03-20 12:41:29 -04:00
Henry Wilkinson
4b5ebb04f8
Fixes docs edit link 2024-03-20 12:34:29 -04:00
Ilya Kreymer
56053534c5
SAX-based sitemap parser (#497)
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-18 19:14:07 -07:00
Ilya Kreymer
8ea3bf8319 CNAME: keep CNAME in docs/docs for mkdocs 2024-03-16 15:24:54 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00