Commit graph

21 commits

Author SHA1 Message Date
Ilya Kreymer
4f234040ce
Profile Saving Improvements (#894)
fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
2025-10-08 17:09:20 -07:00
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Tessa Walsh
f16be32ba6
Make sure all exit calls use ExitCodes enum (#767)
Quick follow-up to #584 to make sure enum is used everywhere in profile editing mode:
- profile browser exits with ExitCodes.SignalInterrupted in response to signal
- use ExitCodes.Success or GenericError for other exit codes
2025-02-11 12:04:38 -08:00
Ilya Kreymer
5c9d808651
exit code cleanup (#753)
- use consistent enums for exit codes
- add disk space check on startup and add OutOfSpace exit code (3)
- preparation for #584
2025-02-06 17:54:51 -08:00
Ilya Kreymer
b7150f1343
Autoclick Support (#729)
Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
2025-01-16 09:38:11 -08:00
Ilya Kreymer
b42548373d
eslint: add strict await checking: (#684)
- require await / void / catch for promises
- don't allow unnecessary await
2024-09-06 16:24:18 -07:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
8934feaf70
SOCKS5 over SSH Tunnel Support (#671)
- Adds support for running a SOCKS5 proxy over an SSH connection. This can
be configured by using `--proxyServer ssh://user@host[:port]` config and
also passing an `--sshProxyPrivateKeyFile <private key file>` file param
and an optional `--sshProxyKnownHostsFile <public host key file>`file
param. The key files are expected to be mounted as volumes into the
crawler.

- Same arguments are also available for create-login-profile

- The proxy config uses autossh to establish a more robust connection, and
also waits until a connection can be established before proceeding.

- Docs are updated to include a new 'Crawling with Proxies' page in the user guide

- Tests are updated to include crawling through an SSH proxy running locally.
---------

Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
2024-08-28 18:47:24 -07:00
benoit74
1099f4f3c8
Make it clear that profile argument can be an HTTP(S) URL (#649)
Small documentation enhancement to make it clear that browser profile
can be passed as HTTP(S) URL as well.
2024-07-19 18:53:28 -07:00
Tessa Walsh
fd98033268
Loosen selectors for login fields in automated profile creation (#638)
Fixes #637 

- Username will match if name attribute is one of: user, username, email
- Password will match if type is password and name attribute is one of:
pass, password

This loosens the rules sufficiently to solve the issue with the URL in
the linked issue without requiring users to pass custom CSS selectors at
this point.

It looks like we were also using XPath methods like contains whereas
puppeteer expects CSS selectors, hence the syntax change.
2024-07-11 15:55:06 -07:00
Ilya Kreymer
2ab58c0ea3
Remove DISPLAY env var from image (#625)
To avoid a strange chromium bug:
https://issues.chromium.org/issues/40209037 which causes WebGL to fail
in headless mode if DISPLAY if set. Instead, just set DISPLAY directly
for Xvfb, x11vnc and pass in `--display=` to browser if running in
headful mode.
2024-06-25 13:53:43 -07:00
Ilya Kreymer
9847af7765
disable socat by default (#622)
- crawling: add '--debugAccessBrowser' flag to enable connecting via
9222, only run socat then
- profiles: only run socat in headless mode
2024-06-20 20:10:25 -07:00
Ilya Kreymer
e2b4cc1844
proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589)
fixes #587 

The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they
were hardcoded to obsolete values in the Dockerfile.

Proxy settings can now be set, in order of precedence via:
- --proxyServer cli flag
- PROXY_SERVER env var
- PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server
only (for backwards compatibility with 0.12.x)

The --proxyServer / PROXY_SERVER settings are passed to the browser via
the --proxy-server flag.
AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying.
Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth
(supported in Brave, but not Chrome!)

---------
Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-06-10 13:11:00 -07:00
Ilya Kreymer
22b2136eb9
profiles: ensure initial page.load() is awaited (#561)
refactor to create a startLoad() method and await it, follow-up to #559
2024-05-02 17:55:22 +02:00
Ilya Kreymer
a61206fd73
profiles: ensure all page.goto() promises have at least catch block or are awaited (#559)
In particular, an API call to /navigate starts, but doesn't wait for a
page load to finish, since user can choose to close the profile browser
at any time. This ensures that user operations don't cause the browser to crash if
page.goto() is interrupted/fails (browser closed, profile is saved, etc...) while a page is still loading.

bump to 1.1.1
2024-04-25 09:34:57 +02:00
Tessa Walsh
1325cc3868
Gracefully handle non-absolute path for create-login-profile --filename (#521)
Fixes #513 

If an absolute path isn't provided to the `create-login-profile`
entrypoint's `--filename` option, resolve the value given within
`/crawls/profiles`.

Also updates the docs cli-options section to include the
`create-login-profile` entrypoint and adjusts the script to
automatically generate this page accordingly.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-29 13:46:54 -07:00
Ilya Kreymer
ea098b6daf
avoid cloudflare detection of puppeteer when using browser profiles: (#518)
- filter out 'other' / no url targets from puppeteer attachment
- disable '--disable-site-isolation-trials' for profiles
- workaround for #446 with profiles
- also fixes `pageExtraDelay` not working for non-200 responses - may be
useful for detecting captcha blocked pages.
- connect VNC right away instead of waiting for page to fully finish
loading, hopefully resulting in faster profile start-up time.
2024-03-28 10:21:31 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
5060e6b0b1
profiles: handle terminate signals directly (#500)
- add our own signal handling to create-login-profile to ensure fast
exit in k8s
- print crawler version info string on startup
2024-03-18 17:24:48 -04:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Renamed from create-login-profile.js (Browse further)