Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	4f234040ce	Profile Saving Improvements (#894 ) fix some observed errors that occur when saving profile: - use browser.cookies instead of page.cookies to get all cookies, not just from page - catch exception when clearing cache and ignore - logging: log when proxy init is happening on all paths, in case error in proxy connection	2025-10-08 17:09:20 -07:00
Ilya Kreymer	a42c0b926e	Support host-specific proxies with proxy config YAML (#837 ) - Adds support for YAML-based config for multiple proxies, containing 'matchHosts' section by regex and 'proxies' declaration, allowing matching any number of hosts to any number of named proxies. - Specified via --proxyServerConfig option passed to both crawl and profile creation commands. - Implemented internally by generating a proxy PAC script which does regex matching and running browser with the specified proxy PAC script served by an internal http server. - Also support matching different undici Agents by regex, for using different proxies with direct fetching - Precedence: --proxyServerConfig takes precedence over --proxyServer / PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided - Updated proxies doc section with example - Updated tests with sample bad and good auth examples of proxy config Fixes #836 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-08-20 16:07:29 -07:00
Tessa Walsh	f16be32ba6	Make sure all exit calls use ExitCodes enum (#767 ) Quick follow-up to #584 to make sure enum is used everywhere in profile editing mode: - profile browser exits with ExitCodes.SignalInterrupted in response to signal - use ExitCodes.Success or GenericError for other exit codes	2025-02-11 12:04:38 -08:00
Ilya Kreymer	5c9d808651	exit code cleanup (#753 ) - use consistent enums for exit codes - add disk space check on startup and add OutOfSpace exit code (3) - preparation for #584	2025-02-06 17:54:51 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	b42548373d	eslint: add strict await checking: (#684 ) - require await / void / catch for promises - don't allow unnecessary await	2024-09-06 16:24:18 -07:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Ilya Kreymer	8934feaf70	SOCKS5 over SSH Tunnel Support (#671 ) - Adds support for running a SOCKS5 proxy over an SSH connection. This can be configured by using `--proxyServer ssh://user@host[:port]` config and also passing an `--sshProxyPrivateKeyFile <private key file>` file param and an optional `--sshProxyKnownHostsFile <public host key file>`file param. The key files are expected to be mounted as volumes into the crawler. - Same arguments are also available for create-login-profile - The proxy config uses autossh to establish a more robust connection, and also waits until a connection can be established before proceeding. - Docs are updated to include a new 'Crawling with Proxies' page in the user guide - Tests are updated to include crawling through an SSH proxy running locally. --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com>	2024-08-28 18:47:24 -07:00
benoit74	1099f4f3c8	Make it clear that profile argument can be an HTTP(S) URL (#649 ) Small documentation enhancement to make it clear that browser profile can be passed as HTTP(S) URL as well.	2024-07-19 18:53:28 -07:00
Tessa Walsh	fd98033268	Loosen selectors for login fields in automated profile creation (#638 ) Fixes #637 - Username will match if name attribute is one of: user, username, email - Password will match if type is password and name attribute is one of: pass, password This loosens the rules sufficiently to solve the issue with the URL in the linked issue without requiring users to pass custom CSS selectors at this point. It looks like we were also using XPath methods like contains whereas puppeteer expects CSS selectors, hence the syntax change.	2024-07-11 15:55:06 -07:00
Ilya Kreymer	2ab58c0ea3	Remove DISPLAY env var from image (#625 ) To avoid a strange chromium bug: https://issues.chromium.org/issues/40209037 which causes WebGL to fail in headless mode if DISPLAY if set. Instead, just set DISPLAY directly for Xvfb, x11vnc and pass in `--display=` to browser if running in headful mode.	2024-06-25 13:53:43 -07:00
Ilya Kreymer	9847af7765	disable socat by default (#622 ) - crawling: add '--debugAccessBrowser' flag to enable connecting via 9222, only run socat then - profiles: only run socat in headless mode	2024-06-20 20:10:25 -07:00
Ilya Kreymer	e2b4cc1844	proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589 ) fixes #587 The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they were hardcoded to obsolete values in the Dockerfile. Proxy settings can now be set, in order of precedence via: - --proxyServer cli flag - PROXY_SERVER env var - PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server only (for backwards compatibility with 0.12.x) The --proxyServer / PROXY_SERVER settings are passed to the browser via the --proxy-server flag. AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying. Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth (supported in Brave, but not Chrome!) --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-10 13:11:00 -07:00
Ilya Kreymer	22b2136eb9	profiles: ensure initial page.load() is awaited (#561 ) refactor to create a startLoad() method and await it, follow-up to #559	2024-05-02 17:55:22 +02:00
Ilya Kreymer	a61206fd73	profiles: ensure all page.goto() promises have at least catch block or are awaited (#559 ) In particular, an API call to /navigate starts, but doesn't wait for a page load to finish, since user can choose to close the profile browser at any time. This ensures that user operations don't cause the browser to crash if page.goto() is interrupted/fails (browser closed, profile is saved, etc...) while a page is still loading. bump to 1.1.1	2024-04-25 09:34:57 +02:00
Tessa Walsh	1325cc3868	Gracefully handle non-absolute path for create-login-profile --filename (#521 ) Fixes #513 If an absolute path isn't provided to the `create-login-profile` entrypoint's `--filename` option, resolve the value given within `/crawls/profiles`. Also updates the docs cli-options section to include the `create-login-profile` entrypoint and adjusts the script to automatically generate this page accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-29 13:46:54 -07:00
Ilya Kreymer	ea098b6daf	avoid cloudflare detection of puppeteer when using browser profiles: (#518 ) - filter out 'other' / no url targets from puppeteer attachment - disable '--disable-site-isolation-trials' for profiles - workaround for #446 with profiles - also fixes `pageExtraDelay` not working for non-200 responses - may be useful for detecting captcha blocked pages. - connect VNC right away instead of waiting for page to fully finish loading, hopefully resulting in faster profile start-up time.	2024-03-28 10:21:31 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	5060e6b0b1	profiles: handle terminate signals directly (#500 ) - add our own signal handling to create-login-profile to ensure fast exit in k8s - print crawler version info string on startup	2024-03-18 17:24:48 -04:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00

21 commits