Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	eb374fa835	base: bump to brave 1.80.113 (#857 ) version: bump to 1.7.0-beta.0 tests: update deprecated command to work with latest minio	2025-06-30 19:55:38 -07:00
Ilya Kreymer	a5936b56aa	deps: bump brave 1.79.118 (#845 ) bump version to 1.6.2	2025-06-03 12:52:07 -07:00
Ilya Kreymer	f9bd534e4c	more dependency updates: (#827 ) - update wabac.js to 2.22.16, RWP to 2.3.7 - fidelity: fixes capture of fb and insta (via wabac.js 2.22.16) - policy: disable tg popups - bump version to 1.6.1!	2025-05-05 10:08:59 -07:00
Ilya Kreymer	fc59d04231	Deps update 1.6.1 (#826 )	2025-05-02 00:43:37 -07:00
Ilya Kreymer	66c71d03c8	deps: bump base browser image to 1.77.95 (#814 )	2025-04-03 17:25:29 -07:00
Ilya Kreymer	91f8fadc5f	deps update: update webrecorder dependencies (#810 ) - browsertrix-behaviors 0.8.1 for improved logging / new behavior functions - wabac.js 2.22.9 - RWP 2.3.4 for QA - update ReplayServer to support 'range: -x' requests used in latest RWP/wabac.js	2025-04-01 22:11:56 -07:00
Ilya Kreymer	2aec2e1a33	reset back to latest image, 1.77.52 bump version to 1.5.7	2025-02-27 16:06:43 -08:00
Ilya Kreymer	9b22df5c90	revert brave version: not ideal, but need to revert to chromium 132 u… (#781 ) …ntil we figure out various stalling issues that still persist in chromium >=133 bump to 1.5.6	2025-02-27 07:05:31 -08:00
Ilya Kreymer	c25c6771a8	browser: update brave to 1.77.52 to get Chromium 134 (#773 ) should fix browser timing out on new window, fixes #766 bump to 1.5.4	2025-02-20 09:14:32 -08:00
Ilya Kreymer	846f0355f6	Improved handling of browser stuck / crashed (#763 ) - only attempt to close browser if not browser crashed - add timeout for browser.close() - ensure browser crash results in healthchecker failure - bump to 1.5.3	2025-02-10 10:16:25 -08:00
Ilya Kreymer	0ca27e4fa1	QA fix: ensure replay iframe actually been updated after goto call! (#756 ) qa fix: check url of iframe, ensure it is not about:blank anymore test: add test to ensure expected diff deps: bump to brave 1.74.51, bump to puppeteer-core 24.2.0	2025-02-06 10:41:38 -08:00
Ilya Kreymer	b7150f1343	Autoclick Support (#729 ) Adds support for autoclick behavior: - Adds new `autoclick` behavior option to `--behaviors`, but not enabling by default - Adds support for new exposed function `__bx_addSet` which allows autoclick behavior to persist state about links that have already been clicked to avoid duplicates, only used if link has an href - Adds a new pageFinished flag on the worker state. - Adds a on('dialog') handler to reject onbeforeunload page navigations, when in behavior (page not finished), but accept when page is finished - to allow navigation away only when behaviors are done - Update to browsertrix-behaviors 0.7.0, which supports autoclick - Add --clickSelector option to customize elements that will be clicked, defaulting to `a`. - Add --linkSelector as alias for --selectLinks for consistency - Unknown options for --behaviors printed as warnings, instead of hard exit, for forward compatibility for new behavior types in the future Fixes #728, also #216, #665, #31	2025-01-16 09:38:11 -08:00
Ilya Kreymer	871490758a	Dependency Update for 1.4.2 (#737 )	2025-01-06 12:06:40 -08:00
Ilya Kreymer	6bfa7d5766	Dependency Update (#725 ) - update yarn packages - update RWP to 2.2.4 - update base image to brave 1.73.91 - fix typing issue - bump to 1.4.0-beta.1	2024-11-24 01:22:50 -08:00
Ilya Kreymer	c8e2e43d4d	Dependency Update (#718 ) - bump browsertrix-behaviors to 0.6.5 - bump browsertrix-base-image to 1.71.123 - bump puppeteer-core to 23.7.1	2024-11-10 19:34:38 -08:00
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Ilya Kreymer	c38b69e74b	bump browser to 1.69.162 (#681 )	2024-09-05 20:21:43 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
benoit74	4cc67a3267	Update Brave image + isolated Python venv for dependencies installation (#591 ) - Debian distro now requires the use of virtual environments to not mess with dependencies installed by official apt packages - removes tldextract update now that pywb is not in use anymore - bump brave version to 1.68.141, for use with base image added in https://github.com/webrecorder/browsertrix-browser-base/pull/20 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-08-14 21:12:00 -07:00
Ilya Kreymer	8d7fb1e084	1.2.8 updates: (#668 ) - rewriting: update wabac.js, use getCustomRewriter(), don't truncate POST request bodies for URLs that use a custom rewriter - browser: disable --enable-automation, setting webdriver = true, so no need for override - deps: update puppeteer-core, necessary changes for latest puppeteer	2024-08-13 23:38:55 -07:00
Ilya Kreymer	bb34c5ef47	version: bump to 1.2.7 deps: bump RWP in Dockerfile to 2.1.3	2024-08-09 13:23:16 -07:00
Ilya Kreymer	88a2fbd0a0	Fix 206 response + general video handling (#646 ) Refactors handling of 206 responses: - If a 206 response is encountered, and its actually the full range, convert to 200 and rewrite range and content-range headers to x-range and x-orig-range. This is to support rewriting of 206 responses for DASH manifests - If a partial 206 response starting with `0-`, do a full async fetch separately. - If a partial 206 response not starting with 0-, just ignore (very likely a duplicate picked up when handling the 0- response) - Don't stream content-types that can be rewritten, since streaming prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no content-length and don't get properly rewritten. - Overall, adds missing rewriting of DASH/HLS manifests that have no content-length and are served as 206. - Update to latest wabac.js which fixes rewriting of DASH manifest to avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192 - Fixes #645	2024-07-17 13:24:25 -07:00
Ilya Kreymer	1a48b37478	bump replayweb.page to 2.1.1 (#640 )	2024-07-11 16:22:37 -07:00
Ilya Kreymer	302b119908	Dependency Update / 1.2.2 (#633 ) Dependency Updates: - Bump Brave to 1.67.123 - Update puppeteer-core to latest, fixes possible crash when loading current browser with old profiles - Tests: simplifies extra hops test to avoid complex pages that could lead to timeout	2024-07-03 12:55:14 -07:00
Ilya Kreymer	2ab58c0ea3	Remove DISPLAY env var from image (#625 ) To avoid a strange chromium bug: https://issues.chromium.org/issues/40209037 which causes WebGL to fail in headless mode if DISPLAY if set. Instead, just set DISPLAY directly for Xvfb, x11vnc and pass in `--display=` to browser if running in headful mode.	2024-06-25 13:53:43 -07:00
Ilya Kreymer	8af8b3c19a	1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624 )	2024-06-21 16:34:06 -07:00
Ilya Kreymer	ea114c6083	bump brave to 1.67.119 (#620 )	2024-06-20 20:10:46 -07:00
Ilya Kreymer	3c26996f93	add yarn.lock to Docker to ensure consistent builds! (#621 )	2024-06-20 18:54:05 -07:00
Ilya Kreymer	ff481855d5	add EXPOSE for ports used inside container (#612 ) documents fixed internal ports used in browsertrix, via EXPOSE cmd, addresses #558	2024-06-14 15:19:35 -07:00
Ilya Kreymer	f504effa51	Merge branch 'main' into release/1.1.4 bump to 1.2.0-beta.1	2024-06-13 19:28:25 -07:00
Ilya Kreymer	53d437570e	dependency: update RWP to 2.0.1 (#610 ) for QA, use ReplayWeb.page 2.0.1 by default	2024-06-13 18:43:58 -07:00
Ilya Kreymer	e2b4cc1844	proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589 ) fixes #587 The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they were hardcoded to obsolete values in the Dockerfile. Proxy settings can now be set, in order of precedence via: - --proxyServer cli flag - PROXY_SERVER env var - PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server only (for backwards compatibility with 0.12.x) The --proxyServer / PROXY_SERVER settings are passed to the browser via the --proxy-server flag. AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying. Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth (supported in Brave, but not Chrome!) --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-10 13:11:00 -07:00
Ilya Kreymer	1bd94d93a1	cleanup dockerfile + fix test (#595 ) - remove obsolete line from Dockerfile - fix pdf test to webrecorder-hosted pdf	2024-06-06 12:14:44 -07:00
Vinzenz Sinapius	068ee79288	Add group policies, limit browser access to container filesystem (#579 ) Add some default policy settings to disable unneeded Brave features. Helps a bit with #463, but Brave unfortunately doesn't provide all mentioned settings as policy options. Most important changes are in `config/policies/lockdown-profilebrowser.json` it limits access to the container filesystem especially during interactive profile browser creation.	2024-06-05 12:46:49 -07:00
Ilya Kreymer	757e838832	base image version bump to brave 1.66.115 (#592 )	2024-06-04 13:35:13 -07:00
Ilya Kreymer	e15f0c95d9	Adblock support (#534 ) Now that RWP 2.0.0 with adblock support has been released (webrecorder/replayweb.page#307), this enables adblock on the QA mode RWP embed, to get more accurate screenshots. Fetches the adblock.gz directly from RWP (though could also fetch it separately from Easylist) Updates to 1.1.0-beta.5	2024-04-12 09:47:32 -07:00
Ilya Kreymer	db613aa4ff	Revert "Make /app world-readable to better support non-root usage" (#529 ) Reverts webrecorder/browsertrix-crawler#523 The chmod operation is a bit slow, and in testing don't think the CI is related to chmod :/	2024-04-03 19:48:37 -07:00
Vinzenz Sinapius	23fda685d9	Make /app world-readable to better support non-root usage (#523 ) Possible fix for failing tests with non-root deployment.	2024-04-03 15:22:12 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	1fe810b1df	Improved support for running as non-root (#503 ) This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1	2024-03-21 08:16:59 -07:00
Ilya Kreymer	e8f2073a7e	Update Browser Image (#466 ) - Update to Brave browser (1.62.165) - Update page resource test to reflect latest Brave behavior	2024-02-17 22:40:12 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00
Ilya Kreymer	877d9f5b44	Use new browser-based archiving mechanism instead of pywb proxy (#424 ) Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 Changes include: - Recorder class for capture CDP network traffic for each page. - Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..) - WARC writing support via TS-based warcio.js library. - Generates single WARC file per worker (still need to add size rollover). - Request interception via Fetch.requestPaused - Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest() - Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch via fetch() - Direct async fetch() capture of non-HTML URLs - Awaiting for all requests to finish before moving on to next page, upto page timeout. - Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use). - removed pywb, using cdxj-indexer for --generateCDX option.	2023-11-07 21:38:50 -08:00
Ilya Kreymer	064db52272	base image: bump brave to 1.59.120 version: bump to 0.12.0-beta.2	2023-10-26 19:48:49 -07:00
Ilya Kreymer	f453dbfb56	Switch to Brave Base Image (#400 ) * switch to brave: - switch base browser to brave base image 1.58.135 - tests: add extra delay for blocking tests - bump to 0.12.0-beta.0	2023-10-02 14:30:44 -07:00
Vinzenz Sinapius	7b6bb681c7	Update tldextract cache for pywb in build process (#383 )	2023-09-15 12:22:17 -04:00
Ilya Kreymer	3c9be514d3	behavior logging tweaks, add netIdle (#381 ) * behavior logging tweaks, add netIdle * fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes! now used not only for link extraction but also to run() behaviors * add logging if iframe check fails * Dockerfile: add commented out line to use local behaviors.js * bump behaviors to 0.5.2	2023-09-14 19:48:41 -07:00
Ilya Kreymer	f51154facb	Chrome 112 + new headless mode + consistent viewport tweaks (#316 ) * base: update to chrome 112 headless: switch to using new headless mode available in 112 which is more in sync with headful mode viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set) profiles: fix catching new window message, reopening page in current window versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1) bump to 0.10.0-beta.4 * profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages	2023-05-22 16:24:39 -07:00
Ilya Kreymer	d4233582bb	ci: bump yarn install timeout for ci, use latest gh action	2023-04-03 12:18:42 -07:00
Ilya Kreymer	10e61d4c85	Bump to Chrome 109, Beta 0.8.0-beta.1 Release (#215 ) - bump to chrome-109 image - bump uwsgi to fix intermittent build errors -remove installs moved to base image bump to 0.8.0-beta.1	2023-01-30 19:00:33 -08:00

1 2

78 commits