Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Tessa Walsh	2a9b152531	Support loading custom behaviors from URLs and/or filepaths (#707 ) Fixes #368 The `--customBehaviors` flag is now an array, making it repeatable. This should be backwards compatible with the CLI flag, but may require changes to YAML configs when custom behaviors are used. Custom behaviors can be loaded from URLs, local filepaths, and paths to local directories, including any combination thereof. New tests are added to ensure loading behaviors from URLs as well as a mixed combination of URL and filepath works as expected. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-11-04 20:30:53 -08:00
Ilya Kreymer	e5bab8e7c8	various edge-case loading optimizations: (#709 ) - rework 'should stream' logic: * ensure 206 responses (or any response) greater than 25M are streamed * response between 5M and 25M are read into memory if text/css/js as they may be rewritten * responses <5M are read into memory * responses with unknown size are streamed if a 2xx, otherwise read into memory, assuming error code responses may lack status codes but otherwise are small - likely fix for issues in #706 - if too many range requests for same URL are being made, try skipping/failing right away to reduce load - assume main browser context is used not just for service workers, always enable - check false positive 'net-aborted' error that may actually be ok for media, as well as documents - improve logging - interrupt any pending requests (that may be loading via browser context) after page timeout, log dropped requests --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-10-31 14:06:17 -07:00
Ilya Kreymer	181d9b824c	deps: update to latest wabac (#708 ) bump version to 1.3.4	2024-10-26 11:02:32 -07:00
Ilya Kreymer	0d39ea3590	dep: update to wabac.js 2.20 (#704 ) Update imports for new TS-based wabac.js	2024-10-16 21:02:04 -07:00
Ilya Kreymer	a45b85dd74	version: bump to 1.3.3	2024-10-11 00:12:23 -07:00
Ilya Kreymer	282c47ad66	bump puppeteer core to 23.5.1 (#700 ) includes possible improvements for detecting crashes with wrong stack trace (see: puppeteer/puppeteer#13056)	2024-10-07 16:39:48 -07:00
Ilya Kreymer	356b3f8d10	bump to 1.3.2	2024-09-30 15:51:13 -07:00
Ilya Kreymer	9f310907f0	version: bump to 1.3.1	2024-09-27 14:30:56 -04:00
Ilya Kreymer	da442573b8	version: bump to 1.3.0	2024-09-12 09:22:22 -07:00
Ilya Kreymer	083a9d2090	version: bump to 1.3.0-beta.1	2024-09-05 18:11:52 -07:00
Ilya Kreymer	9d0e3423a3	WARC writer + incremental indexing fixes (#679 ) - ensure WARC rollover happens only after response/request + cdx or single record + cdx have been written - ensure request payload is buffered for POST request indexing - update to warcio 2.3.1 for POST request case-insensitive 'content-type' check - recorder: remove unused 'tempdir', no longer used as warcio chooses a temp file on it's own	2024-09-05 11:10:31 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Ilya Kreymer	23fbbcb6bf	version: bump to 1.3.0-beta.0	2024-08-14 20:12:48 -07:00
Ilya Kreymer	8d7fb1e084	1.2.8 updates: (#668 ) - rewriting: update wabac.js, use getCustomRewriter(), don't truncate POST request bodies for URLs that use a custom rewriter - browser: disable --enable-automation, setting webdriver = true, so no need for override - deps: update puppeteer-core, necessary changes for latest puppeteer	2024-08-13 23:38:55 -07:00
Ilya Kreymer	bb34c5ef47	version: bump to 1.2.7 deps: bump RWP in Dockerfile to 2.1.3	2024-08-09 13:23:16 -07:00
Ilya Kreymer	a1ba29d878	deps: update puppeteer-core to 22.14.0 (#661 )	2024-07-30 13:51:52 -07:00
Ilya Kreymer	ff81048d3a	deps: bump browsertrix-behaviors to 0.6.3 (#659 ) adds support for detecting videos in shadow dom with query-selector-shadow-dom library	2024-07-30 09:41:21 -07:00
Ilya Kreymer	9f2b9bf4e5	version: bump to 1.2.6	2024-07-29 16:41:40 -07:00
Ilya Kreymer	539730d54e	remove crc32 computation, fixes #653 (#657 ) Removes crc32 computation, which was incorrect, and no longer needed	2024-07-29 16:19:44 -07:00
Ilya Kreymer	88a2fbd0a0	Fix 206 response + general video handling (#646 ) Refactors handling of 206 responses: - If a 206 response is encountered, and its actually the full range, convert to 200 and rewrite range and content-range headers to x-range and x-orig-range. This is to support rewriting of 206 responses for DASH manifests - If a partial 206 response starting with `0-`, do a full async fetch separately. - If a partial 206 response not starting with 0-, just ignore (very likely a duplicate picked up when handling the 0- response) - Don't stream content-types that can be rewritten, since streaming prevents rewriting. Fixes rewriting on DASH/HLS manifests which have no content-length and don't get properly rewritten. - Overall, adds missing rewriting of DASH/HLS manifests that have no content-length and are served as 206. - Update to latest wabac.js which fixes rewriting of DASH manifest to avoid duplicate '<?xml' prefix, webrecorder/wabac.js#192 - Fixes #645	2024-07-17 13:24:25 -07:00
Ilya Kreymer	01666b4474	deps: bump browsertrix-behaviors to 0.6.2	2024-07-11 19:53:59 -07:00
Ilya Kreymer	4fb9577d4f	don't disable extraHops when using sitemaps: (#639 ) - instead, exclude sitemap-discovered page URLs from being counted to extra hops rules, eg. if a sitemap page is not in scope, don't include it. -if extraHops is set with sitemaps, only consider extraHops for links for pages that are in scope. - bump version to 1.2.4	2024-07-11 19:48:43 -07:00
Ilya Kreymer	320c041235	version: bump to 1.2.3	2024-07-08 10:50:51 -07:00
Ilya Kreymer	302b119908	Dependency Update / 1.2.2 (#633 ) Dependency Updates: - Bump Brave to 1.67.123 - Update puppeteer-core to latest, fixes possible crash when loading current browser with old profiles - Tests: simplifies extra hops test to avoid complex pages that could lead to timeout	2024-07-03 12:55:14 -07:00
Ilya Kreymer	e65bf21135	version: bump to 1.2.1	2024-06-25 13:28:59 -07:00
Ilya Kreymer	8af8b3c19a	1.2.0 release - deps: bump wabac.js to 2.19.1, RWP for QA to 2.1.0 (#624 )	2024-06-21 16:34:06 -07:00
Ilya Kreymer	65a86352fd	Updated rewriting for YouTube + dependency update (#623 ) - update to wabac.js 2.19.0 to use new html rewriting support in wabac.js 2.19.0 - update to browsertrix-behaviors to 0.6.1 to fix instagram behavior - bump to 1.2.0-beta.3	2024-06-21 15:03:53 -07:00
Ilya Kreymer	de10ba9f15	version: bump to 1.2.0-beta.2	2024-06-20 20:11:35 -07:00
Ilya Kreymer	3339374092	http auth support per seed (supersedes #566 ): (#616 ) - parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config) - add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders() - tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI) - docs: add HTTP Auth to YAML config section --------- Co-authored-by: Ed Summers <ehs@pobox.com>	2024-06-20 16:35:30 -07:00
Ilya Kreymer	f504effa51	Merge branch 'main' into release/1.1.4 bump to 1.2.0-beta.1	2024-06-13 19:28:25 -07:00
Ilya Kreymer	f85727954a	add undici for 1.1.4 release, to fix #606 (#608 )	2024-06-13 18:46:05 -07:00
Ilya Kreymer	f6c4bf9935	bump version to 1.1.4	2024-06-13 10:31:57 -07:00
Ilya Kreymer	e2b4cc1844	proxy: support setting proxy via --proxyServer, PROXY_SERVER env var or PROXY_HOST + PROXY_PORT env vars (#589 ) fixes #587 The proxy env vars PROXY_HOST and PROXY_PORT were being ignored, as they were hardcoded to obsolete values in the Dockerfile. Proxy settings can now be set, in order of precedence via: - --proxyServer cli flag - PROXY_SERVER env var - PROXY_HOST and PROXY_PORT env vars, which set an HTTP proxy server only (for backwards compatibility with 0.12.x) The --proxyServer / PROXY_SERVER settings are passed to the browser via the --proxy-server flag. AsyncFetcher / direct fetch also supports HTTP and SOCKS5 proxying. Supported proxies are: HTTP no auth, SOCKS5 no auth, SOCKS5 with auth (supported in Brave, but not Chrome!) --------- Co-authored-by: Vinzenz Sinapius <Vinzenz.Sinapius@gmail.com> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-06-10 13:11:00 -07:00
Ilya Kreymer	894681e5fc	Bump version to 1.2.0 Beta + make draft release for each commit (#582 ) Generate draft release from main and *-release branches to simplify release process --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-05-22 15:45:48 -07:00
Ilya Kreymer	6c15bb3f00	version: bump to 1.1.3	2024-05-21 16:37:03 -07:00
Ilya Kreymer	bd5368cbca	version: bump to 1.1.2	2024-05-07 13:46:05 +02:00
Ilya Kreymer	a61206fd73	profiles: ensure all page.goto() promises have at least catch block or are awaited (#559 ) In particular, an API call to /navigate starts, but doesn't wait for a page load to finish, since user can choose to close the profile browser at any time. This ensures that user operations don't cause the browser to crash if page.goto() is interrupted/fails (browser closed, profile is saved, etc...) while a page is still loading. bump to 1.1.1	2024-04-25 09:34:57 +02:00
Ilya Kreymer	dece69c233	version: bump to 1.1.0!	2024-04-18 17:45:57 -07:00
Ilya Kreymer	51d82598e7	Support site-specific wait via browsertrix-behaviors (#555 ) The 0.6.0 release of Browsertrix Behaviors / webrecorder/browsertrix-behaviors#70 introduces support for site-specific behaviors to implement an `awaitPageLoad()` function which allows for waiting for specific resources on the page load. - This PR just adds a call to this function directly after page load. - Factors out into an `awaitPageLoad()` method used in both crawler and replaycrawler to support the same wait in QA Mode - This is to support custom loading wait time for Instagram (other sites in the future)	2024-04-18 17:16:57 -07:00
Ilya Kreymer	e15f0c95d9	Adblock support (#534 ) Now that RWP 2.0.0 with adblock support has been released (webrecorder/replayweb.page#307), this enables adblock on the QA mode RWP embed, to get more accurate screenshots. Fetches the adblock.gz directly from RWP (though could also fetch it separately from Easylist) Updates to 1.1.0-beta.5	2024-04-12 09:47:32 -07:00
Tessa Walsh	1325cc3868	Gracefully handle non-absolute path for create-login-profile --filename (#521 ) Fixes #513 If an absolute path isn't provided to the `create-login-profile` entrypoint's `--filename` option, resolve the value given within `/crawls/profiles`. Also updates the docs cli-options section to include the `create-login-profile` entrypoint and adjusts the script to automatically generate this page accordingly. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-29 13:46:54 -07:00
Ilya Kreymer	5152169916	bump version to 1.1.0-beta.3	2024-03-28 17:19:40 -07:00
Ilya Kreymer	0d973d67e3	upgrade puppeteer-core to 22.6.1 (#516 ) Using latest puppeteer-core to keep up with latest browsers, mostly minor syntax changes Due to change in puppeteer hiding the executionContextId, need to create a frameId->executionContextId mapping and track it ourselves to support the custom evaluateWithCLI() function	2024-03-27 09:26:51 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	22a7351dc7	service worker capture fix: disable by default for now (#506 ) Due to issues with capturing top-level pages, make bypassing service workers the default for now. Previously, it was only disabled when using profiles. (This is also consistent with ArchiveWeb.page behavior). Includes: - add --serviceWorker option which can be `disabled`, disabled-if-profile (previous default) and `enabled` - ensure page timestamp is set for direct fetch - warn if page timestamp is missing on serialization, then set to now before serializing bump version to 1.0.2	2024-03-22 13:37:14 -07:00
Ilya Kreymer	1fe810b1df	Improved support for running as non-root (#503 ) This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1	2024-03-21 08:16:59 -07:00
Ilya Kreymer	9a2ada3461	version: bump to 1.0.0	2024-03-18 19:15:35 -07:00
Ilya Kreymer	56053534c5	SAX-based sitemap parser (#497 ) Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-18 19:14:07 -07:00
Ilya Kreymer	f96c6a13dc	version: bump to 1.0.0-beta.8	2024-03-16 15:32:19 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00

1 2 3 4 5

218 commits