Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	28241c824e	ci: fixes to deploy ci workflow	2025-04-03 23:36:49 -07:00
Ilya Kreymer	7421404aee	ci: add workflow to deploy to dev channels (requires actions secrets config) (#815 ) - uses DEPLOY_REGISTRY, DEPLOY_REGISTRY_PATH, DEPLOY_REGISTRY_API_TOKEN secrets	2025-04-03 23:21:48 -07:00
Ilya Kreymer	e585b6d194	Better default crawlId (#806 ) - set crawl id from collection, not other way around, to ensure unique redis keyspace for different collections - by default, set crawl id to unique value based on host and collection, eg. '@hostname-@id' - don't include '@id' in collection interpolation, can only used hostname or timestamp - fixes issue mentioned / workaround provided in #784 - ci: add docker login + cacheing to work around rate limits - tests: fix sitemap tests	2025-04-01 13:40:03 -07:00
Ilya Kreymer	d923e11436	separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736 ) Chromium now interrupts fetch() if abort() is called or page is navigated, so autofetch behavior using native fetch() is less than ideal. This PR adds support for __bx_fetch() command for autofetch behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately from browser's reguar fetch() - __bx_fetch() starts a fetch, but does not return content to browser, doesn't need abort(), unaffected by page navigation, but will still try to use browser network stack when possible, making it more efficient for background fetching. - if network stack fetch fails, fallback to regular node fetch() in the crawler. Additional improvements for interrupted fetch: - don't store truncated media responses, even for 200 - avoid doing duplicate async fetching if response already handled (eg. fetch handled in multiple contexts) - fixes #735, where fetch was interrupted, resulted in an empty response	2024-12-31 13:52:12 -08:00
Ilya Kreymer	d497a424fc	tests: disable blockrules youtube tests in CI (#698 ) due to youtube being blocked, disable test involving youtube embeds when running in CI for now	2024-10-04 17:37:13 -07:00
Ilya Kreymer	85a07aff18	Streaming in-place WACZ creation + CDXJ indexing (#673 ) Fixes #674 This PR supersedes #505, and instead of using js-wacz for optimized WACZ creation: - generates an 'in-place' or 'streaming' WACZ in the crawler, without having to copy the data again. - WACZ contents are streamed to remote upload (or to disk) from existing files on disk - CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX) - All data in the WARCs is written and read only once - Should result in significant speed / disk usage improvements: previously WARC was written once, then read again (for CDXJ indexing), read again (for adding to new WACZ ZIP), written to disk (into new WACZ ZIP), read again (if upload to remote endpoint). Now, WARCs are written once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all data is read once to either generate WACZ on disk or upload to remote. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-08-29 13:21:20 -07:00
Ilya Kreymer	c61a03de6e	ci: use docker compose instead of docker-compose	2024-08-14 21:21:35 -07:00
benoit74	4cc67a3267	Update Brave image + isolated Python venv for dependencies installation (#591 ) - Debian distro now requires the use of virtual environments to not mess with dependencies installed by official apt packages - removes tldextract update now that pywb is not in use anymore - bump brave version to 1.68.141, for use with base image added in https://github.com/webrecorder/browsertrix-browser-base/pull/20 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-08-14 21:12:00 -07:00
Tessa Walsh	84129b1888	QA: Ensure empty string text is propagated for QA comparison (#667 ) Fixes #666 Fixes two issues with QA replay text extraction: - ensures empty string text from QA replay is treated as empty string, instead of undefined - avoids a divide by zero when both original and replay text strings was 0. Ensures the match is 1.0 if both crawl and QA replay text is an empty string --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-08-09 13:20:56 -07:00
Ilya Kreymer	3339374092	http auth support per seed (supersedes #566 ): (#616 ) - parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config) - add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders() - tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI) - docs: add HTTP Auth to YAML config section --------- Co-authored-by: Ed Summers <ehs@pobox.com>	2024-06-20 16:35:30 -07:00
Ilya Kreymer	894681e5fc	Bump version to 1.2.0 Beta + make draft release for each commit (#582 ) Generate draft release from main and *-release branches to simplify release process --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-05-22 15:45:48 -07:00
Ilya Kreymer	bb9c82493b	QA Crawl Support (Beta) (#469 ) Initial (beta) support for QA/replay crawling! - Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page - Runs local http server with full-page, ui-less ReplayWeb.page embed - ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint. - Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd - Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified. - Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff images. - If using --writePagesToRedis, a `comparison` key is added to existing page data where: ``` comparison: { screenshotMatch?: number; textMatch?: number; resourceCounts: { crawlGood?: number; crawlBad?: number; replayGood?: number; replayBad?: number; }; }; ``` - bump version to 1.1.0-beta.2	2024-03-22 17:32:42 -07:00
Ilya Kreymer	1fe810b1df	Improved support for running as non-root (#503 ) This PR provides improved support for running crawler as non-root, matching the user to the uid/gid of the crawl volume. This fixes #502 initial regression from 0.12.4, where `chmod u+x` was used instead of `chmod a+x` on the node binary files. However, that was not enough to fully support equivalent signal handling / graceful shutdown as when running with the same user. To make the running as different user path work the same way: - need to switch to `gosu` instead of `su` (added in Brave 1.64.109 image) - run all child processes as detached (redis-server, socat, wacz, etc..) to avoid them automatically being killed via SIGINT/SIGTERM - running detached is controlled via `DETACHED_CHILD_PROC=1` env variable, set to 1 by default in the Dockerfile (to allow for overrides just in case) A test has been added which runs one of the tests with a non-root `test-crawls` directory to test the different user path. The test (saved-state.test.js) includes sending interrupt signals and graceful shutdown and allows testing of those features for a non-root gosu execution. Also bumping crawler version to 1.0.1	2024-03-21 08:16:59 -07:00
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	af1e0860e4	TypeScript Conversion (#425 ) Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> Co-authored-by: emma <hi@emma.cafe>	2023-11-09 11:27:11 -08:00
Ilya Kreymer	15661eb9c8	More flexible multi value arg parsing + README update for 0.12.0 (#422 ) Updated arg parsing thanks to example in https://github.com/yargs/yargs/issues/846#issuecomment-517264899 to support multiple value arguments specified as either one string or multiple string using array type + coerce function. This allows for `choice` option to also be used to validate the options, when needed. With this setup, `--text to-pages,to-warc,final-to-warc`, `--text to-pages,to-warc --text final-to-warc` and `--text to-pages --text to-warc --text final-to-warc` all result in the same configuration! Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system. Also updated README with new text extraction options and bumped version to 0.12.0 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 11:47:37 -07:00
Ilya Kreymer	c6cbbc1a17	Update CI Release Action (#386 ) * update to latest actions, use docker meta action with semver tags	2023-09-18 22:43:47 -05:00
Ilya Kreymer	d4233582bb	ci: bump yarn install timeout for ci, use latest gh action	2023-04-03 12:18:42 -07:00
Ilya Kreymer	277314f2de	Convert to ESM (#179 ) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0	2022-11-15 18:30:27 -08:00
Ilya Kreymer	82c771f7cd	ci: possibly fix for ci release build (issues building uwsgi)	2022-07-03 11:09:06 -07:00
Ilya Kreymer	f0c5ca1035	ci release: fix typo in release yaml config for latest tag	2021-07-23 20:00:40 -07:00
Ilya Kreymer	0e0b85d7c3	Customizable extract selectors + typo fix (0.4.2) (#72 ) * fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail. - wrap directFetchCapture() to retry browser loading in case of failure * custom link extraction improvements (improvements for #25) - extractLinks() returns a list of link URLs to allow for more flexibility in custom driver - rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed - loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false} - tests: add test for custom driver which uses custom selector * tests - tests: all tests uses 'test-crawls' instead of crawls - consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js - add custom driver test and fixture to test custom link extraction * add to CHANGES, bump to 0.4.2	2021-07-23 18:31:43 -07:00
Ilya Kreymer	bd44190ab2	Build simplification: Use :latest Version By default + README update (#71 ) * docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image - ci: add 'latest' tag to release ci build to automatically update latest as well - README: remove '[VERSION]', just refer to latest version of image in all examples - README: mention using specific released tag version for production	2021-07-22 17:46:10 -07:00
Ilya Kreymer	f4c6b6a99f	0.4.1 Release! (#70 ) * optimization: don't intercept requests if no blockRules set * page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages * add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds) * refactor profile loadProfile/saveProfile to util/browser.js - support augmenting existing profile when creating a new profile * screencasting: convert newContext to window instead of page by default, instead of just warning about it * shared multiplatform image support: - determine browser exe from list of options, getBrowserExe() returns current exe - supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64 - update to multiplatform oldwebtoday/chrome:91 as browser image - enable multiplatform build with latest build-push-action@v2 * seeds: add trim() to seed URLs * logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically * profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles * extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25 * update CHANGES and README with new features * bump version to 0.4.1	2021-07-22 14:24:51 -07:00
Ilya Kreymer	473de8c49f	Scope Handling Improvements + Tests (#66 ) * scope fixes: - remove default prefix scopeType, ensure scope include and exclude take precedence - add new 'custom' scopeType, when include or exclude are used - use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude) - ensure per-seed scope include/exclude used when present, and scopeType set to 'custom' - ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified - rename --type to --scopeType in seed to maintain consistency - add sitemap param as alias for useSitemap tests: - add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides - fix screencaster to use relative paths to work with tests - ci: use yarn instead of npm * update README with new flags * bump version to 0.4.0-beta.3	2021-07-06 20:22:27 -07:00
Ilya Kreymer	f57818f2f6	New Docker Image, Customizable Browser Source + Binary (#62 ) * switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!) * add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...) * github action ci: use system unzip * update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file. * Update README with info on customizing build image * bump version to 0.4.0-beta.2	2021-06-24 15:39:17 -07:00
Ilya Kreymer	dba4524246	ci: add push to registry on release action	2021-04-14 15:45:20 -07:00
Emma Dickson	24e2c4ddf8	Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33 ) * generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc.. * each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC. * add test for --combineWARC flag * add improved lint rules Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-03-31 10:41:27 -07:00
Emma Dickson	fb0f1d8db9	tests text extraction (#30 ) * new tests * add jest to eslint, lint fixes Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-03-01 16:00:23 -08:00
Emma Dickson	7cfeefd19b	add ci and linting (#21 ) * linting with eslint * ci: validate linting and check basic single-page crawl with wacz creation Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-02-08 09:45:46 -08:00

31 commits