Commit graph

31 commits

Author SHA1 Message Date
Ilya Kreymer
28241c824e ci: fixes to deploy ci workflow 2025-04-03 23:36:49 -07:00
Ilya Kreymer
7421404aee
ci: add workflow to deploy to dev channels (requires actions secrets config) (#815)
- uses DEPLOY_REGISTRY, DEPLOY_REGISTRY_PATH, DEPLOY_REGISTRY_API_TOKEN
secrets
2025-04-03 23:21:48 -07:00
Ilya Kreymer
e585b6d194
Better default crawlId (#806)
- set crawl id from collection, not other way around, to ensure unique
redis keyspace for different collections
- by default, set crawl id to unique value based on host and collection,
eg. '@hostname-@id'
- don't include '@id' in collection interpolation, can only used
hostname or timestamp
- fixes issue mentioned / workaround provided in #784 
- ci: add docker login + cacheing to work around rate limits
- tests: fix sitemap tests
2025-04-01 13:40:03 -07:00
Ilya Kreymer
d923e11436
separate fetch api for autofetch bbehavior + additional improvements on partial responses: (#736)
Chromium now interrupts fetch() if abort() is called or page is
navigated, so autofetch behavior using native fetch() is less than
ideal. This PR adds support for __bx_fetch() command for autofetch
behavior (supported in browsertrix-behaviors 0.6.6) to fetch separately
from browser's reguar fetch()
- __bx_fetch() starts a fetch, but does not return content to browser,
doesn't need abort(), unaffected by page navigation, but will still try
to use browser network stack when possible, making it more efficient for
background fetching.
- if network stack fetch fails, fallback to regular node fetch() in the
crawler.
Additional improvements for interrupted fetch:
- don't store truncated media responses, even for 200
- avoid doing duplicate async fetching if response already handled (eg.
fetch handled in multiple contexts)
- fixes #735, where fetch was interrupted, resulted in an empty response
2024-12-31 13:52:12 -08:00
Ilya Kreymer
d497a424fc
tests: disable blockrules youtube tests in CI (#698)
due to youtube being blocked, disable test involving youtube embeds when
running in CI for now
2024-10-04 17:37:13 -07:00
Ilya Kreymer
85a07aff18
Streaming in-place WACZ creation + CDXJ indexing (#673)
Fixes #674 

This PR supersedes #505, and instead of using js-wacz for optimized WACZ
creation:
- generates an 'in-place' or 'streaming' WACZ in the crawler, without
having to copy the data again.
- WACZ contents are streamed to remote upload (or to disk) from existing
files on disk
- CDXJ indices per-WARC are first written to 'warc-cdx' directory, then merged using the linux 'sort' command, and compressed to ZipNum if >50K (or always if using --generateCDX)
- All data in the WARCs is written and read only once
- Should result in significant speed / disk usage improvements:
previously WARC was written once, then read again (for CDXJ indexing),
read again (for adding to new WACZ ZIP), written to disk (into new WACZ
ZIP), read again (if upload to remote endpoint). Now, WARCs are written
once, along with the per-WARC CDXJ, the CDXJ only is reread, sorted and merged on-disk, and all
data is read once to either generate WACZ on disk or upload to remote.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-08-29 13:21:20 -07:00
Ilya Kreymer
c61a03de6e ci: use docker compose instead of docker-compose 2024-08-14 21:21:35 -07:00
benoit74
4cc67a3267
Update Brave image + isolated Python venv for dependencies installation (#591)
- Debian distro now requires the use of virtual environments to not mess
with dependencies installed by official apt packages
- removes tldextract update now that pywb is not in use anymore
- bump brave version to 1.68.141, for use with base image added in
https://github.com/webrecorder/browsertrix-browser-base/pull/20

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-14 21:12:00 -07:00
Tessa Walsh
84129b1888
QA: Ensure empty string text is propagated for QA comparison (#667)
Fixes #666 

Fixes two issues with QA replay text extraction:
- ensures empty string text from QA replay is treated as empty string, instead of undefined
- avoids a divide by zero when both original and replay text
strings was 0.
Ensures the match is 1.0 if both crawl and QA replay text is an empty string

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-08-09 13:20:56 -07:00
Ilya Kreymer
3339374092
http auth support per seed (supersedes #566): (#616)
- parse URL username/password, store in 'auth' field in seed, or pass in 'auth' field directly (from yaml config)
- add 'Authorization' header with base64 encoded basic auth via setExtraHTTPHeaders()
- tests: add test for crawling with auth using http-server using local docs build (now build docs as part of CI)
- docs: add HTTP Auth to YAML config section

---------
Co-authored-by: Ed Summers <ehs@pobox.com>
2024-06-20 16:35:30 -07:00
Ilya Kreymer
894681e5fc
Bump version to 1.2.0 Beta + make draft release for each commit (#582)
Generate draft release from main and *-release branches to simplify
release process

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-05-22 15:45:48 -07:00
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
1fe810b1df
Improved support for running as non-root (#503)
This PR provides improved support for running crawler as non-root,
matching the user to the uid/gid of the crawl volume.

This fixes #502 initial regression from 0.12.4, where `chmod u+x` was
used instead of `chmod a+x` on the node binary files.

However, that was not enough to fully support equivalent signal handling
/ graceful shutdown as when running with the same user. To make the
running as different user path work the same way:
- need to switch to `gosu` instead of `su` (added in Brave 1.64.109
image)
- run all child processes as detached (redis-server, socat, wacz, etc..)
to avoid them automatically being killed via SIGINT/SIGTERM
- running detached is controlled via `DETACHED_CHILD_PROC=1` env
variable, set to 1 by default in the Dockerfile (to allow for overrides
just in case)

A test has been added which runs one of the tests with a non-root
`test-crawls` directory to test the different user path. The test
(saved-state.test.js) includes sending interrupt signals and graceful
shutdown and allows testing of those features for a non-root gosu
execution.

Also bumping crawler version to 1.0.1
2024-03-21 08:16:59 -07:00
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
af1e0860e4
TypeScript Conversion (#425)
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
2023-11-09 11:27:11 -08:00
Ilya Kreymer
15661eb9c8
More flexible multi value arg parsing + README update for 0.12.0 (#422)
Updated arg parsing thanks to example in
https://github.com/yargs/yargs/issues/846#issuecomment-517264899
to support multiple value arguments specified as either one string or
multiple string using array type + coerce function.

This allows for `choice` option to also be used to validate the options,
when needed.

With this setup, `--text to-pages,to-warc,final-to-warc`, `--text
to-pages,to-warc --text final-to-warc` and `--text to-pages --text
to-warc --text final-to-warc` all result in the same configuration!

Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system.

Also updated README with new text extraction options and bumped version
to 0.12.0

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-02 11:47:37 -07:00
Ilya Kreymer
c6cbbc1a17
Update CI Release Action (#386)
* update to latest actions, use docker meta action with semver tags
2023-09-18 22:43:47 -05:00
Ilya Kreymer
d4233582bb ci: bump yarn install timeout for ci, use latest gh action 2023-04-03 12:18:42 -07:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Ilya Kreymer
82c771f7cd ci: possibly fix for ci release build (issues building uwsgi) 2022-07-03 11:09:06 -07:00
Ilya Kreymer
f0c5ca1035 ci release: fix typo in release yaml config for latest tag 2021-07-23 20:00:40 -07:00
Ilya Kreymer
0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00
Ilya Kreymer
bd44190ab2
Build simplification: Use :latest Version By default + README update (#71)
* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image
- ci: add 'latest' tag to release ci build to automatically update latest as well
- README: remove '[VERSION]', just refer to latest version of image in all examples
- README: mention using specific released tag version for production
2021-07-22 17:46:10 -07:00
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
Ilya Kreymer
473de8c49f
Scope Handling Improvements + Tests (#66)
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap

tests: 
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm

* update README with new flags

* bump version to 0.4.0-beta.3
2021-07-06 20:22:27 -07:00
Ilya Kreymer
f57818f2f6
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
2021-06-24 15:39:17 -07:00
Ilya Kreymer
dba4524246 ci: add push to registry on release action 2021-04-14 15:45:20 -07:00
Emma Dickson
24e2c4ddf8
Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33)
* generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc..
* each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC.
* add test for --combineWARC flag
* add improved lint rules

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-31 10:41:27 -07:00
Emma Dickson
fb0f1d8db9
tests text extraction (#30)
* new tests

* add jest to eslint, lint fixes

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-01 16:00:23 -08:00
Emma Dickson
7cfeefd19b
add ci and linting (#21)
* linting with eslint
* ci: validate linting and check basic single-page crawl with wacz creation

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-08 09:45:46 -08:00