Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs
Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
comparison: {
screenshotMatch?: number;
textMatch?: number;
resourceCounts: {
crawlGood?: number;
crawlBad?: number;
replayGood?: number;
replayBad?: number;
};
};
```
- bump version to 1.1.0-beta.2
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.
Fixes#491
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays
args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334
bump to 0.12.1
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes#411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container