Commit graph

5 commits

Author SHA1 Message Date
Ilya Kreymer
bb9c82493b
QA Crawl Support (Beta) (#469)
Initial (beta) support for QA/replay crawling!
- Supports running a crawl over a given WACZ / list of WACZ (multi WACZ) input, hosted in ReplayWeb.page
- Runs local http server with full-page, ui-less ReplayWeb.page embed
- ReplayWeb.page release version configured in the Dockerfile, pinned ui.js and sw.js fetched directly from cdnjs

Can be deployed with `webrecorder/browsertrix-crawler qa` entrypoint.
- Requires `--qaSource`, pointing to WACZ or multi-WACZ json that will be replay/QAd
- Also supports `--qaRedisKey` where QA comparison data will be pushed, if specified.
- Supports `--qaDebugImageDiff` for outputting crawl / replay/ diff
images.
- If using --writePagesToRedis, a `comparison` key is added to existing page data where:
```
  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };
  ```
- bump version to 1.1.0-beta.2
2024-03-22 17:32:42 -07:00
Ilya Kreymer
6d04c9575f
Fix Save/Load State (#495)
- Fixes state serialization, which was missing the done list. Instead,
adds a 'finished' list computed from the seen list, minus failed and
queued URLs.
- Also adds serialization support for 'extraSeeds', seeds added
dynamically from a redirect (via #475). Extra seeds are added to Redis
and also included in the serialization.

Fixes #491
2024-03-15 20:54:43 -04:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
dd7b926d87
Exclusion Optimizations: follow-up to (#423)
Follow-up to #408 - optimized exclusion filtering:
- use zscan with default count instead of ordered scan to remvoe
- use glob match when possible (non-regex as determined by string check)
- move isInScope() check to worker to avoid creating a page and then
closing for every excluded URL
- tests: update saved-state test to be more resilient to delays

args: also support '--text false' for backwards compatibility, fixes
webrecorder/browsertrix-cloud#1334

bump to 0.12.1
2023-11-03 15:15:09 -07:00
Ilya Kreymer
8c92901889
load saved state fixes + redis tests (#415)
- set done key correctly, just an int now
- also check if array for old-style save states (for backwards
compatibility)
- fixes #411
- tests: includes tests using redis: tests save state + dynamically
adding exclusions (follow up to #408)
- adds `--debugAccessRedis` flag to allow accessing local redis outside
container
2023-10-23 09:36:10 -07:00