browsertrix-crawler/docs
Ilya Kreymer 154151913a
Dedup Initial Implementation (#889)
Fixes #884 

- Support for hash-based deduplication via a Redis provided with
--redisDedupeUrl (can be same as default redis)
- Support for writing WARC revisit records for duplicates
- Support for new indexer mode which imports CDXJ from one or more WACZs
(refactored from replay) to populate the dedup index
- Crawl and aggregate stats updated in dedupe index, including total
urls, deduped URLs, conserved size (difference between revisit and
response records), and estimated redundant size (aggregate) of
duplicates not deduped.
- Track removed crawls on index update, support for --remove operation
to purge removed crawls, otherwise removed crawl aggregate data is
maintained.
- Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field.
- Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added.
- WIP on page-level dedupe (preempt loading entire pages) if HTML is a
dupe/matches exactly.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2026-02-12 13:40:49 -08:00
..
docs Dedup Initial Implementation (#889) 2026-02-12 13:40:49 -08:00
gen-cli.sh Gracefully handle non-absolute path for create-login-profile --filename (#521) 2024-03-29 13:46:54 -07:00
mkdocs.yml Dedup Initial Implementation (#889) 2026-02-12 13:40:49 -08:00