browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2026-06-18 11:51:42 +00:00

History

Ilya Kreymer 154151913a Dedup Initial Implementation (#889 ) Fixes #884 - Support for hash-based deduplication via a Redis provided with --redisDedupeUrl (can be same as default redis) - Support for writing WARC revisit records for duplicates - Support for new indexer mode which imports CDXJ from one or more WACZs (refactored from replay) to populate the dedup index - Crawl and aggregate stats updated in dedupe index, including total urls, deduped URLs, conserved size (difference between revisit and response records), and estimated redundant size (aggregate) of duplicates not deduped. - Track removed crawls on index update, support for --remove operation to purge removed crawls, otherwise removed crawl aggregate data is maintained. - Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field. - Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added. - WIP on page-level dedupe (preempt loading entire pages) if HTML is a dupe/matches exactly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>		2026-02-12 13:40:49 -08:00
..
docs	Dedup Initial Implementation (#889 )	2026-02-12 13:40:49 -08:00
gen-cli.sh	Gracefully handle non-absolute path for create-login-profile --filename (#521 )	2024-03-29 13:46:54 -07:00
mkdocs.yml	Dedup Initial Implementation (#889 )	2026-02-12 13:40:49 -08:00