Commit graph

10 commits

Author SHA1 Message Date
Ilya Kreymer
fd5308cf4a fix stats computation by always using done / total stats from redis 2026-01-30 13:32:36 -08:00
Ilya Kreymer
fdd9958e12 - add 'estimatedRedundantSize' calculation for estimated wasted space
- rename to 'conservedSize' calculation for estimated conserved space
- add 'dupeUrls' to track duplicate URLs added on each crawl explicitly
- incrStat and type checking for stat types
- add removedCrawls and removedCrawlSize to track removed crawls
- clean up stats, add incrStat() to better keep track of stats in one place
2026-01-30 13:32:36 -08:00
Ilya Kreymer
bebaf38e0b error handling:
- skip invalid wacz files provided for import
- skip invalid multi-wacz json files provided for import
- tests: add invalid multi-wacz file for testing
2026-01-30 13:32:36 -08:00
Ilya Kreymer
09388ff9dc tests: add test for import from json 2026-01-30 13:32:36 -08:00
Ilya Kreymer
b96c409729 include size in hash key data
add hash dupe when WARC record actually written
store savedSize as diff between original and revisit WARC records
indexer: compute savedSize by tracking subtracing revisit records to be added, if revisit added before original
2026-01-30 13:32:36 -08:00
Ilya Kreymer
27a19bb64f fix size count typo, unique == not dupe! 2026-01-30 13:32:36 -08:00
Ilya Kreymer
8311b61fa1 uniq -> unique
add 'removable' count for number of crawls that can be removed from the index
2026-01-30 13:32:36 -08:00
Ilya Kreymer
55fbe43b22 stats:
- compute totalUrls, totalSize, uniqSize (uniqUrls = number of hashes) in per crawl key
- add stats on crawl commit, remove on crawl remove
- tests: update tests to check stats
2026-01-30 13:32:36 -08:00
Ilya Kreymer
d31530a753 cleanup pass:
- support dedupe without requiring wacz, no crawl dependency tracking stored
- add dedupe test w/o wacz
- cleanup dedupe related naming
2026-01-30 13:32:36 -08:00
Ilya Kreymer
4104ba8361 rename 'dedup' -> 'dedupe' for consistency 2026-01-30 13:32:36 -08:00
Renamed from tests/dedup-basic.test.js (Browse further)