Commit graph

3 commits

Author SHA1 Message Date
Ilya Kreymer
298b901558 - track source index for each hash, so entry becomes '<source index> <date> <url>'
- entry for source index can contain the crawl id (or possibly wacz and crawl id)
- also store dependent sources in relation.requires in datapackage.json
- tests: update tests to check for relation.requires
2025-11-27 22:29:37 -08:00
Ilya Kreymer
ca02f09b5d dedup indexing: strip hash prefix from digest, as cdx does not have it
tests: add index import + dedup crawl to ensure digests match fully
2025-11-27 22:28:43 -08:00
Ilya Kreymer
0cadf371d0 tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records 2025-11-27 22:28:13 -08:00