Commit graph

3 commits

Author SHA1 Message Date
Ilya Kreymer
c3dc62dae5 - track source index for each hash, so entry becomes '<source index> <date> <url>'
- entry for source index can contain the crawl id (or possibly wacz and crawl id)
- also store dependent sources in relation.requires in datapackage.json
- tests: update tests to check for relation.requires
2025-12-03 15:00:08 -08:00
Ilya Kreymer
0d3d774fe8 dedup indexing: strip hash prefix from digest, as cdx does not have it
tests: add index import + dedup crawl to ensure digests match fully
2025-12-03 15:00:08 -08:00
Ilya Kreymer
2cd3fc0157 tests: add dedup-basic.test for simple dedup, ensure number of revisit records === number of response records 2025-12-03 15:00:08 -08:00