Stowage/browsertrix-crawler

Fork 0

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2026-04-18 15:10:21 +00:00

Commit graph

Author	SHA1	Message	Date
Ilya Kreymer	cee501a20a	add reference to external WACZ per revisit record (#1009 ) - store in `WARC-Refers-To-Container` with file://<WACZ filename> as per discussions in iipc/warc-specifications#111 - wabac.js 2.26.0 will use this header for prioritizing the specified WACZ for looking up the original. - also clears the per-WACZ dependency key `...:duperef` after current WACZ is finished, so future WACZ files don't use stale dependencies - fixes #1008 - version: bump to 1.12.4	2026-03-31 17:39:06 -07:00
Ilya Kreymer	4aa883ec1a	track crawlIds included in each --collection directory (#1005 ) - track crawlIds that are included in each crawl via crawls/ids.txt list, one crawlId per line - when generating wacz, don't reference crawls that are part of current wacz in relation.requires list - tests: add test to ensure included crawls not references as external - fixes issue in #1004 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2026-03-30 10:21:15 -07:00
Ilya Kreymer	154151913a	Dedup Initial Implementation (#889 ) Fixes #884 - Support for hash-based deduplication via a Redis provided with --redisDedupeUrl (can be same as default redis) - Support for writing WARC revisit records for duplicates - Support for new indexer mode which imports CDXJ from one or more WACZs (refactored from replay) to populate the dedup index - Crawl and aggregate stats updated in dedupe index, including total urls, deduped URLs, conserved size (difference between revisit and response records), and estimated redundant size (aggregate) of duplicates not deduped. - Track removed crawls on index update, support for --remove operation to purge removed crawls, otherwise removed crawl aggregate data is maintained. - Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field. - Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added. - WIP on page-level dedupe (preempt loading entire pages) if HTML is a dupe/matches exactly. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2026-02-12 13:40:49 -08:00

Author

SHA1

Message

Date

Ilya Kreymer

cee501a20a

add reference to external WACZ per revisit record (#1009 )

- store in `WARC-Refers-To-Container` with file://<WACZ filename> as per
discussions in iipc/warc-specifications#111
- wabac.js 2.26.0 will use this header for prioritizing the specified
WACZ for looking up the original.
- also clears the per-WACZ dependency key `...:duperef` after current
WACZ is finished, so future WACZ files don't use stale dependencies
- fixes #1008
- version: bump to 1.12.4

2026-03-31 17:39:06 -07:00

Ilya Kreymer

4aa883ec1a

track crawlIds included in each --collection directory (#1005 )

- track crawlIds that are included in each crawl via crawls/ids.txt
list, one crawlId per line
- when generating wacz, don't reference crawls that are part of current
wacz in relation.requires list
- tests: add test to ensure included crawls not references as external
- fixes issue in #1004

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

2026-03-30 10:21:15 -07:00

Ilya Kreymer

154151913a

Dedup Initial Implementation (#889 )

Fixes #884 

- Support for hash-based deduplication via a Redis provided with
--redisDedupeUrl (can be same as default redis)
- Support for writing WARC revisit records for duplicates
- Support for new indexer mode which imports CDXJ from one or more WACZs
(refactored from replay) to populate the dedup index
- Crawl and aggregate stats updated in dedupe index, including total
urls, deduped URLs, conserved size (difference between revisit and
response records), and estimated redundant size (aggregate) of
duplicates not deduped.
- Track removed crawls on index update, support for --remove operation
to purge removed crawls, otherwise removed crawl aggregate data is
maintained.
- Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field.
- Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added.
- WIP on page-level dedupe (preempt loading entire pages) if HTML is a
dupe/matches exactly.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

2026-02-12 13:40:49 -08:00

3 commits