browsertrix-crawler/docs
Ilya Kreymer 3433a4a440
Page Level Dedupe support: (#1018)
- add --dedupePagesMinDepth to enable page-level dedupe at certain depth
or greater
- add 'duplicate' as another skip reason, log skip reason when page is
skipped due to dedupe
- when pageDedupe is enabled, set pageLimit to 0 and allow queueing
pages beyond expected limit, in case pages are skipped
- add queuePageLimit and check limit on each new page at queue pop time,
allows skipping already deduped pages and incrementally crawling new
pages
- when limit reached, queued pages are drained and marked as excluded /
logged to skippedPages list
- tests: test page dedupe / incremental crawling: new pages are archived
on subsequent crawls, previous pages skipped with 'duplicate' reason
- docs: add Page Deduplication on dedupe page
- docs: add Reports page (reports.md), document skipped pages /
--reportSkipped report
- fixes #1017

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2026-04-30 20:14:42 +02:00
..
docs Page Level Dedupe support: (#1018) 2026-04-30 20:14:42 +02:00
gen-cli.sh Dedupe docs (#989) 2026-03-10 12:49:30 -07:00
mkdocs.yml Page Level Dedupe support: (#1018) 2026-04-30 20:14:42 +02:00