mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2026-06-18 11:51:42 +00:00
- add --dedupePagesMinDepth to enable page-level dedupe at certain depth or greater - add 'duplicate' as another skip reason, log skip reason when page is skipped due to dedupe - when pageDedupe is enabled, set pageLimit to 0 and allow queueing pages beyond expected limit, in case pages are skipped - add queuePageLimit and check limit on each new page at queue pop time, allows skipping already deduped pages and incrementally crawling new pages - when limit reached, queued pages are drained and marked as excluded / logged to skippedPages list - tests: test page dedupe / incremental crawling: new pages are archived on subsequent crawls, previous pages skipped with 'duplicate' reason - docs: add Page Deduplication on dedupe page - docs: add Reports page (reports.md), document skipped pages / --reportSkipped report - fixes #1017 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net> |
||
|---|---|---|
| .. | ||
| docs | ||
| gen-cli.sh | ||
| mkdocs.yml | ||