browsertrix-crawler/tests
Ilya Kreymer 3433a4a440
Page Level Dedupe support: (#1018)
- add --dedupePagesMinDepth to enable page-level dedupe at certain depth
or greater
- add 'duplicate' as another skip reason, log skip reason when page is
skipped due to dedupe
- when pageDedupe is enabled, set pageLimit to 0 and allow queueing
pages beyond expected limit, in case pages are skipped
- add queuePageLimit and check limit on each new page at queue pop time,
allows skipping already deduped pages and incrementally crawling new
pages
- when limit reached, queued pages are drained and marked as excluded /
logged to skippedPages list
- tests: test page dedupe / incremental crawling: new pages are archived
on subsequent crawls, previous pages skipped with 'duplicate' reason
- docs: add Page Deduplication on dedupe page
- docs: add Reports page (reports.md), document skipped pages /
--reportSkipped report
- fixes #1017

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2026-04-30 20:14:42 +02:00
..
custom-behaviors tests: remove example.com from tests (#885) 2025-09-19 23:21:47 -07:00
fixtures Add downloads dir to cache external dependency within the crawl (#921) 2025-11-26 19:30:27 -08:00
invalid-behaviors tests: remove example.com from tests (#885) 2025-09-19 23:21:47 -07:00
adblockrules.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
add-exclusion.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
basic_crawl.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
blockrules.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
brave-query-redir.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
collection_name.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
config_file.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
config_stdin.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
crawl_overwrite.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
custom-behavior-flow.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
custom-behavior.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
custom_driver.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
custom_selector.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
dedupe-basic.test.ts Page Level Dedupe support: (#1018) 2026-04-30 20:14:42 +02:00
dedupe-page.test.ts Page Level Dedupe support: (#1018) 2026-04-30 20:14:42 +02:00
dryrun.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
exclude-redirected.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
extra_hops_depth.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
file_stats.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
http-auth.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
lang-code.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
limit_reached.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
log_filtering.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
mult_url_crawl_with_favicon.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
multi-instance-crawl.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
non-html-crawl.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
norm-test.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
pageinfo-records.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
profiles.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
proxy.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
qa_compare.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
retry-failed.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
robots_txt.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
rollover-writer.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
saved-state.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
scopes.test.ts Fix allowHashUrls option and scope checking for hash URLs (#1025) 2026-04-28 22:32:12 +02:00
screenshot.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
seeds.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
sitemap-parse.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
skipped_pages.test.ts Add option to write JSONL file with data on skipped pages (#966) 2026-04-09 12:51:41 -07:00
storage.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
text-extract.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
upload-wacz.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
url-normalize.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00
url_file_list.test.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
utils.ts Convert tests from JS to TS (#1003) 2026-04-02 17:05:41 -07:00
warcinfo.test.ts tests: include tests in format and lint operations, reformat existing tests to match style 2026-04-09 12:52:33 -07:00