browsertrix-crawler/docs/mkdocs.yml
Ilya Kreymer 154151913a
Dedup Initial Implementation (#889)
Fixes #884 

- Support for hash-based deduplication via a Redis provided with
--redisDedupeUrl (can be same as default redis)
- Support for writing WARC revisit records for duplicates
- Support for new indexer mode which imports CDXJ from one or more WACZs
(refactored from replay) to populate the dedup index
- Crawl and aggregate stats updated in dedupe index, including total
urls, deduped URLs, conserved size (difference between revisit and
response records), and estimated redundant size (aggregate) of
duplicates not deduped.
- Track removed crawls on index update, support for --remove operation
to purge removed crawls, otherwise removed crawl aggregate data is
maintained.
- Dependencies of each deduped crawl (WACZ files containing original data) are recorded in datapackage.json related.requires field.
- Initial docs (develop/dedupe.md) and tests (tests/dedupe-basic.test.js) added.
- WIP on page-level dedupe (preempt loading entire pages) if HTML is a
dupe/matches exactly.

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2026-02-12 13:40:49 -08:00

99 lines
2.6 KiB
YAML

site_name: Browsertrix Crawler Docs
repo_url: https://github.com/webrecorder/browsertrix-crawler/
repo_name: Browsertrix Crawler
edit_uri: edit/main/docs/docs/
extra_css:
- stylesheets/extra.css
theme:
name: material
custom_dir: docs/overrides
features:
- navigation.sections
- navigation.tabs
- navigation.tabs.sticky
- navigation.instant
- navigation.tracking
- navigation.indexes
- navigation.footer
- content.code.copy
- content.action.edit
- content.tooltips
- search.suggest
palette:
scheme: webrecorder
logo: assets/brand/browsertrix-crawler-white.svg
favicon: assets/brand/browsertrix-crawler-icon-color-dynamic.svg
icon:
admonition:
note: bootstrap/pencil-fill
abstract: bootstrap/file-earmark-text-fill
info: bootstrap/info-circle-fill
tip: bootstrap/exclamation-circle-fill
success: bootstrap/check-circle-fill
question: bootstrap/question-circle-fill
warning: bootstrap/exclamation-triangle-fill
failure: bootstrap/x-octagon-fill
danger: bootstrap/exclamation-diamond-fill
bug: bootstrap/bug-fill
example: bootstrap/mortarboard-fill
quote: bootstrap/quote
repo: bootstrap/github
edit: bootstrap/pencil
view: bootstrap/eye
nav:
- index.md
- Develop:
- develop/index.md
- develop/docs.md
- develop/dedupe.md
- User Guide:
- user-guide/index.md
- user-guide/outputs.md
- user-guide/exit-codes.md
- user-guide/common-options.md
- user-guide/crawl-scope.md
- user-guide/yaml-config.md
- user-guide/browser-profiles.md
- user-guide/proxies.md
- user-guide/behaviors.md
- user-guide/qa.md
- user-guide/cli-options.md
markdown_extensions:
- toc:
toc_depth: 4
permalink: true
- pymdownx.highlight:
anchor_linenums: true
- pymdownx.emoji:
emoji_index: !!python/name:material.extensions.emoji.twemoji
emoji_generator: !!python/name:material.extensions.emoji.to_svg
options:
custom_icons:
- docs/overrides/.icons
- admonition
- pymdownx.inlinehilite
- pymdownx.details
- pymdownx.superfences
- pymdownx.keys
- def_list
- attr_list
extra:
generator: false
social:
- icon: bootstrap/globe
link: https://webrecorder.net
- icon: bootstrap/chat-left-text-fill
link: https://forum.webrecorder.net/
- icon: bootstrap/mastodon
link: https://digipres.club/@webrecorder
- icon: bootstrap/youtube
link: https://www.youtube.com/@webrecorder
copyright: "Creative Commons Attribution 4.0 International (CC BY 4.0)"
plugins:
- search