Page Level Dedupe support: (#1018)

- add --dedupePagesMinDepth to enable page-level dedupe at certain depth
or greater
- add 'duplicate' as another skip reason, log skip reason when page is
skipped due to dedupe
- when pageDedupe is enabled, set pageLimit to 0 and allow queueing
pages beyond expected limit, in case pages are skipped
- add queuePageLimit and check limit on each new page at queue pop time,
allows skipping already deduped pages and incrementally crawling new
pages
- when limit reached, queued pages are drained and marked as excluded /
logged to skippedPages list
- tests: test page dedupe / incremental crawling: new pages are archived
on subsequent crawls, previous pages skipped with 'duplicate' reason
- docs: add Page Deduplication on dedupe page
- docs: add Reports page (reports.md), document skipped pages /
--reportSkipped report
- fixes #1017

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This commit is contained in:
Ilya Kreymer 2026-04-30 20:14:42 +02:00 committed by GitHub
parent 4ae629e8a0
commit 3433a4a440
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
13 changed files with 309 additions and 52 deletions

View file

@ -194,6 +194,10 @@ Options:
--redisDedupeUrl If set, url for remote redis server
to store state. Otherwise, using loc
al redis instance [string]
--dedupePagesMinDepth If set >= 0, minimum depth at which
duplicate pages can be skipped. -1 m
eans never skip duplicate pages
[number] [default: -1]
--saveState If the crawl state should be seriali
zed to the crawls/ directory. Defaul
ts to 'partial', only saved when cra

View file

@ -115,6 +115,28 @@ This operational is not required - the data from the interrupted crawl will not
simply free up the data on the Redis.
## Page Deduplication
The crawler also has the option to skip loading of entire pages if the page HTML is a duplicate.
In such cases, after the `revisit` record for the HTML page is written, the crawler aborts loading the page in the browser and moves on to the next page.
This allows saving not only storage but crawling time, as duplicate pages are quickly skipped by the browser.
Note that there is a trade-off involved, as any page resources (images, stylesheets) or links are also skipped and are not crawled, even if they may have changed.
To enable this feature, set the `--dedupePagesMinDepth` to a value of 0 or greater.
Setting to a value of 0 means that even the seed page will be skipped if it has not changed, and no additional pages will be crawled.
It is generally recommended to set `--dedupePagesMinDepth` to a value of at least 1 when using this feature. Setting to a value of 1 will ensure that the seed page is always fully crawled and that its links will be added to the crawl queue. Pages one link away from the seed and at greater depths will be loaded and then skipped if they are unchanged.
### Incremental Crawling
When duplicate pages are skipped they do not count towards the page limit, and additional pages can be crawled up to the page limit. This approach can be used to crawl a site that hasn't changed incrementally, a few pages at a time.
For example, if a site has 100 pages, a home page and 99 other static pages that don't change, it can be fully crawled after 11 crawls with the setting `--dedupePagesMinDepth 10 --pageLimit 10`, which will crawl the home page and up to 9 additional pages each time the crawler is run.
## Deduplication outputs
When deduplication is enabled, the crawler output is changed in the following ways (in addition to using
@ -135,5 +157,9 @@ file(s) contain the WARCs with `response` records for each `revisit` record that
This allows for tracking which WACZ files are required by other WACZ files to get the full archived content.
See the [WACZ dependency section of the developer documentation](../develop/dedupe.md#crawl-dependency-tracking-in-wacz-datapackagejson) for more details on the architecture of this dependency system.
### Reporting
If both page deduplication and [skipped page reporting](./reports.md) are enabled, pages that are skipped due to page deduplication are logged with the `duplicate` reason.
## Deduplication system architecture
See the [developer docs for dedupe](../develop/dedupe.md) for more advanced information of the architecture of the dedupe system.

View file

@ -9,9 +9,14 @@ Browsertrix Crawler crawl outputs are organized into collections, which can be f
Each collection is a directory which contains at minimum:
- `archive/`: A directory containing gzipped [WARC](https://www.iso.org/standard/68004.html) files containing the web traffic recorded during crawling.
- `downloads/`: A directory containing any resources the crawler downloaded to run the crawl, such as browser profiles or behaviors.
- `logs/`: A directory containing one or more crawler log files in [JSON-Lines](https://jsonlines.org/) format.
- `pages/`: A directory containing one or more "Page" files in [JSON-Lines](https://jsonlines.org/) format. At minimum, this directory will contain a `pages.jsonl` file with information about the seed URLs provided to the crawler. If additional pages were discovered and in scope during crawling, information about those non-seed pages is written to `extraPages.jsonl`. For more information about the contents of Page files, see the [WACZ specification](https://specs.webrecorder.net/wacz/1.1.1/#pages-jsonl).
- `warc-cdx/`: A directory containing one or more [CDXJ](https://specs.webrecorder.net/cdxj/0.1.0/) index files created while recording traffic to WARC files. These index files are
- `profile/`: A directory containing the current browser profile, either new or loaded from a saved profile.
- `crawlIds/`: A directory containing a single `crawlIds.txt` file which lists the identifiers of all crawls included in this collection.
- `warc-cdx/`: A directory containing one or more [CDXJ](https://specs.webrecorder.net/cdxj/0.1.0/) index files created while recording traffic to WARC files. These index files contain entries for every WARC record written and are updated
at the same time as the WARCs.
Additionally, the collection may include:
@ -19,6 +24,8 @@ Additionally, the collection may include:
- An `indexes/` directory containing merged [CDXJ](https://specs.webrecorder.net/cdxj/0.1.0/) index files for the crawl, if the `--generateCDX` or `--generateWACZ` arguments are provided. If the combined size of the CDXJ files in the `warc-cdx/` directory is over 50 KB, the resulting final CDXJ file will be gzipped.
- A single combined gzipped [WARC](https://www.iso.org/standard/68004.html) file for the crawl, if the `--combineWARC` argument is provided.
- A `crawls/` directory including YAML files describing the crawl state, if the `--saveState` argument is provided with a value of "always", or if the crawl is interrupted and `--saveState` is not set to "never". These files can be used to restart a crawl from its saved state.
- A `reports/` directory containing various reports generated during the crawl. See [Reports](reports) for more info.
-
## Profile Outputs

View file

@ -0,0 +1,27 @@
# Reports
Browsertrix has the option to generate optional reports with each crawl. The following reports are currently available.
Browsertrix has the option to generate optional reports with each crawl. All reports are in the JSONL format, with one JSON entry per line. The following reports are currently available.
## Skipped Pages Report
Written to `reports/skippedPages.jsonl` and enabled with `--reportSkipped`, this report is in the same format as the `pages/pages.jsonl` file, but lists pages that were either never loaded or where page loading was immediately aborted and no content was archived from that page.
Each line in the report contains the following:
- `url`: Page URL
- `ts`: The ISO Date of the time the page was encountered
- `seedUrl`: The seed URL that this page was discovered from
- `depth`: The depth of the page if it were to be crawled
- `seed`: true|false if the page is a seed
- `reason`: Reason for skipping this page
### Skip Reasons
The `reason` may be one of the following:
- `outOfScope`: Page URL out of scope according to scoping rules
- `pageLimit`: The limit `--pageLimit` was reached before the page could be crawled
- `robotsTxt`: The page URL has been excluded via robots.txt rules
- `redirectToExcluded`: A special case of `outOfScope` where the page URL itself is in scope but loading it resulted in a HTTP redirect to a page that was not in scope, so page loading was aborted
- `duplicate`: The page content is a duplicate and loading was aborted (see [Page Deduplication](./dedupe.md#page-deduplication) for more information)

View file

@ -57,6 +57,7 @@ nav:
- user-guide/proxies.md
- user-guide/behaviors.md
- user-guide/qa.md
- user-guide/reports.md
- user-guide/cli-options.md
- Develop:
- develop/index.md

View file

@ -128,6 +128,7 @@ export class Crawler {
limitHit = false;
pageLimit: number;
queuePageLimit: number;
saveStateFiles: string[] = [];
lastSaveTime: number;
@ -260,6 +261,13 @@ export class Crawler {
: this.params.maxPageLimit;
}
// if using page dedupe, set queuePageLimit to 0
// so queue can be unlimited, as unknown how many pages may be skipped
// and additional pages needed from the queue
// otherwise, queuePageLimit == pagesLimit
this.queuePageLimit =
this.params.dedupePagesMinDepth >= 0 ? 0 : this.pageLimit;
this.saveStateFiles = [];
this.lastSaveTime = 0;
@ -613,7 +621,7 @@ export class Crawler {
await this.loadCrawlState();
await this.crawlState.trimToLimit(this.pageLimit);
await this.crawlState.trimToLimit(this.queuePageLimit);
}
extraChromeArgs() {
@ -1385,12 +1393,9 @@ self.__bx_behaviors.selectMainBehavior();
if (pageSkipped) {
await this.crawlState.markExcluded(url);
this.writeSkippedPage(
url,
data.seedId,
depth,
SkippedReason.RedirectToExcluded,
);
if (data.pageSkipReason) {
this.writeSkippedPage(url, data.seedId, depth, data.pageSkipReason);
}
this.limitHit = false;
} else {
const retry = await this.crawlState.markFailed(url, noRetries);
@ -2746,7 +2751,7 @@ self.__bx_behaviors.selectMainBehavior();
const result = await this.crawlState.addToQueue(
{ url, seedId, depth, extraHops, ts, pageid },
this.pageLimit,
this.queuePageLimit,
);
switch (result) {
@ -2978,7 +2983,7 @@ self.__bx_behaviors.selectMainBehavior();
headers,
fromDate,
toDate,
limit: this.pageLimit,
limit: this.queuePageLimit,
});
let power = 1;

View file

@ -457,12 +457,12 @@ class ArgParser {
type: "string",
},
// minPageDedupeDepth: {
// describe:
// "If set >= 0, minimum depth at which duplicate pages can be skipped. -1 means never skip duplicate pages",
// type: "number",
// default: -1,
// },
dedupePagesMinDepth: {
describe:
"If set >= 0, minimum depth at which duplicate pages can be skipped. -1 means never skip duplicate pages",
type: "number",
default: -1,
},
saveState: {
describe:

View file

@ -123,4 +123,5 @@ export enum SkippedReason {
PageLimit = "pageLimit",
RobotsTxt = "robotsTxt",
RedirectToExcluded = "redirectToExcluded",
Duplicate = "duplicate",
}

View file

@ -27,8 +27,13 @@ import { Crawler } from "../crawler.js";
import { getProxyDispatcher } from "./proxy.js";
import { ScopedSeed } from "./seeds.js";
import EventEmitter from "events";
import { DEFAULT_MAX_RETRIES, WARC_REFERS_TO_CONTAINER } from "./constants.js";
import {
DEFAULT_MAX_RETRIES,
SkippedReason,
WARC_REFERS_TO_CONTAINER,
} from "./constants.js";
import { Readable } from "stream";
import { createHash } from "crypto";
const MAX_BROWSER_DEFAULT_FETCH_SIZE = 5_000_000;
const MAX_TEXT_REWRITE_SIZE = 25_000_000;
@ -137,6 +142,7 @@ export class Recorder extends EventEmitter {
mainFrameId: string | null = null;
skipRangeUrls!: Map<string, number>;
skipPageInfo = false;
state: PageState | null = null;
swTargetId?: string | null;
swFrameIds = new Set<string>();
@ -160,7 +166,7 @@ export class Recorder extends EventEmitter {
pageSeed?: ScopedSeed;
pageSeedDepth = 0;
//minPageDedupeDepth = -1;
dedupePagesMinDepth = -1;
frameIdToExecId: Map<string, number> | null;
@ -184,7 +190,7 @@ export class Recorder extends EventEmitter {
this.shouldSaveStorage = !!crawler.params.saveStorage;
//this.minPageDedupeDepth = crawler.params.minPageDedupeDepth;
this.dedupePagesMinDepth = crawler.params.dedupePagesMinDepth;
this.writer = writer;
@ -863,32 +869,33 @@ export class Recorder extends EventEmitter {
const rewritten = await this.rewriteResponse(reqresp, mimeType);
// ** WIP: Experimental page-level dedupe **
// Will abort page loading in case of duplicate
// TODO: Write revisit record, track page as a duplicate in page list
// if (
// url === this.pageUrl &&
// reqresp.payload &&
// this.minPageDedupeDepth >= 0 &&
// this.pageSeedDepth >= this.minPageDedupeDepth
// ) {
// const hash =
// "sha256:" + createHash("sha256").update(reqresp.payload).digest("hex");
// const res = await this.crawlState.getHashDupe(hash);
// if (res) {
// const { index, crawlId } = res;
// const errorReason = "BlockedByResponse";
// await cdp.send("Fetch.failRequest", {
// requestId,
// errorReason,
// });
// await this.crawlState.addDupeCrawlDependency(crawlId, index);
// // await this.crawlState.addConservedSizeStat(
// // size - reqresp.payload.length,
// // );
// return true;
// }
// }
// If page is at dedupePagesMinDepth or higher and HTML is a duplicate,
// write a revisit record, track pages as a duplicate, and abort the page
if (
url === this.pageUrl &&
reqresp.payload &&
this.dedupePagesMinDepth >= 0 &&
this.pageSeedDepth >= this.dedupePagesMinDepth
) {
const hash =
"sha256:" + createHash("sha256").update(reqresp.payload).digest("hex");
const res = await this.crawlState.getHashDupe(hash);
if (res) {
await cdp.send("Fetch.failRequest", {
requestId,
errorReason: "BlockedByResponse",
});
await this.serializeToWARC(reqresp, undefined, false, true, hash);
this.skipPageInfo = true;
this.state!.pageSkipReason = SkippedReason.Duplicate;
logger.debug(
"Skipped loading duplicate page",
{ pageUrl: url, depth: this.pageSeedDepth, ...this.logDetails },
"pageStatus",
);
return true;
}
}
// not rewritten, and not streaming, return false to continue
if (!rewritten && !streamingConsume) {
@ -978,6 +985,7 @@ export class Recorder extends EventEmitter {
loc = new URL(loc, url).href;
if (this.pageSeed && this.pageSeed.isExcluded(loc)) {
this.state!.pageSkipReason = SkippedReason.RedirectToExcluded;
logger.warn(
"Skipping page that redirects to excluded URL",
{ newUrl: loc, origUrl: this.pageUrl },
@ -994,7 +1002,15 @@ export class Recorder extends EventEmitter {
}
}
startPage({ pageid, url }: { pageid: string; url: string }) {
startPage({
pageid,
url,
state,
}: {
pageid: string;
url: string;
state: PageState;
}) {
this.pageid = pageid;
this.pageUrl = url;
this.finalPageUrl = this.pageUrl;
@ -1012,6 +1028,7 @@ export class Recorder extends EventEmitter {
this.skipRangeUrls = new Map<string, number>();
this.skipPageInfo = false;
this.pageFinished = false;
this.state = state;
this.pageInfo = {
pageid,
urls: {},
@ -1653,10 +1670,14 @@ export class Recorder extends EventEmitter {
reqresp: RequestResponseInfo,
iter?: AsyncIterable<Uint8Array>,
canRetry = false,
skipPageInfo = false,
matchHash?: string,
): Promise<SerializeRes> {
// always include in pageinfo record if going to serialize to WARC
// even if serialization does not happen, indicates this URL was on the page
this.addPageRecord(reqresp);
if (!skipPageInfo) {
this.addPageRecord(reqresp);
}
const { pageid, gzip } = this;
const { url, status, requestId, method, payload } = reqresp;
@ -1758,6 +1779,13 @@ export class Recorder extends EventEmitter {
if (!isEmpty && url) {
const res = await this.crawlState.getHashDupe(hash);
// if only writing revisit, ensure it's a revisit and hash
// matches expected value, otherwise just return
// should always match, but just in case!
if (matchHash && (matchHash !== hash || !res)) {
return SerializeRes.Skipped;
}
if (res) {
const { origUrl, origDate, crawlId, index, size } = res;
origRecSize = size;
@ -1828,8 +1856,6 @@ export class Recorder extends EventEmitter {
addStatsCallback,
);
this.addPageRecord(reqresp);
return SerializeRes.Success;
}
}

View file

@ -12,6 +12,7 @@ import {
DUPE_ALL_COUNTS,
DUPE_UNCOMMITTED,
CrawlStatus,
SkippedReason,
} from "./constants.js";
import { ScopedSeed } from "./seeds.js";
import { Frame } from "puppeteer-core";
@ -98,6 +99,7 @@ export class PageState {
skipBehaviors = false;
pageSkipped = false;
pageSkipReason: SkippedReason | null = null;
noRetries = false;
isDirectFetched = false;

View file

@ -7,7 +7,7 @@ import { rxEscape } from "./seeds.js";
import { CDPSession, Page } from "puppeteer-core";
import { PageState, WorkerId } from "./state.js";
import { Crawler } from "../crawler.js";
import { PAGE_OP_TIMEOUT_SECS } from "./constants.js";
import { PAGE_OP_TIMEOUT_SECS, SkippedReason } from "./constants.js";
const MAX_REUSE = 5;
@ -275,7 +275,7 @@ export class PageWorker {
this.logDetails = { page: url, workerid };
if (this.recorder) {
this.recorder.startPage({ pageid, url });
this.recorder.startPage({ pageid, url, state: data });
}
try {
@ -360,8 +360,28 @@ export class PageWorker {
const data = await crawlState.nextFromQueue();
const limit = this.crawler.pageLimit;
const isDedupePages = this.crawler.params.dedupePagesMinDepth >= 0;
// see if any work data in the queue
if (data) {
if (
isDedupePages &&
limit > 0 &&
(await crawlState.numDone()) >= limit
) {
const { url, seedId, depth } = data;
logger.info("Skipping queued page, at limit", { url, limit });
await crawlState.markExcluded(url);
this.crawler.writeSkippedPage(
url,
seedId,
depth,
SkippedReason.PageLimit,
);
continue;
}
// filter out any out-of-scope pages right away
if (!(await this.crawler.isInScope(data, this.logDetails))) {
logger.info("Page no longer in scope", data);

View file

@ -5,7 +5,7 @@ import Redis from "ioredis";
import { WARCParser } from "warcio";
import { sleep } from "./utils";
let redisId: NonSharedBuffer;
let redisId: Uint8Array;
let numResponses = 0;
let sizeSaved = 0;

138
tests/dedupe-page.test.ts Normal file
View file

@ -0,0 +1,138 @@
import { exec, execSync } from "child_process";
import fs from "fs";
import { sleep } from "./utils";
let redisId: Uint8Array;
beforeAll(() => {
execSync("docker network create dedupe-pages");
redisId = execSync(
"docker run --rm --network=dedupe-pages -p 37379:6379 --name dedupe-redis -d redis",
);
});
afterAll(async () => {
execSync(`docker kill ${redisId}`);
await sleep(3000);
execSync("docker network rm dedupe-pages");
});
function runCrawl(name: string, { db = 0, limit = 3, wacz = true } = {}) {
fs.rmSync(`./test-crawls/collections/${name}`, {
recursive: true,
force: true,
});
const crawler = exec(
`docker run -v $PWD/test-crawls:/crawls --network=dedupe-pages -e CRAWL_ID=${name} webrecorder/browsertrix-crawler crawl --url https://old.webrecorder.net/ --limit ${limit} --exclude community --collection ${name} --reportSkipped --dedupePagesMinDepth 1 --redisDedupeUrl redis://dedupe-redis:6379/${db} ${
wacz ? "--generateWACZ" : ""
}`,
);
return new Promise((resolve) => {
crawler.on("exit", (code) => {
resolve(code);
});
});
}
function loadPages(
collName: string,
path = "pages/extraPages.jsonl",
filter?: (x: Record<string, string>) => boolean,
) {
const extraPages = fs.readFileSync(
`test-crawls/collections/${collName}/${path}`,
"utf8",
);
const pageUrls = [];
for (const page of extraPages.trim().split("\n")) {
const parsed = JSON.parse(page);
if (parsed.url && (!filter || filter(parsed))) {
pageUrls.push(parsed.url);
}
}
return pageUrls;
}
let firstPageUrls: string[] = [];
let secondPageUrls: string[] = [];
test("first crawl, initial 3 pages", async () => {
const collName = "dedupe-pages-1";
expect(await runCrawl(collName)).toBe(0);
firstPageUrls = loadPages(collName);
// initial seed page is in pages.jsonl + 2 more pages
expect(firstPageUrls).toHaveLength(2);
});
test("second crawl, next 3 pages", async () => {
const collName = "dedupe-pages-2";
expect(await runCrawl(collName)).toBe(0);
secondPageUrls = loadPages(collName);
// initial seed page is in pages.jsonl + 2 more pages
expect(secondPageUrls).toHaveLength(2);
// ensure pages are not in first set, totally new pages
for (const pageUrl of firstPageUrls) {
expect(secondPageUrls).not.toContain(pageUrl);
}
const skippedPages = loadPages(
collName,
"reports/skippedPages.jsonl",
(entry) => entry.reason === "duplicate",
);
// ensure first pages are in the skipped pages list
for (const pageUrl of firstPageUrls) {
expect(skippedPages).toContain(pageUrl);
}
});
test("third crawl, next 3 pages", async () => {
const collName = "dedupe-pages-3";
expect(await runCrawl(collName)).toBe(0);
const thirdPageUrls = loadPages(collName);
// initial seed page is in pages.jsonl + 2 more pages
expect(thirdPageUrls).toHaveLength(2);
// ensure pages are not in first or second set, totally new pages
for (const pageUrl of firstPageUrls) {
expect(thirdPageUrls).not.toContain(pageUrl);
}
for (const pageUrl of secondPageUrls) {
expect(thirdPageUrls).not.toContain(pageUrl);
}
const skippedPages = loadPages(
collName,
"reports/skippedPages.jsonl",
(entry) => entry.reason === "duplicate",
);
// ensure first and second set of pages are in the skipped pages list
for (const pageUrl of firstPageUrls) {
expect(skippedPages).toContain(pageUrl);
}
for (const pageUrl of secondPageUrls) {
expect(skippedPages).toContain(pageUrl);
}
});