Autoclick Support (#729)

Adds support for autoclick behavior:
- Adds new `autoclick` behavior option to `--behaviors`, but not
enabling by default
- Adds support for new exposed function `__bx_addSet` which allows
autoclick behavior to persist state about links that have already been
clicked to avoid duplicates, only used if link has an href
- Adds a new pageFinished flag on the worker state.
- Adds a on('dialog') handler to reject onbeforeunload page navigations,
when in behavior (page not finished), but accept when page is finished -
to allow navigation away only when behaviors are done
- Update to browsertrix-behaviors 0.7.0, which supports autoclick
- Add --clickSelector option to customize elements that will be clicked,
defaulting to `a`.
- Add --linkSelector as alias for --selectLinks for consistency
- Unknown options for --behaviors printed as warnings, instead of hard
exit, for forward compatibility for new behavior types in the future

Fixes #728, also #216, #665, #31
This commit is contained in:
Ilya Kreymer 2025-01-16 09:38:11 -08:00 committed by GitHub
parent 871490758a
commit b7150f1343
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
14 changed files with 259 additions and 108 deletions

View file

@ -1,4 +1,4 @@
ARG BROWSER_VERSION=1.73.104 ARG BROWSER_VERSION=1.74.48
ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base:brave-${BROWSER_VERSION} ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base:brave-${BROWSER_VERSION}
FROM ${BROWSER_IMAGE_BASE} FROM ${BROWSER_IMAGE_BASE}
@ -39,7 +39,7 @@ ADD config/ /app/
ADD html/ /app/html/ ADD html/ /app/html/
ARG RWP_VERSION=2.2.4 ARG RWP_VERSION=2.2.5
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/ui.js /app/html/rwp/ ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/ui.js /app/html/rwp/
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/sw.js /app/html/rwp/ ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/sw.js /app/html/rwp/
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/adblock/adblock.gz /app/html/rwp/adblock.gz ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/adblock/adblock.gz /app/html/rwp/adblock.gz

1
behaviors.js Normal file

File diff suppressed because one or more lines are too long

View file

@ -50,11 +50,14 @@ Options:
e-page-application crawling or when e-page-application crawling or when
different hashtags load dynamic cont different hashtags load dynamic cont
ent ent
--selectLinks one or more selectors for extracting --selectLinks, --linkSelector One or more selectors for extracting
links, in the format [css selector] links, in the format [css selector]
->[property to use],[css selector]-> ->[property to use],[css selector]->
@[attribute to use] @[attribute to use]
[array] [default: ["a[href]->href"]] [array] [default: ["a[href]->href"]]
--clickSelector Selector for elements to click when
using the autoclick behavior
[string] [default: "a"]
--blockRules Additional rules for blocking certai --blockRules Additional rules for blocking certai
n URLs from being loaded, by URL reg n URLs from being loaded, by URL reg
ex and optionally via text match in ex and optionally via text match in
@ -75,7 +78,8 @@ Options:
[string] [default: "crawl-@ts"] [string] [default: "crawl-@ts"]
--headless Run in headless mode, otherwise star --headless Run in headless mode, otherwise star
t xvfb [boolean] [default: false] t xvfb [boolean] [default: false]
--driver JS driver for the crawler [string] --driver Custom driver for the crawler, if an
y [string]
--generateCDX, --generatecdx, --gene If set, generate index (CDXJ) for us --generateCDX, --generatecdx, --gene If set, generate index (CDXJ) for us
rateCdx e with pywb after crawl is done rateCdx e with pywb after crawl is done
[boolean] [default: false] [boolean] [default: false]
@ -142,8 +146,7 @@ Options:
o crawl working directory) [string] o crawl working directory) [string]
--behaviors Which background behaviors to enable --behaviors Which background behaviors to enable
on each page on each page
[array] [choices: "autoplay", "autofetch", "autoscroll", "siteSpecific"] [defa [array] [default: ["autoplay","autofetch","autoscroll","siteSpecific"]]
ult: ["autoplay","autofetch","autoscroll","siteSpecific"]]
--behaviorTimeout If >0, timeout (in seconds) for in-p --behaviorTimeout If >0, timeout (in seconds) for in-p
age behavior will run on each page. age behavior will run on each page.
If 0, a behavior can run until finis If 0, a behavior can run until finis
@ -163,8 +166,10 @@ Options:
hich contains the browser profile di hich contains the browser profile di
rectory [string] rectory [string]
--screenshot Screenshot options for crawler, can --screenshot Screenshot options for crawler, can
include: view, thumbnail, fullPage include: view, thumbnail, fullPage,
[array] [choices: "view", "thumbnail", "fullPage"] [default: []] fullPageFinal
[array] [choices: "view", "thumbnail", "fullPage", "fullPageFinal"] [default:
[]]
--screencastPort If set to a non-zero value, starts a --screencastPort If set to a non-zero value, starts a
n HTTP server with screencast access n HTTP server with screencast access
ible on this port ible on this port
@ -251,9 +256,15 @@ Options:
failing due to non-200 responses failing due to non-200 responses
[boolean] [default: false] [boolean] [default: false]
--customBehaviors Custom behavior files to inject. Val --customBehaviors Custom behavior files to inject. Val
ues can be URLs, paths to individual id values: URL to file, path to file
behavior files, or paths to a direc , path to directory of behaviors, UR
tory of behavior files L to Git repo of behaviors (prefixed
with git+, optionally specify branc
h and relative path to a directory w
ithin repo as branch and path query
parameters, e.g. --customBehaviors "
git+https://git.example.com/repo.git
?branch=dev&path=some/dir"
[array] [default: []] [array] [default: []]
--debugAccessRedis if set, runs internal redis without --debugAccessRedis if set, runs internal redis without
protected mode to allow external acc protected mode to allow external acc

View file

@ -1,6 +1,6 @@
{ {
"name": "browsertrix-crawler", "name": "browsertrix-crawler",
"version": "1.4.2", "version": "1.5.0-beta.2",
"main": "browsertrix-crawler", "main": "browsertrix-crawler",
"type": "module", "type": "module",
"repository": "https://github.com/webrecorder/browsertrix-crawler", "repository": "https://github.com/webrecorder/browsertrix-crawler",
@ -18,7 +18,7 @@
"dependencies": { "dependencies": {
"@novnc/novnc": "1.4.0", "@novnc/novnc": "1.4.0",
"@webrecorder/wabac": "^2.20.8", "@webrecorder/wabac": "^2.20.8",
"browsertrix-behaviors": "^0.6.6", "browsertrix-behaviors": "^0.7.0",
"client-zip": "^2.4.5", "client-zip": "^2.4.5",
"css-selector-parser": "^3.0.5", "css-selector-parser": "^3.0.5",
"fetch-socks": "^1.3.0", "fetch-socks": "^1.3.0",
@ -31,7 +31,7 @@
"p-queue": "^7.3.4", "p-queue": "^7.3.4",
"pixelmatch": "^5.3.0", "pixelmatch": "^5.3.0",
"pngjs": "^7.0.0", "pngjs": "^7.0.0",
"puppeteer-core": "^23.7.1", "puppeteer-core": "^24.1.0",
"sax": "^1.3.0", "sax": "^1.3.0",
"sharp": "^0.32.6", "sharp": "^0.32.6",
"tsc": "^2.0.4", "tsc": "^2.0.4",

View file

@ -32,12 +32,7 @@ import { ScreenCaster, WSTransport } from "./util/screencaster.js";
import { Screenshots } from "./util/screenshots.js"; import { Screenshots } from "./util/screenshots.js";
import { initRedis } from "./util/redis.js"; import { initRedis } from "./util/redis.js";
import { logger, formatErr, LogDetails } from "./util/logger.js"; import { logger, formatErr, LogDetails } from "./util/logger.js";
import { import { WorkerState, closeWorkers, runWorkers } from "./util/worker.js";
WorkerOpts,
WorkerState,
closeWorkers,
runWorkers,
} from "./util/worker.js";
import { sleep, timedRun, secondsElapsed } from "./util/timing.js"; import { sleep, timedRun, secondsElapsed } from "./util/timing.js";
import { collectCustomBehaviors, getInfoString } from "./util/file_reader.js"; import { collectCustomBehaviors, getInfoString } from "./util/file_reader.js";
@ -689,14 +684,9 @@ export class Crawler {
return !!seed.isIncluded(url, depth, extraHops, logDetails); return !!seed.isIncluded(url, depth, extraHops, logDetails);
} }
async setupPage({ async setupPage(opts: WorkerState) {
page, const { page, cdp, workerid, callbacks, frameIdToExecId, recorder } = opts;
cdp,
workerid,
callbacks,
recorder,
frameIdToExecId,
}: WorkerOpts) {
await this.browser.setupPage({ page, cdp }); await this.browser.setupPage({ page, cdp });
await this.setupExecContextEvents(cdp, frameIdToExecId); await this.setupExecContextEvents(cdp, frameIdToExecId);
@ -775,6 +765,87 @@ self.__bx_behaviors.selectMainBehavior();
await this.browser.addInitScript(page, initScript); await this.browser.addInitScript(page, initScript);
} }
// only add if running with autoclick behavior
if (this.params.behaviors.includes("autoclick")) {
// Ensure off-page navigation is canceled while behavior is running
page.on("dialog", async (dialog) => {
let accepted = true;
if (dialog.type() === "beforeunload") {
if (opts.pageBlockUnload) {
accepted = false;
await dialog.dismiss();
} else {
await dialog.accept();
}
} else {
await dialog.accept();
}
logger.debug("JS Dialog", {
accepted,
blockingUnload: opts.pageBlockUnload,
message: dialog.message(),
type: dialog.type(),
page: page.url(),
workerid,
});
});
// Close any windows opened during navigation from autoclick
await cdp.send("Target.setDiscoverTargets", { discover: true });
cdp.on("Target.targetCreated", async (params) => {
const { targetInfo } = params;
const { type, openerFrameId, targetId } = targetInfo;
try {
if (
type === "page" &&
openerFrameId &&
opts.frameIdToExecId.has(openerFrameId)
) {
await cdp.send("Target.closeTarget", { targetId });
} else {
logger.warn("Extra target not closed", { targetInfo });
}
await cdp.send("Runtime.runIfWaitingForDebugger");
} catch (e) {
// target likely already closed
}
});
void cdp.send("Target.setAutoAttach", {
autoAttach: true,
waitForDebuggerOnStart: true,
flatten: false,
});
if (this.recording) {
await cdp.send("Page.enable");
cdp.on("Page.windowOpen", async (params) => {
const { seedId, depth, extraHops = 0, url } = opts.data;
const logDetails = { page: url, workerid };
await this.queueInScopeUrls(
seedId,
[params.url],
depth,
extraHops,
false,
logDetails,
);
});
}
}
await page.exposeFunction("__bx_addSet", (data: string) =>
this.crawlState.addToUserSet(data),
);
// await page.exposeFunction("__bx_hasSet", (data: string) => this.crawlState.hasUserSet(data));
} }
async setupExecContextEvents( async setupExecContextEvents(
@ -932,6 +1003,7 @@ self.__bx_behaviors.selectMainBehavior();
} }
opts.markPageUsed(); opts.markPageUsed();
opts.pageBlockUnload = false;
if (auth) { if (auth) {
await page.setExtraHTTPHeaders({ Authorization: auth }); await page.setExtraHTTPHeaders({ Authorization: auth });
@ -955,8 +1027,12 @@ self.__bx_behaviors.selectMainBehavior();
); );
data.favicon = await this.getFavicon(page, logDetails); data.favicon = await this.getFavicon(page, logDetails);
opts.pageBlockUnload = true;
await this.doPostLoadActions(opts); await this.doPostLoadActions(opts);
opts.pageBlockUnload = false;
await this.awaitPageExtraDelay(opts); await this.awaitPageExtraDelay(opts);
} }
@ -1111,7 +1187,7 @@ self.__bx_behaviors.selectMainBehavior();
} }
} }
async teardownPage({ workerid }: WorkerOpts) { async teardownPage({ workerid }: WorkerState) {
if (this.screencaster) { if (this.screencaster) {
await this.screencaster.stopById(workerid); await this.screencaster.stopById(workerid);
} }

View file

@ -17,6 +17,7 @@ import { CDPSession, Page, PuppeteerLifeCycleEvent } from "puppeteer-core";
import { getInfoString } from "./util/file_reader.js"; import { getInfoString } from "./util/file_reader.js";
import { DISPLAY } from "./util/constants.js"; import { DISPLAY } from "./util/constants.js";
import { initProxy } from "./util/proxy.js"; import { initProxy } from "./util/proxy.js";
//import { sleep } from "./util/timing.js";
const profileHTML = fs.readFileSync( const profileHTML = fs.readFileSync(
new URL("../html/createProfile.html", import.meta.url), new URL("../html/createProfile.html", import.meta.url),
@ -437,6 +438,27 @@ class InteractiveBrowser {
// attempt to keep everything to initial tab if headless // attempt to keep everything to initial tab if headless
if (this.params.headless) { if (this.params.headless) {
void cdp.send("Target.setDiscoverTargets", { discover: true });
cdp.on("Target.targetCreated", async (params) => {
const { targetInfo } = params;
const { type, openerFrameId } = targetInfo;
if (type === "page" && openerFrameId) {
await cdp.send("Target.closeTarget", {
targetId: params.targetInfo.targetId,
});
}
await cdp.send("Runtime.runIfWaitingForDebugger");
});
void cdp.send("Target.setAutoAttach", {
autoAttach: true,
waitForDebuggerOnStart: true,
flatten: false,
});
cdp.send("Page.enable").catch((e) => logger.warn("Page.enable error", e)); cdp.send("Page.enable").catch((e) => logger.warn("Page.enable error", e));
cdp.on("Page.windowOpen", async (resp) => { cdp.on("Page.windowOpen", async (resp) => {

View file

@ -3,7 +3,7 @@ import { Crawler } from "./crawler.js";
import { ReplayServer } from "./util/replayserver.js"; import { ReplayServer } from "./util/replayserver.js";
import { sleep } from "./util/timing.js"; import { sleep } from "./util/timing.js";
import { logger, formatErr } from "./util/logger.js"; import { logger, formatErr } from "./util/logger.js";
import { WorkerOpts, WorkerState } from "./util/worker.js"; import { WorkerState } from "./util/worker.js";
import { PageState } from "./util/state.js"; import { PageState } from "./util/state.js";
import { PageInfoRecord, PageInfoValue, Recorder } from "./util/recorder.js"; import { PageInfoRecord, PageInfoValue, Recorder } from "./util/recorder.js";
@ -718,7 +718,7 @@ export class ReplayCrawler extends Crawler {
return text; return text;
} }
async teardownPage(opts: WorkerOpts) { async teardownPage(opts: WorkerState) {
const { page } = opts; const { page } = opts;
await this.processPageInfo(page); await this.processPageInfo(page);
await super.teardownPage(opts); await super.teardownPage(opts);

View file

@ -15,6 +15,7 @@ import {
EXTRACT_TEXT_TYPES, EXTRACT_TEXT_TYPES,
SERVICE_WORKER_OPTS, SERVICE_WORKER_OPTS,
DEFAULT_SELECTORS, DEFAULT_SELECTORS,
BEHAVIOR_TYPES,
ExtractSelector, ExtractSelector,
} from "./constants.js"; } from "./constants.js";
import { ScopedSeed } from "./seeds.js"; import { ScopedSeed } from "./seeds.js";
@ -165,6 +166,7 @@ class ArgParser {
}, },
selectLinks: { selectLinks: {
alias: "linkSelector",
describe: describe:
"One or more selectors for extracting links, in the format [css selector]->[property to use],[css selector]->@[attribute to use]", "One or more selectors for extracting links, in the format [css selector]->[property to use],[css selector]->@[attribute to use]",
type: "array", type: "array",
@ -172,6 +174,13 @@ class ArgParser {
coerce, coerce,
}, },
clickSelector: {
describe:
"Selector for elements to click when using the autoclick behavior",
type: "string",
default: "a",
},
blockRules: { blockRules: {
describe: describe:
"Additional rules for blocking certain URLs from being loaded, by URL regex and optionally via text match in an iframe", "Additional rules for blocking certain URLs from being loaded, by URL regex and optionally via text match in an iframe",
@ -351,7 +360,6 @@ class ArgParser {
describe: "Which background behaviors to enable on each page", describe: "Which background behaviors to enable on each page",
type: "array", type: "array",
default: ["autoplay", "autofetch", "autoscroll", "siteSpecific"], default: ["autoplay", "autofetch", "autoscroll", "siteSpecific"],
choices: ["autoplay", "autofetch", "autoscroll", "siteSpecific"],
coerce, coerce,
}, },
@ -693,9 +701,20 @@ class ArgParser {
// background behaviors to apply // background behaviors to apply
const behaviorOpts: { [key: string]: string | boolean } = {}; const behaviorOpts: { [key: string]: string | boolean } = {};
if (argv.behaviors.length > 0) { if (argv.behaviors.length > 0) {
argv.behaviors.forEach((x: string) => (behaviorOpts[x] = true)); argv.behaviors.forEach((x: string) => {
if (BEHAVIOR_TYPES.includes(x)) {
behaviorOpts[x] = true;
} else {
logger.warn(
"Unknown behavior specified, ignoring",
{ behavior: x },
"behavior",
);
}
});
behaviorOpts.log = BEHAVIOR_LOG_FUNC; behaviorOpts.log = BEHAVIOR_LOG_FUNC;
behaviorOpts.startEarly = true; behaviorOpts.startEarly = true;
behaviorOpts.clickSelector = argv.clickSelector;
argv.behaviorOpts = JSON.stringify(behaviorOpts); argv.behaviorOpts = JSON.stringify(behaviorOpts);
} else { } else {
argv.behaviorOpts = ""; argv.behaviorOpts = "";

View file

@ -15,7 +15,7 @@ import puppeteer, {
Frame, Frame,
HTTPRequest, HTTPRequest,
Page, Page,
PuppeteerLaunchOptions, LaunchOptions,
Viewport, Viewport,
} from "puppeteer-core"; } from "puppeteer-core";
import { CDPSession, Target, Browser as PptrBrowser } from "puppeteer-core"; import { CDPSession, Target, Browser as PptrBrowser } from "puppeteer-core";
@ -108,7 +108,7 @@ export class Browser {
}; };
} }
const launchOpts: PuppeteerLaunchOptions = { const launchOpts: LaunchOptions = {
args, args,
headless, headless,
executablePath: this.getBrowserExe(), executablePath: this.getBrowserExe(),
@ -126,7 +126,7 @@ export class Browser {
? undefined ? undefined
: (target) => this.targetFilter(target), : (target) => this.targetFilter(target),
}; };
await this._init(launchOpts, ondisconnect); await this._init(launchOpts, recording, ondisconnect);
} }
targetFilter(target: Target) { targetFilter(target: Target) {
@ -388,8 +388,9 @@ export class Browser {
} }
} }
async _init( private async _init(
launchOpts: PuppeteerLaunchOptions, launchOpts: LaunchOptions,
recording: boolean,
// eslint-disable-next-line @typescript-eslint/ban-types // eslint-disable-next-line @typescript-eslint/ban-types
ondisconnect: Function | null = null, ondisconnect: Function | null = null,
) { ) {
@ -399,7 +400,9 @@ export class Browser {
this.firstCDP = await target.createCDPSession(); this.firstCDP = await target.createCDPSession();
if (recording) {
await this.browserContextFetch(); await this.browserContextFetch();
}
if (ondisconnect) { if (ondisconnect) {
this.browser.on("disconnected", (err) => ondisconnect(err)); this.browser.on("disconnected", (err) => ondisconnect(err));

View file

@ -46,4 +46,12 @@ export const DEFAULT_SELECTORS: ExtractSelector[] = [
}, },
]; ];
export const BEHAVIOR_TYPES = [
"autoplay",
"autofetch",
"autoscroll",
"autoclick",
"siteSpecific",
];
export const DISPLAY = ":99"; export const DISPLAY = ":99";

View file

@ -837,6 +837,10 @@ return inx;
return await this.redis.srem(key, status + "|" + url); return await this.redis.srem(key, status + "|" + url);
} }
async addToUserSet(value: string) {
return (await this.redis.sadd(this.key + ":user", value)) === 1;
}
async logError(error: string) { async logError(error: string) {
return await this.redis.lpush(this.ekey, error); return await this.redis.lpush(this.ekey, error);
} }

View file

@ -18,7 +18,7 @@ const NEW_WINDOW_TIMEOUT = 20;
const TEARDOWN_TIMEOUT = 10; const TEARDOWN_TIMEOUT = 10;
const FINISHED_TIMEOUT = 60; const FINISHED_TIMEOUT = 60;
export type WorkerOpts = { export type WorkerState = {
page: Page; page: Page;
cdp: CDPSession; cdp: CDPSession;
workerid: WorkerId; workerid: WorkerId;
@ -31,10 +31,7 @@ export type WorkerOpts = {
markPageUsed: () => void; markPageUsed: () => void;
frameIdToExecId: Map<string, number>; frameIdToExecId: Map<string, number>;
isAuthSet?: boolean; isAuthSet?: boolean;
}; pageBlockUnload?: boolean;
// ===========================================================================
export type WorkerState = WorkerOpts & {
data: PageState; data: PageState;
}; };
@ -53,7 +50,7 @@ export class PageWorker {
// eslint-disable-next-line @typescript-eslint/ban-types // eslint-disable-next-line @typescript-eslint/ban-types
callbacks?: Record<string, Function>; callbacks?: Record<string, Function>;
opts?: WorkerOpts; opts?: WorkerState;
// TODO: Fix this the next time the file is edited. // TODO: Fix this the next time the file is edited.
// eslint-disable-next-line @typescript-eslint/no-explicit-any // eslint-disable-next-line @typescript-eslint/no-explicit-any
@ -135,7 +132,8 @@ export class PageWorker {
} }
} }
async initPage(url: string): Promise<WorkerOpts> { async initPage(pagestate: PageState): Promise<WorkerState> {
const { url } = pagestate;
let reuse = !this.crashed && !!this.opts && !!this.page; let reuse = !this.crashed && !!this.opts && !!this.page;
if (!this.alwaysReuse) { if (!this.alwaysReuse) {
reuse = this.reuseCount <= MAX_REUSE && this.isSameOrigin(url); reuse = this.reuseCount <= MAX_REUSE && this.isSameOrigin(url);
@ -146,6 +144,7 @@ export class PageWorker {
{ reuseCount: this.reuseCount, ...this.logDetails }, { reuseCount: this.reuseCount, ...this.logDetails },
"worker", "worker",
); );
this.opts!.data = pagestate;
return this.opts!; return this.opts!;
} else if (this.page) { } else if (this.page) {
await this.closePage(); await this.closePage();
@ -192,6 +191,8 @@ export class PageWorker {
this.reuseCount++; this.reuseCount++;
} }
}, },
pageBlockUnload: false,
data: pagestate,
}; };
if (this.recorder) { if (this.recorder) {
@ -370,10 +371,10 @@ export class PageWorker {
} }
// init page (new or reuse) // init page (new or reuse)
const opts = await this.initPage(data.url); const opts = await this.initPage(data);
// run timed crawl of page // run timed crawl of page
await this.timedCrawlPage({ ...opts, data }); await this.timedCrawlPage(opts);
loggedWaiting = false; loggedWaiting = false;
} else { } else {

View file

@ -3,7 +3,7 @@ import fs from "fs";
test("set rollover to 500K and ensure individual WARCs rollover, including screenshots", async () => { test("set rollover to 500K and ensure individual WARCs rollover, including screenshots", async () => {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://old.webrecorder.net/ --limit 5 --exclude community --collection rollover-500K --rolloverSize 500000 --screenshot view" "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://old.webrecorder.net/ --limit 5 --exclude community --collection rollover-500K --rolloverSize 500000 --screenshot view --logging debug"
); );
const warcLists = fs.readdirSync("test-crawls/collections/rollover-500K/archive"); const warcLists = fs.readdirSync("test-crawls/collections/rollover-500K/archive");

124
yarn.lock
View file

@ -719,15 +719,15 @@
tslib "^2.7.0" tslib "^2.7.0"
tsyringe "^4.8.0" tsyringe "^4.8.0"
"@puppeteer/browsers@2.4.1": "@puppeteer/browsers@2.7.0":
version "2.4.1" version "2.7.0"
resolved "https://registry.yarnpkg.com/@puppeteer/browsers/-/browsers-2.4.1.tgz#7afd271199cc920ece2ff25109278be0a3e8a225" resolved "https://registry.yarnpkg.com/@puppeteer/browsers/-/browsers-2.7.0.tgz#dad70b30458f4e0855b2f402055f408823cc67c5"
integrity sha512-0kdAbmic3J09I6dT8e9vE2JOCSt13wHCW5x/ly8TSt2bDtuIWe2TgLZZDHdcziw9AVCzflMAXCrVyRIhIs44Ng== integrity sha512-bO61XnTuopsz9kvtfqhVbH6LTM1koxK0IlBR+yuVrM2LB7mk8+5o1w18l5zqd5cs8xlf+ntgambqRqGifMDjog==
dependencies: dependencies:
debug "^4.3.7" debug "^4.4.0"
extract-zip "^2.0.1" extract-zip "^2.0.1"
progress "^2.0.3" progress "^2.0.3"
proxy-agent "^6.4.0" proxy-agent "^6.5.0"
semver "^7.6.3" semver "^7.6.3"
tar-fs "^3.0.6" tar-fs "^3.0.6"
unbzip2-stream "^1.4.3" unbzip2-stream "^1.4.3"
@ -1062,13 +1062,18 @@ acorn@^8.10.0, acorn@^8.9.0:
resolved "https://registry.yarnpkg.com/acorn/-/acorn-8.14.0.tgz#063e2c70cac5fb4f6467f0b11152e04c682795b0" resolved "https://registry.yarnpkg.com/acorn/-/acorn-8.14.0.tgz#063e2c70cac5fb4f6467f0b11152e04c682795b0"
integrity sha512-cl669nCJTZBsL97OF4kUQm5g5hC2uihk0NxY3WENAC0TYdILVkAyHymAntgxGkl7K+t0cXIrH5siy5S4XkFycA== integrity sha512-cl669nCJTZBsL97OF4kUQm5g5hC2uihk0NxY3WENAC0TYdILVkAyHymAntgxGkl7K+t0cXIrH5siy5S4XkFycA==
agent-base@^7.0.2, agent-base@^7.1.0, agent-base@^7.1.1: agent-base@^7.1.0:
version "7.1.1" version "7.1.1"
resolved "https://registry.yarnpkg.com/agent-base/-/agent-base-7.1.1.tgz#bdbded7dfb096b751a2a087eeeb9664725b2e317" resolved "https://registry.yarnpkg.com/agent-base/-/agent-base-7.1.1.tgz#bdbded7dfb096b751a2a087eeeb9664725b2e317"
integrity sha512-H0TSyFNDMomMNJQBn8wFV5YC/2eJ+VXECwOadZJT554xP6cODZHPX3H9QMQECxvrgiSOP1pHjy1sMWQVYJOUOA== integrity sha512-H0TSyFNDMomMNJQBn8wFV5YC/2eJ+VXECwOadZJT554xP6cODZHPX3H9QMQECxvrgiSOP1pHjy1sMWQVYJOUOA==
dependencies: dependencies:
debug "^4.3.4" debug "^4.3.4"
agent-base@^7.1.2:
version "7.1.3"
resolved "https://registry.yarnpkg.com/agent-base/-/agent-base-7.1.3.tgz#29435eb821bc4194633a5b89e5bc4703bafc25a1"
integrity sha512-jRR5wdylq8CkOe6hei19GGZnxM6rBGwFl3Bg0YItGDimvjGtAvdZk4Pu6Cl4u4Igsws4a1fd1Vq3ezrhn4KmFw==
ajv@^6.12.4: ajv@^6.12.4:
version "6.12.6" version "6.12.6"
resolved "https://registry.yarnpkg.com/ajv/-/ajv-6.12.6.tgz#baf5a62e802b07d977034586f8c3baf5adf26df4" resolved "https://registry.yarnpkg.com/ajv/-/ajv-6.12.6.tgz#baf5a62e802b07d977034586f8c3baf5adf26df4"
@ -1435,10 +1440,10 @@ browserslist@^4.24.0:
node-releases "^2.0.18" node-releases "^2.0.18"
update-browserslist-db "^1.1.1" update-browserslist-db "^1.1.1"
browsertrix-behaviors@^0.6.6: browsertrix-behaviors@^0.7.0:
version "0.6.6" version "0.7.0"
resolved "https://registry.yarnpkg.com/browsertrix-behaviors/-/browsertrix-behaviors-0.6.6.tgz#10bcccfb091c051f5c886d5f69487e6d184078de" resolved "https://registry.yarnpkg.com/browsertrix-behaviors/-/browsertrix-behaviors-0.7.0.tgz#a08b7d3e9cd449d0d76b14a438e28472124fd1a4"
integrity sha512-UPNcU9dV0nAvUwJHKKYCkuqdYdlMjK7AWYDyr4xBpSq55xmEh2wQlwQyDyJuUUUrhJNII4NqXK24hVXPdvf5VA== integrity sha512-t0X74puXJsH8sVkkVZwEdo8L5E1PYtzX/RkVXM4fwwBIL804bOB8WIV+5Dfwov/odaukhB67KZhM00hN60SiBA==
dependencies: dependencies:
query-selector-shadow-dom "^1.0.1" query-selector-shadow-dom "^1.0.1"
@ -1534,13 +1539,12 @@ chownr@^1.1.1:
resolved "https://registry.yarnpkg.com/chownr/-/chownr-1.1.4.tgz#6fc9d7b42d32a583596337666e7d08084da2cc6b" resolved "https://registry.yarnpkg.com/chownr/-/chownr-1.1.4.tgz#6fc9d7b42d32a583596337666e7d08084da2cc6b"
integrity sha512-jJ0bqzaylmJtVnNgzTeSOs8DPavpbYgEr/b0YL8/2GO3xJEhInFmhKMUnEJQjZumK7KXGFhUy89PrsJWlakBVg== integrity sha512-jJ0bqzaylmJtVnNgzTeSOs8DPavpbYgEr/b0YL8/2GO3xJEhInFmhKMUnEJQjZumK7KXGFhUy89PrsJWlakBVg==
chromium-bidi@0.8.0: chromium-bidi@0.11.0:
version "0.8.0" version "0.11.0"
resolved "https://registry.yarnpkg.com/chromium-bidi/-/chromium-bidi-0.8.0.tgz#ffd79dad7db1fcc874f1c55fcf46ded05a884269" resolved "https://registry.yarnpkg.com/chromium-bidi/-/chromium-bidi-0.11.0.tgz#9c3c42ee7b42d8448e9fce8d649dc8bfbcc31153"
integrity sha512-uJydbGdTw0DEUjhoogGveneJVWX/9YuqkWePzMmkBYwtdAqo5d3J/ovNKFr+/2hWXYmYCr6it8mSSTIj6SS6Ug== integrity sha512-6CJWHkNRoyZyjV9Rwv2lYONZf1Xm0IuDyNq97nwSsxxP3wf5Bwy15K5rOvVKMtJ127jJBmxFUanSAOjgFRxgrA==
dependencies: dependencies:
mitt "3.0.1" mitt "3.0.1"
urlpattern-polyfill "10.0.0"
zod "3.23.8" zod "3.23.8"
ci-info@^3.2.0: ci-info@^3.2.0:
@ -1696,7 +1700,7 @@ data-view-byte-offset@^1.0.0:
es-errors "^1.3.0" es-errors "^1.3.0"
is-data-view "^1.0.1" is-data-view "^1.0.1"
debug@4, debug@^4.1.0, debug@^4.1.1, debug@^4.3.1, debug@^4.3.2, debug@^4.3.4, debug@^4.3.7: debug@4, debug@^4.1.0, debug@^4.1.1, debug@^4.3.1, debug@^4.3.2, debug@^4.3.4:
version "4.3.7" version "4.3.7"
resolved "https://registry.yarnpkg.com/debug/-/debug-4.3.7.tgz#87945b4151a011d76d95a198d7111c865c360a52" resolved "https://registry.yarnpkg.com/debug/-/debug-4.3.7.tgz#87945b4151a011d76d95a198d7111c865c360a52"
integrity sha512-Er2nc/H7RrMXZBFCEim6TCmMk02Z8vLC2Rbi1KEBggpo0fS6l0S1nnapwmIi3yW/+GOJap1Krg4w0Hg80oCqgQ== integrity sha512-Er2nc/H7RrMXZBFCEim6TCmMk02Z8vLC2Rbi1KEBggpo0fS6l0S1nnapwmIi3yW/+GOJap1Krg4w0Hg80oCqgQ==
@ -1710,6 +1714,13 @@ debug@^3.2.7:
dependencies: dependencies:
ms "^2.1.1" ms "^2.1.1"
debug@^4.4.0:
version "4.4.0"
resolved "https://registry.yarnpkg.com/debug/-/debug-4.4.0.tgz#2b3f2aea2ffeb776477460267377dc8710faba8a"
integrity sha512-6WTZ/IxCY/T6BALoZHaE4ctp9xm+Z5kY/pzYaCHRFeyVhojxlrm+46y68HA6hr0TcwEssoxNiDEUJQjfPZ/RYA==
dependencies:
ms "^2.1.3"
decode-uri-component@^0.2.2: decode-uri-component@^0.2.2:
version "0.2.2" version "0.2.2"
resolved "https://registry.yarnpkg.com/decode-uri-component/-/decode-uri-component-0.2.2.tgz#e69dbe25d37941171dd540e024c444cd5188e1e9" resolved "https://registry.yarnpkg.com/decode-uri-component/-/decode-uri-component-0.2.2.tgz#e69dbe25d37941171dd540e024c444cd5188e1e9"
@ -1784,10 +1795,10 @@ detect-newline@^3.0.0:
resolved "https://registry.yarnpkg.com/detect-newline/-/detect-newline-3.1.0.tgz#576f5dfc63ae1a192ff192d8ad3af6308991b651" resolved "https://registry.yarnpkg.com/detect-newline/-/detect-newline-3.1.0.tgz#576f5dfc63ae1a192ff192d8ad3af6308991b651"
integrity sha512-TLz+x/vEXm/Y7P7wn1EJFNLxYpUD4TgMosxY6fAVJUnJMbupHBOncxyWUG9OpTaH9EBD7uFI5LfEgmMOc54DsA== integrity sha512-TLz+x/vEXm/Y7P7wn1EJFNLxYpUD4TgMosxY6fAVJUnJMbupHBOncxyWUG9OpTaH9EBD7uFI5LfEgmMOc54DsA==
devtools-protocol@0.0.1367902: devtools-protocol@0.0.1380148:
version "0.0.1367902" version "0.0.1380148"
resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1367902.tgz#7333bfc4466c5a54a4c6de48a9dfbcb4b811660c" resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1380148.tgz#7dcdad06515135b244ff05878ca8019e041c1c55"
integrity sha512-XxtPuC3PGakY6PD7dG66/o8KwJ/LkH2/EKe19Dcw58w53dv4/vSQEkn/SzuyhHE2q4zPgCkxQBxus3VV4ql+Pg== integrity sha512-1CJABgqLxbYxVI+uJY/UDUHJtJ0KZTSjNYJYKqd9FRoXT33WDakDHNxRapMEgzeJ/C3rcs01+avshMnPmKQbvA==
diff-sequences@^29.6.3: diff-sequences@^29.6.3:
version "29.6.3" version "29.6.3"
@ -2603,12 +2614,12 @@ http-status-codes@^2.1.4:
resolved "https://registry.yarnpkg.com/http-status-codes/-/http-status-codes-2.3.0.tgz#987fefb28c69f92a43aecc77feec2866349a8bfc" resolved "https://registry.yarnpkg.com/http-status-codes/-/http-status-codes-2.3.0.tgz#987fefb28c69f92a43aecc77feec2866349a8bfc"
integrity sha512-RJ8XvFvpPM/Dmc5SV+dC4y5PCeOhT3x1Hq0NU3rjGeg5a/CqlhZ7uudknPwZFz4aeAXDcbAyaeP7GAo9lvngtA== integrity sha512-RJ8XvFvpPM/Dmc5SV+dC4y5PCeOhT3x1Hq0NU3rjGeg5a/CqlhZ7uudknPwZFz4aeAXDcbAyaeP7GAo9lvngtA==
https-proxy-agent@^7.0.3, https-proxy-agent@^7.0.5: https-proxy-agent@^7.0.6:
version "7.0.5" version "7.0.6"
resolved "https://registry.yarnpkg.com/https-proxy-agent/-/https-proxy-agent-7.0.5.tgz#9e8b5013873299e11fab6fd548405da2d6c602b2" resolved "https://registry.yarnpkg.com/https-proxy-agent/-/https-proxy-agent-7.0.6.tgz#da8dfeac7da130b05c2ba4b59c9b6cd66611a6b9"
integrity sha512-1e4Wqeblerz+tMKPIq2EMGiiWW1dIjZOksyHWSUm1rmuvw/how9hBHZ38lAGj5ID4Ik6EdkOw7NmWPy6LAwalw== integrity sha512-vK9P5/iUfdl95AI+JVyUuIcVtd4ofvtrOr3HNtM2yxC9bnMbEdp3x01OhQNnjb8IJYi38VlTE3mBXwcfvywuSw==
dependencies: dependencies:
agent-base "^7.0.2" agent-base "^7.1.2"
debug "4" debug "4"
human-signals@^2.1.0: human-signals@^2.1.0:
@ -3834,19 +3845,19 @@ p-try@^2.0.0:
resolved "https://registry.yarnpkg.com/p-try/-/p-try-2.2.0.tgz#cb2868540e313d61de58fafbe35ce9004d5540e6" resolved "https://registry.yarnpkg.com/p-try/-/p-try-2.2.0.tgz#cb2868540e313d61de58fafbe35ce9004d5540e6"
integrity sha512-R4nPAVTAU0B9D35/Gk3uJf/7XYbQcyohSKdvAxIRSNghFl4e71hVoGnBNQz9cWaXxO2I10KTC+3jMdvvoKw6dQ== integrity sha512-R4nPAVTAU0B9D35/Gk3uJf/7XYbQcyohSKdvAxIRSNghFl4e71hVoGnBNQz9cWaXxO2I10KTC+3jMdvvoKw6dQ==
pac-proxy-agent@^7.0.1: pac-proxy-agent@^7.1.0:
version "7.0.2" version "7.1.0"
resolved "https://registry.yarnpkg.com/pac-proxy-agent/-/pac-proxy-agent-7.0.2.tgz#0fb02496bd9fb8ae7eb11cfd98386daaac442f58" resolved "https://registry.yarnpkg.com/pac-proxy-agent/-/pac-proxy-agent-7.1.0.tgz#da7c3b5c4cccc6655aaafb701ae140fb23f15df2"
integrity sha512-BFi3vZnO9X5Qt6NRz7ZOaPja3ic0PhlsmCRYLOpN11+mWBCR6XJDqW5RF3j8jm4WGGQZtBA+bTfxYzeKW73eHg== integrity sha512-Z5FnLVVZSnX7WjBg0mhDtydeRZ1xMcATZThjySQUHqr+0ksP8kqaw23fNKkaaN/Z8gwLUs/W7xdl0I75eP2Xyw==
dependencies: dependencies:
"@tootallnate/quickjs-emscripten" "^0.23.0" "@tootallnate/quickjs-emscripten" "^0.23.0"
agent-base "^7.0.2" agent-base "^7.1.2"
debug "^4.3.4" debug "^4.3.4"
get-uri "^6.0.1" get-uri "^6.0.1"
http-proxy-agent "^7.0.0" http-proxy-agent "^7.0.0"
https-proxy-agent "^7.0.5" https-proxy-agent "^7.0.6"
pac-resolver "^7.0.1" pac-resolver "^7.0.1"
socks-proxy-agent "^8.0.4" socks-proxy-agent "^8.0.5"
pac-resolver@^7.0.1: pac-resolver@^7.0.1:
version "7.0.1" version "7.0.1"
@ -4056,19 +4067,19 @@ prop-types@^15.8.1:
object-assign "^4.1.1" object-assign "^4.1.1"
react-is "^16.13.1" react-is "^16.13.1"
proxy-agent@^6.4.0: proxy-agent@^6.5.0:
version "6.4.0" version "6.5.0"
resolved "https://registry.yarnpkg.com/proxy-agent/-/proxy-agent-6.4.0.tgz#b4e2dd51dee2b377748aef8d45604c2d7608652d" resolved "https://registry.yarnpkg.com/proxy-agent/-/proxy-agent-6.5.0.tgz#9e49acba8e4ee234aacb539f89ed9c23d02f232d"
integrity sha512-u0piLU+nCOHMgGjRbimiXmA9kM/L9EHh3zL81xCdp7m+Y2pHIsnmbdDoEDoAz5geaonNR6q6+yOPQs6n4T6sBQ== integrity sha512-TmatMXdr2KlRiA2CyDu8GqR8EjahTG3aY3nXjdzFyoZbmB8hrBsTyMezhULIXKnC0jpfjlmiZ3+EaCzoInSu/A==
dependencies: dependencies:
agent-base "^7.0.2" agent-base "^7.1.2"
debug "^4.3.4" debug "^4.3.4"
http-proxy-agent "^7.0.1" http-proxy-agent "^7.0.1"
https-proxy-agent "^7.0.3" https-proxy-agent "^7.0.6"
lru-cache "^7.14.1" lru-cache "^7.14.1"
pac-proxy-agent "^7.0.1" pac-proxy-agent "^7.1.0"
proxy-from-env "^1.1.0" proxy-from-env "^1.1.0"
socks-proxy-agent "^8.0.2" socks-proxy-agent "^8.0.5"
proxy-from-env@^1.1.0: proxy-from-env@^1.1.0:
version "1.1.0" version "1.1.0"
@ -4088,15 +4099,15 @@ punycode@^2.1.0:
resolved "https://registry.yarnpkg.com/punycode/-/punycode-2.3.1.tgz#027422e2faec0b25e1549c3e1bd8309b9133b6e5" resolved "https://registry.yarnpkg.com/punycode/-/punycode-2.3.1.tgz#027422e2faec0b25e1549c3e1bd8309b9133b6e5"
integrity sha512-vYt7UD1U9Wg6138shLtLOvdAu+8DsC/ilFtEVHcH+wydcSpNE20AfSOduf6MkRFahL5FY7X1oU7nKVZFtfq8Fg== integrity sha512-vYt7UD1U9Wg6138shLtLOvdAu+8DsC/ilFtEVHcH+wydcSpNE20AfSOduf6MkRFahL5FY7X1oU7nKVZFtfq8Fg==
puppeteer-core@^23.7.1: puppeteer-core@^24.1.0:
version "23.9.0" version "24.1.0"
resolved "https://registry.yarnpkg.com/puppeteer-core/-/puppeteer-core-23.9.0.tgz#24add69fb58dde4ac49d165872b44a30d2bf5b32" resolved "https://registry.yarnpkg.com/puppeteer-core/-/puppeteer-core-24.1.0.tgz#4ea006ab26077dfbf6c72e2cf74797a7ff6db468"
integrity sha512-hLVrav2HYMVdK0YILtfJwtnkBAwNOztUdR4aJ5YKDvgsbtagNr6urUJk9HyjRA9e+PaLI3jzJ0wM7A4jSZ7Qxw== integrity sha512-ReefWoQgqdyl67uWEBy/TMZ4mAB7hP0JB5HIxSE8B1ot/4ningX1gmzHCOSNfMbTiS/VJHCvaZAe3oJTXph7yw==
dependencies: dependencies:
"@puppeteer/browsers" "2.4.1" "@puppeteer/browsers" "2.7.0"
chromium-bidi "0.8.0" chromium-bidi "0.11.0"
debug "^4.3.7" debug "^4.4.0"
devtools-protocol "0.0.1367902" devtools-protocol "0.0.1380148"
typed-query-selector "^2.12.0" typed-query-selector "^2.12.0"
ws "^8.18.0" ws "^8.18.0"
@ -4445,12 +4456,12 @@ smart-buffer@^4.2.0:
resolved "https://registry.yarnpkg.com/smart-buffer/-/smart-buffer-4.2.0.tgz#6e1d71fa4f18c05f7d0ff216dd16a481d0e8d9ae" resolved "https://registry.yarnpkg.com/smart-buffer/-/smart-buffer-4.2.0.tgz#6e1d71fa4f18c05f7d0ff216dd16a481d0e8d9ae"
integrity sha512-94hK0Hh8rPqQl2xXc3HsaBoOXKV20MToPkcXvwbISWLEs+64sBq5kFgn2kJDHb1Pry9yrP0dxrCI9RRci7RXKg== integrity sha512-94hK0Hh8rPqQl2xXc3HsaBoOXKV20MToPkcXvwbISWLEs+64sBq5kFgn2kJDHb1Pry9yrP0dxrCI9RRci7RXKg==
socks-proxy-agent@^8.0.2, socks-proxy-agent@^8.0.4: socks-proxy-agent@^8.0.5:
version "8.0.4" version "8.0.5"
resolved "https://registry.yarnpkg.com/socks-proxy-agent/-/socks-proxy-agent-8.0.4.tgz#9071dca17af95f483300316f4b063578fa0db08c" resolved "https://registry.yarnpkg.com/socks-proxy-agent/-/socks-proxy-agent-8.0.5.tgz#b9cdb4e7e998509d7659d689ce7697ac21645bee"
integrity sha512-GNAq/eg8Udq2x0eNiFkr9gRg5bA7PXEWagQdeRX4cPSG+X/8V38v637gim9bjFptMk1QWsCTr0ttrJEiXbNnRw== integrity sha512-HehCEsotFqbPW9sJ8WVYB6UbmIMv7kUUORIF2Nncq4VQvBfNBLibW9YZR5dlYCSUhwcD628pRllm7n+E+YTzJw==
dependencies: dependencies:
agent-base "^7.1.1" agent-base "^7.1.2"
debug "^4.3.4" debug "^4.3.4"
socks "^2.8.3" socks "^2.8.3"
@ -4959,11 +4970,6 @@ url-join@^4.0.1:
resolved "https://registry.yarnpkg.com/url-join/-/url-join-4.0.1.tgz#b642e21a2646808ffa178c4c5fda39844e12cde7" resolved "https://registry.yarnpkg.com/url-join/-/url-join-4.0.1.tgz#b642e21a2646808ffa178c4c5fda39844e12cde7"
integrity sha512-jk1+QP6ZJqyOiuEI9AEWQfju/nB2Pw466kbA0LEZljHwKeMgd9WrAEgEGxjPDD2+TNbbb37rTyhEfrCXfuKXnA== integrity sha512-jk1+QP6ZJqyOiuEI9AEWQfju/nB2Pw466kbA0LEZljHwKeMgd9WrAEgEGxjPDD2+TNbbb37rTyhEfrCXfuKXnA==
urlpattern-polyfill@10.0.0:
version "10.0.0"
resolved "https://registry.yarnpkg.com/urlpattern-polyfill/-/urlpattern-polyfill-10.0.0.tgz#f0a03a97bfb03cdf33553e5e79a2aadd22cac8ec"
integrity sha512-H/A06tKD7sS1O1X2SshBVeA5FLycRpjqiBeqGKmBwBDBy28EnRjORxTNe269KSSr5un5qyWi1iL61wLxpd+ZOg==
util-deprecate@^1.0.1: util-deprecate@^1.0.1:
version "1.0.2" version "1.0.2"
resolved "https://registry.yarnpkg.com/util-deprecate/-/util-deprecate-1.0.2.tgz#450d4dc9fa70de732762fbd2d4a28981419a0ccf" resolved "https://registry.yarnpkg.com/util-deprecate/-/util-deprecate-1.0.2.tgz#450d4dc9fa70de732762fbd2d4a28981419a0ccf"