Compare commits

..

18 commits

Author SHA1 Message Date
Ilya Kreymer
6f26148a9b bump version to 1.8.1 2025-10-08 17:11:04 -07:00
Ilya Kreymer
4f234040ce
Profile Saving Improvements (#894)
fix some observed errors that occur when saving profile:
- use browser.cookies instead of page.cookies to get all cookies, not
just from page
- catch exception when clearing cache and ignore
- logging: log when proxy init is happening on all paths, in case error
in proxy connection
2025-10-08 17:09:20 -07:00
Ilya Kreymer
002feb287b
dismiss js dialog popups (#895)
move the JS dialog handler to not be only for autoclick, dismiss all JS
dialogs (alert(), prompt()) to avoid blocking page
fixes #891
2025-10-08 14:57:52 -07:00
Ilya Kreymer
2270964996
logging: remove duplicate seeds found error (#893)
Per discussion, the message is unnecessary / confusing (doesn't provide
enough info) and can also happen on crawler restart.
2025-10-07 08:18:22 -07:00
Ilya Kreymer
fd49041f63
flow behaviors: add scrolling into view (#892)
Some page elements don't quite respond correctly if the element is not
in view, so should add the setEnsureElementIsInTheViewport() to click,
doubleclick, hover and change step locators.
2025-10-07 08:17:56 -07:00
Ed Summers
cc2d890916
Add addLink doc (#890)
It's helpful to know this function is there!
2025-10-02 15:45:55 -04:00
Ilya Kreymer
f7a080fe83 version: bump to 1.8.0 2025-09-25 10:42:02 -07:00
Ilya Kreymer
048b72ca87
deps update: bump browser to brave 1.82.170, wabac.js 2.24.1 (#886)
use latest puppeteer-core, puppeteer/replay

bump to 1.8.0-beta.1
2025-09-20 11:38:20 -07:00
Ilya Kreymer
8ca7756d1b
tests: remove example.com from tests (#885)
also use local http-server for behavior tests
2025-09-19 23:21:47 -07:00
Ilya Kreymer
a2742df328
seed urls list: check for quoted URLs and remove quotes (#883)
- check for urls that are wrapped in quotes, eg. 'https://example.com/'
or "https://example.com/" and trim and remove the quotes before adding seed
- tests: add quoted URL to tests, fix old.webrecorder.net test
- deps: update wabac.js, RWP to latest
- logging: reduce error logging for seed lists, only log once that there are duplicates or page limit is reached
- fix for #882
2025-09-12 13:34:41 -07:00
Ilya Kreymer
705bc0cd9f
Async Fetch Refactor (#880)
- separate out reading stream response while browser is waiting (not
really async) from actual async loading, this is not handled via
fetchResponseBody()
- unify async fetch into first trying browser networking for regular
GET, fallback to regular fetch()
- load headers and body separately in async fetch, allowing for
cancelling request after headers
- refactor direct fetch of non-html pages: load headers and handle
loading body, adding page async, allowing worker to continue loading
browser-based pages (should allow more parallelization in the future)
- unify WARC writing in preparation for dedup: unified serializeWARC()
called for all paths, WARC digest computed, additional checks for
payload added for streaming loading
2025-09-10 12:05:21 -07:00
Ilya Kreymer
a42c0b926e
Support host-specific proxies with proxy config YAML (#837)
- Adds support for YAML-based config for multiple proxies, containing
'matchHosts' section by regex and 'proxies' declaration, allowing
matching any number of hosts to any number of named proxies.
- Specified via --proxyServerConfig option passed to both crawl and
profile creation commands.
- Implemented internally by generating a proxy PAC script which does
regex matching and running browser with the specified proxy PAC script
served by an internal http server.
- Also support matching different undici Agents by regex, for using
different proxies with direct fetching
- Precedence: --proxyServerConfig takes precedence over --proxyServer /
PROXY_SERVER, unless --proxyServerPreferSingleProxy is also provided
- Updated proxies doc section with example
- Updated tests with sample bad and good auth examples of proxy config

Fixes #836

---------

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-08-20 16:07:29 -07:00
Ilya Kreymer
a6ad6a0e42 version: bump to 1.7.0 2025-07-31 15:23:42 -07:00
Ilya Kreymer
5c7ff3dfef
deps: bump base to brave 1.80.125 (#875) 2025-07-31 14:51:18 -07:00
Ilya Kreymer
18fe5a9676
behavior logging: remove last line dupe check for behavior logs (#874)
Shouldn't skip multiple log messages, as this is unexpected behavior for
user-defined behaviors.
2025-07-30 16:20:14 -07:00
Tessa Walsh
aba065c8fb
Don't trim to limit if limit is default of 0 (#873)
Fixes #872 

Fix for restarting crawl from saved state, where the default `--limit`
value of 0 was incorrectly preventing any URLs from being re-queued.
2025-07-29 15:48:08 -07:00
Ilya Kreymer
0652a3fb1d
quickfix: WACZ upload retry support: (#871)
- if a failure occurs on failed upload, and crawler restarts on error,
exit with 'interrupt' to allow for automatic restart (eg. in Browsertrix
app)
- otherwise, a failed upload will exit the crawl with no WACZ, resulting
in overall crawl failure
2025-07-29 15:41:22 -07:00
sua yoo
bc4d649307
Capitalization fix for log messages (#870)
Capitalizes "URL" in log messages.
2025-07-24 23:52:12 -07:00
37 changed files with 1178 additions and 763 deletions

View file

@ -1,4 +1,4 @@
ARG BROWSER_VERSION=1.80.122 ARG BROWSER_VERSION=1.82.170
ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base:brave-${BROWSER_VERSION} ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base:brave-${BROWSER_VERSION}
FROM ${BROWSER_IMAGE_BASE} FROM ${BROWSER_IMAGE_BASE}
@ -39,7 +39,7 @@ ADD config/ /app/
ADD html/ /app/html/ ADD html/ /app/html/
ARG RWP_VERSION=2.3.15 ARG RWP_VERSION=2.3.19
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/ui.js /app/html/rwp/ ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/ui.js /app/html/rwp/
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/sw.js /app/html/rwp/ ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/sw.js /app/html/rwp/
ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/adblock/adblock.gz /app/html/rwp/adblock.gz ADD https://cdn.jsdelivr.net/npm/replaywebpage@${RWP_VERSION}/adblock/adblock.gz /app/html/rwp/adblock.gz

View file

@ -266,6 +266,7 @@ Some of these functions which may be of use to behaviors authors are:
- `scrollToOffset`: scroll to particular offset - `scrollToOffset`: scroll to particular offset
- `scrollIntoView`: smoothly scroll particular element into view - `scrollIntoView`: smoothly scroll particular element into view
- `getState`: increment a state counter and return all state counters + string message - `getState`: increment a state counter and return all state counters + string message
* `addLink`: add a given URL to the crawl queue
More detailed references will be added in the future. More detailed references will be added in the future.

View file

@ -103,16 +103,16 @@ Options:
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast , "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi ", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default: atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy", "scope"]
[]] [default: []]
--logExcludeContext Comma-separated list of contexts to --logExcludeContext Comma-separated list of contexts to
NOT include in logs NOT include in logs
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer" [array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast , "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi ", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default: atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy", "scope"]
["recorderNetwork","jsError","screencast"]] [default: ["recorderNetwork","jsError","screencast"]]
--text Extract initial (default) or final t --text Extract initial (default) or final t
ext to pages.jsonl or WARC resource ext to pages.jsonl or WARC resource
record(s) record(s)
@ -294,6 +294,13 @@ Options:
--proxyServer if set, will use specified proxy ser --proxyServer if set, will use specified proxy ser
ver. Takes precedence over any env v ver. Takes precedence over any env v
ar proxy settings [string] ar proxy settings [string]
--proxyServerPreferSingleProxy if set, and both proxyServer and pro
xyServerConfig are provided, the pro
xyServer value will be preferred
[boolean] [default: false]
--proxyServerConfig if set, path to yaml/json file that
configures multiple path servers per
URL regex [string]
--dryRun If true, no archive data is written --dryRun If true, no archive data is written
to disk, only pages and logs (and op to disk, only pages and logs (and op
tionally saved state). [boolean] tionally saved state). [boolean]
@ -343,6 +350,8 @@ Options:
[number] [default: 7] [number] [default: 7]
--proxyServer if set, will use specified proxy server. Takes prece --proxyServer if set, will use specified proxy server. Takes prece
dence over any env var proxy settings [string] dence over any env var proxy settings [string]
--proxyServerConfig if set, path to yaml/json file that configures multi
ple path servers per URL regex [string]
--sshProxyPrivateKeyFile path to SSH private key for SOCKS5 over SSH proxy co --sshProxyPrivateKeyFile path to SSH private key for SOCKS5 over SSH proxy co
nnection [string] nnection [string]
--sshProxyKnownHostsFile path to SSH known hosts file for SOCKS5 over SSH pro --sshProxyKnownHostsFile path to SSH known hosts file for SOCKS5 over SSH pro

View file

@ -80,7 +80,55 @@ The above proxy settings also apply to [Browser Profile Creation](browser-profil
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts
``` ```
## Host-Specific Proxies
With the 1.7.0 release, the crawler also supports running with multiple proxies, defined in a separate proxy YAML config file. The file contains a match hosts section, matching hosts by regex to named proxies.
For example, the following YAML file can be passed to `--proxyConfigFile` option:
```yaml
matchHosts:
# load all URLs from example.com through 'example-1-proxy'
example.com/.*: example-1-proxy
# load all URLS from https://my-social.example.com/.*/posts/ through
# a different proxy
https://my-social.example.com/.*/posts/: social-proxy
# optional default proxy
"": default-proxy
proxies:
# SOCKS5 proxy just needs a URL
example-1-proxy: socks5://username:password@my-socks-5-proxy.example.com
# SSH proxy also should have at least a 'privateKeyFile'
social-proxy:
url: ssh://user@my-social-proxy.example.com
privateKeyFile: /proxies/social-proxy-private-key
# optional
publicHostsFile: /proxies/social-proxy-public-hosts
default-proxy:
url: ssh://user@my-social-proxy.example.com
privateKeyFile: /proxies/default-proxy-private-key
```
If the above config is stored in `./proxies/proxyConfig.yaml` along with the SSH private keys and known public hosts
files, the crawler can be started with:
```sh
docker run -v $PWD/crawls:/crawls -v $PWD/proxies:/proxies -it webrecorder/browsertrix-crawler --url https://example.com/ --proxyServerConfig /proxies/proxyConfig.yaml
```
Note that if SSH proxies are provided, an SSH tunnel must be opened for each one before the crawl starts.
The crawl will not start if any of the SSH proxy connections fail, even if a host-specific proxy is not actually used.
SOCKS5 and HTTP proxy connections are attempted only on first use.
The same `--proxyServerConfig` option can also be used in browser profile creation with the `create-login-profile` command in the same way.
### Proxy Precedence
If both `--proxyServerConfig` and `--proxyServer`/`PROXY_SERVER` env var are specified, the `--proxyServerConfig` option takes precedence on matching hosts. To have the single `--proxyServer` option always take precedence instead, pass the `--proxyServerPreferSingleProxy` option.

View file

@ -1,6 +1,6 @@
{ {
"name": "browsertrix-crawler", "name": "browsertrix-crawler",
"version": "1.7.0-beta.1", "version": "1.8.1",
"main": "browsertrix-crawler", "main": "browsertrix-crawler",
"type": "module", "type": "module",
"repository": "https://github.com/webrecorder/browsertrix-crawler", "repository": "https://github.com/webrecorder/browsertrix-crawler",
@ -17,9 +17,9 @@
}, },
"dependencies": { "dependencies": {
"@novnc/novnc": "1.4.0", "@novnc/novnc": "1.4.0",
"@puppeteer/replay": "^3.1.1", "@puppeteer/replay": "^3.1.3",
"@webrecorder/wabac": "^2.23.8", "@webrecorder/wabac": "^2.24.1",
"browsertrix-behaviors": "^0.9.1", "browsertrix-behaviors": "^0.9.2",
"client-zip": "^2.4.5", "client-zip": "^2.4.5",
"css-selector-parser": "^3.0.5", "css-selector-parser": "^3.0.5",
"fetch-socks": "^1.3.0", "fetch-socks": "^1.3.0",
@ -33,13 +33,13 @@
"p-queue": "^7.3.4", "p-queue": "^7.3.4",
"pixelmatch": "^5.3.0", "pixelmatch": "^5.3.0",
"pngjs": "^7.0.0", "pngjs": "^7.0.0",
"puppeteer-core": "^24.7.2", "puppeteer-core": "^24.22.0",
"sax": "^1.3.0", "sax": "^1.3.0",
"sharp": "^0.32.6", "sharp": "^0.32.6",
"tsc": "^2.0.4", "tsc": "^2.0.4",
"undici": "^6.18.2", "undici": "^6.18.2",
"uuid": "8.3.2", "uuid": "8.3.2",
"warcio": "^2.4.4", "warcio": "^2.4.7",
"ws": "^7.4.4", "ws": "^7.4.4",
"yargs": "^17.7.2" "yargs": "^17.7.2"
}, },
@ -71,7 +71,7 @@
}, },
"resolutions": { "resolutions": {
"wrap-ansi": "7.0.0", "wrap-ansi": "7.0.0",
"warcio": "^2.4.4", "warcio": "^2.4.7",
"@novnc/novnc": "1.4.0" "@novnc/novnc": "1.4.0"
} }
} }

View file

@ -178,7 +178,6 @@ export class Crawler {
customBehaviors = ""; customBehaviors = "";
behaviorsChecked = false; behaviorsChecked = false;
behaviorLastLine?: string;
browser: Browser; browser: Browser;
storage: S3StorageSync | null = null; storage: S3StorageSync | null = null;
@ -187,6 +186,7 @@ export class Crawler {
maxHeapTotal = 0; maxHeapTotal = 0;
proxyServer?: string; proxyServer?: string;
proxyPacUrl?: string;
driver: driver:
| ((opts: { | ((opts: {
@ -509,7 +509,9 @@ export class Crawler {
setWARCInfo(this.infoString, this.params.warcInfo); setWARCInfo(this.infoString, this.params.warcInfo);
logger.info(this.infoString); logger.info(this.infoString);
this.proxyServer = await initProxy(this.params, RUN_DETACHED); const res = await initProxy(this.params, RUN_DETACHED);
this.proxyServer = res.proxyServer;
this.proxyPacUrl = res.proxyPacUrl;
this.seeds = await parseSeeds(this.params); this.seeds = await parseSeeds(this.params);
this.numOriginalSeeds = this.seeds.length; this.numOriginalSeeds = this.seeds.length;
@ -667,7 +669,6 @@ export class Crawler {
pageUrl: string, pageUrl: string,
workerid: WorkerId, workerid: WorkerId,
) { ) {
let behaviorLine;
let message; let message;
let details; let details;
@ -711,11 +712,7 @@ export class Crawler {
switch (type) { switch (type) {
case "info": case "info":
behaviorLine = JSON.stringify(data);
if (behaviorLine !== this.behaviorLastLine) {
logger.info(message, details, context); logger.info(message, details, context);
this.behaviorLastLine = behaviorLine;
}
break; break;
case "error": case "error":
@ -855,9 +852,9 @@ self.__bx_behaviors.selectMainBehavior();
await this.browser.addInitScript(page, initScript); await this.browser.addInitScript(page, initScript);
} }
// only add if running with autoclick behavior // Handle JS dialogs:
if (this.params.behaviors.includes("autoclick")) { // - Ensure off-page navigation is canceled while behavior is running
// Ensure off-page navigation is canceled while behavior is running // - dismiss close all other dialogs if not blocking unload
page.on("dialog", async (dialog) => { page.on("dialog", async (dialog) => {
let accepted = true; let accepted = true;
if (dialog.type() === "beforeunload") { if (dialog.type() === "beforeunload") {
@ -868,7 +865,8 @@ self.__bx_behaviors.selectMainBehavior();
await dialog.accept(); await dialog.accept();
} }
} else { } else {
await dialog.accept(); // other JS dialog, just dismiss
await dialog.dismiss();
} }
logger.debug("JS Dialog", { logger.debug("JS Dialog", {
accepted, accepted,
@ -880,6 +878,8 @@ self.__bx_behaviors.selectMainBehavior();
}); });
}); });
// only add if running with autoclick behavior
if (this.params.behaviors.includes("autoclick")) {
// Close any windows opened during navigation from autoclick // Close any windows opened during navigation from autoclick
await cdp.send("Target.setDiscoverTargets", { discover: true }); await cdp.send("Target.setDiscoverTargets", { discover: true });
@ -1062,60 +1062,45 @@ self.__bx_behaviors.selectMainBehavior();
data.logDetails = logDetails; data.logDetails = logDetails;
data.workerid = workerid; data.workerid = workerid;
let result = false;
if (recorder) { if (recorder) {
try { try {
const headers = auth const headers = auth
? { Authorization: auth, ...this.headers } ? { Authorization: auth, ...this.headers }
: this.headers; : this.headers;
const result = await timedRun( result = await timedRun(
recorder.directFetchCapture({ url, headers, cdp }), recorder.directFetchCapture({
url,
headers,
cdp,
state: data,
crawler: this,
}),
this.params.pageLoadTimeout, this.params.pageLoadTimeout,
"Direct fetch of page URL timed out", "Direct fetch of page URL timed out",
logDetails, logDetails,
"fetch", "fetch",
); );
} catch (e) {
// fetched timed out, already logged, don't retry in browser logger.error(
if (!result) { "Direct fetch of page URL failed",
return; { e, ...logDetails },
}
const { fetched, mime, ts } = result;
if (mime) {
data.mime = mime;
data.isHTMLPage = isHTMLMime(mime);
}
if (fetched) {
data.loadState = LoadState.FULL_PAGE_LOADED;
data.status = 200;
data.ts = ts || new Date();
logger.info(
"Direct fetch successful",
{ url, mime, ...logDetails },
"fetch", "fetch",
); );
return;
} }
} catch (e) {
if (e instanceof Error && e.message === "response-filtered-out") { if (!result) {
// filtered out direct fetch
logger.debug( logger.debug(
"Direct fetch response not accepted, continuing with browser fetch", "Direct fetch response not accepted, continuing with browser fetch",
logDetails, logDetails,
"fetch", "fetch",
); );
} else { } else {
logger.error(
"Direct fetch of page URL failed",
{ e, ...logDetails },
"fetch",
);
return; return;
} }
} }
}
opts.markPageUsed(); opts.markPageUsed();
opts.pageBlockUnload = false; opts.pageBlockUnload = false;
@ -1282,7 +1267,11 @@ self.__bx_behaviors.selectMainBehavior();
} }
} }
async pageFinished(data: PageState) { async pageFinished(data: PageState, lastErrorText = "") {
// not yet finished
if (data.asyncLoading) {
return;
}
// if page loaded, considered page finished successfully // if page loaded, considered page finished successfully
// (even if behaviors timed out) // (even if behaviors timed out)
const { loadState, logDetails, depth, url, pageSkipped } = data; const { loadState, logDetails, depth, url, pageSkipped } = data;
@ -1317,11 +1306,28 @@ self.__bx_behaviors.selectMainBehavior();
await this.serializeConfig(); await this.serializeConfig();
if (depth === 0 && this.params.failOnFailedSeed) { if (depth === 0 && this.params.failOnFailedSeed) {
let errorCode = ExitCodes.GenericError;
switch (lastErrorText) {
case "net::ERR_SOCKS_CONNECTION_FAILED":
case "net::SOCKS_CONNECTION_HOST_UNREACHABLE":
case "net::ERR_PROXY_CONNECTION_FAILED":
case "net::ERR_TUNNEL_CONNECTION_FAILED":
errorCode = ExitCodes.ProxyError;
break;
case "net::ERR_TIMED_OUT":
case "net::ERR_INVALID_AUTH_CREDENTIALS":
if (this.proxyServer || this.proxyPacUrl) {
errorCode = ExitCodes.ProxyError;
}
break;
}
logger.fatal( logger.fatal(
"Seed Page Load Failed, failing crawl", "Seed Page Load Failed, failing crawl",
{}, {},
"general", "general",
ExitCodes.GenericError, errorCode,
); );
} }
} }
@ -1709,7 +1715,8 @@ self.__bx_behaviors.selectMainBehavior();
emulateDevice: this.emulateDevice, emulateDevice: this.emulateDevice,
swOpt: this.params.serviceWorker, swOpt: this.params.serviceWorker,
chromeOptions: { chromeOptions: {
proxy: this.proxyServer, proxyServer: this.proxyServer,
proxyPacUrl: this.proxyPacUrl,
userAgent: this.emulateDevice.userAgent, userAgent: this.emulateDevice.userAgent,
extraArgs: this.extraChromeArgs(), extraArgs: this.extraChromeArgs(),
}, },
@ -1962,6 +1969,8 @@ self.__bx_behaviors.selectMainBehavior();
logger.error("Error creating WACZ", e); logger.error("Error creating WACZ", e);
if (!streaming) { if (!streaming) {
logger.fatal("Unable to write WACZ successfully"); logger.fatal("Unable to write WACZ successfully");
} else if (this.params.restartsOnError) {
await this.setStatusAndExit(ExitCodes.UploadFailed, "interrupted");
} }
} }
} }
@ -2459,30 +2468,30 @@ self.__bx_behaviors.selectMainBehavior();
this.pageLimit, this.pageLimit,
); );
const logContext = depth === 0 ? "scope" : "links";
const logLevel = depth === 0 ? "error" : "debug";
switch (result) { switch (result) {
case QueueState.ADDED: case QueueState.ADDED:
logger.debug("Queued new page url", { url, ...logDetails }, logContext); logger.debug("Queued new page URL", { url, ...logDetails }, "links");
return true; return true;
case QueueState.LIMIT_HIT: case QueueState.LIMIT_HIT:
logger.logAsJSON( logger.debug(
"Page url not queued, at page limit", "Page URL not queued, at page limit",
{ url, ...logDetails }, { url, ...logDetails },
logContext, "links",
logLevel,
); );
if (!this.limitHit && depth === 0) {
logger.error(
"Page limit reached when adding URL list, some URLs not crawled.",
);
}
this.limitHit = true; this.limitHit = true;
return false; return false;
case QueueState.DUPE_URL: case QueueState.DUPE_URL:
logger.logAsJSON( logger.debug(
"Page url not queued, already seen", "Page URL not queued, already seen",
{ url, ...logDetails }, { url, ...logDetails },
logContext, "links",
logLevel,
); );
return false; return false;
} }

View file

@ -16,7 +16,7 @@ import { initStorage } from "./util/storage.js";
import { CDPSession, Page, PuppeteerLifeCycleEvent } from "puppeteer-core"; import { CDPSession, Page, PuppeteerLifeCycleEvent } from "puppeteer-core";
import { getInfoString } from "./util/file_reader.js"; import { getInfoString } from "./util/file_reader.js";
import { DISPLAY, ExitCodes } from "./util/constants.js"; import { DISPLAY, ExitCodes } from "./util/constants.js";
import { initProxy } from "./util/proxy.js"; import { initProxy, loadProxyConfig } from "./util/proxy.js";
//import { sleep } from "./util/timing.js"; //import { sleep } from "./util/timing.js";
const profileHTML = fs.readFileSync( const profileHTML = fs.readFileSync(
@ -123,6 +123,12 @@ function initArgs() {
type: "string", type: "string",
}, },
proxyServerConfig: {
describe:
"if set, path to yaml/json file that configures multiple path servers per URL regex",
type: "string",
},
sshProxyPrivateKeyFile: { sshProxyPrivateKeyFile: {
describe: describe:
"path to SSH private key for SOCKS5 over SSH proxy connection", "path to SSH private key for SOCKS5 over SSH proxy connection",
@ -161,7 +167,9 @@ async function main() {
process.on("SIGTERM", () => handleTerminate("SIGTERM")); process.on("SIGTERM", () => handleTerminate("SIGTERM"));
const proxyServer = await initProxy(params, false); loadProxyConfig(params);
const { proxyServer, proxyPacUrl } = await initProxy(params, false);
if (!params.headless) { if (!params.headless) {
logger.debug("Launching XVFB"); logger.debug("Launching XVFB");
@ -203,7 +211,8 @@ async function main() {
headless: params.headless, headless: params.headless,
signals: false, signals: false,
chromeOptions: { chromeOptions: {
proxy: proxyServer, proxyServer,
proxyPacUrl,
extraArgs: [ extraArgs: [
"--window-position=0,0", "--window-position=0,0",
`--window-size=${params.windowSize}`, `--window-size=${params.windowSize}`,
@ -330,7 +339,11 @@ async function createProfile(
cdp: CDPSession, cdp: CDPSession,
targetFilename = "", targetFilename = "",
) { ) {
try {
await cdp.send("Network.clearBrowserCache"); await cdp.send("Network.clearBrowserCache");
} catch (e) {
logger.warn("Error clearing cache", e, "browser");
}
await browser.close(); await browser.close();
@ -537,7 +550,8 @@ class InteractiveBrowser {
return; return;
} }
const cookies = await this.browser.getCookies(this.page); const cookies = await this.browser.getCookies();
for (const cookieOrig of cookies) { for (const cookieOrig of cookies) {
// eslint-disable-next-line @typescript-eslint/no-explicit-any // eslint-disable-next-line @typescript-eslint/no-explicit-any
const cookie = cookieOrig as any; const cookie = cookieOrig as any;
@ -557,7 +571,7 @@ class InteractiveBrowser {
cookie.url = url; cookie.url = url;
} }
} }
await this.browser.setCookies(this.page, cookies); await this.browser.setCookies(cookies);
// eslint-disable-next-line @typescript-eslint/no-explicit-any // eslint-disable-next-line @typescript-eslint/no-explicit-any
} catch (e: any) { } catch (e: any) {
logger.error("Save Cookie Error: ", e); logger.error("Save Cookie Error: ", e);

View file

@ -29,6 +29,7 @@ import {
logger, logger,
} from "./logger.js"; } from "./logger.js";
import { SaveState } from "./state.js"; import { SaveState } from "./state.js";
import { loadProxyConfig } from "./proxy.js";
// ============================================================================ // ============================================================================
export type CrawlerArgs = ReturnType<typeof parseArgs> & { export type CrawlerArgs = ReturnType<typeof parseArgs> & {
@ -641,6 +642,19 @@ class ArgParser {
type: "string", type: "string",
}, },
proxyServerPreferSingleProxy: {
describe:
"if set, and both proxyServer and proxyServerConfig are provided, the proxyServer value will be preferred",
type: "boolean",
default: false,
},
proxyServerConfig: {
describe:
"if set, path to yaml/json file that configures multiple path servers per URL regex",
type: "string",
},
dryRun: { dryRun: {
describe: describe:
"If true, no archive data is written to disk, only pages and logs (and optionally saved state).", "If true, no archive data is written to disk, only pages and logs (and optionally saved state).",
@ -778,6 +792,8 @@ class ArgParser {
argv.emulateDevice = { viewport: null }; argv.emulateDevice = { viewport: null };
} }
loadProxyConfig(argv);
if (argv.lang) { if (argv.lang) {
if (!ISO6391.validate(argv.lang)) { if (!ISO6391.validate(argv.lang)) {
logger.fatal("Invalid ISO-639-1 country code for --lang: " + argv.lang); logger.fatal("Invalid ISO-639-1 country code for --lang: " + argv.lang);

View file

@ -272,7 +272,9 @@ export class BlockRules {
logDetails: Record<string, any>, logDetails: Record<string, any>,
) { ) {
try { try {
const res = await fetch(reqUrl, { dispatcher: getProxyDispatcher() }); const res = await fetch(reqUrl, {
dispatcher: getProxyDispatcher(reqUrl),
});
const text = await res.text(); const text = await res.text();
return !!text.match(frameTextMatch); return !!text.match(frameTextMatch);
@ -303,7 +305,7 @@ export class BlockRules {
method: "PUT", method: "PUT",
headers: { "Content-Type": "text/html" }, headers: { "Content-Type": "text/html" },
body, body,
dispatcher: getProxyDispatcher(), dispatcher: getProxyDispatcher(putUrl.href),
}); });
} }
} }

View file

@ -22,6 +22,7 @@ import puppeteer, {
Page, Page,
LaunchOptions, LaunchOptions,
Viewport, Viewport,
CookieData,
} from "puppeteer-core"; } from "puppeteer-core";
import { CDPSession, Target, Browser as PptrBrowser } from "puppeteer-core"; import { CDPSession, Target, Browser as PptrBrowser } from "puppeteer-core";
import { Recorder } from "./recorder.js"; import { Recorder } from "./recorder.js";
@ -29,7 +30,8 @@ import { timedRun } from "./timing.js";
import assert from "node:assert"; import assert from "node:assert";
type BtrixChromeOpts = { type BtrixChromeOpts = {
proxy?: string; proxyServer?: string;
proxyPacUrl?: string;
userAgent?: string | null; userAgent?: string | null;
extraArgs?: string[]; extraArgs?: string[];
}; };
@ -243,7 +245,8 @@ export class Browser {
} }
chromeArgs({ chromeArgs({
proxy = "", proxyServer = "",
proxyPacUrl = "",
userAgent = null, userAgent = null,
extraArgs = [], extraArgs = [],
}: BtrixChromeOpts) { }: BtrixChromeOpts) {
@ -262,14 +265,14 @@ export class Browser {
...extraArgs, ...extraArgs,
]; ];
if (proxy) { if (proxyServer) {
const proxyString = getSafeProxyString(proxy); const proxyString = getSafeProxyString(proxyServer);
logger.info("Using proxy", { proxy: proxyString }, "browser"); logger.info("Using proxy", { proxy: proxyString }, "browser");
}
if (proxy) {
args.push("--ignore-certificate-errors"); args.push("--ignore-certificate-errors");
args.push(`--proxy-server=${proxy}`); args.push(`--proxy-server=${proxyServer}`);
} else if (proxyPacUrl) {
args.push("--proxy-pac-url=" + proxyPacUrl);
} }
return args; return args;
@ -614,14 +617,12 @@ export class Browser {
await page.setViewport(params); await page.setViewport(params);
} }
async getCookies(page: Page) { async getCookies() {
return await page.cookies(); return (await this.browser?.cookies()) || [];
} }
// TODO: Fix this the next time the file is edited. async setCookies(cookies: CookieData[]) {
// eslint-disable-next-line @typescript-eslint/no-explicit-any return await this.browser?.setCookie(...cookies);
async setCookies(page: Page, cookies: any) {
return await page.setCookie(...cookies);
} }
} }

View file

@ -81,6 +81,7 @@ export enum ExitCodes {
DiskUtilization = 16, DiskUtilization = 16,
Fatal = 17, Fatal = 17,
ProxyError = 21, ProxyError = 21,
UploadFailed = 22,
} }
export enum InterruptReason { export enum InterruptReason {

View file

@ -41,7 +41,7 @@ async function writeUrlContentsToFile(
pathPrefix: string, pathPrefix: string,
pathDefaultExt: string, pathDefaultExt: string,
) { ) {
const res = await fetch(url, { dispatcher: getProxyDispatcher() }); const res = await fetch(url, { dispatcher: getProxyDispatcher(url) });
const fileContents = await res.text(); const fileContents = await res.text();
const filename = const filename =

View file

@ -368,7 +368,7 @@ class Flow {
case StepType.DoubleClick: case StepType.DoubleClick:
await locator(step) await locator(step)
.setTimeout(timeout * 1000) .setTimeout(timeout * 1000)
//.on('action', () => startWaitingForEvents()) .setEnsureElementIsInTheViewport(true)
.click({ .click({
count: 2, count: 2,
button: step.button && mouseButtonMap.get(step.button), button: step.button && mouseButtonMap.get(step.button),
@ -392,7 +392,7 @@ class Flow {
await locator(step) await locator(step)
.setTimeout(timeout * 1000) .setTimeout(timeout * 1000)
//.on('action', () => startWaitingForEvents()) .setEnsureElementIsInTheViewport(true)
.click({ .click({
delay: step.duration, delay: step.duration,
button: step.button && mouseButtonMap.get(step.button), button: step.button && mouseButtonMap.get(step.button),
@ -410,7 +410,7 @@ class Flow {
case StepType.Hover: case StepType.Hover:
await locator(step) await locator(step)
.setTimeout(timeout * 1000) .setTimeout(timeout * 1000)
//.on('action', () => startWaitingForEvents()) .setEnsureElementIsInTheViewport(true)
.hover(); .hover();
break; break;
@ -426,15 +426,14 @@ class Flow {
case StepType.Change: case StepType.Change:
await locator(step) await locator(step)
//.on('action', () => startWaitingForEvents())
.setTimeout(timeout * 1000) .setTimeout(timeout * 1000)
.setEnsureElementIsInTheViewport(true)
.fill(step.value); .fill(step.value);
break; break;
case StepType.Scroll: { case StepType.Scroll: {
if ("selectors" in step) { if ("selectors" in step) {
await locator(step) await locator(step)
//.on('action', () => startWaitingForEvents())
.setTimeout(timeout * 1000) .setTimeout(timeout * 1000)
.scroll({ .scroll({
scrollLeft: step.x || 0, scrollLeft: step.x || 0,

View file

@ -48,7 +48,7 @@ export class OriginOverride {
const resp = await fetch(newUrl, { const resp = await fetch(newUrl, {
headers, headers,
dispatcher: getProxyDispatcher(), dispatcher: getProxyDispatcher(newUrl),
}); });
const body = Buffer.from(await resp.arrayBuffer()); const body = Buffer.from(await resp.arrayBuffer());

View file

@ -1,7 +1,9 @@
import net from "net"; import net from "net";
import { Agent, Dispatcher, ProxyAgent } from "undici";
import child_process from "child_process"; import child_process from "child_process";
import fs from "fs";
import { Agent, Dispatcher, ProxyAgent } from "undici";
import yaml from "js-yaml";
import { logger } from "./logger.js"; import { logger } from "./logger.js";
@ -9,11 +11,40 @@ import { socksDispatcher } from "fetch-socks";
import type { SocksProxyType } from "socks/typings/common/constants.js"; import type { SocksProxyType } from "socks/typings/common/constants.js";
import { ExitCodes, FETCH_HEADERS_TIMEOUT_SECS } from "./constants.js"; import { ExitCodes, FETCH_HEADERS_TIMEOUT_SECS } from "./constants.js";
import http, { IncomingMessage, ServerResponse } from "http";
const SSH_PROXY_LOCAL_PORT = 9722; const SSH_PROXY_LOCAL_PORT = 9722;
const SSH_WAIT_TIMEOUT = 30000; const SSH_WAIT_TIMEOUT = 30000;
let proxyDispatcher: Dispatcher | undefined = undefined; //let proxyDispatcher: Dispatcher | undefined = undefined;
type ProxyEntry = {
proxyUrl: string;
dispatcher: Dispatcher;
};
export type ProxyServerConfig = {
matchHosts?: Record<string, string>;
proxies?: Record<
string,
string | { url: string; privateKeyFile?: string; publicHostsFile?: string }
>;
};
export type ProxyCLIArgs = {
sshProxyPrivateKeyFile?: string;
sshProxyKnownHostsFile?: string;
sshProxyLocalPort?: number;
proxyServer?: string;
proxyServerPreferSingleProxy?: boolean;
proxyMap?: ProxyServerConfig;
};
const proxyMap = new Map<RegExp, ProxyEntry>();
let defaultProxyEntry: ProxyEntry | null = null;
export function getEnvProxyUrl() { export function getEnvProxyUrl() {
if (process.env.PROXY_SERVER) { if (process.env.PROXY_SERVER) {
@ -28,6 +59,27 @@ export function getEnvProxyUrl() {
return ""; return "";
} }
export function loadProxyConfig(params: {
proxyServerConfig?: string;
proxyMap?: ProxyServerConfig;
}) {
if (params.proxyServerConfig) {
const proxyServerConfig = params.proxyServerConfig;
try {
const proxies = yaml.load(
fs.readFileSync(proxyServerConfig, "utf8"),
// eslint-disable-next-line @typescript-eslint/no-explicit-any
) as any;
params.proxyMap = proxies;
logger.debug("Proxy host match config loaded", { proxyServerConfig });
} catch (e) {
logger.warn("Proxy host match config file not found, ignoring", {
proxyServerConfig,
});
}
}
}
export function getSafeProxyString(proxyString: string): string { export function getSafeProxyString(proxyString: string): string {
if (!proxyString) { if (!proxyString) {
return ""; return "";
@ -54,31 +106,127 @@ export function getSafeProxyString(proxyString: string): string {
} }
export async function initProxy( export async function initProxy(
// eslint-disable-next-line @typescript-eslint/no-explicit-any params: ProxyCLIArgs,
params: Record<string, any>,
detached: boolean, detached: boolean,
): Promise<string | undefined> { ): Promise<{ proxyServer?: string; proxyPacUrl?: string }> {
let proxy = params.proxyServer; const { sshProxyPrivateKeyFile, sshProxyKnownHostsFile, sshProxyLocalPort } =
params;
let localPort = sshProxyLocalPort || SSH_PROXY_LOCAL_PORT;
if (!proxy) { const singleProxy = params.proxyServer || getEnvProxyUrl();
proxy = getEnvProxyUrl();
if (singleProxy) {
defaultProxyEntry = await initSingleProxy(
singleProxy,
localPort++,
detached,
sshProxyPrivateKeyFile,
sshProxyKnownHostsFile,
);
if (params.proxyServerPreferSingleProxy && defaultProxyEntry.proxyUrl) {
return { proxyServer: defaultProxyEntry.proxyUrl };
} }
if (proxy && proxy.startsWith("ssh://")) { }
proxy = await runSSHD(params, detached);
if (!params.proxyMap?.matchHosts || !params.proxyMap?.proxies) {
if (defaultProxyEntry) {
logger.debug("Using Single Proxy", {}, "proxy");
}
return { proxyServer: defaultProxyEntry?.proxyUrl };
}
const nameToProxy = new Map<string, ProxyEntry>();
for (const [name, value] of Object.entries(params.proxyMap.proxies)) {
let proxyUrl = "";
let privateKeyFile: string | undefined = "";
let publicHostsFile: string | undefined = "";
if (typeof value === "string") {
proxyUrl = value;
} else {
proxyUrl = value.url;
privateKeyFile = value.privateKeyFile;
publicHostsFile = value.publicHostsFile;
}
privateKeyFile = privateKeyFile || sshProxyPrivateKeyFile;
publicHostsFile = publicHostsFile || sshProxyKnownHostsFile;
const entry = await initSingleProxy(
proxyUrl,
localPort++,
detached,
privateKeyFile,
publicHostsFile,
);
nameToProxy.set(name, entry);
}
for (const [rx, name] of Object.entries(params.proxyMap.matchHosts)) {
const entry = nameToProxy.get(name);
if (!entry) {
logger.fatal("Proxy specified but not found in proxies list: " + name);
return {};
}
if (rx) {
proxyMap.set(new RegExp(rx), entry);
} else {
defaultProxyEntry = entry;
}
}
const p = new ProxyPacServer();
logger.debug("Using Proxy PAC script", {}, "proxy");
return { proxyPacUrl: `http://localhost:${p.port}/proxy.pac` };
}
export async function initSingleProxy(
proxyUrl: string,
localPort: number,
detached: boolean,
sshProxyPrivateKeyFile?: string,
sshProxyKnownHostsFile?: string,
): Promise<{ proxyUrl: string; dispatcher: Dispatcher }> {
logger.debug("Initing proxy", {
url: getSafeProxyString(proxyUrl),
localPort,
sshProxyPrivateKeyFile,
sshProxyKnownHostsFile,
});
if (proxyUrl && proxyUrl.startsWith("ssh://")) {
proxyUrl = await runSSHD(
proxyUrl,
localPort,
detached,
sshProxyPrivateKeyFile,
sshProxyKnownHostsFile,
);
} }
const agentOpts: Agent.Options = { const agentOpts: Agent.Options = {
headersTimeout: FETCH_HEADERS_TIMEOUT_SECS * 1000, headersTimeout: FETCH_HEADERS_TIMEOUT_SECS * 1000,
}; };
// set global fetch() dispatcher (with proxy, if any) const dispatcher = createDispatcher(proxyUrl, agentOpts);
const dispatcher = createDispatcher(proxy, agentOpts); return { proxyUrl, dispatcher };
proxyDispatcher = dispatcher;
return proxy;
} }
export function getProxyDispatcher() { export function getProxyDispatcher(url: string) {
return proxyDispatcher; // find url match by regex first
for (const [rx, { dispatcher }] of proxyMap.entries()) {
if (rx && url.match(rx)) {
return dispatcher;
}
}
// if default proxy set, return default dispatcher, otherwise no dispatcher
return defaultProxyEntry ? defaultProxyEntry.dispatcher : undefined;
} }
export function createDispatcher( export function createDispatcher(
@ -113,9 +261,13 @@ export function createDispatcher(
} }
} }
// eslint-disable-next-line @typescript-eslint/no-explicit-any export async function runSSHD(
export async function runSSHD(params: Record<string, any>, detached: boolean) { proxyServer: string,
const { proxyServer } = params; localPort: number,
detached: boolean,
privateKey?: string,
publicKnownHost?: string,
) {
if (!proxyServer || !proxyServer.startsWith("ssh://")) { if (!proxyServer || !proxyServer.startsWith("ssh://")) {
return ""; return "";
} }
@ -126,17 +278,14 @@ export async function runSSHD(params: Record<string, any>, detached: boolean) {
const host = proxyServerUrl.hostname.replace("[", "").replace("]", ""); const host = proxyServerUrl.hostname.replace("[", "").replace("]", "");
const port = proxyServerUrl.port || 22; const port = proxyServerUrl.port || 22;
const user = proxyServerUrl.username || "root"; const user = proxyServerUrl.username || "root";
const localPort = params.sshProxyLocalPort || SSH_PROXY_LOCAL_PORT;
const proxyString = `socks5://localhost:${localPort}`; const proxyString = `socks5://localhost:${localPort}`;
const args: string[] = [ const args: string[] = [
user + "@" + host, user + "@" + host,
"-p", "-p",
port, port + "",
"-D", "-D",
localPort, localPort + "",
"-i",
params.sshProxyPrivateKeyFile,
"-o", "-o",
"IdentitiesOnly=yes", "IdentitiesOnly=yes",
"-o", "-o",
@ -146,12 +295,17 @@ export async function runSSHD(params: Record<string, any>, detached: boolean) {
"-o", "-o",
]; ];
if (params.sshProxyKnownHostsFile) { if (publicKnownHost) {
args.push(`UserKnownHostsFile=${params.sshProxyKnownHostsFile}`); args.push(`UserKnownHostsFile=${publicKnownHost}`);
} else { } else {
args.push("StrictHostKeyChecking=no"); args.push("StrictHostKeyChecking=no");
} }
if (privateKey) {
args.push("-i");
args.push(privateKey);
}
args.push("-M", "0", "-N", "-T"); args.push("-M", "0", "-N", "-T");
logger.info("Checking SSH connection for proxy...", {}, "proxy"); logger.info("Checking SSH connection for proxy...", {}, "proxy");
@ -221,7 +375,7 @@ export async function runSSHD(params: Record<string, any>, detached: boolean) {
"proxy", "proxy",
ExitCodes.ProxyError, ExitCodes.ProxyError,
); );
return; return "";
} }
logger.info( logger.info(
@ -241,10 +395,61 @@ export async function runSSHD(params: Record<string, any>, detached: boolean) {
}, },
"proxy", "proxy",
); );
runSSHD(params, detached).catch((e) => runSSHD(
logger.error("proxy retry error", e, "proxy"), proxyServer,
); localPort,
detached,
privateKey,
publicKnownHost,
).catch((e) => logger.error("proxy retry error", e, "proxy"));
}); });
return proxyString; return proxyString;
} }
class ProxyPacServer {
port = 20278;
proxyPacText = "";
constructor() {
const httpServer = http.createServer((req, res) =>
this.handleRequest(req, res),
);
httpServer.listen(this.port);
this.generateProxyPac();
}
async handleRequest(request: IncomingMessage, response: ServerResponse) {
response.writeHead(200, {
"Content-Type": "application/x-ns-proxy-autoconfig",
});
response.end(this.proxyPacText);
}
generateProxyPac() {
const urlToProxy = (proxyUrl: string) => {
const url = new URL(proxyUrl);
const hostport = url.href.slice(url.protocol.length + 2);
const type = url.protocol.slice(0, -1).toUpperCase();
return `"${type} ${hostport}"`;
};
this.proxyPacText = `
function FindProxyForURL(url, host) {
`;
proxyMap.forEach(({ proxyUrl }, k) => {
this.proxyPacText += ` if (url.match(/${
k.source
}/)) { return ${urlToProxy(proxyUrl)}; }\n`;
});
this.proxyPacText += `\n return ${
defaultProxyEntry ? urlToProxy(defaultProxyEntry.proxyUrl) : `"DIRECT"`
};
}
`;
}
}

File diff suppressed because it is too large Load diff

View file

@ -342,6 +342,7 @@ export async function parseSeeds(params: CrawlerArgs): Promise<ScopedSeed[]> {
for (const seed of seeds) { for (const seed of seeds) {
const newSeed = typeof seed === "string" ? { url: seed } : seed; const newSeed = typeof seed === "string" ? { url: seed } : seed;
newSeed.url = removeQuotes(newSeed.url);
try { try {
scopedSeeds.push(new ScopedSeed({ ...scopeOpts, ...newSeed })); scopedSeeds.push(new ScopedSeed({ ...scopeOpts, ...newSeed }));
@ -389,3 +390,14 @@ export function parseRx(
return value.map((e) => (e instanceof RegExp ? e : new RegExp(e))); return value.map((e) => (e instanceof RegExp ? e : new RegExp(e)));
} }
} }
export function removeQuotes(url: string) {
url = url.trim();
if (
(url.startsWith(`"`) && url.endsWith(`"`)) ||
(url.startsWith(`'`) && url.endsWith(`'`))
) {
url = url.slice(1, -1);
}
return url;
}

View file

@ -68,7 +68,7 @@ export class SitemapReader extends EventEmitter {
while (true) { while (true) {
const resp = await fetch(url, { const resp = await fetch(url, {
headers: this.headers, headers: this.headers,
dispatcher: getProxyDispatcher(), dispatcher: getProxyDispatcher(url),
}); });
if (resp.ok) { if (resp.ok) {

View file

@ -85,6 +85,7 @@ export class PageState {
skipBehaviors = false; skipBehaviors = false;
pageSkipped = false; pageSkipped = false;
asyncLoading = false;
filteredFrames: Frame[] = []; filteredFrames: Frame[] = [];
loadState: LoadState = LoadState.FAILED; loadState: LoadState = LoadState.FAILED;
contentCheckAllowed = false; contentCheckAllowed = false;
@ -458,6 +459,10 @@ return inx;
} }
async trimToLimit(limit: number) { async trimToLimit(limit: number) {
if (limit === 0) {
return;
}
const totalComplete = const totalComplete =
(await this.numPending()) + (await this.numPending()) +
(await this.numDone()) + (await this.numDone()) +

View file

@ -311,7 +311,7 @@ export class PageWorker {
} }
await timedRun( await timedRun(
this.crawler.pageFinished(data), this.crawler.pageFinished(data, this.recorder?.lastErrorText),
FINISHED_TIMEOUT, FINISHED_TIMEOUT,
"Page Finished Timed Out", "Page Finished Timed Out",
this.logDetails, this.logDetails,

View file

@ -8,7 +8,7 @@ const testIf = (condition, ...args) => condition ? test(...args) : test.skip(...
test("ensure basic crawl run with docker run passes", async () => { test("ensure basic crawl run with docker run passes", async () => {
child_process.execSync( child_process.execSync(
'docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --generateWACZ --text --collection wr-net --combineWARC --rolloverSize 10000 --workers 2 --title "test title" --description "test description" --warcPrefix custom-prefix', 'docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net/ --generateWACZ --text --collection wr-net --combineWARC --rolloverSize 10000 --workers 2 --title "test title" --description "test description" --warcPrefix custom-prefix',
); );
child_process.execSync( child_process.execSync(

View file

@ -1,6 +1,21 @@
import child_process from "child_process"; import child_process from "child_process";
import Redis from "ioredis"; import Redis from "ioredis";
let proc = null;
const DOCKER_HOST_NAME = process.env.DOCKER_HOST_NAME || "host.docker.internal";
const TEST_HOST = `http://${DOCKER_HOST_NAME}:31503`;
beforeAll(() => {
proc = child_process.spawn("../../node_modules/.bin/http-server", ["-p", "31503"], {cwd: "tests/custom-behaviors/"});
});
afterAll(() => {
if (proc) {
proc.kill();
}
});
async function sleep(time) { async function sleep(time) {
await new Promise((resolve) => setTimeout(resolve, time)); await new Promise((resolve) => setTimeout(resolve, time));
@ -9,7 +24,7 @@ async function sleep(time) {
test("test custom behaviors from local filepath", async () => { test("test custom behaviors from local filepath", async () => {
const res = child_process.execSync( const res = child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://example.org/ --url https://old.webrecorder.net/ --customBehaviors /custom-behaviors/ --scopeType page", "docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://example-com.webrecorder.net/page --url https://old.webrecorder.net/ --customBehaviors /custom-behaviors/ --scopeType page",
); );
const log = res.toString(); const log = res.toString();
@ -21,10 +36,10 @@ test("test custom behaviors from local filepath", async () => {
) > 0, ) > 0,
).toBe(true); ).toBe(true);
// but not for example.org // but not for example.com
expect( expect(
log.indexOf( log.indexOf(
'"logLevel":"info","context":"behaviorScriptCustom","message":"test-stat","details":{"state":{},"behavior":"TestBehavior","page":"https://example.org","workerid":0}}', '"logLevel":"info","context":"behaviorScriptCustom","message":"test-stat","details":{"state":{},"behavior":"TestBehavior","page":"https://example-com.webrecorder.net/page","workerid":0}}',
) > 0, ) > 0,
).toBe(false); ).toBe(false);
@ -37,7 +52,7 @@ test("test custom behaviors from local filepath", async () => {
}); });
test("test custom behavior from URL", async () => { test("test custom behavior from URL", async () => {
const res = child_process.execSync("docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://old.webrecorder.net/ --customBehaviors https://raw.githubusercontent.com/webrecorder/browsertrix-crawler/refs/heads/main/tests/custom-behaviors/custom-2.js --scopeType page"); const res = child_process.execSync(`docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://old.webrecorder.net/ --customBehaviors ${TEST_HOST}/custom-2.js --scopeType page`);
const log = res.toString(); const log = res.toString();
@ -51,7 +66,7 @@ test("test custom behavior from URL", async () => {
}); });
test("test mixed custom behavior sources", async () => { test("test mixed custom behavior sources", async () => {
const res = child_process.execSync("docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://old.webrecorder.net/ --customBehaviors https://raw.githubusercontent.com/webrecorder/browsertrix-crawler/refs/heads/main/tests/custom-behaviors/custom-2.js --customBehaviors /custom-behaviors/custom.js --scopeType page"); const res = child_process.execSync(`docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://old.webrecorder.net/ --customBehaviors ${TEST_HOST}/custom-2.js --customBehaviors /custom-behaviors/custom.js --scopeType page`);
const log = res.toString(); const log = res.toString();
@ -74,7 +89,7 @@ test("test mixed custom behavior sources", async () => {
test("test custom behaviors from git repo", async () => { test("test custom behaviors from git repo", async () => {
const res = child_process.execSync( const res = child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://example.org/ --url https://old.webrecorder.net/ --customBehaviors \"git+https://github.com/webrecorder/browsertrix-crawler.git?branch=main&path=tests/custom-behaviors\" --scopeType page", "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net/ --url https://example-com.webrecorder.net/ --url https://old.webrecorder.net/ --customBehaviors \"git+https://github.com/webrecorder/browsertrix-crawler.git?branch=main&path=tests/custom-behaviors\" --scopeType page",
); );
const log = res.toString(); const log = res.toString();
@ -86,10 +101,10 @@ test("test custom behaviors from git repo", async () => {
) > 0, ) > 0,
).toBe(true); ).toBe(true);
// but not for example.org // but not for example.com
expect( expect(
log.indexOf( log.indexOf(
'"logLevel":"info","context":"behaviorScriptCustom","message":"test-stat","details":{"state":{},"behavior":"TestBehavior","page":"https://example.org/","workerid":0}}', '"logLevel":"info","context":"behaviorScriptCustom","message":"test-stat","details":{"state":{},"behavior":"TestBehavior","page":"https://example-com.webrecorder.net/","workerid":0}}',
) > 0, ) > 0,
).toBe(false); ).toBe(false);
@ -106,7 +121,7 @@ test("test invalid behavior exit", async () => {
try { try {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/invalid-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://example.com/ --url https://example.org/ --url https://old.webrecorder.net/ --customBehaviors /custom-behaviors/invalid-export.js --scopeType page", "docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/invalid-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net.webrecorder.net/ --url https://example-com.webrecorder.net/ --url https://old.webrecorder.net/ --customBehaviors /custom-behaviors/invalid-export.js --scopeType page",
); );
} catch (e) { } catch (e) {
status = e.status; status = e.status;
@ -121,7 +136,7 @@ test("test crawl exits if behavior not fetched from url", async () => {
try { try {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com --customBehaviors https://webrecorder.net/doesntexist/custombehavior.js --scopeType page", "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net --customBehaviors https://webrecorder.net/doesntexist/custombehavior.js --scopeType page",
); );
} catch (e) { } catch (e) {
status = e.status; status = e.status;
@ -136,7 +151,7 @@ test("test crawl exits if behavior not fetched from git repo", async () => {
try { try {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com --customBehaviors git+https://github.com/webrecorder/doesntexist --scopeType page", "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net --customBehaviors git+https://github.com/webrecorder/doesntexist --scopeType page",
); );
} catch (e) { } catch (e) {
status = e.status; status = e.status;
@ -151,7 +166,7 @@ test("test crawl exits if not custom behaviors collected from local path", async
try { try {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com --customBehaviors /custom-behaviors/doesntexist --scopeType page", "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net --customBehaviors /custom-behaviors/doesntexist --scopeType page",
); );
} catch (e) { } catch (e) {
status = e.status; status = e.status;
@ -166,7 +181,7 @@ test("test pushing behavior logs to redis", async () => {
const redisId = child_process.execSync("docker run --rm --network=crawl -p 36399:6379 --name redis -d redis"); const redisId = child_process.execSync("docker run --rm --network=crawl -p 36399:6379 --name redis -d redis");
const child = child_process.exec("docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ -e CRAWL_ID=behavior-logs-redis-test --network=crawl --rm webrecorder/browsertrix-crawler crawl --debugAccessRedis --redisStoreUrl redis://redis:6379 --url https://specs.webrecorder.net/ --url https://old.webrecorder.net/ --customBehaviors https://raw.githubusercontent.com/webrecorder/browsertrix-crawler/refs/heads/main/tests/custom-behaviors/custom-2.js --customBehaviors /custom-behaviors/custom.js --scopeType page --logBehaviorsToRedis"); const child = child_process.exec(`docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ -e CRAWL_ID=behavior-logs-redis-test --network=crawl --rm webrecorder/browsertrix-crawler crawl --debugAccessRedis --redisStoreUrl redis://redis:6379 --url https://specs.webrecorder.net/ --url https://old.webrecorder.net/ --customBehaviors ${TEST_HOST}/custom-2.js --customBehaviors /custom-behaviors/custom.js --scopeType page --logBehaviorsToRedis`);
let resolve = null; let resolve = null;
const crawlFinished = new Promise(r => resolve = r); const crawlFinished = new Promise(r => resolve = r);

View file

@ -28,7 +28,7 @@
}, },
{ {
"type": "change", "type": "change",
"value": "https://example.com/", "value": "https://example-com.webrecorder.net/",
"selectors": [ "selectors": [
[ [
"aria/[role=\"main\"]", "aria/[role=\"main\"]",

View file

@ -43,8 +43,8 @@ test("test custom selector crawls JS files as pages", async () => {
]); ]);
const expectedExtraPages = new Set([ const expectedExtraPages = new Set([
"https://www.iana.org/_js/jquery.js", "https://www.iana.org/static/_js/jquery.js",
"https://www.iana.org/_js/iana.js", "https://www.iana.org/static/_js/iana.js",
]); ]);
expect(pages).toEqual(expectedPages); expect(pages).toEqual(expectedPages);
@ -71,7 +71,7 @@ test("test valid autoclick selector passes validation", async () => {
try { try {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --clickSelector button --scopeType page", "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net/ --clickSelector button --scopeType page",
); );
} catch (e) { } catch (e) {
failed = true; failed = true;
@ -87,7 +87,7 @@ test("test invalid autoclick selector fails validation, crawl fails", async () =
try { try {
child_process.execSync( child_process.execSync(
"docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --clickSelector \",\" --scopeType page", "docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net/ --clickSelector \",\" --scopeType page",
); );
} catch (e) { } catch (e) {
status = e.status; status = e.status;

View file

@ -6,7 +6,7 @@ import { execSync } from "child_process";
test("ensure exclusion is applied on redirected URL, which contains 'help', so it is not crawled", () => { test("ensure exclusion is applied on redirected URL, which contains 'help', so it is not crawled", () => {
execSync( execSync(
"docker run -p 9037:9037 -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --exclude help --collection redir-exclude-test --extraHops 1"); "docker run -p 9037:9037 -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example-com.webrecorder.net/ --exclude help --collection redir-exclude-test --extraHops 1");
// no entries besides header // no entries besides header
expect( expect(

View file

@ -0,0 +1,6 @@
matchHosts:
old.webrecorder.net: socks-proxy
proxies:
socks-proxy: socks5://user:passw1rd@proxy-with-auth:1080

View file

@ -0,0 +1,5 @@
matchHosts:
old.webrecorder.net: socks-proxy
proxies:
socks-proxy: socks5://user:passw0rd@proxy-with-auth:1080

View file

@ -1,2 +1,3 @@
https://webrecorder.net/about/ https://old.webrecorder.net/about/
https://specs.webrecorder.net/wacz/1.1.1/ https://specs.webrecorder.net/wacz/1.1.1/
"https://old.webrecorder.net/faq"

View file

@ -10,7 +10,7 @@ export class TestBehavior {
} }
static isMatch() { static isMatch() {
return window.location.origin === "https://example.com"; return window.location.origin === "https://example-com.webrecorder.net";
} }
async *run(ctx) { async *run(ctx) {

View file

@ -76,7 +76,7 @@ test("PDF: check that the pages.jsonl file entry contains status code and mime t
expect(pageH.loadState).toBe(2); expect(pageH.loadState).toBe(2);
}); });
test("PDF: check that CDX contains one pdf 200, one 301 and one 200, two pageinfo entries", () => { test("PDF: check that CDX contains data from two crawls: one pdf 200, one 301 and one 200, two pageinfo entries", () => {
const filedata = fs.readFileSync( const filedata = fs.readFileSync(
"test-crawls/collections/crawl-pdf/indexes/index.cdxj", "test-crawls/collections/crawl-pdf/indexes/index.cdxj",
{ encoding: "utf-8" }, { encoding: "utf-8" },
@ -90,6 +90,7 @@ test("PDF: check that CDX contains one pdf 200, one 301 and one 200, two pageinf
expect(cdxj[0].url).toBe(PDF_HTTP); expect(cdxj[0].url).toBe(PDF_HTTP);
expect(cdxj[0].status).toBe("301"); expect(cdxj[0].status).toBe("301");
// this is duplicated as this is data from two crawls
expect(cdxj[1].url).toBe(PDF); expect(cdxj[1].url).toBe(PDF);
expect(cdxj[1].status).toBe("200"); expect(cdxj[1].status).toBe("200");
expect(cdxj[1].mime).toBe("application/pdf"); expect(cdxj[1].mime).toBe("application/pdf");
@ -149,7 +150,7 @@ test("XML: check that CDX contains one xml 200, one 301 and one 200, two pageinf
const lines = filedata.trim().split("\n"); const lines = filedata.trim().split("\n");
const cdxj = lines.map(line => JSON.parse(line.split(" ").slice(2).join(" "))).sort((a, b) => a.url < b.url ? -1 : 1); const cdxj = lines.map(line => JSON.parse(line.split(" ").slice(2).join(" "))).sort((a, b) => a.url < b.url ? -1 : 1);
expect(cdxj.length).toBe(6); expect(cdxj.length).toBe(5);
expect(cdxj[0].url).toBe("https://webrecorder.net/favicon.ico"); expect(cdxj[0].url).toBe("https://webrecorder.net/favicon.ico");
@ -157,18 +158,14 @@ test("XML: check that CDX contains one xml 200, one 301 and one 200, two pageinf
expect(cdxj[1].status).toBe("200"); expect(cdxj[1].status).toBe("200");
expect(cdxj[1].mime).toBe("application/xml"); expect(cdxj[1].mime).toBe("application/xml");
expect(cdxj[2].url).toBe(XML); expect(cdxj[2].url).toBe(XML_REDIR);
expect(cdxj[2].status).toBe("200"); expect(cdxj[2].status).toBe("301");
expect(cdxj[2].mime).toBe("application/xml");
expect(cdxj[3].url).toBe(XML_REDIR); expect(cdxj[3].url).toBe("urn:pageinfo:" + XML);
expect(cdxj[3].status).toBe("301"); expect(cdxj[3].mime).toBe("application/json");
expect(cdxj[4].url).toBe("urn:pageinfo:" + XML); expect(cdxj[4].url).toBe("urn:pageinfo:" + XML_REDIR);
expect(cdxj[4].mime).toBe("application/json"); expect(cdxj[4].mime).toBe("application/json");
expect(cdxj[5].url).toBe("urn:pageinfo:" + XML_REDIR);
expect(cdxj[5].mime).toBe("application/json");
}); });

View file

@ -118,9 +118,9 @@ function validateResourcesIndex(json) {
{ status: 200, mime: "text/css", type: "stylesheet" }, { status: 200, mime: "text/css", type: "stylesheet" },
"https://fonts.googleapis.com/css?family=Source+Code+Pro|Source+Sans+Pro&display=swap": "https://fonts.googleapis.com/css?family=Source+Code+Pro|Source+Sans+Pro&display=swap":
{ status: 200, mime: "text/css", type: "stylesheet" }, { status: 200, mime: "text/css", type: "stylesheet" },
"https://fonts.gstatic.com/s/sourcesanspro/v22/6xK3dSBYKcSV-LCoeQqfX1RYOo3qOK7l.woff2": "https://fonts.gstatic.com/s/sourcesanspro/v23/6xK3dSBYKcSV-LCoeQqfX1RYOo3qOK7l.woff2":
{ status: 200, mime: "font/woff2", type: "font" }, { status: 200, mime: "font/woff2", type: "font" },
"https://fonts.gstatic.com/s/sourcesanspro/v22/6xKydSBYKcSV-LCoeQqfX1RYOo3ig4vwlxdu.woff2": "https://fonts.gstatic.com/s/sourcesanspro/v23/6xKydSBYKcSV-LCoeQqfX1RYOo3ig4vwlxdu.woff2":
{ status: 200, mime: "font/woff2", type: "font" }, { status: 200, mime: "font/woff2", type: "font" },
"https://old.webrecorder.net/assets/favicon.ico": { "https://old.webrecorder.net/assets/favicon.ico": {
status: 200, status: 200,
@ -161,9 +161,9 @@ function validateResourcesAbout(json) {
mime: "image/svg+xml", mime: "image/svg+xml",
type: "image", type: "image",
}, },
"https://fonts.gstatic.com/s/sourcesanspro/v22/6xK3dSBYKcSV-LCoeQqfX1RYOo3qOK7l.woff2": "https://fonts.gstatic.com/s/sourcesanspro/v23/6xK3dSBYKcSV-LCoeQqfX1RYOo3qOK7l.woff2":
{ status: 200, mime: "font/woff2", type: "font" }, { status: 200, mime: "font/woff2", type: "font" },
"https://fonts.gstatic.com/s/sourcesanspro/v22/6xKydSBYKcSV-LCoeQqfX1RYOo3ig4vwlxdu.woff2": "https://fonts.gstatic.com/s/sourcesanspro/v23/6xKydSBYKcSV-LCoeQqfX1RYOo3ig4vwlxdu.woff2":
{ status: 200, mime: "font/woff2", type: "font" }, { status: 200, mime: "font/woff2", type: "font" },
}); });
} }

View file

@ -9,6 +9,8 @@ const SOCKS_PORT = "1080";
const HTTP_PORT = "3128"; const HTTP_PORT = "3128";
const WRONG_PORT = "33130"; const WRONG_PORT = "33130";
const PROXY_EXIT_CODE = 21;
const SSH_PROXY_IMAGE = "linuxserver/openssh-server" const SSH_PROXY_IMAGE = "linuxserver/openssh-server"
const PDF = "https://specs.webrecorder.net/wacz/1.1.1/wacz-2021.pdf"; const PDF = "https://specs.webrecorder.net/wacz/1.1.1/wacz-2021.pdf";
@ -27,7 +29,7 @@ beforeAll(() => {
proxyNoAuthId = execSync(`docker run -d --rm --network=proxy-test-net --name proxy-no-auth ${PROXY_IMAGE}`, {encoding: "utf-8"}); proxyNoAuthId = execSync(`docker run -d --rm --network=proxy-test-net --name proxy-no-auth ${PROXY_IMAGE}`, {encoding: "utf-8"});
proxySSHId = execSync(`docker run -d --rm -e DOCKER_MODS=linuxserver/mods:openssh-server-ssh-tunnel -e USER_NAME=user -e PUBLIC_KEY_FILE=/keys/proxy-key.pub -v $PWD/tests/fixtures/proxy-key.pub:/keys/proxy-key.pub --network=proxy-test-net --name ssh-proxy ${SSH_PROXY_IMAGE}`); proxySSHId = execSync(`docker run -d --rm -e DOCKER_MODS=linuxserver/mods:openssh-server-ssh-tunnel -e USER_NAME=user -e PUBLIC_KEY_FILE=/keys/proxy-key.pub -v $PWD/tests/fixtures/proxies/proxy-key.pub:/keys/proxy-key.pub --network=proxy-test-net --name ssh-proxy ${SSH_PROXY_IMAGE}`);
}); });
afterAll(async () => { afterAll(async () => {
@ -66,7 +68,7 @@ describe("socks5 + https proxy tests", () => {
status = e.status; status = e.status;
} }
// auth supported only for SOCKS5 // auth supported only for SOCKS5
expect(status).toBe(scheme === "socks5" ? 0 : 1); expect(status).toBe(scheme === "socks5" ? 0 : PROXY_EXIT_CODE);
}); });
test(`${scheme} proxy, ${type}, wrong auth`, () => { test(`${scheme} proxy, ${type}, wrong auth`, () => {
@ -77,7 +79,7 @@ describe("socks5 + https proxy tests", () => {
} catch (e) { } catch (e) {
status = e.status; status = e.status;
} }
expect(status).toBe(1); expect(status).toBe(PROXY_EXIT_CODE);
}); });
test(`${scheme} proxy, ${type}, wrong protocol`, () => { test(`${scheme} proxy, ${type}, wrong protocol`, () => {
@ -88,7 +90,8 @@ describe("socks5 + https proxy tests", () => {
} catch (e) { } catch (e) {
status = e.status; status = e.status;
} }
expect(status).toBe(1); // wrong protocol (socks5 for http) causes connection to hang, causes a timeout, so just errors with 1
expect(status === PROXY_EXIT_CODE || status === 1).toBe(true);
}); });
} }
@ -100,7 +103,7 @@ describe("socks5 + https proxy tests", () => {
} catch (e) { } catch (e) {
status = e.status; status = e.status;
} }
expect(status).toBe(1); expect(status).toBe(PROXY_EXIT_CODE);
}); });
} }
}); });
@ -118,7 +121,7 @@ test("http proxy set, but not running, separate env vars", () => {
} catch (e) { } catch (e) {
status = e.status; status = e.status;
} }
expect(status).toBe(1); expect(status).toBe(PROXY_EXIT_CODE);
}); });
test("http proxy set, but not running, cli arg", () => { test("http proxy set, but not running, cli arg", () => {
@ -129,12 +132,12 @@ test("http proxy set, but not running, cli arg", () => {
} catch (e) { } catch (e) {
status = e.status; status = e.status;
} }
expect(status).toBe(1); expect(status).toBe(PROXY_EXIT_CODE);
}); });
test("ssh socks proxy with custom user", () => { test("ssh socks proxy with custom user", () => {
execSync(`docker run --rm --network=proxy-test-net -v $PWD/tests/fixtures/proxy-key:/keys/proxy-key webrecorder/browsertrix-crawler crawl --proxyServer ssh://user@ssh-proxy:2222 --sshProxyPrivateKeyFile /keys/proxy-key --url ${HTML} ${extraArgs}`, {encoding: "utf-8"}); execSync(`docker run --rm --network=proxy-test-net -v $PWD/tests/fixtures/proxies/proxy-key:/keys/proxy-key webrecorder/browsertrix-crawler crawl --proxyServer ssh://user@ssh-proxy:2222 --sshProxyPrivateKeyFile /keys/proxy-key --url ${HTML} ${extraArgs}`, {encoding: "utf-8"});
}); });
@ -146,7 +149,7 @@ test("ssh socks proxy, wrong user", () => {
} catch (e) { } catch (e) {
status = e.status; status = e.status;
} }
expect(status).toBe(21); expect(status).toBe(PROXY_EXIT_CODE);
}); });
@ -164,4 +167,30 @@ test("ensure logged proxy string does not include any credentials", () => {
}); });
test("proxy with config file, wrong auth or no match", () => {
let status = 0;
try {
execSync(`docker run --rm --network=proxy-test-net -v $PWD/tests/fixtures/proxies/:/proxies/ webrecorder/browsertrix-crawler crawl --proxyServerConfig /proxies/proxy-test-bad-auth.pac --url ${HTML} ${extraArgs}`, {encoding: "utf-8"});
} catch (e) {
status = e.status;
}
expect(status).toBe(PROXY_EXIT_CODE);
// success, no match for PDF
execSync(`docker run --rm --network=proxy-test-net -v $PWD/tests/fixtures/proxies/:/proxies/ webrecorder/browsertrix-crawler crawl --proxyServerConfig /proxies/proxy-test-bad-auth.pac --url ${PDF} ${extraArgs}`, {encoding: "utf-8"});
});
test("proxy with config file, correct auth or no match", () => {
let status = 0;
try {
execSync(`docker run --rm --network=proxy-test-net -v $PWD/tests/fixtures/proxies/:/proxies/ webrecorder/browsertrix-crawler crawl --proxyServerConfig /proxies/proxy-test-good-auth.pac --url ${HTML} ${extraArgs}`, {encoding: "utf-8"});
} catch (e) {
status = e.status;
}
expect(status).toBe(0);
// success, no match for PDF
execSync(`docker run --rm --network=proxy-test-net -v $PWD/tests/fixtures/proxies/:/proxies/ webrecorder/browsertrix-crawler crawl --proxyServerConfig /proxies/proxy-test-good-auth.pac --url ${PDF} ${extraArgs}`, {encoding: "utf-8"});
});

View file

@ -38,7 +38,7 @@ afterAll(() => {
test("run crawl with retries for no response", async () => { test("run crawl with retries for no response", async () => {
execSync(`docker run -d -v $PWD/test-crawls:/crawls -e CRAWL_ID=test -p 36387:6379 --rm webrecorder/browsertrix-crawler crawl --url http://invalid-host-x:31501 --url https://example.com/ --limit 2 --pageExtraDelay 10 --debugAccessRedis --collection retry-fail --retries 5`); execSync(`docker run -d -v $PWD/test-crawls:/crawls -e CRAWL_ID=test -p 36387:6379 --rm webrecorder/browsertrix-crawler crawl --url http://invalid-host-x:31501 --url https://example-com.webrecorder.net/ --limit 2 --pageExtraDelay 10 --debugAccessRedis --collection retry-fail --retries 5`);
const redis = new Redis("redis://127.0.0.1:36387/0", { lazyConnect: true, retryStrategy: () => null }); const redis = new Redis("redis://127.0.0.1:36387/0", { lazyConnect: true, retryStrategy: () => null });
@ -90,7 +90,7 @@ test("run crawl with retries for 503, enough retries to succeed", async () => {
requests = 0; requests = 0;
success = false; success = false;
const child = exec(`docker run -v $PWD/test-crawls:/crawls --rm webrecorder/browsertrix-crawler crawl --url http://${DOCKER_HOST_NAME}:31501 --url https://example.com/ --limit 2 --collection retry-fail-2 --retries 2 --failOnInvalidStatus --failOnFailedSeed --logging stats,debug`); const child = exec(`docker run -v $PWD/test-crawls:/crawls --rm webrecorder/browsertrix-crawler crawl --url http://${DOCKER_HOST_NAME}:31501 --url https://example-com.webrecorder.net/ --limit 2 --collection retry-fail-2 --retries 2 --failOnInvalidStatus --failOnFailedSeed --logging stats,debug`);
let status = 0; let status = 0;
@ -117,7 +117,7 @@ test("run crawl with retries for 503, not enough retries, fail", async () => {
requests = 0; requests = 0;
success = false; success = false;
const child = exec(`docker run -v $PWD/test-crawls:/crawls --rm webrecorder/browsertrix-crawler crawl --url http://${DOCKER_HOST_NAME}:31501 --url https://example.com/ --limit 2 --collection retry-fail-3 --retries 1 --failOnInvalidStatus --failOnFailedSeed --logging stats,debug`); const child = exec(`docker run -v $PWD/test-crawls:/crawls --rm webrecorder/browsertrix-crawler crawl --url http://${DOCKER_HOST_NAME}:31501 --url https://example-com.webrecorder.net/ --limit 2 --collection retry-fail-3 --retries 1 --failOnInvalidStatus --failOnFailedSeed --logging stats,debug`);
let status = 0; let status = 0;
@ -143,7 +143,7 @@ test("run crawl with retries for 503, no retries, fail", async () => {
requests = 0; requests = 0;
success = false; success = false;
const child = exec(`docker run -v $PWD/test-crawls:/crawls --rm webrecorder/browsertrix-crawler crawl --url http://${DOCKER_HOST_NAME}:31501 --url https://example.com/ --limit 2 --collection retry-fail-4 --retries 0 --failOnInvalidStatus --failOnFailedSeed --logging stats,debug`); const child = exec(`docker run -v $PWD/test-crawls:/crawls --rm webrecorder/browsertrix-crawler crawl --url http://${DOCKER_HOST_NAME}:31501 --url https://example-com.webrecorder.net/ --limit 2 --collection retry-fail-4 --retries 0 --failOnInvalidStatus --failOnFailedSeed --logging stats,debug`);
let status = 0; let status = 0;

View file

@ -1,13 +1,30 @@
import util from "util"; import util from "util";
import { exec as execCallback } from "child_process"; import { spawn, exec as execCallback } from "child_process";
import fs from "fs"; import fs from "fs";
const exec = util.promisify(execCallback); const exec = util.promisify(execCallback);
let proc = null;
const DOCKER_HOST_NAME = process.env.DOCKER_HOST_NAME || "host.docker.internal";
const TEST_HOST = `http://${DOCKER_HOST_NAME}:31502`;
beforeAll(() => {
proc = spawn("../../node_modules/.bin/http-server", ["-p", "31502"], {cwd: "tests/fixtures/"});
});
afterAll(() => {
if (proc) {
proc.kill();
}
});
test("check that URLs in seed-list are crawled", async () => { test("check that URLs in seed-list are crawled", async () => {
try { try {
await exec( await exec(
"docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/fixtures:/tests/fixtures webrecorder/browsertrix-crawler crawl --collection filelisttest --urlFile /tests/fixtures/urlSeedFile.txt --timeout 90000", "docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/fixtures:/tests/fixtures webrecorder/browsertrix-crawler crawl --collection filelisttest --urlFile /tests/fixtures/urlSeedFile.txt --timeout 90000 --scopeType page",
); );
} catch (error) { } catch (error) {
console.log(error); console.log(error);
@ -43,7 +60,7 @@ test("check that URLs in seed-list are crawled", async () => {
test("check that URLs in seed-list hosted at URL are crawled", async () => { test("check that URLs in seed-list hosted at URL are crawled", async () => {
try { try {
await exec( await exec(
'docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/fixtures:/tests/fixtures webrecorder/browsertrix-crawler crawl --collection onlinefilelisttest --urlFile "https://raw.githubusercontent.com/webrecorder/browsertrix-crawler/refs/heads/main/tests/fixtures/urlSeedFile.txt" --timeout 90000', `docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/fixtures:/tests/fixtures webrecorder/browsertrix-crawler crawl --collection onlinefilelisttest --urlFile "${TEST_HOST}/urlSeedFile.txt" --timeout 90000 --scopeType page`,
); );
} catch (error) { } catch (error) {
console.log(error); console.log(error);

133
yarn.lock
View file

@ -772,17 +772,17 @@
tslib "^2.7.0" tslib "^2.7.0"
tsyringe "^4.8.0" tsyringe "^4.8.0"
"@puppeteer/browsers@2.10.2": "@puppeteer/browsers@2.10.10":
version "2.10.2" version "2.10.10"
resolved "https://registry.yarnpkg.com/@puppeteer/browsers/-/browsers-2.10.2.tgz#c2a63cee699c6b5b971b9fcba9095098970f1648" resolved "https://registry.yarnpkg.com/@puppeteer/browsers/-/browsers-2.10.10.tgz#f806f92d966918c931fb9c48052eba2db848beaa"
integrity sha512-i4Ez+s9oRWQbNjtI/3+jxr7OH508mjAKvza0ekPJem0ZtmsYHP3B5dq62+IaBHKaGCOuqJxXzvFLUhJvQ6jtsQ== integrity sha512-3ZG500+ZeLql8rE0hjfhkycJjDj0pI/btEh3L9IkWUYcOrgP0xCNRq3HbtbqOPbvDhFaAWD88pDFtlLv8ns8gA==
dependencies: dependencies:
debug "^4.4.0" debug "^4.4.3"
extract-zip "^2.0.1" extract-zip "^2.0.1"
progress "^2.0.3" progress "^2.0.3"
proxy-agent "^6.5.0" proxy-agent "^6.5.0"
semver "^7.7.1" semver "^7.7.2"
tar-fs "^3.0.8" tar-fs "^3.1.0"
yargs "^17.7.2" yargs "^17.7.2"
"@puppeteer/browsers@2.8.0": "@puppeteer/browsers@2.8.0":
@ -798,10 +798,10 @@
tar-fs "^3.0.8" tar-fs "^3.0.8"
yargs "^17.7.2" yargs "^17.7.2"
"@puppeteer/replay@^3.1.1": "@puppeteer/replay@^3.1.3":
version "3.1.1" version "3.1.3"
resolved "https://registry.yarnpkg.com/@puppeteer/replay/-/replay-3.1.1.tgz#ada5412c5330ba22e3186ed4b622d26ac89bf564" resolved "https://registry.yarnpkg.com/@puppeteer/replay/-/replay-3.1.3.tgz#24178c5aa28af1c1b47d39043d62dd722680b55e"
integrity sha512-8tW1APEoqkpPVH19wRPqePb+/wbGuSVxE2OeRySKeb2SX1VpL2TuADodETRVGYYe07gBbs8FucaUu09A0QI7+w== integrity sha512-chqKAKoVDtqXAFib93So2W+KHdd1RZ/yfOgXW+u0+BQaElTLVe+OpaLzEn+MIWfIkakhBHE5/tP0/CFQMVydQQ==
dependencies: dependencies:
cli-table3 "0.6.5" cli-table3 "0.6.5"
colorette "2.0.20" colorette "2.0.20"
@ -1134,16 +1134,16 @@
resolved "https://registry.yarnpkg.com/@ungap/structured-clone/-/structured-clone-1.2.0.tgz#756641adb587851b5ccb3e095daf27ae581c8406" resolved "https://registry.yarnpkg.com/@ungap/structured-clone/-/structured-clone-1.2.0.tgz#756641adb587851b5ccb3e095daf27ae581c8406"
integrity sha512-zuVdFrMJiuCDQUMCzQaD6KL28MjnqqN8XnAqiEq9PNm/hCPTSGfrXCOfwj1ow4LFb/tNymJPwsNbVePc1xFqrQ== integrity sha512-zuVdFrMJiuCDQUMCzQaD6KL28MjnqqN8XnAqiEq9PNm/hCPTSGfrXCOfwj1ow4LFb/tNymJPwsNbVePc1xFqrQ==
"@webrecorder/wabac@^2.23.8": "@webrecorder/wabac@^2.24.1":
version "2.23.8" version "2.24.1"
resolved "https://registry.yarnpkg.com/@webrecorder/wabac/-/wabac-2.23.8.tgz#a3eb1e605acb706b6f043ec9e7fae9ff412ccc8a" resolved "https://registry.yarnpkg.com/@webrecorder/wabac/-/wabac-2.24.1.tgz#4cf2423a8a593410eabc7cb84041331d39081a96"
integrity sha512-+ShHsaBHwFC0SPFTpMWrwJHd47MzT6o1Rg12FSfGfpycrcmrBV447+JR28NitLJIsfcIif8xAth9Vh5Z7tHWlQ== integrity sha512-n3MwHpPNbU1LrwZjlax9UJVvYwfYAiYQDjzAQbeE6SrAU/YFGgD3BthLCaHP5YyIvFjIKtUpfxbsxHYRqNAyxg==
dependencies: dependencies:
"@peculiar/asn1-ecc" "^2.3.4" "@peculiar/asn1-ecc" "^2.3.4"
"@peculiar/asn1-schema" "^2.3.3" "@peculiar/asn1-schema" "^2.3.3"
"@peculiar/x509" "^1.9.2" "@peculiar/x509" "^1.9.2"
"@types/js-levenshtein" "^1.1.3" "@types/js-levenshtein" "^1.1.3"
"@webrecorder/wombat" "^3.8.14" "@webrecorder/wombat" "^3.9.1"
acorn "^8.10.0" acorn "^8.10.0"
auto-js-ipfs "^2.1.1" auto-js-ipfs "^2.1.1"
base64-js "^1.5.1" base64-js "^1.5.1"
@ -1151,7 +1151,6 @@
buffer "^6.0.3" buffer "^6.0.3"
fast-xml-parser "^4.4.1" fast-xml-parser "^4.4.1"
hash-wasm "^4.9.0" hash-wasm "^4.9.0"
http-link-header "^1.1.3"
http-status-codes "^2.1.4" http-status-codes "^2.1.4"
idb "^7.1.1" idb "^7.1.1"
js-levenshtein "^1.1.6" js-levenshtein "^1.1.6"
@ -1162,14 +1161,14 @@
path-parser "^6.1.0" path-parser "^6.1.0"
process "^0.11.10" process "^0.11.10"
stream-browserify "^3.0.0" stream-browserify "^3.0.0"
warcio "^2.4.3" warcio "^2.4.7"
"@webrecorder/wombat@^3.8.14": "@webrecorder/wombat@^3.9.1":
version "3.8.14" version "3.9.1"
resolved "https://registry.yarnpkg.com/@webrecorder/wombat/-/wombat-3.8.14.tgz#fde951519ed9ab8271107a013fc1abd6a9997424" resolved "https://registry.yarnpkg.com/@webrecorder/wombat/-/wombat-3.9.1.tgz#266135612e8063fa6b453f45d37d2c94e7be93d6"
integrity sha512-1CaL8Oel02V321SS+wOomV+cSDo279eVEAuiamO9jl9YoijRsGL9z/xZKE6sz6npLltE3YYziEBYO81xnaeTcA== integrity sha512-NX7vYQxulVRPgZk4ok9JbrUsf0dct2f34D/B1ZUCcB4M9aTKDhDAxwoIJbMha4DLhQlPcPp2wjH5/uJtPvtsXQ==
dependencies: dependencies:
warcio "^2.4.0" warcio "^2.4.7"
"@zxing/text-encoding@0.9.0": "@zxing/text-encoding@0.9.0":
version "0.9.0" version "0.9.0"
@ -1595,10 +1594,10 @@ browserslist@^4.24.0:
node-releases "^2.0.18" node-releases "^2.0.18"
update-browserslist-db "^1.1.1" update-browserslist-db "^1.1.1"
browsertrix-behaviors@^0.9.1: browsertrix-behaviors@^0.9.2:
version "0.9.1" version "0.9.2"
resolved "https://registry.yarnpkg.com/browsertrix-behaviors/-/browsertrix-behaviors-0.9.1.tgz#55bf51e43ddd88b3261e5ca570019415542fa0cc" resolved "https://registry.yarnpkg.com/browsertrix-behaviors/-/browsertrix-behaviors-0.9.2.tgz#b5bee47d15014a05a873d8cc6ea8917bfa61d5c8"
integrity sha512-NcEcg0sQmKlIy4PesZa9Vr7HRmjCLGF7I278SamQE54WiHyNzKiRQnYVA2A/MloGoSM1mQQq/oPXQ0znx1gUXQ== integrity sha512-d7rLNKXaiD83S4uXKBUf2x9UzmMjbrqKoO820KVqzWtlpzqnXFUsqN/wKvMSiNbDzmL1+G9Um7Gwb1AjD0djCw==
dependencies: dependencies:
query-selector-shadow-dom "^1.0.1" query-selector-shadow-dom "^1.0.1"
@ -1712,10 +1711,10 @@ chromium-bidi@2.1.2:
mitt "^3.0.1" mitt "^3.0.1"
zod "^3.24.1" zod "^3.24.1"
chromium-bidi@4.1.1: chromium-bidi@8.0.0:
version "4.1.1" version "8.0.0"
resolved "https://registry.yarnpkg.com/chromium-bidi/-/chromium-bidi-4.1.1.tgz#e1c34154ddd94473f180fd15158a24d36049e3d5" resolved "https://registry.yarnpkg.com/chromium-bidi/-/chromium-bidi-8.0.0.tgz#d73c9beed40317adf2bcfeb9a47087003cd467ec"
integrity sha512-biR7t4vF3YluE6RlMSk9IWk+b9U+WWyzHp+N2pL9vRTk+UXHYRTVp7jTK58ZNzMLBgoLMHY4QyJMbeuw3eKxqg== integrity sha512-d1VmE0FD7lxZQHzcDUCKZSNRtRwISXDsdg4HjdTR5+Ll5nQ/vzU12JeNmupD6VWffrPSlrnGhEWlLESKH3VO+g==
dependencies: dependencies:
mitt "^3.0.1" mitt "^3.0.1"
zod "^3.24.1" zod "^3.24.1"
@ -1947,6 +1946,13 @@ debug@^4.4.0:
dependencies: dependencies:
ms "^2.1.3" ms "^2.1.3"
debug@^4.4.3:
version "4.4.3"
resolved "https://registry.yarnpkg.com/debug/-/debug-4.4.3.tgz#c6ae432d9bd9662582fce08709b038c58e9e3d6a"
integrity sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA==
dependencies:
ms "^2.1.3"
decimal.js@^10.4.3: decimal.js@^10.4.3:
version "10.5.0" version "10.5.0"
resolved "https://registry.yarnpkg.com/decimal.js/-/decimal.js-10.5.0.tgz#0f371c7cf6c4898ce0afb09836db73cd82010f22" resolved "https://registry.yarnpkg.com/decimal.js/-/decimal.js-10.5.0.tgz#0f371c7cf6c4898ce0afb09836db73cd82010f22"
@ -2036,16 +2042,16 @@ devtools-protocol@0.0.1413902:
resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1413902.tgz#a0f00fe9eb25ab337a8f9656a29e0a1a69f42401" resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1413902.tgz#a0f00fe9eb25ab337a8f9656a29e0a1a69f42401"
integrity sha512-yRtvFD8Oyk7C9Os3GmnFZLu53yAfsnyw1s+mLmHHUK0GQEc9zthHWvS1r67Zqzm5t7v56PILHIVZ7kmFMaL2yQ== integrity sha512-yRtvFD8Oyk7C9Os3GmnFZLu53yAfsnyw1s+mLmHHUK0GQEc9zthHWvS1r67Zqzm5t7v56PILHIVZ7kmFMaL2yQ==
devtools-protocol@0.0.1425554:
version "0.0.1425554"
resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1425554.tgz#51ed2fed1405f56783d24a393f7c75b6bbb58029"
integrity sha512-uRfxR6Nlzdzt0ihVIkV+sLztKgs7rgquY/Mhcv1YNCWDh5IZgl5mnn2aeEnW5stYTE0wwiF4RYVz8eMEpV1SEw==
devtools-protocol@0.0.1436416: devtools-protocol@0.0.1436416:
version "0.0.1436416" version "0.0.1436416"
resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1436416.tgz#ce8af8a210b8bcac83c5c8f095b9f977a9570df0" resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1436416.tgz#ce8af8a210b8bcac83c5c8f095b9f977a9570df0"
integrity sha512-iGLhz2WOrlBLcTcoVsFy5dPPUqILG6cc8MITYd5lV6i38gWG14bMXRH/d8G5KITrWHBnbsOnWHfc9Qs4/jej9Q== integrity sha512-iGLhz2WOrlBLcTcoVsFy5dPPUqILG6cc8MITYd5lV6i38gWG14bMXRH/d8G5KITrWHBnbsOnWHfc9Qs4/jej9Q==
devtools-protocol@0.0.1495869:
version "0.0.1495869"
resolved "https://registry.yarnpkg.com/devtools-protocol/-/devtools-protocol-0.0.1495869.tgz#f68daef77a48d5dcbcdd55dbfa3265a51989c91b"
integrity sha512-i+bkd9UYFis40RcnkW7XrOprCujXRAHg62IVh/Ah3G8MmNXpCGt1m0dTFhSdx/AVs8XEMbdOGRwdkR1Bcta8AA==
diff-sequences@^29.6.3: diff-sequences@^29.6.3:
version "29.6.3" version "29.6.3"
resolved "https://registry.yarnpkg.com/diff-sequences/-/diff-sequences-29.6.3.tgz#4deaf894d11407c51efc8418012f9e70b84ea921" resolved "https://registry.yarnpkg.com/diff-sequences/-/diff-sequences-29.6.3.tgz#4deaf894d11407c51efc8418012f9e70b84ea921"
@ -2834,7 +2840,7 @@ html-escaper@^2.0.0:
resolved "https://registry.yarnpkg.com/html-escaper/-/html-escaper-2.0.2.tgz#dfd60027da36a36dfcbe236262c00a5822681453" resolved "https://registry.yarnpkg.com/html-escaper/-/html-escaper-2.0.2.tgz#dfd60027da36a36dfcbe236262c00a5822681453"
integrity sha512-H2iMtd0I4Mt5eYiapRdIDjp+XzelXQ0tFE4JS7YFwFevXXMmOp9myNrUvCg0D6ws8iqkRPBfKHgbwig1SmlLfg== integrity sha512-H2iMtd0I4Mt5eYiapRdIDjp+XzelXQ0tFE4JS7YFwFevXXMmOp9myNrUvCg0D6ws8iqkRPBfKHgbwig1SmlLfg==
http-link-header@^1.1.1, http-link-header@^1.1.3: http-link-header@^1.1.1:
version "1.1.3" version "1.1.3"
resolved "https://registry.yarnpkg.com/http-link-header/-/http-link-header-1.1.3.tgz#b367b7a0ad1cf14027953f31aa1df40bb433da2a" resolved "https://registry.yarnpkg.com/http-link-header/-/http-link-header-1.1.3.tgz#b367b7a0ad1cf14027953f31aa1df40bb433da2a"
integrity sha512-3cZ0SRL8fb9MUlU3mKM61FcQvPfXx2dBrZW3Vbg5CXa8jFlK8OaEpePenLe1oEXQduhz8b0QjsqfS59QP4AJDQ== integrity sha512-3cZ0SRL8fb9MUlU3mKM61FcQvPfXx2dBrZW3Vbg5CXa8jFlK8OaEpePenLe1oEXQduhz8b0QjsqfS59QP4AJDQ==
@ -4549,17 +4555,18 @@ puppeteer-core@24.4.0, puppeteer-core@^24.4.0:
typed-query-selector "^2.12.0" typed-query-selector "^2.12.0"
ws "^8.18.1" ws "^8.18.1"
puppeteer-core@^24.7.2: puppeteer-core@^24.22.0:
version "24.7.2" version "24.22.0"
resolved "https://registry.yarnpkg.com/puppeteer-core/-/puppeteer-core-24.7.2.tgz#734e377a5634ce1e419fa3ce20ad297a7e1a99ff" resolved "https://registry.yarnpkg.com/puppeteer-core/-/puppeteer-core-24.22.0.tgz#4d576b1a2b7699c088d3f0e843c32d81df82c3a6"
integrity sha512-P9pZyTmJqKODFCnkZgemCpoFA4LbAa8+NumHVQKyP5X9IgdNS1ZnAnIh1sMAwhF8/xEUGf7jt+qmNLlKieFw1Q== integrity sha512-oUeWlIg0pMz8YM5pu0uqakM+cCyYyXkHBxx9di9OUELu9X9+AYrNGGRLK9tNME3WfN3JGGqQIH3b4/E9LGek/w==
dependencies: dependencies:
"@puppeteer/browsers" "2.10.2" "@puppeteer/browsers" "2.10.10"
chromium-bidi "4.1.1" chromium-bidi "8.0.0"
debug "^4.4.0" debug "^4.4.3"
devtools-protocol "0.0.1425554" devtools-protocol "0.0.1495869"
typed-query-selector "^2.12.0" typed-query-selector "^2.12.0"
ws "^8.18.1" webdriver-bidi-protocol "0.2.11"
ws "^8.18.3"
puppeteer@^24.4.0: puppeteer@^24.4.0:
version "24.4.0" version "24.4.0"
@ -4834,6 +4841,11 @@ semver@^7.7.1:
resolved "https://registry.yarnpkg.com/semver/-/semver-7.7.1.tgz#abd5098d82b18c6c81f6074ff2647fd3e7220c9f" resolved "https://registry.yarnpkg.com/semver/-/semver-7.7.1.tgz#abd5098d82b18c6c81f6074ff2647fd3e7220c9f"
integrity sha512-hlq8tAfn0m/61p4BVRcPzIGr6LKiMwo4VM6dGi6pt4qcRkmNzTcWq6eCEjEh+qXjkMDvPlOFFSGwQjoEa6gyMA== integrity sha512-hlq8tAfn0m/61p4BVRcPzIGr6LKiMwo4VM6dGi6pt4qcRkmNzTcWq6eCEjEh+qXjkMDvPlOFFSGwQjoEa6gyMA==
semver@^7.7.2:
version "7.7.2"
resolved "https://registry.yarnpkg.com/semver/-/semver-7.7.2.tgz#67d99fdcd35cec21e6f8b87a7fd515a33f982b58"
integrity sha512-RF0Fw+rO5AMf9MAyaRXI4AV0Ulj5lMHqVxxdSgiVbixSCXoEmmX/jk0CuJw4+3SqroYO9VoUh+HcuJivvtJemA==
set-function-length@^1.2.1: set-function-length@^1.2.1:
version "1.2.2" version "1.2.2"
resolved "https://registry.yarnpkg.com/set-function-length/-/set-function-length-1.2.2.tgz#aac72314198eaed975cf77b2c3b6b880695e5449" resolved "https://registry.yarnpkg.com/set-function-length/-/set-function-length-1.2.2.tgz#aac72314198eaed975cf77b2c3b6b880695e5449"
@ -5196,6 +5208,17 @@ tar-fs@^3.0.8:
bare-fs "^4.0.1" bare-fs "^4.0.1"
bare-path "^3.0.0" bare-path "^3.0.0"
tar-fs@^3.1.0:
version "3.1.1"
resolved "https://registry.yarnpkg.com/tar-fs/-/tar-fs-3.1.1.tgz#4f164e59fb60f103d472360731e8c6bb4a7fe9ef"
integrity sha512-LZA0oaPOc2fVo82Txf3gw+AkEd38szODlptMYejQUhndHMLQ9M059uXR+AfS7DNo0NpINvSqDsvyaCrBVkptWg==
dependencies:
pump "^3.0.0"
tar-stream "^3.1.5"
optionalDependencies:
bare-fs "^4.0.1"
bare-path "^3.0.0"
tar-stream@^2.1.4: tar-stream@^2.1.4:
version "2.2.0" version "2.2.0"
resolved "https://registry.yarnpkg.com/tar-stream/-/tar-stream-2.2.0.tgz#acad84c284136b060dc3faa64474aa9aebd77287" resolved "https://registry.yarnpkg.com/tar-stream/-/tar-stream-2.2.0.tgz#acad84c284136b060dc3faa64474aa9aebd77287"
@ -5527,10 +5550,10 @@ walker@^1.0.8:
dependencies: dependencies:
makeerror "1.0.12" makeerror "1.0.12"
warcio@^2.4.0, warcio@^2.4.3, warcio@^2.4.4: warcio@^2.4.7:
version "2.4.4" version "2.4.7"
resolved "https://registry.yarnpkg.com/warcio/-/warcio-2.4.4.tgz#6c0c030bb55c0f0b824f854fa9e6718ca25d333d" resolved "https://registry.yarnpkg.com/warcio/-/warcio-2.4.7.tgz#7c3918463e550f62fe63df5f76a871424e74097a"
integrity sha512-FrWOhv1qLNhPBPGEMm24Yo+DtkipK5DxK3ckVGbOf0OJ/UqaxAhiiby74q+GW70dsJV0wF+RA1ToK6CKseTshA== integrity sha512-WGRqvoUqSalAkx+uJ8xnrxiiSPZ7Ru/h7iKC2XmuMMSOUSnS917l4V+qpaN9thAsZkZ+8qJRtee3uyOjlq4Dgg==
dependencies: dependencies:
"@types/pako" "^1.0.7" "@types/pako" "^1.0.7"
"@types/stream-buffers" "^3.0.7" "@types/stream-buffers" "^3.0.7"
@ -5550,6 +5573,11 @@ web-encoding@^1.1.5:
optionalDependencies: optionalDependencies:
"@zxing/text-encoding" "0.9.0" "@zxing/text-encoding" "0.9.0"
webdriver-bidi-protocol@0.2.11:
version "0.2.11"
resolved "https://registry.yarnpkg.com/webdriver-bidi-protocol/-/webdriver-bidi-protocol-0.2.11.tgz#dba18d9b0a33aed33fab272dbd6e42411ac753cc"
integrity sha512-Y9E1/oi4XMxcR8AT0ZC4OvYntl34SPgwjmELH+owjBr0korAX4jKgZULBWILGCVGdVCQ0dodTToIETozhG8zvA==
whatwg-encoding@^2.0.0: whatwg-encoding@^2.0.0:
version "2.0.0" version "2.0.0"
resolved "https://registry.yarnpkg.com/whatwg-encoding/-/whatwg-encoding-2.0.0.tgz#e7635f597fd87020858626805a2729fa7698ac53" resolved "https://registry.yarnpkg.com/whatwg-encoding/-/whatwg-encoding-2.0.0.tgz#e7635f597fd87020858626805a2729fa7698ac53"
@ -5662,6 +5690,11 @@ ws@^8.18.1:
resolved "https://registry.yarnpkg.com/ws/-/ws-8.18.1.tgz#ea131d3784e1dfdff91adb0a4a116b127515e3cb" resolved "https://registry.yarnpkg.com/ws/-/ws-8.18.1.tgz#ea131d3784e1dfdff91adb0a4a116b127515e3cb"
integrity sha512-RKW2aJZMXeMxVpnZ6bck+RswznaxmzdULiBr6KY7XkTnW8uvt0iT9H5DkHUChXrc+uurzwa0rVI16n/Xzjdz1w== integrity sha512-RKW2aJZMXeMxVpnZ6bck+RswznaxmzdULiBr6KY7XkTnW8uvt0iT9H5DkHUChXrc+uurzwa0rVI16n/Xzjdz1w==
ws@^8.18.3:
version "8.18.3"
resolved "https://registry.yarnpkg.com/ws/-/ws-8.18.3.tgz#b56b88abffde62791c639170400c93dcb0c95472"
integrity sha512-PEIGCY5tSlUt50cqyMXfCzX+oOPqN0vuGqWzbcJ2xvnkzkq46oOpz7dQaTDBdfICb4N14+GARUDw2XV2N4tvzg==
xdg-basedir@^4.0.0: xdg-basedir@^4.0.0:
version "4.0.0" version "4.0.0"
resolved "https://registry.yarnpkg.com/xdg-basedir/-/xdg-basedir-4.0.0.tgz#4bc8d9984403696225ef83a1573cbbcb4e79db13" resolved "https://registry.yarnpkg.com/xdg-basedir/-/xdg-basedir-4.0.0.tgz#4bc8d9984403696225ef83a1573cbbcb4e79db13"