mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 14:33:17 +00:00
Profiles: Support for running with existing profiles + saving profile after a login (#34)
Support for profiles via a mounted .tar.gz and --profile option + improved docs #18 * support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted. * use patched pywb for fix * bump browsertrix-behaviors to 0.1.0 * README: updates to include better getting started, behaviors and profile reference/examples * bump version to 0.3.0!
This commit is contained in:
parent
c9f8fe051c
commit
b59788ea04
8 changed files with 483 additions and 88 deletions
|
@ -43,6 +43,7 @@ ADD uwsgi.ini /app/
|
|||
ADD *.js /app/
|
||||
|
||||
RUN ln -s /app/main.js /usr/bin/crawl
|
||||
RUN ln -s /app/create-login-profile.js /usr/bin/create-login-profile
|
||||
|
||||
WORKDIR /crawls
|
||||
|
||||
|
|
220
README.md
220
README.md
|
@ -1,20 +1,170 @@
|
|||
# Browsertrix Crawler
|
||||
|
||||
Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container. It is designed as part of a more streamlined replacement of the original [Browsertrix](https://github.com/webrecorder/browsertrix).
|
||||
|
||||
The original Browsertrix may be too complex for situations where a single crawl is needed, and requires managing multiple containers.
|
||||
|
||||
This is an attempt to refactor Browsertrix into a core crawling system, driven by [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster)
|
||||
and [puppeteer](https://github.com/puppeteer/puppeteer)
|
||||
Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster)
|
||||
and [puppeteer](https://github.com/puppeteer/puppeteer) to control one or more browsers in parallel.
|
||||
|
||||
## Features
|
||||
|
||||
Thus far, Browsertrix Crawler supports:
|
||||
|
||||
- Single-container, browser based crawling with multiple headless/headful browsers
|
||||
- Support for some behaviors: autoplay to capture video/audio, scrolling
|
||||
- Support for direct capture for non-HTML resources
|
||||
- Extensible driver script for customizing behavior per crawl or page via Puppeteer
|
||||
- Single-container, browser based crawling with multiple headless/headful browsers.
|
||||
- Support for custom browser behaviors, ysing [Browsertix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) including autoscroll, video autoplay and site-specific behaviors.
|
||||
- Optimized (non-browser) capture of non-HTML resources.
|
||||
- Extensible Puppeteer driver script for customizing behavior per crawl or page.
|
||||
- Ability to create and reuse browser profiles with user/password login
|
||||
|
||||
## Getting Started
|
||||
|
||||
Browsertrix Crawler requires [Docker](https://docs.docker.com/get-docker/) to be installed on the machine running the crawl.
|
||||
|
||||
Assuming Docker is installed, you can run a crawl and test your archive with the following steps.
|
||||
|
||||
You don't even need to clone this repo, just choose a directory where you'd like the crawl data to be placed, and then run
|
||||
the following commands. Replace `[URL]` with the web site you'd like to crawl.
|
||||
|
||||
1. Run `docker pull webrecorder/browsertrix-crawler`
|
||||
2. `docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test`
|
||||
3. The crawl will now run and progress of the crawl will be output to the console. Depending on the size of the site, this may take a bit!
|
||||
4. Once the crawl is finished, a WACZ file will be created in `crawls/collection/test/test.wacz` from the directory you ran the crawl!
|
||||
5. You can go to [ReplayWeb.page](https://replayweb.page) and open the generated WACZ file and browse your newly crawled archive!
|
||||
|
||||
Here's how you can use some of the command-line options to configure the crawl:
|
||||
|
||||
- To include automated text extraction for full text search, add the `--text` flag.
|
||||
|
||||
- To limit the crawl to a maximum number of pages, add `--limit P` where P is the number of pages that will be crawled.
|
||||
|
||||
- To run more than one browser worker and crawl in parallel, and `--workers N` where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.
|
||||
|
||||
- To crawl into a new directory, specify a different name for the `--collection` param, or, if omitted, a new collection directory based on current time will be created.
|
||||
-
|
||||
|
||||
Browsertrix Crawler includes a number of additional command-line options, explained below.
|
||||
|
||||
## Crawling Configuration Options
|
||||
|
||||
The Browsertrix Crawler docker image currently accepts the following parameters:
|
||||
|
||||
```
|
||||
browsertrix-crawler [options]
|
||||
|
||||
Options:
|
||||
--help Show help [boolean]
|
||||
--version Show version number [boolean]
|
||||
-u, --url The URL to start crawling from
|
||||
[string] [required]
|
||||
-w, --workers The number of workers to run in
|
||||
parallel [number] [default: 1]
|
||||
--newContext The context for each new capture,
|
||||
can be a new: page, session or
|
||||
browser. [string] [default: "page"]
|
||||
--waitUntil Puppeteer page.goto() condition to
|
||||
wait for before continuing, can be
|
||||
multiple separate by ','
|
||||
[default: "load,networkidle0"]
|
||||
--limit Limit crawl to this number of pages
|
||||
[number] [default: 0]
|
||||
--timeout Timeout for each page to load (in
|
||||
seconds) [number] [default: 90]
|
||||
--scope Regex of page URLs that should be
|
||||
included in the crawl (defaults to
|
||||
the immediate directory of URL)
|
||||
--exclude Regex of page URLs that should be
|
||||
excluded from the crawl.
|
||||
-c, --collection Collection name to crawl to (replay
|
||||
will be accessible under this name
|
||||
in pywb preview)
|
||||
[string] [default: "capture-2021-04-10T04-49-4"]
|
||||
--headless Run in headless mode, otherwise
|
||||
start xvfb[boolean] [default: false]
|
||||
--driver JS driver for the crawler
|
||||
[string] [default: "/app/defaultDriver.js"]
|
||||
--generateCDX, --generatecdx, If set, generate index (CDXJ) for
|
||||
--generateCdx use with pywb after crawl is done
|
||||
[boolean] [default: false]
|
||||
--generateWACZ, --generatewacz, If set, generate wacz
|
||||
--generateWacz [boolean] [default: false]
|
||||
--logging Logging options for crawler, can
|
||||
include: stats, pywb, behaviors
|
||||
[string] [default: "stats"]
|
||||
--text If set, extract text to the
|
||||
pages.jsonl file
|
||||
[boolean] [default: false]
|
||||
--cwd Crawl working directory for captures
|
||||
(pywb root). If not set, defaults to
|
||||
process.cwd()
|
||||
[string] [default: "/crawls"]
|
||||
--mobileDevice Emulate mobile device by name from:
|
||||
https://github.com/puppeteer/puppete
|
||||
er/blob/main/src/common/DeviceDescri
|
||||
ptors.ts [string]
|
||||
--userAgent Override user-agent with specified
|
||||
string [string]
|
||||
--userAgentSuffix Append suffix to existing browser
|
||||
user-agent (ex: +MyCrawler,
|
||||
info@example.com) [string]
|
||||
--useSitemap If enabled, check for sitemaps at
|
||||
/sitemap.xml, or custom URL if URL
|
||||
is specified
|
||||
--statsFilename If set, output stats as JSON to this
|
||||
file. (Relative filename resolves to
|
||||
crawl working directory)
|
||||
--behaviors Which background behaviors to enable
|
||||
on each page
|
||||
[string] [default: "autoplay,autofetch,siteSpecific"]
|
||||
--profile Path to tar.gz file which will be
|
||||
extracted and used as the browser
|
||||
profile [string]
|
||||
```
|
||||
|
||||
For the `--waitUntil` flag, see [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options).
|
||||
|
||||
The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example),
|
||||
while `--waitUntil networkidle0` may make sense for dynamic sites.
|
||||
|
||||
### Behaviors
|
||||
|
||||
Browsertrix Crawler also supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible),
|
||||
and auto-fetch content that is not loaded by default, and also run custom behaviors on certain sites.
|
||||
|
||||
Behaviors to run can be specified via a comma-separated list passed to the `--behaviors` option. By default, the auto-scroll behavior is not enabled by default, as it may slow down crawling. To enable this behaviors, you can add
|
||||
`--behaviors autoscroll` or to enable all behaviors, add `--behaviors autoscroll,autoplay,autofetch,siteSpecific`.
|
||||
|
||||
See [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for more info on all of the currently available behaviors.
|
||||
|
||||
## Creating and Using Browser Profiles
|
||||
|
||||
Browsertrix Crawler also includes a way to use existing browser profiles when running a crawl. This allows pre-configuring the browser, such as by logging in
|
||||
to certain sites or setting other settings, and running a crawl exactly with those settings. By creating a logged in profile, the actual login credentials are not included in the crawl, only (temporary) session cookies.
|
||||
|
||||
Browsertrix Crawler currently includes a script to login to a single website with supplied credentials and then save the profile.
|
||||
It can also take a screenshot so you can check if the login succeeded. The `--url` parameter should specify the URL of a login page.
|
||||
|
||||
For example, to create a profile logged in to Twitter, you can run:
|
||||
|
||||
```bash
|
||||
docker run -v $PWD/crawls/profiles:/output/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/login"
|
||||
```
|
||||
|
||||
The script will then prompt you for login credentials, attempt to login and create a tar.gz file in `./crawls/profiles/profile.tar.gz`.
|
||||
|
||||
- To specify a custom filename, pass along `--filename` parameter.
|
||||
|
||||
- To specify the username and password on the command line (for automated profile creation), pass a `--username` and `--password` flags.
|
||||
|
||||
- To specify headless mode, add the `--headless` flag. Note that for crawls run with `--headless` flag, it is recommended to also create the profile with `--headless` to ensure the profile is compatible.
|
||||
|
||||
The `--profile` flag can then be used to specify a Chrome profile stored as a tarball when running the regular `crawl` command. With this option, it is possible to crawl with the browser already pre-configured. To ensure compatibility, the profile should be created using the following mechanism.
|
||||
|
||||
After running the above command, you can now run a crawl with the profile, as follows:
|
||||
|
||||
```bash
|
||||
|
||||
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /crawls/profiles/profile.tar.gz --url https://twitter.com/--generateWACZ --collection test-with-profile
|
||||
```
|
||||
|
||||
The current profile creation script is still experimental and the script attempts to detect the usename and password fields on a site as generically as possible, but may not work for all sites. Additional profile functionality, such as support for custom profile creation scripts, may be added in the future.
|
||||
|
||||
|
||||
## Architecture
|
||||
|
||||
|
@ -31,56 +181,6 @@ The crawl produces a single pywb collection, at `/crawls/collections/<collection
|
|||
To access the contents of the crawl, the `/crawls` directory in the container should be mounted to a volume (default in the Docker Compose setup).
|
||||
|
||||
|
||||
## Crawling Parameters
|
||||
|
||||
The image currently accepts the following parameters:
|
||||
|
||||
```
|
||||
browsertrix-crawler [options]
|
||||
|
||||
Options:
|
||||
--help Show help [boolean]
|
||||
--version Show version number [boolean]
|
||||
-u, --url The URL to start crawling from [string] [required]
|
||||
-w, --workers The number of workers to run in parallel
|
||||
[number] [default: 1]
|
||||
--newContext The context for each new capture, can be a new: page,
|
||||
session or browser. [string] [default: "page"]
|
||||
--waitUntil Puppeteer page.goto() condition to wait for before
|
||||
continuing [default: "load"]
|
||||
--limit Limit crawl to this number of pages [number] [default: 0]
|
||||
--timeout Timeout for each page to load (in seconds)
|
||||
[number] [default: 90]
|
||||
--scope Regex of page URLs that should be included in the crawl
|
||||
(defaults to the immediate directory of URL)
|
||||
--exclude Regex of page URLs that should be excluded from the crawl.
|
||||
--scroll If set, will autoscroll to bottom of the page
|
||||
[boolean] [default: false]
|
||||
-c, --collection Collection name to crawl to (replay will be accessible
|
||||
under this name in pywb preview)
|
||||
[string] [default: "capture"]
|
||||
--headless Run in headless mode, otherwise start xvfb
|
||||
[boolean] [default: false]
|
||||
--driver JS driver for the crawler
|
||||
[string] [default: "/app/defaultDriver.js"]
|
||||
--generateCDX If set, generate index (CDXJ) for use with pywb after crawl
|
||||
is done [boolean] [default: false]
|
||||
--generateWACZ If set, generate wacz for use with pywb after crawl
|
||||
is done [boolean] [default: false]
|
||||
--combineWARC If set, combine the individual warcs generated into a single warc after crawl
|
||||
is done [boolean] [default: false]
|
||||
--rolloverSize If set, dictates the maximum size that a generated warc and combined warc can be
|
||||
[number] [default: 1000000000]
|
||||
--text If set, extract the pages full text to be added to the pages.jsonl
|
||||
file [boolean] [default: false]
|
||||
--cwd Crawl working directory for captures (pywb root). If not
|
||||
set, defaults to process.cwd [string] [default: "/crawls"]
|
||||
```
|
||||
|
||||
For the `--waitUntil` flag, see [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options).
|
||||
|
||||
The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example),
|
||||
while `--waitUntil networkidle0` may make sense for dynamic sites.
|
||||
|
||||
### Example Usage
|
||||
|
||||
|
|
39
crawler.js
39
crawler.js
|
@ -5,6 +5,7 @@ const fetch = require("node-fetch");
|
|||
const AbortController = require("abort-controller");
|
||||
const path = require("path");
|
||||
const fs = require("fs");
|
||||
const os = require("os");
|
||||
const Sitemapper = require("sitemapper");
|
||||
const { v4: uuidv4 } = require("uuid");
|
||||
const warcio = require("warcio");
|
||||
|
@ -44,6 +45,7 @@ class Crawler {
|
|||
|
||||
this.userAgent = "";
|
||||
this.behaviorsLogDebug = false;
|
||||
this.profileDir = fs.mkdtempSync(path.join(os.tmpdir(), "profile-"));
|
||||
|
||||
const params = require("yargs")
|
||||
.usage("browsertrix-crawler [options]")
|
||||
|
@ -279,6 +281,11 @@ class Crawler {
|
|||
default: "autoplay,autofetch,siteSpecific",
|
||||
type: "string",
|
||||
},
|
||||
|
||||
"profile": {
|
||||
describe: "Path to tar.gz file which will be extracted and used as the browser profile",
|
||||
type: "string",
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
|
@ -399,6 +406,10 @@ class Crawler {
|
|||
argv.statsFilename = path.resolve(argv.cwd, argv.statsFilename);
|
||||
}
|
||||
|
||||
if (argv.profile) {
|
||||
child_process.execSync("tar xvfz " + argv.profile, {cwd: this.profileDir});
|
||||
}
|
||||
|
||||
return true;
|
||||
}
|
||||
|
||||
|
@ -411,6 +422,7 @@ class Crawler {
|
|||
"--disable-background-media-suspend",
|
||||
"--autoplay-policy=no-user-gesture-required",
|
||||
"--disable-features=IsolateOrigins,site-per-process",
|
||||
"--disable-popup-blocking"
|
||||
];
|
||||
}
|
||||
|
||||
|
@ -420,7 +432,9 @@ class Crawler {
|
|||
headless: this.params.headless,
|
||||
executablePath: CHROME_PATH,
|
||||
ignoreHTTPSErrors: true,
|
||||
args: this.chromeArgs
|
||||
args: this.chromeArgs,
|
||||
userDataDir: this.profileDir,
|
||||
defaultViewport: null,
|
||||
};
|
||||
}
|
||||
|
||||
|
@ -437,14 +451,7 @@ class Crawler {
|
|||
}
|
||||
}
|
||||
|
||||
async crawlPage({page, data}) {
|
||||
try {
|
||||
if (this.emulateDevice) {
|
||||
await page.emulate(this.emulateDevice);
|
||||
}
|
||||
|
||||
if (this.behaviorOpts) {
|
||||
await page.exposeFunction(BEHAVIOR_LOG_FUNC, ({data, type}) => {
|
||||
_behaviorLog({data, type}) {
|
||||
switch (type) {
|
||||
case "info":
|
||||
console.log(JSON.stringify(data));
|
||||
|
@ -456,11 +463,17 @@ class Crawler {
|
|||
console.log("behavior debug: " + JSON.stringify(data));
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
await page.evaluateOnNewDocument(behaviors + `
|
||||
self.__bx_behaviors.init(${this.behaviorOpts});
|
||||
`);
|
||||
async crawlPage({page, data}) {
|
||||
try {
|
||||
if (this.emulateDevice) {
|
||||
await page.emulate(this.emulateDevice);
|
||||
}
|
||||
|
||||
if (this.behaviorOpts) {
|
||||
await page.exposeFunction(BEHAVIOR_LOG_FUNC, (logdata) => this._behaviorLog(logdata));
|
||||
await page.evaluateOnNewDocument(behaviors + `;\nself.__bx_behaviors.init(${this.behaviorOpts});`);
|
||||
}
|
||||
|
||||
// run custom driver here
|
||||
|
|
178
create-login-profile.js
Executable file
178
create-login-profile.js
Executable file
|
@ -0,0 +1,178 @@
|
|||
#!/usr/bin/env node
|
||||
|
||||
const readline = require("readline");
|
||||
const child_process = require("child_process");
|
||||
|
||||
const puppeteer = require("puppeteer-core");
|
||||
const yargs = require("yargs");
|
||||
|
||||
function cliOpts() {
|
||||
return {
|
||||
"url": {
|
||||
describe: "The URL of the login page",
|
||||
type: "string",
|
||||
demandOption: true,
|
||||
},
|
||||
|
||||
"user": {
|
||||
describe: "The username for the login. If not specified, will be prompted",
|
||||
},
|
||||
|
||||
"password": {
|
||||
describe: "The password for the login. If not specified, will be prompted (recommended)",
|
||||
},
|
||||
|
||||
"filename": {
|
||||
describe: "The filename for the profile tarball",
|
||||
default: "/output/profile.tar.gz",
|
||||
},
|
||||
|
||||
"debugScreenshot": {
|
||||
describe: "If specified, take a screenshot after login and save as this filename"
|
||||
},
|
||||
|
||||
"headless": {
|
||||
describe: "Run in headless mode, otherwise start xvfb",
|
||||
type: "boolean",
|
||||
default: false,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
|
||||
|
||||
async function main() {
|
||||
const params = yargs
|
||||
.usage("browsertrix-crawler profile [options]")
|
||||
.option(cliOpts())
|
||||
.argv;
|
||||
|
||||
if (!params.headless) {
|
||||
console.log("Launching XVFB");
|
||||
child_process.spawn("Xvfb", [
|
||||
process.env.DISPLAY,
|
||||
"-listen",
|
||||
"tcp",
|
||||
"-screen",
|
||||
"0",
|
||||
process.env.GEOMETRY,
|
||||
"-ac",
|
||||
"+extension",
|
||||
"RANDR"
|
||||
]);
|
||||
}
|
||||
|
||||
//await new Promise(resolve => setTimeout(resolve, 2000));
|
||||
|
||||
const args = {
|
||||
headless: !!params.headless,
|
||||
executablePath: "google-chrome",
|
||||
ignoreHTTPSErrors: true,
|
||||
args: [
|
||||
"--no-xshm",
|
||||
"--no-sandbox",
|
||||
"--disable-background-media-suspend",
|
||||
"--autoplay-policy=no-user-gesture-required",
|
||||
"--disable-features=IsolateOrigins,site-per-process",
|
||||
"--user-data-dir=/tmp/profile"
|
||||
]
|
||||
};
|
||||
|
||||
if (!params.user) {
|
||||
params.user = await promptInput("Enter username: ");
|
||||
}
|
||||
|
||||
if (!params.password) {
|
||||
params.password = await promptInput("Enter password: ", true);
|
||||
}
|
||||
|
||||
const browser = await puppeteer.launch(args);
|
||||
|
||||
const page = await browser.newPage();
|
||||
|
||||
const waitUntil = ["load", "networkidle2"];
|
||||
|
||||
await page.setCacheEnabled(false);
|
||||
|
||||
console.log("loading");
|
||||
|
||||
await page.goto(params.url, {waitUntil});
|
||||
|
||||
console.log("loaded");
|
||||
|
||||
let u, p;
|
||||
|
||||
try {
|
||||
u = await page.waitForXPath("//input[contains(@name, 'user')]");
|
||||
|
||||
p = await page.waitForXPath("//input[contains(@name, 'pass') and @type='password']");
|
||||
|
||||
} catch (e) {
|
||||
if (params.debugScreenshot) {
|
||||
await page.screenshot({path: params.debugScreenshot});
|
||||
}
|
||||
console.log("Login form could not be found");
|
||||
await page.close();
|
||||
process.exit(1);
|
||||
return;
|
||||
}
|
||||
|
||||
await u.type(params.user);
|
||||
|
||||
await p.type(params.password);
|
||||
|
||||
await Promise.allSettled([
|
||||
p.press("Enter"),
|
||||
page.waitForNavigation({waitUntil})
|
||||
]);
|
||||
|
||||
await page._client.send("Network.clearBrowserCache");
|
||||
|
||||
if (params.debugScreenshot) {
|
||||
await page.screenshot({path: params.debugScreenshot});
|
||||
}
|
||||
|
||||
await browser.close();
|
||||
|
||||
console.log("creating profile");
|
||||
|
||||
const profileFilename = params.filename || "/output/profile.tar.gz";
|
||||
|
||||
child_process.execFileSync("tar", ["cvfz", profileFilename, "./"], {cwd: "/tmp/profile"});
|
||||
console.log("done");
|
||||
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
function promptInput(msg, hidden = false) {
|
||||
const rl = readline.createInterface({
|
||||
input: process.stdin,
|
||||
output: process.stdout
|
||||
});
|
||||
|
||||
if (hidden) {
|
||||
// from https://stackoverflow.com/a/59727173
|
||||
rl.input.on("keypress", function () {
|
||||
// get the number of characters entered so far:
|
||||
const len = rl.line.length;
|
||||
// move cursor back to the beginning of the input:
|
||||
readline.moveCursor(rl.output, -len, 0);
|
||||
// clear everything to the right of the cursor:
|
||||
readline.clearLine(rl.output, 1);
|
||||
// replace the original input with asterisks:
|
||||
for (let i = 0; i < len; i++) {
|
||||
rl.output.write("*");
|
||||
}
|
||||
});
|
||||
}
|
||||
|
||||
return new Promise((resolve) => {
|
||||
rl.question(msg, function (res) {
|
||||
rl.close();
|
||||
resolve(res);
|
||||
});
|
||||
});
|
||||
}
|
||||
|
||||
main();
|
||||
|
|
@ -2,7 +2,7 @@ version: '3.5'
|
|||
|
||||
services:
|
||||
crawler:
|
||||
image: webrecorder/browsertrix-crawler:0.3.0-beta.0
|
||||
image: webrecorder/browsertrix-crawler:0.3.0
|
||||
build:
|
||||
context: ./
|
||||
|
||||
|
|
|
@ -1,13 +1,13 @@
|
|||
{
|
||||
"name": "browsertrix-crawler",
|
||||
"version": "0.3.0-beta.0",
|
||||
"version": "0.3.0",
|
||||
"main": "browsertrix-crawler",
|
||||
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
||||
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
||||
"license": "MIT",
|
||||
"dependencies": {
|
||||
"abort-controller": "^3.0.0",
|
||||
"browsertrix-behaviors": "github:webrecorder/browsertrix-behaviors",
|
||||
"browsertrix-behaviors": "^0.1.0",
|
||||
"node-fetch": "^2.6.1",
|
||||
"puppeteer-cluster": "^0.22.0",
|
||||
"puppeteer-core": "^5.3.1",
|
||||
|
@ -20,6 +20,6 @@
|
|||
"eslint-plugin-react": "^7.22.0",
|
||||
"jest": "^26.6.3",
|
||||
"md5": "^2.3.0",
|
||||
"warcio": "^1.4.2"
|
||||
"warcio": "^1.4.3"
|
||||
}
|
||||
}
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
pywb>=2.5.0
|
||||
#pywb>=2.5.0
|
||||
git+https://github.com/webrecorder/pywb@yt-rules-improve
|
||||
uwsgi
|
||||
wacz>=0.2.1
|
||||
|
|
110
yarn.lock
110
yarn.lock
|
@ -509,6 +509,34 @@
|
|||
"@types/yargs" "^15.0.0"
|
||||
chalk "^4.0.0"
|
||||
|
||||
"@peculiar/asn1-schema@^2.0.27":
|
||||
version "2.0.27"
|
||||
resolved "https://registry.yarnpkg.com/@peculiar/asn1-schema/-/asn1-schema-2.0.27.tgz#1ee3b2b869ff3200bcc8ec60e6c87bd5a6f03fe0"
|
||||
integrity sha512-1tIx7iL3Ma3HtnNS93nB7nhyI0soUJypElj9owd4tpMrRDmeJ8eZubsdq1sb0KSaCs5RqZNoABCP6m5WtnlVhQ==
|
||||
dependencies:
|
||||
"@types/asn1js" "^2.0.0"
|
||||
asn1js "^2.0.26"
|
||||
pvtsutils "^1.1.1"
|
||||
tslib "^2.0.3"
|
||||
|
||||
"@peculiar/json-schema@^1.1.12":
|
||||
version "1.1.12"
|
||||
resolved "https://registry.yarnpkg.com/@peculiar/json-schema/-/json-schema-1.1.12.tgz#fe61e85259e3b5ba5ad566cb62ca75b3d3cd5339"
|
||||
integrity sha512-coUfuoMeIB7B8/NMekxaDzLhaYmp0HZNPEjYRm9goRou8UZIC3z21s0sL9AWoCw4EG876QyO3kYrc61WNF9B/w==
|
||||
dependencies:
|
||||
tslib "^2.0.0"
|
||||
|
||||
"@peculiar/webcrypto@^1.1.1":
|
||||
version "1.1.6"
|
||||
resolved "https://registry.yarnpkg.com/@peculiar/webcrypto/-/webcrypto-1.1.6.tgz#484bb58be07149e19e873861b585b0d5e4f83b7b"
|
||||
integrity sha512-xcTjouis4Y117mcsJslWAGypwhxtXslkVdRp7e3tHwtuw0/xCp1te8RuMMv/ia5TsvxomcyX/T+qTbRZGLLvyA==
|
||||
dependencies:
|
||||
"@peculiar/asn1-schema" "^2.0.27"
|
||||
"@peculiar/json-schema" "^1.1.12"
|
||||
pvtsutils "^1.1.2"
|
||||
tslib "^2.1.0"
|
||||
webcrypto-core "^1.2.0"
|
||||
|
||||
"@sindresorhus/is@^4.0.0":
|
||||
version "4.0.0"
|
||||
resolved "https://registry.npmjs.org/@sindresorhus/is/-/is-4.0.0.tgz"
|
||||
|
@ -535,6 +563,11 @@
|
|||
dependencies:
|
||||
defer-to-connect "^2.0.0"
|
||||
|
||||
"@types/asn1js@^2.0.0":
|
||||
version "2.0.0"
|
||||
resolved "https://registry.yarnpkg.com/@types/asn1js/-/asn1js-2.0.0.tgz#10ca75692575744d0117098148a8dc84cbee6682"
|
||||
integrity sha512-Jjzp5EqU0hNpADctc/UqhiFbY1y2MqIxBVa2S4dBlbnZHTLPMuggoL5q43X63LpsOIINRDirBjP56DUUKIUWIA==
|
||||
|
||||
"@types/babel__core@^7.0.0", "@types/babel__core@^7.1.7":
|
||||
version "7.1.12"
|
||||
resolved "https://registry.npmjs.org/@types/babel__core/-/babel__core-7.1.12.tgz"
|
||||
|
@ -824,6 +857,13 @@ asn1@~0.2.3:
|
|||
dependencies:
|
||||
safer-buffer "~2.1.0"
|
||||
|
||||
asn1js@^2.0.26:
|
||||
version "2.1.1"
|
||||
resolved "https://registry.yarnpkg.com/asn1js/-/asn1js-2.1.1.tgz#bb3896191ebb5fb1caeda73436a6c6e20a2eedff"
|
||||
integrity sha512-t9u0dU0rJN4ML+uxgN6VM2Z4H5jWIYm0w8LsZLzMJaQsgL3IJNbxHgmbWDvJAwspyHpDFuzUaUFh4c05UB4+6g==
|
||||
dependencies:
|
||||
pvutils latest
|
||||
|
||||
assert-plus@1.0.0, assert-plus@^1.0.0:
|
||||
version "1.0.0"
|
||||
resolved "https://registry.npmjs.org/assert-plus/-/assert-plus-1.0.0.tgz"
|
||||
|
@ -1006,9 +1046,10 @@ browserslist@^4.14.5:
|
|||
escalade "^3.1.1"
|
||||
node-releases "^1.1.70"
|
||||
|
||||
"browsertrix-behaviors@github:webrecorder/browsertrix-behaviors":
|
||||
browsertrix-behaviors@^0.1.0:
|
||||
version "0.1.0"
|
||||
resolved "https://codeload.github.com/webrecorder/browsertrix-behaviors/tar.gz/d0a37297b8446fc43b54e6a3b363656a17cb0912"
|
||||
resolved "https://registry.yarnpkg.com/browsertrix-behaviors/-/browsertrix-behaviors-0.1.0.tgz#202aabac6dcc2b15fe4777c3cc99d3d0cc042191"
|
||||
integrity sha512-AfED59t8b7couu5Vzcy76BoWqCyHtYfmaR5t8ic1MoSfzz40d5WS4HfZqUWvOcoqsUfpJhjlc9R7nCptpQ6tNQ==
|
||||
|
||||
bser@2.1.1:
|
||||
version "2.1.1"
|
||||
|
@ -1630,6 +1671,11 @@ eslint@^7.20.0:
|
|||
text-table "^0.2.0"
|
||||
v8-compile-cache "^2.0.3"
|
||||
|
||||
esm@^3.2.25:
|
||||
version "3.2.25"
|
||||
resolved "https://registry.yarnpkg.com/esm/-/esm-3.2.25.tgz#342c18c29d56157688ba5ce31f8431fbb795cc10"
|
||||
integrity sha512-U1suiZ2oDVWv4zPO56S0NcR5QriEahGtdN2OR6FiOG4WJvcjBVFB0qI4+eKoWFH483PKGuLuu6V8Z4T5g63UVA==
|
||||
|
||||
espree@^7.3.0, espree@^7.3.1:
|
||||
version "7.3.1"
|
||||
resolved "https://registry.npmjs.org/espree/-/espree-7.3.1.tgz"
|
||||
|
@ -2090,6 +2136,11 @@ has@^1.0.3:
|
|||
dependencies:
|
||||
function-bind "^1.1.1"
|
||||
|
||||
hi-base32@^0.5.0:
|
||||
version "0.5.1"
|
||||
resolved "https://registry.yarnpkg.com/hi-base32/-/hi-base32-0.5.1.tgz#1279f2ddae2673219ea5870c2121d2a33132857e"
|
||||
integrity sha512-EmBBpvdYh/4XxsnUybsPag6VikPYnN30td+vQk+GI3qpahVEG9+gTkG0aXVxTjBqQ5T6ijbWIu77O+C5WFWsnA==
|
||||
|
||||
hosted-git-info@^2.1.4:
|
||||
version "2.8.8"
|
||||
resolved "https://registry.npmjs.org/hosted-git-info/-/hosted-git-info-2.8.8.tgz"
|
||||
|
@ -3205,7 +3256,7 @@ nice-try@^1.0.4:
|
|||
resolved "https://registry.npmjs.org/nice-try/-/nice-try-1.0.5.tgz"
|
||||
integrity sha512-1nh45deeb5olNY7eX82BkPO7SSxR5SSYJiPTrTdFUVYwAl8CKMA5N9PjTYkHiRjisVcxcQ1HXdLhx2qxxJzLNQ==
|
||||
|
||||
node-fetch@^2.6.1:
|
||||
node-fetch@^2.6.0, node-fetch@^2.6.1:
|
||||
version "2.6.1"
|
||||
resolved "https://registry.npmjs.org/node-fetch/-/node-fetch-2.6.1.tgz"
|
||||
integrity sha512-V4aYg89jEoVRxRb2fJdAg8FHvI7cEyYdVAh94HH0UIK8oJxUfkjlDQN9RbMx+bEjP7+ggMiFRprSti032Oipxw==
|
||||
|
@ -3438,6 +3489,11 @@ p-try@^2.0.0:
|
|||
resolved "https://registry.npmjs.org/p-try/-/p-try-2.2.0.tgz"
|
||||
integrity sha512-R4nPAVTAU0B9D35/Gk3uJf/7XYbQcyohSKdvAxIRSNghFl4e71hVoGnBNQz9cWaXxO2I10KTC+3jMdvvoKw6dQ==
|
||||
|
||||
pako@^1.0.11:
|
||||
version "1.0.11"
|
||||
resolved "https://registry.yarnpkg.com/pako/-/pako-1.0.11.tgz#6c9599d340d54dfd3946380252a35705a6b992bf"
|
||||
integrity sha512-4hLB8Py4zZce5s4yd9XzopqwVv/yGNhV1Bl8NTmCq1763HeK2+EwVTv+leGeL13Dnh2wfbqowVPXCIO0z4taYw==
|
||||
|
||||
parent-module@^1.0.0:
|
||||
version "1.0.1"
|
||||
resolved "https://registry.npmjs.org/parent-module/-/parent-module-1.0.1.tgz"
|
||||
|
@ -3613,6 +3669,18 @@ puppeteer-core@^5.3.1:
|
|||
unbzip2-stream "^1.3.3"
|
||||
ws "^7.2.3"
|
||||
|
||||
pvtsutils@^1.1.1, pvtsutils@^1.1.2:
|
||||
version "1.1.2"
|
||||
resolved "https://registry.yarnpkg.com/pvtsutils/-/pvtsutils-1.1.2.tgz#483d72f4baa5e354466e68ff783ce8a9e2810030"
|
||||
integrity sha512-Yfm9Dsk1zfEpOWCaJaHfqtNXAFWNNHMFSCLN6jTnhuCCBCC2nqge4sAgo7UrkRBoAAYIL8TN/6LlLoNfZD/b5A==
|
||||
dependencies:
|
||||
tslib "^2.1.0"
|
||||
|
||||
pvutils@latest:
|
||||
version "1.0.17"
|
||||
resolved "https://registry.yarnpkg.com/pvutils/-/pvutils-1.0.17.tgz#ade3c74dfe7178944fe44806626bd2e249d996bf"
|
||||
integrity sha512-wLHYUQxWaXVQvKnwIDWFVKDJku9XDCvyhhxoq8dc5MFdIlRenyPI9eSfEtcvgHgD7FlvCyGAlWgOzRnZD99GZQ==
|
||||
|
||||
qs@~6.5.2:
|
||||
version "6.5.2"
|
||||
resolved "https://registry.npmjs.org/qs/-/qs-6.5.2.tgz"
|
||||
|
@ -4345,6 +4413,11 @@ tr46@^2.0.2:
|
|||
dependencies:
|
||||
punycode "^2.1.1"
|
||||
|
||||
tslib@^2.0.0, tslib@^2.0.3, tslib@^2.1.0:
|
||||
version "2.2.0"
|
||||
resolved "https://registry.yarnpkg.com/tslib/-/tslib-2.2.0.tgz#fb2c475977e35e241311ede2693cee1ec6698f5c"
|
||||
integrity sha512-gS9GVHRU+RGn5KQM2rllAlR3dU6m7AcpJKdtH8gFvQiC4Otgk98XnmMU+nZenHt/+VhnBPWwgrJsyrdcw6i23w==
|
||||
|
||||
tunnel-agent@^0.6.0:
|
||||
version "0.6.0"
|
||||
resolved "https://registry.npmjs.org/tunnel-agent/-/tunnel-agent-0.6.0.tgz"
|
||||
|
@ -4446,6 +4519,11 @@ util-deprecate@^1.0.1:
|
|||
resolved "https://registry.npmjs.org/util-deprecate/-/util-deprecate-1.0.2.tgz"
|
||||
integrity sha1-RQ1Nyfpw3nMnYvvS1KKJgUGaDM8=
|
||||
|
||||
uuid-random@^1.3.0:
|
||||
version "1.3.2"
|
||||
resolved "https://registry.yarnpkg.com/uuid-random/-/uuid-random-1.3.2.tgz#96715edbaef4e84b1dcf5024b00d16f30220e2d0"
|
||||
integrity sha512-UOzej0Le/UgkbWEO8flm+0y+G+ljUon1QWTEZOq1rnMAsxo2+SckbiZdKzAHHlVh6gJqI1TjC/xwgR50MuCrBQ==
|
||||
|
||||
uuid@8.3.2, uuid@^8.3.0:
|
||||
version "8.3.2"
|
||||
resolved "https://registry.npmjs.org/uuid/-/uuid-8.3.2.tgz"
|
||||
|
@ -4508,6 +4586,30 @@ walker@^1.0.7, walker@~1.0.5:
|
|||
dependencies:
|
||||
makeerror "1.0.x"
|
||||
|
||||
warcio@^1.4.3:
|
||||
version "1.4.3"
|
||||
resolved "https://registry.yarnpkg.com/warcio/-/warcio-1.4.3.tgz#9cfa3264c4c2b80a0a5955fa0fd7f85f7418d799"
|
||||
integrity sha512-zyhFDt/OmzkNGCaUJaFNrAia9NAPIJmmxtGypqDjH1tVloWSp4tvrj1mQ3CgMiZ0nbrFYwrLVGzMfBTul6gnDg==
|
||||
dependencies:
|
||||
"@peculiar/webcrypto" "^1.1.1"
|
||||
esm "^3.2.25"
|
||||
hi-base32 "^0.5.0"
|
||||
node-fetch "^2.6.0"
|
||||
pako "^1.0.11"
|
||||
uuid-random "^1.3.0"
|
||||
yargs "^15.3.1"
|
||||
|
||||
webcrypto-core@^1.2.0:
|
||||
version "1.2.0"
|
||||
resolved "https://registry.yarnpkg.com/webcrypto-core/-/webcrypto-core-1.2.0.tgz#44fda3f9315ed6effe9a1e47466e0935327733b5"
|
||||
integrity sha512-p76Z/YLuE4CHCRdc49FB/ETaM4bzM3roqWNJeGs+QNY1fOTzKTOVnhmudW1fuO+5EZg6/4LG9NJ6gaAyxTk9XQ==
|
||||
dependencies:
|
||||
"@peculiar/asn1-schema" "^2.0.27"
|
||||
"@peculiar/json-schema" "^1.1.12"
|
||||
asn1js "^2.0.26"
|
||||
pvtsutils "^1.1.2"
|
||||
tslib "^2.1.0"
|
||||
|
||||
webidl-conversions@^5.0.0:
|
||||
version "5.0.0"
|
||||
resolved "https://registry.npmjs.org/webidl-conversions/-/webidl-conversions-5.0.0.tgz"
|
||||
|
@ -4654,7 +4756,7 @@ yargs-parser@^20.0.0:
|
|||
resolved "https://registry.npmjs.org/yargs-parser/-/yargs-parser-20.0.0.tgz"
|
||||
integrity sha512-8eblPHTL7ZWRkyjIZJjnGf+TijiKJSwA24svzLRVvtgoi/RZiKa9fFQTrlx0OKLnyHSdt/enrdadji6WFfESVA==
|
||||
|
||||
yargs@^15.4.1:
|
||||
yargs@^15.3.1, yargs@^15.4.1:
|
||||
version "15.4.1"
|
||||
resolved "https://registry.npmjs.org/yargs/-/yargs-15.4.1.tgz"
|
||||
integrity sha512-aePbxDmcYW++PaqBsJ+HYUFwCdv4LVvdnhBy78E57PIor8/OVvhMrADFFEDh8DHDFRv/O9i3lPhsENjO7QX0+A==
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue