mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 14:33:17 +00:00
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta
This commit is contained in:
parent
2956be2026
commit
39ddecd35e
13 changed files with 544 additions and 6407 deletions
|
@ -1,6 +1,6 @@
|
|||
ARG BROWSER_VERSION=91
|
||||
|
||||
ARG BROWSER_IMAGE_BASE=oldwebtoday/chrome
|
||||
ARG BROWSER_IMAGE_BASE=webrecorder/browsertrix-browser-base
|
||||
|
||||
ARG BROWSER_BIN=google-chrome
|
||||
|
||||
|
@ -21,7 +21,7 @@ RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - \
|
|||
&& curl -sL https://deb.nodesource.com/setup_16.x -o /tmp/nodesource_setup.sh && bash /tmp/nodesource_setup.sh \
|
||||
&& apt-get update -y && apt-get install -qqy nodejs yarn \
|
||||
&& curl https://bootstrap.pypa.io/get-pip.py | python3.8 \
|
||||
&& pip install -U setuptools
|
||||
&& pip install 'setuptools<58.0'
|
||||
|
||||
# needed to add args to main build stage
|
||||
ARG BROWSER_VERSION
|
||||
|
@ -36,9 +36,8 @@ ENV PROXY_HOST=localhost \
|
|||
BROWSER_VERSION=${BROWSER_VERSION} \
|
||||
BROWSER_BIN=${BROWSER_BIN}
|
||||
|
||||
COPY --from=browser /tmp/*.deb /deb/
|
||||
COPY --from=browser /app/libpepflashplayer.so /app/libpepflashplayer.so
|
||||
RUN dpkg -i /deb/*.deb; apt-get update; apt-get install -fqqy && \
|
||||
COPY --from=browser /deb/*.deb /deb/
|
||||
RUN dpkg -i /deb/*.deb; apt-get update; apt-mark hold chromium-browser; apt --fix-broken install -qqy; \
|
||||
rm -rf /var/lib/opts/lists/*
|
||||
|
||||
WORKDIR /app
|
||||
|
|
37
README.md
37
README.md
|
@ -61,6 +61,10 @@ Browsertrix Crawler includes a number of additional command-line options, explai
|
|||
[string]
|
||||
-w, --workers The number of workers to run in
|
||||
parallel [number] [default: 1]
|
||||
--crawlId, --id A user provided ID for this crawl or
|
||||
crawl configuration (can also be set
|
||||
via CRAWL_ID env var)
|
||||
[string] [default: "4dd1535f7800"]
|
||||
--newContext The context for each new capture,
|
||||
can be a new: page, window, session
|
||||
or browser.
|
||||
|
@ -75,11 +79,10 @@ Browsertrix Crawler includes a number of additional command-line options, explai
|
|||
[number] [default: 0]
|
||||
--timeout Timeout for each page to load (in
|
||||
seconds) [number] [default: 90]
|
||||
--scopeType Predefined for which URLs to crawl,
|
||||
can be: prefix, page, host, any, or
|
||||
custom, to use the
|
||||
scopeIncludeRx/scopeExcludeRx
|
||||
[string]
|
||||
--scopeType A predfined scope of the crawl. For
|
||||
more customization, use 'custom' and
|
||||
set scopeIncludeRx regexes
|
||||
[string] [choices: "page", "page-spa", "prefix", "host", "any", "custom"]
|
||||
--scopeIncludeRx, --include Regex of page URLs that should be
|
||||
included in the crawl (defaults to
|
||||
the immediate directory of URL)
|
||||
|
@ -156,6 +159,14 @@ Browsertrix Crawler includes a number of additional command-line options, explai
|
|||
[number] [default: 0]
|
||||
--warcInfo, --warcinfo Optional fields added to the
|
||||
warcinfo record in combined WARCs
|
||||
--redisStoreUrl If set, url for remote redis server
|
||||
to store state. Otherwise, using
|
||||
in-memory store [string]
|
||||
--saveState If the crawl state should be
|
||||
serialized to the crawls/ directory.
|
||||
Defaults to 'partial', only saved
|
||||
when crawl is interrupted
|
||||
[string] [choices: "never", "partial", "always"] [default: "partial"]
|
||||
--config Path to YAML config file
|
||||
```
|
||||
</details>
|
||||
|
@ -394,6 +405,20 @@ docker run -p 9037:9037 -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler
|
|||
|
||||
will start a crawl with 3 workers, and show the screen of each of the workers from `http://localhost:9037/`.
|
||||
|
||||
## Interrupting and Restarting the Crawl
|
||||
|
||||
With version 0.5.0, a crawl can be gracefully interrupted with Ctrl-C (SIGINT) or a SIGTERM.
|
||||
When a crawl is interrupted, the current crawl state is written to the `crawls` subdirectory inside the collection directory.
|
||||
The crawl state includes the current YAML config, if any, plus the current state of the crawl.
|
||||
|
||||
The idea is that this crawl state YAML file can then be used as `--config` option to restart the crawl from where it was left of previously.
|
||||
|
||||
By default, the crawl interruption waits for current pages to finish. A subsequent SIGINT will cause the crawl to stop immediately. Any unfinished pages
|
||||
are recorded in the `pending` section of the crawl state (if gracefully finished, the section will be empty).
|
||||
|
||||
By default, the crawl state is only written when a crawl is only partially done - when it is interrupted. The `--saveState` cli option can be set to `always`
|
||||
or `never` respectively, to control when the crawl state file should be written.
|
||||
|
||||
|
||||
## Creating and Using Browser Profiles
|
||||
|
||||
|
@ -517,7 +542,7 @@ Browsertrix Crawler uses a browser image which supports amd64 and arm64 (current
|
|||
|
||||
This means Browsertrix Crawler can be built natively on Apple M1 systems using the default settings. Simply running `docker-compose build` on an Apple M1 should build a native version that should work for development.
|
||||
|
||||
On M1 system, the browser used will be Chromium instead of Chrome since there is no Linux build of Chrome for ARM, and this now is handled automatically as part of the build.
|
||||
On M1 system, the browser used will be Chromium instead of Chrome since there is no Linux build of Chrome for ARM, and this now is handled automatically as part of the build. Note that Chromium is different than Chrome, and for example, some video codecs may not be supported in the ARM / Chromium-based version that would be in the amd64 / Chrome version. For production crawling, it is recommended to run on an amd64 Linux environment.
|
||||
|
||||
|
||||
### Custom Browser Image
|
||||
|
|
140
crawler.js
140
crawler.js
|
@ -13,14 +13,16 @@ const HTTP_AGENT = require("http").Agent();
|
|||
const fetch = require("node-fetch");
|
||||
const puppeteer = require("puppeteer-core");
|
||||
const { Cluster } = require("puppeteer-cluster");
|
||||
const { RedisCrawlState, MemoryCrawlState } = require("./util/state");
|
||||
const AbortController = require("abort-controller");
|
||||
const Sitemapper = require("sitemapper");
|
||||
const { v4: uuidv4 } = require("uuid");
|
||||
const Redis = require("ioredis");
|
||||
const yaml = require("js-yaml");
|
||||
|
||||
const warcio = require("warcio");
|
||||
|
||||
const behaviors = fs.readFileSync("/app/node_modules/browsertrix-behaviors/dist/behaviors.js", "utf-8");
|
||||
const behaviors = fs.readFileSync(path.join(__dirname, "node_modules", "browsertrix-behaviors", "dist", "behaviors.js"), {encoding: "utf8"});
|
||||
|
||||
const TextExtract = require("./util/textextract");
|
||||
const { ScreenCaster } = require("./util/screencaster");
|
||||
|
@ -37,7 +39,7 @@ const { BlockRules } = require("./util/blockrules");
|
|||
class Crawler {
|
||||
constructor() {
|
||||
this.headers = {};
|
||||
this.seenList = new Set();
|
||||
this.crawlState = null;
|
||||
|
||||
this.emulateDevice = null;
|
||||
|
||||
|
@ -52,7 +54,9 @@ class Crawler {
|
|||
|
||||
this.userAgent = "";
|
||||
|
||||
this.params = parseArgs();
|
||||
const res = parseArgs();
|
||||
this.params = res.parsed;
|
||||
this.origConfig = res.origConfig;
|
||||
|
||||
this.debugLogging = this.params.logging.includes("debug");
|
||||
|
||||
|
@ -117,7 +121,8 @@ class Crawler {
|
|||
let version = process.env.BROWSER_VERSION;
|
||||
|
||||
try {
|
||||
version = child_process.execFileSync(this.browserExe, ["--product-version"], {encoding: "utf8"}).trim();
|
||||
version = child_process.execFileSync(this.browserExe, ["--version"], {encoding: "utf8"});
|
||||
version = version.match(/[\d.]+/)[0];
|
||||
} catch(e) {
|
||||
console.error(e);
|
||||
}
|
||||
|
@ -135,6 +140,34 @@ class Crawler {
|
|||
}
|
||||
}
|
||||
|
||||
async initCrawlState() {
|
||||
const redisUrl = this.params.redisStoreUrl;
|
||||
|
||||
if (redisUrl) {
|
||||
if (!redisUrl.startsWith("redis://")) {
|
||||
throw new Error("stateStoreUrl must start with redis:// -- Only redis-based store currently supported");
|
||||
}
|
||||
|
||||
const redis = new Redis(redisUrl, {lazyConnect: true});
|
||||
|
||||
try {
|
||||
await redis.connect();
|
||||
} catch (e) {
|
||||
throw new Error("Unable to connect to state store Redis: " + redisUrl);
|
||||
}
|
||||
|
||||
this.statusLog(`Storing state via Redis ${redisUrl} @ key prefix "${this.params.crawlId}"`);
|
||||
|
||||
this.crawlState = new RedisCrawlState(redis, this.params.crawlId, this.params.timeout);
|
||||
} else {
|
||||
this.statusLog("Storing state in memory");
|
||||
|
||||
this.crawlState = new MemoryCrawlState();
|
||||
}
|
||||
|
||||
return this.crawlState;
|
||||
}
|
||||
|
||||
bootstrap() {
|
||||
let opts = {};
|
||||
if (this.params.logging.includes("pywb")) {
|
||||
|
@ -164,7 +197,7 @@ class Crawler {
|
|||
}
|
||||
});
|
||||
|
||||
if (!this.params.headless) {
|
||||
if (!this.params.headless && !process.env.NO_XVFB) {
|
||||
child_process.spawn("Xvfb", [
|
||||
process.env.DISPLAY,
|
||||
"-listen",
|
||||
|
@ -198,6 +231,9 @@ class Crawler {
|
|||
return {
|
||||
headless: this.params.headless,
|
||||
executablePath: this.browserExe,
|
||||
handleSIGINT: false,
|
||||
handleSIGTERM: false,
|
||||
handleSIGHUP: false,
|
||||
ignoreHTTPSErrors: true,
|
||||
args: this.chromeArgs,
|
||||
userDataDir: this.profileDir,
|
||||
|
@ -257,7 +293,6 @@ class Crawler {
|
|||
// run custom driver here
|
||||
await this.driver({page, data, crawler: this});
|
||||
|
||||
|
||||
const title = await page.title();
|
||||
let text = "";
|
||||
if (this.params.text) {
|
||||
|
@ -266,7 +301,7 @@ class Crawler {
|
|||
text = await new TextExtract(result).parseTextFromDom();
|
||||
}
|
||||
|
||||
await this.writePage(data.url, title, this.params.text, text);
|
||||
await this.writePage(data, title, this.params.text, text);
|
||||
|
||||
if (this.params.behaviorOpts) {
|
||||
await Promise.allSettled(page.frames().map(frame => frame.evaluate("self.__bx_behaviors.run();")));
|
||||
|
@ -326,6 +361,13 @@ class Crawler {
|
|||
monitor: this.params.logging.includes("stats")
|
||||
});
|
||||
|
||||
|
||||
this.cluster.jobQueue = await this.initCrawlState();
|
||||
|
||||
if (this.params.state) {
|
||||
await this.crawlState.load(this.params.state, this.params.scopedSeeds, true);
|
||||
}
|
||||
|
||||
this.cluster.task((opts) => this.crawlPage(opts));
|
||||
|
||||
await this.initPages();
|
||||
|
@ -341,7 +383,11 @@ class Crawler {
|
|||
|
||||
for (let i = 0; i < this.params.scopedSeeds.length; i++) {
|
||||
const seed = this.params.scopedSeeds[i];
|
||||
this.queueUrl(i, seed.url, 0);
|
||||
if (!await this.queueUrl(i, seed.url, 0)) {
|
||||
if (this.limitHit) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
|
||||
if (seed.sitemap) {
|
||||
await this.parseSitemap(seed.sitemap, i);
|
||||
|
@ -351,6 +397,8 @@ class Crawler {
|
|||
await this.cluster.idle();
|
||||
await this.cluster.close();
|
||||
|
||||
await this.serializeConfig();
|
||||
|
||||
this.writeStats();
|
||||
|
||||
if (this.pagesFH) {
|
||||
|
@ -382,7 +430,7 @@ class Crawler {
|
|||
// Build the argument list to pass to the wacz create command
|
||||
const waczFilename = this.params.collection.concat(".wacz");
|
||||
const waczPath = path.join(this.collDir, waczFilename);
|
||||
const argument_list = ["create", "-o", waczPath, "--pages", this.pagesFile, "-f"];
|
||||
const argument_list = ["create", "--split-seeds", "-o", waczPath, "--pages", this.pagesFile, "-f"];
|
||||
warcFileList.forEach((val, index) => argument_list.push(path.join(archiveDir, val))); // eslint-disable-line no-unused-vars
|
||||
|
||||
// Run the wacz create command
|
||||
|
@ -395,7 +443,7 @@ class Crawler {
|
|||
if (this.params.statsFilename) {
|
||||
const total = this.cluster.allTargetCount;
|
||||
const workersRunning = this.cluster.workersBusy.length;
|
||||
const numCrawled = total - this.cluster.jobQueue.size() - workersRunning;
|
||||
const numCrawled = total - (await this.cluster.jobQueue.size()) - workersRunning;
|
||||
const limit = {max: this.params.limit || 0, hit: this.limitHit};
|
||||
const stats = {numCrawled, workersRunning, total, limit};
|
||||
|
||||
|
@ -438,7 +486,7 @@ class Crawler {
|
|||
|
||||
for (const opts of selectorOptsList) {
|
||||
const links = await this.extractLinks(page, opts);
|
||||
this.queueInScopeUrls(seedId, links, depth);
|
||||
await this.queueInScopeUrls(seedId, links, depth);
|
||||
}
|
||||
}
|
||||
|
||||
|
@ -473,7 +521,7 @@ class Crawler {
|
|||
return results;
|
||||
}
|
||||
|
||||
queueInScopeUrls(seedId, urls, depth) {
|
||||
async queueInScopeUrls(seedId, urls, depth) {
|
||||
try {
|
||||
depth += 1;
|
||||
const seed = this.params.scopedSeeds[seedId];
|
||||
|
@ -482,7 +530,7 @@ class Crawler {
|
|||
const captureUrl = seed.isIncluded(url, depth);
|
||||
|
||||
if (captureUrl) {
|
||||
this.queueUrl(seedId, captureUrl, depth);
|
||||
await this.queueUrl(seedId, captureUrl, depth);
|
||||
}
|
||||
}
|
||||
} catch (e) {
|
||||
|
@ -490,16 +538,21 @@ class Crawler {
|
|||
}
|
||||
}
|
||||
|
||||
queueUrl(seedId, url, depth) {
|
||||
if (this.seenList.has(url)) {
|
||||
async queueUrl(seedId, url, depth) {
|
||||
if (this.limitHit) {
|
||||
return false;
|
||||
}
|
||||
|
||||
this.seenList.add(url);
|
||||
if (this.numLinks >= this.params.limit && this.params.limit > 0) {
|
||||
this.limitHit = true;
|
||||
return false;
|
||||
}
|
||||
|
||||
if (await this.crawlState.has(url)) {
|
||||
return false;
|
||||
}
|
||||
|
||||
await this.crawlState.add(url);
|
||||
this.numLinks++;
|
||||
this.cluster.queue({url, seedId, depth});
|
||||
return true;
|
||||
|
@ -535,12 +588,16 @@ class Crawler {
|
|||
}
|
||||
}
|
||||
|
||||
async writePage(url, title, text, text_content){
|
||||
async writePage({url, depth}, title, text, text_content) {
|
||||
const id = uuidv4();
|
||||
const row = {"id": id, "url": url, "title": title};
|
||||
|
||||
if (text == true){
|
||||
row["text"] = text_content;
|
||||
if (depth === 0) {
|
||||
row.seed = true;
|
||||
}
|
||||
|
||||
if (text) {
|
||||
row.text = text_content;
|
||||
}
|
||||
|
||||
const processedRow = JSON.stringify(row).concat("\n");
|
||||
|
@ -626,7 +683,7 @@ class Crawler {
|
|||
|
||||
try {
|
||||
const { sites } = await sitemapper.fetch();
|
||||
this.queueInScopeUrls(seedId, sites, 0);
|
||||
await this.queueInScopeUrls(seedId, sites, 0);
|
||||
} catch(e) {
|
||||
console.warn(e);
|
||||
}
|
||||
|
@ -644,12 +701,10 @@ class Crawler {
|
|||
|
||||
// Go through a list of the created works and create an array sorted by their filesize with the largest file first.
|
||||
for (let i = 0; i < warcLists.length; i++) {
|
||||
let fileName = path.join(this.collDir, "archive", warcLists[i]);
|
||||
let fileSize = await this.getFileSize(fileName);
|
||||
const fileName = path.join(this.collDir, "archive", warcLists[i]);
|
||||
const fileSize = await this.getFileSize(fileName);
|
||||
fileSizeObjects.push({"fileSize": fileSize, "fileName": fileName});
|
||||
fileSizeObjects.sort(function(a, b){
|
||||
return b.fileSize - a.fileSize;
|
||||
});
|
||||
fileSizeObjects.sort((a, b) => b.fileSize - a.fileSize);
|
||||
}
|
||||
|
||||
const generatedCombinedWarcs = [];
|
||||
|
@ -726,6 +781,41 @@ class Crawler {
|
|||
|
||||
this.debugLog(`Combined WARCs saved as: ${generatedCombinedWarcs}`);
|
||||
}
|
||||
|
||||
async serializeConfig() {
|
||||
switch (this.params.saveState) {
|
||||
case "never":
|
||||
return;
|
||||
|
||||
case "partial":
|
||||
if (await this.crawlState.finished()) {
|
||||
return;
|
||||
}
|
||||
break;
|
||||
|
||||
case "always":
|
||||
default:
|
||||
break;
|
||||
}
|
||||
|
||||
const ts = new Date().toISOString().slice(0,19).replace(/[T:-]/g, "");
|
||||
|
||||
const crawlDir = path.join(this.collDir, "crawls");
|
||||
|
||||
await fsp.mkdir(crawlDir, {recursive: true});
|
||||
|
||||
const filename = path.join(crawlDir, `crawl-${ts}-${this.params.crawlId}.yaml`);
|
||||
|
||||
this.statusLog("Saving crawl state to: " + filename);
|
||||
|
||||
const state = await this.crawlState.serialize();
|
||||
|
||||
if (this.origConfig) {
|
||||
this.origConfig.state = state;
|
||||
}
|
||||
const res = yaml.dump(this.origConfig, {lineWidth: -1});
|
||||
fs.writeFileSync(filename, res);
|
||||
}
|
||||
}
|
||||
|
||||
module.exports.Crawler = Crawler;
|
||||
|
|
54
main.js
54
main.js
|
@ -1,17 +1,57 @@
|
|||
#!/usr/bin/env node
|
||||
|
||||
process.once("SIGINT", () => {
|
||||
console.log("SIGINT received, exiting");
|
||||
process.exit(1);
|
||||
var crawler = null;
|
||||
|
||||
var lastSigInt = 0;
|
||||
let forceTerm = false;
|
||||
|
||||
|
||||
async function handleTerminate() {
|
||||
if (!crawler) {
|
||||
process.exit(0);
|
||||
}
|
||||
|
||||
try {
|
||||
if (!crawler.crawlState.drainMax) {
|
||||
console.log("SIGNAL: gracefully finishing current pages...");
|
||||
crawler.crawlState.setDrain();
|
||||
|
||||
} else if ((Date.now() - lastSigInt) > 200) {
|
||||
console.log("SIGNAL: stopping crawl now...");
|
||||
await crawler.serializeConfig();
|
||||
process.exit(0);
|
||||
}
|
||||
lastSigInt = Date.now();
|
||||
} catch (e) {
|
||||
console.log(e);
|
||||
}
|
||||
}
|
||||
|
||||
process.on("SIGINT", async () => {
|
||||
console.log("SIGINT received...");
|
||||
await handleTerminate();
|
||||
});
|
||||
|
||||
process.once("SIGTERM", () => {
|
||||
console.log("SIGTERM received, exiting");
|
||||
process.exit(1);
|
||||
process.on("SIGTERM", async () => {
|
||||
if (forceTerm) {
|
||||
console.log("SIGTERM received, exit immediately");
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log("SIGTERM received...");
|
||||
await handleTerminate();
|
||||
});
|
||||
|
||||
process.on("SIGABRT", async () => {
|
||||
console.log("SIGABRT received, will force immediate exit on SIGTERM");
|
||||
forceTerm = true;
|
||||
});
|
||||
|
||||
|
||||
|
||||
const { Crawler } = require("./crawler");
|
||||
|
||||
new Crawler().run();
|
||||
crawler = new Crawler();
|
||||
crawler.run();
|
||||
|
||||
|
||||
|
|
6335
package-lock.json
generated
6335
package-lock.json
generated
File diff suppressed because it is too large
Load diff
|
@ -1,6 +1,6 @@
|
|||
{
|
||||
"name": "browsertrix-crawler",
|
||||
"version": "0.4.4",
|
||||
"version": "0.5.0-beta.0",
|
||||
"main": "browsertrix-crawler",
|
||||
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
||||
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
||||
|
@ -10,11 +10,11 @@
|
|||
},
|
||||
"dependencies": {
|
||||
"abort-controller": "^3.0.0",
|
||||
"browsertrix-behaviors": "^0.2.3",
|
||||
"browsertrix-behaviors": "github:webrecorder/browsertrix-behaviors#skip-mp4-video",
|
||||
"ioredis": "^4.27.1",
|
||||
"js-yaml": "^4.1.0",
|
||||
"node-fetch": "^2.6.1",
|
||||
"puppeteer-cluster": "^0.22.0",
|
||||
"puppeteer-cluster": "github:ikreymer/puppeteer-cluster#async-job-queue",
|
||||
"puppeteer-core": "^8.0.0",
|
||||
"sitemapper": "^3.1.2",
|
||||
"uuid": "8.3.2",
|
||||
|
|
|
@ -1,3 +1,4 @@
|
|||
pywb>=2.6.0
|
||||
#pywb>=2.6.0
|
||||
git+https://github.com/webrecorder/pywb@twitter-rw
|
||||
uwsgi
|
||||
wacz>=0.3.1
|
||||
wacz>=0.3.2
|
||||
|
|
|
@ -133,7 +133,7 @@ test("test block url in frame url", () => {
|
|||
test("test block rules complex example, block external urls on main frame, but not on youtube", () => {
|
||||
const config = {
|
||||
"seeds": [
|
||||
"https://archiveweb.page/guide/troubleshooting/errors.html",
|
||||
"https://archiveweb.page/en/troubleshooting/errors/",
|
||||
],
|
||||
"depth": "0",
|
||||
"blockRules": [{
|
||||
|
|
|
@ -12,7 +12,8 @@ function getSeeds(config) {
|
|||
return orig(name, ...args);
|
||||
};
|
||||
|
||||
return parseArgs(["node", "crawler", "--config", "stdinconfig"]).scopedSeeds;
|
||||
const res = parseArgs(["node", "crawler", "--config", "stdinconfig"]);
|
||||
return res.parsed.scopedSeeds;
|
||||
}
|
||||
|
||||
test("default scope", async () => {
|
||||
|
@ -30,6 +31,24 @@ seeds:
|
|||
|
||||
});
|
||||
|
||||
test("default scope + exclude", async () => {
|
||||
const seeds = getSeeds(`
|
||||
seeds:
|
||||
- https://example.com/
|
||||
|
||||
exclude: https://example.com/pathexclude
|
||||
|
||||
`);
|
||||
|
||||
|
||||
expect(seeds.length).toEqual(1);
|
||||
expect(seeds[0].scopeType).toEqual("prefix");
|
||||
expect(seeds[0].include).toEqual([/^https:\/\/example\.com\//]);
|
||||
expect(seeds[0].exclude).toEqual([/https:\/\/example.com\/pathexclude/]);
|
||||
|
||||
});
|
||||
|
||||
|
||||
test("custom scope", async () => {
|
||||
const seeds = getSeeds(`
|
||||
seeds:
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
const path = require("path");
|
||||
const fs = require("fs");
|
||||
const os = require("os");
|
||||
|
||||
const yaml = require("js-yaml");
|
||||
const puppeteer = require("puppeteer-core");
|
||||
|
@ -12,7 +13,6 @@ const { BEHAVIOR_LOG_FUNC, WAIT_UNTIL_OPTS } = require("./constants");
|
|||
const { ScopedSeed } = require("./seeds");
|
||||
|
||||
|
||||
|
||||
// ============================================================================
|
||||
class ArgParser {
|
||||
get cliOpts() {
|
||||
|
@ -37,6 +37,13 @@ class ArgParser {
|
|||
type: "number",
|
||||
},
|
||||
|
||||
"crawlId": {
|
||||
alias: "id",
|
||||
describe: "A user provided ID for this crawl or crawl configuration (can also be set via CRAWL_ID env var)",
|
||||
type: "string",
|
||||
default: process.env.CRAWL_ID || os.hostname(),
|
||||
},
|
||||
|
||||
"newContext": {
|
||||
describe: "The context for each new capture, can be a new: page, window, session or browser.",
|
||||
default: "page",
|
||||
|
@ -67,8 +74,9 @@ class ArgParser {
|
|||
},
|
||||
|
||||
"scopeType": {
|
||||
describe: "Predefined for which URLs to crawl, can be: prefix, page, host, any, or custom, to use the scopeIncludeRx/scopeExcludeRx",
|
||||
describe: "A predfined scope of the crawl. For more customization, use 'custom' and set scopeIncludeRx regexes",
|
||||
type: "string",
|
||||
choices: ["page", "page-spa", "prefix", "host", "any", "custom"]
|
||||
},
|
||||
|
||||
"scopeIncludeRx": {
|
||||
|
@ -211,6 +219,18 @@ class ArgParser {
|
|||
alias: ["warcinfo"],
|
||||
describe: "Optional fields added to the warcinfo record in combined WARCs",
|
||||
type: "object"
|
||||
},
|
||||
|
||||
"redisStoreUrl": {
|
||||
describe: "If set, url for remote redis server to store state. Otherwise, using in-memory store",
|
||||
type: "string"
|
||||
},
|
||||
|
||||
"saveState": {
|
||||
describe: "If the crawl state should be serialized to the crawls/ directory. Defaults to 'partial', only saved when crawl is interrupted",
|
||||
type: "string",
|
||||
default: "partial",
|
||||
choices: ["never", "partial", "always"]
|
||||
}
|
||||
};
|
||||
}
|
||||
|
@ -218,17 +238,26 @@ class ArgParser {
|
|||
parseArgs(argv) {
|
||||
argv = argv || process.argv;
|
||||
|
||||
return yargs(hideBin(argv))
|
||||
if (process.env.CRAWL_ARGS) {
|
||||
argv = argv.concat(process.env.CRAWL_ARGS.split(" "));
|
||||
}
|
||||
|
||||
let origConfig = {};
|
||||
|
||||
const parsed = yargs(hideBin(argv))
|
||||
.usage("crawler [options]")
|
||||
.option(this.cliOpts)
|
||||
.config("config", "Path to YAML config file", (configPath) => {
|
||||
if (configPath === "/crawls/stdin") {
|
||||
configPath = process.stdin.fd;
|
||||
}
|
||||
return yaml.load(fs.readFileSync(configPath, "utf8"));
|
||||
origConfig = yaml.load(fs.readFileSync(configPath, "utf8"));
|
||||
return origConfig;
|
||||
})
|
||||
.check((argv) => this.validateArgs(argv))
|
||||
.argv;
|
||||
|
||||
return {parsed, origConfig};
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -5,16 +5,16 @@ class ScopedSeed
|
|||
this.url = parsedUrl.href;
|
||||
this.include = this.parseRx(include);
|
||||
this.exclude = this.parseRx(exclude);
|
||||
|
||||
if (!scopeType) {
|
||||
scopeType = (this.include.length || this.exclude.length) ? "custom" : "prefix";
|
||||
}
|
||||
|
||||
this.scopeType = scopeType;
|
||||
|
||||
if (!this.scopeType) {
|
||||
this.scopeType = this.include.length ? "custom" : "prefix";
|
||||
}
|
||||
|
||||
if (this.scopeType !== "custom") {
|
||||
[this.include, allowHash] = this.scopeFromType(this.scopeType, parsedUrl);
|
||||
}
|
||||
|
||||
this.sitemap = this.resolveSiteMap(sitemap);
|
||||
this.allowHash = allowHash;
|
||||
this.maxDepth = depth < 0 ? 99999 : depth;
|
||||
|
|
276
util/state.js
Normal file
276
util/state.js
Normal file
|
@ -0,0 +1,276 @@
|
|||
const Job = require("puppeteer-cluster/dist/Job").default;
|
||||
|
||||
|
||||
// ============================================================================
|
||||
class BaseState
|
||||
{
|
||||
constructor() {
|
||||
this.drainMax = 0;
|
||||
}
|
||||
|
||||
async setDrain() {
|
||||
this.drainMax = (await this.numPending()) + (await this.numDone());
|
||||
}
|
||||
|
||||
async size() {
|
||||
return this.drainMax ? 0 : await this.realSize();
|
||||
}
|
||||
|
||||
async finished() {
|
||||
return await this.realSize() == 0;
|
||||
}
|
||||
|
||||
async numSeen() {
|
||||
return this.drainMax ? this.drainMax : await this.numRealSeen();
|
||||
}
|
||||
|
||||
recheckScope(data, seeds) {
|
||||
const seed = seeds[data.seedId];
|
||||
|
||||
return seed.isIncluded(data.url, data.depth);
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// ============================================================================
|
||||
class MemoryCrawlState extends BaseState
|
||||
{
|
||||
constructor() {
|
||||
super();
|
||||
this.seenList = new Set();
|
||||
this.queue = [];
|
||||
this.pending = new Set();
|
||||
this.done = [];
|
||||
}
|
||||
|
||||
push(job) {
|
||||
this.queue.unshift(job.data);
|
||||
}
|
||||
|
||||
realSize() {
|
||||
return this.queue.length;
|
||||
}
|
||||
|
||||
shift() {
|
||||
const data = this.queue.pop();
|
||||
data.started = new Date().toISOString();
|
||||
const str = JSON.stringify(data);
|
||||
this.pending.add(str);
|
||||
|
||||
const callback = {
|
||||
resolve: () => {
|
||||
this.pending.delete(str);
|
||||
data.finished = new Date().toISOString();
|
||||
this.done.unshift(data);
|
||||
},
|
||||
|
||||
reject: (e) => {
|
||||
this.pending.delete(str);
|
||||
console.warn(`URL Load Failed: ${data.url}, Reason: ${e}`);
|
||||
data.failed = true;
|
||||
this.done.unshift(data);
|
||||
}
|
||||
};
|
||||
|
||||
return new Job(data, undefined, callback);
|
||||
}
|
||||
|
||||
has(url) {
|
||||
return this.seenList.has(url);
|
||||
}
|
||||
|
||||
add(url) {
|
||||
return this.seenList.add(url);
|
||||
}
|
||||
|
||||
async serialize() {
|
||||
const queued = this.queue.map(x => JSON.stringify(x));
|
||||
const pending = Array.from(this.pending.values());
|
||||
const done = this.done.map(x => JSON.stringify(x));
|
||||
|
||||
return {queued, pending, done};
|
||||
}
|
||||
|
||||
async load(state, seeds, checkScope=false) {
|
||||
for (const json of state.queued) {
|
||||
const data = JSON.parse(json);
|
||||
if (checkScope && !this.recheckScope(data, seeds)) {
|
||||
continue;
|
||||
}
|
||||
this.queue.push(data);
|
||||
this.seenList.add(data.url);
|
||||
}
|
||||
|
||||
for (const json of state.pending) {
|
||||
const data = JSON.parse(json);
|
||||
if (checkScope && !this.recheckScope(data, seeds)) {
|
||||
continue;
|
||||
}
|
||||
this.queue.push(data);
|
||||
this.seenList.add(data.url);
|
||||
}
|
||||
|
||||
for (const json of state.done) {
|
||||
const data = JSON.parse(json);
|
||||
if (data.failed) {
|
||||
this.queue.push(data);
|
||||
} else {
|
||||
this.done.push(data);
|
||||
}
|
||||
this.seenList.add(data.url);
|
||||
}
|
||||
|
||||
return this.seenList.size;
|
||||
}
|
||||
|
||||
async numDone() {
|
||||
return this.done.length;
|
||||
}
|
||||
|
||||
async numRealSeen() {
|
||||
return this.seenList.size;
|
||||
}
|
||||
|
||||
async numPending() {
|
||||
return this.pending.size;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
// ============================================================================
|
||||
class RedisCrawlState extends BaseState
|
||||
{
|
||||
constructor(redis, key, pageTimeout) {
|
||||
super();
|
||||
this.redis = redis;
|
||||
|
||||
this.key = key;
|
||||
this.pageTimeout = pageTimeout / 1000;
|
||||
|
||||
this.qkey = this.key + ":q";
|
||||
this.pkey = this.key + ":p";
|
||||
this.skey = this.key + ":s";
|
||||
this.dkey = this.key + ":d";
|
||||
|
||||
|
||||
redis.defineCommand("movestarted", {
|
||||
numberOfKeys: 2,
|
||||
lua: "local val = redis.call('rpop', KEYS[1]); if (val) then local json = cjson.decode(val); json['started'] = ARGV[1]; val = cjson.encode(json); redis.call('sadd', KEYS[2], val); redis.call('expire', KEYS[2], ARGV[2]); end; return val"
|
||||
});
|
||||
|
||||
redis.defineCommand("movefinished", {
|
||||
numberOfKeys: 2,
|
||||
lua: "local val = ARGV[1]; if (redis.call('srem', KEYS[1], val)) then local json = cjson.decode(val); json[ARGV[3]] = ARGV[2]; val = cjson.encode(json); redis.call('lpush', KEYS[2], val); end; return val"
|
||||
});
|
||||
|
||||
}
|
||||
|
||||
async push(job) {
|
||||
await this.redis.lpush(this.qkey, JSON.stringify(job.data));
|
||||
}
|
||||
|
||||
async realSize() {
|
||||
return await this.redis.llen(this.qkey);
|
||||
}
|
||||
|
||||
async shift() {
|
||||
const started = new Date().toISOString();
|
||||
// atomically move from queue list -> pending set while adding started timestamp
|
||||
// set pending set expire to page timeout
|
||||
const json = await this.redis.movestarted(this.qkey, this.pkey, started, this.pageTimeout);
|
||||
const data = JSON.parse(json);
|
||||
|
||||
const callback = {
|
||||
resolve: async () => {
|
||||
const finished = new Date().toISOString();
|
||||
// atomically move from pending set -> done list while adding finished timestamp
|
||||
await this.redis.movefinished(this.pkey, this.dkey, json, finished, "finished");
|
||||
},
|
||||
|
||||
reject: async (e) => {
|
||||
console.warn(`URL Load Failed: ${data.url}, Reason: ${e}`);
|
||||
await this.redis.movefinished(this.pkey, this.dkey, json, true, "failed");
|
||||
}
|
||||
};
|
||||
|
||||
return new Job(data, undefined, callback);
|
||||
}
|
||||
|
||||
async has(url) {
|
||||
return !!await this.redis.sismember(this.skey, url);
|
||||
}
|
||||
|
||||
async add(url) {
|
||||
return await this.redis.sadd(this.skey, url);
|
||||
}
|
||||
|
||||
async serialize() {
|
||||
const queued = await this.redis.lrange(this.qkey, 0, -1);
|
||||
const pending = await this.redis.smembers(this.pkey);
|
||||
const done = await this.redis.lrange(this.dkey, 0, -1);
|
||||
|
||||
return {queued, pending, done};
|
||||
}
|
||||
|
||||
async load(state, seeds, checkScope) {
|
||||
const seen = [];
|
||||
|
||||
// need to delete existing keys, if exist to fully reset state
|
||||
await this.redis.del(this.qkey);
|
||||
await this.redis.del(this.pkey);
|
||||
await this.redis.del(this.dkey);
|
||||
await this.redis.del(this.skey);
|
||||
|
||||
for (const json of state.queued) {
|
||||
const data = JSON.parse(json);
|
||||
if (checkScope) {
|
||||
if (!this.recheckScope(data, seeds)) {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
await this.redis.rpush(this.qkey, json);
|
||||
seen.push(data.url);
|
||||
}
|
||||
|
||||
for (const json of state.pending) {
|
||||
const data = JSON.parse(json);
|
||||
if (checkScope) {
|
||||
if (!this.recheckScope(data, seeds)) {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
|
||||
await this.redis.rpush(this.qkey, json);
|
||||
seen.push(data.url);
|
||||
}
|
||||
|
||||
for (const json of state.done) {
|
||||
const data = JSON.parse(json);
|
||||
if (data.failed) {
|
||||
await this.redis.rpush(this.qkey, json);
|
||||
} else {
|
||||
await this.redis.rpush(this.dkey, json);
|
||||
}
|
||||
seen.push(data.url);
|
||||
}
|
||||
|
||||
await this.redis.sadd(this.skey, seen);
|
||||
return seen.length;
|
||||
}
|
||||
|
||||
async numDone() {
|
||||
return await this.redis.llen(this.dkey);
|
||||
}
|
||||
|
||||
async numRealSeen() {
|
||||
return await this.redis.scard(this.skey);
|
||||
}
|
||||
|
||||
async numPending() {
|
||||
return await this.redis.scard(this.pkey);
|
||||
}
|
||||
}
|
||||
|
||||
module.exports.RedisCrawlState = RedisCrawlState;
|
||||
module.exports.MemoryCrawlState = MemoryCrawlState;
|
19
yarn.lock
19
yarn.lock
|
@ -1053,10 +1053,9 @@ browserslist@^4.14.5:
|
|||
escalade "^3.1.1"
|
||||
node-releases "^1.1.71"
|
||||
|
||||
browsertrix-behaviors@^0.2.3:
|
||||
version "0.2.3"
|
||||
resolved "https://registry.yarnpkg.com/browsertrix-behaviors/-/browsertrix-behaviors-0.2.3.tgz#207ddbd88c54f388ad9723406982cabe772bceaf"
|
||||
integrity sha512-4OaELMXU9bljvEsBMJrwWNLnlHsy9OIOL2ZXa0t0rcSQzAUm3HbhIWWNSh+LY0h/76mljbpsqW0kptIqgNJqvg==
|
||||
"browsertrix-behaviors@github:webrecorder/browsertrix-behaviors#skip-mp4-video":
|
||||
version "0.2.4"
|
||||
resolved "https://codeload.github.com/webrecorder/browsertrix-behaviors/tar.gz/50a0538f0a19fba786a7af62ef6c0946e21038b4"
|
||||
|
||||
bser@2.1.1:
|
||||
version "2.1.1"
|
||||
|
@ -3744,10 +3743,9 @@ punycode@^2.1.0, punycode@^2.1.1:
|
|||
resolved "https://registry.yarnpkg.com/punycode/-/punycode-2.1.1.tgz#b58b010ac40c22c5657616c8d2c2c02c7bf479ec"
|
||||
integrity sha512-XRsRjdf+j5ml+y/6GKHPZbrF/8p2Yga0JPtdqTIY2Xe5ohJPD9saDJJLPvp9+NSBprVvevdXZybnj2cv8OEd0A==
|
||||
|
||||
puppeteer-cluster@^0.22.0:
|
||||
"puppeteer-cluster@github:ikreymer/puppeteer-cluster#async-job-queue":
|
||||
version "0.22.0"
|
||||
resolved "https://registry.yarnpkg.com/puppeteer-cluster/-/puppeteer-cluster-0.22.0.tgz#4ab214671f414f15ad6a94a4b61ed0b4172e86e6"
|
||||
integrity sha512-hmydtMwfVM+idFIDzS8OXetnujHGre7RY3BGL+3njy9+r8Dcu3VALkZHfuBEPf6byKssTCgzxU1BvLczifXd5w==
|
||||
resolved "https://codeload.github.com/ikreymer/puppeteer-cluster/tar.gz/6901d46224bd69cc0c96ad5f015269c3ca885ede"
|
||||
dependencies:
|
||||
debug "^4.1.1"
|
||||
|
||||
|
@ -4858,12 +4856,7 @@ write-file-atomic@^3.0.0:
|
|||
signal-exit "^3.0.2"
|
||||
typedarray-to-buffer "^3.1.5"
|
||||
|
||||
ws@^7.2.3:
|
||||
version "7.5.3"
|
||||
resolved "https://registry.yarnpkg.com/ws/-/ws-7.5.3.tgz#160835b63c7d97bfab418fc1b8a9fced2ac01a74"
|
||||
integrity sha512-kQ/dHIzuLrS6Je9+uv81ueZomEwH0qVYstcAQ4/Z93K8zeko9gtAbttJWzoC5ukqXY1PpoouV3+VSOqEAFt5wg==
|
||||
|
||||
ws@^7.4.4:
|
||||
ws@^7.2.3, ws@^7.4.4:
|
||||
version "7.4.5"
|
||||
resolved "https://registry.yarnpkg.com/ws/-/ws-7.4.5.tgz#a484dd851e9beb6fdb420027e3885e8ce48986c1"
|
||||
integrity sha512-xzyu3hFvomRfXKH8vOFMU3OguG6oOvhXMo3xsGy3xWExqaM2dxBbVxuD99O7m3ZUFMvvscsZDqxfgMaRr/Nr1g==
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue