browsertrix-crawler/main.js
Ilya Kreymer 39ddecd35e
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state



* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT

* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats

* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible

* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0

* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes

*  py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture

* update to latest browsertrix-behaviors

* fix setuptools dependency #88

* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00

57 lines
1.1 KiB
JavaScript
Executable file

#!/usr/bin/env node
var crawler = null;
var lastSigInt = 0;
let forceTerm = false;
async function handleTerminate() {
if (!crawler) {
process.exit(0);
}
try {
if (!crawler.crawlState.drainMax) {
console.log("SIGNAL: gracefully finishing current pages...");
crawler.crawlState.setDrain();
} else if ((Date.now() - lastSigInt) > 200) {
console.log("SIGNAL: stopping crawl now...");
await crawler.serializeConfig();
process.exit(0);
}
lastSigInt = Date.now();
} catch (e) {
console.log(e);
}
}
process.on("SIGINT", async () => {
console.log("SIGINT received...");
await handleTerminate();
});
process.on("SIGTERM", async () => {
if (forceTerm) {
console.log("SIGTERM received, exit immediately");
process.exit(1);
}
console.log("SIGTERM received...");
await handleTerminate();
});
process.on("SIGABRT", async () => {
console.log("SIGABRT received, will force immediate exit on SIGTERM");
forceTerm = true;
});
const { Crawler } = require("./crawler");
crawler = new Crawler();
crawler.run();