Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	93b6dad7b9	Health Check + Size Limits + Profile fixes (#138 ) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1	2022-05-18 22:51:55 -07:00
Ilya Kreymer	81e8fa6da7	Incremental save state (#124 ) * save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states display save state status on startup page write fixes: add missing await fix for #113 * update README	2022-03-14 10:41:56 -07:00
Ilya Kreymer	0c32d0f223	add 'scopeType: domain' to include all subdomains + http/https include (#117 ) - add 'scopeType: domain' to include all subdomains of a given seed url, eg. given `https://example.com/path' as starting seed, will consider `https://*.example.com/` to be in scope. - include both http/https in all the default scopes except single page (page-spa, prefix, host, domain), eg. given https://example.com/, will also include http://example.com/ - fixes #116	2022-03-06 14:46:14 -08:00
Ilya Kreymer	ef53b1acea	Screencast Refactor (#108 ) - Move connection data to separate transport class, in addition to current, direct connection via WS, also support sending screencast data via redis pubsub - Implement WSTransport and RedisPubSubTransport for screencasting - Redis screencasting enabled when --redisStoreUrl is set and --screencastRedis is set. - Redis screencasting uses pubsub channels: * a ctrl channel is used to start/stop screencasting * a data channel is used to send screencast messages Simplify screencasting messages: {"msg": "screencast", "id": "<page id>", "url": "<page url>", "data": "<png base64 data>"} - for new and incremental screencast frames for page id {"msg": "close", "id": "<page id>"} - to indicate page id has closed. Rename html dir from screencast -> html	2022-02-23 12:09:48 -08:00
Ilya Kreymer	a54ca6e51d	scopes: - fix scopeType prefix set + exclude not reverting to custom - only mark include + scopeType as overlapping	2022-02-13 14:34:25 -08:00
Ilya Kreymer	201eab4ad1	Support Extra Hops beyond current scope with --extraHops option (#98 ) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2	2022-01-15 09:03:09 -08:00
Ilya Kreymer	39ddecd35e	State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78 ) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta	2021-09-28 09:41:16 -07:00
Ilya Kreymer	f4c6b6a99f	0.4.1 Release! (#70 ) * optimization: don't intercept requests if no blockRules set * page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages * add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds) * refactor profile loadProfile/saveProfile to util/browser.js - support augmenting existing profile when creating a new profile * screencasting: convert newContext to window instead of page by default, instead of just warning about it * shared multiplatform image support: - determine browser exe from list of options, getBrowserExe() returns current exe - supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64 - update to multiplatform oldwebtoday/chrome:91 as browser image - enable multiplatform build with latest build-push-action@v2 * seeds: add trim() to seed URLs * logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically * profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles * extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25 * update CHANGES and README with new features * bump version to 0.4.1	2021-07-22 14:24:51 -07:00
Ilya Kreymer	6dbdff9656	Support for per-URL conditional Block Rules (#68 ) - Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex. - Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex. - Support for restricting block rules based on containing frame URL, specified via inFrameURL param. - Testing for various blockRules configurations - Fixes Support URL-level WARC-writing inclusion/exclusion lists #15 - optional message to add when a URL is blocked, specified via 'blockMessage' - update README for blockRules - bump to pywb dependency 2.5.0b4	2021-07-19 15:50:32 -07:00
Emma Dickson	c02855627c	Add fields to warcinfo in combinedwarc (#60 ) * add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options * tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo * fix warcio.js version being added incorrectly * switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now) Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2021-07-07 15:56:52 -07:00
Ilya Kreymer	473de8c49f	Scope Handling Improvements + Tests (#66 ) * scope fixes: - remove default prefix scopeType, ensure scope include and exclude take precedence - add new 'custom' scopeType, when include or exclude are used - use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude) - ensure per-seed scope include/exclude used when present, and scopeType set to 'custom' - ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified - rename --type to --scopeType in seed to maintain consistency - add sitemap param as alias for useSitemap tests: - add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides - fix screencaster to use relative paths to work with tests - ci: use yarn instead of npm * update README with new flags * bump version to 0.4.0-beta.3	2021-07-06 20:22:27 -07:00
Ilya Kreymer	ef7d5e50d8	Per-Seed Scoping Rules + Crawl Depth (#63 ) * scoped seeds: - support per-seed scoping (include + exclude), allowHash, depth, and sitemap options - support maxDepth per seed #16 - combine --url, --seed and --urlFile/--seedFile urls into a unified seed list arg parsing: - simplify seed file options into --seedFile/--urlFile, move option in help display - rename --maxDepth -> --depth, supported globally and per seed - ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation) - update to latest js-yaml - rename --yamlConfig -> --config - config: support reading config from stdin if --config set to 'stdin' * scope: fix typo in 'prefix' scope * update browsertrix-behaviors to 0.2.2 * tests: add test for passing config via stdin, also adding --excludes via cmdline * update README: - latest cli, add docs on config via stdin - rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position - info on scoped seeds - list current scope types	2021-06-26 13:11:29 -07:00
Ilya Kreymer	3ebe511b32	Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59 ) * Create an argument parser class * move constants, arg parser to separate files in utils/* * ensure yaml config overriden by command-line args * yaml loading work: - simplify yaml config by using yargs.config option - move all option parsing to argParser, simply expose parseArgs - export constants directly - add lint to util/* files * support inline 'seeds' in cmdline and yaml config tests: - add test for crawl config, ensuring seeds crawled + wacz created - add test to ensure cmdline overrides yaml config * scope fix: empty scope implies only fixed list, use '.' for any scope lint fix * update readme with yaml config info * allow 'url' and 'seeds' if both provided Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: emmadickson <emma.dickson@artsymail.com>	2021-06-23 19:45:40 -07:00