Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Tessa Walsh	e1fe028c7c	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 ) Fixes #493 This PR updates the documentation for Browsertrix Crawler 1.0.0 and moves it from the project README to an MKDocs site. Initial docs site set to https://crawler.docs.browsertrix.com/ Many thanks to @Shrinks99 for help setting this up! --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-03-16 14:59:32 -07:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	15661eb9c8	More flexible multi value arg parsing + README update for 0.12.0 (#422 ) Updated arg parsing thanks to example in https://github.com/yargs/yargs/issues/846#issuecomment-517264899 to support multiple value arguments specified as either one string or multiple string using array type + coerce function. This allows for `choice` option to also be used to validate the options, when needed. With this setup, `--text to-pages,to-warc,final-to-warc`, `--text to-pages,to-warc --text final-to-warc` and `--text to-pages --text to-warc --text final-to-warc` all result in the same configuration! Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system. Also updated README with new text extraction options and bumped version to 0.12.0 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-11-02 11:47:37 -07:00
gitreich	18dce9534e	Update README.md (#390 ) added missing quotes in command to extend an existing profiles	2023-09-29 09:23:05 -07:00
Ilya Kreymer	debfe8945f	README: add --restartOnError cli opt	2023-09-15 11:22:52 -07:00
Anish Lakhwara	5bd4fedff9	Add example of mounting custom behaviours (#369 ) * feat: add docker mount custom behavior to README * Add link to behaviors tutorial --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-09-13 10:47:05 -07:00
Graham Hukill	1eeee2c215	Surface lastmod option for sitemap parser (#367 ) * Surface lastmod option for sitemap parser - Add --sitemapFromDate to use along with --useSitemap which will filter sitemap by on or after specified ISO date. The library used to parse sitemaps for URLs added an optional "lastmod" argument in v3.2.5 that allows filtering URLs returned by a "last_modified" element present in sitemap XMLs. This surfaces that argument to the browsertrix-crawler CLI runtime parameters. This can be useful for orienting a crawl around a list of seeds known to contain sitemaps, but are only interested in including URLs that have been modified on or after X date. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-13 10:20:41 -07:00
Tessa Walsh	74831373fd	Update README options	2023-07-06 15:21:30 -04:00
wvengen	de2b4512b6	Allow configuration of deduplication policy (#331 ) (#332 )	2023-07-06 14:54:35 -04:00
Ilya Kreymer	71b618fe94	Switch back to Puppeteer from Playwright (#301 ) - reduced memory usage, avoids memory leak issues caused by using playwright (see #298) - browser: split Browser into Browser and BaseBrowser - browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later - browser: use defaultArgs from playwright - browser: attempt to recover if initial target is gone - logging: add debug logging from process.memoryUsage() after every page - request interception: use priorities for cooperative request interception - request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used - request interception: fix originOverrides enabled check, fix to work with catch-all request interception - default args: set --waitUntil back to 'load,networkidle2' - Update README with changes for puppeteer - tests: fix extra hops depth test to ensure more than one page crawled --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-04-26 15:41:35 -07:00
Tessa Walsh	b303af02ef	Add --title and --description CLI args to write metadata into datapackage.json (#276 ) Multi-word values including spaces must be enclosed in double quotes. Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-04-04 10:46:03 -04:00
Ilya Kreymer	78faa965c5	Add --maxPageLimit override (#275 ) * max page limit: - rename --limit -> --pageLimit (keep alias for now) - add new --maxPageLimit flag which overrides --pageLimit to ensure it is not greater than max - readme: add new --pageLimit, --maxPageLimit to README	2023-04-03 11:10:47 -07:00
Tessa Walsh	d8c505a076	Update README for 0.9.0 (#272 ) * Update README for Playwright/0.9.0 * Add ad blocking to README	2023-04-02 21:55:14 -07:00
Tessa Walsh	b0e93cb06e	Add option for sleep interval after behaviors run + timing cleanup (#257 ) * Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131) * Store total page time in 'maxPageTime', include pageExtraDelay * Rename timeout->pageLoadTimeout * cleanup: - store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions - add secondsElapsed() utility function to help checking time elapsed - cleanup comments --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-22 11:50:18 -07:00
Ilya Kreymer	82808d8133	Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253 ) * Migrate from Puppeteer to Playwright! - use playwright persistent browser context to support profiles - move on-new-page setup actions to worker - fix screencaster, init only one per page object, associate with worker-id - fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage - port additional chromium setup options - create / detach cdp per page for each new page, screencaster just uses existing cdp - fix evaluateWithCLI to call CDP command directly - workers directly during WorkerPool - await not necessary * State / Worker Refactor (#252) * refactoring state: - use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState - remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster - switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150) - override console.error to avoid logging ioredis errors (fixes #244) - add MAX_DEPTH as const for extraHops - fix immediate exit on second interrupt * worker/state refactor: - remove job object from puppeteer-cluster - rename shift() -> nextFromQueue() - condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc... - screencaster: don't screencast about:blank pages * more worker queue refactor: - remove p-queue - initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages - add setupPage(), teardownPage() to crawler, called from worker - await runWorkers() promise which runs all workers until completion - remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code) - bump to 0.9.0-beta.1 * use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition) * more fixes for playwright: - fix profile creation - browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout - crawler: various fixes, including for html check - logging: addition logging for screencaster, new window, etc... - remove unused packages --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-03-17 12:50:32 -07:00
Tessa Walsh	1bee46b321	Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219 ) * This commit removes puppeteer-cluster as a dependency in favor of a simpler concurrency implementation, using p-queue to limit concurrency to the number of available workers. As part of the refactor, the custom window concurrency model in windowconcur.js is removed and its logic implemented in the new Worker class's initPage method. * Remove concurrency models, always use new tab * logging improvements: include worker-id in logs, use 'worker' context - logging: log info string / version as first line - logging: improve logging of error stack traces - interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue - interruption: don't repair if interrupting, wait for queue to be idle - log text extraction - init order: ensure wb-manager init called first, then logs created - logging: adjust info->debug logging - Log no jobs available as debug * tests: bail on first failure * iframe filtering: - fix filtering for about:blank iframes, support non-async shouldProcessFrame() - filter iframes both for behaviors and for link extraction - add 5-second timeout to link extraction, to avoid link extraction holding up crawl! - cache filtered frames * healthcheck/worker reuse: - refactor healthchecker into separate class - increment healthchecker (if provided) if new page load fails - remove expermeintal repair functionality for now - add healthcheck * deps: bump puppeteer-core to 17.1.2 - bump to 0.9.0-beta.0 -------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-03-08 18:31:19 -08:00
Sara Tavares	5b1f224dcb	fix typos (#232 )	2023-02-24 11:09:40 -08:00
Ilya Kreymer	5ee05985b1	Use VNC for headful profile creation (#197 ) * profiles: use vnc for automatic profile creation (fixes #194): - add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode - use @novnc/novnc to serve vnc JS library - add novnc_lite.html to serve the content from an iframe - optimization: don't show initial blank page / don't wait for initial page in puppeteer * more vnc work: - set position of browser at 0,0, avoid needing offset to fit - add /vncpass endpoint to query vnc password (for use with browsertrix-cloud) - remove websockify, x11vnc now supports ws connections directly! - vnc_lite: support reconnecting ws if gracefully disconnected * x11vnc cleanup: just pass password via cmdline to simplify setup * make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified README updates: - mention new VNC-based streaming - mention new --automated flag, move automated info below interactive * README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently	2023-01-09 23:56:53 -08:00
Tessa Walsh	f35d495103	Add screenshot functionality (#188 ) * Add screenshot and thumbnail functionality Introduces a --screenshot CLI option, which takes a comma-separated list of screenshot types: view,fullPage,thumbnail. In addition, this commit: - Adds '--experimental-global-webcrypto' to ensure webcrypto is available in node - Deprecates newContext, instead always using page context for 1 worker and window context for >1 worker * Separate screenshotTypes into exported const Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>	2022-12-21 09:06:13 -08:00
Tim	5b738bd24e	Fix incorrect `combineWARCs` property in README.md (#180 ) This stumped me for a little while. The actual property isn't plural.	2022-11-14 22:17:44 -08:00
Ilya Kreymer	be3b6b85fa	README: update default behaviors in README, fixes #169	2022-10-11 15:33:32 -07:00
Ed Summers	3ba64535a5	Run in Docker as User (#171 ) * Run in Docker as User This follows a similar pattern to pywb to run as the user that owns the crawls directory. bump version to 0.7.0-beta.6 Closes #170	2022-09-28 12:49:52 -07:00
raffaele messuti	a527cc9b36	Update README.md (#147 ) fix link to puppeteer waitUntil	2022-08-11 18:28:54 -07:00
Ilya Kreymer	93b6dad7b9	Health Check + Size Limits + Profile fixes (#138 ) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1	2022-05-18 22:51:55 -07:00
Ilya Kreymer	12d96f22c6	Profile download support (#126 ) * profiles: support loading profiles via a URL. * add 'request' dependency * README: mention profile URLs	2022-03-14 14:44:24 -07:00
Ilya Kreymer	81e8fa6da7	Incremental save state (#124 ) * save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states display save state status on startup page write fixes: add missing await fix for #113 * update README	2022-03-14 10:41:56 -07:00
phiresky	fb297574c7	add documentation of env variables for socks proxy + browser extensions (#120 )	2022-03-13 15:00:46 -07:00
Chris Millson	7f1ea89456	Fix typo in regex yaml example (#121 ) crawl-this\|crawl-that didn't have () around it in the yaml example	2022-03-11 13:54:13 -08:00
Ilya Kreymer	7588f8d572	README: update README for #116 , mention 'scopeType: domain' and http/https scope inclusion	2022-03-06 14:51:16 -08:00
Ilya Kreymer	201eab4ad1	Support Extra Hops beyond current scope with --extraHops option (#98 ) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2	2022-01-15 09:03:09 -08:00
Ilya Kreymer	9f541ab011	Support for uploading to S3 (#95 ) - support uploading WACZ to s3-compatible storage (via minio client) - config storage loaded from env vars, enabled when WACZ output is used. - support pinging either or an http or a redis key-based webhook, - webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal) - consolidate redis init to redis.js - support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId) - README: add docs for s3 storage, remove unused args - update to pywb 2.6.2, browsertrix-behaviors 0.2.4 * fix to `limit` option, ensure limit check uses shared state * bump version to 0.5.0-beta.1	2021-11-23 12:53:30 -08:00
Ilya Kreymer	39ddecd35e	State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78 ) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta	2021-09-28 09:41:16 -07:00
Ilya Kreymer	2956be2026	README: make profile paths in README consistent, fixes #84	2021-08-29 14:20:36 -07:00
Ilya Kreymer	c5494be653	Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81 ) * blockrules improvements: - add await to continue/abort to catch errors, each called only in one place. - avoid adding multiple interception handlers for same page to avoid 'request already handled' errors - disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning * setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set. * scopeType rename: - rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage - rename 'none' -> page to indicate default single-page-only crawl - messaging: adjust error message displaying valid scopeTypes * README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80 bump to 0.4.4	2021-08-17 20:54:18 -07:00
Rebecca Sutton Koeser	4033c52693	Revise docker syntax for screencast examples (#77 ) Specify port binding option as a parameter of `docker run` instead of within the `crawl` command	2021-08-05 13:06:14 -07:00
Ilya Kreymer	d27e67e92e	README: fix invalid dashes, addresses #76	2021-07-28 15:43:36 -07:00
Ilya Kreymer	be1ee53c3e	BlockRules Fixes (0.4.3) (#75 ) - blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one - blockrules: cleanup, always allow 'pywb.proxy' static files - logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules - tests: add more complex test for blockrules - update CHANGES and support info in README - bump to 0.4.3	2021-07-27 09:41:21 -07:00
Ilya Kreymer	36ac3cb905	Update README.md with new features from 0.4.1 release!	2021-07-22 17:55:42 -07:00
Ilya Kreymer	bd44190ab2	Build simplification: Use :latest Version By default + README update (#71 ) * docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image - ci: add 'latest' tag to release ci build to automatically update latest as well - README: remove '[VERSION]', just refer to latest version of image in all examples - README: mention using specific released tag version for production	2021-07-22 17:46:10 -07:00
Ilya Kreymer	f4c6b6a99f	0.4.1 Release! (#70 ) * optimization: don't intercept requests if no blockRules set * page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages * add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds) * refactor profile loadProfile/saveProfile to util/browser.js - support augmenting existing profile when creating a new profile * screencasting: convert newContext to window instead of page by default, instead of just warning about it * shared multiplatform image support: - determine browser exe from list of options, getBrowserExe() returns current exe - supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64 - update to multiplatform oldwebtoday/chrome:91 as browser image - enable multiplatform build with latest build-push-action@v2 * seeds: add trim() to seed URLs * logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically * profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles * extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25 * update CHANGES and README with new features * bump version to 0.4.1	2021-07-22 14:24:51 -07:00
Ilya Kreymer	d40cf6cc2b	Interactive Profiles + bug fixes (#69 ) * support for interactive profile creation mode via --interactive file * screencasting error catching, ensure errors in screencasting do not interrupt crawl * better error reporting for invalid seed URLs, fixes #67 * README: update to mention interactive profile creation, additional * dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3	2021-07-20 15:45:51 -07:00
Ilya Kreymer	6dbdff9656	Support for per-URL conditional Block Rules (#68 ) - Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex. - Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex. - Support for restricting block rules based on containing frame URL, specified via inFrameURL param. - Testing for various blockRules configurations - Fixes Support URL-level WARC-writing inclusion/exclusion lists #15 - optional message to add when a URL is blocked, specified via 'blockMessage' - update README for blockRules - bump to pywb dependency 2.5.0b4	2021-07-19 15:50:32 -07:00
Emma Dickson	838e1fa1bd	Documentation Update (#58 ) * README: update documentation to be more clear about how to use the seed file option Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2021-07-08 16:06:10 -07:00
Emma Dickson	c02855627c	Add fields to warcinfo in combinedwarc (#60 ) * add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options * tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo * fix warcio.js version being added incorrectly * switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now) Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2021-07-07 15:56:52 -07:00
Ilya Kreymer	473de8c49f	Scope Handling Improvements + Tests (#66 ) * scope fixes: - remove default prefix scopeType, ensure scope include and exclude take precedence - add new 'custom' scopeType, when include or exclude are used - use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude) - ensure per-seed scope include/exclude used when present, and scopeType set to 'custom' - ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified - rename --type to --scopeType in seed to maintain consistency - add sitemap param as alias for useSitemap tests: - add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides - fix screencaster to use relative paths to work with tests - ci: use yarn instead of npm * update README with new flags * bump version to 0.4.0-beta.3	2021-07-06 20:22:27 -07:00
Ilya Kreymer	ef7d5e50d8	Per-Seed Scoping Rules + Crawl Depth (#63 ) * scoped seeds: - support per-seed scoping (include + exclude), allowHash, depth, and sitemap options - support maxDepth per seed #16 - combine --url, --seed and --urlFile/--seedFile urls into a unified seed list arg parsing: - simplify seed file options into --seedFile/--urlFile, move option in help display - rename --maxDepth -> --depth, supported globally and per seed - ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation) - update to latest js-yaml - rename --yamlConfig -> --config - config: support reading config from stdin if --config set to 'stdin' * scope: fix typo in 'prefix' scope * update browsertrix-behaviors to 0.2.2 * tests: add test for passing config via stdin, also adding --excludes via cmdline * update README: - latest cli, add docs on config via stdin - rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position - info on scoped seeds - list current scope types	2021-06-26 13:11:29 -07:00
Ilya Kreymer	f57818f2f6	New Docker Image, Customizable Browser Source + Binary (#62 ) * switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!) * add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...) * github action ci: use system unzip * update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file. * Update README with info on customizing build image * bump version to 0.4.0-beta.2	2021-06-24 15:39:17 -07:00
Ilya Kreymer	3ebe511b32	Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59 ) * Create an argument parser class * move constants, arg parser to separate files in utils/* * ensure yaml config overriden by command-line args * yaml loading work: - simplify yaml config by using yargs.config option - move all option parsing to argParser, simply expose parseArgs - export constants directly - add lint to util/* files * support inline 'seeds' in cmdline and yaml config tests: - add test for crawl config, ensuring seeds crawled + wacz created - add test to ensure cmdline overrides yaml config * scope fix: empty scope implies only fixed list, use '.' for any scope lint fix * update readme with yaml config info * allow 'url' and 'seeds' if both provided Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: emmadickson <emma.dickson@artsymail.com>	2021-06-23 19:45:40 -07:00
Ilya Kreymer	ae4ce979fb	Screencast Support for Debugging (fixes #43 ) (#52 ) * screencast support (fixes #43): - add NewWindowPage concurrency mode to support opening new window, and also reusing pages - add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client - concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session * add warning if using screencasting + more than one worker + page context, recommend 'window' * cleanup: remove debug console, bump py-wacz dependency, improve close message * README: add screencasting info to README	2021-06-07 17:43:36 -07:00
Emma Dickson	63376ab6ac	Add --urlFile param to specify text file with a list of URLs to crawl (#38 ) * Resolves #12 * Make --url param optional. Only one of --url of --urlFile should be specified. * Add ignoreScope option queueUrls() to support adding specific URLs * add tests for urlFile * bump version to 0.3.2 Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-05-12 22:57:06 -07:00

1 2

60 commits