Commit graph

60 commits

Author SHA1 Message Date
Tessa Walsh
e1fe028c7c
Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494)
Fixes #493 

This PR updates the documentation for Browsertrix Crawler 1.0.0 and
moves it from the project README to an MKDocs site.

Initial docs site set to https://crawler.docs.browsertrix.com/

Many thanks to @Shrinks99 for help setting this up!

---------

Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-03-16 14:59:32 -07:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Ilya Kreymer
15661eb9c8
More flexible multi value arg parsing + README update for 0.12.0 (#422)
Updated arg parsing thanks to example in
https://github.com/yargs/yargs/issues/846#issuecomment-517264899
to support multiple value arguments specified as either one string or
multiple string using array type + coerce function.

This allows for `choice` option to also be used to validate the options,
when needed.

With this setup, `--text to-pages,to-warc,final-to-warc`, `--text
to-pages,to-warc --text final-to-warc` and `--text to-pages --text
to-warc --text final-to-warc` all result in the same configuration!

Updated other multiple choice args (waitUntil, logging, logLevel, context, behaviors, screenshot) to use the same system.

Also updated README with new text extraction options and bumped version
to 0.12.0

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-11-02 11:47:37 -07:00
gitreich
18dce9534e
Update README.md (#390)
added missing quotes in command to extend an existing profiles
2023-09-29 09:23:05 -07:00
Ilya Kreymer
debfe8945f README: add --restartOnError cli opt 2023-09-15 11:22:52 -07:00
Anish Lakhwara
5bd4fedff9
Add example of mounting custom behaviours (#369)
* feat: add docker mount custom behavior to README

* Add link to behaviors tutorial

---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-09-13 10:47:05 -07:00
Graham Hukill
1eeee2c215
Surface lastmod option for sitemap parser (#367)
* Surface lastmod option for sitemap parser
- Add --sitemapFromDate to use along with --useSitemap which will filter sitemap by on or after
specified ISO date.

The library used to parse sitemaps for URLs added an optional
"lastmod" argument in v3.2.5 that allows filtering URLs returned
by a "last_modified" element present in sitemap XMLs.  This
surfaces that argument to the browsertrix-crawler CLI runtime
parameters.

This can be useful for orienting a crawl around a list of seeds
known to contain sitemaps, but are only interested in including
URLs that have been modified on or after X date.

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-09-13 10:20:41 -07:00
Tessa Walsh
74831373fd Update README options 2023-07-06 15:21:30 -04:00
wvengen
de2b4512b6
Allow configuration of deduplication policy (#331) (#332) 2023-07-06 14:54:35 -04:00
Ilya Kreymer
71b618fe94
Switch back to Puppeteer from Playwright (#301)
- reduced memory usage, avoids memory leak issues caused by using playwright (see #298) 
- browser: split Browser into Browser and BaseBrowser
- browser: puppeteer-specific functions added to Browser for additional flexibility if need to change again later
- browser: use defaultArgs from playwright
- browser: attempt to recover if initial target is gone
- logging: add debug logging from process.memoryUsage() after every page
- request interception: use priorities for cooperative request interception
- request interception: move to setupPage() to run once per page, enable if any of blockrules, adblockrules or originOverrides are used
- request interception: fix originOverrides enabled check, fix to work with catch-all request interception
- default args: set --waitUntil back to 'load,networkidle2'
- Update README with changes for puppeteer
- tests: fix extra hops depth test to ensure more than one page crawled

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-04-26 15:41:35 -07:00
Tessa Walsh
b303af02ef
Add --title and --description CLI args to write metadata into datapackage.json (#276)
Multi-word values including spaces must be enclosed in double quotes.

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2023-04-04 10:46:03 -04:00
Ilya Kreymer
78faa965c5
Add --maxPageLimit override (#275)
* max page limit:
- rename --limit -> --pageLimit (keep alias for now)
- add new --maxPageLimit flag which overrides --pageLimit to ensure it is not greater than max
- readme: add new --pageLimit, --maxPageLimit to README
2023-04-03 11:10:47 -07:00
Tessa Walsh
d8c505a076
Update README for 0.9.0 (#272)
* Update README for Playwright/0.9.0

* Add ad blocking to README
2023-04-02 21:55:14 -07:00
Tessa Walsh
b0e93cb06e
Add option for sleep interval after behaviors run + timing cleanup (#257)
* Add --pageExtraDelay option to add extra delay/wait time after every page (fixes #131)

* Store total page time in 'maxPageTime', include pageExtraDelay

* Rename timeout->pageLoadTimeout

* cleanup:
- store seconds for most interval checks, convert to ms only for api calls, remove most sec<->ms conversions
- add secondsElapsed() utility function to help checking time elapsed
- cleanup comments

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-22 11:50:18 -07:00
Ilya Kreymer
82808d8133
Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State (#253)
* Migrate from Puppeteer to Playwright!
- use playwright persistent browser context to support profiles
- move on-new-page setup actions to worker
- fix screencaster, init only one per page object, associate with worker-id
- fix device emulation: load on startup, also replace '-' with space for more friendly command-line usage
- port additional chromium setup options
- create / detach cdp per page for each new page, screencaster just uses existing cdp
- fix evaluateWithCLI to call CDP command directly
- workers directly during WorkerPool - await not necessary

* State / Worker Refactor (#252)

* refactoring state:
- use RedisCrawlState, defaulting to local redis, remove MemoryCrawlState and BaseState
- remove 'real' accessors / draining queue - no longer neede without puppeteer-cluster
- switch to sorted set for crawl queue, set depth + extraHops as score, (fixes #150)
- override console.error to avoid logging ioredis errors (fixes #244)
- add MAX_DEPTH as const for extraHops
- fix immediate exit on second interrupt

* worker/state refactor:
- remove job object from puppeteer-cluster
- rename shift() -> nextFromQueue()
- condense crawl mgmt logic to crawlPageInWorker: init page, mark pages as finished/failed, close page on failure, etc...
- screencaster: don't screencast about:blank pages

* more worker queue refactor:
- remove p-queue
- initialize PageWorkers which run in its own loop to process pages, until no pending pages, no queued pages
- add setupPage(), teardownPage() to crawler, called from worker
- await runWorkers() promise which runs all workers until completion
- remove: p-queue, node-fetch, update README (no longer using any puppeteer-cluster base code)
- bump to 0.9.0-beta.1

* use existing data object for per-page context, instead of adding things to page (will be more clear with typescript transition)

* more fixes for playwright:
- fix profile creation
- browser: add newWindowPageWithCDP() to create new page + cdp in new window, use with timeout
- crawler: various fixes, including for html check
- logging: addition logging for screencaster, new window, etc...
- remove unused packages

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2023-03-17 12:50:32 -07:00
Tessa Walsh
1bee46b321
Remove puppeteer-cluster + iframe filtering + health check refactor + logging improvements (0.9.0-beta.0) (#219)
* This commit removes puppeteer-cluster as a dependency in favor of
a simpler concurrency implementation, using p-queue to limit
concurrency to the number of available workers. As part of the
refactor, the custom window concurrency model in windowconcur.js
is removed and its logic implemented in the new Worker class's
initPage method.

* Remove concurrency models, always use new tab

* logging improvements: include worker-id in logs, use 'worker' context
- logging: log info string / version as first line
- logging: improve logging of error stack traces
- interruption: support interrupting crawl directly with 'interrupt' check which stops the job queue
- interruption: don't repair if interrupting, wait for queue to be idle
- log text extraction
- init order: ensure wb-manager init called first, then logs created
- logging: adjust info->debug logging
- Log no jobs available as debug

* tests: bail on first failure

* iframe filtering:
- fix filtering for about:blank iframes, support non-async shouldProcessFrame()
- filter iframes both for behaviors and for link extraction
- add 5-second timeout to link extraction, to avoid link extraction holding up crawl!
- cache filtered frames

* healthcheck/worker reuse:
- refactor healthchecker into separate class
- increment healthchecker (if provided) if new page load fails
- remove expermeintal repair functionality for now
- add healthcheck

* deps: bump puppeteer-core to 17.1.2
- bump to 0.9.0-beta.0

--------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2023-03-08 18:31:19 -08:00
Sara Tavares
5b1f224dcb
fix typos (#232) 2023-02-24 11:09:40 -08:00
Ilya Kreymer
5ee05985b1
Use VNC for headful profile creation (#197)
* profiles: use vnc for automatic profile creation (fixes #194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer

* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected

* x11vnc cleanup: just pass password via cmdline to simplify setup

* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive

* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
2023-01-09 23:56:53 -08:00
Tessa Walsh
f35d495103
Add screenshot functionality (#188)
* Add screenshot and thumbnail functionality

Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.

In addition, this commit:

- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker

* Separate screenshotTypes into exported const

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
2022-12-21 09:06:13 -08:00
Tim
5b738bd24e
Fix incorrect combineWARCs property in README.md (#180)
This stumped me for a little while. The actual property isn't plural.
2022-11-14 22:17:44 -08:00
Ilya Kreymer
be3b6b85fa README: update default behaviors in README, fixes #169 2022-10-11 15:33:32 -07:00
Ed Summers
3ba64535a5
Run in Docker as User (#171)
* Run in Docker as User

This follows a similar pattern to pywb to run as the user that owns the
crawls directory.

bump version to 0.7.0-beta.6

Closes #170
2022-09-28 12:49:52 -07:00
raffaele messuti
a527cc9b36
Update README.md (#147)
fix link to puppeteer waitUntil
2022-08-11 18:28:54 -07:00
Ilya Kreymer
93b6dad7b9
Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check

- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.

- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.

- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.

- S3 Storage refactor, simplify, don't add additional paths by default.

- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.

- wacz save: reenable wacz validation after save.

- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.

- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
Ilya Kreymer
12d96f22c6
Profile download support (#126)
* profiles: support loading profiles via a URL.

* add 'request' dependency

* README: mention profile URLs
2022-03-14 14:44:24 -07:00
Ilya Kreymer
81e8fa6da7
Incremental save state (#124)
* save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states
in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states
display save state status on startup
page write fixes: add missing await
fix for #113

* update README
2022-03-14 10:41:56 -07:00
phiresky
fb297574c7
add documentation of env variables for socks proxy + browser extensions (#120) 2022-03-13 15:00:46 -07:00
Chris Millson
7f1ea89456
Fix typo in regex yaml example (#121)
crawl-this|crawl-that didn't have () around it in the yaml example
2022-03-11 13:54:13 -08:00
Ilya Kreymer
7588f8d572 README: update README for #116, mention 'scopeType: domain' and http/https scope inclusion 2022-03-06 14:51:16 -08:00
Ilya Kreymer
201eab4ad1
Support Extra Hops beyond current scope with --extraHops option (#98)
* extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83

* update README with info on `extraHops`, add tests for extraHops

* dependency fix: use pywb 2.6.3, warcio 1.5.0

* bump to 0.5.0-beta.2
2022-01-15 09:03:09 -08:00
Ilya Kreymer
9f541ab011
Support for uploading to S3 (#95)
- support uploading WACZ to s3-compatible storage (via minio client)
- config storage loaded from env vars, enabled when WACZ output is used.
- support pinging either or an http or a redis key-based webhook,
- webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal)
- consolidate redis init to redis.js
- support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId)
- README: add docs for s3 storage, remove unused args
- update to pywb 2.6.2, browsertrix-behaviors 0.2.4

* fix to `limit` option, ensure limit check uses shared state

* bump version to 0.5.0-beta.1
2021-11-23 12:53:30 -08:00
Ilya Kreymer
39ddecd35e
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state



* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT

* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats

* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible

* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0

* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes

*  py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture

* update to latest browsertrix-behaviors

* fix setuptools dependency #88

* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
Ilya Kreymer
2956be2026
README: make profile paths in README consistent, fixes #84 2021-08-29 14:20:36 -07:00
Ilya Kreymer
c5494be653
Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81)
* blockrules improvements:
- add await to continue/abort to catch errors, each called only in one place.
- avoid adding multiple interception handlers for same page to avoid 'request already handled' errors
- disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning

* setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set.

* scopeType rename:
- rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage
- rename 'none' -> page to indicate default single-page-only crawl
- messaging: adjust error message displaying valid scopeTypes

* README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80

bump to 0.4.4
2021-08-17 20:54:18 -07:00
Rebecca Sutton Koeser
4033c52693
Revise docker syntax for screencast examples (#77)
Specify port binding option as a parameter of `docker run` instead of within the `crawl` command
2021-08-05 13:06:14 -07:00
Ilya Kreymer
d27e67e92e README: fix invalid dashes, addresses #76 2021-07-28 15:43:36 -07:00
Ilya Kreymer
be1ee53c3e
BlockRules Fixes (0.4.3) (#75)
- blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one
- blockrules: cleanup, always allow 'pywb.proxy' static files
- logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules
- tests: add more complex test for blockrules
- update CHANGES and support info in README
- bump to 0.4.3
2021-07-27 09:41:21 -07:00
Ilya Kreymer
36ac3cb905
Update README.md with new features from 0.4.1 release! 2021-07-22 17:55:42 -07:00
Ilya Kreymer
bd44190ab2
Build simplification: Use :latest Version By default + README update (#71)
* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image
- ci: add 'latest' tag to release ci build to automatically update latest as well
- README: remove '[VERSION]', just refer to latest version of image in all examples
- README: mention using specific released tag version for production
2021-07-22 17:46:10 -07:00
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
Ilya Kreymer
d40cf6cc2b
Interactive Profiles + bug fixes (#69)
* support for interactive profile creation mode via --interactive file
* screencasting error catching, ensure errors in screencasting do not interrupt crawl
* better error reporting for invalid seed URLs, fixes #67
* README: update to mention interactive profile creation, additional 
* dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3
2021-07-20 15:45:51 -07:00
Ilya Kreymer
6dbdff9656 Support for per-URL conditional Block Rules (#68)
- Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex.
- Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex.
- Support for restricting block rules based on containing frame URL, specified via inFrameURL param.
- Testing for various blockRules configurations
- Fixes Support URL-level WARC-writing inclusion/exclusion lists #15
- optional message to add when a URL is blocked, specified via 'blockMessage'
- update README for blockRules
- bump to pywb dependency 2.5.0b4
2021-07-19 15:50:32 -07:00
Emma Dickson
838e1fa1bd
Documentation Update (#58)
* README: update documentation to be more clear about how to use the seed file option

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2021-07-08 16:06:10 -07:00
Emma Dickson
c02855627c
Add fields to warcinfo in combinedwarc (#60)
* add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options
* tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo
* fix warcio.js version being added incorrectly
* switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now)

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-07-07 15:56:52 -07:00
Ilya Kreymer
473de8c49f
Scope Handling Improvements + Tests (#66)
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap

tests: 
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm

* update README with new flags

* bump version to 0.4.0-beta.3
2021-07-06 20:22:27 -07:00
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
f57818f2f6
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
2021-06-24 15:39:17 -07:00
Ilya Kreymer
3ebe511b32 Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59)
* Create an argument parser class

* move constants, arg parser to separate files in utils/*

* ensure yaml config overriden by command-line args

* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files

* support inline 'seeds' in cmdline and yaml config

tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config

* scope fix: empty scope implies only fixed list, use '.*' for any scope

* lint fix

* update readme with yaml config info

* allow 'url' and 'seeds' if both provided

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
2021-06-23 19:45:40 -07:00
Ilya Kreymer
ae4ce979fb
Screencast Support for Debugging (fixes #43) (#52)
* screencast support (fixes #43):

- add NewWindowPage concurrency mode to support opening new window, and also reusing pages

- add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client

- concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session

* add warning if using screencasting + more than one worker + page context, recommend 'window'

* cleanup: remove debug console, bump py-wacz dependency, improve close message

* README: add screencasting info to README
2021-06-07 17:43:36 -07:00
Emma Dickson
63376ab6ac
Add --urlFile param to specify text file with a list of URLs to crawl (#38)
* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-05-12 22:57:06 -07:00