Commit graph

218 commits

Author SHA1 Message Date
Ilya Kreymer
a59ec05a85
update behaviors to 0.4.1, rename 'Behavior line' -> 'Behavior log' (#223) 2023-02-04 16:02:43 -08:00
Ilya Kreymer
b513246b03
deps: bump pywb to 2.7.3, update CHANGES to current version (#222)
* deps: bump pywb to 2.7.3
bump to 0.8.0 for release

* update CHANGES
2023-02-03 17:56:30 -08:00
Ilya Kreymer
10e61d4c85
Bump to Chrome 109, Beta 0.8.0-beta.1 Release (#215)
- bump to chrome-109 image
- bump uwsgi to fix intermittent build errors
-remove installs moved to base image
bump to 0.8.0-beta.1
2023-01-30 19:00:33 -08:00
Ilya Kreymer
5ee05985b1
Use VNC for headful profile creation (#197)
* profiles: use vnc for automatic profile creation (fixes #194):
- add x11vnc and serve via vnc when not headless, keep existing screencast for headless mode
- use @novnc/novnc to serve vnc JS library
- add novnc_lite.html to serve the content from an iframe
- optimization: don't show initial blank page / don't wait for initial page in puppeteer

* more vnc work:
- set position of browser at 0,0, avoid needing offset to fit
- add /vncpass endpoint to query vnc password (for use with browsertrix-cloud)
- remove websockify, x11vnc now supports ws connections directly!
- vnc_lite: support reconnecting ws if gracefully disconnected

* x11vnc cleanup: just pass password via cmdline to simplify setup

* make interactive profile creation default, automated enabled only if --automated or --username / --password flags are specified
README updates:
- mention new VNC-based streaming
- mention new --automated flag, move automated info below interactive

* README: adjust auto-login example to use mastodon example instead of twitter, which works more consistently
2023-01-09 23:56:53 -08:00
Ilya Kreymer
b268c02823 package: fix license string in package.json 2022-11-21 09:20:15 -08:00
Ilya Kreymer
2a1e0edf3c version: set version correctly to 0.8.0-beta.0 2022-11-15 18:30:27 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Ilya Kreymer
ffa3174578
Fix for warcio.js (#178)
* dependency fix: set warcio to 1.5.1 until we update to esm support
bump test timeout
fixes #175
bump to 0.7.1
2022-10-24 08:20:01 +02:00
Ilya Kreymer
1213694dde bump to 0.7.0 for release! 2022-10-11 16:14:53 -07:00
Ed Summers
3ba64535a5
Run in Docker as User (#171)
* Run in Docker as User

This follows a similar pattern to pywb to run as the user that owns the
crawls directory.

bump version to 0.7.0-beta.6

Closes #170
2022-09-28 12:49:52 -07:00
Ilya Kreymer
fd1737962b dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165)
bump to 0.7.0-beta.5
2022-09-15 23:13:31 -07:00
Ilya Kreymer
314ee3f730
Default Wait-Time Improvements (#162)
- netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds
- default behaviors: include autoscroll in default behavior as well
- restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting.
- bump to puppeteer-core 17.1.2
- bump to 0.7.0-beta.4
2022-09-08 23:39:26 -07:00
Ilya Kreymer
a52ee5ed1f dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3 2022-09-02 17:45:16 -07:00
Ilya Kreymer
6cc38bf511
Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements (#157)
* new window: use cdp instead of window.open

* new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page

* rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js
potential fix for #156

* browser repair:
- when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur
- mark pages as failed and don't reuse if page error or cdp errors occur
- screencaster: clear previous targets if screencasting when repairing browser

* bump version to 0.7.0-beta.3
2022-08-19 09:23:40 -07:00
Ilya Kreymer
c5d208024a
Wait Default + Logging Improvements (#153)
improved logging of pywb + redis:
- if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log
- otherwise, just ignore (don't print to stdout as that's too confusing)
- print if wb-manager fails, likely due to existing collection

waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait
dependency: update to latest puppeteer-core (16.1.0)
2022-08-11 18:44:39 -07:00
Ilya Kreymer
e3b8b5ba21
Add --netIdleWait, bump dependencies (0.7.0-beta.2) (#145)
- add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load
- add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl
- update to browsertrix-behaviors 0.3.2
- update current url for screencasting of page before page load starts
bump to 0.7.0-beta.2
2022-07-08 17:17:46 -07:00
Ilya Kreymer
bd10f1ad8c bump to 0.7.0-beta.1 2022-07-03 11:11:11 -07:00
Ilya Kreymer
0a309af740
Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144)
* update base image 
- switch to browsertrix-base-image:101 with chrome/chromium 101,
- includes additional fonts and ubuntu 22.04 as base.
- add --disable-site-isolation-trials as default flag to support behaviors accessing iframes

* debugging support for shared redis state:
- support pausing crawler indefinitely if crawl state is set to 'debug'
- must be set/unset manually via external redis
- designed for browsertrix-cloud for now

bump to 0.7.0-beta.0
2022-06-30 19:24:26 -07:00
Ilya Kreymer
cf90304fa7
0.6.0 Wait State + Screencasting Fixes (#141)
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished 
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds

* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request

* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed

* bump to 0.6.0
2022-06-17 11:58:44 -07:00
Ilya Kreymer
93b6dad7b9
Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check

- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.

- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.

- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.

- S3 Storage refactor, simplify, don't add additional paths by default.

- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.

- wacz save: reenable wacz validation after save.

- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.

- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
Ilya Kreymer
500ed1f9a1
Profile Creation Improvements (#136)
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.

* bump to 0.6.0-beta.0
2022-05-05 14:27:17 -05:00
Ilya Kreymer
5dfbfbeaf6
update dependencies: (#134)
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for telegram behavior
- bump version to 0.5.1
2022-04-15 16:22:47 -07:00
Ilya Kreymer
cc391146c4 package: set minio version to fixed (7.0.26) 2022-04-09 22:07:17 -07:00
Ilya Kreymer
bfd72835d1 update CHANGES for 0.5.0 release 2022-04-09 21:59:44 -07:00
Ilya Kreymer
5afd19f43d
Non-HTML Page Load Optimization (#130)
* non-html page load improvements: fix for #129
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
- don't do text extraction for non-HTML pages (will need to handle pdf separately)
bump to 0.5.0-beta.8
2022-03-22 17:41:51 -07:00
Ilya Kreymer
5e5efda437
Profile Creation Fix + Cloudflare Wait Support + UserAgent Fix (#128)
* cloudlfare wait improvements (#110 fix)
- set navigator.webdriver to false to help with cloudflare wait
- add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded

* chrome args refactor:
- move to utils/browser
- add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing
- add userAgent option

* profile creation improvements:
- fix loadProfile() missing await
- fix url to support running remotely
- load shared chromeArgs()
- add --proxy to support profile creation through pywb proxy

* fix setting custom userAgent (#90)
- fix typo that resulted in error
- ensure userAgent is applied separate from emulatedDevice
- add getDefaultUA() browser util
2022-03-18 10:32:59 -07:00
Ilya Kreymer
12d96f22c6
Profile download support (#126)
* profiles: support loading profiles via a URL.

* add 'request' dependency

* README: mention profile URLs
2022-03-14 14:44:24 -07:00
Ilya Kreymer
1fae21b0cf
Better check to see if ERR_ABORTED should be ignored. (#127)
* error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler
* bump version to 0.5.0-beta.6
2022-03-14 14:41:39 -07:00
Ilya Kreymer
81e8fa6da7
Incremental save state (#124)
* save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states
in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states
display save state status on startup
page write fixes: add missing await
fix for #113

* update README
2022-03-14 10:41:56 -07:00
Ilya Kreymer
affa45a7d4 dependency: update py-wacz dependency to 0.4.3 (to include webrecorder/py-wacz#16 fix)
bump to 0.5.0-beta.4
2022-03-07 08:46:12 -08:00
Ilya Kreymer
761ce7067b
behaviors update (#105)
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
2022-02-20 22:22:19 -08:00
Ilya Kreymer
c2ce9fc001
various state + wacz fixes: (#101)
- wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes

state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue:
* puppeteer-cluster: add custom 'start()' callback to indicate task actually starting
* new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started'  when actaully started

minio: use fPutObject to support parallel uploading, compute hash and size separately (for now)
dependencies: update to latest minio

error checking:
* print number of WARCs found, exit with error if 0
* ensure wacz creation succeeds, exit with error code if not
* validate wacz after creation, exit with error code if validation fails

bump to 0.5.0-beta.3
2022-02-08 15:31:55 -08:00
Ilya Kreymer
201eab4ad1
Support Extra Hops beyond current scope with --extraHops option (#98)
* extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83

* update README with info on `extraHops`, add tests for extraHops

* dependency fix: use pywb 2.6.3, warcio 1.5.0

* bump to 0.5.0-beta.2
2022-01-15 09:03:09 -08:00
Ilya Kreymer
9f541ab011
Support for uploading to S3 (#95)
- support uploading WACZ to s3-compatible storage (via minio client)
- config storage loaded from env vars, enabled when WACZ output is used.
- support pinging either or an http or a redis key-based webhook,
- webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal)
- consolidate redis init to redis.js
- support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId)
- README: add docs for s3 storage, remove unused args
- update to pywb 2.6.2, browsertrix-behaviors 0.2.4

* fix to `limit` option, ensure limit check uses shared state

* bump version to 0.5.0-beta.1
2021-11-23 12:53:30 -08:00
Ilya Kreymer
39ddecd35e
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state



* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT

* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats

* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible

* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0

* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes

*  py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture

* update to latest browsertrix-behaviors

* fix setuptools dependency #88

* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
Ilya Kreymer
c5494be653
Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81)
* blockrules improvements:
- add await to continue/abort to catch errors, each called only in one place.
- avoid adding multiple interception handlers for same page to avoid 'request already handled' errors
- disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning

* setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set.

* scopeType rename:
- rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage
- rename 'none' -> page to indicate default single-page-only crawl
- messaging: adjust error message displaying valid scopeTypes

* README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80

bump to 0.4.4
2021-08-17 20:54:18 -07:00
Ilya Kreymer
be1ee53c3e
BlockRules Fixes (0.4.3) (#75)
- blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one
- blockrules: cleanup, always allow 'pywb.proxy' static files
- logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules
- tests: add more complex test for blockrules
- update CHANGES and support info in README
- bump to 0.4.3
2021-07-27 09:41:21 -07:00
Ilya Kreymer
0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
Ilya Kreymer
6a65ea7a58 update CHANGES.md for 0.4.0
bump version to 0.4.0
remove extraneous logging
2021-07-20 23:06:15 -07:00
Ilya Kreymer
d40cf6cc2b
Interactive Profiles + bug fixes (#69)
* support for interactive profile creation mode via --interactive file
* screencasting error catching, ensure errors in screencasting do not interrupt crawl
* better error reporting for invalid seed URLs, fixes #67
* README: update to mention interactive profile creation, additional 
* dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3
2021-07-20 15:45:51 -07:00
Ilya Kreymer
473de8c49f
Scope Handling Improvements + Tests (#66)
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap

tests: 
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm

* update README with new flags

* bump version to 0.4.0-beta.3
2021-07-06 20:22:27 -07:00
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
f57818f2f6
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
2021-06-24 15:39:17 -07:00
Ilya Kreymer
3ebe511b32 Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59)
* Create an argument parser class

* move constants, arg parser to separate files in utils/*

* ensure yaml config overriden by command-line args

* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files

* support inline 'seeds' in cmdline and yaml config

tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config

* scope fix: empty scope implies only fixed list, use '.*' for any scope

* lint fix

* update readme with yaml config info

* allow 'url' and 'seeds' if both provided

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
2021-06-23 19:45:40 -07:00
Ilya Kreymer
ae4ce979fb
Screencast Support for Debugging (fixes #43) (#52)
* screencast support (fixes #43):

- add NewWindowPage concurrency mode to support opening new window, and also reusing pages

- add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client

- concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session

* add warning if using screencasting + more than one worker + page context, recommend 'window'

* cleanup: remove debug console, bump py-wacz dependency, improve close message

* README: add screencasting info to README
2021-06-07 17:43:36 -07:00
Ilya Kreymer
e7d3767efb
Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51)
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0

* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well

* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)

* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)

- bump version to 0.4.0-beta.1
2021-05-21 15:37:02 -07:00
Emma Dickson
63376ab6ac
Add --urlFile param to specify text file with a list of URLs to crawl (#38)
* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-05-12 22:57:06 -07:00
Ilya Kreymer
2db7bc98b1 bump version to 0.3.1 for release 2021-05-04 13:38:56 -07:00
Ilya Kreymer
7bc8efff3d add CHANGES.md, list changes for 0.3.1
update to browsertrix-behaviors 0.2.1
2021-05-04 12:10:12 -07:00