This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds
* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request
* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed
* bump to 0.6.0
* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image
- ci: add 'latest' tag to release ci build to automatically update latest as well
- README: remove '[VERSION]', just refer to latest version of image in all examples
- README: mention using specific released tag version for production
* optimization: don't intercept requests if no blockRules set
* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages
* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)
* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile
* screencasting: convert newContext to window instead of page by default, instead of just warning about it
* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2
* seeds: add trim() to seed URLs
* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically
* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles
* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes#25
* update CHANGES and README with new features
* bump version to 0.4.1
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap
tests:
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm
* update README with new flags
* bump version to 0.4.0-beta.3
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)
* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)
* github action ci: use system unzip
* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.
* Update README with info on customizing build image
* bump version to 0.4.0-beta.2
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0
* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well
* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)
* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)
- bump version to 0.4.0-beta.1
* Resolves#12
* Make --url param optional. Only one of --url of --urlFile should be specified.
* Add ignoreScope option queueUrls() to support adding specific URLs
* add tests for urlFile
* bump version to 0.3.2
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Support for profiles via a mounted .tar.gz and --profile option + improved docs #18
* support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted.
* use patched pywb for fix
* bump browsertrix-behaviors to 0.1.0
* README: updates to include better getting started, behaviors and profile reference/examples
* bump version to 0.3.0!
- inject built 'behaviors.js' from browsertrix-behaviors, init with options and run
- remove bgbehaviors
- move textextract to root for now
- add requirements.txt for python dependencies
- remove obsolete --scroll option, to part of the behaviors system
logging:
- configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug'
- inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set
- 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification)
dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step
bump to 0.3.0-beta.0
- add support for --userAgent to override user agent
- add support for --mobileDevice to use puppeteer device emulation presets
- add support for --userAgentSuffix to append to default user agent (including device userAgent)
bump to 0.1.2
- add extensible defaultDriver, wrap crawling functionality in Crawler class
- support headless/non-headless, custom driver
- support custom collection name for pywb, generate-cdx option
- autoplay: add slightly delay for splash loading