Commit graph

505 commits

Author SHA1 Message Date
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
Ilya Kreymer
6a65ea7a58 update CHANGES.md for 0.4.0
bump version to 0.4.0
remove extraneous logging
2021-07-20 23:06:15 -07:00
Ilya Kreymer
d40cf6cc2b
Interactive Profiles + bug fixes (#69)
* support for interactive profile creation mode via --interactive file
* screencasting error catching, ensure errors in screencasting do not interrupt crawl
* better error reporting for invalid seed URLs, fixes #67
* README: update to mention interactive profile creation, additional 
* dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3
2021-07-20 15:45:51 -07:00
Ilya Kreymer
6dbdff9656 Support for per-URL conditional Block Rules (#68)
- Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex.
- Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex.
- Support for restricting block rules based on containing frame URL, specified via inFrameURL param.
- Testing for various blockRules configurations
- Fixes Support URL-level WARC-writing inclusion/exclusion lists #15
- optional message to add when a URL is blocked, specified via 'blockMessage'
- update README for blockRules
- bump to pywb dependency 2.5.0b4
2021-07-19 15:50:32 -07:00
Emma Dickson
838e1fa1bd
Documentation Update (#58)
* README: update documentation to be more clear about how to use the seed file option

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2021-07-08 16:06:10 -07:00
Emma Dickson
c02855627c
Add fields to warcinfo in combinedwarc (#60)
* add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options
* tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo
* fix warcio.js version being added incorrectly
* switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now)

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-07-07 15:56:52 -07:00
Ilya Kreymer
473de8c49f
Scope Handling Improvements + Tests (#66)
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap

tests: 
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm

* update README with new flags

* bump version to 0.4.0-beta.3
2021-07-06 20:22:27 -07:00
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
f57818f2f6
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
2021-06-24 15:39:17 -07:00
Ilya Kreymer
3ebe511b32 Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59)
* Create an argument parser class

* move constants, arg parser to separate files in utils/*

* ensure yaml config overriden by command-line args

* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files

* support inline 'seeds' in cmdline and yaml config

tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config

* scope fix: empty scope implies only fixed list, use '.*' for any scope

* lint fix

* update readme with yaml config info

* allow 'url' and 'seeds' if both provided

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
2021-06-23 19:45:40 -07:00
Ilya Kreymer
ae4ce979fb
Screencast Support for Debugging (fixes #43) (#52)
* screencast support (fixes #43):

- add NewWindowPage concurrency mode to support opening new window, and also reusing pages

- add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client

- concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session

* add warning if using screencasting + more than one worker + page context, recommend 'window'

* cleanup: remove debug console, bump py-wacz dependency, improve close message

* README: add screencasting info to README
2021-06-07 17:43:36 -07:00
Ilya Kreymer
e7d3767efb
Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51)
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0

* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well

* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)

* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)

- bump version to 0.4.0-beta.1
2021-05-21 15:37:02 -07:00
Emma Dickson
63376ab6ac
Add --urlFile param to specify text file with a list of URLs to crawl (#38)
* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-05-12 22:57:06 -07:00
Ilya Kreymer
2db7bc98b1 bump version to 0.3.1 for release 2021-05-04 13:38:56 -07:00
Ilya Kreymer
51bb54e869 add CHANGES.md for 0.3.1 release! 2021-05-04 13:13:33 -07:00
Ilya Kreymer
7bc8efff3d add CHANGES.md, list changes for 0.3.1
update to browsertrix-behaviors 0.2.1
2021-05-04 12:10:12 -07:00
Emma Dickson
6211315999
update pages detection method (#50)
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-04-30 19:05:04 -07:00
Ilya Kreymer
183f8edf10
Wait for Pending Requests to Finish (#47)
* pending request wait:
- instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending
- keep checking key until pending requests are at 0
- requires latest pywb 2.6.0+
- should fix #44

* fix test to no longer look for waiting for 5s message

* lint settings and fixes: allow constant in loops, add lint command to script

* chrome: bump default image to chrome:90 image
2021-04-30 15:31:14 -04:00
Sebastian Nagel
9d577dac57
Extract links from all frames attached to a page, fixes #45 (#48) 2021-04-30 08:41:00 -07:00
Ilya Kreymer
9293375790
combine WARC/async fixes: (#49)
* combine WARC/async fixes:
- use streams for combine WARCs to avoid any issues with sync apis
- use async apis for writing/reading pages as well

* use async stat()

* fix tests, also sets extension to .warc.gz, addresses #41
2021-04-29 14:34:56 -07:00
Ilya Kreymer
b1e0654bdd update to browsertrix-behaviors 0.2.0
update to latest pywb@main
create-login-profile: also allow 'email' as alternative to user name
bump to 0.3.1-beta.0
2021-04-28 11:00:43 -07:00
Ilya Kreymer
dba4524246 ci: add push to registry on release action 2021-04-14 15:45:20 -07:00
Ilya Kreymer
eff4c61270 misc typos/fixes for 0.3.0:
- update README with latest params
- ensure capture dir includes seconds
- bump behaviors to 0.1.1
2021-04-13 18:17:44 -07:00
Ilya Kreymer
b59788ea04
Profiles: Support for running with existing profiles + saving profile after a login (#34)
Support for profiles via a mounted .tar.gz and --profile option + improved docs #18

* support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted.

* use patched pywb for fix

* bump browsertrix-behaviors to 0.1.0

* README: updates to include better getting started, behaviors and profile reference/examples

* bump version to 0.3.0!
2021-04-10 13:08:22 -07:00
Emma Dickson
c9f8fe051c
add collection name validation (#37)
* add collection name validation

* linter fix

* add tests and optimize

* linter fix

* move to validateargs

* properly reference collection

* Update regex and error message

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-04-07 20:24:01 -04:00
Emma Dickson
24e2c4ddf8
Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33)
* generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc..
* each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC.
* add test for --combineWARC flag
* add improved lint rules

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-31 10:41:27 -07:00
Ilya Kreymer
bc7f1badf3
factor out behaviors to browsertrix-behaviors: (#32)
- inject built 'behaviors.js' from browsertrix-behaviors, init with options and run
- remove bgbehaviors
- move textextract to root for now
- add requirements.txt for python dependencies
- remove obsolete --scroll option, to part of the behaviors system

logging:
- configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug'
- inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set
- 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification)

dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step

bump to 0.3.0-beta.0
2021-03-13 19:48:31 -05:00
Emma Dickson
9ef3f25416
add logging option (#29)
* add --pywb-log flag cmdline option which enables the pywb logging to stdout/stderr

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2021-03-04 12:36:58 -08:00
Emma Dickson
fb0f1d8db9
tests text extraction (#30)
* new tests

* add jest to eslint, lint fixes

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-01 16:00:23 -08:00
Emma Dickson
748b0399e9
add text extraction (#28)
* add text extraction via --text flag

* update readme with --text and --generateWACZ flags

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-23 13:52:54 -08:00
Emma Dickson
0688674f6f
case insensitive params (#27)
* make --generateWacz, --generateCdx case insensitive with alias option
* fix eslint config and eslint issues

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-02-17 09:37:07 -08:00
Ilya Kreymer
4d6dcbc3d6 bump version, remove extraneous console.log 2021-02-16 20:00:33 -08:00
Emma Dickson
9ef83e4ab4
update default collection name (#26)
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-15 20:06:18 -08:00
Emma Dickson
73b1dd77d4
Merge pull request #24 from webrecorder/behavior-refactor
background behaviors refactor: (fixes #23)
2021-02-11 11:24:34 -05:00
Ilya Kreymer
8c85ca2749 background behaviors refactor: (fixes #23)
- move auto-play, auto-fetch and auto-scroll behaviors to behaviors/global/*
- bgbehaviors manages these background behaviors
- command line --bgbehaviors option specifies which background behaviors to run (defaults to auto-fetch and auto-play)
2021-02-08 22:21:34 -08:00
Emma Dickson
7cfeefd19b
add ci and linting (#21)
* linting with eslint
* ci: validate linting and check basic single-page crawl with wacz creation

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-08 09:45:46 -08:00
Ilya Kreymer
8af5e1487d
waitUntil improvements: (#22)
- puppeteer 'waitUntil supports an array of options, support via comma separated list
- default to 'waitUntil,load'
- should fix #3
2021-02-04 22:42:03 -08:00
Ilya Kreymer
0a4f716a9c version update:
- parametrize chrome version, set to 88 in Dockerfile and as BROWSER_VERSION env var
- bump to docker image to 0.2.0
2021-02-03 22:24:38 -08:00
Emma Dickson
9c139eba2b
Add wacz support to browsertrix (#6)
* Add WACZ creation support, fixes #2
* --generateWACZ flag adds WACZ file (currently named <collection>/<collection>.wacz)
* page list generated in <collection>/pages/pages.jsonl, entry for each page is appended to end of file, includes url and title of page

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-02-03 21:28:32 -08:00
rgaudin
789279021b
Added limit info to statsFilename (#5)
- added new `limit` dict to statsFilename
- `limit` dict composed of:
  - `max`: the limit requested (or `0`)
  - `hit`: boolean whether limit was reached or not
2021-01-29 10:26:55 -08:00
Ilya Kreymer
b7fe292021
work on latest update: (#7)
- fixes for iframes, as described in #4
- bump chrome to 88
- bump pywb to 2.5.0
- bump version to 1.0.5
2021-01-29 00:33:01 -08:00
Ilya Kreymer
db382707ec autoplay: add youtube-nocookie domain to autoplay, part of work on #4 2021-01-26 10:24:32 -08:00
Ilya Kreymer
5386d897f7 add autofetcher script to be injected by defaultDriver to capture srcsets + URLs
in dynamically added stylesheets
should fix openzim/zimit#63
bump version to 0.1.4
2020-12-13 21:10:50 +00:00
Ilya Kreymer
62834735d1 stats: support json stats output to specified filename with --statsFilename flag (fixes openzim/zimit#39)
bump version to 0.1.3
2020-12-02 16:27:17 +00:00
Ilya Kreymer
082667099d add support for sitemaps with --useSitemap flag, defaults to /sitemap.xml if no string provided 2020-11-14 21:56:30 +00:00
Ilya Kreymer
92b251f0cb fix typo in setting userAgent 2020-11-14 20:51:07 +00:00
Ilya Kreymer
fe406b5f74 browser config settings:
- add support for --userAgent to override user agent
- add support for --mobileDevice to use puppeteer device emulation presets
- add support for --userAgentSuffix to append to default user agent (including device userAgent)
bump to 0.1.2
2020-11-14 19:32:31 +00:00
Ilya Kreymer
bfa1fc1618 Dockerfile: build with chrome deb directly instead of copying binaries from chrome image
bump to 0.1.1
2020-11-05 22:34:33 +00:00
Ilya Kreymer
7a13535d78 dockerfile: add symlink to 'google-chrome'
crawler: get version for user-agent via 'google-chrome --product-version'
compose: build versionned image, version 0.1.0
2020-11-05 22:34:10 +00:00
raffaele messuti
5bf64be018
minor fixes (#1)
* Update README.md - fix incomplete docker run pywb

* Update crawler.js - fix generateCDX
2020-11-03 13:33:19 -08:00