Commit graph

43 commits

Author SHA1 Message Date
Ilya Kreymer
be1ee53c3e
BlockRules Fixes (0.4.3) (#75)
- blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one
- blockrules: cleanup, always allow 'pywb.proxy' static files
- logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules
- tests: add more complex test for blockrules
- update CHANGES and support info in README
- bump to 0.4.3
2021-07-27 09:41:21 -07:00
Ilya Kreymer
0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
Ilya Kreymer
d40cf6cc2b
Interactive Profiles + bug fixes (#69)
* support for interactive profile creation mode via --interactive file
* screencasting error catching, ensure errors in screencasting do not interrupt crawl
* better error reporting for invalid seed URLs, fixes #67
* README: update to mention interactive profile creation, additional 
* dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3
2021-07-20 15:45:51 -07:00
Ilya Kreymer
6dbdff9656 Support for per-URL conditional Block Rules (#68)
- Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex.
- Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex.
- Support for restricting block rules based on containing frame URL, specified via inFrameURL param.
- Testing for various blockRules configurations
- Fixes Support URL-level WARC-writing inclusion/exclusion lists #15
- optional message to add when a URL is blocked, specified via 'blockMessage'
- update README for blockRules
- bump to pywb dependency 2.5.0b4
2021-07-19 15:50:32 -07:00
Emma Dickson
c02855627c
Add fields to warcinfo in combinedwarc (#60)
* add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options
* tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo
* fix warcio.js version being added incorrectly
* switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now)

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-07-07 15:56:52 -07:00
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
f57818f2f6
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
2021-06-24 15:39:17 -07:00
Ilya Kreymer
3ebe511b32 Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59)
* Create an argument parser class

* move constants, arg parser to separate files in utils/*

* ensure yaml config overriden by command-line args

* yaml loading work:
- simplify yaml config by using yargs.config option
- move all option parsing to argParser, simply expose parseArgs
- export constants directly
- add lint to util/* files

* support inline 'seeds' in cmdline and yaml config

tests:
- add test for crawl config, ensuring seeds crawled + wacz created
- add test to ensure cmdline overrides yaml config

* scope fix: empty scope implies only fixed list, use '.*' for any scope

* lint fix

* update readme with yaml config info

* allow 'url' and 'seeds' if both provided

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: emmadickson <emma.dickson@artsymail.com>
2021-06-23 19:45:40 -07:00
Ilya Kreymer
ae4ce979fb
Screencast Support for Debugging (fixes #43) (#52)
* screencast support (fixes #43):

- add NewWindowPage concurrency mode to support opening new window, and also reusing pages

- add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client

- concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session

* add warning if using screencasting + more than one worker + page context, recommend 'window'

* cleanup: remove debug console, bump py-wacz dependency, improve close message

* README: add screencasting info to README
2021-06-07 17:43:36 -07:00
Ilya Kreymer
e7d3767efb
Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51)
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0

* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well

* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)

* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)

- bump version to 0.4.0-beta.1
2021-05-21 15:37:02 -07:00
Emma Dickson
63376ab6ac
Add --urlFile param to specify text file with a list of URLs to crawl (#38)
* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-05-12 22:57:06 -07:00
Emma Dickson
6211315999
update pages detection method (#50)
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-04-30 19:05:04 -07:00
Ilya Kreymer
183f8edf10
Wait for Pending Requests to Finish (#47)
* pending request wait:
- instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending
- keep checking key until pending requests are at 0
- requires latest pywb 2.6.0+
- should fix #44

* fix test to no longer look for waiting for 5s message

* lint settings and fixes: allow constant in loops, add lint command to script

* chrome: bump default image to chrome:90 image
2021-04-30 15:31:14 -04:00
Sebastian Nagel
9d577dac57
Extract links from all frames attached to a page, fixes #45 (#48) 2021-04-30 08:41:00 -07:00
Ilya Kreymer
9293375790
combine WARC/async fixes: (#49)
* combine WARC/async fixes:
- use streams for combine WARCs to avoid any issues with sync apis
- use async apis for writing/reading pages as well

* use async stat()

* fix tests, also sets extension to .warc.gz, addresses #41
2021-04-29 14:34:56 -07:00
Ilya Kreymer
eff4c61270 misc typos/fixes for 0.3.0:
- update README with latest params
- ensure capture dir includes seconds
- bump behaviors to 0.1.1
2021-04-13 18:17:44 -07:00
Ilya Kreymer
b59788ea04
Profiles: Support for running with existing profiles + saving profile after a login (#34)
Support for profiles via a mounted .tar.gz and --profile option + improved docs #18

* support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted.

* use patched pywb for fix

* bump browsertrix-behaviors to 0.1.0

* README: updates to include better getting started, behaviors and profile reference/examples

* bump version to 0.3.0!
2021-04-10 13:08:22 -07:00
Emma Dickson
c9f8fe051c
add collection name validation (#37)
* add collection name validation

* linter fix

* add tests and optimize

* linter fix

* move to validateargs

* properly reference collection

* Update regex and error message

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-04-07 20:24:01 -04:00
Emma Dickson
24e2c4ddf8
Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33)
* generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc..
* each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC.
* add test for --combineWARC flag
* add improved lint rules

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-31 10:41:27 -07:00
Ilya Kreymer
bc7f1badf3
factor out behaviors to browsertrix-behaviors: (#32)
- inject built 'behaviors.js' from browsertrix-behaviors, init with options and run
- remove bgbehaviors
- move textextract to root for now
- add requirements.txt for python dependencies
- remove obsolete --scroll option, to part of the behaviors system

logging:
- configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug'
- inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set
- 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification)

dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step

bump to 0.3.0-beta.0
2021-03-13 19:48:31 -05:00
Emma Dickson
9ef3f25416
add logging option (#29)
* add --pywb-log flag cmdline option which enables the pywb logging to stdout/stderr

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2021-03-04 12:36:58 -08:00
Emma Dickson
fb0f1d8db9
tests text extraction (#30)
* new tests

* add jest to eslint, lint fixes

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-03-01 16:00:23 -08:00
Emma Dickson
748b0399e9
add text extraction (#28)
* add text extraction via --text flag

* update readme with --text and --generateWACZ flags

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-23 13:52:54 -08:00
Emma Dickson
0688674f6f
case insensitive params (#27)
* make --generateWacz, --generateCdx case insensitive with alias option
* fix eslint config and eslint issues

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-02-17 09:37:07 -08:00
Emma Dickson
9ef83e4ab4
update default collection name (#26)
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-15 20:06:18 -08:00
Ilya Kreymer
8c85ca2749 background behaviors refactor: (fixes #23)
- move auto-play, auto-fetch and auto-scroll behaviors to behaviors/global/*
- bgbehaviors manages these background behaviors
- command line --bgbehaviors option specifies which background behaviors to run (defaults to auto-fetch and auto-play)
2021-02-08 22:21:34 -08:00
Emma Dickson
7cfeefd19b
add ci and linting (#21)
* linting with eslint
* ci: validate linting and check basic single-page crawl with wacz creation

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-02-08 09:45:46 -08:00
Ilya Kreymer
8af5e1487d
waitUntil improvements: (#22)
- puppeteer 'waitUntil supports an array of options, support via comma separated list
- default to 'waitUntil,load'
- should fix #3
2021-02-04 22:42:03 -08:00
Ilya Kreymer
0a4f716a9c version update:
- parametrize chrome version, set to 88 in Dockerfile and as BROWSER_VERSION env var
- bump to docker image to 0.2.0
2021-02-03 22:24:38 -08:00
Emma Dickson
9c139eba2b
Add wacz support to browsertrix (#6)
* Add WACZ creation support, fixes #2
* --generateWACZ flag adds WACZ file (currently named <collection>/<collection>.wacz)
* page list generated in <collection>/pages/pages.jsonl, entry for each page is appended to end of file, includes url and title of page

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-02-03 21:28:32 -08:00
rgaudin
789279021b
Added limit info to statsFilename (#5)
- added new `limit` dict to statsFilename
- `limit` dict composed of:
  - `max`: the limit requested (or `0`)
  - `hit`: boolean whether limit was reached or not
2021-01-29 10:26:55 -08:00
Ilya Kreymer
b7fe292021
work on latest update: (#7)
- fixes for iframes, as described in #4
- bump chrome to 88
- bump pywb to 2.5.0
- bump version to 1.0.5
2021-01-29 00:33:01 -08:00
Ilya Kreymer
62834735d1 stats: support json stats output to specified filename with --statsFilename flag (fixes openzim/zimit#39)
bump version to 0.1.3
2020-12-02 16:27:17 +00:00
Ilya Kreymer
082667099d add support for sitemaps with --useSitemap flag, defaults to /sitemap.xml if no string provided 2020-11-14 21:56:30 +00:00
Ilya Kreymer
92b251f0cb fix typo in setting userAgent 2020-11-14 20:51:07 +00:00
Ilya Kreymer
fe406b5f74 browser config settings:
- add support for --userAgent to override user agent
- add support for --mobileDevice to use puppeteer device emulation presets
- add support for --userAgentSuffix to append to default user agent (including device userAgent)
bump to 0.1.2
2020-11-14 19:32:31 +00:00
Ilya Kreymer
7a13535d78 dockerfile: add symlink to 'google-chrome'
crawler: get version for user-agent via 'google-chrome --product-version'
compose: build versionned image, version 0.1.0
2020-11-05 22:34:10 +00:00
raffaele messuti
5bf64be018
minor fixes (#1)
* Update README.md - fix incomplete docker run pywb

* Update crawler.js - fix generateCDX
2020-11-03 13:33:19 -08:00
Ilya Kreymer
8f740d4e24 support custom crawl directory with --cwd flag, default to /crawls
update README
2020-11-02 15:28:19 +00:00
Ilya Kreymer
a875aa90d3 Dockerfile: switch to cmd 'crawl', instead of entrypoint to support running 'pywb' also
update README with docker-compose and docker run examples, update commandline example
default output to './crawls' subdirectory
2020-11-01 21:35:00 -08:00
Ilya Kreymer
91b8994a08 refactor crawler and default driver:
- add extensible defaultDriver, wrap crawling functionality in Crawler class
- support headless/non-headless, custom driver
- support custom collection name for pywb, generate-cdx option
- autoplay: add slightly delay for splash loading
2020-11-01 19:53:47 -08:00
Ilya Kreymer
ded83b52b3 initial commit after split from zimit 2020-10-31 13:16:37 -07:00