Commit graph

96 commits

Author SHA1 Message Date
Ilya Kreymer
d723f95cb9 add missing lock.js
update chrome launch flag
2022-03-17 08:58:13 +00:00
Ilya Kreymer
c0508d44a7 add locking mechanism for deleteOnExit and runnning post-crawl steps
only run postcrawl steps on last instance, when all locks released
2022-03-16 04:16:33 +00:00
Ilya Kreymer
59c1bd5626 interpolate: add support for @hostsuffix 2022-03-16 02:24:18 +00:00
Ilya Kreymer
952f93293f collection name: allow interpolation of ts, crawl id, hostname 2022-03-15 08:49:22 -07:00
Ilya Kreymer
dd381899e1 option to delete crawl dir on exit via --deleteOnExit, if exiting succesfully or via forced-quit 2022-03-14 23:51:09 -07:00
Ilya Kreymer
d1e53e6e26 add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded 2022-03-14 19:00:51 -07:00
Ilya Kreymer
0fcc89fdb4 set navigator.webdriver to false to help with cloudflare wait
profile: fix loadProfile() missing await
profile: fix url to support running remotely
2022-03-15 01:57:51 +00:00
Ilya Kreymer
12d96f22c6
Profile download support (#126)
* profiles: support loading profiles via a URL.

* add 'request' dependency

* README: mention profile URLs
2022-03-14 14:44:24 -07:00
Ilya Kreymer
1fae21b0cf
Better check to see if ERR_ABORTED should be ignored. (#127)
* error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler
* bump version to 0.5.0-beta.6
2022-03-14 14:41:39 -07:00
Ilya Kreymer
ab096cd5b0
Improve to URL direct check and fetch (#125)
- direct check fix: only do direct check if HEAD returns 200 status code
- if direct load results in non-200 status code, still load in browser
- error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error
- state: tweak error logging message
2022-03-14 11:11:53 -07:00
Ilya Kreymer
81e8fa6da7
Incremental save state (#124)
* save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states
in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states
display save state status on startup
page write fixes: add missing await
fix for #113

* update README
2022-03-14 10:41:56 -07:00
phiresky
fb297574c7
add documentation of env variables for socks proxy + browser extensions (#120) 2022-03-13 15:00:46 -07:00
Simon Wiles
d7c24c44f6
Set a UTF-8 locale in Dockerfile (#122) 2022-03-13 12:47:37 -07:00
Chris Millson
7f1ea89456
Fix typo in regex yaml example (#121)
crawl-this|crawl-that didn't have () around it in the yaml example
2022-03-11 13:54:13 -08:00
Ilya Kreymer
affa45a7d4 dependency: update py-wacz dependency to 0.4.3 (to include webrecorder/py-wacz#16 fix)
bump to 0.5.0-beta.4
2022-03-07 08:46:12 -08:00
Ilya Kreymer
7588f8d572 README: update README for #116, mention 'scopeType: domain' and http/https scope inclusion 2022-03-06 14:51:16 -08:00
Ilya Kreymer
0c32d0f223
add 'scopeType: domain' to include all subdomains + http/https include (#117)
- add 'scopeType: domain' to include all subdomains of a given seed url, eg. given `https://example.com/path' as starting seed, will consider `https://*.example.com/` to be in scope.
- include both http/https in all the default scopes except single page (page-spa, prefix, host, domain), eg. given https://example.com/, will also include http://example.com/
- fixes #116
2022-03-06 14:46:14 -08:00
Ilya Kreymer
e160382f4d
Screencast + Redis state tweaks (#109)
* redis save state: load queued and done urls in chunks in case lists are large

* screencast: add 'init' message to include number of workers and dimensions
2022-03-02 13:26:11 -08:00
Ilya Kreymer
805b6466bc screencast tweaks:
- set default dimension to 640x480
- don't send frames for about:blank
- ensure url updated in cache
- rename screencast html to screencast.html
2022-02-23 14:39:33 -08:00
Ilya Kreymer
ef53b1acea
Screencast Refactor (#108)
- Move connection data to separate transport class, in addition to current, direct connection via WS, also support sending screencast data via redis pubsub
- Implement WSTransport and RedisPubSubTransport for screencasting
- Redis screencasting enabled when --redisStoreUrl is set and --screencastRedis is set.
- Redis screencasting uses pubsub channels:
* a ctrl channel is used to start/stop screencasting
* a data channel is used to send screencast messages

Simplify screencasting messages:
{"msg": "screencast", "id": "<page id>", "url": "<page url>", "data": "<png base64 data>"} - for new and incremental screencast frames for page id
{"msg": "close", "id": "<page id>"} - to indicate page id has closed.
Rename html dir from screencast -> html
2022-02-23 12:09:48 -08:00
Ilya Kreymer
761ce7067b
behaviors update (#105)
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
2022-02-20 22:22:19 -08:00
Ilya Kreymer
a54ca6e51d scopes:
- fix scopeType prefix set + exclude not reverting to custom
- only mark include + scopeType as overlapping
2022-02-13 14:34:25 -08:00
Ilya Kreymer
56be08e2e0 state improvements:
- local: use map for pending state
- redis: uset hmap for pending state
- redis: support requeing if only pending urls are left, add expiring keys per pending page for pageTimeout
2022-02-09 22:53:15 -08:00
Ilya Kreymer
c2ce9fc001
various state + wacz fixes: (#101)
- wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes

state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue:
* puppeteer-cluster: add custom 'start()' callback to indicate task actually starting
* new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started'  when actaully started

minio: use fPutObject to support parallel uploading, compute hash and size separately (for now)
dependencies: update to latest minio

error checking:
* print number of WARCs found, exit with error if 0
* ensure wacz creation succeeds, exit with error code if not
* validate wacz after creation, exit with error code if validation fails

bump to 0.5.0-beta.3
2022-02-08 15:31:55 -08:00
Ilya Kreymer
66ce6688eb
Add WACZ Signing Support (#99)
* initial support for wacz signing (using a custom version py-wacz)
- signing url and token set via env vars WACZ_SIGN_TOKEN and WACZ_SIGN_URL
-  add CHANGELIST for 0.5.0
- bump pywb to 2.6.4
2022-01-26 16:06:10 -08:00
Ilya Kreymer
e12463446a lint style fix 2022-01-26 12:56:35 -08:00
CreativeCactus
eb1dd8e8cf
browser option: custom flags via CHROME_FLAGS env option (#96) 2022-01-26 12:22:52 -08:00
Ilya Kreymer
201eab4ad1
Support Extra Hops beyond current scope with --extraHops option (#98)
* extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83

* update README with info on `extraHops`, add tests for extraHops

* dependency fix: use pywb 2.6.3, warcio 1.5.0

* bump to 0.5.0-beta.2
2022-01-15 09:03:09 -08:00
Ilya Kreymer
9f541ab011
Support for uploading to S3 (#95)
- support uploading WACZ to s3-compatible storage (via minio client)
- config storage loaded from env vars, enabled when WACZ output is used.
- support pinging either or an http or a redis key-based webhook,
- webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal)
- consolidate redis init to redis.js
- support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId)
- README: add docs for s3 storage, remove unused args
- update to pywb 2.6.2, browsertrix-behaviors 0.2.4

* fix to `limit` option, ensure limit check uses shared state

* bump version to 0.5.0-beta.1
2021-11-23 12:53:30 -08:00
Ilya Kreymer
f5d0328ac0 don't set skipDuplicateUrls at puppeteer-cluster level, as already handling via crawl state. potential fix for issue in #91 where
crawl appears to not finish
2021-10-27 20:49:37 -07:00
Ilya Kreymer
39ddecd35e
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state



* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT

* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats

* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible

* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0

* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes

*  py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture

* update to latest browsertrix-behaviors

* fix setuptools dependency #88

* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
Ilya Kreymer
2956be2026
README: make profile paths in README consistent, fixes #84 2021-08-29 14:20:36 -07:00
Ilya Kreymer
8c8cf232de update CHANGES for 0.4.4! 2021-08-17 21:24:56 -07:00
Ilya Kreymer
c5494be653
Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81)
* blockrules improvements:
- add await to continue/abort to catch errors, each called only in one place.
- avoid adding multiple interception handlers for same page to avoid 'request already handled' errors
- disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning

* setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set.

* scopeType rename:
- rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage
- rename 'none' -> page to indicate default single-page-only crawl
- messaging: adjust error message displaying valid scopeTypes

* README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80

bump to 0.4.4
2021-08-17 20:54:18 -07:00
Rebecca Sutton Koeser
4033c52693
Revise docker syntax for screencast examples (#77)
Specify port binding option as a parameter of `docker run` instead of within the `crawl` command
2021-08-05 13:06:14 -07:00
Ilya Kreymer
d27e67e92e README: fix invalid dashes, addresses #76 2021-07-28 15:43:36 -07:00
Ilya Kreymer
be1ee53c3e
BlockRules Fixes (0.4.3) (#75)
- blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one
- blockrules: cleanup, always allow 'pywb.proxy' static files
- logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules
- tests: add more complex test for blockrules
- update CHANGES and support info in README
- bump to 0.4.3
2021-07-27 09:41:21 -07:00
Ilya Kreymer
f0c5ca1035 ci release: fix typo in release yaml config for latest tag 2021-07-23 20:00:40 -07:00
Ilya Kreymer
0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00
Ilya Kreymer
36ac3cb905
Update README.md with new features from 0.4.1 release! 2021-07-22 17:55:42 -07:00
Ilya Kreymer
bd44190ab2
Build simplification: Use :latest Version By default + README update (#71)
* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image
- ci: add 'latest' tag to release ci build to automatically update latest as well
- README: remove '[VERSION]', just refer to latest version of image in all examples
- README: mention using specific released tag version for production
2021-07-22 17:46:10 -07:00
Ilya Kreymer
f4c6b6a99f
0.4.1 Release! (#70)
* optimization: don't intercept requests if no blockRules set

* page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages

* add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds)

* refactor profile loadProfile/saveProfile to util/browser.js
- support augmenting existing profile when creating a new profile

* screencasting: convert newContext to window instead of page by default, instead of just warning about it

* shared multiplatform image support:
- determine browser exe from list of options, getBrowserExe() returns current exe
- supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64
- update to multiplatform oldwebtoday/chrome:91 as browser image
- enable multiplatform build with latest build-push-action@v2

* seeds: add trim() to seed URLs

* logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically

* profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles

* extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25

* update CHANGES and README with new features

* bump version to 0.4.1
2021-07-22 14:24:51 -07:00
Ilya Kreymer
6a65ea7a58 update CHANGES.md for 0.4.0
bump version to 0.4.0
remove extraneous logging
2021-07-20 23:06:15 -07:00
Ilya Kreymer
d40cf6cc2b
Interactive Profiles + bug fixes (#69)
* support for interactive profile creation mode via --interactive file
* screencasting error catching, ensure errors in screencasting do not interrupt crawl
* better error reporting for invalid seed URLs, fixes #67
* README: update to mention interactive profile creation, additional 
* dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3
2021-07-20 15:45:51 -07:00
Ilya Kreymer
6dbdff9656 Support for per-URL conditional Block Rules (#68)
- Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex.
- Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex.
- Support for restricting block rules based on containing frame URL, specified via inFrameURL param.
- Testing for various blockRules configurations
- Fixes Support URL-level WARC-writing inclusion/exclusion lists #15
- optional message to add when a URL is blocked, specified via 'blockMessage'
- update README for blockRules
- bump to pywb dependency 2.5.0b4
2021-07-19 15:50:32 -07:00
Emma Dickson
838e1fa1bd
Documentation Update (#58)
* README: update documentation to be more clear about how to use the seed file option

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2021-07-08 16:06:10 -07:00
Emma Dickson
c02855627c
Add fields to warcinfo in combinedwarc (#60)
* add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options
* tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo
* fix warcio.js version being added incorrectly
* switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now)

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2021-07-07 15:56:52 -07:00
Ilya Kreymer
473de8c49f
Scope Handling Improvements + Tests (#66)
* scope fixes:
- remove default prefix scopeType, ensure scope include and exclude take precedence
- add new 'custom' scopeType, when include or exclude are used
- use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude)
- ensure per-seed scope include/exclude used when present, and scopeType set to 'custom'
- ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified
- rename --type to --scopeType in seed to maintain consistency
- add sitemap param as alias for useSitemap

tests: 
- add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides
- fix screencaster to use relative paths to work with tests
- ci: use yarn instead of npm

* update README with new flags

* bump version to 0.4.0-beta.3
2021-07-06 20:22:27 -07:00
Ilya Kreymer
ef7d5e50d8
Per-Seed Scoping Rules + Crawl Depth (#63)
* scoped seeds:
- support per-seed scoping (include + exclude), allowHash, depth, and sitemap options
- support maxDepth per seed #16
- combine --url, --seed and --urlFile/--seedFile urls into a unified seed list

arg parsing:
- simplify seed file options into --seedFile/--urlFile, move option in help display
- rename --maxDepth -> --depth, supported globally and per seed
- ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation)
- update to latest js-yaml
- rename --yamlConfig -> --config
- config: support reading config from stdin if --config set to 'stdin'

* scope: fix typo in 'prefix' scope

* update browsertrix-behaviors to 0.2.2

* tests: add test for passing config via stdin, also adding --excludes via cmdline

* update README:
- latest cli, add docs on config via stdin
- rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position
- info on scoped seeds
- list current scope types
2021-06-26 13:11:29 -07:00
Ilya Kreymer
f57818f2f6
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
2021-06-24 15:39:17 -07:00