Commit graph

505 commits

Author SHA1 Message Date
Ilya Kreymer
cf90304fa7
0.6.0 Wait State + Screencasting Fixes (#141)
* new options:
- to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished 
- when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished.
- set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable)
- when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting.
- also write WACZ if exiting due to size limit exceed, but not do to other interruptions
- change sleep() to be in seconds

* misc fixes:
- crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue
- don't fail crawl if isFinished() is true
- don't keep looping in pending wait for urls to finish if received abort request

* screencast improvements (fix related to webrecorder/browsertrix-cloud#233)
- more optimized screencasting, don't close and restart after every page.
- don't assume targets change after every page, they don't in window mode!
- only send 'close' message when target is actually closed

* bump to 0.6.0
2022-06-17 11:58:44 -07:00
Ilya Kreymer
e7eb6a6620 create profile: fix typo in cookie settings, multiply by seconds in day
uwsgi: set number of workers to be 2x cpus by default
2022-06-01 09:11:11 -07:00
Ilya Kreymer
70ba9241ca limit interrupt fix: after self-interrupting, only look at local pending list (for redis state)
logging: don't log CF check errors, do log when errorCount is reset
2022-05-19 06:25:46 +00:00
Ilya Kreymer
6ec47cdd14
profile creation: when creating a profile, force all cookies to have a duration to avoid expiring session cookies (#139)
- save cookies on page load and also before profile creation
- default cookie duration is 7 days, configurable via --cookieDays option
2022-05-18 23:23:32 -07:00
Ilya Kreymer
93b6dad7b9
Health Check + Size Limits + Profile fixes (#138)
- Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check

- Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded.

- Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded.

- Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted.

- S3 Storage refactor, simplify, don't add additional paths by default.

- Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value.

- wacz save: reenable wacz validation after save.

- Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs.

- bump to 0.6.0-beta.1
2022-05-18 22:51:55 -07:00
Ilya Kreymer
500ed1f9a1
Profile Creation Improvements (#136)
* interactive profile api improvements:
- refactor profile creation into separate class
- if profile starts with '@', load as relative path using current s3 storage
- support uploading profiles to s3
- profile api: support filename passed to /createProfieJS as part of json POST
- profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping)
- profile api: add /target to retrieve target and /navigate to navigate by url.

* bump to 0.6.0-beta.0
2022-05-05 14:27:17 -05:00
Ilya Kreymer
5dfbfbeaf6
update dependencies: (#134)
- update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX
- update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction
- update browsertrix-behaviors to 0.3.0, support for telegram behavior
- bump version to 0.5.1
2022-04-15 16:22:47 -07:00
Ilya Kreymer
9b938304ce dependencies: update to pywb>=2.6.6, wacz>=0.4.5 2022-04-11 15:09:59 -07:00
Ilya Kreymer
cc391146c4 package: set minio version to fixed (7.0.26) 2022-04-09 22:07:17 -07:00
Ilya Kreymer
bfd72835d1 update CHANGES for 0.5.0 release 2022-04-09 21:59:44 -07:00
Ilya Kreymer
7ed5586bdb scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity 2022-03-22 17:43:13 -07:00
Ilya Kreymer
5afd19f43d
Non-HTML Page Load Optimization (#130)
* non-html page load improvements: fix for #129
- don't include cookie check in eliminating direct fetch, may be too speculative
- as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors
- don't do text extraction for non-HTML pages (will need to handle pdf separately)
bump to 0.5.0-beta.8
2022-03-22 17:41:51 -07:00
Ilya Kreymer
09082e8abb dependencies: set wacz>=0.4.4 2022-03-18 10:38:34 -07:00
Ilya Kreymer
8727ca7f8c redis state error handling: catch and log potential errors with reading json state for next url
bump version to 0.5.0-beta.7
2022-03-18 10:34:17 -07:00
Ilya Kreymer
5e5efda437
Profile Creation Fix + Cloudflare Wait Support + UserAgent Fix (#128)
* cloudlfare wait improvements (#110 fix)
- set navigator.webdriver to false to help with cloudflare wait
- add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded

* chrome args refactor:
- move to utils/browser
- add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing
- add userAgent option

* profile creation improvements:
- fix loadProfile() missing await
- fix url to support running remotely
- load shared chromeArgs()
- add --proxy to support profile creation through pywb proxy

* fix setting custom userAgent (#90)
- fix typo that resulted in error
- ensure userAgent is applied separate from emulatedDevice
- add getDefaultUA() browser util
2022-03-18 10:32:59 -07:00
Ilya Kreymer
dedf1cc0ad typo fix: add await to loadProfile in create-login-profile.js 2022-03-15 02:40:06 +00:00
Ilya Kreymer
12d96f22c6
Profile download support (#126)
* profiles: support loading profiles via a URL.

* add 'request' dependency

* README: mention profile URLs
2022-03-14 14:44:24 -07:00
Ilya Kreymer
1fae21b0cf
Better check to see if ERR_ABORTED should be ignored. (#127)
* error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler
* bump version to 0.5.0-beta.6
2022-03-14 14:41:39 -07:00
Ilya Kreymer
ab096cd5b0
Improve to URL direct check and fetch (#125)
- direct check fix: only do direct check if HEAD returns 200 status code
- if direct load results in non-200 status code, still load in browser
- error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error
- state: tweak error logging message
2022-03-14 11:11:53 -07:00
Ilya Kreymer
81e8fa6da7
Incremental save state (#124)
* save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states
in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states
display save state status on startup
page write fixes: add missing await
fix for #113

* update README
2022-03-14 10:41:56 -07:00
phiresky
fb297574c7
add documentation of env variables for socks proxy + browser extensions (#120) 2022-03-13 15:00:46 -07:00
Simon Wiles
d7c24c44f6
Set a UTF-8 locale in Dockerfile (#122) 2022-03-13 12:47:37 -07:00
Chris Millson
7f1ea89456
Fix typo in regex yaml example (#121)
crawl-this|crawl-that didn't have () around it in the yaml example
2022-03-11 13:54:13 -08:00
Ilya Kreymer
affa45a7d4 dependency: update py-wacz dependency to 0.4.3 (to include webrecorder/py-wacz#16 fix)
bump to 0.5.0-beta.4
2022-03-07 08:46:12 -08:00
Ilya Kreymer
7588f8d572 README: update README for #116, mention 'scopeType: domain' and http/https scope inclusion 2022-03-06 14:51:16 -08:00
Ilya Kreymer
0c32d0f223
add 'scopeType: domain' to include all subdomains + http/https include (#117)
- add 'scopeType: domain' to include all subdomains of a given seed url, eg. given `https://example.com/path' as starting seed, will consider `https://*.example.com/` to be in scope.
- include both http/https in all the default scopes except single page (page-spa, prefix, host, domain), eg. given https://example.com/, will also include http://example.com/
- fixes #116
2022-03-06 14:46:14 -08:00
Ilya Kreymer
e160382f4d
Screencast + Redis state tweaks (#109)
* redis save state: load queued and done urls in chunks in case lists are large

* screencast: add 'init' message to include number of workers and dimensions
2022-03-02 13:26:11 -08:00
Ilya Kreymer
805b6466bc screencast tweaks:
- set default dimension to 640x480
- don't send frames for about:blank
- ensure url updated in cache
- rename screencast html to screencast.html
2022-02-23 14:39:33 -08:00
Ilya Kreymer
ef53b1acea
Screencast Refactor (#108)
- Move connection data to separate transport class, in addition to current, direct connection via WS, also support sending screencast data via redis pubsub
- Implement WSTransport and RedisPubSubTransport for screencasting
- Redis screencasting enabled when --redisStoreUrl is set and --screencastRedis is set.
- Redis screencasting uses pubsub channels:
* a ctrl channel is used to start/stop screencasting
* a data channel is used to send screencast messages

Simplify screencasting messages:
{"msg": "screencast", "id": "<page id>", "url": "<page url>", "data": "<png base64 data>"} - for new and incremental screencast frames for page id
{"msg": "close", "id": "<page id>"} - to indicate page id has closed.
Rename html dir from screencast -> html
2022-02-23 12:09:48 -08:00
Ilya Kreymer
761ce7067b
behaviors update (#105)
* update to browsertrix-behaviors 0.2.5 to support improved autoscroll
- add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early
- inject behaviors into interactive loader to allow testing
- fix signal handler if state not inited yet
- dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5
2022-02-20 22:22:19 -08:00
Ilya Kreymer
a54ca6e51d scopes:
- fix scopeType prefix set + exclude not reverting to custom
- only mark include + scopeType as overlapping
2022-02-13 14:34:25 -08:00
Ilya Kreymer
56be08e2e0 state improvements:
- local: use map for pending state
- redis: uset hmap for pending state
- redis: support requeing if only pending urls are left, add expiring keys per pending page for pageTimeout
2022-02-09 22:53:15 -08:00
Ilya Kreymer
c2ce9fc001
various state + wacz fixes: (#101)
- wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes

state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue:
* puppeteer-cluster: add custom 'start()' callback to indicate task actually starting
* new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started'  when actaully started

minio: use fPutObject to support parallel uploading, compute hash and size separately (for now)
dependencies: update to latest minio

error checking:
* print number of WARCs found, exit with error if 0
* ensure wacz creation succeeds, exit with error code if not
* validate wacz after creation, exit with error code if validation fails

bump to 0.5.0-beta.3
2022-02-08 15:31:55 -08:00
Ilya Kreymer
66ce6688eb
Add WACZ Signing Support (#99)
* initial support for wacz signing (using a custom version py-wacz)
- signing url and token set via env vars WACZ_SIGN_TOKEN and WACZ_SIGN_URL
-  add CHANGELIST for 0.5.0
- bump pywb to 2.6.4
2022-01-26 16:06:10 -08:00
Ilya Kreymer
e12463446a lint style fix 2022-01-26 12:56:35 -08:00
CreativeCactus
eb1dd8e8cf
browser option: custom flags via CHROME_FLAGS env option (#96) 2022-01-26 12:22:52 -08:00
Ilya Kreymer
201eab4ad1
Support Extra Hops beyond current scope with --extraHops option (#98)
* extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83

* update README with info on `extraHops`, add tests for extraHops

* dependency fix: use pywb 2.6.3, warcio 1.5.0

* bump to 0.5.0-beta.2
2022-01-15 09:03:09 -08:00
Ilya Kreymer
9f541ab011
Support for uploading to S3 (#95)
- support uploading WACZ to s3-compatible storage (via minio client)
- config storage loaded from env vars, enabled when WACZ output is used.
- support pinging either or an http or a redis key-based webhook,
- webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal)
- consolidate redis init to redis.js
- support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId)
- README: add docs for s3 storage, remove unused args
- update to pywb 2.6.2, browsertrix-behaviors 0.2.4

* fix to `limit` option, ensure limit check uses shared state

* bump version to 0.5.0-beta.1
2021-11-23 12:53:30 -08:00
Ilya Kreymer
f5d0328ac0 don't set skipDuplicateUrls at puppeteer-cluster level, as already handling via crawl state. potential fix for issue in #91 where
crawl appears to not finish
2021-10-27 20:49:37 -07:00
Ilya Kreymer
39ddecd35e
State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78)
* save state work:
- support interrupting and saving crawl
- support loading crawl state (frontier queue, pending, done) from YAML
- support scope check when loading to apply new scoping rules when restarting crawl
- failed urls added to done as failed, can be retried if crawl is stopped and restarted
- save state to crawls/crawl-<ts>-<id>.yaml when interrupted
- --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never.
- support in-memory or redis based crawl state, using fork of puppeteer-cluster
- --redisStore used to enable redis-based state



* signals/crawl interruption:
- crawl state set to drain/not provide any more urls to crawl
- graceful stop of crawl in response to sigint/sigterm
- initial sigint/sigterm waits for graceful end of current pages, second terminates immediately
- initial sigabrt followed by sigterm terminates immediately
- puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT

* redis state support:
- use lua scripts for atomic move from queue -> pending, and pending -> done
- pending key expiry set to page timeout
- add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination
- drainMax returns the numPending() + numSeen() to work with cluster stats

* arg improvements:
- add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file)
- support setting cmdline args via env var CRAWL_ARGS
- use 'choices' in args when possible

* build update:
- switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds
- use setuptools<58.0

* misc crawl/scoping rule fixes:
- scoping rules fix when external is used with scopeType
state:
- limit: ensure no urls, including initial seeds, are added past the limit
- signals: fix immediate shutdown on second signal
- tests: add scope test for default scope + excludes

*  py-wacz update
- add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2)
- pywb: use latest pywb branch for improved twitter video capture

* update to latest browsertrix-behaviors

* fix setuptools dependency #88

* update README for 0.5.0 beta
2021-09-28 09:41:16 -07:00
Ilya Kreymer
2956be2026
README: make profile paths in README consistent, fixes #84 2021-08-29 14:20:36 -07:00
Ilya Kreymer
8c8cf232de update CHANGES for 0.4.4! 2021-08-17 21:24:56 -07:00
Ilya Kreymer
c5494be653
Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81)
* blockrules improvements:
- add await to continue/abort to catch errors, each called only in one place.
- avoid adding multiple interception handlers for same page to avoid 'request already handled' errors
- disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning

* setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set.

* scopeType rename:
- rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage
- rename 'none' -> page to indicate default single-page-only crawl
- messaging: adjust error message displaying valid scopeTypes

* README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80

bump to 0.4.4
2021-08-17 20:54:18 -07:00
Rebecca Sutton Koeser
4033c52693
Revise docker syntax for screencast examples (#77)
Specify port binding option as a parameter of `docker run` instead of within the `crawl` command
2021-08-05 13:06:14 -07:00
Ilya Kreymer
d27e67e92e README: fix invalid dashes, addresses #76 2021-07-28 15:43:36 -07:00
Ilya Kreymer
be1ee53c3e
BlockRules Fixes (0.4.3) (#75)
- blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one
- blockrules: cleanup, always allow 'pywb.proxy' static files
- logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules
- tests: add more complex test for blockrules
- update CHANGES and support info in README
- bump to 0.4.3
2021-07-27 09:41:21 -07:00
Ilya Kreymer
f0c5ca1035 ci release: fix typo in release yaml config for latest tag 2021-07-23 20:00:40 -07:00
Ilya Kreymer
0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00
Ilya Kreymer
36ac3cb905
Update README.md with new features from 0.4.1 release! 2021-07-22 17:55:42 -07:00
Ilya Kreymer
bd44190ab2
Build simplification: Use :latest Version By default + README update (#71)
* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image
- ci: add 'latest' tag to release ci build to automatically update latest as well
- README: remove '[VERSION]', just refer to latest version of image in all examples
- README: mention using specific released tag version for production
2021-07-22 17:46:10 -07:00