Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-27 02:04:10 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	cf90304fa7	0.6.0 Wait State + Screencasting Fixes (#141 ) * new options: - to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished - when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished. - set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable) - when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting. - also write WACZ if exiting due to size limit exceed, but not do to other interruptions - change sleep() to be in seconds * misc fixes: - crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue - don't fail crawl if isFinished() is true - don't keep looping in pending wait for urls to finish if received abort request * screencast improvements (fix related to webrecorder/browsertrix-cloud#233) - more optimized screencasting, don't close and restart after every page. - don't assume targets change after every page, they don't in window mode! - only send 'close' message when target is actually closed * bump to 0.6.0	2022-06-17 11:58:44 -07:00
Ilya Kreymer	e7eb6a6620	create profile: fix typo in cookie settings, multiply by seconds in day uwsgi: set number of workers to be 2x cpus by default	2022-06-01 09:11:11 -07:00
Ilya Kreymer	70ba9241ca	limit interrupt fix: after self-interrupting, only look at local pending list (for redis state) logging: don't log CF check errors, do log when errorCount is reset	2022-05-19 06:25:46 +00:00
Ilya Kreymer	6ec47cdd14	profile creation: when creating a profile, force all cookies to have a duration to avoid expiring session cookies (#139 ) - save cookies on page load and also before profile creation - default cookie duration is 7 days, configurable via --cookieDays option	2022-05-18 23:23:32 -07:00
Ilya Kreymer	93b6dad7b9	Health Check + Size Limits + Profile fixes (#138 ) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1	2022-05-18 22:51:55 -07:00
Ilya Kreymer	500ed1f9a1	Profile Creation Improvements (#136 ) * interactive profile api improvements: - refactor profile creation into separate class - if profile starts with '@', load as relative path using current s3 storage - support uploading profiles to s3 - profile api: support filename passed to /createProfieJS as part of json POST - profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping) - profile api: add /target to retrieve target and /navigate to navigate by url. * bump to 0.6.0-beta.0	2022-05-05 14:27:17 -05:00
Ilya Kreymer	5dfbfbeaf6	update dependencies: (#134 ) - update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX - update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction - update browsertrix-behaviors to 0.3.0, support for telegram behavior - bump version to 0.5.1	2022-04-15 16:22:47 -07:00
Ilya Kreymer	9b938304ce	dependencies: update to pywb>=2.6.6, wacz>=0.4.5	2022-04-11 15:09:59 -07:00
Ilya Kreymer	cc391146c4	package: set minio version to fixed (7.0.26)	2022-04-09 22:07:17 -07:00
Ilya Kreymer	bfd72835d1	update CHANGES for 0.5.0 release	2022-04-09 21:59:44 -07:00
Ilya Kreymer	7ed5586bdb	scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity	2022-03-22 17:43:13 -07:00
Ilya Kreymer	5afd19f43d	Non-HTML Page Load Optimization (#130 ) * non-html page load improvements: fix for #129 - don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors - don't do text extraction for non-HTML pages (will need to handle pdf separately) bump to 0.5.0-beta.8	2022-03-22 17:41:51 -07:00
Ilya Kreymer	09082e8abb	dependencies: set wacz>=0.4.4	2022-03-18 10:38:34 -07:00
Ilya Kreymer	8727ca7f8c	redis state error handling: catch and log potential errors with reading json state for next url bump version to 0.5.0-beta.7	2022-03-18 10:34:17 -07:00
Ilya Kreymer	5e5efda437	Profile Creation Fix + Cloudflare Wait Support + UserAgent Fix (#128 ) * cloudlfare wait improvements (#110 fix) - set navigator.webdriver to false to help with cloudflare wait - add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded * chrome args refactor: - move to utils/browser - add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing - add userAgent option * profile creation improvements: - fix loadProfile() missing await - fix url to support running remotely - load shared chromeArgs() - add --proxy to support profile creation through pywb proxy * fix setting custom userAgent (#90) - fix typo that resulted in error - ensure userAgent is applied separate from emulatedDevice - add getDefaultUA() browser util	2022-03-18 10:32:59 -07:00
Ilya Kreymer	dedf1cc0ad	typo fix: add await to loadProfile in create-login-profile.js	2022-03-15 02:40:06 +00:00
Ilya Kreymer	12d96f22c6	Profile download support (#126 ) * profiles: support loading profiles via a URL. * add 'request' dependency * README: mention profile URLs	2022-03-14 14:44:24 -07:00
Ilya Kreymer	1fae21b0cf	Better check to see if ERR_ABORTED should be ignored. (#127 ) * error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler * bump version to 0.5.0-beta.6	2022-03-14 14:41:39 -07:00
Ilya Kreymer	ab096cd5b0	Improve to URL direct check and fetch (#125 ) - direct check fix: only do direct check if HEAD returns 200 status code - if direct load results in non-200 status code, still load in browser - error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error - state: tweak error logging message	2022-03-14 11:11:53 -07:00
Ilya Kreymer	81e8fa6da7	Incremental save state (#124 ) * save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states display save state status on startup page write fixes: add missing await fix for #113 * update README	2022-03-14 10:41:56 -07:00
phiresky	fb297574c7	add documentation of env variables for socks proxy + browser extensions (#120 )	2022-03-13 15:00:46 -07:00
Simon Wiles	d7c24c44f6	Set a `UTF-8` locale in `Dockerfile` (#122 )	2022-03-13 12:47:37 -07:00
Chris Millson	7f1ea89456	Fix typo in regex yaml example (#121 ) crawl-this\|crawl-that didn't have () around it in the yaml example	2022-03-11 13:54:13 -08:00
Ilya Kreymer	affa45a7d4	dependency: update py-wacz dependency to 0.4.3 (to include webrecorder/py-wacz#16 fix) bump to 0.5.0-beta.4	2022-03-07 08:46:12 -08:00
Ilya Kreymer	7588f8d572	README: update README for #116 , mention 'scopeType: domain' and http/https scope inclusion	2022-03-06 14:51:16 -08:00
Ilya Kreymer	0c32d0f223	add 'scopeType: domain' to include all subdomains + http/https include (#117 ) - add 'scopeType: domain' to include all subdomains of a given seed url, eg. given `https://example.com/path' as starting seed, will consider `https://*.example.com/` to be in scope. - include both http/https in all the default scopes except single page (page-spa, prefix, host, domain), eg. given https://example.com/, will also include http://example.com/ - fixes #116	2022-03-06 14:46:14 -08:00
Ilya Kreymer	e160382f4d	Screencast + Redis state tweaks (#109 ) * redis save state: load queued and done urls in chunks in case lists are large * screencast: add 'init' message to include number of workers and dimensions	2022-03-02 13:26:11 -08:00
Ilya Kreymer	805b6466bc	screencast tweaks: - set default dimension to 640x480 - don't send frames for about:blank - ensure url updated in cache - rename screencast html to screencast.html	2022-02-23 14:39:33 -08:00
Ilya Kreymer	ef53b1acea	Screencast Refactor (#108 ) - Move connection data to separate transport class, in addition to current, direct connection via WS, also support sending screencast data via redis pubsub - Implement WSTransport and RedisPubSubTransport for screencasting - Redis screencasting enabled when --redisStoreUrl is set and --screencastRedis is set. - Redis screencasting uses pubsub channels: * a ctrl channel is used to start/stop screencasting * a data channel is used to send screencast messages Simplify screencasting messages: {"msg": "screencast", "id": "<page id>", "url": "<page url>", "data": "<png base64 data>"} - for new and incremental screencast frames for page id {"msg": "close", "id": "<page id>"} - to indicate page id has closed. Rename html dir from screencast -> html	2022-02-23 12:09:48 -08:00
Ilya Kreymer	761ce7067b	behaviors update (#105 ) * update to browsertrix-behaviors 0.2.5 to support improved autoscroll - add evaluateWithCLI() to support evaluate() with 'getEventListeners()' and other devtools command-line api functions, to allow autoscroll behavior to check if it should exit out early - inject behaviors into interactive loader to allow testing - fix signal handler if state not inited yet - dependencies: update puppeteer-cluster to latest, update pywb to 2.6.5	2022-02-20 22:22:19 -08:00
Ilya Kreymer	a54ca6e51d	scopes: - fix scopeType prefix set + exclude not reverting to custom - only mark include + scopeType as overlapping	2022-02-13 14:34:25 -08:00
Ilya Kreymer	56be08e2e0	state improvements: - local: use map for pending state - redis: uset hmap for pending state - redis: support requeing if only pending urls are left, add expiring keys per pending page for pageTimeout	2022-02-09 22:53:15 -08:00
Ilya Kreymer	c2ce9fc001	various state + wacz fixes: (#101 ) - wacz: update to py-wacz 0.4.1, avoid reading full file into memory to compute hashes state: fix pending state, account for puppeteer-cluster popping/pushing jobs from queue: * puppeteer-cluster: add custom 'start()' callback to indicate task actually starting * new semantics: add pending urls in pending state immediately, remove if readded to queue, add 'started' when actaully started minio: use fPutObject to support parallel uploading, compute hash and size separately (for now) dependencies: update to latest minio error checking: * print number of WARCs found, exit with error if 0 * ensure wacz creation succeeds, exit with error code if not * validate wacz after creation, exit with error code if validation fails bump to 0.5.0-beta.3	2022-02-08 15:31:55 -08:00
Ilya Kreymer	66ce6688eb	Add WACZ Signing Support (#99 ) * initial support for wacz signing (using a custom version py-wacz) - signing url and token set via env vars WACZ_SIGN_TOKEN and WACZ_SIGN_URL - add CHANGELIST for 0.5.0 - bump pywb to 2.6.4	2022-01-26 16:06:10 -08:00
Ilya Kreymer	e12463446a	lint style fix	2022-01-26 12:56:35 -08:00
CreativeCactus	eb1dd8e8cf	browser option: custom flags via CHROME_FLAGS env option (#96 )	2022-01-26 12:22:52 -08:00
Ilya Kreymer	201eab4ad1	Support Extra Hops beyond current scope with --extraHops option (#98 ) * extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83 * update README with info on `extraHops`, add tests for extraHops * dependency fix: use pywb 2.6.3, warcio 1.5.0 * bump to 0.5.0-beta.2	2022-01-15 09:03:09 -08:00
Ilya Kreymer	9f541ab011	Support for uploading to S3 (#95 ) - support uploading WACZ to s3-compatible storage (via minio client) - config storage loaded from env vars, enabled when WACZ output is used. - support pinging either or an http or a redis key-based webhook, - webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal) - consolidate redis init to redis.js - support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId) - README: add docs for s3 storage, remove unused args - update to pywb 2.6.2, browsertrix-behaviors 0.2.4 * fix to `limit` option, ensure limit check uses shared state * bump version to 0.5.0-beta.1	2021-11-23 12:53:30 -08:00
Ilya Kreymer	f5d0328ac0	don't set skipDuplicateUrls at puppeteer-cluster level, as already handling via crawl state. potential fix for issue in #91 where crawl appears to not finish	2021-10-27 20:49:37 -07:00
Ilya Kreymer	39ddecd35e	State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78 ) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta	2021-09-28 09:41:16 -07:00
Ilya Kreymer	2956be2026	README: make profile paths in README consistent, fixes #84	2021-08-29 14:20:36 -07:00
Ilya Kreymer	8c8cf232de	update CHANGES for 0.4.4!	2021-08-17 21:24:56 -07:00
Ilya Kreymer	c5494be653	Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81 ) * blockrules improvements: - add await to continue/abort to catch errors, each called only in one place. - avoid adding multiple interception handlers for same page to avoid 'request already handled' errors - disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning * setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set. * scopeType rename: - rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage - rename 'none' -> page to indicate default single-page-only crawl - messaging: adjust error message displaying valid scopeTypes * README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80 bump to 0.4.4	2021-08-17 20:54:18 -07:00
Rebecca Sutton Koeser	4033c52693	Revise docker syntax for screencast examples (#77 ) Specify port binding option as a parameter of `docker run` instead of within the `crawl` command	2021-08-05 13:06:14 -07:00
Ilya Kreymer	d27e67e92e	README: fix invalid dashes, addresses #76	2021-07-28 15:43:36 -07:00
Ilya Kreymer	be1ee53c3e	BlockRules Fixes (0.4.3) (#75 ) - blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one - blockrules: cleanup, always allow 'pywb.proxy' static files - logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules - tests: add more complex test for blockrules - update CHANGES and support info in README - bump to 0.4.3	2021-07-27 09:41:21 -07:00
Ilya Kreymer	f0c5ca1035	ci release: fix typo in release yaml config for latest tag	2021-07-23 20:00:40 -07:00
Ilya Kreymer	0e0b85d7c3	Customizable extract selectors + typo fix (0.4.2) (#72 ) * fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail. - wrap directFetchCapture() to retry browser loading in case of failure * custom link extraction improvements (improvements for #25) - extractLinks() returns a list of link URLs to allow for more flexibility in custom driver - rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed - loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false} - tests: add test for custom driver which uses custom selector * tests - tests: all tests uses 'test-crawls' instead of crawls - consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js - add custom driver test and fixture to test custom link extraction * add to CHANGES, bump to 0.4.2	2021-07-23 18:31:43 -07:00
Ilya Kreymer	36ac3cb905	Update README.md with new features from 0.4.1 release!	2021-07-22 17:55:42 -07:00
Ilya Kreymer	bd44190ab2	Build simplification: Use :latest Version By default + README update (#71 ) * docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image - ci: add 'latest' tag to release ci build to automatically update latest as well - README: remove '[VERSION]', just refer to latest version of image in all examples - README: mention using specific released tag version for production	2021-07-22 17:46:10 -07:00

... 7 8 9 10 11

505 commits