Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	9f541ab011	Support for uploading to S3 (#95 ) - support uploading WACZ to s3-compatible storage (via minio client) - config storage loaded from env vars, enabled when WACZ output is used. - support pinging either or an http or a redis key-based webhook, - webhook: include 'completed' bool to indicate if fully completed crawl or partial (eg. interrupted via signal) - consolidate redis init to redis.js - support upload filename with custom variables: can interpolate current timestamp (@ts), hostname (@hostname) and user provided id (@crawlId) - README: add docs for s3 storage, remove unused args - update to pywb 2.6.2, browsertrix-behaviors 0.2.4 * fix to `limit` option, ensure limit check uses shared state * bump version to 0.5.0-beta.1	2021-11-23 12:53:30 -08:00
Ilya Kreymer	f5d0328ac0	don't set skipDuplicateUrls at puppeteer-cluster level, as already handling via crawl state. potential fix for issue in #91 where crawl appears to not finish	2021-10-27 20:49:37 -07:00
Ilya Kreymer	39ddecd35e	State Save + Restore State from Config + Redis State + Scope Fix 0.5.0 (#78 ) * save state work: - support interrupting and saving crawl - support loading crawl state (frontier queue, pending, done) from YAML - support scope check when loading to apply new scoping rules when restarting crawl - failed urls added to done as failed, can be retried if crawl is stopped and restarted - save state to crawls/crawl-<ts>-<id>.yaml when interrupted - --saveState option controls when crawl state is saved, default to partial/when interrupted, also always, never. - support in-memory or redis based crawl state, using fork of puppeteer-cluster - --redisStore used to enable redis-based state * signals/crawl interruption: - crawl state set to drain/not provide any more urls to crawl - graceful stop of crawl in response to sigint/sigterm - initial sigint/sigterm waits for graceful end of current pages, second terminates immediately - initial sigabrt followed by sigterm terminates immediately - puppeteer disable handleSIGTERM, handleSIGHUP, handleSIGINT * redis state support: - use lua scripts for atomic move from queue -> pending, and pending -> done - pending key expiry set to page timeout - add numPending() and numSeen() to support better puppeteer-cluster semantics for early termination - drainMax returns the numPending() + numSeen() to work with cluster stats * arg improvements: - add --crawlId param, also settable via CRAWL_ID env var, defaulting to os.hostname() (used for redis key and crawl state file) - support setting cmdline args via env var CRAWL_ARGS - use 'choices' in args when possible * build update: - switch base browser image to new webrecorder/browsertrix-browser-base, simple image with .deb files only for amd64 and arm64 builds - use setuptools<58.0 * misc crawl/scoping rule fixes: - scoping rules fix when external is used with scopeType state: - limit: ensure no urls, including initial seeds, are added past the limit - signals: fix immediate shutdown on second signal - tests: add scope test for default scope + excludes * py-wacz update - add 'seed': true to pages that are seeds for optimized wacz creation, keeping non-seeds separate (supported via wacz 0.3.2) - pywb: use latest pywb branch for improved twitter video capture * update to latest browsertrix-behaviors * fix setuptools dependency #88 * update README for 0.5.0 beta	2021-09-28 09:41:16 -07:00
Ilya Kreymer	2956be2026	README: make profile paths in README consistent, fixes #84	2021-08-29 14:20:36 -07:00
Ilya Kreymer	8c8cf232de	update CHANGES for 0.4.4!	2021-08-17 21:24:56 -07:00
Ilya Kreymer	c5494be653	Page Resource Block Rules Avoid Duplicate Handlers + Ignore top-level pages + README update (0.4.4) (#81 ) * blockrules improvements: - add await to continue/abort to catch errors, each called only in one place. - avoid adding multiple interception handlers for same page to avoid 'request already handled' errors - disallow blocking full pages via blockRules (should be handled via scope exclusion) and print warning * setup: ensure the 'cwd' for the crawl output exists on startup, in case a custom cwd was set. * scopeType rename: - rename 'page' -> page-spa to indicate support for hashtag / single-page-app intended usage - rename 'none' -> page to indicate default single-page-only crawl - messaging: adjust error message displaying valid scopeTypes * README: Add additional examples for scope rules, update scopeType param, explain different between scope rules vs block rules, to better address confusion as per #80 bump to 0.4.4	2021-08-17 20:54:18 -07:00
Rebecca Sutton Koeser	4033c52693	Revise docker syntax for screencast examples (#77 ) Specify port binding option as a parameter of `docker run` instead of within the `crawl` command	2021-08-05 13:06:14 -07:00
Ilya Kreymer	d27e67e92e	README: fix invalid dashes, addresses #76	2021-07-28 15:43:36 -07:00
Ilya Kreymer	be1ee53c3e	BlockRules Fixes (0.4.3) (#75 ) - blockrules fix: when checking an iframe nav request, match inFrameUrl against the parent iframe, not current one - blockrules: cleanup, always allow 'pywb.proxy' static files - logging: when 'debug' logging enabled, log urls blocked and conditional iframe checks from blockrules - tests: add more complex test for blockrules - update CHANGES and support info in README - bump to 0.4.3	2021-07-27 09:41:21 -07:00
Ilya Kreymer	f0c5ca1035	ci release: fix typo in release yaml config for latest tag	2021-07-23 20:00:40 -07:00
Ilya Kreymer	0e0b85d7c3	Customizable extract selectors + typo fix (0.4.2) (#72 ) * fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail. - wrap directFetchCapture() to retry browser loading in case of failure * custom link extraction improvements (improvements for #25) - extractLinks() returns a list of link URLs to allow for more flexibility in custom driver - rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed - loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false} - tests: add test for custom driver which uses custom selector * tests - tests: all tests uses 'test-crawls' instead of crawls - consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js - add custom driver test and fixture to test custom link extraction * add to CHANGES, bump to 0.4.2	2021-07-23 18:31:43 -07:00
Ilya Kreymer	36ac3cb905	Update README.md with new features from 0.4.1 release!	2021-07-22 17:55:42 -07:00
Ilya Kreymer	bd44190ab2	Build simplification: Use :latest Version By default + README update (#71 ) * docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image - ci: add 'latest' tag to release ci build to automatically update latest as well - README: remove '[VERSION]', just refer to latest version of image in all examples - README: mention using specific released tag version for production	2021-07-22 17:46:10 -07:00
Ilya Kreymer	f4c6b6a99f	0.4.1 Release! (#70 ) * optimization: don't intercept requests if no blockRules set * page load: set waitUntil to use networkidle2 instead of networkidle0 as reasonable default for most pages * add --behaviorTimeout to set max running time for behaviors (defaults to 90 seconds) * refactor profile loadProfile/saveProfile to util/browser.js - support augmenting existing profile when creating a new profile * screencasting: convert newContext to window instead of page by default, instead of just warning about it * shared multiplatform image support: - determine browser exe from list of options, getBrowserExe() returns current exe - supports running with 'google-chrome' under amd64, and 'chromium-browser' under arm64 - update to multiplatform oldwebtoday/chrome:91 as browser image - enable multiplatform build with latest build-push-action@v2 * seeds: add trim() to seed URLs * logging: reduce initial debug logging, enable only if '--logging debug' is set. log if profile, text-extraction enabled, and post-processing stages automatically * profile creation: add --windowSize flag, set default to 1600x900, default to loading Application tab, tweak UI styles * extractLinks: support passing in custom property to get link, and also loading as an attribute via getAttribute. Fixes #25 * update CHANGES and README with new features * bump version to 0.4.1	2021-07-22 14:24:51 -07:00
Ilya Kreymer	6a65ea7a58	update CHANGES.md for 0.4.0 bump version to 0.4.0 remove extraneous logging	2021-07-20 23:06:15 -07:00
Ilya Kreymer	d40cf6cc2b	Interactive Profiles + bug fixes (#69 ) * support for interactive profile creation mode via --interactive file * screencasting error catching, ensure errors in screencasting do not interrupt crawl * better error reporting for invalid seed URLs, fixes #67 * README: update to mention interactive profile creation, additional * dependencies: update to pywb 2.6.0b4, py-wacz 0.3.1, browsertrix-behaviors 0.2.3	2021-07-20 15:45:51 -07:00
Ilya Kreymer	6dbdff9656	Support for per-URL conditional Block Rules (#68 ) - Support for block rules specified in YAML config to exclude URLs based on regex, and also negate a rule by specifying `allowOnly` to allow URLs based on certain regex. - Support for conditional blocking for iframes, based on content of iframe text, specified via frameTextMatch regex. - Support for restricting block rules based on containing frame URL, specified via inFrameURL param. - Testing for various blockRules configurations - Fixes Support URL-level WARC-writing inclusion/exclusion lists #15 - optional message to add when a URL is blocked, specified via 'blockMessage' - update README for blockRules - bump to pywb dependency 2.5.0b4	2021-07-19 15:50:32 -07:00
Emma Dickson	838e1fa1bd	Documentation Update (#58 ) * README: update documentation to be more clear about how to use the seed file option Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2021-07-08 16:06:10 -07:00
Emma Dickson	c02855627c	Add fields to warcinfo in combinedwarc (#60 ) * add support for adding custom warcinfo fields via the 'warcinfo' block in yaml config or via --warcinfo.<field> command-line options * tests: add tests for warcinfo custom and standard fields ('software' and 'format') being added to warcinfo * fix warcio.js version being added incorrectly * switch to warc/1.0 for warcinfo field to match generated warcs from pywb, which use warc/1.0 (for now) Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2021-07-07 15:56:52 -07:00
Ilya Kreymer	473de8c49f	Scope Handling Improvements + Tests (#66 ) * scope fixes: - remove default prefix scopeType, ensure scope include and exclude take precedence - add new 'custom' scopeType, when include or exclude are used - use --scopeIncludeRx and --scopeExcludeRx for better consistency for scope include and exclude (also allow --include/--exclude) - ensure per-seed scope include/exclude used when present, and scopeType set to 'custom' - ensure default scope is set to 'prefix' if no scopeType and no include/exclude regexes specified - rename --type to --scopeType in seed to maintain consistency - add sitemap param as alias for useSitemap tests: - add seed scope resolution tests for argParse, testing per-scope seed resolution, inheritance and overrides - fix screencaster to use relative paths to work with tests - ci: use yarn instead of npm * update README with new flags * bump version to 0.4.0-beta.3	2021-07-06 20:22:27 -07:00
Ilya Kreymer	ef7d5e50d8	Per-Seed Scoping Rules + Crawl Depth (#63 ) * scoped seeds: - support per-seed scoping (include + exclude), allowHash, depth, and sitemap options - support maxDepth per seed #16 - combine --url, --seed and --urlFile/--seedFile urls into a unified seed list arg parsing: - simplify seed file options into --seedFile/--urlFile, move option in help display - rename --maxDepth -> --depth, supported globally and per seed - ensure custom parsed params from argParser passed back correctly (behaviors, logging, device emulation) - update to latest js-yaml - rename --yamlConfig -> --config - config: support reading config from stdin if --config set to 'stdin' * scope: fix typo in 'prefix' scope * update browsertrix-behaviors to 0.2.2 * tests: add test for passing config via stdin, also adding --excludes via cmdline * update README: - latest cli, add docs on config via stdin - rename --yamlConfig -> --config, consolidate --seedFile/--urlFile, move arg position - info on scoped seeds - list current scope types	2021-06-26 13:11:29 -07:00
Ilya Kreymer	f57818f2f6	New Docker Image, Customizable Browser Source + Binary (#62 ) * switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!) * add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...) * github action ci: use system unzip * update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file. * Update README with info on customizing build image * bump version to 0.4.0-beta.2	2021-06-24 15:39:17 -07:00
Ilya Kreymer	3ebe511b32	Arg Parsing Refactor + Support for YAML Config Support (take 2!) (#59 ) * Create an argument parser class * move constants, arg parser to separate files in utils/* * ensure yaml config overriden by command-line args * yaml loading work: - simplify yaml config by using yargs.config option - move all option parsing to argParser, simply expose parseArgs - export constants directly - add lint to util/* files * support inline 'seeds' in cmdline and yaml config tests: - add test for crawl config, ensuring seeds crawled + wacz created - add test to ensure cmdline overrides yaml config * scope fix: empty scope implies only fixed list, use '.' for any scope lint fix * update readme with yaml config info * allow 'url' and 'seeds' if both provided Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: emmadickson <emma.dickson@artsymail.com>	2021-06-23 19:45:40 -07:00
Ilya Kreymer	ae4ce979fb	Screencast Support for Debugging (fixes #43 ) (#52 ) * screencast support (fixes #43): - add NewWindowPage concurrency mode to support opening new window, and also reusing pages - add --screencastPort cli options to enable screencasting, uses websockets to stream frames to client - concurrency: add separate 'window' concurrency for opening new window per-page in same session, useful for screencasting with multiple workers but within same session * add warning if using screencasting + more than one worker + page context, recommend 'window' * cleanup: remove debug console, bump py-wacz dependency, improve close message * README: add screencasting info to README	2021-06-07 17:43:36 -07:00
Ilya Kreymer	e7d3767efb	Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51 ) * support hashtag for page-scoped crawls: - allow hashtags for current page, automatically set scope to current w/ different hashtags - also allow hashtags for URLs specified via urlFile - driver: simplify driver, move default driver function to loadPage() - bump version to 0.4.0-beta.0 * add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well * graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker) * replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex. scopeType options include: prefix - scope is prefix of current page (default) page - scope is current page + hashtags (spa mode) domain - scope is domain/origin of current page any - scope is any url (default for urlFile) - bump version to 0.4.0-beta.1	2021-05-21 15:37:02 -07:00
Emma Dickson	63376ab6ac	Add --urlFile param to specify text file with a list of URLs to crawl (#38 ) * Resolves #12 * Make --url param optional. Only one of --url of --urlFile should be specified. * Add ignoreScope option queueUrls() to support adding specific URLs * add tests for urlFile * bump version to 0.3.2 Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-05-12 22:57:06 -07:00
Ilya Kreymer	2db7bc98b1	bump version to 0.3.1 for release	2021-05-04 13:38:56 -07:00
Ilya Kreymer	51bb54e869	add CHANGES.md for 0.3.1 release!	2021-05-04 13:13:33 -07:00
Ilya Kreymer	7bc8efff3d	add CHANGES.md, list changes for 0.3.1 update to browsertrix-behaviors 0.2.1	2021-05-04 12:10:12 -07:00
Emma Dickson	6211315999	update pages detection method (#50 ) Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-04-30 19:05:04 -07:00
Ilya Kreymer	183f8edf10	Wait for Pending Requests to Finish (#47 ) * pending request wait: - instead of waiting for 5s, check redis key 'pywb:{coll}:pending' to see if any pending requests are still pending - keep checking key until pending requests are at 0 - requires latest pywb 2.6.0+ - should fix #44 * fix test to no longer look for waiting for 5s message * lint settings and fixes: allow constant in loops, add lint command to script * chrome: bump default image to chrome:90 image	2021-04-30 15:31:14 -04:00
Sebastian Nagel	9d577dac57	Extract links from all frames attached to a page, fixes #45 (#48 )	2021-04-30 08:41:00 -07:00
Ilya Kreymer	9293375790	combine WARC/async fixes: (#49 ) * combine WARC/async fixes: - use streams for combine WARCs to avoid any issues with sync apis - use async apis for writing/reading pages as well * use async stat() * fix tests, also sets extension to .warc.gz, addresses #41	2021-04-29 14:34:56 -07:00
Ilya Kreymer	b1e0654bdd	update to browsertrix-behaviors 0.2.0 update to latest pywb@main create-login-profile: also allow 'email' as alternative to user name bump to 0.3.1-beta.0	2021-04-28 11:00:43 -07:00
Ilya Kreymer	dba4524246	ci: add push to registry on release action	2021-04-14 15:45:20 -07:00
Ilya Kreymer	eff4c61270	misc typos/fixes for 0.3.0: - update README with latest params - ensure capture dir includes seconds - bump behaviors to 0.1.1	2021-04-13 18:17:44 -07:00
Ilya Kreymer	b59788ea04	Profiles: Support for running with existing profiles + saving profile after a login (#34 ) Support for profiles via a mounted .tar.gz and --profile option + improved docs #18 * support creating profiles via 'create-login-profile' command with options for where to save profile, username/pass and debug screenshot output. support entering username and password (hidden) on command-line if omitted. * use patched pywb for fix * bump browsertrix-behaviors to 0.1.0 * README: updates to include better getting started, behaviors and profile reference/examples * bump version to 0.3.0!	2021-04-10 13:08:22 -07:00
Emma Dickson	c9f8fe051c	add collection name validation (#37 ) * add collection name validation * linter fix * add tests and optimize * linter fix * move to validateargs * properly reference collection * Update regex and error message Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-04-07 20:24:01 -04:00
Emma Dickson	24e2c4ddf8	Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33 ) * generates combined WARCs in collection root directory with suffix `_0.warc`, `_1.warc`, etc.. * each combined WARC limited by the size in `--rolloverSize`, if exceeds a new WARC is created, otherwise appended to previous WARC. * add test for --combineWARC flag * add improved lint rules Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-03-31 10:41:27 -07:00
Ilya Kreymer	bc7f1badf3	factor out behaviors to browsertrix-behaviors: (#32 ) - inject built 'behaviors.js' from browsertrix-behaviors, init with options and run - remove bgbehaviors - move textextract to root for now - add requirements.txt for python dependencies - remove obsolete --scroll option, to part of the behaviors system logging: - configure logging options via --logging param, can include 'stats' (default), 'pywb', 'behaviors', and 'behaviors-debug' - inject custom logging function for behaviors to call if either behaviors or behaviors-debug is set - 'behaviors-debug' prints all debug messages from behaviors, while regular 'behaviors' prints main behavior messages (useful for verification) dockerfile: add 'rebuild' arg to faciliate rebuilding image from specific step bump to 0.3.0-beta.0	2021-03-13 19:48:31 -05:00
Emma Dickson	9ef3f25416	add logging option (#29 ) * add --pywb-log flag cmdline option which enables the pywb logging to stdout/stderr Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2021-03-04 12:36:58 -08:00
Emma Dickson	fb0f1d8db9	tests text extraction (#30 ) * new tests * add jest to eslint, lint fixes Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-03-01 16:00:23 -08:00
Emma Dickson	748b0399e9	add text extraction (#28 ) * add text extraction via --text flag * update readme with --text and --generateWACZ flags Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-02-23 13:52:54 -08:00
Emma Dickson	0688674f6f	case insensitive params (#27 ) * make --generateWacz, --generateCdx case insensitive with alias option * fix eslint config and eslint issues Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2021-02-17 09:37:07 -08:00
Ilya Kreymer	4d6dcbc3d6	bump version, remove extraneous console.log	2021-02-16 20:00:33 -08:00
Emma Dickson	9ef83e4ab4	update default collection name (#26 ) Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-02-15 20:06:18 -08:00
Emma Dickson	73b1dd77d4	Merge pull request #24 from webrecorder/behavior-refactor background behaviors refactor: (fixes #23)	2021-02-11 11:24:34 -05:00
Ilya Kreymer	8c85ca2749	background behaviors refactor: (fixes #23 ) - move auto-play, auto-fetch and auto-scroll behaviors to behaviors/global/* - bgbehaviors manages these background behaviors - command line --bgbehaviors option specifies which background behaviors to run (defaults to auto-fetch and auto-play)	2021-02-08 22:21:34 -08:00
Emma Dickson	7cfeefd19b	add ci and linting (#21 ) * linting with eslint * ci: validate linting and check basic single-page crawl with wacz creation Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>	2021-02-08 09:45:46 -08:00
Ilya Kreymer	8af5e1487d	waitUntil improvements: (#22 ) - puppeteer 'waitUntil supports an array of options, support via comma separated list - default to 'waitUntil,load' - should fix #3	2021-02-04 22:42:03 -08:00

1 2

68 commits