Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Author	SHA1	Message	Date
Tessa Walsh	034657dbb6	Use brave-1.46.144 base image	2023-01-09 15:02:46 -05:00
Tessa Walsh	b8bed40e14	Add working ad block disabled Brave profile	2023-01-09 15:02:40 -05:00
Tessa Walsh	c078ce7fb9	Modify BROWSER_BIN	2022-12-13 11:43:22 -05:00
Tessa Walsh	59e41b04c2	Set Brave default profile in argparser	2022-12-12 17:22:50 -05:00
Tessa Walsh	9d3af6f80f	WIP: Add default Brave profile Current requires locally built Brave base image named: webrecorder/browsertrix-browser-base:brave-test-latest brave-ad-blocking-disabled-profile.tar.gz may not be working quite correctly and may need to be replaced, as it wasn't possible to modify the selects in brave://settings via create-login-profile's interactive mode quite yet	2022-12-12 17:21:58 -05:00
Ilya Kreymer	2a1e0edf3c	version: set version correctly to 0.8.0-beta.0	2022-11-15 18:30:27 -08:00
Ilya Kreymer	cacf5da5a1	esm conversion: finish esm conversion for create-login-profile.js	2022-11-15 18:30:27 -08:00
Tessa Walsh	e02058f001	Add ad blocking via request interception (#173 ) * ad blocking via request interception, extending block rules system, adding new AdBlockRules * Load list of hosts to block from https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts added as json on image build * Enabled via --blockAds and setting a custom message via --adBlockMessage * new test to check for ad blocking * Add test-crawls dir to .gitignore and .dockerignore	2022-11-15 18:30:27 -08:00
Ilya Kreymer	277314f2de	Convert to ESM (#179 ) * switch base image to chrome/chromium 105 with node 18.x * convert all source to esm for node 18.x, remove unneeded node-fetch dependency * ci: use node 18.x, update to latest actions * tests: convert to esm, run with --experimental-vm-modules * tests: set higher default timeout (90s) for all tests * tests: rename driver test fixture to .mjs for loading in jest * bump to 0.8.0	2022-11-15 18:30:27 -08:00
Tim	5b738bd24e	Fix incorrect `combineWARCs` property in README.md (#180 ) This stumped me for a little while. The actual property isn't plural.	2022-11-14 22:17:44 -08:00
Ed Summers	cd17764b77	Check if group/user exists (#176 ) Ensure that group and user do not already exist before creating them. Fixes #174	2022-11-03 17:28:13 -07:00
Ilya Kreymer	ffa3174578	Fix for warcio.js (#178 ) * dependency fix: set warcio to 1.5.1 until we update to esm support bump test timeout fixes #175 bump to 0.7.1	2022-10-24 08:20:01 +02:00
Ilya Kreymer	1213694dde	bump to 0.7.0 for release!	2022-10-11 16:14:53 -07:00
Ilya Kreymer	be3b6b85fa	README: update default behaviors in README, fixes #169	2022-10-11 15:33:32 -07:00
Ed Summers	3ba64535a5	Run in Docker as User (#171 ) * Run in Docker as User This follows a similar pattern to pywb to run as the user that owns the crawls directory. bump version to 0.7.0-beta.6 Closes #170	2022-09-28 12:49:52 -07:00
Ilya Kreymer	65933c6b12	Interrupt Handling Fixes (#167 ) * interrupts: simplify interrupt behavior: - SIGTERM/SIGINT behave same way, trigger an graceful shutdown after page load improvements of remote state / parallel crawlers (for browsertrix-cloud): - SIGUSR1 before SIGINT/SIGTERM ensures data is saved, mark crawler as done - for use with graceful stopping crawl - SIGUSR2 before SIGINT/SIGTERM ensures data is saved, does not mark crawler as done - for use with scaling down a single crawler * scope check: check scope of URL retrieved from queue (in case scoping rules changed), urls matching seed automatically in scope!	2022-09-20 17:09:52 -07:00
Ilya Kreymer	fd1737962b	dependencies: update to browsertrix-behaviors 0.3.4, fixes autofetch loading of lazy load images (fixes #165 ) bump to 0.7.0-beta.5	2022-09-15 23:13:31 -07:00
Ilya Kreymer	314ee3f730	Default Wait-Time Improvements (#162 ) - netIdleWait better defaults: if not set, set to 15 seconds for page/page-spa scope, otherwise to 2 seconds - default behaviors: include autoscroll in default behavior as well - restart: if crawl already done, don't attempt to crawl further. if 'waitOnDone' set, wait for signal before exiting. - bump to puppeteer-core 17.1.2 - bump to 0.7.0-beta.4	2022-09-08 23:39:26 -07:00
Ilya Kreymer	5c931275ed	pending wait: set max pending request wait to 120 seconds	2022-09-02 17:53:04 -07:00
Ilya Kreymer	a52ee5ed1f	dependencies: update to pywb>=2.6.8, browsertrix-behaviors>=0.3.3	2022-09-02 17:45:16 -07:00
Ilya Kreymer	e22d95e2f0	Logging and browser improvements: (#158 ) * logging: add 'jserrors' option to --logging to print JS errors * browser config: use flags from playwright * browser: use socat to allow connecting via devtools via crawling on port 9222	2022-08-21 00:30:25 -07:00
Ilya Kreymer	6cc38bf511	Page-reuse concurrency + Browser Repair + Screencaster Cleanup Improvements (#157 ) * new window: use cdp instead of window.open * new window tweaks: add reuseCount, use browser.target() instead of opening a new blank page * rename NewWindowPage -> ReuseWindowConcurrency, move to windowconcur.js potential fix for #156 * browser repair: - when using window-concurrency, attempt to repair / relaunch browser if cdp errors occur - mark pages as failed and don't reuse if page error or cdp errors occur - screencaster: clear previous targets if screencasting when repairing browser * bump version to 0.7.0-beta.3	2022-08-19 09:23:40 -07:00
Ilya Kreymer	827c153679	fix for latest puppeteer: page._client -> page._client()	2022-08-17 21:40:10 -07:00
Ilya Kreymer	c5d208024a	Wait Default + Logging Improvements (#153 ) improved logging of pywb + redis: - if 'logging' includes 'pywb', log pywb and redis output, to pywb.log and redis.log - otherwise, just ignore (don't print to stdout as that's too confusing) - print if wb-manager fails, likely due to existing collection waitUntil: default to just 'load' to avoid potential infinite loop, separate --netIdle can configure idle wait dependency: update to latest puppeteer-core (16.1.0)	2022-08-11 18:44:39 -07:00
raffaele messuti	a527cc9b36	Update README.md (#147 ) fix link to puppeteer waitUntil	2022-08-11 18:28:54 -07:00
Ilya Kreymer	e3b8b5ba21	Add --netIdleWait, bump dependencies (0.7.0-beta.2) (#145 ) - add --netIdleWait option, default to 10 seconds - necessary for some sites that start fetching immediately after page load - add openssl.conf to allow pywb to avoid 'unsafe legacy renegotiation disabled' from openssl - update to browsertrix-behaviors 0.3.2 - update current url for screencasting of page before page load starts bump to 0.7.0-beta.2	2022-07-08 17:17:46 -07:00
Ilya Kreymer	bd10f1ad8c	bump to 0.7.0-beta.1	2022-07-03 11:11:11 -07:00
Ilya Kreymer	82c771f7cd	ci: possibly fix for ci release build (issues building uwsgi)	2022-07-03 11:09:06 -07:00
Ilya Kreymer	0a309af740	Update to Chrome/Chromium 101 - (0.7.0 Beta 0) (#144 ) * update base image - switch to browsertrix-base-image:101 with chrome/chromium 101, - includes additional fonts and ubuntu 22.04 as base. - add --disable-site-isolation-trials as default flag to support behaviors accessing iframes * debugging support for shared redis state: - support pausing crawler indefinitely if crawl state is set to 'debug' - must be set/unset manually via external redis - designed for browsertrix-cloud for now bump to 0.7.0-beta.0	2022-06-30 19:24:26 -07:00
Ilya Kreymer	cf90304fa7	0.6.0 Wait State + Screencasting Fixes (#141 ) * new options: - to support browsertrix-cloud, add a --waitOnDone option, which has browsertrix crawler wait when finished - when running with redis shared state, set the `<crawl id>:status` field to `running`, `failing`, `failed` or `done` to let job controller know crawl is finished. - set redis state to `failing` in case of exception, set to `failed` in case of >3 or more failed exits within 60 seconds (todo: make customizable) - when receiving a SIGUSR1, assume final shutdown and finalize files (eg. save WACZ) before exiting. - also write WACZ if exiting due to size limit exceed, but not do to other interruptions - change sleep() to be in seconds * misc fixes: - crawlstate.finished() -> isFinished() - return if >0 pages and none left in queue - don't fail crawl if isFinished() is true - don't keep looping in pending wait for urls to finish if received abort request * screencast improvements (fix related to webrecorder/browsertrix-cloud#233) - more optimized screencasting, don't close and restart after every page. - don't assume targets change after every page, they don't in window mode! - only send 'close' message when target is actually closed * bump to 0.6.0	2022-06-17 11:58:44 -07:00
Ilya Kreymer	e7eb6a6620	create profile: fix typo in cookie settings, multiply by seconds in day uwsgi: set number of workers to be 2x cpus by default	2022-06-01 09:11:11 -07:00
Ilya Kreymer	70ba9241ca	limit interrupt fix: after self-interrupting, only look at local pending list (for redis state) logging: don't log CF check errors, do log when errorCount is reset	2022-05-19 06:25:46 +00:00
Ilya Kreymer	6ec47cdd14	profile creation: when creating a profile, force all cookies to have a duration to avoid expiring session cookies (#139 ) - save cookies on page load and also before profile creation - default cookie duration is 7 days, configurable via --cookieDays option	2022-05-18 23:23:32 -07:00
Ilya Kreymer	93b6dad7b9	Health Check + Size Limits + Profile fixes (#138 ) - Add optional health check via `--healthCheckPort`. If set, runs a server on designated port that returns 200 if healthcheck succeeds (num of consecutive failed page loads < 2*num workers), or 503 if fails. Useful for k8s health check - Add crawl size limit (in bytes), via `--sizeLimit`. Crawl exits (and state optionally saved) when size limit is exceeded. - Add crawl total time limit (in seconds), via `--timeLimit`. Crawl exists (and state optionally saved) when total running time is exceeded. - Add option to overwrite existing collection. If `--overwrite` is included, any existing data for specified collection is deleted. - S3 Storage refactor, simplify, don't add additional paths by default. - Add interpolateFilename as generic utility, supported in filename and STORE_PATH env value. - wacz save: reenable wacz validation after save. - Profiles: support /navigate endpoint, return origins from /ping, prevent opening new tabs. - bump to 0.6.0-beta.1	2022-05-18 22:51:55 -07:00
Ilya Kreymer	500ed1f9a1	Profile Creation Improvements (#136 ) * interactive profile api improvements: - refactor profile creation into separate class - if profile starts with '@', load as relative path using current s3 storage - support uploading profiles to s3 - profile api: support filename passed to /createProfieJS as part of json POST - profile api: support /ping to keep profile browser running, --shutdownWait to add autoshutdown timeout (extendable via ping) - profile api: add /target to retrieve target and /navigate to navigate by url. * bump to 0.6.0-beta.0	2022-05-05 14:27:17 -05:00
Ilya Kreymer	5dfbfbeaf6	update dependencies: (#134 ) - update pywb to 2.6.7, fix possible error cdx indexing ever via --generateCDX - update wacz to 0.4.6, ensure wacz file is closed and better and more error-resilient text extraction - update browsertrix-behaviors to 0.3.0, support for telegram behavior - bump version to 0.5.1	2022-04-15 16:22:47 -07:00
Ilya Kreymer	9b938304ce	dependencies: update to pywb>=2.6.6, wacz>=0.4.5	2022-04-11 15:09:59 -07:00
Ilya Kreymer	cc391146c4	package: set minio version to fixed (7.0.26)	2022-04-09 22:07:17 -07:00
Ilya Kreymer	bfd72835d1	update CHANGES for 0.5.0 release	2022-04-09 21:59:44 -07:00
Ilya Kreymer	7ed5586bdb	scopeType improvement: when setting scopeType domain on a URL with "www.", automatically drop the www. for simplicity	2022-03-22 17:43:13 -07:00
Ilya Kreymer	5afd19f43d	Non-HTML Page Load Optimization (#130 ) * non-html page load improvements: fix for #129 - don't include cookie check in eliminating direct fetch, may be too speculative - as suggested in #129, when loading non-html, only wait for dom load and don't run behaviors - don't do text extraction for non-HTML pages (will need to handle pdf separately) bump to 0.5.0-beta.8	2022-03-22 17:41:51 -07:00
Ilya Kreymer	09082e8abb	dependencies: set wacz>=0.4.4	2022-03-18 10:38:34 -07:00
Ilya Kreymer	8727ca7f8c	redis state error handling: catch and log potential errors with reading json state for next url bump version to 0.5.0-beta.7	2022-03-18 10:34:17 -07:00
Ilya Kreymer	5e5efda437	Profile Creation Fix + Cloudflare Wait Support + UserAgent Fix (#128 ) * cloudlfare wait improvements (#110 fix) - set navigator.webdriver to false to help with cloudflare wait - add checkCF() that will detect cloudflare ddos page and wait 5 seconds until original page is loaded * chrome args refactor: - move to utils/browser - add LazyFrameLoading disable to fix occasional issues with page.goto() never finishing - add userAgent option * profile creation improvements: - fix loadProfile() missing await - fix url to support running remotely - load shared chromeArgs() - add --proxy to support profile creation through pywb proxy * fix setting custom userAgent (#90) - fix typo that resulted in error - ensure userAgent is applied separate from emulatedDevice - add getDefaultUA() browser util	2022-03-18 10:32:59 -07:00
Ilya Kreymer	dedf1cc0ad	typo fix: add await to loadProfile in create-login-profile.js	2022-03-15 02:40:06 +00:00
Ilya Kreymer	12d96f22c6	Profile download support (#126 ) * profiles: support loading profiles via a URL. * add 'request' dependency * README: mention profile URLs	2022-03-14 14:44:24 -07:00
Ilya Kreymer	1fae21b0cf	Better check to see if ERR_ABORTED should be ignored. (#127 ) * error abort check: Fix possible regression with req.failure() returning null, also move to separate function., wrap in exception handler * bump version to 0.5.0-beta.6	2022-03-14 14:41:39 -07:00
Ilya Kreymer	ab096cd5b0	Improve to URL direct check and fetch (#125 ) - direct check fix: only do direct check if HEAD returns 200 status code - if direct load results in non-200 status code, still load in browser - error reporting: detect if net:ERR_ABORTED is actually caused by loading of PDF / other binary that is downloaded, and not an actual page load error - state: tweak error logging message	2022-03-14 11:11:53 -07:00
Ilya Kreymer	81e8fa6da7	Incremental save state (#124 ) * save state: if --saveState set to always, incrementally save state every --saveStateInterval seconds, and keep last --saveStateHistory number of save states in the /crawls directory - defaults to saving every 5 mins and keeping the last 5 save states display save state status on startup page write fixes: add missing await fix for #113 * update README	2022-03-14 10:41:56 -07:00
phiresky	fb297574c7	add documentation of env variables for socks proxy + browser extensions (#120 )	2022-03-13 15:00:46 -07:00

1 2 3

134 commits