Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
Ilya Kreymer	8533f6ccf9	additional failure logic: (#402 ) - logger.fatal() also sets crawl status to 'failed' and adds endTime before exiting - add 'failOnFailedLimit' to set crawl status to 'failed' if number of failed pages exceeds limit, refactored from #393 to now use logger.fatal() to end crawl.	2023-10-03 20:21:30 -07:00
Tessa Walsh	a23f840318	Store crawler start and end times in Redis lists (#397 ) * Store crawler start and end times in Redis lists * end time tweaks: - set end time for logger.fatal() - set missing start time into setEndTime() --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-10-02 17:55:52 -07:00
Ilya Kreymer	f453dbfb56	Switch to Brave Base Image (#400 ) * switch to brave: - switch base browser to brave base image 1.58.135 - tests: add extra delay for blocking tests - bump to 0.12.0-beta.0	2023-10-02 14:30:44 -07:00
Ilya Kreymer	4c7ebf18d4	version: bump to 0.11.2	2023-09-29 11:18:22 -07:00
Tessa Walsh	7e03dc076f	Set new logic for invalid seeds (#395 ) Allow for some seeds to be invalid unless failOnFailedSeed is set Fail crawl if not valid seeds are provided Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-29 13:02:52 -04:00
gitreich	18dce9534e	Update README.md (#390 ) added missing quotes in command to extend an existing profiles	2023-09-29 09:23:05 -07:00
Ilya Kreymer	52817c776e	add more timeouts to operations that happen outside of page processing time: (#396 ) - await page.close() if not finished within 20s - await crawler.pageFinished() if not finished within 60s (in case config is being written)	2023-09-27 15:46:36 -07:00
Ilya Kreymer	165a9787af	logging and beheaviors improvements (#389 ) - run behaviors: check if behaviors object exists before trying to run behaviors to avoid failure message - skip behaviors if frame no longer attached / has empty URL	2023-09-20 15:02:37 -04:00
Ilya Kreymer	c6cbbc1a17	Update CI Release Action (#386 ) * update to latest actions, use docker meta action with semver tags	2023-09-18 22:43:47 -05:00
Ilya Kreymer	c4287c7ed9	Error handling fixes to avoid crawler getting stuck. (#385 ) * error handling fixes: - listen to correct event for page crashes, 'error' instead of 'crash', may fix #371, #351 - more removal of duplicate logging for status-related errors, eg. if page crashed, don't log worker exception - detect browser 'disconnected' event, interrupt crawl (but allow post-crawl tasks, such as waiting for pending requests to run), set browser to null to avoid trying to use again. worker - bump new page timeout to 20 - if loading page from new domain, always use new page logger: - log timestamp first for better sorting	2023-09-18 15:24:33 -07:00
Ilya Kreymer	0c88eb78af	favicon: use 127.0.0.1 instead of localhost (#384 ) catch exception in fetch bump to 0.11.1	2023-09-17 12:50:39 -07:00
Ilya Kreymer	debfe8945f	README: add --restartOnError cli opt	2023-09-15 11:22:52 -07:00
Ilya Kreymer	e5b0c4ec1b	optimize link extraction: (fixes #376 ) (#380 ) * optimize link extraction: (fixes #376) - dedup urls in browser first - don't return entire list of URLs, process one-at-a-time via callback - add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler - optimize addqueue: atomically check if already at max urls and if url already seen in one redis call - add QueueState enum to indicate possible states: url added, limit hit, or dupe url - better logging: log rejected promises for link extraction - tests: add test for exact page limit being reached	2023-09-15 10:12:08 -07:00
benoit74	947d15725b	Enhance file stats test to detect file modification (#382 ) Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2023-09-15 12:34:56 -04:00
Vinzenz Sinapius	7b6bb681c7	Update tldextract cache for pywb in build process (#383 )	2023-09-15 12:22:17 -04:00
Ilya Kreymer	3c9be514d3	behavior logging tweaks, add netIdle (#381 ) * behavior logging tweaks, add netIdle * fix shouldIncludeFrame() check: was actually erroring out and never accepting any iframes! now used not only for link extraction but also to run() behaviors * add logging if iframe check fails * Dockerfile: add commented out line to use local behaviors.js * bump behaviors to 0.5.2	2023-09-14 19:48:41 -07:00
benoit74	d72443ced3	Add option to output stats file live, i.e. after each page crawled (#374 ) * Add option to output stats file live, i.e. after each page crawled * Always output stat files after each page crawled (+ test) * Fix inversion between expected and test value	2023-09-14 15:16:19 -07:00
Ilya Kreymer	afecec01bd	status: fix typo setting status to log message (#379 ) status should be set to 'done'!	2023-09-13 22:54:55 -07:00
Ilya Kreymer	a3cfc55c38	various fixes regarding state restart: (#370 ) * additional fixes: - use distinct exit code for subsequent interrupt (13) and fatal interrupt (17) - if crawl has been stopped, mark for final exit for post crawl tasks - stopped takes precedence over interrupted: if both, still exit with 0 (and marked for final exit) - if no warcs found, crawl stopped, but previous pages found, don't consider failed! - cleanup: remove unused code, rename to gracefulFinishOnInterrupt, separate from graceful finish via crawl stopped	2023-09-13 10:48:21 -07:00
Anish Lakhwara	5bd4fedff9	Add example of mounting custom behaviours (#369 ) * feat: add docker mount custom behavior to README * Add link to behaviors tutorial --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2023-09-13 10:47:05 -07:00
Ilya Kreymer	6a73d292b4	bump to 0.11.0 for new features	2023-09-13 10:39:59 -07:00
Graham Hukill	1eeee2c215	Surface lastmod option for sitemap parser (#367 ) * Surface lastmod option for sitemap parser - Add --sitemapFromDate to use along with --useSitemap which will filter sitemap by on or after specified ISO date. The library used to parse sitemaps for URLs added an optional "lastmod" argument in v3.2.5 that allows filtering URLs returned by a "last_modified" element present in sitemap XMLs. This surfaces that argument to the browsertrix-crawler CLI runtime parameters. This can be useful for orienting a crawl around a list of seeds known to contain sitemaps, but are only interested in including URLs that have been modified on or after X date. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-09-13 10:20:41 -07:00
Ilya Kreymer	f8508a85ab	logging fixes: (#377 ) - avoid duplicate logging for same error, if logging more specific message and rethrowing exception, set e.detail to "logged" and worker exception handler will not log same error again - add option to log timeouts as warnings instead of errors - remove unneed async method in browser, get headers directly - fix logging in screenshots to include page	2023-09-13 10:05:05 -07:00
Ilya Kreymer	283fa00299	logging: resolve confusion with 'crawl done' not being written to log… (#375 ) * logging: resolve confusion with 'crawl done' not being written to log, because the log is itself stored in the WACZ: (fixes #365) - keep log file open until end, even if its being written to WACZ, close before exit - add logging of 'crawling done' when crawling is done (writing to WACZ or not) - add debug logging of 'end of log file' to indicate log file is being added to WACZ and nothing else will be added there in the WACZ.	2023-09-13 10:04:09 -07:00
Anish Lakhwara	1c486ea1f3	Capture Favicon (#362 ) - get favicon from CDP debug page, if available, log warning if not - store in favIconUrl in pages.jsonl - test: add test for favIcon and additional multi-page crawls	2023-09-10 11:29:35 -07:00
Anish Lakhwara	d42010a598	feat: precommit (#363 ) * add .husky/pre-commit * run lint on precommit	2023-09-07 13:03:22 -07:00
Ilya Kreymer	b95c535821	misc exit features: (#366 ) - if interrupted (via signal or due to limits) and not finished, return error code 11 to indicate interruption - allow stopping single instances with hset '<crawlid>:stopone' uid (similar to status) - deliberate stop via redis not considered interruption (exit 0)	2023-09-06 11:14:18 -04:00
Ilya Kreymer	3c2f5f8934	link extraction optimization: for scopeType page, set depth == extraHops to avoid getting links (#364 ) if we know no additional links wil be used	2023-08-31 13:42:14 -07:00
Ilya Kreymer	cf404efa13	improve crawl stopped check with unified isCrawlRunning() check with checks both interrupted + redis-based state (#356 ) - handle browser crash -- if getting new page fails after 5 tries, assume browser crashed and exit - check if timedRun() returns a non-null value before expanding - update timedRun() to rethrow any non-timeout exception, instead of just logging 'unknown exception', as it should be handled downstream.	2023-08-22 09:16:00 -07:00
Ilya Kreymer	212bff0a27	mark for upload-and-delete when crawl is interrupted for any limit: total size, total time, or disk limit (#354 )	2023-08-15 11:34:39 -07:00
Ilya Kreymer	5ba6c33bff	args parsing: fix parseRx() for inclusions/exclusions to deal with non-string types (fixes #352 ) (#353 ) treat non-regexes as strings and pass to RegExp constructor tests: add additional scope parsing tests for different types passed in as exclusions update yargs bump to 0.10.4	2023-08-13 15:08:36 -07:00
Ilya Kreymer	16751de147	version: bump to 0.10.3	2023-08-08 08:43:27 -07:00
Ilya Kreymer	6270571b34	seed parsing: return null if invalid url encountered in parseUrl to avoid subsequent exception! (#349 ) adjust error labels to differentiate invalid pages vs seeds fixes webrecorder/browsertrix-cloud#1037	2023-08-08 08:42:44 -07:00
Ilya Kreymer	69fc1819d1	sizeLimit fix: (#347 ) - only delete local data if uploading and uploaded succeeded, not after every sizeLimit interruption - fixes #344	2023-08-01 00:04:10 -07:00
Amani	442f4486d3	feat: Add custom behavior injection (#285 ) * support loading custom behaviors from a specified directory via --customBehaviors * call load() for each behavior incrementally, then call selectMainBehavior() (available in browsertrix-behaviors 0.5.1) * tests: add tests for multiple custom behaviors --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-06 13:09:48 -07:00
Tessa Walsh	74831373fd	Update README options	2023-07-06 15:21:30 -04:00
wvengen	de2b4512b6	Allow configuration of deduplication policy (#331 ) (#332 )	2023-07-06 14:54:35 -04:00
Tessa Walsh	22dc2e8426	deps: bump browsertrix-behaviors to ^0.5.1 (#341 )	2023-07-06 10:15:18 -07:00
Ilya Kreymer	5ce410c275	profiles: use newly provided puppeteer page.setBypassServiceWorker() (#340 ) * profiles: use newly provided puppeteer page.setBypassServiceWorker() instead of cdp command bump puppeteer core to 20.7.4	2023-07-06 10:09:32 -04:00
Tessa Walsh	254da95a44	Fix disk utilization computation errors (#338 ) * Check size of /crawls by default to fix disk utilization check * Refactor calculating percentage used and add unit tests * add tests using df output for with disk usage above and below threshold --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2023-07-05 21:58:28 -07:00
Ilya Kreymer	3049b957bd	version: bump to 0.10.2 deps: bump to py-wacz 0.4.9	2023-07-05 21:20:58 -07:00
Ilya Kreymer	c7dc504c75	deps: update puppeteer-core to 20.4.0, fixes #324 (#325 )	2023-05-30 19:25:54 -07:00
Ilya Kreymer	7b906f921c	Origin Overrides: Ensure Host header also set (#326 ) * origin overrides: ensure 'host' and 'origin' headers are also overridden, set to the original host and origin when sent to the destination origin	2023-05-30 19:25:37 -07:00
Ilya Kreymer	7c6c7d57a8	version: bump to 0.10.1	2023-05-30 19:12:28 -07:00
Tessa Walsh	d9b72bb9f5	Ignore spaces in double quotes when splitting process.env.CRAWL_ARGS (#323 )	2023-05-30 19:06:44 -07:00
Ilya Kreymer	db46cdf6d5	version: bump to 0.10.0	2023-05-23 12:45:29 -07:00
Ilya Kreymer	392c8bba0f	allow adding --include with pre-existing --scopeType values (besides custom) (fixes #318 ) (#319 ) remove warning when --scopeType and --include used together tests: update tests to reflect new semantics of --include + --scopeType	2023-05-23 09:43:11 -07:00
Ilya Kreymer	f51154facb	Chrome 112 + new headless mode + consistent viewport tweaks (#316 ) * base: update to chrome 112 headless: switch to using new headless mode available in 112 which is more in sync with headful mode viewport: use fixed viewport matching screen dimensions for headless and headful mode (if GEOMETRY is set) profiles: fix catching new window message, reopening page in current window versions: bump to pywb 2.7.4, update puppeteer-core to (20.2.1) bump to 0.10.0-beta.4 * profile: force reopen in current window only for headless mode (currently breaks otherwise), remove logging messages	2023-05-22 16:24:39 -07:00
Tessa Walsh	cc606deba9	Improve thumbnails with sharp (#304 ) * Resize thumbnails to 640x360 with sharp	2023-05-19 11:30:24 -07:00
Ilya Kreymer	b5df5ad3c1	version: bump to 0.10.0-beta.3	2023-05-19 07:44:29 -07:00

1 2 3 4 5 ...

290 commits