Stowage/browsertrix-crawler - Remotebranch.eu

Stowage/browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 14:33:17 +00:00

Author	SHA1	Message	Date
benoit74	fc56c2cf76	Add more exit codes to detect interruption reason (#764 ) Fix #584 - Replace interrupted with interruptReason - Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16) are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10), SignalInterrupted (11) and SignalInterruptedForce (13) - Doc fix to cli args --------- Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-10 14:00:55 -08:00
Ilya Kreymer	5c00bca2b4	tests: use old.webrecorder.net for testing (#710 ) replace webrecorder.net -> old.webrecorder.net to fix tests relying on old website for now	2024-10-31 13:24:58 -04:00
Ilya Kreymer	9c9643c24f	crawler args typing (#680 ) - Refactors args parsing so that `Crawler.params` is properly timed with CLI options + additions with `CrawlerArgs` type. - also adds typing to create-login-profile CLI options - validation still done w/o typing due to yargs limitations - tests: exclude slow page from tests for faster test runs	2024-09-05 18:10:27 -07:00
Tessa Walsh	1fcd3b7d6b	Fix failOnFailedLimit and add tests (#580 ) Fixes #575 - Adds a missing await to fetching the number of failed pages from Redis - Fixes a typo in the fatal logging message - Adds a test to ensure that the crawl fails with exit code 17 if --failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that will 404	2024-05-21 16:35:43 -07:00
Emma Segal-Grossman	2a49406df7	Add Prettier to the repo, and format all the files! (#428 ) This adds prettier to the repo, and sets up the pre-commit hook to auto-format as well as lint. Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.	2023-11-09 16:11:11 -08:00
Ilya Kreymer	e5b0c4ec1b	optimize link extraction: (fixes #376 ) (#380 ) * optimize link extraction: (fixes #376) - dedup urls in browser first - don't return entire list of URLs, process one-at-a-time via callback - add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler - optimize addqueue: atomically check if already at max urls and if url already seen in one redis call - add QueueState enum to indicate possible states: url added, limit hit, or dupe url - better logging: log rejected promises for link extraction - tests: add test for exact page limit being reached	2023-09-15 10:12:08 -07:00