Fix#584
- Replace interrupted with interruptReason
- Distinct exit codes for different interrupt reasons: SizeLimit (14), TimeLimit (15), FailedLimit (12), DiskUtilization (16)
are used when an interrupt happens for these reasons, in addition to existing reasons BrowserCrashed (10),
SignalInterrupted (11) and SignalInterruptedForce (13)
- Doc fix to cli args
---------
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
Fixes#575
- Adds a missing await to fetching the number of failed pages from Redis
- Fixes a typo in the fatal logging message
- Adds a test to ensure that the crawl fails with exit code 17 if
--failOnInvalidStatus and --failOnFailedLimit 1 are set with a url that
will 404
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
* optimize link extraction: (fixes#376)
- dedup urls in browser first
- don't return entire list of URLs, process one-at-a-time via callback
- add exposeFunction per page in setupPage, then register 'addLink' callback for each pages' handler
- optimize addqueue: atomically check if already at max urls and if url already seen in one redis call
- add QueueState enum to indicate possible states: url added, limit hit, or dupe url
- better logging: log rejected promises for link extraction
- tests: add test for exact page limit being reached