Commit graph

4 commits

Author SHA1 Message Date
Ilya Kreymer
30646ca7ba
Add downloads dir to cache external dependency within the crawl (#921)
Fixes #920 
- Downloads profile, custom behavior, and seed list to `/downloads`
directory in the crawl
- Seed File: Downloaded into downloads. Never refetched if already
exists on subsequent crawl restarts.
- Custom Behaviors: Git: Downloaded into dir, then moved to
/downloads/behaviors/<dir name>. if already exist, failure to downloaded
will reuse existing directory
- Custom Behaviors: File: Downloaded into temp file, then moved to
/downloads/behaviors/<name.js>. if already exists, failure to download
will reuse existing file.
- Profile: using `/profile` directory to contain the browser profile
- Profile: downloaded to temp file, then placed into
/downloads/profile.tar.gz. If failed to download, but already exists,
existing /profile directory is used
- Also fixes #897
2025-11-26 19:30:27 -08:00
Ilya Kreymer
5c00bca2b4
tests: use old.webrecorder.net for testing (#710)
replace webrecorder.net -> old.webrecorder.net to fix tests relying on
old website for now
2024-10-31 13:24:58 -04:00
Ilya Kreymer
9c9643c24f
crawler args typing (#680)
- Refactors args parsing so that `Crawler.params` is properly timed with
CLI options + additions with `CrawlerArgs` type.
- also adds typing to create-login-profile CLI options
- validation still done w/o typing due to yargs limitations
- tests: exclude slow page from tests for faster test runs
2024-09-05 18:10:27 -07:00
Ilya Kreymer
b83d1c58da
add --dryRun flag and mode (#594)
- if set, runs the crawl but doesn't store any archive data (WARCS,
WACZ, CDXJ) while logs and pages are still written, and saved state can be
generated (per the --saveState options).
- adds test to ensure only 'logs' and 'pages' dirs are generated with --dryRun
- screenshot, text extraction are skipped altogether in dryRun mode,
warning is printed that storage and archiving-related options may be
ignored
- fixes #593
2024-06-07 10:34:19 -07:00