Commit graph

9 commits

Author SHA1 Message Date
Tessa Walsh
2af94ffab5
Support downloading seed file from URL (#852)
Fixes #841 

Crawler work toward long URL lists in Browsertrix. This PR moves seed
handling from the arg parser's validation step to the crawler's
bootstrap step in order to be able to async fetch the seed file from a
URL.

---------

Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2025-07-03 10:49:37 -04:00
Emma Segal-Grossman
2a49406df7
Add Prettier to the repo, and format all the files! (#428)
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
2023-11-09 16:11:11 -08:00
Tessa Walsh
0192d05f4c Implement improved json-l logging
- Add Logger class with methods for info, error, warn, debug, fatal
- Add context, timestamp, and details fields to log entries
- Log messages as JSON Lines
- Replace puppeteer-cluster stats with custom stats implementation
- Log behaviors by default
- Amend argParser to reflect logging changes
- Capture and log stdout/stderr from awaited child_processes
- Modify tests to use webrecorder.net to avoid timeouts
2023-01-19 14:17:27 -05:00
Tessa Walsh
f35d495103
Add screenshot functionality (#188)
* Add screenshot and thumbnail functionality

Introduces a --screenshot CLI option, which takes a comma-separated
list of screenshot types: view,fullPage,thumbnail.

In addition, this commit:

- Adds '--experimental-global-webcrypto' to ensure webcrypto is
available in node
- Deprecates newContext, instead always using page context for 1 worker
and window context for >1 worker

* Separate screenshotTypes into exported const

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Air.local>
2022-12-21 09:06:13 -08:00
Ilya Kreymer
277314f2de Convert to ESM (#179)
* switch base image to chrome/chromium 105 with node 18.x
* convert all source to esm for node 18.x, remove unneeded node-fetch dependency
* ci: use node 18.x, update to latest actions
* tests: convert to esm, run with --experimental-vm-modules
* tests: set higher default timeout (90s) for all tests
* tests: rename driver test fixture to .mjs for loading in jest
* bump to 0.8.0
2022-11-15 18:30:27 -08:00
Ilya Kreymer
201eab4ad1
Support Extra Hops beyond current scope with --extraHops option (#98)
* extra hops depth: add support for --extraHops option, which expands the inclusion scope to go N 'extra hops' beyond the existing scope. fixes most common use case in #83

* update README with info on `extraHops`, add tests for extraHops

* dependency fix: use pywb 2.6.3, warcio 1.5.0

* bump to 0.5.0-beta.2
2022-01-15 09:03:09 -08:00
Ilya Kreymer
0e0b85d7c3
Customizable extract selectors + typo fix (0.4.2) (#72)
* fix typo in setting crawler.capturePrefix which caused directFetchCapture() to fail, causing non-HTML urls to fail.
- wrap directFetchCapture() to retry browser loading in case of failure

* custom link extraction improvements (improvements for #25) 
- extractLinks() returns a list of link URLs to allow for more flexibility in custom driver
- rename queueUrls() to queueInScopeUrls() to indicate the filtering is performed
- loadPage accepts a list of select opts {selector, extract, isAttribute} and defaults to {"a[href]", "href", false}
- tests: add test for custom driver which uses custom selector

* tests
- tests: all tests uses 'test-crawls' instead of crawls
- consolidation: combine initial crawl + rollover, combine warc, text tests into basic_crawl.test.js
- add custom driver test and fixture to test custom link extraction

* add to CHANGES, bump to 0.4.2
2021-07-23 18:31:43 -07:00
Ilya Kreymer
e7d3767efb
Add scopeType options + option to crawl hashtags + simplify defaultDriver.js (#51)
* support hashtag for page-scoped crawls:
- allow hashtags for current page, automatically set scope to current w/ different hashtags
- also allow hashtags for URLs specified via urlFile
- driver: simplify driver, move default driver function to loadPage()
- bump version to 0.4.0-beta.0

* add --allowHash option to allow hashtags in URLs, enabled for --spaMode but can be set for crawling as well

* graceful shutdown: ensure redis and pywb processes shutdown on exit (for use with singularity, outside of docker)

* replace spaMode with more generic --scopeType, a shortcut to setting the scope via regex.
scopeType options include:
prefix - scope is prefix of current page (default)
page - scope is current page + hashtags (spa mode)
domain - scope is domain/origin of current page
any - scope is any url (default for urlFile)

- bump version to 0.4.0-beta.1
2021-05-21 15:37:02 -07:00
Emma Dickson
63376ab6ac
Add --urlFile param to specify text file with a list of URLs to crawl (#38)
* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>
2021-05-12 22:57:06 -07:00