browsertrix-crawler/CHANGES.md at 63376ab6accee68dc270ab0bdce1c4dcfb4957d4

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Add --urlFile param to specify text file with a list of URLs to crawl (#38 )

* Resolves #12

* Make --url param optional. Only one of --url of --urlFile should be specified.

* Add ignoreScope option queueUrls() to support adding specific URLs

* add tests for urlFile

* bump version to 0.3.2

Co-authored-by: Emma Dickson <emmadickson@Emmas-MacBook-Pro.local>

2021-05-12 22:57:06 -07:00

926 B

Raw Blame History

CHANGES

v0.3.2

Added a --urlFile option: Allows users to specify a .txt file list of exact URLs to crawl (one URL per line).

v0.3.1

Improved shutdown wait: Instead of waiting for 5 secs, wait until all pending requests are written to WARCs
Bug fix: Use async APIs for combine WARC to avoid spurrious issues with multiple crawls
Behaviors Update to Behaviors to 0.2.1, with support for facebook pages

v0.3.0

WARC Combining: --combineWARC and --rolloverSize flags for generating combined WARC at end of crawl, each WARC upto specified rolloverSize
Profiles: Support for creating reusable browser profiles, stored as tarballs, and running crawl with a login profile (see README for more info)
Behaviors: Switch to Browsertrix Behaviors v0.1.1 for in-page behaviors
Logging: Customizable logging options via --logging, including behavior log, behavior debug log, pywb log and crawl stats (default)

926 B Raw Blame History

CHANGES

926 B

Raw Blame History