zimit/CHANGELOG.md

157 lines
4 KiB
Markdown
Raw Normal View History

## Changelog
All notable changes to this project are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.2.0).
2023-11-02 20:52:43 +01:00
## [1.6.0] - 2023-11-02
### Changed
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
- User-Agent now has a default value (#228)
- Manipulation of spaces with UA suffix and adminEmail has been modified
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
2023-11-02 20:52:43 +01:00
- Using browsertrix-crawler 0.12.0
2023-10-02 10:51:06 +00:00
## [1.5.3] - 2023-10-02
### Changed
- Using browsertrix-crawler 0.11.2
2023-09-19 09:04:23 +00:00
## [1.5.2] - 2023-09-19
### Changed
- Using browsertrix-crawler 0.11.1
## [1.5.1] - 2023-09-18
2023-08-28 13:10:07 +02:00
2023-09-11 10:43:28 +00:00
### Changed
2023-08-28 13:10:07 +02:00
- Using browsertrix-crawler 0.11.0
2023-09-11 10:43:28 +00:00
- Scraper stat file is not created empty (#211)
2023-09-18 16:16:12 +02:00
- Crawler statistics are not available anymore (#213)
- Using warc2zim 1.5.4
2023-08-28 13:10:07 +02:00
2023-08-23 16:33:46 +00:00
## [1.5.0] - 2023-08-23
### Added
- `--long-description` param
## [1.4.1] - 2023-08-23
2023-08-10 18:51:19 +00:00
### Changed
2023-08-23 12:15:01 +00:00
- Using browsertrix-crawler 0.10.4
- Using warc2zim 1.5.3
2023-08-10 18:51:19 +00:00
2023-08-02 14:42:10 +00:00
## [1.4.0] - 2023-08-02
2023-04-10 13:08:12 +00:00
### Added
- `--title` to set ZIM title
- `--description` to set ZIM description
- New crawler options: `--maxPageLimit`, `--delay`, `--diskUtilization`
- `--zim-lang` param to set warc2zim's `--lang` (ISO-639-3)
2023-04-10 13:08:12 +00:00
### Changed
2023-08-02 11:01:10 +00:00
- Using browsertrix-crawler 0.10.2
2023-03-24 07:26:10 +00:00
- Default and accepted values for `--waitUntil` from crawler's update
2023-08-02 14:47:23 +00:00
- Using warc2zim 1.5.2
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
2023-05-22 11:23:46 +00:00
- `--failOnFailedSeed` used inconditionally
- `--lang` now passed to crawler (ISO-639-1)
2023-03-24 07:26:10 +00:00
### Removed
- `--newContext` from crawler's update
2023-02-06 11:48:44 +01:00
## [1.3.1] - 2023-02-06
### Changed
- Using browsertrix-crawler 0.8.0
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
2023-02-02 16:31:45 +00:00
## [1.3.0] - 2023-02-02
2022-06-21 17:20:30 +00:00
### Added
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
2022-06-21 17:20:30 +00:00
### Changed
2023-02-02 16:31:45 +00:00
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
2023-01-31 10:34:32 +00:00
- Using browsertrix-crawler 0.8.0-beta.1
- Fixed `--allowHashUrls` being a boolean param
2022-07-25 08:41:08 +00:00
- Increased `check_url` timeout (12s to connect, 27s to read) instead of 10s
2022-06-21 17:08:38 +00:00
## [1.2.0] - 2022-06-21
### Added
- `--urlFile` browsertrix crawler parameter
- `--depth` browsertrix crawler parameter
- `--extraHops`, parameter
- `--collection` browsertrix crawler parameter
- `--allowHashUrls` browsertrix crawler parameter
- `--userAgentSuffix` browsertrix crawler parameter
- `--behaviors`, parameter
- `--behaviorTimeout` browsertrix crawler parameter
- `--profile` browsertrix crawler parameter
- `--sizeLimit` browsertrix crawler parameter
- `--timeLimit` browsertrix crawler parameter
- `--healthCheckPort`, parameter
- `--overwrite` parameter
### Changed
- using browsertrix-crawler `0.6.0` and warc2zim `1.4.2`
- default WARC location after crawl changed
from `collections/capture-*/archive/` to `collections/crawl-*/archive/`
### Removed
- `--scroll` browsertrix crawler parameter (see `--behaviors`)
- `--scope` browsertrix crawler parameter (see `--scopeType`, `--include` and `--exclude`)
## [1.1.5]
2021-06-10 14:14:11 +00:00
- using crawler 0.3.2 and warc2zim 1.3.6
## [1.1.4]
- Defaults to `load,networkidle0` for waitUntil param (same as crawler)
- Allows setting combinations of values for waitUntil param
- Updated warc2zim to 1.3.5
- Updated browsertrix-crawler to 0.3.1
2021-05-12 17:03:48 +00:00
- Warc to zim now written to `{temp_root_dir}/collections/capture-*/archive/` where
`capture-*` is dynamic and includes the datetime. (from browsertrix-crawler)
## [1.1.3]
2021-01-15 12:59:00 +00:00
- allows same first-level-domain redirects
- fixed redirects to URL in scope
2021-02-15 17:15:54 +00:00
- updated crawler to 0.2.0
2021-03-01 09:59:34 +00:00
- `statsFilename` now informs whether limit was hit or not
2021-01-15 12:59:00 +00:00
## [1.1.2]
2021-01-15 12:59:00 +00:00
- added support for --custom-css
- added domains block list (dfault)
## [1.1.1]
2020-12-14 08:13:54 +00:00
- updated browsertrix-crawler to 0.1.4
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
## [1.0]
2020-12-14 08:13:54 +00:00
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3