mirror of
https://github.com/openzim/zimit.git
synced 2025-12-31 04:23:15 +00:00
409 lines
10 KiB
Markdown
409 lines
10 KiB
Markdown
## Changelog
|
||
|
||
All notable changes to this project are documented in this file.
|
||
|
||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.2.0).
|
||
|
||
## [Unreleased]
|
||
|
||
### Added
|
||
- Added `--overwrite` flag to overwrite existing ZIM file if it exists (#399)
|
||
|
||
### Changed
|
||
- Fix issues preventing interrupted crawls from being resumed. (#499)
|
||
- Ensure build directory is used explicitly instead of a randomized subdirectory when passed, and pre-create it if it does not exist.
|
||
- Use all warc_dirs found instead of just the latest so interrupted crawls use all collected pages across runs when an explicit collections directory is not passed.
|
||
- Don't cleanup an explicitly passed build directory.
|
||
|
||
## [3.0.5] - 2024-04-11
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.6.0 (#493)
|
||
|
||
## [3.0.4] - 2024-04-04
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.5.10 (#491)
|
||
|
||
## [3.0.3] - 2024-02-28
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.5.7 (#483)
|
||
|
||
## [3.0.2] - 2024-02-27
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.5.6 (#482)
|
||
|
||
## [3.0.1] - 2024-02-24
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.5.4 (#476)
|
||
|
||
## [3.0.0] - 2024-02-17
|
||
|
||
### Changed
|
||
|
||
- Change solution to report partial ZIM to the Zimfarm and other clients (#304)
|
||
- Keep temporary folder when crawler or warc2zim fails, even if not asked for (#468)
|
||
- Add many missing Browsertrix Crawler arguments ; drop default overrides by zimit ; drop `--noMobileDevice` setting (not needed anymore) (#433)
|
||
- Document all Browsertrix Crawler default arguments values (#416)
|
||
- Use preferred Browsertrix Crawler arguments names: (part of #471)
|
||
- `--seeds` instead of `--url`
|
||
- `--seedFile` instead of `--urlFile`
|
||
- `--pageLimit` instead of `--limit`
|
||
- `--pageLoadTimeout` instead of `--timeout`
|
||
- `--scopeIncludeRx` instead of `--include`
|
||
- `--scopeExcludeRx` instead of `--exclude`
|
||
- `--pageExtraDelay` instead of `--delay`
|
||
- Remove confusion between zimit, warc2zim and crawler stats filenames (part of #471)
|
||
- `--statsFilename` is now the crawler stats file (since it is the same name, just like other arguments)
|
||
- `--zimit-progress-file` is now the zimit stats location
|
||
- `--warc2zim-progress-file` is the warc2zim stats location
|
||
- all are optional values, if not set and needed temporary files are used
|
||
|
||
### Fixed
|
||
|
||
- Do not create the ZIM when crawl is incomplete (#444)
|
||
|
||
## [2.1.8] - 2024-02-07
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.5.1, Python 3.13 and others (#462 + #464)
|
||
|
||
## [2.1.7] - 2024-01-10
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.4.2 (#450)
|
||
- Upgrade to warc2zim 2.2.0
|
||
|
||
## [2.1.6] - 2024-11-07
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.3.5 (#426)
|
||
|
||
## [2.1.5] - 2024-11-01
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.3.4 and warc2zim 2.1.3 (#424)
|
||
|
||
## [2.1.4] - 2024-10-11
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.3.3 (#411)
|
||
|
||
## [2.1.3] - 2024-10-08
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.3.2, warc2zim 2.1.2 and other dependencies (#406)
|
||
|
||
### Fixed
|
||
|
||
- Fix help (#393)
|
||
|
||
## [2.1.2] - 2024-09-09
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.3.0-beta.1 (#387) (fixes "Ziming a website with huge assets (e.g. PDFs) is failing to proceed" - #380)
|
||
|
||
## [2.1.1] - 2024-09-05
|
||
|
||
### Added
|
||
|
||
- Add support for uncompressed tar archive in --warcs (#369)
|
||
|
||
### Changed
|
||
|
||
- Upgrade to browsertrix crawler 1.3.0-beta.0 (#379), including upgrage to Ubuntu Noble (#307)
|
||
|
||
### Fixed
|
||
|
||
- Stream files downloads to not exhaust memory (#373)
|
||
- Fix documentation on `--diskUtilization` setting (#375)
|
||
|
||
## [2.1.0] - 2024-08-09
|
||
|
||
### Added
|
||
|
||
- Add `--custom-behaviors` argument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313)
|
||
- Add daily automated end-to-end tests of a page with Youtube player (#330)
|
||
- Add `--warcs` option to directly process WARC files (#301)
|
||
|
||
### Changed
|
||
|
||
- Make it clear that `--profile` argument can be an HTTP(S) URL (and not only a path) (#288)
|
||
- Fix README imprecisions + add back warc2zim availability in docker image (#314)
|
||
- Enhance integration test to assert final content of the ZIM (#287)
|
||
- Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354)
|
||
- Do not log number of WARC files found (#357)
|
||
- Upgrade dependencies (warc2zim 2.1.0)
|
||
|
||
### Fixed
|
||
|
||
- Sort WARC directories found by modification time (#366)
|
||
|
||
## [2.0.6] - 2024-08-02
|
||
|
||
### Changed
|
||
|
||
- Upgraded Browsertrix Crawler to 1.2.6
|
||
|
||
## [2.0.5] - 2024-07-24
|
||
|
||
### Changed
|
||
|
||
- Upgraded Browsertrix Crawler to 1.2.5
|
||
- Upgraded warc2zim to 2.0.3
|
||
|
||
## [2.0.4] - 2024-07-15
|
||
|
||
### Changed
|
||
|
||
- Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316)
|
||
|
||
## [2.0.3] - 2024-06-24
|
||
|
||
### Changed
|
||
|
||
- Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323)
|
||
|
||
## [2.0.2] - 2024-06-18
|
||
|
||
### Changed
|
||
|
||
- Upgrade dependencies (mainly warc2zim 2.0.2)
|
||
|
||
|
||
## [2.0.1] - 2024-06-13
|
||
|
||
### Changed
|
||
|
||
- Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318)
|
||
|
||
### Fixed
|
||
|
||
- Crawler is not correctly checking disk size / usage (#305)
|
||
|
||
## [2.0.0] - 2024-06-04
|
||
|
||
### Added
|
||
|
||
- New `--version` flag to display Zimit version (#234)
|
||
- New `--logging` flag to adjust Browsertrix Crawler logging (#273)
|
||
- Use new `--scraper-suffix` flag of warc2zim to enhance ZIM "Scraper" metadata (#275)
|
||
- New `--noMobileDevice` CLI argument
|
||
- Publish Docker image for `linux/arm64` (in addition to `linux/amd64`) (#178)
|
||
|
||
### Changed
|
||
|
||
- **Use `warc2zim` version 2**, which works without Service Worker anymore (#193)
|
||
- Upgraded Browsertrix Crawler to 1.1.3
|
||
- Adopt Python bootstrap conventions
|
||
- Upgrade to Python 3.12 + upgrade dependencies
|
||
- Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284)
|
||
- Drop initial check of URL in Python (#256)
|
||
- `--userAgent` CLI argument overrides again the `--userAgentSuffix` and `--adminEmail` values
|
||
- `--userAgent` CLI arguement is not mandatory anymore
|
||
|
||
### Fixed
|
||
|
||
- Fix support for Youtube videos (#291)
|
||
- Fix crawler `--waitUntil` values (#289)
|
||
|
||
## [1.6.3] - 2024-01-18
|
||
|
||
### Changed
|
||
|
||
- Adapt to new `warc2zim` code structure
|
||
- Using browsertrix-crawler 0.12.4
|
||
- Using warc2zim 1.5.5
|
||
|
||
### Added
|
||
|
||
- New `--build` parameter (optional) to specify the directory holding Browsertrix files ; if not set, `--output`
|
||
directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only
|
||
if `--keep` is set.
|
||
|
||
### Fixed
|
||
|
||
- `--collection` parameter was not working (#252)
|
||
|
||
## [1.6.2] - 2023-11-17
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.12.3
|
||
|
||
### Fixed
|
||
|
||
- Fix logic passing args to crawler to support value '0' (#245)
|
||
- Fix documentation about Chrome and headless (#248)
|
||
|
||
## [1.6.1] - 2023-11-06
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.12.1
|
||
|
||
## [1.6.0] - 2023-11-02
|
||
|
||
### Changed
|
||
|
||
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
|
||
- User-Agent now has a default value (#228)
|
||
- Manipulation of spaces with UA suffix and adminEmail has been modified
|
||
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
|
||
- Using browsertrix-crawler 0.12.0
|
||
|
||
## [1.5.3] - 2023-10-02
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.11.2
|
||
|
||
## [1.5.2] - 2023-09-19
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.11.1
|
||
|
||
## [1.5.1] - 2023-09-18
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.11.0
|
||
- Scraper stat file is not created empty (#211)
|
||
- Crawler statistics are not available anymore (#213)
|
||
- Using warc2zim 1.5.4
|
||
|
||
## [1.5.0] - 2023-08-23
|
||
|
||
### Added
|
||
|
||
- `--long-description` param
|
||
|
||
## [1.4.1] - 2023-08-23
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.10.4
|
||
- Using warc2zim 1.5.3
|
||
|
||
## [1.4.0] - 2023-08-02
|
||
|
||
### Added
|
||
|
||
- `--title` to set ZIM title
|
||
- `--description` to set ZIM description
|
||
- New crawler options: `--maxPageLimit`, `--delay`, `--diskUtilization`
|
||
- `--zim-lang` param to set warc2zim's `--lang` (ISO-639-3)
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.10.2
|
||
- Default and accepted values for `--waitUntil` from crawler's update
|
||
- Using warc2zim 1.5.2
|
||
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
|
||
- `--failOnFailedSeed` used inconditionally
|
||
- `--lang` now passed to crawler (ISO-639-1)
|
||
|
||
### Removed
|
||
|
||
- `--newContext` from crawler's update
|
||
|
||
## [1.3.1] - 2023-02-06
|
||
|
||
### Changed
|
||
|
||
- Using browsertrix-crawler 0.8.0
|
||
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
|
||
|
||
## [1.3.0] - 2023-02-02
|
||
|
||
### Added
|
||
|
||
- Initial url check normalizes homepage redirects to standart ports – 80/443 (#137)
|
||
|
||
### Changed
|
||
|
||
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
|
||
- Using browsertrix-crawler 0.8.0-beta.1
|
||
- Fixed `--allowHashUrls` being a boolean param
|
||
- Increased `check_url` timeout (12s to connect, 27s to read) instead of 10s
|
||
|
||
## [1.2.0] - 2022-06-21
|
||
|
||
### Added
|
||
|
||
- `--urlFile` browsertrix crawler parameter
|
||
- `--depth` browsertrix crawler parameter
|
||
- `--extraHops`, parameter
|
||
- `--collection` browsertrix crawler parameter
|
||
- `--allowHashUrls` browsertrix crawler parameter
|
||
- `--userAgentSuffix` browsertrix crawler parameter
|
||
- `--behaviors`, parameter
|
||
- `--behaviorTimeout` browsertrix crawler parameter
|
||
- `--profile` browsertrix crawler parameter
|
||
- `--sizeLimit` browsertrix crawler parameter
|
||
- `--timeLimit` browsertrix crawler parameter
|
||
- `--healthCheckPort`, parameter
|
||
- `--overwrite` parameter
|
||
|
||
### Changed
|
||
|
||
- using browsertrix-crawler `0.6.0` and warc2zim `1.4.2`
|
||
- default WARC location after crawl changed
|
||
from `collections/capture-*/archive/` to `collections/crawl-*/archive/`
|
||
|
||
### Removed
|
||
|
||
- `--scroll` browsertrix crawler parameter (see `--behaviors`)
|
||
- `--scope` browsertrix crawler parameter (see `--scopeType`, `--include` and `--exclude`)
|
||
|
||
|
||
## [1.1.5]
|
||
|
||
- using crawler 0.3.2 and warc2zim 1.3.6
|
||
|
||
## [1.1.4]
|
||
|
||
- Defaults to `load,networkidle0` for waitUntil param (same as crawler)
|
||
- Allows setting combinations of values for waitUntil param
|
||
- Updated warc2zim to 1.3.5
|
||
- Updated browsertrix-crawler to 0.3.1
|
||
- Warc to zim now written to `{temp_root_dir}/collections/capture-*/archive/` where
|
||
`capture-*` is dynamic and includes the datetime. (from browsertrix-crawler)
|
||
|
||
## [1.1.3]
|
||
|
||
- allows same first-level-domain redirects
|
||
- fixed redirects to URL in scope
|
||
- updated crawler to 0.2.0
|
||
- `statsFilename` now informs whether limit was hit or not
|
||
|
||
## [1.1.2]
|
||
|
||
- added support for --custom-css
|
||
- added domains block list (dfault)
|
||
|
||
## [1.1.1]
|
||
|
||
- updated browsertrix-crawler to 0.1.4
|
||
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
|
||
|
||
## [1.0]
|
||
|
||
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3
|