zimit/CHANGELOG.md
2025-12-22 16:08:02 +05:30

409 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Changelog
All notable changes to this project are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.2.0).
## [Unreleased]
### Added
- Added `--overwrite` flag to overwrite existing ZIM file if it exists (#399)
### Changed
- Fix issues preventing interrupted crawls from being resumed. (#499)
- Ensure build directory is used explicitly instead of a randomized subdirectory when passed, and pre-create it if it does not exist.
- Use all warc_dirs found instead of just the latest so interrupted crawls use all collected pages across runs when an explicit collections directory is not passed.
- Don't cleanup an explicitly passed build directory.
## [3.0.5] - 2024-04-11
### Changed
- Upgrade to browsertrix crawler 1.6.0 (#493)
## [3.0.4] - 2024-04-04
### Changed
- Upgrade to browsertrix crawler 1.5.10 (#491)
## [3.0.3] - 2024-02-28
### Changed
- Upgrade to browsertrix crawler 1.5.7 (#483)
## [3.0.2] - 2024-02-27
### Changed
- Upgrade to browsertrix crawler 1.5.6 (#482)
## [3.0.1] - 2024-02-24
### Changed
- Upgrade to browsertrix crawler 1.5.4 (#476)
## [3.0.0] - 2024-02-17
### Changed
- Change solution to report partial ZIM to the Zimfarm and other clients (#304)
- Keep temporary folder when crawler or warc2zim fails, even if not asked for (#468)
- Add many missing Browsertrix Crawler arguments ; drop default overrides by zimit ; drop `--noMobileDevice` setting (not needed anymore) (#433)
- Document all Browsertrix Crawler default arguments values (#416)
- Use preferred Browsertrix Crawler arguments names: (part of #471)
- `--seeds` instead of `--url`
- `--seedFile` instead of `--urlFile`
- `--pageLimit` instead of `--limit`
- `--pageLoadTimeout` instead of `--timeout`
- `--scopeIncludeRx` instead of `--include`
- `--scopeExcludeRx` instead of `--exclude`
- `--pageExtraDelay` instead of `--delay`
- Remove confusion between zimit, warc2zim and crawler stats filenames (part of #471)
- `--statsFilename` is now the crawler stats file (since it is the same name, just like other arguments)
- `--zimit-progress-file` is now the zimit stats location
- `--warc2zim-progress-file` is the warc2zim stats location
- all are optional values, if not set and needed temporary files are used
### Fixed
- Do not create the ZIM when crawl is incomplete (#444)
## [2.1.8] - 2024-02-07
### Changed
- Upgrade to browsertrix crawler 1.5.1, Python 3.13 and others (#462 + #464)
## [2.1.7] - 2024-01-10
### Changed
- Upgrade to browsertrix crawler 1.4.2 (#450)
- Upgrade to warc2zim 2.2.0
## [2.1.6] - 2024-11-07
### Changed
- Upgrade to browsertrix crawler 1.3.5 (#426)
## [2.1.5] - 2024-11-01
### Changed
- Upgrade to browsertrix crawler 1.3.4 and warc2zim 2.1.3 (#424)
## [2.1.4] - 2024-10-11
### Changed
- Upgrade to browsertrix crawler 1.3.3 (#411)
## [2.1.3] - 2024-10-08
### Changed
- Upgrade to browsertrix crawler 1.3.2, warc2zim 2.1.2 and other dependencies (#406)
### Fixed
- Fix help (#393)
## [2.1.2] - 2024-09-09
### Changed
- Upgrade to browsertrix crawler 1.3.0-beta.1 (#387) (fixes "Ziming a website with huge assets (e.g. PDFs) is failing to proceed" - #380)
## [2.1.1] - 2024-09-05
### Added
- Add support for uncompressed tar archive in --warcs (#369)
### Changed
- Upgrade to browsertrix crawler 1.3.0-beta.0 (#379), including upgrage to Ubuntu Noble (#307)
### Fixed
- Stream files downloads to not exhaust memory (#373)
- Fix documentation on `--diskUtilization` setting (#375)
## [2.1.0] - 2024-08-09
### Added
- Add `--custom-behaviors` argument to support path/HTTP(S) URL custom behaviors to pass to the crawler (#313)
- Add daily automated end-to-end tests of a page with Youtube player (#330)
- Add `--warcs` option to directly process WARC files (#301)
### Changed
- Make it clear that `--profile` argument can be an HTTP(S) URL (and not only a path) (#288)
- Fix README imprecisions + add back warc2zim availability in docker image (#314)
- Enhance integration test to assert final content of the ZIM (#287)
- Stop fetching and passing browsertrix crawler version as scraperSuffix to warc2zim (#354)
- Do not log number of WARC files found (#357)
- Upgrade dependencies (warc2zim 2.1.0)
### Fixed
- Sort WARC directories found by modification time (#366)
## [2.0.6] - 2024-08-02
### Changed
- Upgraded Browsertrix Crawler to 1.2.6
## [2.0.5] - 2024-07-24
### Changed
- Upgraded Browsertrix Crawler to 1.2.5
- Upgraded warc2zim to 2.0.3
## [2.0.4] - 2024-07-15
### Changed
- Upgraded Browsertrix Crawler to 1.2.4 (fixes retrieve automatically the assets present in a data-xxx tag #316)
## [2.0.3] - 2024-06-24
### Changed
- Upgraded Browsertrix Crawler to 1.2.0 (fixes Youtube videos issue #323)
## [2.0.2] - 2024-06-18
### Changed
- Upgrade dependencies (mainly warc2zim 2.0.2)
## [2.0.1] - 2024-06-13
### Changed
- Upgrade dependencies (especially warc2zim 2.0.1 and browsertrix crawler 1.2.0-beta.0) (#318)
### Fixed
- Crawler is not correctly checking disk size / usage (#305)
## [2.0.0] - 2024-06-04
### Added
- New `--version` flag to display Zimit version (#234)
- New `--logging` flag to adjust Browsertrix Crawler logging (#273)
- Use new `--scraper-suffix` flag of warc2zim to enhance ZIM "Scraper" metadata (#275)
- New `--noMobileDevice` CLI argument
- Publish Docker image for `linux/arm64` (in addition to `linux/amd64`) (#178)
### Changed
- **Use `warc2zim` version 2**, which works without Service Worker anymore (#193)
- Upgraded Browsertrix Crawler to 1.1.3
- Adopt Python bootstrap conventions
- Upgrade to Python 3.12 + upgrade dependencies
- Removed handling of redirects by zimit, they are handled by browsertrix crawler and detected properly by warc2zim (#284)
- Drop initial check of URL in Python (#256)
- `--userAgent` CLI argument overrides again the `--userAgentSuffix` and `--adminEmail` values
- `--userAgent` CLI arguement is not mandatory anymore
### Fixed
- Fix support for Youtube videos (#291)
- Fix crawler `--waitUntil` values (#289)
## [1.6.3] - 2024-01-18
### Changed
- Adapt to new `warc2zim` code structure
- Using browsertrix-crawler 0.12.4
- Using warc2zim 1.5.5
### Added
- New `--build` parameter (optional) to specify the directory holding Browsertrix files ; if not set, `--output`
directory is used ; zimit creates one subdir of this folder per invocation to isolate datasets ; subdir is kept only
if `--keep` is set.
### Fixed
- `--collection` parameter was not working (#252)
## [1.6.2] - 2023-11-17
### Changed
- Using browsertrix-crawler 0.12.3
### Fixed
- Fix logic passing args to crawler to support value '0' (#245)
- Fix documentation about Chrome and headless (#248)
## [1.6.1] - 2023-11-06
### Changed
- Using browsertrix-crawler 0.12.1
## [1.6.0] - 2023-11-02
### Changed
- Scraper fails for all HTTP error codes returned when checking URL at startup (#223)
- User-Agent now has a default value (#228)
- Manipulation of spaces with UA suffix and adminEmail has been modified
- Same User-Agent is used for check_url (Python) and Browsertrix crawler (#227)
- Using browsertrix-crawler 0.12.0
## [1.5.3] - 2023-10-02
### Changed
- Using browsertrix-crawler 0.11.2
## [1.5.2] - 2023-09-19
### Changed
- Using browsertrix-crawler 0.11.1
## [1.5.1] - 2023-09-18
### Changed
- Using browsertrix-crawler 0.11.0
- Scraper stat file is not created empty (#211)
- Crawler statistics are not available anymore (#213)
- Using warc2zim 1.5.4
## [1.5.0] - 2023-08-23
### Added
- `--long-description` param
## [1.4.1] - 2023-08-23
### Changed
- Using browsertrix-crawler 0.10.4
- Using warc2zim 1.5.3
## [1.4.0] - 2023-08-02
### Added
- `--title` to set ZIM title
- `--description` to set ZIM description
- New crawler options: `--maxPageLimit`, `--delay`, `--diskUtilization`
- `--zim-lang` param to set warc2zim's `--lang` (ISO-639-3)
### Changed
- Using browsertrix-crawler 0.10.2
- Default and accepted values for `--waitUntil` from crawler's update
- Using warc2zim 1.5.2
- Disabled Chrome updates to prevent incidental inclusion of update data in WARC/ZIM (#172)
- `--failOnFailedSeed` used inconditionally
- `--lang` now passed to crawler (ISO-639-1)
### Removed
- `--newContext` from crawler's update
## [1.3.1] - 2023-02-06
### Changed
- Using browsertrix-crawler 0.8.0
- Using warc2zim version 1.5.1 with wabac.js 2.15.2
## [1.3.0] - 2023-02-02
### Added
- Initial url check normalizes homepage redirects to standart ports 80/443 (#137)
### Changed
- Using warc2zim version 1.5.0 with scope conflict fix and videos fix
- Using browsertrix-crawler 0.8.0-beta.1
- Fixed `--allowHashUrls` being a boolean param
- Increased `check_url` timeout (12s to connect, 27s to read) instead of 10s
## [1.2.0] - 2022-06-21
### Added
- `--urlFile` browsertrix crawler parameter
- `--depth` browsertrix crawler parameter
- `--extraHops`, parameter
- `--collection` browsertrix crawler parameter
- `--allowHashUrls` browsertrix crawler parameter
- `--userAgentSuffix` browsertrix crawler parameter
- `--behaviors`, parameter
- `--behaviorTimeout` browsertrix crawler parameter
- `--profile` browsertrix crawler parameter
- `--sizeLimit` browsertrix crawler parameter
- `--timeLimit` browsertrix crawler parameter
- `--healthCheckPort`, parameter
- `--overwrite` parameter
### Changed
- using browsertrix-crawler `0.6.0` and warc2zim `1.4.2`
- default WARC location after crawl changed
from `collections/capture-*/archive/` to `collections/crawl-*/archive/`
### Removed
- `--scroll` browsertrix crawler parameter (see `--behaviors`)
- `--scope` browsertrix crawler parameter (see `--scopeType`, `--include` and `--exclude`)
## [1.1.5]
- using crawler 0.3.2 and warc2zim 1.3.6
## [1.1.4]
- Defaults to `load,networkidle0` for waitUntil param (same as crawler)
- Allows setting combinations of values for waitUntil param
- Updated warc2zim to 1.3.5
- Updated browsertrix-crawler to 0.3.1
- Warc to zim now written to `{temp_root_dir}/collections/capture-*/archive/` where
`capture-*` is dynamic and includes the datetime. (from browsertrix-crawler)
## [1.1.3]
- allows same first-level-domain redirects
- fixed redirects to URL in scope
- updated crawler to 0.2.0
- `statsFilename` now informs whether limit was hit or not
## [1.1.2]
- added support for --custom-css
- added domains block list (dfault)
## [1.1.1]
- updated browsertrix-crawler to 0.1.4
- autofetcher script to be injected by defaultDriver to capture srcsets + URLs in dynamically added stylesheets
## [1.0]
- initial version using browsertrix-crawler:0.1.3 and warc2zim:1.3.3