warc2zim/CHANGELOG.md

289 lines
8.3 KiB
Markdown
Raw Permalink Normal View History

2022-06-13 14:54:12 +00:00
## Changelog
2020-08-19 23:27:32 +02:00
2022-06-13 14:54:12 +00:00
All notable changes to this project are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.4.0).
2025-02-17 09:48:52 +00:00
## [Unreleased]
2025-03-17 14:39:53 +01:00
### Added
2025-03-17 15:13:27 +01:00
2025-03-17 14:39:53 +01:00
- Provide default encoding aliases (#416)
2025-03-17 15:13:27 +01:00
### Changed
- Convert aliases given in `--encoding-aliases` to lower case (#412)
2025-03-17 14:39:53 +01:00
2025-02-17 09:45:04 +00:00
## [2.2.2] - 2024-02-17
2025-02-07 08:35:54 +00:00
### Changed
- Upgrade dependencies especially zimscraperlib 5.1.1 (#439)
2025-02-07 08:29:42 +00:00
## [2.2.1] - 2024-02-07
2025-01-10 10:02:46 +00:00
2025-02-03 14:40:20 +00:00
### Changed
- Upgrade dependencies: Python 3.13, zimscraperlib 5.1.0 and others (#434)
2025-01-07 16:10:03 +00:00
- Fork cdxj_indexer codebase (#428)
2025-02-03 14:40:20 +00:00
2025-01-10 09:51:39 +00:00
## [2.2.0] - 2024-01-10
2024-11-01 13:20:54 +00:00
2024-11-14 14:17:21 +00:00
### Changed
- Upgrade dependencies: zimscraperlib 5.0.0, warcio 1.7.5, cdxj_index 1.4.6 and others
- Use all rewriting stuff from zimscraperlib
- Remove most HTML / CSS / JS rewriting logic which is now part of zimscraperlib 5
2024-11-14 14:17:21 +00:00
- Fix wombat setup settings (especially `isSW`) (#293)
### Fixed
- Stop checking main entry processability when it is already found (#424)
2024-11-01 13:16:42 +00:00
## [2.1.3] - 2024-11-01
2024-10-08 12:29:37 +00:00
2024-11-01 08:34:45 +00:00
### Changed
- Upgrade to wombat 3.8.3 (#414)
2024-10-08 12:06:39 +00:00
## [2.1.2] - 2024-10-08
2024-09-05 07:15:44 +00:00
### Added
- Enrich test website with img srcset situations (in preparation for #403)
### Changed
- Upgrade dependencies, including wombat 3.8.2 (#407)
### Fixed
- HTML document can be retrieved as `fetch` resource type (#405)
2024-09-05 07:13:44 +00:00
## [2.1.1] - 2024-09-05
2024-08-09 07:47:15 +00:00
2024-09-03 13:56:55 +00:00
### Changed
- Upgrade dependencies, including wombat 3.8.0 (#386)
2024-08-09 07:42:53 +00:00
## [2.1.0] - 2024-08-09
2024-07-24 05:29:45 +00:00
### Added
2024-08-07 08:28:03 +00:00
- New fuzzy-rule for cheatography.com (#342), der-postillon.com (#330), iranwire.com (#363)
- Properly rewrite redirect target url when present in <meta> HTML tag (#237)
- New `--encoding-aliases` argument to pass encoding/charset aliases (#331)
2024-07-18 13:26:24 +00:00
- Add support for SVG favicon (#148)
2024-08-05 10:15:13 +00:00
- Automatically index PDF content and use PDF title (#289 and #290)
### Changed
2024-08-05 09:44:48 +00:00
- Upgrade to python-scraperlib 4.0.0
- Generate fuzzy rules tests in Python and Javascript (#284)
2024-07-30 19:12:47 +00:00
- Refactor HTML rewriter class to make it more open to change and expressive (#305)
- Detect charset in document header only for HTML documents (#331)
- Use `software` property from `warcinfo` record to set ZIM `Scraper` metadata (#357)
- Store `ContentDate` as metadata, based on `WARC-Date` (#358)
2024-08-07 08:05:05 +00:00
- Remove domain specific rules (#328)
- Revisit retrieve_illustration logic to prefer best favicons (#352 and #369)
2024-08-09 07:04:28 +00:00
- Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376)
### Fixed
- Handle case where the redirect target is bad / unsupported (#332 and #356)
- Fixed WARC files handling order to follow creation order (#366)
- Remove subsequent slashes in URLs, both in Python and JS (#365)
2024-07-17 12:52:51 +00:00
- Ignore non HTTP(S) WARC records (#351)
- Fix `vimeo_cdn_fix` fuzzy rule for proper operation in Javascript (#348)
- Performance issue linked to new "extensible" HTML rewriting rules (#370)
2024-07-24 05:27:12 +00:00
## [2.0.3] - 2024-07-24
2024-06-18 13:39:19 +00:00
### Changed
- Moved rules definition from JSON to YAML and documented update process (#216)
2024-07-24 05:21:41 +00:00
- Upgrade to wombat.js 3.7.11
### Added
- Exit with cleaner message when no entries are expected in the ZIM (#336) and when main entry is not processable (#337)
- Add debug log for items whose content is empty (#344)
### Fixed
- Some resources rewrite mode are still not correctly identified (#326)
2024-06-18 13:25:18 +00:00
## [2.0.2] - 2024-06-18
2024-06-13 11:27:48 +00:00
### Added
- Add `--ignore-content-header-charsets` option to disable automatic retrieval of content charsets from content first bytes (#318)
- Add `--content-header-bytes-length` option to specify how many first bytes to consider when searching for content charsets in header (#320)
- Add `--ignore-http-header-charsets` option to disable automatic retrieval of content charsets from content HTTP `Content-Type` headers (#318)
2024-06-17 11:40:13 +00:00
### Changed
- Simplify logic deciding content charset, stop guessing with chardet (#312)
### Fixed
- Rewrite only content with mimetype `text-html` when `WARC-Resource-Type` is `html` (#313)
2024-06-13 10:12:52 +00:00
## [2.0.1] - 2024-06-13
### Added
- Add support for multiple languages in `--lang` CLI argument (#300)
2024-06-04 07:26:26 +00:00
### Changed
- Use the new `WARC-Resource-Type` header to decide rewrite mode (when present in WARC) (#296)
2024-06-13 10:12:52 +00:00
- Upgrade Python dependencies + wombat.js 3.7.5
### Fixed
- Drop `integrity` attribute in HTML `<script>` and `<link>` tags (#298)
- Use automatic detection of content encoding also for JS, JSON and CSS files (#301)
- Set correct charset in HTML documents (#253)
2024-06-04 07:16:10 +00:00
## [2.0.0] - 2024-06-04
2024-01-18 08:55:32 +01:00
2024-01-31 14:52:10 +01:00
### Added
2024-06-04 07:16:10 +00:00
- Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (#168)
- New test website to test many known situations supposed to be handled (#166)
2024-01-31 14:52:10 +01:00
2024-03-08 09:17:27 +05:30
### Changed
2024-06-04 07:16:10 +00:00
- Replace **Service Worker** approach by **scraper-side rewriting** of static content (https://github.com/kiwix/overview/issues/95)
- Adopted Python bootstrap conventions (#152)
2024-06-04 07:16:10 +00:00
- Upgrade dependencies, especially move to **Python 3.12** (only) and zimscraperlib 3.3.2
2024-01-31 15:13:32 +01:00
- Change wording in logs about the return code 100 (which is not an error code)
2024-06-04 07:16:10 +00:00
- Added checks in `converter.py` to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (#106)
- Added check for invalid zim file names (#232)
- Changed default publisher metadata from 'Kiwix' to 'openZIM' (#150)
2024-01-18 08:55:32 +01:00
2024-01-18 08:48:09 +01:00
## [1.5.5] - 2024-01-18
### Changed
- Code restructuration in preparation for 2.x
2023-09-18 08:00:50 +00:00
## [1.5.4] - 2023-09-18
2023-08-30 11:31:25 +00:00
### Changed
- Using wabac.js 2.16.11
2023-09-11 09:52:18 +00:00
- Using `cover` resize method for favicon to prevent issues with too-small ones
- Fixed direct link hack when inside an outer frame (kiwix-serve 3.5+) #119
2023-08-30 11:31:25 +00:00
2023-08-23 11:54:34 +00:00
## [1.5.3] - 2023-08-23
2023-08-10 18:40:53 +00:00
### Changed
2023-08-17 18:44:01 +00:00
- Using wabac.js 2.16.9
2023-08-10 18:40:53 +00:00
2023-08-02 10:35:22 +00:00
## [1.5.2] - 2023-08-02
2023-02-27 09:46:53 +00:00
### Changed
- Using scraperlib 3.1.1, openZIM metatadata now always set, using default if missing
2023-07-27 09:08:47 +00:00
- Using wabac.js 2.16.6
2023-02-27 09:46:53 +00:00
2023-02-06 11:26:25 +01:00
## [1.5.1] - 2023-02-06
### Changed
- Using wabac.js 2.15.2
2023-02-02 16:21:52 +00:00
## [1.5.0] - 2023-02-02
2022-06-22 10:48:45 +00:00
### Added
- Don't crash on failure to convert illustration (skip illus instead)
2022-07-04 14:35:25 +00:00
### Changed
- Fixed 404 page (#96)
- Dont't crash on missing Location headers on potential redirect
- Fixed incorrect ISO-639-3 --lang not replaced with `eng`
- Don't fallback to `eng` if the host doesnt have the matching locale
2023-02-02 16:21:52 +00:00
- Using wabac.js 2.15.0 with fix for scope conflict in SW/DB
2023-02-02 15:26:41 +00:00
- Payload entries now uses original ~`text/html` mimetype instead of `text/html;raw=true`
- dont't crash on icon link with no href
2022-07-04 14:35:25 +00:00
2022-06-21 10:41:57 +00:00
## [1.4.3] - 2022-06-21
2022-06-14 11:21:35 +00:00
2022-06-17 15:30:31 +00:00
### Changed
* Using wabac.js 2.12.0
2022-06-21 10:41:57 +00:00
* Prevent duplicate entries from failing (including illustrations)
* Fixed crash on HTTP 300 records (#94)
2022-06-14 11:21:35 +00:00
2022-06-14 10:44:16 +00:00
## [1.4.0] – 2022-06-14
2022-06-13 14:54:12 +00:00
### Added
* Additional fuzzy matching rules for youtube and vimeo, and additional test cases
* Support for youtube videos, which require POST request handling to work.
* Support for canonicalizing POST request data into URL for fuzzy matching (using cdxj-indexer)
* Support loading custom sw.js from a local file path
2022-06-13 14:54:12 +00:00
### Changed
* Updated zimscraperlib to 1.6 using libzim7.2
* Updated warcio to 1.7.4
2022-06-13 14:54:12 +00:00
* Added support for {period} replacement in --zim-file
* Using fixed MarkupSafe version (Jinja2 dependency)
# [1.3.6]
2021-06-10 14:05:35 +00:00
* updated zimscraperlib (for libzim fix)
2022-06-13 14:54:12 +00:00
# [1.3.5]
* don't crash on records without WARC-Target-URI
* fixed failure if url contains a fragment
2021-05-12 16:09:45 +00:00
* updated wabac.js to 2.7.3
2022-06-13 14:54:12 +00:00
# [1.3.4]
2021-01-14 18:04:37 +00:00
* Added `--custom-css` option
2022-06-13 14:54:12 +00:00
# [1.3.3]
2020-12-09 11:01:56 +00:00
* Added `--progress-file` option
2022-06-13 14:54:12 +00:00
# [1.3.2]
* Update to wabac.js 2.1.6
2022-06-13 14:54:12 +00:00
# [1.3.1]
* Favicon loading fixes: In topFrame.html, load favicon URL directly from ZIM A/ record, bypassing service worker H/ lookup.
2022-06-13 14:54:12 +00:00
# [1.3.0]
* Supports 'fuzzy matching' with additional redirects add from normalized URL to exact URL
* Add fuzzy matching rules for youtube and '?timestamp' URLs
* Fix canonicaliziation where URLs that contain http/https were being incorrectly stripped (https://github.com/openzim/zimit/issues/37)
2022-06-13 14:54:12 +00:00
# [1.2.0]
* Accepts directory inputs as well as individual files. If directory given, which will process all .warc and .warc.gz files recursively in the directory.
* If trailing slash is missing on main URL, `--url https://example.com?test=value`, slash added and URL treated as `--url https://example.com/?test=value`
2022-06-13 14:54:12 +00:00
# [1.1.0]
* Now defaults to including all URLs unless --include-domains is specifief (removed `-a`)
* Arguments are now checked before starting. Also returns `100` on valid arguments but no WARC provided.
2022-06-13 14:54:12 +00:00
# [1.0.1]
* Now skipping WARC records that redirect to self (http -> https mostly)
2022-06-13 14:54:12 +00:00
# [1.0.0]
2020-08-19 23:27:32 +02:00
* Initial release