2022-06-13 14:54:12 +00:00
## Changelog
2020-08-19 23:27:32 +02:00
2022-06-13 14:54:12 +00:00
All notable changes to this project are documented in this file.
The format is based on [Keep a Changelog ](https://keepachangelog.com/en/1.0.0/ ),
and this project adheres to [Semantic Versioning ](https://semver.org/spec/v2.0.0.html ) (as of version 1.4.0).
2025-02-17 09:48:52 +00:00
## [Unreleased]
2025-03-17 14:39:53 +01:00
### Added
2025-03-17 15:13:27 +01:00
2025-03-17 14:39:53 +01:00
- Provide default encoding aliases (#416 )
2025-03-17 15:13:27 +01:00
### Changed
- Convert aliases given in `--encoding-aliases` to lower case (#412 )
2025-03-17 14:39:53 +01:00
2025-02-17 09:45:04 +00:00
## [2.2.2] - 2024-02-17
2025-02-07 08:35:54 +00:00
2025-02-14 15:37:34 +00:00
### Changed
- Upgrade dependencies especially zimscraperlib 5.1.1 (#439 )
2025-02-07 08:29:42 +00:00
## [2.2.1] - 2024-02-07
2025-01-10 10:02:46 +00:00
2025-02-03 14:40:20 +00:00
### Changed
- Upgrade dependencies: Python 3.13, zimscraperlib 5.1.0 and others (#434 )
2025-01-07 16:10:03 +00:00
- Fork cdxj_indexer codebase (#428 )
2025-02-03 14:40:20 +00:00
2025-01-10 09:51:39 +00:00
## [2.2.0] - 2024-01-10
2024-11-01 13:20:54 +00:00
2024-11-14 14:17:21 +00:00
### Changed
2025-01-07 13:12:33 +00:00
- Upgrade dependencies: zimscraperlib 5.0.0, warcio 1.7.5, cdxj_index 1.4.6 and others
- Use all rewriting stuff from zimscraperlib
- Remove most HTML / CSS / JS rewriting logic which is now part of zimscraperlib 5
2024-11-14 14:17:21 +00:00
- Fix wombat setup settings (especially `isSW` ) (#293 )
2024-11-25 21:37:12 +00:00
### Fixed
- Stop checking main entry processability when it is already found (#424 )
2024-11-01 13:16:42 +00:00
## [2.1.3] - 2024-11-01
2024-10-08 12:29:37 +00:00
2024-11-01 08:34:45 +00:00
### Changed
- Upgrade to wombat 3.8.3 (#414 )
2024-10-08 12:06:39 +00:00
## [2.1.2] - 2024-10-08
2024-09-05 07:15:44 +00:00
2024-10-08 09:58:04 +00:00
### Added
- Enrich test website with img srcset situations (in preparation for #403 )
2024-10-08 11:22:50 +00:00
### Changed
- Upgrade dependencies, including wombat 3.8.2 (#407 )
2024-10-08 09:58:04 +00:00
### Fixed
- HTML document can be retrieved as `fetch` resource type (#405 )
2024-09-05 07:13:44 +00:00
## [2.1.1] - 2024-09-05
2024-08-09 07:47:15 +00:00
2024-09-03 13:56:55 +00:00
### Changed
- Upgrade dependencies, including wombat 3.8.0 (#386 )
2024-08-09 07:42:53 +00:00
## [2.1.0] - 2024-08-09
2024-07-24 05:29:45 +00:00
2024-07-18 11:44:09 +00:00
### Added
2024-08-07 08:28:03 +00:00
- New fuzzy-rule for cheatography.com (#342 ), der-postillon.com (#330 ), iranwire.com (#363 )
2024-07-18 11:44:09 +00:00
- Properly rewrite redirect target url when present in < meta > HTML tag (#237 )
2024-07-08 15:20:44 +00:00
- New `--encoding-aliases` argument to pass encoding/charset aliases (#331 )
2024-07-18 13:26:24 +00:00
- Add support for SVG favicon (#148 )
2024-08-05 10:15:13 +00:00
- Automatically index PDF content and use PDF title (#289 and #290 )
2024-07-18 11:44:09 +00:00
2024-07-17 07:42:21 +00:00
### Changed
2024-08-05 09:44:48 +00:00
- Upgrade to python-scraperlib 4.0.0
2024-07-17 07:42:21 +00:00
- Generate fuzzy rules tests in Python and Javascript (#284 )
2024-07-30 19:12:47 +00:00
- Refactor HTML rewriter class to make it more open to change and expressive (#305 )
2024-07-08 13:51:45 +00:00
- Detect charset in document header only for HTML documents (#331 )
2024-07-22 12:49:43 +00:00
- Use `software` property from `warcinfo` record to set ZIM `Scraper` metadata (#357 )
- Store `ContentDate` as metadata, based on `WARC-Date` (#358 )
2024-08-07 08:05:05 +00:00
- Remove domain specific rules (#328 )
2024-07-19 09:42:33 +00:00
- Revisit retrieve_illustration logic to prefer best favicons (#352 and #369 )
2024-08-09 07:04:28 +00:00
- Upgrade dependencies (zimscraperlib 4.0.0, wombat.js 3.7.12 and others) (#376 )
2024-07-17 07:42:21 +00:00
2024-07-24 11:47:57 +00:00
###Â Fixed
- Handle case where the redirect target is bad / unsupported (#332 and #356 )
2024-07-24 09:12:44 +00:00
- Fixed WARC files handling order to follow creation order (#366 )
2024-07-24 08:22:13 +00:00
- Remove subsequent slashes in URLs, both in Python and JS (#365 )
2024-07-17 12:52:51 +00:00
- Ignore non HTTP(S) WARC records (#351 )
2024-07-17 08:00:19 +00:00
- Fix `vimeo_cdn_fix` fuzzy rule for proper operation in Javascript (#348 )
2024-08-07 12:28:07 +00:00
- Performance issue linked to new "extensible" HTML rewriting rules (#370 )
2024-07-24 11:47:57 +00:00
2024-07-24 05:27:12 +00:00
## [2.0.3] - 2024-07-24
2024-06-18 13:39:19 +00:00
2024-06-25 11:55:39 +00:00
### Changed
- Moved rules definition from JSON to YAML and documented update process (#216 )
2024-07-24 05:21:41 +00:00
- Upgrade to wombat.js 3.7.11
2024-06-25 11:55:39 +00:00
2024-06-25 13:31:39 +00:00
###Â Added
2024-06-25 13:38:48 +00:00
- Exit with cleaner message when no entries are expected in the ZIM (#336 ) and when main entry is not processable (#337 )
2024-07-02 07:20:18 +00:00
- Add debug log for items whose content is empty (#344 )
2024-06-25 13:31:39 +00:00
2024-06-25 15:38:32 +00:00
### Fixed
- Some resources rewrite mode are still not correctly identified (#326 )
2024-06-18 13:25:18 +00:00
## [2.0.2] - 2024-06-18
2024-06-13 11:27:48 +00:00
2024-06-17 11:37:57 +00:00
### Added
- Add `--ignore-content-header-charsets` option to disable automatic retrieval of content charsets from content first bytes (#318 )
2024-06-17 13:05:56 +00:00
- Add `--content-header-bytes-length` option to specify how many first bytes to consider when searching for content charsets in header (#320 )
2024-06-17 11:37:57 +00:00
- Add `--ignore-http-header-charsets` option to disable automatic retrieval of content charsets from content HTTP `Content-Type` headers (#318 )
2024-06-17 11:40:13 +00:00
### Changed
- Simplify logic deciding content charset, stop guessing with chardet (#312 )
### Fixed
- Rewrite only content with mimetype `text-html` when `WARC-Resource-Type` is `html` (#313 )
2024-06-13 10:12:52 +00:00
## [2.0.1] - 2024-06-13
### Added
- Add support for multiple languages in `--lang` CLI argument (#300 )
2024-06-04 07:26:26 +00:00
2024-06-11 13:21:53 +00:00
### Changed
- Use the new `WARC-Resource-Type` header to decide rewrite mode (when present in WARC) (#296 )
2024-06-13 10:12:52 +00:00
- Upgrade Python dependencies + wombat.js 3.7.5
2024-06-11 13:21:53 +00:00
2024-06-04 14:50:31 +00:00
### Fixed
- Drop `integrity` attribute in HTML `<script>` and `<link>` tags (#298 )
2024-06-11 08:14:35 +00:00
- Use automatic detection of content encoding also for JS, JSON and CSS files (#301 )
2024-06-10 14:02:04 +00:00
- Set correct charset in HTML documents (#253 )
2024-06-04 14:50:31 +00:00
2024-06-04 07:16:10 +00:00
## [2.0.0] - 2024-06-04
2024-01-18 08:55:32 +01:00
2024-01-31 14:52:10 +01:00
### Added
2024-06-04 07:16:10 +00:00
- Allow to specify a scraper suffix for the ZIM scraper metadata at the CLI (#168 )
- New test website to test many known situations supposed to be handled (#166 )
2024-01-31 14:52:10 +01:00
2024-03-08 09:17:27 +05:30
### Changed
2024-06-04 07:16:10 +00:00
- Replace **Service Worker** approach by **scraper-side rewriting** of static content (https://github.com/kiwix/overview/issues/95)
2024-01-18 17:15:16 +01:00
- Adopted Python bootstrap conventions (#152 )
2024-06-04 07:16:10 +00:00
- Upgrade dependencies, especially move to **Python 3.12** (only) and zimscraperlib 3.3.2
2024-01-31 15:13:32 +01:00
- Change wording in logs about the return code 100 (which is not an error code)
2024-06-04 07:16:10 +00:00
- Added checks in `converter.py` to verify output directory existence, logging appropriate error messages and cleanly exit if checks fail. (#106 )
- Added check for invalid zim file names (#232 )
- Changed default publisher metadata from 'Kiwix' to 'openZIM' (#150 )
2024-01-18 08:55:32 +01:00
2024-01-18 08:48:09 +01:00
## [1.5.5] - 2024-01-18
### Changed
- Code restructuration in preparation for 2.x
2023-09-18 08:00:50 +00:00
## [1.5.4] - 2023-09-18
2023-08-30 11:31:25 +00:00
### Changed
- Using wabac.js 2.16.11
2023-09-11 09:52:18 +00:00
- Using `cover` resize method for favicon to prevent issues with too-small ones
2023-09-11 16:26:26 +00:00
- Fixed direct link hack when inside an outer frame (kiwix-serve 3.5+) #119
2023-08-30 11:31:25 +00:00
2023-08-23 11:54:34 +00:00
## [1.5.3] - 2023-08-23
2023-08-10 18:40:53 +00:00
### Changed
2023-08-17 18:44:01 +00:00
- Using wabac.js 2.16.9
2023-08-10 18:40:53 +00:00
2023-08-02 10:35:22 +00:00
## [1.5.2] - 2023-08-02
2023-02-27 09:46:53 +00:00
### Changed
2023-05-22 09:27:50 +00:00
- Using scraperlib 3.1.1, openZIM metatadata now always set, using default if missing
2023-07-27 09:08:47 +00:00
- Using wabac.js 2.16.6
2023-02-27 09:46:53 +00:00
2023-02-06 11:26:25 +01:00
## [1.5.1] - 2023-02-06
### Changed
- Using wabac.js 2.15.2
2023-02-02 16:21:52 +00:00
## [1.5.0] - 2023-02-02
2022-06-22 10:48:45 +00:00
### Added
- Don't crash on failure to convert illustration (skip illus instead)
2022-07-04 14:35:25 +00:00
### Changed
- Fixed 404 page (#96 )
2022-08-12 07:47:52 +00:00
- Dont't crash on missing Location headers on potential redirect
2022-12-17 15:34:45 +00:00
- Fixed incorrect ISO-639-3 --lang not replaced with `eng`
2023-01-16 11:30:08 +00:00
- Don't fallback to `eng` if the host doesnt have the matching locale
2023-02-02 16:21:52 +00:00
- Using wabac.js 2.15.0 with fix for scope conflict in SW/DB
2023-02-02 15:26:41 +00:00
- Payload entries now uses original ~`text/html` mimetype instead of `text/html;raw=true`
2023-02-02 16:05:16 +00:00
- dont't crash on icon link with no href
2022-07-04 14:35:25 +00:00
2022-06-21 10:41:57 +00:00
## [1.4.3] - 2022-06-21
2022-06-14 11:21:35 +00:00
2022-06-17 15:30:31 +00:00
### Changed
* Using wabac.js 2.12.0
2022-06-21 10:41:57 +00:00
* Prevent duplicate entries from failing (including illustrations)
2022-06-20 16:06:57 +00:00
* Fixed crash on HTTP 300 records (#94 )
2022-06-14 11:21:35 +00:00
2022-06-14 10:44:16 +00:00
## [1.4.0] – 2022-06-14
2022-06-13 14:54:12 +00:00
### Added
2021-10-29 20:17:02 +00:00
* Additional fuzzy matching rules for youtube and vimeo, and additional test cases
* Support for youtube videos, which require POST request handling to work.
* Support for canonicalizing POST request data into URL for fuzzy matching (using cdxj-indexer)
* Support loading custom sw.js from a local file path
2022-06-13 14:54:12 +00:00
### Changed
* Updated zimscraperlib to 1.6 using libzim7.2
2022-06-14 10:19:14 +00:00
* Updated warcio to 1.7.4
2022-06-13 14:54:12 +00:00
* Added support for {period} replacement in --zim-file
* Using fixed MarkupSafe version (Jinja2 dependency)
# [1.3.6]
2021-06-10 14:05:35 +00:00
* updated zimscraperlib (for libzim fix)
2022-06-13 14:54:12 +00:00
# [1.3.5]
2021-01-25 11:20:27 +00:00
* don't crash on records without WARC-Target-URI
* fixed failure if url contains a fragment
2021-05-12 16:09:45 +00:00
* updated wabac.js to 2.7.3
2021-01-25 11:20:27 +00:00
2022-06-13 14:54:12 +00:00
# [1.3.4]
2021-01-14 18:04:37 +00:00
* Added `--custom-css` option
2022-06-13 14:54:12 +00:00
# [1.3.3]
2020-12-09 11:01:56 +00:00
* Added `--progress-file` option
2022-06-13 14:54:12 +00:00
# [1.3.2]
2020-11-13 02:58:01 +00:00
* Update to wabac.js 2.1.6
2022-06-13 14:54:12 +00:00
# [1.3.1]
2020-11-13 02:58:01 +00:00
* Favicon loading fixes: In topFrame.html, load favicon URL directly from ZIM A/ record, bypassing service worker H/ lookup.
2022-06-13 14:54:12 +00:00
# [1.3.0]
2020-10-27 17:48:20 +00:00
* Supports 'fuzzy matching' with additional redirects add from normalized URL to exact URL
* Add fuzzy matching rules for youtube and '?timestamp' URLs
* Fix canonicaliziation where URLs that contain http/https were being incorrectly stripped (https://github.com/openzim/zimit/issues/37)
2022-06-13 14:54:12 +00:00
# [1.2.0]
2020-10-19 18:00:10 +00:00
* Accepts directory inputs as well as individual files. If directory given, which will process all .warc and .warc.gz files recursively in the directory.
* If trailing slash is missing on main URL, `--url https://example.com?test=value` , slash added and URL treated as `--url https://example.com/?test=value`
2022-06-13 14:54:12 +00:00
# [1.1.0]
2020-10-06 13:46:31 +00:00
* Now defaults to including all URLs unless --include-domains is specifief (removed `-a` )
* Arguments are now checked before starting. Also returns `100` on valid arguments but no WARC provided.
2022-06-13 14:54:12 +00:00
# [1.0.1]
2020-09-25 09:15:48 +00:00
* Now skipping WARC records that redirect to self (http -> https mostly)
2022-06-13 14:54:12 +00:00
# [1.0.0]
2020-08-19 23:27:32 +02:00
* Initial release