using scraperlib 1.6 (libzim 7.2)

This commit is contained in:
renaud gaudin 2022-06-13 14:54:12 +00:00
parent 16d4bfafc1
commit c19c0eb1ef
2 changed files with 27 additions and 15 deletions

View file

@ -1,59 +1,71 @@
warc2zim
===
## Changelog
# 1.4.0
All notable changes to this project are documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html) (as of version 1.4.0).
## [Unreleased]
### Added
* Additional fuzzy matching rules for youtube and vimeo, and additional test cases
* Support for youtube videos, which require POST request handling to work.
* Support for canonicalizing POST request data into URL for fuzzy matching (using cdxj-indexer)
* Support loading custom sw.js from a local file path
# 1.3.6
### Changed
* Updated zimscraperlib to 1.6 using libzim7.2
* Added support for {period} replacement in --zim-file
* Using fixed MarkupSafe version (Jinja2 dependency)
# [1.3.6]
* updated zimscraperlib (for libzim fix)
# 1.3.5
# [1.3.5]
* don't crash on records without WARC-Target-URI
* fixed failure if url contains a fragment
* updated wabac.js to 2.7.3
# 1.3.4
# [1.3.4]
* Added `--custom-css` option
# 1.3.3
# [1.3.3]
* Added `--progress-file` option
# 1.3.2
# [1.3.2]
* Update to wabac.js 2.1.6
# 1.3.1
# [1.3.1]
* Favicon loading fixes: In topFrame.html, load favicon URL directly from ZIM A/ record, bypassing service worker H/ lookup.
# 1.3.0
# [1.3.0]
* Supports 'fuzzy matching' with additional redirects add from normalized URL to exact URL
* Add fuzzy matching rules for youtube and '?timestamp' URLs
* Fix canonicaliziation where URLs that contain http/https were being incorrectly stripped (https://github.com/openzim/zimit/issues/37)
# 1.2.0
# [1.2.0]
* Accepts directory inputs as well as individual files. If directory given, which will process all .warc and .warc.gz files recursively in the directory.
* If trailing slash is missing on main URL, `--url https://example.com?test=value`, slash added and URL treated as `--url https://example.com/?test=value`
# 1.1.0
# [1.1.0]
* Now defaults to including all URLs unless --include-domains is specifief (removed `-a`)
* Arguments are now checked before starting. Also returns `100` on valid arguments but no WARC provided.
# 1.0.1
# [1.0.1]
* Now skipping WARC records that redirect to self (http -> https mostly)
# 1.0.0
# [1.0.0]
* Initial release

View file

@ -1,7 +1,7 @@
warcio>=1.7.3,<1.8
requests>=2.25.1,<3.0
beautifulsoup4>=4.9.3,<4.10
zimscraperlib>=1.4.1,<1.5
zimscraperlib>=1.6.0,<1.7
Babel>=2.9,<3.0
jinja2>=2.11,<3.0
# to support possible brotli content in warcs