Decode content bytes only with supplied charset or static list of charsets to try

This commit is contained in:
benoit74 2024-06-14 13:21:18 +00:00
parent 4c12681b1c
commit b1c8a35212
No known key found for this signature in database
GPG key ID: B89606434FC7B530
18 changed files with 1343 additions and 271 deletions

View file

@ -35,10 +35,6 @@ It provide two main features:
Except that, scraper directly uses WarcRecord (returned by cdxj_indexer, implemented in warcio) to access metadata and such.
## chardet
[chardet Python library](https://pypi.org/project/chardet/) is used to detect character encoding of files when it is absent (only HTML file typically specify its encoding) or incoherent.
## zimscraperlib
[zimscraperlib Python library](https://pypi.org/project/zimscraperlib) is used for ZIM operations.