mirror of
https://github.com/openzim/warc2zim.git
synced 2025-10-19 14:33:17 +00:00
Decode content bytes only with supplied charset or static list of charsets to try
This commit is contained in:
parent
4c12681b1c
commit
b1c8a35212
18 changed files with 1343 additions and 271 deletions
|
@ -35,10 +35,6 @@ It provide two main features:
|
|||
|
||||
Except that, scraper directly uses WarcRecord (returned by cdxj_indexer, implemented in warcio) to access metadata and such.
|
||||
|
||||
## chardet
|
||||
|
||||
[chardet Python library](https://pypi.org/project/chardet/) is used to detect character encoding of files when it is absent (only HTML file typically specify its encoding) or incoherent.
|
||||
|
||||
## zimscraperlib
|
||||
|
||||
[zimscraperlib Python library](https://pypi.org/project/zimscraperlib) is used for ZIM operations.
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue