benoit74
a90e1f40e6
Upgrade dependencies especially zimscraperlib 5.1.1
2025-02-17 08:49:55 +00:00
benoit74
c590b28ef5
Remove unused code/tests + 'modernize'
2025-02-06 13:42:32 +00:00
benoit74
4c584cab75
Upgrade dependencies - Python 3.13
2025-02-03 14:41:48 +00:00
benoit74
606c4e5cbb
Use scraperlib 5.0.0rc3
2025-01-07 16:16:45 +00:00
benoit74
1218df0560
Adapt to zimscraperlib 5.0.0 - including all rewriting logic moved there - and upgrade other dependencies
2025-01-07 15:53:33 +00:00
benoit74
29307e6b69
Upgrade dependencies, including wombat 3.8.2
2024-10-08 11:27:22 +00:00
dependabot[bot]
72bb166491
Bump pyright in the production-dependencies group
...
Bumps the production-dependencies group with 1 update: [pyright](https://github.com/RobertCraigie/pyright-python ).
Updates `pyright` from 1.1.378 to 1.1.379
- [Release notes](https://github.com/RobertCraigie/pyright-python/releases )
- [Commits](https://github.com/RobertCraigie/pyright-python/compare/v1.1.378...v1.1.379 )
---
updated-dependencies:
- dependency-name: pyright
dependency-type: direct:production
update-type: version-update:semver-patch
dependency-group: production-dependencies
...
Signed-off-by: dependabot[bot] <support@github.com>
2024-09-04 17:33:34 +00:00
dependabot[bot]
f0b3a23e26
Bump the production-dependencies group with 3 updates
...
Bumps the production-dependencies group with 3 updates: [lxml](https://github.com/lxml/lxml ), [ruff](https://github.com/astral-sh/ruff ) and [pyright](https://github.com/RobertCraigie/pyright-python ).
Updates `lxml` from 5.2.2 to 5.3.0
- [Release notes](https://github.com/lxml/lxml/releases )
- [Changelog](https://github.com/lxml/lxml/blob/master/CHANGES.txt )
- [Commits](https://github.com/lxml/lxml/compare/lxml-5.2.2...lxml-5.3.0 )
Updates `ruff` from 0.5.7 to 0.6.3
- [Release notes](https://github.com/astral-sh/ruff/releases )
- [Changelog](https://github.com/astral-sh/ruff/blob/main/CHANGELOG.md )
- [Commits](https://github.com/astral-sh/ruff/compare/0.5.7...0.6.3 )
Updates `pyright` from 1.1.375 to 1.1.378
- [Release notes](https://github.com/RobertCraigie/pyright-python/releases )
- [Commits](https://github.com/RobertCraigie/pyright-python/compare/v1.1.375...v1.1.378 )
---
updated-dependencies:
- dependency-name: lxml
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: production-dependencies
- dependency-name: ruff
dependency-type: direct:production
update-type: version-update:semver-minor
dependency-group: production-dependencies
- dependency-name: pyright
dependency-type: direct:production
update-type: version-update:semver-patch
dependency-group: production-dependencies
...
Signed-off-by: dependabot[bot] <support@github.com>
2024-09-03 13:30:15 +00:00
benoit74
ab68df2178
Upgrade dependencies before release
2024-08-09 07:07:51 +00:00
benoit74
aacfa17ca5
Add support for SVG favicon
2024-08-05 10:17:44 +00:00
benoit74
9cc6c6867a
Retrieve content date and WARC software from WARC files
2024-08-02 14:03:36 +00:00
benoit74
a38fdc7b9f
Move rules to YAML instead of JSON to add support for inline comments
2024-06-25 12:59:59 +00:00
benoit74
e17a0d0db9
Upgrade dependencies before release
2024-06-18 13:10:30 +00:00
benoit74
b1c8a35212
Decode content bytes only with supplied charset or static list of charsets to try
2024-06-17 07:25:11 +00:00
benoit74
4ad6e64749
Upgrade dependencies before release
2024-06-13 07:32:28 +00:00
benoit74
dd88211818
Use hatch-openzim 0.2.1
2024-05-06 10:07:47 +00:00
benoit74
a680a75224
Upgrade dependencies and add lxml explicitely
2024-05-04 18:54:11 +00:00
benoit74
68fc781ef9
Add support for base-href in HTML head
...
- detect base href in HTML pages head
- use it to properly rewrite URLs found in the HTML page
- rewrite the base to remove the href (for simplicity) but keep the
target
2024-05-04 18:54:10 +00:00
benoit74
8a483dd693
Adapt CI and README for javascript instructions
2024-04-12 08:29:07 +00:00
benoit74
beca01cd19
Move JS code to dedicated JS module and generate all fuzzy rules
...
- JS code used to setup wombat.js now lives in a dedicated JS subproject
- JS code is compiled by rollup
- Fuzzy rules are defined in a data-driven JSON file
- This JSON file is used to generate both the Python and JS code that will
use them
2024-04-12 08:24:50 +00:00
benoit74
3de8ab8108
Add much documentation about warc2zim architecture
2024-04-12 06:23:15 +00:00
benoit74
f20d331958
Remove useless hatchling dependency now that we are using hatch-openzim plugin
2024-03-18 09:49:32 +00:00
benoit74
0e7ed6186b
Add all statics files in wheel, even if ignored in .gitignore
...
While some files are in .gitignore because we do not want to commit
them, we still need them in the wheel. This is typically the case of
wombat.js which is downloaded at the build stage, must be present inside
the wheel since essential to scraper but must never be commited because
we want to retrieve it at build time.
2024-03-07 07:54:31 +00:00
benoit74
61cbc49a87
Adopt hatch-openzim plugin
2024-03-01 13:21:32 +00:00
benoit74
40f22b2398
Upgrade dependencies and Python version
2024-03-01 13:21:32 +00:00
Matthieu Gautier
5d8782dcdd
Properly decode content.
...
- First try to use the declared encoding in headers (if available).
- Then search for encoding declared at the beginning of content.
- Finally use chardet to detect the content encoding.
2024-02-15 14:49:22 +01:00
Matthieu Gautier
baafc75093
Disable "bytes type promotions" in pyright.
...
pyright automatically "promote" `bytes` type to `bytes|memoryview`.
So `str|bytes` is promoted to `str|bytes|memoryview` and the following
doesn't change the type of `bytes_or_text` variable as `memoryview` is not
covered.
```python
if instance(bytes_or_text, bytes):
bytes_or_text = bytes_or_text.decode()
```
By disabling bytes type promoting, all cases are covered and
`bytes_or_text` is "converted" to `str`.
2024-02-07 13:58:38 +01:00
benoit74
ac0461853e
Fix Python package keywords
2024-01-26 13:56:23 +01:00
benoit74
40d9595cae
Adopt Python bootstrap and fix all code quality issues
2024-01-26 13:56:20 +01:00