Commit graph

112 commits

Author SHA1 Message Date
benoit74
1218df0560
Adapt to zimscraperlib 5.0.0 - including all rewriting logic moved there - and upgrade other dependencies 2025-01-07 15:53:33 +00:00
benoit74
3c7363f050
Fix type hints and add CHANGELOG
Nota: the two on content and mimetype are just linked to
https://github.com/openzim/python-scraperlib/issues/196
and will have to be reverted once this issue is fixed
2024-10-08 10:02:43 +00:00
benoit74
93c866d6bd
Revisit retrieve_illustration logic to prefer best favicons unless user
provided a favicon to use.

Instead of prefering to use WARC items (or prefering to download as it
was before #202), we prefer to use the most suited favicon.

Potential favicons are sourced from main HTML page.

All favicons are retrieved either from the WARC or downloaded to inspect
their sizes.

We use the most suited one (i.e. 48x48 or bigger if possible or the
biggest one).

We still fallback to default ZIM illustration if no favicon is found, to
avoid loosing all time spent crawling the website.
2024-08-07 12:14:59 +00:00
benoit74
217769e9f3
Adapt to scraperlib 4.0.0 changes 2024-08-05 10:28:35 +00:00
benoit74
9cc6c6867a
Retrieve content date and WARC software from WARC files 2024-08-02 14:03:36 +00:00
benoit74
ffcdfe38df
Replace os.path with pathlib 2024-08-02 12:27:09 +00:00
benoit74
79dfb059f7
Add --encoding-aliases setting to allow alias to Python charsets 2024-08-02 12:27:09 +00:00
benoit74
861a15db5f
Remove double slash in URLs, both in Python and JS 2024-07-30 13:53:47 +00:00
benoit74
219d252e25
Handle case where the redirect target is bad 2024-07-26 05:08:25 +00:00
benoit74
ecb5a1f0e2
Exit with cleaner message when main entry is not processable 2024-06-27 06:31:53 +00:00
benoit74
3997113c2d
Exit with cleaner message when no entries are expected in the ZIM 2024-06-27 06:31:52 +00:00
benoit74
e23415fa8b
Fix the return value of get_rewrite_mode (including new test) 2024-06-13 07:40:05 +00:00
benoit74
2298379301
Properly detect nested redirection loops 2024-05-28 09:10:44 +00:00
benoit74
d6968750d6
Merge branch 'warc2zim2' 2024-05-24 13:53:12 +00:00
benoit74
97e2321b8b
Tags passed as single string with values separated by semicolons 2024-05-24 08:30:36 +00:00
benoit74
1c38fe65d3
Fix and simplify custom CSS injection
- pass the CSS as stuff to be rendered (we need to compute the relative
  ZIM path based on current page location)
- directly create a ZIM record instead of faking a WARC record
  (simplification and less error-prone)
- store custom CSS at `_zim_static/custom.css` instead of magic URL `warc2zim.kiwix.app/custom.css`
2024-05-21 11:11:17 +00:00
benoit74
e944c325d5
Remove special handling of hostname failure in url rewriting
Now theat url rewriting error are handled at a higher level, we do not
need a special handling for hostname rewriting issues, all URL rewriting
issues have to be handled the same
2024-05-14 12:34:34 +00:00
benoit74
27baaa23d2
Handle HTTP return codes properly 2024-05-04 10:16:55 +00:00
Dan Niles
02b9591f8a
Add check for invalid ZIM file names 2024-04-18 10:06:27 +00:00
Jairaj Mahadev
917d2e7bc2
Enhance checks regarding availability of output folder 2024-04-18 09:51:27 +00:00
benoit74
395ae95368
Fix detection and rewriting of JS modules 2024-04-12 08:28:59 +00:00
benoit74
63bf7a9fd9
Rework transformation of WARC record url to ZIM path and URL normalization to support encoded URL and query strings 2024-03-18 14:03:14 +00:00
benoit74
3e4a0a6f90
Add kiwix.org test case and fix favicon assertions 2024-03-07 07:30:43 +00:00
benoit74
8120a1f559 Only process response and revisits, ignore resources 2024-03-04 15:33:34 +01:00
Matthieu Gautier
7dcb552422 Url normalize to not take a bytes as input. 2024-02-15 14:49:22 +01:00
benoit74
39efc42405
Fix scraper metadata when no suffix is passed 2024-02-10 17:46:09 +01:00
benoit74
2dd2375463
Enhance wording about return code 2024-01-31 15:18:54 +01:00
benoit74
43e98433f0
Allow to add a ZIM scraper suffix via CLI argument 2024-01-31 14:00:51 +01:00
benoit74
458c6fd622
Fix type hint 2024-01-26 13:56:26 +01:00
benoit74
0a1cf355d7
Mutualize getting ArcWarcRecord content 2024-01-26 13:56:25 +01:00
benoit74
6da8be7046
Make exception check more precise 2024-01-26 13:56:24 +01:00
benoit74
01df4c7b8c
Really assert on exception + use value to get exception code 2024-01-26 13:56:24 +01:00
benoit74
853d7b5410
Make test more precise 2024-01-26 13:56:24 +01:00
benoit74
40d9595cae
Adopt Python bootstrap and fix all code quality issues 2024-01-26 13:56:20 +01:00
Matthieu Gautier
53b3463d23 Remove tag _sw:yes 2024-01-25 14:17:31 +01:00
Matthieu Gautier
1cdc8634ba Move all "constant" code in template in static js file. 2023-12-22 16:11:32 +01:00
Matthieu Gautier
f88b4b1be7 Add wombat.js
`wombat.js` is the dist version build from source at:
openzim/wombat/commit/9979720ae739cc0f7ec39a3cf3677a5d9b4280f5
(patch from upstream wombat)
2023-12-22 15:58:42 +01:00
Matthieu Gautier
9a605c9c61 Introduce JSRewriter.
Algorithm (and tests) are greatly inspired (to not say copied) from
webrecorder/wabac.js/blob/8cc4755/src/rewrite/jsrewriter.js
(and webrecorder/wabac.js/blob/8cc4755/test/rewriteJS.js)
2023-12-22 15:58:42 +01:00
Matthieu Gautier
d353cf95ae Readd head and css insert. 2023-12-19 11:43:02 +01:00
Matthieu Gautier
4e8aa7da6f Rewrite url in html content.
Head insert has been (temporarily) removed (to be readded in next commit).
2023-12-19 11:43:01 +01:00
Matthieu Gautier
e86f2b75a3 Deactivate searching file to add in templates directory.
We don't have files to add (and so, no directory).
2023-12-18 16:18:27 +01:00
Matthieu Gautier
7355b80093 Store static content in a _zim_static/ subdir instead of A/.
We don't have anything now in `A/` or `H/` subdirs.
Remove the left over `A/` in test urls (was working thanks to libzim's
compatibility layer)
2023-12-18 16:18:27 +01:00
Matthieu Gautier
1113c1a693 Properly set the main page. 2023-12-18 16:18:27 +01:00
Matthieu Gautier
09f55eba86 Do not add service worker stuff in zim file.
Remove other unnecessary files.
2023-12-18 16:18:27 +01:00
Matthieu Gautier
1cacf88d5b Store revisits as alias instead of WARCHeadersItem.
We don't need `WARCHeadersItem` anymore.
2023-12-18 16:18:27 +01:00
Matthieu Gautier
54f8cd621e Don't store Entry's header.
We don't use it and we agree to not store them (at least for now).
If we need them, we will see how to readd them.

Converted `test/data/video-vimeo.warc.gz` goes from :

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
```

to:

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
```
2023-12-18 16:18:27 +01:00
Matthieu Gautier
e8122deb73 Directly store entries using their potentially reduced path.
Before, we were storing a entry using its full path and potentially
create a redirect entry (using reduced path) pointing to the full path
entry.

Now, path reduction is part of normalization and so we directly store
entries using their (potentially) reduced path.

Converted `test/data/video-vimeo.warc.gz` goes from :

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
```

to :

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
```

Notice that `vod-progressive.akamaized.net` is not present.
It is "replaced" by
`vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4`
which is now a plain entry instead of a redirect to
`vod-progressive.akamaized.net[...]`.
2023-12-18 16:18:27 +01:00
Matthieu Gautier
eaa4fa2ce5 Introduce normalize and normalization schema.
Properly define how we store entries in zim file.

We introduce `normalize` helper function class in place of `canonicalize`.

We work on normalization on converter level. So we  path the path to the
items instead of letting them call `normalize`.

Converted `test/data/video-vimeo.warc.gz` to zim was containing :

```
A/404.html
A/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
A/f.vimeocdn.com/p/3.45.3/css/player.css
A/f.vimeocdn.com/p/3.45.3/js/player.js
A/i.vimeocdn.com/player/354746.png?mw=200&mh=200
A/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
A/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
A/index.html
A/load.js
A/oembed.link/favicon.ico
A/oembed.link/https://vimeo.com/347119375
A/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
A/sw.js
A/topFrame.html
A/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
H/vimeo.fuzzy.replayweb.page/video/347119375
H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
```

With this change it contains:

```
A/404.html
A/index.html
A/load.js
A/sw.js
A/topFrame.html
H/f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
H/f.vimeocdn.com/p/3.45.3/css/player.css
H/f.vimeocdn.com/p/3.45.3/js/player.js
H/i.vimeocdn.com/player/354746.png?mw=200&mh=200
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
H/i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
H/oembed.link/favicon.ico
H/oembed.link/https://vimeo.com/347119375
H/player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
H/vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
f.vimeocdn.com/js_opt/modules/utils/vuid.min.js
f.vimeocdn.com/p/3.45.3/css/player.css
f.vimeocdn.com/p/3.45.3/js/player.js
i.vimeocdn.com/player/354746.png?mw=200&mh=200
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d.jpg?mw=80&q=85
i.vimeocdn.com/video/797382244-0106ae13e902e09d0f02d8f404fa80581f38d1b8b7846b3f8e87ef391ffb8c99-d?mw=1280&mh=720&q=70
oembed.link/favicon.ico
oembed.link/https://vimeo.com/347119375
player.vimeo.com/video/347119375?h=1699409fe2&app_id=122963
vimeo-cdn.fuzzy.replayweb.page/01/4423/13/347119375/1398505169.mp4
vimeo.fuzzy.replayweb.page/video/347119375
vod-progressive.akamaized.net/exp=1635528595~acl=%2Fvimeo-prod-skyfire-std-us%2F01%2F4423%2F13%2F347119375%2F1398505169.mp4~hmac=27c31f1990aab5e5429f7f7db5b2dcbcf8d2f5c92184d53102da36920d33d53e/vimeo-prod-skyfire-std-us/01/4423/13/347119375/1398505169.mp4
```
2023-12-18 16:18:27 +01:00
Matthieu Gautier
15567179ed Rename warc2zim to main. 2023-12-08 11:00:49 +01:00
Matthieu Gautier
57da2b27fb Introduce utils.py module to store small helpers. 2023-12-08 11:00:49 +01:00