Add much documentation about warc2zim architecture

This commit is contained in:
benoit74 2024-04-05 15:03:15 +00:00
parent 1f93873b1f
commit 3de8ab8108
No known key found for this signature in database
GPG key ID: B89606434FC7B530
5 changed files with 218 additions and 28 deletions

View file

@ -38,7 +38,9 @@ warc2zim --help
deactivate # unloads virtualenv from shell
```
## URL Filtering
## Usage
### URL Filtering
By default, all URLs found in the WARC files are included unless the `--include-domains`/ `-i` flag is set.
@ -66,37 +68,17 @@ If main page is on a subdomain, `https://subdomain1.example.com/` and only URLs
warc2zim myarchive.warc --name myarchive -i subdomain1.example.com -i subdomain2.example.com -u https://subdomain1.example.com/starting/page.html
```
## Custom CSS
### Custom CSS
`--custom-css` allows passing an URL or a path to a CSS file that gets added to the ZIM and gets included on **every HTML article** at the very end of `</head>` (if it exists).
### Other options
See `warc2zim -h` for other options.
## Documentation
## ZIM Entry Layout
The WARC to ZIM conversion is performed by splitting the WARC (and HTTP) headers from the payload.
For `response` records, the WARC + HTTP headers are stored under `H/<url>` while the payload is stored under `A/<url>`
For `resource` records, the WARC headers are stored under `H/<url>` while the payload is stored under `A/<url>`. (Three are no HTTP headers for resource records).
For `revisit` records, the WARC + optional HTTP headers are stored under `H/<url>`, while no payload record is created.
If the payload `A/<url>` is zero-length, the record is omitted to conform to ZIM specifications of not storing empty records.
## Duplicate URIs
WARCs allow multiple records for the same URL, while ZIM does not. As a result, only the first encountered response or resource record is stored in the ZIM,
and subsequent records are ignored.
For revisit records, they are only added if pointing to a different URL, and are processed after response/revisit records. A revisit record to the same URL
will always be ignored.
All other WARC records are skipped.
We have documentation about the [functional architecture](docs/functional_architecture.md), the [technical architecture](docs/technical_architecture.md) and the [software architecture](docs/software_architecture.md).
## Contributing

View file

@ -0,0 +1,98 @@
# Functional architecture
## Foreword
At a high level, warc2zim is a piece of software capable to transform a set of WARC files into one ZIM file. From a functional point of view, it is hence a "format converter".
While warc2zim is typically used as a sub-component of zimit, where WARC files are produced by Browsertrix crawler, it is in fact agnostic of this fact and could process any WARC file adhering to the standard.
This documentation will describe the big functions achieved by warc2zim codebase. It is important to note that these functions are not seggregated inside the codebase with frontiers.
## ZIM storage
While storing the web resources in the ZIM is mostly straightforward (we just transfer the raw bytes, after some modification for URL rewriting if needed), the decision of the path where the resource will be stored is very important.
This is purely conventional, even if ZIM specification has to be respected for proper operation in readers.
This function is responsible to compute the ZIM path where a given web resource is going to be stored.
While the URL is the only driver of this computation for now, warc2zim might have to consider other contextual data in the future. E.g. the resource to serve might by dynamic, depending not only on URL query parameters but also header(s) value(s).
## Fuzzy rules
Unfortunately, it is not always possible / desirable to store the resource with a simple transformation.
A typical situation is that some query parameters are dynamically computed by some Javascript code to include user tracking identifier, current datetime information, ...
When running again the same javascript code inside the ZIM, the URL will hence be slightly different because context has changed, but the same content needs to be retrieved.
warc2zim hence relies on fuzzy rules to transform/simplify some URLs when computing the ZIM path.
## URL Rewriting
warc2zim transforms (rewrites) URLs found in documents (HTML, CSS, JS, ...) so that they are usable inside the ZIM.
### General case
One simple example is that we might have following code in an HTML document to load an image with an absolute URL:
```
<img src="https://en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg"></img>
```
The URL `https://en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` has to be transformed to a URL that it is usable inside the ZIM.
For proper reader operation, openZIM prohibits using absolute URLs, so this has to be a relative URL. This relative URL is hence dependant on the location of the resource currently being rewriten.
The table below gives some examples of what the rewritten URL is going to be, depending on the URL of the rewritten document.
| HTML document URL | image URL rewritten for usage inside the ZIM |
|--|--|
| `https://en.wikipedia.org/wiki/Kiwix` | `./File:Kiwix_logo_v3.svg` |
| `https://en.wikipedia.org/wiki` | `./wiki/File:Kiwix_logo_v3.svg` |
| `https://en.wikipedia.org/waka/Kiwix` | `../wiki/File:Kiwix_logo_v3.svg` |
| `https://fr.wikipedia.org/wiki/Kiwix` | `../../en.wikipedia.org/wiki/File:Kiwix_logo_v3.svg` |
As can be seen on the last line (but this is true for all URLs), this rewriting has to take into account the convention saying at which ZIM path a given web resource will be stored.
### Dynamic case
The explanation above more or less assumed that the transformations can be done statically, i.e warc2zim can open every known document, find existing URLs and replace them with their counterpart inside the ZIM.
While this is possible for HTML and CSS documents typically, it is not possible when the URL is dynamically computed. This is typically the case for JS documents, where in the general case the URL is not statically stored inside the JS code but computed on-the-fly by aggregating various strings and values.
Rewriting these computations is not deemed feasible due to the huge variety of situation which might be encountered.
A specific function is hence needed to rewrite URL **live in client browser**, intercept any function triggering a web request, transform the URL according to conventions (where we expect the resource to be located in the general case) and fuzzy rules.
_Spoiler: this is where we will rely on wombat.js from webrecorder team, since this dynamic interception is quite complex and already done quite neatly by them_
### Fuzzy rules
The same fuzzy rules that have been used to compute the ZIM path from a resource URL have to be applied again when rewriting URLs.
While this is expected to serve mostly for the dynamic case, we still applies them on both side (staticaly and dynamicaly) for coherency.
## Content rewriting
### DS rules
DS (Domain Specific) rules patch javascript code with regular expressions matching. These rules are not related to url or path.
They are here to patch javascript for specific site (domain) to make it works in our context.
What they are doing (and how they have been created) is still unclear (they have been transferred as-is from wabac codebase) and undocumentted.
For instance, an identified use case is removing some test on video resolution in youtube player.
Something like transforming `Oq&&(a.Uo=SC(a.Uo,Oq))}"0"==b.dash&&(a.FB=!0);var sm=b.dashmpd;` to `Oq&&(a.Uo=SC(a.Uo,Oq))}1&&(a.FB=!0);var sm=b.dashmpd;` (in middle of a full minified js code).
These DS rules are applied to JS and JSON files. They are not supposed to manipulate any URL.
### JSONP
JSONP callback is rewriten, as is done in wabac. Not fully tested for now.
## Documents rewriten
For now warc2zim rewrites HTML, CSS, JSON, JSONP and JS documents. Other types of documents are supposed to be either not feasible / not worth it (e.g. URLs inside PDF documents), meaningless (e.g. images, fonts) or planned for later due to limited usage in the wild (e.g. XML).

View file

@ -0,0 +1,50 @@
# Software architecture
## HTML rewriting
HTML rewriting is purely static (i.e. before resources are written to the ZIM). HTML code is parsed with the [HTML parser from Python standard library](https://docs.python.org/3/library/html.parser.html).
A small header script is inserted in HTML code to initialize wombat.js which will wrap all JS APIs to dynamically rewrite URLs comming from JS.
This header script is generated using [Jinja2](https://pypi.org/project/Jinja2/) template since it needs to populate some JS context variables needed by wombat.js operations (original scheme, original url, ...).
## CSS rewriting
CSS rewriting is purely static (i.e. before resources are written to the ZIM). CSS code is parsed with the [tinycss2 Python library](https://pypi.org/project/tinycss2/).
## JS rewriting
### Static
Static JS rewriting is simply a matter of pure textual manipulation with regular expressions. No parsing is done at all.
### Dynamic
Dynamic JS rewriting is done with [wombat JS library](https://github.com/webrecorder/wombat). The same fuzzy rules that are used for static rewritting are injected into wombat configuration. Code to rewrite URLs is an adapted version of the code used to compute ZIM paths.
## cdxj_indexer and warcio
[cdxj_indexer Python library](https://pypi.org/project/cdxj-indexer/) is a thin wrapper over [warcio Python library](https://pypi.org/project/warcio/). It used to iterate all record in WARCs.
It provide two main features:
- Loop over several WARCs in a directory (A visit of a website may be stored in several WARCs in the same directory).
- Provide a buffered access to warcs content (and not a "stream" (fileio) only api) (but monkey patching returned WarcRecord.
Except that, scraper directly uses WarcRecord (returned by cdxj_indexer, implemented in warcio) to access metadata and such.
## chardet
[chardet Python library](https://pypi.org/project/chardet/) is used to detect character encoding of files when it is absent (only HTML file typically specify its encoding) or incoherent.
## zimscraperlib
[zimscraperlib Python library](https://pypi.org/project/zimscraperlib) is used for ZIM operations.
## requests
[requests Python library](https://pypi.org/project/requests/) is used to retrieve the custom CSS file when a URL is passed.
## brotlipy
[brotlipy Python library](https://pypi.org/project/brotlipy/) is used to access brotli content in WARC records (not part of warcio because it is an optional dependency).

View file

@ -0,0 +1,62 @@
# Technical architecture
## High level overview
The scraper behavior is done in two phases.
First the WARC records are iterated to compute the ZIM metadata (find main path, favicon, ...) and detect which ZIM paths are expected to be populated. This is mandatory to know when we will rewrite the documents if the URLs we will encounter leads to something which is internal (inside the ZIM) and should be rewriten or external and should be kept as-is.
Second, the WARC records are iterated to be transformed and appended inside the ZIM. ZIM records are appended to the ZIM on the fly.
In both phases, WARC records are iterated in natural order, i.e. as they have been retrieved online during the crawl.
## Transformation of URL into ZIM path
Transforming a URL into a ZIM path has to respect the ZIM specification: path must not be url-encoded (i.e. it must be decoded) and it must be stored as UTF-8.
WARC record stores the items URL inside a header named "WARC-Target-URI". The value inside this header is encoded, or more exactly it is "exactly what the browser sent at the HTTP level" (see https://github.com/webrecorder/browsertrix-crawler/issues/492 for more details).
It has been decided (by convention) that we will drop the scheme, the port, the username and password from the URL. Headers are also not considered in this computation.
Computation of the ZIM path is hence mostly straightforward:
- decode the hostname which is puny-encoded
- decode the path and query parameter which might be url-encoded
## URL rewriting
In addition to the computation of the relative path from the current document URL to the URL to rewrite, URL rewriting also consists in computing the proper ZIM path (with same operation as above) and properly encoding it so that the resulting URL respects [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986). Some important stuff has to be noted in this encoding.
- since the original hostname is now part of the path, it will now be url-encoded
- since the `?` and following query parameters are also part of the path (we do not want readers to drop them like kiwix-serve would do), they are also url-encoded
Below is an example case of the rewrite operation on an image URL found in an HTML document.
- Document original URL: `https://kiwix.org/a/article/document.html`
- Document ZIM path: `kiwix.org/a/article/document.html`
- Image original URL: `//xn--exmple-cva.com/a/resource/image.png?foo=bar`
- Image rewritten URL: `../../../ex%C3%A9mple.com/a/resource/image.png%3Ffoo%3Dbar`
- Image ZIM Path: `exémple.com/a/resource/image.png?foo=bar`
## Different kinds of WARC records
The WARC to ZIM conversion is performed by transforming WARC records into ZIM records.
For `response` records, the rewritten payload (only, without HTTP headers) is stored inside the ZIM.
If the payload is zero-length, the record is omitted to conform to ZIM specifications of not storing empty records.
For `request` and `resource` records, they are simply ignored. These records do not convey important information for now.
**TODO** better explain what `request` and `resource` records are and why they might point to a different URL.
For `revisit` records, a ZIM alias is created if the revisit points to a diferrent URL.
**TODO** better explain what `revisit` records are and why they might point to a different URL.
## Duplicate URIs
WARCs allow multiple records for the same URL, while ZIM does not. As a result, only the first encountered response or resource record is stored in the ZIM, and subsequent records are ignored.
For revisit records, they are only added as a ZIM alias if pointing to a different URL, and are processed after response records. A revisit record to the same URL will always be ignored.
All other WARC records are skipped.

View file

@ -10,12 +10,10 @@ readme = "README.md"
dependencies = [
"warcio==1.7.4",
"requests==2.31.0",
"beautifulsoup4==4.9.3",
"zimscraperlib==3.3.1",
"Babel==2.14.0",
"jinja2==3.1.3",
"chardet==5.2.0",
# to support possible brotli content in warcs
# to support possible brotli content in warcs, must be added separately
"brotlipy==0.7.0",
"cdxj_indexer==1.4.5",
"tinycss2==1.2.1",