2020-09-25 11:36:30 +02:00
Zimit
2020-09-19 22:19:20 +00:00
=====
2020-08-19 09:36:48 +02:00
2020-09-25 12:02:43 +02:00
Zimit is a scraper allowing to create ZIM file from any Web site.
2020-09-25 11:36:30 +02:00
[](https://www.codefactor.io/repository/github/openzim/zimit)
[](https://www.gnu.org/licenses/gpl-3.0)
2024-01-18 13:27:55 +01:00
[](https://ghcr.io/openzim/zimit)
2020-09-25 11:36:30 +02:00
2024-06-03 19:59:04 +00:00
Zimit adheres to openZIM's [Contribution Guidelines ](https://github.com/openzim/overview/wiki/Contributing ).
Zimit has implemented openZIM's [Python bootstrap, conventions and policies ](https://github.com/openzim/_python-bootstrap/docs/Policy.md ) **v1.0.1** .
2020-09-25 12:02:43 +02:00
Technical background
--------------------
2023-11-16 13:13:31 +01:00
Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.
2020-08-19 09:36:48 +02:00
2023-11-16 13:13:31 +01:00
The system:
- runs a website crawl with [Browsertrix Crawler ](https://github.com/webrecorder/browsertrix-crawler ), which produces WARC files
- converts the crawled WARC files to a single ZIM using [warc2zim ](https://github.com/openzim/warc2zim )
2020-09-19 22:19:20 +00:00
2020-11-02 15:36:28 +00:00
The `zimit.py` is the entrypoint for the system.
2020-09-19 22:19:20 +00:00
2020-09-25 12:02:43 +02:00
After the crawl is done, warc2zim is used to write a zim to the
`/output` directory, which can be mounted as a volume.
2020-09-19 22:19:20 +00:00
2020-11-02 15:36:28 +00:00
Using the `--keep` flag, the crawled WARCs will also be kept in a temp directory inside `/output`
2020-09-25 12:02:43 +02:00
Usage
-----
2020-09-19 22:19:20 +00:00
`zimit` is intended to be run in Docker.
2020-09-19 15:53:23 -07:00
To build locally run:
2020-09-25 12:02:43 +02:00
```bash
2023-03-22 13:55:07 +00:00
docker build -t ghcr.io/openzim/zimit .
2020-09-19 15:53:23 -07:00
```
2020-09-19 22:19:20 +00:00
2022-05-03 10:31:34 +00:00
The image accepts the following parameters, **as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones** ; useful for setting metadata, for instance:
2020-09-19 22:19:20 +00:00
2020-09-29 05:16:00 +00:00
- `--url URL` - the url to be crawled (required)
2020-09-19 22:19:20 +00:00
- `--workers N` - number of crawl workers to be run in parallel
- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options ](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options ). The default is `load` , but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
2020-09-19 22:47:19 +00:00
- `--name` - Name of ZIM file (defaults to the hostname of the URL)
- `--output` - output directory (defaults to `/output` )
2020-09-21 07:14:23 +00:00
- `--limit U` - Limit capture to at most U URLs
2024-01-18 13:27:55 +01:00
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=|signup-landing\?|\?cid=)"` , where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded.
2020-10-16 18:54:04 +00:00
- `--scroll [N]` - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
2020-11-02 15:36:28 +00:00
- `--keep` - if set, keep the WARC files in a temp directory inside the output directory
2020-09-19 22:47:19 +00:00
2023-02-02 16:30:15 +00:00
The following is an example usage. The `--shm-size` flags is [needed to run Chrome in Docker ](https://github.com/puppeteer/puppeteer/blob/v1.0.0/docs/troubleshooting.md#tips ).
2020-09-19 22:19:20 +00:00
Example command:
2020-09-25 12:02:43 +02:00
```bash
2023-03-22 13:55:07 +00:00
docker run ghcr.io/openzim/zimit zimit --help
docker run ghcr.io/openzim/zimit warc2zim --help
2023-02-02 16:30:15 +00:00
docker run -v /output:/output \
2023-07-13 12:49:34 +00:00
--shm-size=1gb ghcr.io/openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded
2020-09-19 22:19:20 +00:00
```
2020-09-19 15:53:23 -07:00
2020-09-25 12:02:43 +02:00
The puppeteer-cluster provides monitoring output which is enabled by
default and prints the crawl status to the Docker log.
2020-09-19 15:53:23 -07:00
2023-03-22 13:55:07 +00:00
**Note**: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND ](https://github.com/anudeepND/blacklist ). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...` ).
2021-01-12 06:31:16 +00:00
2020-09-25 12:02:43 +02:00
Nota bene
---------
2020-09-19 15:53:23 -07:00
2024-01-18 13:23:30 +01:00
While Zimit 1.x relied on a Service Worker to display the ZIM content, this is not anymore the case
since Zimit 2.x which does not have any special requirements anymore.
It should also be noted that a first version of a generic HTTP scraper was created in 2016 during
2020-09-25 12:02:43 +02:00
the [Wikimania Esino Lario
Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).
2020-09-19 22:19:20 +00:00
2020-09-25 12:02:43 +02:00
That version is now considered outdated and [archived in `2016`
branch](https://github.com/openzim/zimit/tree/2016).
2020-08-19 09:36:48 +02:00
2020-09-25 11:36:30 +02:00
License
-------
2020-08-19 09:36:48 +02:00
2020-09-25 11:36:30 +02:00
[GPLv3 ](https://www.gnu.org/licenses/gpl-3.0 ) or later, see
[LICENSE ](LICENSE ) for more details.