mirror of https://github.com/openzim/zimit.git synced 2025-12-31 04:23:15 +00:00

Make a ZIM file from any Web site and surf offline!

Find a file

benoit74 6e3951dfa7 Fix README and Dockerfile for imprecisions (#314 )		2024-08-07 09:32:37 +00:00
.github	Allow to run dev image update manually + use main warc2zim branch for zimit dev versions	2024-06-04 15:17:33 +00:00
src/zimit	Add support for custom behaviors configuration	2024-08-07 09:28:07 +00:00
tests	Rollback previous changes around the presence of a default user-agent	2024-03-27 15:08:58 +00:00
tests-integration	Fix tests, there are in fact only 7 items to be pushed to the ZIM	2024-03-07 10:16:51 +00:00
.gitignore	Enhance .gitignore with toptal generated one	2023-11-23 08:48:00 +01:00
.pre-commit-config.yaml	Upgrade dependencies	2024-06-18 13:42:05 +00:00
CHANGELOG.md	Fix README and Dockerfile for imprecisions (#314 )	2024-08-07 09:32:37 +00:00
Dockerfile	Fix README and Dockerfile for imprecisions (#314 )	2024-08-07 09:32:37 +00:00
LICENSE	Added LICENSE document	2020-09-01 10:22:32 +02:00
pyproject.toml	Prepare for 2.0.7	2024-08-02 08:46:59 +00:00
README.md	Fix README and Dockerfile for imprecisions (#314 )	2024-08-07 09:32:37 +00:00
tasks.py	Upgrade dependencies	2024-03-01 14:03:24 +00:00

README.md

Zimit

Zimit is a scraper allowing to create ZIM file from any Web site.

Zimit adheres to openZIM's Contribution Guidelines.

Zimit has implemented openZIM's Python bootstrap, conventions and policies v1.0.1.

Technical background

Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.

The system:

runs a website crawl with Browsertrix Crawler, which produces WARC files
converts the crawled WARC files to a single ZIM using warc2zim

The zimit.py is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which should be mounted as a volume to not loose the ZIM created when container stops.

Using the --keep flag, the crawled WARCs and few other artifacts will also be kept in a temp directory inside /output

Usage

zimit is intended to be run in Docker. Docker image is published at https://github.com/orgs/openzim/packages/container/package/zimit.

The image accepts the following parameters, as well as any of the warc2zim ones; useful for setting metadata, for instance:

Required: --url URL - the url to be crawled
Required: --name - Name of ZIM file
--output - output directory (defaults to /output)
--limit U - Limit capture to at most U URLs
--behaviors - Control which browsertrix behaviors are ran (defaults to autoplay,autofetch,siteSpecific, adding autoscroll to the list is possible to automatically scroll the pages and fetch resources which are lazy loaded)
--exclude <regex> - skip URLs that match the regex from crawling. Can be specified multiple times. An example is --exclude="(\?q=|signup-landing\?|\?cid=)", where URLs that contain either ?q= or signup-landing? or ?cid= will be excluded.
--workers N - number of crawl workers to be run in parallel
--wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
--keep - if set, keep the WARC files in a temp directory inside the output directory

Example command:

docker run ghcr.io/openzim/zimit zimit --help
docker run ghcr.io/openzim/zimit warc2zim --help
docker run  -v /output:/output ghcr.io/openzim/zimit zimit --url URL --name myzimfile

Note: Image automatically filters out a large number of ads by using the 3 blocklists from anudeepND. If you don't want this filtering, disable the image's entrypoint in your container (docker run --entrypoint="" ghcr.io/openzim/zimit ...).

To re-build the Docker image locally run:

docker build -t ghcr.io/openzim/zimit .

FAQ

The Zimit contributor's team maintains a page with most Frequently Asked Questions.

Nota bene

While Zimit 1.x relied on a Service Worker to display the ZIM content, this is not anymore the case since Zimit 2.x which does not have any special requirements anymore.

It should also be noted that a first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.

License

GPLv3 or later, see LICENSE for more details.