Make a ZIM file from any Web site and surf offline!
Find a file
Ilya Kreymer b00c4262a7 add --limit param for max URLs to be captured
add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8)
improved error handling
2020-09-21 07:16:26 +00:00
.github Github Kiwix Sponsoring page link 2020-02-01 18:14:09 +01:00
.gitignore initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim 2020-09-19 17:38:52 +00:00
config.yaml initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim 2020-09-19 17:38:52 +00:00
Dockerfile add --limit param for max URLs to be captured 2020-09-21 07:16:26 +00:00
index.js add --limit param for max URLs to be captured 2020-09-21 07:16:26 +00:00
LICENSE Added LICENSE document 2020-09-01 10:22:32 +02:00
package.json add --limit param for max URLs to be captured 2020-09-21 07:16:26 +00:00
README.md add --limit param for max URLs to be captured 2020-09-21 07:16:26 +00:00
run.sh add --limit param for max URLs to be captured 2020-09-21 07:16:26 +00:00
uwsgi.ini initial setup - single url capture with existing browser image, pywb, puppeteer and warc2zim 2020-09-19 17:38:52 +00:00
yarn.lock add --limit param for max URLs to be captured 2020-09-21 07:16:26 +00:00

zimit

This version of Zimit runs a single-site headless-Chrome based crawl in a Docker container and produces a ZIM of the crawled content.

The system uses:

  • oldwebtoday/chrome - to install a recent version of Chrome 84
  • puppeteer-cluster - for running Chrome browsers in parallel
  • pywb - in recording mode for capturing the content
  • warc2zim - to convert the crawled WARC files into a ZIM

The driver in index.js crawls a given URL using puppeteer-cluster.

After the crawl is done, warc2zim is used to write a zim to the /output directory, which can be mounted as a volume.

Usage

zimit is intended to be run in Docker.

To build locally run:

docker build -t openzim/zimit .

The image accepts the following parameters:

  • URL - the url to be crawled (required)
  • --workers N - number of crawl workers to be run in parallel
  • --wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
  • --name - Name of ZIM file (defaults to the hostname of the URL)
  • --output - output directory (defaults to /output)
  • --limit U - Limit capture to at most U URLs

The following is an example usage. The --cap-add and --shm-size flags are needed to run Chrome in Docker.

Example command:

docker run  -v /output:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1gb openzim/zimit URL --name myzimfile --workers 2 --wait-until domcontentloaded

The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.


Previous version

A first version of a generic HTTP scraper was created in 2016 during the Wikimania Esino Lario Hackathon.

That version is now considered outdated and archived in 2016 branch.