zimit/README.md

Zimit
=====

Zimit is a scraper allowing to create ZIM file from any Web site.

[![Docker](https://ghcr-badge.deta.dev/openzim/zimit/latest_tag?label=docker)](https://ghcr.io/openzim/zimit)
[![Build](https://github.com/openzim/zimit/workflows/CI/badge.svg?query=branch%3Amain)](https://github.com/openzim/zimit/actions?query=branch%3Amain)
[![CodeFactor](https://www.codefactor.io/repository/github/openzim/zimit/badge)](https://www.codefactor.io/repository/github/openzim/zimit)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

⚠️ **Important**: this tool uses [warc2zim](https://github.com/openzim/warc2zim) to create Zim files and thus require the Zim reader to support *Service Workers*. At the time of `zimit:1.0`, that's mostly kiwix-android and kiwix-serve. Note that service workers have protocol restrictions as well so you'll need to run it either from `localhost` or over HTTPS.

Technical background
--------------------

Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.

The system:
- runs a website crawl with [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler), which produces WARC files
- converts the crawled WARC files to a single ZIM using [warc2zim](https://github.com/openzim/warc2zim)

The `zimit.py` is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the
`/output` directory, which can be mounted as a volume.

Using the `--keep` flag, the crawled WARCs will also be kept in a temp directory inside `/output`

Usage
-----

`zimit` is intended to be run in Docker.

To build locally run:

```bash
docker build -t ghcr.io/openzim/zimit .
```

The image accepts the following parameters, **as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones**; useful for setting metadata, for instance:

- `--url URL` - the url to be crawled (required)
- `--workers N` - number of crawl workers to be run in parallel
- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
- `--name` - Name of ZIM file (defaults to the hostname of the URL)
- `--output` - output directory (defaults to `/output`)
- `--limit U` - Limit capture to at most U URLs
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=|signup-landing\?|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded. 
- `--scroll [N]` - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
- `--keep` - if set, keep the WARC files in a temp directory inside the output directory

The following is an example usage. The `--shm-size` flags is [needed to run Chrome in Docker](https://github.com/puppeteer/puppeteer/blob/v1.0.0/docs/troubleshooting.md#tips).

Example command:

```bash
docker run ghcr.io/openzim/zimit zimit --help
docker run ghcr.io/openzim/zimit warc2zim --help
docker run  -v /output:/output \
       --shm-size=1gb ghcr.io/openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded
```

The puppeteer-cluster provides monitoring output which is enabled by
default and prints the crawl status to the Docker log.

**Note**: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND](https://github.com/anudeepND/blacklist). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...`).

Nota bene
---------

A first version of a generic HTTP scraper was created in 2016 during
the [Wikimania Esino Lario
Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).

That version is now considered outdated and [archived in `2016`
branch](https://github.com/openzim/zimit/tree/2016).

License
-------

[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.
Update README.md 2020-09-25 11:36:30 +02:00			`Zimit`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00			`=====`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Zimit is a scraper allowing to create ZIM file from any Web site.`

removed references to docker.io 2023-03-22 13:55:07 +00:00			`[![Docker](https://ghcr-badge.deta.dev/openzim/zimit/latest_tag?label=docker)](https://ghcr.io/openzim/zimit)`
"main" is the new default branch 2022-12-21 11:06:50 +01:00			`[![Build](https://github.com/openzim/zimit/workflows/CI/badge.svg?query=branch%3Amain)](https://github.com/openzim/zimit/actions?query=branch%3Amain)`
Update README.md 2020-09-25 11:36:30 +02:00			`[![CodeFactor](https://www.codefactor.io/repository/github/openzim/zimit/badge)](https://www.codefactor.io/repository/github/openzim/zimit)`
			`[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)`

Update README.md relative link didn't work and replaced by https://github.com/openzim/warc2zim 2022-05-30 21:45:18 +02:00			⚠️ Important: this tool uses [warc2zim](https://github.com/openzim/warc2zim) to create Zim files and thus require the Zim reader to support Service Workers. At the time of `zimit:1.0`, that's mostly kiwix-android and kiwix-serve. Note that service workers have protocol restrictions as well so you'll need to run it either from `localhost` or over HTTPS.
Fixed #58: updated README with limitations 2020-12-12 13:58:32 +00:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Technical background`
			`--------------------`

Enhance README by removing Chrome and headless reference 2023-11-16 13:13:31 +01:00			`Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Enhance README by removing Chrome and headless reference 2023-11-16 13:13:31 +01:00			`The system:`
			`- runs a website crawl with [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler), which produces WARC files`
			`- converts the crawled WARC files to a single ZIM using [warc2zim](https://github.com/openzim/warc2zim)`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler use versioned browsertrix-crawler:0.1.0 image part of #45 2020-11-02 15:36:28 +00:00			The `zimit.py` is the entrypoint for the system.
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`After the crawl is done, warc2zim is used to write a zim to the`
			`/output` directory, which can be mounted as a volume.
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler use versioned browsertrix-crawler:0.1.0 image part of #45 2020-11-02 15:36:28 +00:00			Using the `--keep` flag, the crawled WARCs will also be kept in a temp directory inside `/output`

Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Usage`
			`-----`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
			`zimit` is intended to be run in Docker.

Update README.md 2020-09-19 15:53:23 -07:00			`To build locally run:`

Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			```bash
removed references to docker.io 2023-03-22 13:55:07 +00:00			`docker build -t ghcr.io/openzim/zimit .`
Update README.md 2020-09-19 15:53:23 -07:00			```
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Updated readme: warc2zim params can be passed 2022-05-03 10:31:34 +00:00			`The image accepts the following parameters, as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones; useful for setting metadata, for instance:`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
add help text/validation for all config options, url now must be passed in with --url add --scroll boolean option, which activates simple autoscroll behavior use chrome user-agent for manual fetch reenable pywb option cleanup Dockerfile: update to warc2zim 1.0.1, install fonts-stix for math science sites update README 2020-09-29 05:16:00 +00:00			- `--url URL` - the url to be crawled (required)
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00			- `--workers N` - number of crawl workers to be run in parallel
			- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
move warc2zim to be launched by node process 2020-09-19 22:47:19 +00:00			- `--name` - Name of ZIM file (defaults to the hostname of the URL)
			- `--output` - output directory (defaults to `/output`)
add --limit param for max URLs to be captured add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8) improved error handling 2020-09-21 07:14:23 +00:00			- `--limit U` - Limit capture to at most U URLs
Update README.md 2021-01-25 12:31:09 -06:00			- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=\|signup-landing\?\|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded.
replace run.sh with python runner zimit.py, as suggested in #28 should fix arg parsing issues in #28,#18 warc2zim now called directly from zimit.py, both for arg check and for actual zim creation crawler renamed to crawler.js, no longer handles zim creation, only crawling add signal handling to both zimit and crawler.js for smooth shutdown, should fix #25 pywb: update to latest dev version with dedup support, add redis for deduplication 2020-10-16 18:54:04 +00:00			- `--scroll [N]` - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler use versioned browsertrix-crawler:0.1.0 image part of #45 2020-11-02 15:36:28 +00:00			- `--keep` - if set, keep the WARC files in a temp directory inside the output directory
move warc2zim to be launched by node process 2020-09-19 22:47:19 +00:00
removed obsolete ref to cap-add in README 2023-02-02 16:30:15 +00:00			The following is an example usage. The `--shm-size` flags is [needed to run Chrome in Docker](https://github.com/puppeteer/puppeteer/blob/v1.0.0/docs/troubleshooting.md#tips).
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
			`Example command:`

Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			```bash
removed references to docker.io 2023-03-22 13:55:07 +00:00			`docker run ghcr.io/openzim/zimit zimit --help`
			`docker run ghcr.io/openzim/zimit warc2zim --help`
removed obsolete ref to cap-add in README 2023-02-02 16:30:15 +00:00			`docker run -v /output:/output \`
minor spelling mistake i win 2023-07-13 12:49:34 +00:00			`--shm-size=1gb ghcr.io/openzim/zimit zimit --url URL --name myzimfile --workers 2 --waitUntil domcontentloaded`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00			```
Update README.md 2020-09-19 15:53:23 -07:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`The puppeteer-cluster provides monitoring output which is enabled by`
			`default and prints the crawl status to the Docker log.`
Update README.md 2020-09-19 15:53:23 -07:00
removed references to docker.io 2023-03-22 13:55:07 +00:00			Note: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND](https://github.com/anudeepND/blacklist). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...`).
Added domains blocklist (#77) All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist) lists are now blocked at local resolver level by updating /etc/hosts in entrypoint. - this saves network and CPU resources by failing early. - this is wanted in almost all cases - can be bypassed by setting a blank entrypoint 2021-01-12 06:31:16 +00:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Nota bene`
			`---------`
Update README.md 2020-09-19 15:53:23 -07:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`A first version of a generic HTTP scraper was created in 2016 during`
			`the [Wikimania Esino Lario`
			`Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			That version is now considered outdated and [archived in `2016`
			`branch](https://github.com/openzim/zimit/tree/2016).`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Update README.md 2020-09-25 11:36:30 +02:00			`License`
			`-------`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Update README.md 2020-09-25 11:36:30 +02:00			`[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see`
			`[LICENSE](LICENSE) for more details.`