zimit/README.md

Zimit
=====

Zimit is a scraper allowing to create ZIM file from any Web site.

[![CodeFactor](https://www.codefactor.io/repository/github/openzim/zimit/badge)](https://www.codefactor.io/repository/github/openzim/zimit)
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![Docker](https://ghcr-badge.deta.dev/openzim/zimit/latest_tag?label=docker)](https://ghcr.io/openzim/zimit)

Zimit adheres to openZIM's [Contribution Guidelines](https://github.com/openzim/overview/wiki/Contributing).

Zimit has implemented openZIM's [Python bootstrap, conventions and policies](https://github.com/openzim/_python-bootstrap/docs/Policy.md) **v1.0.1**.

Capabilities and known limitations
--------------------

While we would like to support as many websites as possible, making an offline archive of any website with a versatile tool obviously has some limitations.

See for instance capabilities and known limitations of warc2zim in its [README](https://github.com/openzim/warc2zim/blob/main/README.md). There are also some limitations in Browsertrix Crawler (used to fetch the website) and wombat (used to properly replay dynamic web requests), but these are not (yet?) clearly documented.

Technical background
--------------------

Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.

The system:
- runs a website crawl with [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler), which produces WARC files
- converts the crawled WARC files to a single ZIM using [warc2zim](https://github.com/openzim/warc2zim)

The `zimit.py` is the entrypoint for the system.

After the crawl is done, warc2zim is used to write a zim to the `/output` directory, which should be mounted as a volume to not loose the ZIM created when container stops.

Using the `--keep` flag, the crawled WARCs and few other artifacts will also be kept in a temp directory inside `/output`

Usage
-----

`zimit` is intended to be run in Docker. Docker image is published at https://github.com/orgs/openzim/packages/container/package/zimit.

The image accepts the following parameters, **as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones**; useful for setting metadata, for instance:

- Required: `--url URL` - the url to be crawled
- Required: `--name` - Name of ZIM file
- `--output` - output directory (defaults to `/output`)
- `--limit U` - Limit capture to at most U URLs
- `--behaviors` - Control which browsertrix behaviors are ran (defaults to `autoplay,autofetch,siteSpecific`, adding `autoscroll` to the list is possible to automatically scroll the pages and fetch resources which are lazy loaded)
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=|signup-landing\?|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded.
- `--workers N` - number of crawl workers to be run in parallel
- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
- `--keep` - if set, keep the WARC files in a temp directory inside the output directory

Example command:

```bash
docker run ghcr.io/openzim/zimit zimit --help
docker run ghcr.io/openzim/zimit warc2zim --help
docker run  -v /output:/output ghcr.io/openzim/zimit zimit --url URL --name myzimfile
```

**Note**: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND](https://github.com/anudeepND/blacklist). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...`).

To re-build the Docker image locally run:

```bash
docker build -t ghcr.io/openzim/zimit .
```

FAQ
---

The Zimit contributor's team maintains [a page with most Frequently Asked Questions](https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions).

Nota bene
---------

While Zimit 1.x relied on a Service Worker to display the ZIM content, this is not anymore the case
since Zimit 2.x which does not have any special requirements anymore.

It should also be noted that a first version of a generic HTTP scraper was created in 2016 during
the [Wikimania Esino Lario
Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).

That version is now considered outdated and [archived in `2016`
branch](https://github.com/openzim/zimit/tree/2016).

License
-------

[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see
[LICENSE](LICENSE) for more details.
Update README.md 2020-09-25 11:36:30 +02:00			`Zimit`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00			`=====`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Zimit is a scraper allowing to create ZIM file from any Web site.`

Update README.md 2020-09-25 11:36:30 +02:00			`[![CodeFactor](https://www.codefactor.io/repository/github/openzim/zimit/badge)](https://www.codefactor.io/repository/github/openzim/zimit)`
			`[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)`
Adopt Python bootstrap conventions 2024-01-18 13:27:55 +01:00			`[![Docker](https://ghcr-badge.deta.dev/openzim/zimit/latest_tag?label=docker)](https://ghcr.io/openzim/zimit)`
Update README.md 2020-09-25 11:36:30 +02:00
Release 2.0.0 2024-06-03 19:59:04 +00:00			`Zimit adheres to openZIM's [Contribution Guidelines](https://github.com/openzim/overview/wiki/Contributing).`

			`Zimit has implemented openZIM's [Python bootstrap, conventions and policies](https://github.com/openzim/_python-bootstrap/docs/Policy.md) v1.0.1.`

Document capabilities and known limitations Signed-off-by: benoit74 <benoit74@users.noreply.github.com> 2024-08-11 20:40:59 +02:00			`Capabilities and known limitations`
			`--------------------`

			`While we would like to support as many websites as possible, making an offline archive of any website with a versatile tool obviously has some limitations.`

			`See for instance capabilities and known limitations of warc2zim in its [README](https://github.com/openzim/warc2zim/blob/main/README.md). There are also some limitations in Browsertrix Crawler (used to fetch the website) and wombat (used to properly replay dynamic web requests), but these are not (yet?) clearly documented.`

Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Technical background`
			`--------------------`

Enhance README by removing Chrome and headless reference 2023-11-16 13:13:31 +01:00			`Zimit runs a fully automated browser-based crawl of a website property and produces a ZIM of the crawled content. Zimit runs in a Docker container.`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Enhance README by removing Chrome and headless reference 2023-11-16 13:13:31 +01:00			`The system:`
			`- runs a website crawl with [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler), which produces WARC files`
			`- converts the crawled WARC files to a single ZIM using [warc2zim](https://github.com/openzim/warc2zim)`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler use versioned browsertrix-crawler:0.1.0 image part of #45 2020-11-02 15:36:28 +00:00			The `zimit.py` is the entrypoint for the system.
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			After the crawl is done, warc2zim is used to write a zim to the `/output` directory, which should be mounted as a volume to not loose the ZIM created when container stops.
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			Using the `--keep` flag, the crawled WARCs and few other artifacts will also be kept in a temp directory inside `/output`
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler use versioned browsertrix-crawler:0.1.0 image part of #45 2020-11-02 15:36:28 +00:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Usage`
			`-----`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			`zimit` is intended to be run in Docker. Docker image is published at https://github.com/orgs/openzim/packages/container/package/zimit.
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Updated readme: warc2zim params can be passed 2022-05-03 10:31:34 +00:00			`The image accepts the following parameters, as well as any of the [warc2zim](https://github.com/openzim/warc2zim) ones; useful for setting metadata, for instance:`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			- Required: `--url URL` - the url to be crawled
			- Required: `--name` - Name of ZIM file
move warc2zim to be launched by node process 2020-09-19 22:47:19 +00:00			- `--output` - output directory (defaults to `/output`)
add --limit param for max URLs to be captured add 'html check', only load HTML in browsers, load other content-types directly via pywb, esp for PDFs (work on #8) improved error handling 2020-09-21 07:14:23 +00:00			- `--limit U` - Limit capture to at most U URLs
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			- `--behaviors` - Control which browsertrix behaviors are ran (defaults to `autoplay,autofetch,siteSpecific`, adding `autoscroll` to the list is possible to automatically scroll the pages and fetch resources which are lazy loaded)
Adopt Python bootstrap conventions 2024-01-18 13:27:55 +01:00			- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times. An example is `--exclude="(\?q=\|signup-landing\?\|\?cid=)"`, where URLs that contain either `?q=` or `signup-landing?` or `?cid=` will be excluded.
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			- `--workers N` - number of crawl workers to be run in parallel
			- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
split zimit from core browsertrix-crawler, which has been moved to https://github.com/webrecorder/browsertrix-crawler use versioned browsertrix-crawler:0.1.0 image part of #45 2020-11-02 15:36:28 +00:00			- `--keep` - if set, keep the WARC files in a temp directory inside the output directory
move warc2zim to be launched by node process 2020-09-19 22:47:19 +00:00
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00			`Example command:`

Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			```bash
removed references to docker.io 2023-03-22 13:55:07 +00:00			`docker run ghcr.io/openzim/zimit zimit --help`
			`docker run ghcr.io/openzim/zimit warc2zim --help`
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			`docker run -v /output:/output ghcr.io/openzim/zimit zimit --url URL --name myzimfile`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00			```
Update README.md 2020-09-19 15:53:23 -07:00
removed references to docker.io 2023-03-22 13:55:07 +00:00			Note: Image automatically filters out a large number of ads by using the 3 blocklists from [anudeepND](https://github.com/anudeepND/blacklist). If you don't want this filtering, disable the image's entrypoint in your container (`docker run --entrypoint="" ghcr.io/openzim/zimit ...`).
Added domains blocklist (#77) All domains from the 3 [anudeepND](https://github.com/anudeepND/blacklist) lists are now blocked at local resolver level by updating /etc/hosts in entrypoint. - this saves network and CPU resources by failing early. - this is wanted in almost all cases - can be bypassed by setting a blank entrypoint 2021-01-12 06:31:16 +00:00
Fix README and Dockerfile for imprecisions (#314) 2024-07-19 14:02:24 +00:00			`To re-build the Docker image locally run:`

			```bash
			`docker build -t ghcr.io/openzim/zimit .`
			```

Add link to the FAQ in README 2024-07-20 12:12:50 +02:00			`FAQ`
			`---`

			`The Zimit contributor's team maintains [a page with most Frequently Asked Questions](https://github.com/openzim/zimit/wiki/Frequently-Asked-Questions).`

Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`Nota bene`
			`---------`
Update README.md 2020-09-19 15:53:23 -07:00
Replace warning about service workers by a nota bene about there removal since 2.x 2024-01-18 13:23:30 +01:00			`While Zimit 1.x relied on a Service Worker to display the ZIM content, this is not anymore the case`
			`since Zimit 2.x which does not have any special requirements anymore.`

			`It should also be noted that a first version of a generic HTTP scraper was created in 2016 during`
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			`the [Wikimania Esino Lario`
			`Hackathon](https://wikimania2016.wikimedia.org/wiki/Programme/Kiwix-dedicated_Hackathon).`
use puppeteeer-cluster for parallel crawling use yargs to parse command-line args 2020-09-19 22:19:20 +00:00
Additional README.md changes (#16) 2020-09-25 12:02:43 +02:00			That version is now considered outdated and [archived in `2016`
			`branch](https://github.com/openzim/zimit/tree/2016).`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Update README.md 2020-09-25 11:36:30 +02:00			`License`
			`-------`
reset master branch for 2020 codebase 2020-08-19 09:36:48 +02:00
Update README.md 2020-09-25 11:36:30 +02:00			`[GPLv3](https://www.gnu.org/licenses/gpl-3.0) or later, see`
			`[LICENSE](LICENSE) for more details.`