mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-07 13:49:47 +00:00

Run a high-fidelity browser-based web archiving crawler in a single Docker container https://crawler.docs.browsertrix.com

crawler crawling wacz warc web-archiving web-crawler webrecorder

Find a file

Ilya Kreymer ded83b52b3 initial commit after split from zimit		2020-10-31 13:16:37 -07:00
.dockerignore	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
.gitignore	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
autoplay.js	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
config.yaml	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
crawler.js	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
docker-compose.yml	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
Dockerfile	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
LICENSE	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
NOTICE	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
package.json	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
README.md	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
uwsgi.ini	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
yarn.lock	initial commit after split from zimit	2020-10-31 13:16:37 -07:00

README.md

Browsertrix Core

Browsertrix Core is a simplified browser-based high-fidliety crawling system, designed to run a single crawl in a single Docker container.

It is designed as part of a more streamlined replacement of the original Browsertrix.

The original Browsertrix may be too complex for situations where a single crawl is needed, and requires managing multiple containers.

This is an attempt to refactor Browsertrix into a core crawling system, driven by puppeteer-cluster and puppeteer

The Docker container provided here packages up several components used in Browsertrix.

The system uses:

oldwebtoday/chrome - to install a recent version of Chrome (currently chrome:84)
puppeteer-cluster - for running Chrome browsers in parallel
pywb - in recording mode for capturing the content

The crawl produces a single pywb collection, at /output/collections/capture.

The collection can be mounted as a Docker volume and then accessed in pywb.

Crawling Parameters

The image currently accepts the following parameters:

--url URL - the url to be crawled (required)
--workers N - number of crawl workers to be run in parallel
--wait-until - Puppeteer setting for how long to wait for page load. See page.goto waitUntil options. The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example).
--name - Name of ZIM file (defaults to the hostname of the URL)
--output - output directory (defaults to /output)
--limit U - Limit capture to at most U URLs
--exclude <regex> - skip URLs that match the regex from crawling. Can be specified multiple times.
--scroll [N] - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds

The following is an example usage. The --cap-add and --shm-size flags are needed to run Chrome in Docker.

Example command:

docker run -v ./collections/my-crawl:/output/collections/capture --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://www.iana.org/ --workers 2

The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.

With the above example, when the crawl is finished, you can run pywb and browse the collection from: http://localhost:8080/my-crawl/https://www.iana.org/

Support

Initial support for development of Browsertrix Core, was provided by Kiwix

Initial functionality for Browsertrix Core was developed to support the zimit project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

License

AGPLv3 or later, see LICENSE for more details.