Dockerfile: switch to cmd 'crawl', instead of entrypoint to support running 'pywb' also

update README with docker-compose and docker run examples, update commandline example
default output to './crawls' subdirectory
This commit is contained in:
Ilya Kreymer 2020-11-01 21:35:00 -08:00
parent 91b8994a08
commit a875aa90d3
6 changed files with 92 additions and 32 deletions

2
.gitignore vendored
View file

@ -3,4 +3,4 @@ __pycache__
*.egg-info/
collections/
node_modules/
output/
crawls/

View file

@ -32,7 +32,9 @@ ADD config.yaml /app/
ADD uwsgi.ini /app/
ADD *.js /app/
RUN ln -s /app/main.js /usr/bin/crawl
WORKDIR /output
ENTRYPOINT ["node", "/app/main.js"]
CMD ["crawl"]

107
README.md
View file

@ -1,7 +1,6 @@
Browsertrix Core
================
# Browsertrix Crawler
Browsertrix Core is a simplified browser-based high-fidliety crawling system, designed to run a single crawl in a single Docker container.
Browsertrix Crwaler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container.
It is designed as part of a more streamlined replacement of the original [Browsertrix](https://github.com/webrecorder/browsertrix).
@ -18,47 +17,101 @@ The system uses:
- `pywb` - in recording mode for capturing the content
The crawl produces a single pywb collection, at `/output/collections/capture`.
The crawl produces a single pywb collection, at `/output/collections/<collection name>`.
The collection can be mounted as a Docker volume and then accessed in pywb.
Crawling Parameters
-------------------
## Crawling Parameters
The image currently accepts the following parameters:
- `--url URL` - the url to be crawled (required)
- `--workers N` - number of crawl workers to be run in parallel
- `--wait-until` - Puppeteer setting for how long to wait for page load. See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options). The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example).
- `--name` - Name of ZIM file (defaults to the hostname of the URL)
- `--output` - output directory (defaults to `/output`)
- `--limit U` - Limit capture to at most U URLs
- `--exclude <regex>` - skip URLs that match the regex from crawling. Can be specified multiple times.
- `--scroll [N]` - if set, will activate a simple auto-scroll behavior on each page to scroll for upto N seconds
The following is an example usage. The `--cap-add` and `--shm-size`
flags are [needed to run Chrome in Docker](https://github.com/puppeteer/puppeteer/blob/v1.0.0/docs/troubleshooting.md#tips).
Example command:
```bash
docker run -v ./collections/my-crawl:/output/collections/capture --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://www.iana.org/ --workers 2
```
browsertrix-crawler [options]
Options:
--help Show help [boolean]
--version Show version number [boolean]
-u, --url The URL to start crawling from [string] [required]
-w, --workers The number of workers to run in parallel
[number] [default: 1]
--newContext The context for each new capture, can be a new: page,
session or browser. [string] [default: "page"]
--waitUntil Puppeteer page.goto() condition to wait for before
continuing [default: "load"]
--limit Limit crawl to this number of pages [number] [default: 0]
--timeout Timeout for each page to load (in seconds)
[number] [default: 90]
--scope Regex of page URLs that should be included in the crawl
(defaults to the immediate directory of URL)
--exclude Regex of page URLs that should be excluded from the crawl.
--scroll If set, will autoscroll to bottom of the page
[boolean] [default: false]
-c, --collection Collection name to crawl to (replay will be accessible
under this name in pywb preview)
[string] [default: "capture"]
--headless Run in headless mode, otherwise start xvfb
[boolean] [default: false]
--driver JS driver for the crawler
[string] [default: "/Users/ilya/work/browsertrix-crawler/defaultDriver.js"]
--generateCDX If set, generate index (CDXJ) for use with pywb after crawl
is done [boolean] [default: false]
```
The puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.
For the `--waitUntil` flag, see [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options).
With the above example, when the crawl is finished, you can run pywb and browse the collection from: `http://localhost:8080/my-crawl/https://www.iana.org/`
The default is `load`, but for static sites, `--wait-until domcontentloaded` may be used to speed up the crawl (to avoid waiting for ads to load for example),
while `--waitUntil networkidle0` may make sense for dynamic sites.
### Example Usage
#### With Docker-Compose
The Docker Compose file can simplify building and running a crawl, and includes some required settings for `docker run`, including mounting a volume.
For example, the following commands demonstrate building the image, running a simple crawl with 2 workers:
```
docker-compose build
docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2
```
While the crawl is running, puppeteer-cluster provides monitoring output which is enabled by default and prints the crawl status to the Docker log.
The output is written to `./crawls/collections/wr-net` by default.
When done, you can even use the browsertrix-crawler image to also start a local [pywb](https://github.com/webrecorder/pywb) instance
to preview the crawl:
```
cd crawls
docker run it -p 8080:8080 webrecorder/browsertrix-crawler pywb
```
Then, loading the `http://localhost:8080/wr-net/https://webrecorder.net/` should load a recent crawl of the `https://webrecorder.net/` site.
#### With `docker run`
Browsertrix Crawler can of course all be run directly with Docker run, but requires a few more options.
In particular, the `--cap-add` and `--shm-size`
flags are [needed to run Chrome in Docker](https://github.com/puppeteer/puppeteer/blob/v1.0.0/docs/troubleshooting.md#tips).
```bash
docker run -v $PWD/crawls:/output --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://webrecorder.net/ --workers 2
```
Support
-------
Initial support for development of Browsertrix Core, was provided by [Kiwix](https://kiwix.org/)
Initial support for development of Browsertrix Crawler, was provided by [Kiwix](https://kiwix.org/)
Initial functionality for Browsertrix Core was developed to support the [zimit](https://github.com/openzim/zimit) project in a collaboration between
Initial functionality for Browsertrix Crawler was developed to support the [zimit](https://github.com/openzim/zimit) project in a collaboration between
Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

View file

@ -83,6 +83,7 @@ class Crawler {
alias: "u",
describe: "The URL to start crawling from",
type: "string",
demandOption: true,
},
"workers": {
@ -131,7 +132,7 @@ class Crawler {
"collection": {
alias: "c",
describe: "Collection Name",
describe: "Collection name to crawl to (replay will be accessible under this name in pywb preview)",
type: "string",
default: "capture"
},
@ -148,7 +149,7 @@ class Crawler {
default: path.join(__dirname, "defaultDriver.js"),
},
"generate-cdx": {
"generateCDX": {
describe: "If set, generate index (CDXJ) for use with pywb after crawl is done",
type: "boolean",
default: false,

View file

@ -1,11 +1,14 @@
version: '3.5'
services:
browsertrix-crawler:
crawler:
image: webrecorder/browsertrix-crawler
build:
context: ./
volumes:
- ./crawls:/output
cap_add:
- NET_ADMIN
- SYS_ADMIN

View file

@ -1,3 +1,4 @@
#!/usr/bin/env node
const { Crawler } = require("./crawler");