New Docker Image, Customizable Browser Source + Binary (#62)

* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!)

* add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...)

* github action ci: use system unzip

* update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file.

* Update README with info on customizing build image

* bump version to 0.4.0-beta.2
This commit is contained in:
Ilya Kreymer 2021-06-24 15:39:17 -07:00 committed by GitHub
parent 3ebe511b32
commit f57818f2f6
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
8 changed files with 49 additions and 19 deletions

View file

@ -45,7 +45,7 @@ jobs:
- name: validate existing wacz
run: docker-compose run crawler wacz validate --file collections/wr-net/wr-net.wacz
- name: unzip wacz
run: docker-compose run crawler unzip collections/wr-net/wr-net.wacz -d collections/wr-net/wacz
run: sudo unzip crawls/collections/wr-net/wr-net.wacz -d crawls/collections/wr-net/wacz
- name: run jest
run: sudo yarn jest

View file

@ -1,17 +1,31 @@
ARG BROWSER_VERSION=90
FROM oldwebtoday/chrome:${BROWSER_VERSION} as chrome
ARG BROWSER_IMAGE_BASE=oldwebtoday/chrome
FROM nikolaik/python-nodejs:python3.8-nodejs14
ARG BROWSER_BIN=google-chrome
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add -
FROM ${BROWSER_IMAGE_BASE}:${BROWSER_VERSION} as chrome
RUN apt-get update -y \
&& apt-get install --no-install-recommends -qqy fonts-stix locales-all redis-server xvfb \
FROM ubuntu:bionic
RUN apt-get update -y && apt-get install --no-install-recommends -qqy software-properties-common \
&& add-apt-repository -y ppa:deadsnakes \
&& apt-get update -y \
&& apt-get install --no-install-recommends -qqy build-essential fonts-stix locales-all redis-server xvfb gpg-agent curl git \
python3.8 python3.8-distutils python3.8-dev gpg ca-certificates \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - \
&& echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list \
&& curl -sL https://deb.nodesource.com/setup_16.x -o /tmp/nodesource_setup.sh && bash /tmp/nodesource_setup.sh \
&& apt-get update -y && apt-get install -qqy nodejs yarn \
&& curl https://bootstrap.pypa.io/get-pip.py | python3.8 \
&& pip install -U setuptools
# needed to add args to main build stage
ARG BROWSER_VERSION
ARG BROWSER_BIN
ENV PROXY_HOST=localhost \
PROXY_PORT=8080 \
@ -19,7 +33,8 @@ ENV PROXY_HOST=localhost \
PROXY_CA_FILE=/tmp/proxy-ca.pem \
DISPLAY=:99 \
GEOMETRY=1360x1020x16 \
BROWSER_VERSION=${BROWSER_VERSION}
BROWSER_VERSION=${BROWSER_VERSION} \
BROWSER_BIN=${BROWSER_BIN}
COPY --from=chrome /tmp/*.deb /deb/
COPY --from=chrome /app/libpepflashplayer.so /app/libpepflashplayer.so

View file

@ -37,13 +37,14 @@ Here's how you can use some of the command-line options to configure the crawl:
- To run more than one browser worker and crawl in parallel, and `--workers N` where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.
- To crawl into a new directory, specify a different name for the `--collection` param, or, if omitted, a new collection directory based on current time will be created.
-
Browsertrix Crawler includes a number of additional command-line options, explained below.
## Crawling Configuration Options
The Browsertrix Crawler docker image currently accepts the following parameters:
<details>
<summary><b>The Browsertrix Crawler docker image currently accepts the following parameters:</b></summary>
```
crawler [options]
@ -136,6 +137,8 @@ Options:
command line will take precedence.
[string]
```
</details>
For the `--waitUntil` flag, see [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options).
@ -238,7 +241,7 @@ The current profile creation script is still experimental and the script attempt
The Docker container provided here packages up several components used in Browsertrix.
The system uses:
- `oldwebtoday/chrome` - to install a recent version of Chrome (currently chrome:84)
- `oldwebtoday/chrome` or `oldwebtoday/chromium` - to install a recent version of Chrome (currently chrome:90) or Chromium (see below).
- `puppeteer-cluster` - for running Chrome browsers in parallel
- `pywb` - in recording mode for capturing the content
@ -247,6 +250,19 @@ The crawl produces a single pywb collection, at `/crawls/collections/<collection
To access the contents of the crawl, the `/crawls` directory in the container should be mounted to a volume (default in the Docker Compose setup).
### Building with Custom Browser Image / Building on Apple M1
Browsertrix Crawler can be built on the new ARM M1 chip (for development). However, since there is no Linux build of Chrome for ARM, Chromium can be used instead. Currently, Webrecorder provides the `oldwebtoday/chromium:91-arm` for running Browsertrix Crawler on ARM-based systems.
For example, to build with this Chromium image on an Apple M1 machine, run:
```
docker-compose build --build-arg BROWSER_IMAGE_BASE=oldwebtoday/chromium --build-arg "BROWSER_VERSION=91-arm" --build-arg BROWSER_BIN=chromium-browser
```
You should then be able to run Browsertrix Crawler natively on M1.
The build arguments specify the base image, version and browser binary. This approach can also be used to install a different browser in general from any Debian-based Docker image.
### Example Usage

View file

@ -26,7 +26,7 @@ const TextExtract = require("./util/textextract");
const { ScreenCaster } = require("./util/screencaster");
const { parseArgs } = require("./util/argParser");
const { CHROME_PATH, BEHAVIOR_LOG_FUNC, HTML_TYPES } = require("./util/constants");
const { BROWSER_BIN, BEHAVIOR_LOG_FUNC, HTML_TYPES } = require("./util/constants");
// ============================================================================
class Crawler {
@ -91,7 +91,7 @@ class Crawler {
let version = process.env.BROWSER_VERSION;
try {
version = child_process.execFileSync(CHROME_PATH, ["--product-version"], {encoding: "utf8"}).trim();
version = child_process.execFileSync(BROWSER_BIN, ["--product-version"], {encoding: "utf8"}).trim();
} catch(e) {
console.log(e);
}
@ -171,7 +171,7 @@ class Crawler {
// Puppeter Options
return {
headless: this.params.headless,
executablePath: CHROME_PATH,
executablePath: BROWSER_BIN,
ignoreHTTPSErrors: true,
args: this.chromeArgs,
userDataDir: this.profileDir,
@ -255,9 +255,8 @@ class Crawler {
const warcVersion = "WARC/1.1";
const type = "warcinfo";
const packageFileJSON = JSON.parse(await fsp.readFile("../app/package.json"));
const version = await fsp.readFile("/usr/local/lib/python3.8/site-packages/pywb/version.py", "utf8");
const pywbVersion = version.split("\n")[0].split("=")[1].trim().replace(/['"]+/g, "");
const warcioPackageJson = JSON.parse(await fsp.readFile("/app/node_modules/warcio/package.json"));
const pywbVersion = child_process.execSync("pywb -V", {encoding: "utf8"}).trim().split(" ")[1];
const info = {
"software": `Browsertrix-Crawler ${packageFileJSON["version"]} (with warcio.js ${warcioPackageJson} pywb ${pywbVersion})`,

View file

@ -2,7 +2,7 @@ version: '3.5'
services:
crawler:
image: webrecorder/browsertrix-crawler:0.4.0-beta.1
image: webrecorder/browsertrix-crawler:0.4.0-beta.2
build:
context: ./

View file

@ -1,6 +1,6 @@
{
"name": "browsertrix-crawler",
"version": "0.4.0-beta.1",
"version": "0.4.0-beta.2",
"main": "browsertrix-crawler",
"repository": "https://github.com/webrecorder/browsertrix-crawler",
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",

View file

@ -1,4 +1,4 @@
pywb>=2.6.0b2
pywb>=2.6.0b3
#git+https://github.com/webrecorder/pywb@main
uwsgi
wacz>=0.3.0

View file

@ -2,5 +2,5 @@
module.exports.HTML_TYPES = ["text/html", "application/xhtml", "application/xhtml+xml"];
module.exports.WAIT_UNTIL_OPTS = ["load", "domcontentloaded", "networkidle0", "networkidle2"];
module.exports.BEHAVIOR_LOG_FUNC = "__bx_log";
module.exports.CHROME_PATH = "google-chrome";
module.exports.BROWSER_BIN = process.env.BROWSER_BIN || "google-chrome";