mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 14:33:17 +00:00
New Docker Image, Customizable Browser Source + Binary (#62)
* switch docker image to ubuntu base, install python3.8 + node manually (reduces image size as well!) * add BROWSER_BIN build arg and env var to support building and running with different browser (defaults to google-chrome, but can be chromium, etc...) * github action ci: use system unzip * update to latest pywb beta, get pywb version from `pywb -V` command instead of parsing .py file. * Update README with info on customizing build image * bump version to 0.4.0-beta.2
This commit is contained in:
parent
3ebe511b32
commit
f57818f2f6
8 changed files with 49 additions and 19 deletions
2
.github/workflows/ci.yaml
vendored
2
.github/workflows/ci.yaml
vendored
|
@ -45,7 +45,7 @@ jobs:
|
|||
- name: validate existing wacz
|
||||
run: docker-compose run crawler wacz validate --file collections/wr-net/wr-net.wacz
|
||||
- name: unzip wacz
|
||||
run: docker-compose run crawler unzip collections/wr-net/wr-net.wacz -d collections/wr-net/wacz
|
||||
run: sudo unzip crawls/collections/wr-net/wr-net.wacz -d crawls/collections/wr-net/wacz
|
||||
- name: run jest
|
||||
run: sudo yarn jest
|
||||
|
||||
|
|
27
Dockerfile
27
Dockerfile
|
@ -1,17 +1,31 @@
|
|||
ARG BROWSER_VERSION=90
|
||||
|
||||
FROM oldwebtoday/chrome:${BROWSER_VERSION} as chrome
|
||||
ARG BROWSER_IMAGE_BASE=oldwebtoday/chrome
|
||||
|
||||
FROM nikolaik/python-nodejs:python3.8-nodejs14
|
||||
ARG BROWSER_BIN=google-chrome
|
||||
|
||||
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add -
|
||||
FROM ${BROWSER_IMAGE_BASE}:${BROWSER_VERSION} as chrome
|
||||
|
||||
RUN apt-get update -y \
|
||||
&& apt-get install --no-install-recommends -qqy fonts-stix locales-all redis-server xvfb \
|
||||
FROM ubuntu:bionic
|
||||
|
||||
RUN apt-get update -y && apt-get install --no-install-recommends -qqy software-properties-common \
|
||||
&& add-apt-repository -y ppa:deadsnakes \
|
||||
&& apt-get update -y \
|
||||
&& apt-get install --no-install-recommends -qqy build-essential fonts-stix locales-all redis-server xvfb gpg-agent curl git \
|
||||
python3.8 python3.8-distutils python3.8-dev gpg ca-certificates \
|
||||
&& apt-get clean \
|
||||
&& rm -rf /var/lib/apt/lists/*
|
||||
|
||||
RUN curl -sS https://dl.yarnpkg.com/debian/pubkey.gpg | apt-key add - \
|
||||
&& echo "deb https://dl.yarnpkg.com/debian/ stable main" | tee /etc/apt/sources.list.d/yarn.list \
|
||||
&& curl -sL https://deb.nodesource.com/setup_16.x -o /tmp/nodesource_setup.sh && bash /tmp/nodesource_setup.sh \
|
||||
&& apt-get update -y && apt-get install -qqy nodejs yarn \
|
||||
&& curl https://bootstrap.pypa.io/get-pip.py | python3.8 \
|
||||
&& pip install -U setuptools
|
||||
|
||||
# needed to add args to main build stage
|
||||
ARG BROWSER_VERSION
|
||||
ARG BROWSER_BIN
|
||||
|
||||
ENV PROXY_HOST=localhost \
|
||||
PROXY_PORT=8080 \
|
||||
|
@ -19,7 +33,8 @@ ENV PROXY_HOST=localhost \
|
|||
PROXY_CA_FILE=/tmp/proxy-ca.pem \
|
||||
DISPLAY=:99 \
|
||||
GEOMETRY=1360x1020x16 \
|
||||
BROWSER_VERSION=${BROWSER_VERSION}
|
||||
BROWSER_VERSION=${BROWSER_VERSION} \
|
||||
BROWSER_BIN=${BROWSER_BIN}
|
||||
|
||||
COPY --from=chrome /tmp/*.deb /deb/
|
||||
COPY --from=chrome /app/libpepflashplayer.so /app/libpepflashplayer.so
|
||||
|
|
22
README.md
22
README.md
|
@ -37,13 +37,14 @@ Here's how you can use some of the command-line options to configure the crawl:
|
|||
- To run more than one browser worker and crawl in parallel, and `--workers N` where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.
|
||||
|
||||
- To crawl into a new directory, specify a different name for the `--collection` param, or, if omitted, a new collection directory based on current time will be created.
|
||||
-
|
||||
|
||||
Browsertrix Crawler includes a number of additional command-line options, explained below.
|
||||
|
||||
## Crawling Configuration Options
|
||||
|
||||
The Browsertrix Crawler docker image currently accepts the following parameters:
|
||||
|
||||
<details>
|
||||
<summary><b>The Browsertrix Crawler docker image currently accepts the following parameters:</b></summary>
|
||||
|
||||
```
|
||||
crawler [options]
|
||||
|
@ -136,6 +137,8 @@ Options:
|
|||
command line will take precedence.
|
||||
[string]
|
||||
```
|
||||
</details>
|
||||
|
||||
|
||||
For the `--waitUntil` flag, see [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/main/docs/api.md#pagegotourl-options).
|
||||
|
||||
|
@ -238,7 +241,7 @@ The current profile creation script is still experimental and the script attempt
|
|||
The Docker container provided here packages up several components used in Browsertrix.
|
||||
|
||||
The system uses:
|
||||
- `oldwebtoday/chrome` - to install a recent version of Chrome (currently chrome:84)
|
||||
- `oldwebtoday/chrome` or `oldwebtoday/chromium` - to install a recent version of Chrome (currently chrome:90) or Chromium (see below).
|
||||
- `puppeteer-cluster` - for running Chrome browsers in parallel
|
||||
- `pywb` - in recording mode for capturing the content
|
||||
|
||||
|
@ -247,6 +250,19 @@ The crawl produces a single pywb collection, at `/crawls/collections/<collection
|
|||
|
||||
To access the contents of the crawl, the `/crawls` directory in the container should be mounted to a volume (default in the Docker Compose setup).
|
||||
|
||||
### Building with Custom Browser Image / Building on Apple M1
|
||||
|
||||
Browsertrix Crawler can be built on the new ARM M1 chip (for development). However, since there is no Linux build of Chrome for ARM, Chromium can be used instead. Currently, Webrecorder provides the `oldwebtoday/chromium:91-arm` for running Browsertrix Crawler on ARM-based systems.
|
||||
|
||||
For example, to build with this Chromium image on an Apple M1 machine, run:
|
||||
|
||||
```
|
||||
docker-compose build --build-arg BROWSER_IMAGE_BASE=oldwebtoday/chromium --build-arg "BROWSER_VERSION=91-arm" --build-arg BROWSER_BIN=chromium-browser
|
||||
```
|
||||
|
||||
You should then be able to run Browsertrix Crawler natively on M1.
|
||||
|
||||
The build arguments specify the base image, version and browser binary. This approach can also be used to install a different browser in general from any Debian-based Docker image.
|
||||
|
||||
|
||||
### Example Usage
|
||||
|
|
|
@ -26,7 +26,7 @@ const TextExtract = require("./util/textextract");
|
|||
const { ScreenCaster } = require("./util/screencaster");
|
||||
const { parseArgs } = require("./util/argParser");
|
||||
|
||||
const { CHROME_PATH, BEHAVIOR_LOG_FUNC, HTML_TYPES } = require("./util/constants");
|
||||
const { BROWSER_BIN, BEHAVIOR_LOG_FUNC, HTML_TYPES } = require("./util/constants");
|
||||
|
||||
// ============================================================================
|
||||
class Crawler {
|
||||
|
@ -91,7 +91,7 @@ class Crawler {
|
|||
let version = process.env.BROWSER_VERSION;
|
||||
|
||||
try {
|
||||
version = child_process.execFileSync(CHROME_PATH, ["--product-version"], {encoding: "utf8"}).trim();
|
||||
version = child_process.execFileSync(BROWSER_BIN, ["--product-version"], {encoding: "utf8"}).trim();
|
||||
} catch(e) {
|
||||
console.log(e);
|
||||
}
|
||||
|
@ -171,7 +171,7 @@ class Crawler {
|
|||
// Puppeter Options
|
||||
return {
|
||||
headless: this.params.headless,
|
||||
executablePath: CHROME_PATH,
|
||||
executablePath: BROWSER_BIN,
|
||||
ignoreHTTPSErrors: true,
|
||||
args: this.chromeArgs,
|
||||
userDataDir: this.profileDir,
|
||||
|
@ -255,9 +255,8 @@ class Crawler {
|
|||
const warcVersion = "WARC/1.1";
|
||||
const type = "warcinfo";
|
||||
const packageFileJSON = JSON.parse(await fsp.readFile("../app/package.json"));
|
||||
const version = await fsp.readFile("/usr/local/lib/python3.8/site-packages/pywb/version.py", "utf8");
|
||||
const pywbVersion = version.split("\n")[0].split("=")[1].trim().replace(/['"]+/g, "");
|
||||
const warcioPackageJson = JSON.parse(await fsp.readFile("/app/node_modules/warcio/package.json"));
|
||||
const pywbVersion = child_process.execSync("pywb -V", {encoding: "utf8"}).trim().split(" ")[1];
|
||||
|
||||
const info = {
|
||||
"software": `Browsertrix-Crawler ${packageFileJSON["version"]} (with warcio.js ${warcioPackageJson} pywb ${pywbVersion})`,
|
||||
|
|
|
@ -2,7 +2,7 @@ version: '3.5'
|
|||
|
||||
services:
|
||||
crawler:
|
||||
image: webrecorder/browsertrix-crawler:0.4.0-beta.1
|
||||
image: webrecorder/browsertrix-crawler:0.4.0-beta.2
|
||||
build:
|
||||
context: ./
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
{
|
||||
"name": "browsertrix-crawler",
|
||||
"version": "0.4.0-beta.1",
|
||||
"version": "0.4.0-beta.2",
|
||||
"main": "browsertrix-crawler",
|
||||
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
||||
"author": "Ilya Kreymer <ikreymer@gmail.com>, Webrecorder Software",
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
pywb>=2.6.0b2
|
||||
pywb>=2.6.0b3
|
||||
#git+https://github.com/webrecorder/pywb@main
|
||||
uwsgi
|
||||
wacz>=0.3.0
|
||||
|
|
|
@ -2,5 +2,5 @@
|
|||
module.exports.HTML_TYPES = ["text/html", "application/xhtml", "application/xhtml+xml"];
|
||||
module.exports.WAIT_UNTIL_OPTS = ["load", "domcontentloaded", "networkidle0", "networkidle2"];
|
||||
module.exports.BEHAVIOR_LOG_FUNC = "__bx_log";
|
||||
module.exports.CHROME_PATH = "google-chrome";
|
||||
module.exports.BROWSER_BIN = process.env.BROWSER_BIN || "google-chrome";
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue