Build simplification: Use :latest Version By default + README update (#71)

* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image
- ci: add 'latest' tag to release ci build to automatically update latest as well
- README: remove '[VERSION]', just refer to latest version of image in all examples
- README: mention using specific released tag version for production
This commit is contained in:
Ilya Kreymer 2021-07-22 17:46:10 -07:00 committed by GitHub
parent f4c6b6a99f
commit bd44190ab2
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
6 changed files with 25 additions and 18 deletions

View file

@ -25,7 +25,7 @@ jobs:
elif [[ $GITHUB_REF == refs/pull/* ]]; then
VERSION=pr-${{ github.event.number }}
fi
TAGS="${DOCKER_IMAGE}:${VERSION}"
TAGS="${DOCKER_IMAGE}:${VERSION},latest"
echo ::set-output name=tags::${TAGS}
-
name: Set up QEMU

View file

@ -169,21 +169,21 @@ See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/ma
Browsertix Crawler supports the use of a yaml file to set parameters for a crawl. This can be used by passing a valid yaml file to the `--config` option.
The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the yaml file, the value from the command-line will be used. For example, the following should start a crawl with config in `crawl-config.yaml` (where [VERSION] represents the version of browsertrix-crawler image you're working with). The current [VERSION] can be found by checking the package.json file.
The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the yaml file, the value from the command-line will be used. For example, the following should start a crawl with config in `crawl-config.yaml`.
```
docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-config /app/crawl-config.yaml
docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-config /app/crawl-config.yaml
```
The config can also be passed via stdin, which can simplify the command. Note that this require running `docker run` with the `-i` flag. To read config from stdin, pass `--config stdin`
```
cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-config stdin
cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-config stdin
```
An example config file might contain:
An example config file (eg. crawl-config.yaml) might contain:
```
seeds:
@ -202,7 +202,7 @@ The URL seed file should be a text file formatted so that each line of the file
The seed file must be passed as a volume to the docker container. To do that, you can format your docker command similar to the following:
```
docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-seedFile /app/seedFile.txt
docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-seedFile /app/seedFile.txt
```
#### Per-Seed Settings
@ -308,7 +308,7 @@ With version 0.4.0, Browsertrix Crawler includes an experimental 'screencasting'
To enable, add `--screencastPort` command-line option and also map the port on the docker container. An example command might be:
```
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037
```
Then, you can open `http://localhost:9037/` and watch the crawl.
@ -318,7 +318,7 @@ Note: If specifying multiple workers, the crawler should additional be instructe
For example,
```
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037 --newContext window --workers 3
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037 --newContext window --workers 3
```
will start a crawl with 3 workers, and show the screen of each of the workers from `http://localhost:9037/`.
@ -335,7 +335,7 @@ The script profile creation system also take a screenshot so you can check if th
For example, to create a profile logged in to Twitter, you can run:
```bash
docker run -v $PWD/crawls/profiles:/output/ -it webrecorder/browsertrix-crawler:[VERSION] create-login-profile --url "https://twitter.com/login"
docker run -v $PWD/crawls/profiles:/output/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/login"
```
The script will then prompt you for login credentials, attempt to login and create a tar.gz file in `./crawls/profiles/profile.tar.gz`.
@ -367,7 +367,7 @@ Browsertrix Crawler will then create a profile as before using the current state
For example, to start in interactive profile creation mode, run:
```
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/output/ -it webrecorder/browsertrix-crawler:[VERSION] create-login-profile --interactive --url "https://example.com/"
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/output/ -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://example.com/"
```
Then, open a browser pointing to `http://localhost:9223/` and use the embedded browser to log in to any sites or configure any settings as needed.
@ -376,7 +376,7 @@ Click 'Create Profile at the top when done. The profile will then be created in
It is also possible to extend an existing profiles by also passing in an existing profile via the `--profile` flag. In this way, it is possible to build new profiles by extending previous browsing sessions as needed.
```
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler:[VERSION] create-login-profile --interactive --url "https://example.com/ --profile /profiles/oldProfile.tar.gz"
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://example.com/ --profile /profiles/oldProfile.tar.gz"
```
### Using Browser Profile with a Crawl
@ -390,6 +390,15 @@ After running the above command, you can now run a crawl with the profile, as fo
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /crawls/profiles/profile.tar.gz --url https://twitter.com/--generateWACZ --collection test-with-profile
```
## Published Releases / Production Use
When using Browsertrix Crawler in production, it is recommended to use a specific, published version of the image, eg. `webrecorder/browsertrix-crawler:[VERSION]` instead of `webrecorder/browsertrix-crawler` where `[VERSION]` corresponds to one of the published release tag.
All released Docker Images are available from Docker Hub, listed by release tag here: https://hub.docker.com/r/webrecorder/browsertrix-crawler/tags?page=1&ordering=last_updated
Details for each corresponding release tag are also available on GitHub at: https://github.com/webrecorder/browsertrix-crawler/releases
## Architecture
The Docker container provided here packages up several components used in Browsertrix.

View file

@ -2,7 +2,7 @@ version: '3.5'
services:
crawler:
image: webrecorder/browsertrix-crawler:0.4.1
image: webrecorder/browsertrix-crawler:latest
build:
context: ./
@ -14,3 +14,4 @@ services:
- SYS_ADMIN
shm_size: 1gb

View file

@ -10,8 +10,7 @@ function runCrawl(name, config, commandExtra = "") {
const configYaml = yaml.dump(config);
try {
const version = require("../package.json").version;
const proc = child_process.execSync(`docker run -i -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler:${version} crawl --config stdin ${commandExtra}`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
const proc = child_process.execSync(`docker run -i -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --config stdin ${commandExtra}`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
console.log(proc);
}

View file

@ -9,8 +9,7 @@ test("pass config file via stdin", async () => {
const config = yaml.load(configYaml);
try {
const version = require("../package.json").version;
const proc = child_process.execSync(`docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler:${version} crawl --config stdin --scopeExcludeRx webrecorder.net/202`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
const proc = child_process.execSync("docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler crawl --config stdin --scopeExcludeRx webrecorder.net/202", {input: configYaml, stdin: "inherit", encoding: "utf8"});
console.log(proc);
}

View file

@ -7,8 +7,7 @@ test("check that the warcinfo file works as expected on the command line", async
try{
const configYaml = fs.readFileSync("tests/fixtures/crawl-2.yaml", "utf8");
const version = require("../package.json").version;
const proc = child_process.execSync(`docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler:${version} crawl --config stdin --limit 1 --collection warcinfo --combineWARC`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
const proc = child_process.execSync("docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler crawl --config stdin --limit 1 --collection warcinfo --combineWARC", {input: configYaml, stdin: "inherit", encoding: "utf8"});
console.log(proc);
}