mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 06:23:16 +00:00
Build simplification: Use :latest Version By default + README update (#71)
* docker-compose: just use ':latest' tag for local builds, allow users working with local docker-compose.yml to just build latest image - ci: add 'latest' tag to release ci build to automatically update latest as well - README: remove '[VERSION]', just refer to latest version of image in all examples - README: mention using specific released tag version for production
This commit is contained in:
parent
f4c6b6a99f
commit
bd44190ab2
6 changed files with 25 additions and 18 deletions
2
.github/workflows/release.yaml
vendored
2
.github/workflows/release.yaml
vendored
|
@ -25,7 +25,7 @@ jobs:
|
|||
elif [[ $GITHUB_REF == refs/pull/* ]]; then
|
||||
VERSION=pr-${{ github.event.number }}
|
||||
fi
|
||||
TAGS="${DOCKER_IMAGE}:${VERSION}"
|
||||
TAGS="${DOCKER_IMAGE}:${VERSION},latest"
|
||||
echo ::set-output name=tags::${TAGS}
|
||||
-
|
||||
name: Set up QEMU
|
||||
|
|
29
README.md
29
README.md
|
@ -169,21 +169,21 @@ See [page.goto waitUntil options](https://github.com/puppeteer/puppeteer/blob/ma
|
|||
|
||||
Browsertix Crawler supports the use of a yaml file to set parameters for a crawl. This can be used by passing a valid yaml file to the `--config` option.
|
||||
|
||||
The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the yaml file, the value from the command-line will be used. For example, the following should start a crawl with config in `crawl-config.yaml` (where [VERSION] represents the version of browsertrix-crawler image you're working with). The current [VERSION] can be found by checking the package.json file.
|
||||
The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the yaml file, the value from the command-line will be used. For example, the following should start a crawl with config in `crawl-config.yaml`.
|
||||
|
||||
|
||||
```
|
||||
docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-config /app/crawl-config.yaml
|
||||
docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-config /app/crawl-config.yaml
|
||||
```
|
||||
|
||||
The config can also be passed via stdin, which can simplify the command. Note that this require running `docker run` with the `-i` flag. To read config from stdin, pass `--config stdin`
|
||||
|
||||
```
|
||||
cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-config stdin
|
||||
cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-config stdin
|
||||
```
|
||||
|
||||
|
||||
An example config file might contain:
|
||||
An example config file (eg. crawl-config.yaml) might contain:
|
||||
|
||||
```
|
||||
seeds:
|
||||
|
@ -202,7 +202,7 @@ The URL seed file should be a text file formatted so that each line of the file
|
|||
The seed file must be passed as a volume to the docker container. To do that, you can format your docker command similar to the following:
|
||||
|
||||
```
|
||||
docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl —-seedFile /app/seedFile.txt
|
||||
docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl —-seedFile /app/seedFile.txt
|
||||
```
|
||||
|
||||
#### Per-Seed Settings
|
||||
|
@ -308,7 +308,7 @@ With version 0.4.0, Browsertrix Crawler includes an experimental 'screencasting'
|
|||
To enable, add `--screencastPort` command-line option and also map the port on the docker container. An example command might be:
|
||||
|
||||
```
|
||||
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037
|
||||
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037
|
||||
```
|
||||
|
||||
Then, you can open `http://localhost:9037/` and watch the crawl.
|
||||
|
@ -318,7 +318,7 @@ Note: If specifying multiple workers, the crawler should additional be instructe
|
|||
For example,
|
||||
|
||||
```
|
||||
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler:[VERSION] crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037 --newContext window --workers 3
|
||||
docker run -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl -p 9037:9037 --url https://www.example.com --screencastPort 9037 --newContext window --workers 3
|
||||
```
|
||||
|
||||
will start a crawl with 3 workers, and show the screen of each of the workers from `http://localhost:9037/`.
|
||||
|
@ -335,7 +335,7 @@ The script profile creation system also take a screenshot so you can check if th
|
|||
For example, to create a profile logged in to Twitter, you can run:
|
||||
|
||||
```bash
|
||||
docker run -v $PWD/crawls/profiles:/output/ -it webrecorder/browsertrix-crawler:[VERSION] create-login-profile --url "https://twitter.com/login"
|
||||
docker run -v $PWD/crawls/profiles:/output/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/login"
|
||||
```
|
||||
|
||||
The script will then prompt you for login credentials, attempt to login and create a tar.gz file in `./crawls/profiles/profile.tar.gz`.
|
||||
|
@ -367,7 +367,7 @@ Browsertrix Crawler will then create a profile as before using the current state
|
|||
For example, to start in interactive profile creation mode, run:
|
||||
|
||||
```
|
||||
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/output/ -it webrecorder/browsertrix-crawler:[VERSION] create-login-profile --interactive --url "https://example.com/"
|
||||
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/output/ -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://example.com/"
|
||||
```
|
||||
|
||||
Then, open a browser pointing to `http://localhost:9223/` and use the embedded browser to log in to any sites or configure any settings as needed.
|
||||
|
@ -376,7 +376,7 @@ Click 'Create Profile at the top when done. The profile will then be created in
|
|||
It is also possible to extend an existing profiles by also passing in an existing profile via the `--profile` flag. In this way, it is possible to build new profiles by extending previous browsing sessions as needed.
|
||||
|
||||
```
|
||||
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler:[VERSION] create-login-profile --interactive --url "https://example.com/ --profile /profiles/oldProfile.tar.gz"
|
||||
docker run -p 9222:9222 -p 9223:9223 -v $PWD/profiles:/profiles --filename /profiles/newProfile.tar.gz -it webrecorder/browsertrix-crawler create-login-profile --interactive --url "https://example.com/ --profile /profiles/oldProfile.tar.gz"
|
||||
```
|
||||
|
||||
### Using Browser Profile with a Crawl
|
||||
|
@ -390,6 +390,15 @@ After running the above command, you can now run a crawl with the profile, as fo
|
|||
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /crawls/profiles/profile.tar.gz --url https://twitter.com/--generateWACZ --collection test-with-profile
|
||||
```
|
||||
|
||||
## Published Releases / Production Use
|
||||
|
||||
When using Browsertrix Crawler in production, it is recommended to use a specific, published version of the image, eg. `webrecorder/browsertrix-crawler:[VERSION]` instead of `webrecorder/browsertrix-crawler` where `[VERSION]` corresponds to one of the published release tag.
|
||||
|
||||
All released Docker Images are available from Docker Hub, listed by release tag here: https://hub.docker.com/r/webrecorder/browsertrix-crawler/tags?page=1&ordering=last_updated
|
||||
|
||||
Details for each corresponding release tag are also available on GitHub at: https://github.com/webrecorder/browsertrix-crawler/releases
|
||||
|
||||
|
||||
## Architecture
|
||||
|
||||
The Docker container provided here packages up several components used in Browsertrix.
|
||||
|
|
|
@ -2,7 +2,7 @@ version: '3.5'
|
|||
|
||||
services:
|
||||
crawler:
|
||||
image: webrecorder/browsertrix-crawler:0.4.1
|
||||
image: webrecorder/browsertrix-crawler:latest
|
||||
build:
|
||||
context: ./
|
||||
|
||||
|
@ -14,3 +14,4 @@ services:
|
|||
- SYS_ADMIN
|
||||
|
||||
shm_size: 1gb
|
||||
|
||||
|
|
|
@ -10,8 +10,7 @@ function runCrawl(name, config, commandExtra = "") {
|
|||
const configYaml = yaml.dump(config);
|
||||
|
||||
try {
|
||||
const version = require("../package.json").version;
|
||||
const proc = child_process.execSync(`docker run -i -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler:${version} crawl --config stdin ${commandExtra}`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
|
||||
const proc = child_process.execSync(`docker run -i -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --config stdin ${commandExtra}`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
|
||||
|
||||
console.log(proc);
|
||||
}
|
||||
|
|
|
@ -9,8 +9,7 @@ test("pass config file via stdin", async () => {
|
|||
const config = yaml.load(configYaml);
|
||||
|
||||
try {
|
||||
const version = require("../package.json").version;
|
||||
const proc = child_process.execSync(`docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler:${version} crawl --config stdin --scopeExcludeRx webrecorder.net/202`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
|
||||
const proc = child_process.execSync("docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler crawl --config stdin --scopeExcludeRx webrecorder.net/202", {input: configYaml, stdin: "inherit", encoding: "utf8"});
|
||||
|
||||
console.log(proc);
|
||||
}
|
||||
|
|
|
@ -7,8 +7,7 @@ test("check that the warcinfo file works as expected on the command line", async
|
|||
|
||||
try{
|
||||
const configYaml = fs.readFileSync("tests/fixtures/crawl-2.yaml", "utf8");
|
||||
const version = require("../package.json").version;
|
||||
const proc = child_process.execSync(`docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler:${version} crawl --config stdin --limit 1 --collection warcinfo --combineWARC`, {input: configYaml, stdin: "inherit", encoding: "utf8"});
|
||||
const proc = child_process.execSync("docker run -i -v $PWD/crawls:/crawls webrecorder/browsertrix-crawler crawl --config stdin --limit 1 --collection warcinfo --combineWARC", {input: configYaml, stdin: "inherit", encoding: "utf8"});
|
||||
|
||||
console.log(proc);
|
||||
}
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue