mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-10-19 06:23:16 +00:00

Run a high-fidelity browser-based web archiving crawler in a single Docker container https://crawler.docs.browsertrix.com

crawler crawling wacz warc web-archiving web-crawler webrecorder

Find a file

Ilya Kreymer 2db7bc98b1 bump version to 0.3.1 for release		2021-05-04 13:38:56 -07:00
.github/workflows	ci: add push to registry on release action	2021-04-14 15:45:20 -07:00
tests	Wait for Pending Requests to Finish (#47 )	2021-04-30 15:31:14 -04:00
.dockerignore	Dockerfile: build with chrome deb directly instead of copying binaries from chrome image	2020-11-05 22:34:33 +00:00
.eslintignore	add ci and linting (#21 )	2021-02-08 09:45:46 -08:00
.eslintrc.js	Wait for Pending Requests to Finish (#47 )	2021-04-30 15:31:14 -04:00
.gitignore	Dockerfile: switch to cmd 'crawl', instead of entrypoint to support running 'pywb' also	2020-11-01 21:35:00 -08:00
CHANGES.md	add CHANGES.md for 0.3.1 release!	2021-05-04 13:13:33 -07:00
config.yaml	Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33 )	2021-03-31 10:41:27 -07:00
crawler.js	update pages detection method (#50 )	2021-04-30 19:05:04 -07:00
create-login-profile.js	Wait for Pending Requests to Finish (#47 )	2021-04-30 15:31:14 -04:00
defaultDriver.js	factor out behaviors to browsertrix-behaviors: (#32 )	2021-03-13 19:48:31 -05:00
docker-compose.yml	bump version to 0.3.1 for release	2021-05-04 13:38:56 -07:00
Dockerfile	Wait for Pending Requests to Finish (#47 )	2021-04-30 15:31:14 -04:00
LICENSE	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
main.js	support custom crawl directory with --cwd flag, default to /crawls	2020-11-02 15:28:19 +00:00
NOTICE	initial commit after split from zimit	2020-10-31 13:16:37 -07:00
package-lock.json	tests text extraction (#30 )	2021-03-01 16:00:23 -08:00
package.json	bump version to 0.3.1 for release	2021-05-04 13:38:56 -07:00
README.md	Extract links from all frames attached to a page, fixes #45 (#48 )	2021-04-30 08:41:00 -07:00
requirements.txt	update to browsertrix-behaviors 0.2.0	2021-04-28 11:00:43 -07:00
textextract.js	Create --combineWARC flag that combines generated warcs into a single warc upto rollover size (#33 )	2021-03-31 10:41:27 -07:00
uwsgi.ini	refactor crawler and default driver:	2020-11-01 19:53:47 -08:00
yarn.lock	add CHANGES.md, list changes for 0.3.1	2021-05-04 12:10:12 -07:00

README.md

Browsertrix Crawler

Browsertrix Crawler is a simplified (Chrome) browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses puppeteer-cluster and puppeteer to control one or more browsers in parallel.

Features

Thus far, Browsertrix Crawler supports:

Single-container, browser based crawling with multiple headless/headful browsers.
Support for custom browser behaviors, using Browsertix Behaviors including autoscroll, video autoplay and site-specific behaviors.
Optimized (non-browser) capture of non-HTML resources.
Extensible Puppeteer driver script for customizing behavior per crawl or page.
Ability to create and reuse browser profiles with user/password login

Getting Started

Browsertrix Crawler requires Docker to be installed on the machine running the crawl.

Assuming Docker is installed, you can run a crawl and test your archive with the following steps.

You don't even need to clone this repo, just choose a directory where you'd like the crawl data to be placed, and then run the following commands. Replace [URL] with the web site you'd like to crawl.

Run docker pull webrecorder/browsertrix-crawler
docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test
The crawl will now run and progress of the crawl will be output to the console. Depending on the size of the site, this may take a bit!
Once the crawl is finished, a WACZ file will be created in crawls/collection/test/test.wacz from the directory you ran the crawl!
You can go to ReplayWeb.page and open the generated WACZ file and browse your newly crawled archive!

Here's how you can use some of the command-line options to configure the crawl:

To include automated text extraction for full text search, add the --text flag.
To limit the crawl to a maximum number of pages, add --limit P where P is the number of pages that will be crawled.
To run more than one browser worker and crawl in parallel, and --workers N where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.
To crawl into a new directory, specify a different name for the --collection param, or, if omitted, a new collection directory based on current time will be created.

Browsertrix Crawler includes a number of additional command-line options, explained below.

Crawling Configuration Options

The Browsertrix Crawler docker image currently accepts the following parameters:

crawler [options]

Options:
      --help                                Show help                  [boolean]
      --version                             Show version number        [boolean]
  -u, --url                                 The URL to start crawling from
                                                             [string] [required]
  -w, --workers                             The number of workers to run in
                                            parallel       [number] [default: 1]
      --newContext                          The context for each new capture,
                                            can be a new: page, session or
                                            browser.  [string] [default: "page"]
      --waitUntil                           Puppeteer page.goto() condition to
                                            wait for before continuing, can be
                                            multiple separate by ','
                                                  [default: "load,networkidle0"]
      --limit                               Limit crawl to this number of pages
                                                           [number] [default: 0]
      --timeout                             Timeout for each page to load (in
                                            seconds)      [number] [default: 90]
      --scope                               Regex of page URLs that should be
                                            included in the crawl (defaults to
                                            the immediate directory of URL)
      --exclude                             Regex of page URLs that should be
                                            excluded from the crawl.
  -c, --collection                          Collection name to crawl to (replay
                                            will be accessible under this name
                                            in pywb preview)
                                [string] [default: "capture-YYYY-MM-DDTHH-MM-SS"]
      --headless                            Run in headless mode, otherwise
                                            start xvfb[boolean] [default: false]
      --driver                              JS driver for the crawler
                                     [string] [default: "/app/defaultDriver.js"]
      --generateCDX, --generatecdx,         If set, generate index (CDXJ) for
      --generateCdx                         use with pywb after crawl is done
                                                      [boolean] [default: false]
      --combineWARC, --combinewarc,         If set, combine the warcs
      --combineWarc                                   [boolean] [default: false]
      --rolloverSize                        If set, declare the rollover size
                                                  [number] [default: 1000000000]
      --generateWACZ, --generatewacz,       If set, generate wacz
      --generateWacz                                  [boolean] [default: false]
      --logging                             Logging options for crawler, can
                                            include: stats, pywb, behaviors,
                                            behaviors-debug
                                                     [string] [default: "stats"]
      --text                                If set, extract text to the
                                            pages.jsonl file
                                                      [boolean] [default: false]
      --cwd                                 Crawl working directory for captures
                                            (pywb root). If not set, defaults to
                                            process.cwd()
                                                   [string] [default: "/crawls"]
      --mobileDevice                        Emulate mobile device by name from:
                                            https://github.com/puppeteer/puppete
                                            er/blob/main/src/common/DeviceDescri
                                            ptors.ts                    [string]
      --userAgent                           Override user-agent with specified
                                            string                      [string]
      --userAgentSuffix                     Append suffix to existing browser
                                            user-agent (ex: +MyCrawler,
                                            info@example.com)           [string]
      --useSitemap                          If enabled, check for sitemaps at
                                            /sitemap.xml, or custom URL if URL
                                            is specified
      --statsFilename                       If set, output stats as JSON to this
                                            file. (Relative filename resolves to
                                            crawl working directory)
      --behaviors                           Which background behaviors to enable
                                            on each page
                           [string] [default: "autoplay,autofetch,siteSpecific"]
      --profile                             Path to tar.gz file which will be
                                            extracted and used as the browser
                                            profile                     [string]

For the --waitUntil flag, see page.goto waitUntil options.

The default is load, but for static sites, --wait-until domcontentloaded may be used to speed up the crawl (to avoid waiting for ads to load for example), while --waitUntil networkidle0 may make sense for dynamic sites.

Behaviors

Browsertrix Crawler also supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible), and auto-fetch content that is not loaded by default, and also run custom behaviors on certain sites.

Behaviors to run can be specified via a comma-separated list passed to the --behaviors option. By default, the auto-scroll behavior is not enabled by default, as it may slow down crawling. To enable this behaviors, you can add --behaviors autoscroll or to enable all behaviors, add --behaviors autoscroll,autoplay,autofetch,siteSpecific.

See Browsertrix Behaviors for more info on all of the currently available behaviors.

Creating and Using Browser Profiles

Browsertrix Crawler also includes a way to use existing browser profiles when running a crawl. This allows pre-configuring the browser, such as by logging in to certain sites or setting other settings, and running a crawl exactly with those settings. By creating a logged in profile, the actual login credentials are not included in the crawl, only (temporary) session cookies.

Browsertrix Crawler currently includes a script to login to a single website with supplied credentials and then save the profile. It can also take a screenshot so you can check if the login succeeded. The --url parameter should specify the URL of a login page.

For example, to create a profile logged in to Twitter, you can run:

docker run -v $PWD/crawls/profiles:/output/ -it webrecorder/browsertrix-crawler create-login-profile --url "https://twitter.com/login"

The script will then prompt you for login credentials, attempt to login and create a tar.gz file in ./crawls/profiles/profile.tar.gz.

To specify a custom filename, pass along --filename parameter.
To specify the username and password on the command line (for automated profile creation), pass a --username and --password flags.
To specify headless mode, add the --headless flag. Note that for crawls run with --headless flag, it is recommended to also create the profile with --headless to ensure the profile is compatible.

The --profile flag can then be used to specify a Chrome profile stored as a tarball when running the regular crawl command. With this option, it is possible to crawl with the browser already pre-configured. To ensure compatibility, the profile should be created using the following mechanism.

After running the above command, you can now run a crawl with the profile, as follows:


docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /crawls/profiles/profile.tar.gz --url https://twitter.com/--generateWACZ --collection test-with-profile

The current profile creation script is still experimental and the script attempts to detect the usename and password fields on a site as generically as possible, but may not work for all sites. Additional profile functionality, such as support for custom profile creation scripts, may be added in the future.

Architecture

The Docker container provided here packages up several components used in Browsertrix.

The system uses:

oldwebtoday/chrome - to install a recent version of Chrome (currently chrome:84)
puppeteer-cluster - for running Chrome browsers in parallel
pywb - in recording mode for capturing the content

The crawl produces a single pywb collection, at /crawls/collections/<collection name> in the Docker container.

To access the contents of the crawl, the /crawls directory in the container should be mounted to a volume (default in the Docker Compose setup).

Example Usage

With Docker-Compose

The Docker Compose file can simplify building and running a crawl, and includes some required settings for docker run, including mounting a volume.

For example, the following commands demonstrate building the image, running a simple crawl with 2 workers:

docker-compose build
docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2

In this example, the crawl data is written to ./crawls/collections/wr-net by default.

While the crawl is running, the status of the crawl (provide by puppeteer-cluster monitoring) prints the progress to the Docker log.

When done, you can even use the browsertrix-crawler image to also start a local pywb instance to preview the crawl:

docker run -it -v $(pwd)/crawls:/crawls -p 8080:8080 webrecorder/browsertrix-crawler pywb

Then, loading the http://localhost:8080/wr-net/https://webrecorder.net/ should load a recent crawl of the https://webrecorder.net/ site.

With `docker run`

Browsertrix Crawler can of course all be run directly with Docker run, but requires a few more options.

In particular, the --cap-add and --shm-size flags are needed to run Chrome in Docker.

docker run -v $PWD/crawls:/crawls --cap-add=SYS_ADMIN --cap-add=NET_ADMIN --shm-size=1g -it webrecorder/browsertrix-crawler --url https://webrecorder.net/ --workers 2

Support

Initial support for development of Browsertrix Crawler, was provided by Kiwix

Initial functionality for Browsertrix Crawler was developed to support the zimit project in a collaboration between Webrecorder and Kiwix, and this project has been split off from Zimit into a core component of Webrecorder.

License

AGPLv3 or later, see LICENSE for more details.