Update Behaviors Docs (#820)

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This commit is contained in:
Ilya Kreymer 2025-04-10 09:58:07 +02:00 committed by GitHub
parent f2dac05577
commit 1cb1b2edb9
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
5 changed files with 246 additions and 25 deletions

View file

@ -1,31 +1,63 @@
# Browser Behaviors
Browsertrix Crawler supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible), auto-fetch content that is not loaded by default, and also run custom behaviors on certain sites.
Browsertrix Crawler supports automatically running customized behaviors on each page. Several types of behaviors are supported, including built-in, background, and site-specific behaviors. It is also possible to add fully user-defined custom behaviors that can be added to trigger specific actions on certain pages.
To run behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as autoscroll, use `--behaviors autoscroll`.
## Built-In Behaviors
The site-specific behavior (or autoscroll) will start running after the page is finished its initial load (as defined by the `--waitUntil` settings). The behavior will then run until finished or until the behavior timeout is exceeded. This timeout can be set (in seconds) via the `--behaviorTimeout` flag (90 seconds by default). Setting the timeout to 0 will allow the behavior to run until it is finished.
The built-in behaviors include the following background behaviors which run 'in the background' continually checking for changes:
- Autoplay: find and start playing (when possible) any video or audio on the page (and in each iframe).
- Autofetch: find and start fetching any URLs that may not be fetched by default, such as other resolutions in `img` tags, `data-*`, lazy-loaded resources, etc.
- Autoclick: select all tags (default: `a` tag, customizable via `--clickSelector`) that may be clickable and attempt to click them while avoiding navigation away from the page.
See [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for more info on all of the currently available behaviors.
There is also a built-in 'main' behavior, which runs to completion (or until a timeout is reached):
Browsertrix Crawler includes a `--pageExtraDelay`/`--delay` option, which can be used to have the crawler sleep for a configurable number of seconds after behaviors before moving on to the next page.
- Autoscroll: Determine if a page might need scrolling, and scroll either up or down while new elements are being added. Continue until timeout is reached or scrolling is no longer possible.
To disable behaviors for a crawl, use `--behaviors ""`.
## Site-Specific Behaviors
## Additional Custom Behaviors
Browsertrix also comes with several 'site-specific' behaviors, which run only on specific sites. These behaviors will run instead of Autoscroll and will run until completion or timeout. Currently, site-specific behaviors include major social media sites.
Custom behaviors can be mounted into the crawler and ran from there, or downloaded from a URL.
Refer to [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for the latest list of site-specific behaviors.
Each behavior should contain a single class that implements the behavior interface. See [the behaviors tutorial](https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md) for more info on how to write behaviors.
User-defined custom behaviors are also considered site-specific.
## Enabling Behaviors
The first behavior which returns true for `isMatch()` will be run on a given page.
To enable built-in behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors except Autoclick are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as Autoscroll, use `--behaviors autoscroll`.
The repeatable `--customBehaviors` flag can accept:
To only use Autoclick but not Autoscroll, use `--behaviors autoclick,autoplay,autofetch,siteSpecific`.
- A path to a directory of behavior files
- A path to a single behavior file
- A URL for a single behavior file to download
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located)
The `--siteSpecific` flag enables all site-specific behaviors to be enabled, but only one behavior can be run per site. Each site-specific behavior specifies which site it should run on.
To disable all behaviors, use `--behaviors ""`.
## Behavior and Page Timeouts
Browsertrix includes a number of timeouts, including before, during and after running behaviors.
The timeouts are as follows:
- `--waitUntil`: how long to wait for page to finish loading, *before* doing anything else.
- `--postLoadDelay`: how long to wait *before* starting any behaviors, but after page has finished loading. A custom behavior can override this (see below).
- `--behaviorTimeout`: maximum time to spend on running site-specific / Autoscroll behaviors (can be less if behavior finishes early).
- `--pageExtraDelay`: how long to wait *after* finishing behaviors (or after `behaviorTimeout` has been reached) before moving on to next page.
A site-specific behavior (or Autoscroll) will start after the page is loaded (at most after `--waitUntil` seconds) and exactly after `--postLoadDelay` seconds.
The behavior will then run until finished or at most until `--behaviorTimeout` is reached (90 seconds by default).
## Loading Custom Behaviors
Browsertrix Crawler also supports fully user-defined behaviors, which have all the capabilities of the built-in behaviors.
They can use a library of provided functions, and run on one or more pages in the crawl.
Custom behaviors are specified with the `--customBehaviors` flag, which can be repeated and can accept the following options.
- A path to a single behavior file. This can be mounted into the crawler as a volume.
- A path to a directory of behavior files. This can be mounted into the crawler as a volume.
- A URL for a single behavior file to download. This should be a URL that the crawler has access to.
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located). This should be a git repo the crawler has access to without additional auth.
### Examples
@ -52,3 +84,186 @@ docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --u
```sh
docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --customBehaviors "git+https://git.example.com/custom-behaviors?branch=dev&path=path/to/behaviors"
```
## Creating Custom Behaviors
A custom behavior file can be in one of the following supported formats:
- JSON User Flow
- JavaScript / Typescript (compiled to JavaScript)
### JSON Flow Behaviors
Browsertrix Crawler 1.6 and up supports replaying the JSON User Flow format generated by [DevTools Recorder](https://developer.chrome.com/docs/devtools/recorder), which is built-in to Chrome devtools.
This format can be generated by using the DevTools Recorder to create a series of steps, which are serialized to JSON.
The format represents a series of steps that should happen on a particular page.
The recorder is capable of picking the right selectors interactively and supports events such as `click`, `change`, `waitForElement` and more. See the [feature reference](https://developer.chrome.com/docs/devtools/recorder/reference) for a more complete list.
#### User Flow Extensions
Browsertrix extends the functionality compared to DevTools Recorder in the following ways:
- Browsertrix Crawler will attempt to continue even if initial step fails, for up to 3 failures.
- If a step is repeated 3 or more times, Browsertrix Crawler will attempt to repeat the step as far as it can until the step fails.
- Browsertrix Crawler ignores the `navigate` and `viewport` step. The `navigate` event is used to match when a particular user flow should run, but does not navigate away from the page.
- If `navigate` step is removed, user flow can run on every page in the crawler.
- A `customStep` step with name `runOncePerCrawl` can be added to indicate that a user flow should run only once for a given crawl.
### JavaScript Behaviors
The main native format of custom behaviors is a Javascript class.
There should be a single class per file, and it should be of the following format:
#### Behavior Class
```javascript
class MyBehavior
{
// required: an id for this behavior, will be displayed in the logs
// when the behavior is run.
static id = "My Behavior Id";
// required: a function that checks if a behavior should be run
// for a given page.
// This function can check the DOM / window.location to determine
// what page it is on. The first behavior that returns 'true'
// for a given page is used on that page.
static isMatch() {
return window.location.href === "https://my-site.example.com/";
}
// optional: if true, will also check isMatch() and possibly run
// this behavior in each iframe.
// if false, or not defined, this behavior will be skipped for iframes.
static runInIframes = false;
// optional: if defined, provides a way to define a custom way to determine
// when a page has finished loading beyond the standard 'load' event.
//
// if defined, the crawler will await 'awaitPageLoad()' before moving on to
// post-crawl processing operations, including link extraction, screenshots,
// and running main behavior
async awaitPageLoad() {
}
// required: the main behavior async iterator, which should yield for
// each 'step' in the behavior.
// When the iterator finishes, the behavior is done.
// (See below for more info)
async* run(ctx) {
//... yield ctx.getState("starting behavior");
// do something
//... yield ctx.getState("a step has been performed");
}
}
```
#### Behavior run() loop
The `run()` loop provides the main loop for the behavior to run. It must be an [async iterator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/AsyncIterator), which means that it can optionally call `yield` to return state to the crawler and allow it to print the state.
For example, a behavior that iterates over elements and then clicks them either once or twice (based on the value of a custom `.clickTwice` property) could be written as follows:
```javascript
async* run(ctx) {
let click = 0;
let dblClick = 0;
for await (const elem of document.querySelectorAll(".my-selector")) {
if (elem.clickTwice) {
elem.click();
elem.click();
dblClick++;
} else {
elem.click();
click++;
}
ctx.log({msg: "Clicked on elem", click, dblClick});
}
}
```
This behavior will run to completion and log every time a click event is made. However, this behavior can not be paused and resumed (supported in ArchiveWeb.page) and generally can not be interrupted.
One approach is to yield after every major 'step' in the behavior, for example:
```javascript
async* run(ctx) {
let click = 0;
let dblClick = 0;
for await (const elem of document.querySelectorAll(".my-selector")) {
if (elem.clickTwice) {
elem.click();
elem.click();
dblClick++;
// allows behavior to be paused here
yield {msg: "Double-clicked on elem", click, dblClick};
} else {
elem.click();
click++;
// allows behavior to be paused here
yield {msg: "Single-clicked on elem", click, dblClick};
}
}
}
```
The data that is yielded will be logged in the `behaviorScriptCustom` context.
This allows for the behavior to log the current state of the behavior and allow for it to be gracefully
interrupted after each logical 'step'.
#### getState() function
A common pattern is to increment a particular counter, and then return the whole state.
A convenience function `getState()` is provided to simplify this and avoid the need to create custom counters.
Using this standard function, the above code might be condensed as follows:
```javascript
async* run(ctx) {
const { Lib } = ctx;
for await (const elem of document.querySelectorAll(".my-selector")) {
if (elem.clickTwice) {
elem.click();
elem.click();
yield Lib.getState("Double-Clicked on elem", "dblClick");
} else {
elem.click();
yield Lib.getState("Single-Clicked on elem", "click");
}
}
}
```
#### Utility Functions
In addition to `getState()`, Browsertrix Behaviors includes [a small library of other utility functions](https://github.com/webrecorder/browsertrix-behaviors/blob/main/src/lib/utils.ts) which are available to behaviors under `ctx.Lib`.
Some of these functions which may be of use to behaviors authors are:
- `scrollAndClick`: scroll element into view and click
- `sleep`: sleep for specified timeout (ms)
- `waitUntil`: wait until a certain predicate is true
- `waitUntilNode`: wait until a DOM node exists
- `xpathNode`: find a DOM node by xpath
- `xpathNodes`: find and iterate all DOM nodes by xpath
- `xpathString`: find a string attribute by xpath
- `iterChildElem`: iterate over all child elements of given element
- `iterChildMatches`: iterate over all child elements that match a specific xpath
- `isInViewport`: determine if a given element is in the visible viewport
- `scrollToOffset`: scroll to particular offset
- `scrollIntoView`: smoothly scroll particular element into view
- `getState`: increment a state counter and return all state counters + string message
More detailed references will be added in the future.

View file

@ -19,8 +19,7 @@ Options:
crawl configuration (can also be se
t via CRAWL_ID env var), defaults to
combination of Docker container hos
tname and collection
[string] [default: "@hostname-@id"]
tname and collection [string]
--waitUntil Puppeteer page.goto() condition to w
ait for before continuing, can be mu
ltiple separated by ','
@ -88,6 +87,9 @@ Options:
[number] [default: 1000000000]
--generateWACZ, --generatewacz, --ge If set, generate WACZ on disk
nerateWacz [boolean] [default: false]
--useSHA1 If set, sha-1 instead of sha-256 has
hes will be used for creating record
s [boolean] [default: false]
--logging Logging options for crawler, can inc
lude: stats (enabled by default), js
errors, debug
@ -100,16 +102,17 @@ Options:
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
inks", "sitemap", "wacz", "replay", "proxy"] [default: []]
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
[]]
--logExcludeContext Comma-separated list of contexts to
NOT include in logs
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
inks", "sitemap", "wacz", "replay", "proxy"] [default: ["recorderNetwork","jsE
rror","screencast"]]
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
["recorderNetwork","jsError","screencast"]]
--text Extract initial (default) or final t
ext to pages.jsonl or WARC resource
record(s)
@ -236,6 +239,9 @@ Options:
[array] [default: []]
--logErrorsToRedis If set, write error messages to redi
s [boolean] [default: false]
--logBehaviorsToRedis If set, write behavior script messag
es to redis
[boolean] [default: false]
--writePagesToRedis If set, write page objects to redis
[boolean] [default: false]
--maxPageRetries, --retries If set, number of times to retry a p

View file

@ -74,7 +74,7 @@ Only key-based authentication is supposed for SSH proxies for now.
## Browser Profiles
The above proxy settings also apply to [Browser Profile Creation](../browser-profiles), and browser profiles can also be created using proxies, for example:
The above proxy settings also apply to [Browser Profile Creation](browser-profiles.md), and browser profiles can also be created using proxies, for example:
```sh
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts

View file

@ -63,7 +63,7 @@ nav:
markdown_extensions:
- toc:
toc_depth: 3
toc_depth: 4
permalink: true
- pymdownx.highlight:
anchor_linenums: true

View file

@ -1,6 +1,6 @@
{
"name": "browsertrix-crawler",
"version": "1.6.0-beta.1",
"version": "1.6.0",
"main": "browsertrix-crawler",
"type": "module",
"repository": "https://github.com/webrecorder/browsertrix-crawler",