Update Behaviors Docs (#820)

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2025-12-08 06:09:48 +00:00 · 2025-04-10 09:58:07 +02:00 · 2025-04-10 09:58:07 +02:00 · 1cb1b2edb9
commit 1cb1b2edb9
parent f2dac05577
5 changed files with 246 additions and 25 deletions
--- a/docs/docs/user-guide/behaviors.md
+++ b/docs/docs/user-guide/behaviors.md
@ -1,31 +1,63 @@
 # Browser Behaviors

-Browsertrix Crawler supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible), auto-fetch content that is not loaded by default, and also run custom behaviors on certain sites.
+Browsertrix Crawler supports automatically running customized behaviors on each page. Several types of behaviors are supported, including built-in, background, and site-specific behaviors. It is also possible to add fully user-defined custom behaviors that can be added to trigger specific actions on certain pages.

-To run behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as autoscroll, use `--behaviors autoscroll`.
+## Built-In Behaviors

-The site-specific behavior (or autoscroll) will start running after the page is finished its initial load (as defined by the `--waitUntil` settings). The behavior will then run until finished or until the behavior timeout is exceeded. This timeout can be set (in seconds) via the `--behaviorTimeout` flag (90 seconds by default). Setting the timeout to 0 will allow the behavior to run until it is finished.
+The built-in behaviors include the following background behaviors which run 'in the background' continually checking for changes:
 
-See [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for more info on all of the currently available behaviors.
+- Autoplay: find and start playing (when possible) any video or audio on the page (and in each iframe).
+- Autofetch: find and start fetching any URLs that may not be fetched by default, such as other resolutions in `img` tags, `data-*`, lazy-loaded resources, etc.
+- Autoclick: select all tags (default: `a` tag, customizable via `--clickSelector`) that may be clickable and attempt to click them while avoiding navigation away from the page.

-Browsertrix Crawler includes a `--pageExtraDelay`/`--delay` option, which can be used to have the crawler sleep for a configurable number of seconds after behaviors before moving on to the next page.
+There is also a built-in 'main' behavior, which runs to completion (or until a timeout is reached):

-To disable behaviors for a crawl, use `--behaviors ""`.
+- Autoscroll: Determine if a page might need scrolling, and scroll either up or down while new elements are being added. Continue until timeout is reached or scrolling is no longer possible.

-## Additional Custom Behaviors
+## Site-Specific Behaviors

-Custom behaviors can be mounted into the crawler and ran from there, or downloaded from a URL.
+Browsertrix also comes with several 'site-specific' behaviors, which run only on specific sites. These behaviors will run instead of Autoscroll and will run until completion or timeout. Currently, site-specific behaviors include major social media sites.

-Each behavior should contain a single class that implements the behavior interface. See [the behaviors tutorial](https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md) for more info on how to write behaviors.
+Refer to [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for the latest list of site-specific behaviors.

-The first behavior which returns true for `isMatch()` will be run on a given page.
+User-defined custom behaviors are also considered site-specific.
 
-The repeatable `--customBehaviors` flag can accept:
+## Enabling Behaviors

- A path to a directory of behavior files
- A path to a single behavior file
- A URL for a single behavior file to download
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located)
+To enable built-in behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors except Autoclick are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as Autoscroll, use `--behaviors autoscroll`.
+
+To only use Autoclick but not Autoscroll, use `--behaviors autoclick,autoplay,autofetch,siteSpecific`.
+
+The `--siteSpecific` flag enables all site-specific behaviors to be enabled, but only one behavior can be run per site. Each site-specific behavior specifies which site it should run on.
+
+To disable all behaviors, use `--behaviors ""`.
+
+## Behavior and Page Timeouts
+
+Browsertrix includes a number of timeouts, including before, during and after running behaviors.
+The timeouts are as follows:
+
+- `--waitUntil`: how long to wait for page to finish loading, *before* doing anything else.
+- `--postLoadDelay`: how long to wait *before* starting any behaviors, but after page has finished loading. A custom behavior can override this (see below).
+- `--behaviorTimeout`: maximum time to spend on running site-specific / Autoscroll behaviors (can be less if behavior finishes early).
+- `--pageExtraDelay`: how long to wait *after* finishing behaviors (or after `behaviorTimeout` has been reached) before moving on to next page.
+
+A site-specific behavior (or Autoscroll) will start after the page is loaded (at most after `--waitUntil` seconds) and exactly after `--postLoadDelay` seconds.
+
+The behavior will then run until finished or at most until `--behaviorTimeout` is reached (90 seconds by default).
+
+## Loading Custom Behaviors
+
+Browsertrix Crawler also supports fully user-defined behaviors, which have all the capabilities of the built-in behaviors.
+
+They can use a library of provided functions, and run on one or more pages in the crawl.
+
+Custom behaviors are specified with the `--customBehaviors` flag, which can be repeated and can accept the following options.
+
+- A path to a single behavior file. This can be mounted into the crawler as a volume.
+- A path to a directory of behavior files. This can be mounted into the crawler as a volume.
+- A URL for a single behavior file to download. This should be a URL that the crawler has access to.
+- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located). This should be a git repo the crawler has access to without additional auth.

 ### Examples

@ -52,3 +84,186 @@ docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --u
 ```sh
 docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --customBehaviors "git+https://git.example.com/custom-behaviors?branch=dev&path=path/to/behaviors"
 ```
+
+## Creating Custom Behaviors
+
+A custom behavior file can be in one of the following supported formats:
+- JSON User Flow
+- JavaScript / Typescript (compiled to JavaScript)
+
+### JSON Flow Behaviors
+
+Browsertrix Crawler 1.6 and up supports replaying the JSON User Flow format generated by [DevTools Recorder](https://developer.chrome.com/docs/devtools/recorder), which is built-in to Chrome devtools.
+
+This format can be generated by using the DevTools Recorder to create a series of steps, which are serialized to JSON.
+
+The format represents a series of steps that should happen on a particular page.
+
+The recorder is capable of picking the right selectors interactively and supports events such as `click`, `change`, `waitForElement` and more. See the [feature reference](https://developer.chrome.com/docs/devtools/recorder/reference) for a more complete list.
+
+#### User Flow Extensions
+
+Browsertrix extends the functionality compared to DevTools Recorder in the following ways:
+
+- Browsertrix Crawler will attempt to continue even if initial step fails, for up to 3 failures.
+
+- If a step is repeated 3 or more times, Browsertrix Crawler will attempt to repeat the step as far as it can until the step fails.
+
+- Browsertrix Crawler ignores the `navigate` and `viewport` step. The `navigate` event is used to match when a particular user flow should run, but does not navigate away from the page.
+
+- If `navigate` step is removed, user flow can run on every page in the crawler.
+
+- A `customStep` step with name `runOncePerCrawl` can be added to indicate that a user flow should run only once for a given crawl.
+
+### JavaScript Behaviors
+
+The main native format of custom behaviors is a Javascript class.
+
+There should be a single class per file, and it should be of the following format:
+
+#### Behavior Class
+
+```javascript
+class MyBehavior
+{
+  // required: an id for this behavior, will be displayed in the logs
+  // when the behavior is run.
+  static id = "My Behavior Id";
+
+  // required: a function that checks if a behavior should be run
+  // for a given page.
+  // This function can check the DOM / window.location to determine
+  // what page it is on. The first behavior that returns 'true'
+  // for a given page is used on that page.
+  static isMatch() {
+    return window.location.href === "https://my-site.example.com/";
+  }
+
+  // optional: if true, will also check isMatch() and possibly run
+  // this behavior in each iframe.
+  // if false, or not defined, this behavior will be skipped for iframes.
+  static runInIframes = false;
+
+  // optional: if defined, provides a way to define a custom way to determine
+  // when a page has finished loading beyond the standard 'load' event.
+  //
+  // if defined, the crawler will await 'awaitPageLoad()' before moving on to
+  // post-crawl processing operations, including link extraction, screenshots,
+  // and running main behavior
+  async awaitPageLoad() {
+
+  }
+
+  // required: the main behavior async iterator, which should yield for
+  // each 'step' in the behavior.
+  // When the iterator finishes, the behavior is done.
+  // (See below for more info)
+  async* run(ctx) {
+    //... yield ctx.getState("starting behavior");
+
+    // do something
+
+    //... yield ctx.getState("a step has been performed");
+  }
+}
+```
+
+#### Behavior run() loop
+
+The `run()` loop provides the main loop for the behavior to run. It must be an [async iterator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/AsyncIterator), which means that it can optionally call `yield` to return state to the crawler and allow it to print the state.
+
+For example, a behavior that iterates over elements and then clicks them either once or twice (based on the value of a custom `.clickTwice` property) could be written as follows:
+
+```javascript
+  async* run(ctx) {
+    let click = 0;
+    let dblClick = 0;
+    for await (const elem of document.querySelectorAll(".my-selector")) {
+      if (elem.clickTwice) {
+        elem.click();
+        elem.click();
+        dblClick++;
+      } else {
+        elem.click();
+        click++;
+      }
+      ctx.log({msg: "Clicked on elem", click, dblClick});
+    }
+  }
+```
+
+This behavior will run to completion and log every time a click event is made. However, this behavior can not be paused and resumed (supported in ArchiveWeb.page) and generally can not be interrupted.
+
+One approach is to yield after every major 'step' in the behavior, for example:
+
+```javascript
+  async* run(ctx) {
+    let click = 0;
+    let dblClick = 0;
+    for await (const elem of document.querySelectorAll(".my-selector")) {
+      if (elem.clickTwice) {
+        elem.click();
+        elem.click();
+        dblClick++;
+        // allows behavior to be paused here
+        yield {msg: "Double-clicked on elem", click, dblClick};
+      } else {
+        elem.click();
+        click++;
+        // allows behavior to be paused here
+        yield {msg: "Single-clicked on elem", click, dblClick};
+      }
+    }
+  }
+```
+
+The data that is yielded will be logged in the `behaviorScriptCustom` context.
+
+This allows for the behavior to log the current state of the behavior and allow for it to be gracefully
+interrupted after each logical 'step'.
+
+#### getState() function
+
+A common pattern is to increment a particular counter, and then return the whole state.
+
+A convenience function `getState()` is provided to simplify this and avoid the need to create custom counters.
+
+Using this standard function, the above code might be condensed as follows:
+
+```javascript
+  async* run(ctx) {
+    const { Lib } = ctx;
+    for await (const elem of document.querySelectorAll(".my-selector")) {
+      if (elem.clickTwice) {
+        elem.click();
+        elem.click();
+        yield Lib.getState("Double-Clicked on elem", "dblClick");
+      } else {
+        elem.click();
+        yield Lib.getState("Single-Clicked on elem", "click");
+      }
+    }
+  }
+```
+
+#### Utility Functions
+
+In addition to `getState()`, Browsertrix Behaviors includes [a small library of other utility functions](https://github.com/webrecorder/browsertrix-behaviors/blob/main/src/lib/utils.ts) which are available to behaviors under `ctx.Lib`.
+
+Some of these functions which may be of use to behaviors authors are:
+
+- `scrollAndClick`: scroll element into view and click
+- `sleep`: sleep for specified timeout (ms)
+- `waitUntil`: wait until a certain predicate is true
+- `waitUntilNode`: wait until a DOM node exists
+- `xpathNode`: find a DOM node by xpath
+- `xpathNodes`: find and iterate all DOM nodes by xpath
+- `xpathString`: find a string attribute by xpath
+- `iterChildElem`: iterate over all child elements of given element
+- `iterChildMatches`: iterate over all child elements that match a specific xpath
+- `isInViewport`: determine if a given element is in the visible viewport
+- `scrollToOffset`: scroll to particular offset
+- `scrollIntoView`: smoothly scroll particular element into view
+- `getState`: increment a state counter and return all state counters + string message
+
+More detailed references will be added in the future.
--- a/docs/docs/user-guide/cli-options.md
+++ b/docs/docs/user-guide/cli-options.md
@ -19,8 +19,7 @@ Options:
                                             crawl configuration (can also be se
                                            t via CRAWL_ID env var), defaults to
                                             combination of Docker container hos
-                                            tname and collection
-                                             [string] [default: "@hostname-@id"]
+                                            tname and collection        [string]
      --waitUntil                           Puppeteer page.goto() condition to w
                                            ait for before continuing, can be mu
                                            ltiple separated by ','
@ -88,6 +87,9 @@ Options:
                                                  [number] [default: 1000000000]
      --generateWACZ, --generatewacz, --ge  If set, generate WACZ on disk
      nerateWacz                                      [boolean] [default: false]
+      --useSHA1                             If set, sha-1 instead of sha-256 has
+                                            hes will be used for creating record
+                                            s         [boolean] [default: false]
      --logging                             Logging options for crawler, can inc
                                            lude: stats (enabled by default), js
                                            errors, debug
@ -100,16 +102,17 @@ Options:
  [array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
  , "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
  ", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
-  orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
-                      inks", "sitemap", "wacz", "replay", "proxy"] [default: []]
+  orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
+  atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
+                                                                             []]
      --logExcludeContext                   Comma-separated list of contexts to
                                            NOT include in logs
  [array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
  , "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
  ", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
-  orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
-  inks", "sitemap", "wacz", "replay", "proxy"] [default: ["recorderNetwork","jsE
-                                                            rror","screencast"]]
+  orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
+  atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
+                                     ["recorderNetwork","jsError","screencast"]]
      --text                                Extract initial (default) or final t
                                            ext to pages.jsonl or WARC resource
                                            record(s)
@ -236,6 +239,9 @@ Options:
                                                           [array] [default: []]
      --logErrorsToRedis                    If set, write error messages to redi
                                            s         [boolean] [default: false]
+      --logBehaviorsToRedis                 If set, write behavior script messag
+                                            es to redis
+                                                      [boolean] [default: false]
      --writePagesToRedis                   If set, write page objects to redis
                                                      [boolean] [default: false]
      --maxPageRetries, --retries           If set, number of times to retry a p
--- a/docs/docs/user-guide/proxies.md
+++ b/docs/docs/user-guide/proxies.md
@ -74,7 +74,7 @@ Only key-based authentication is supposed for SSH proxies for now.

 ## Browser Profiles

-The above proxy settings also apply to [Browser Profile Creation](../browser-profiles), and browser profiles can also be created using proxies, for example:
+The above proxy settings also apply to [Browser Profile Creation](browser-profiles.md), and browser profiles can also be created using proxies, for example:

 ```sh
 docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts
--- a/docs/mkdocs.yml
+++ b/docs/mkdocs.yml
@ -63,7 +63,7 @@ nav:

 markdown_extensions:
  - toc:
-      toc_depth: 3
+      toc_depth: 4
      permalink: true
  - pymdownx.highlight:
      anchor_linenums: true
--- a/package.json
+++ b/package.json
@ -1,6 +1,6 @@
 {
  "name": "browsertrix-crawler",
-  "version": "1.6.0-beta.1",
+  "version": "1.6.0",
  "main": "browsertrix-crawler",
  "type": "module",
  "repository": "https://github.com/webrecorder/browsertrix-crawler",