mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 06:23:16 +00:00
Update Behaviors Docs (#820)
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
This commit is contained in:
parent
f2dac05577
commit
1cb1b2edb9
5 changed files with 246 additions and 25 deletions
|
@ -1,31 +1,63 @@
|
|||
# Browser Behaviors
|
||||
|
||||
Browsertrix Crawler supports automatically running customized in-browser behaviors. The behaviors auto-play videos (when possible), auto-fetch content that is not loaded by default, and also run custom behaviors on certain sites.
|
||||
Browsertrix Crawler supports automatically running customized behaviors on each page. Several types of behaviors are supported, including built-in, background, and site-specific behaviors. It is also possible to add fully user-defined custom behaviors that can be added to trigger specific actions on certain pages.
|
||||
|
||||
To run behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as autoscroll, use `--behaviors autoscroll`.
|
||||
## Built-In Behaviors
|
||||
|
||||
The site-specific behavior (or autoscroll) will start running after the page is finished its initial load (as defined by the `--waitUntil` settings). The behavior will then run until finished or until the behavior timeout is exceeded. This timeout can be set (in seconds) via the `--behaviorTimeout` flag (90 seconds by default). Setting the timeout to 0 will allow the behavior to run until it is finished.
|
||||
The built-in behaviors include the following background behaviors which run 'in the background' continually checking for changes:
|
||||
|
||||
See [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for more info on all of the currently available behaviors.
|
||||
- Autoplay: find and start playing (when possible) any video or audio on the page (and in each iframe).
|
||||
- Autofetch: find and start fetching any URLs that may not be fetched by default, such as other resolutions in `img` tags, `data-*`, lazy-loaded resources, etc.
|
||||
- Autoclick: select all tags (default: `a` tag, customizable via `--clickSelector`) that may be clickable and attempt to click them while avoiding navigation away from the page.
|
||||
|
||||
Browsertrix Crawler includes a `--pageExtraDelay`/`--delay` option, which can be used to have the crawler sleep for a configurable number of seconds after behaviors before moving on to the next page.
|
||||
There is also a built-in 'main' behavior, which runs to completion (or until a timeout is reached):
|
||||
|
||||
To disable behaviors for a crawl, use `--behaviors ""`.
|
||||
- Autoscroll: Determine if a page might need scrolling, and scroll either up or down while new elements are being added. Continue until timeout is reached or scrolling is no longer possible.
|
||||
|
||||
## Additional Custom Behaviors
|
||||
## Site-Specific Behaviors
|
||||
|
||||
Custom behaviors can be mounted into the crawler and ran from there, or downloaded from a URL.
|
||||
Browsertrix also comes with several 'site-specific' behaviors, which run only on specific sites. These behaviors will run instead of Autoscroll and will run until completion or timeout. Currently, site-specific behaviors include major social media sites.
|
||||
|
||||
Each behavior should contain a single class that implements the behavior interface. See [the behaviors tutorial](https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md) for more info on how to write behaviors.
|
||||
Refer to [Browsertrix Behaviors](https://github.com/webrecorder/browsertrix-behaviors) for the latest list of site-specific behaviors.
|
||||
|
||||
The first behavior which returns true for `isMatch()` will be run on a given page.
|
||||
User-defined custom behaviors are also considered site-specific.
|
||||
|
||||
The repeatable `--customBehaviors` flag can accept:
|
||||
## Enabling Behaviors
|
||||
|
||||
- A path to a directory of behavior files
|
||||
- A path to a single behavior file
|
||||
- A URL for a single behavior file to download
|
||||
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located)
|
||||
To enable built-in behaviors, specify them via a comma-separated list passed to the `--behaviors` option. All behaviors except Autoclick are enabled by default, the equivalent of `--behaviors autoscroll,autoplay,autofetch,siteSpecific`. To enable only a single behavior, such as Autoscroll, use `--behaviors autoscroll`.
|
||||
|
||||
To only use Autoclick but not Autoscroll, use `--behaviors autoclick,autoplay,autofetch,siteSpecific`.
|
||||
|
||||
The `--siteSpecific` flag enables all site-specific behaviors to be enabled, but only one behavior can be run per site. Each site-specific behavior specifies which site it should run on.
|
||||
|
||||
To disable all behaviors, use `--behaviors ""`.
|
||||
|
||||
## Behavior and Page Timeouts
|
||||
|
||||
Browsertrix includes a number of timeouts, including before, during and after running behaviors.
|
||||
The timeouts are as follows:
|
||||
|
||||
- `--waitUntil`: how long to wait for page to finish loading, *before* doing anything else.
|
||||
- `--postLoadDelay`: how long to wait *before* starting any behaviors, but after page has finished loading. A custom behavior can override this (see below).
|
||||
- `--behaviorTimeout`: maximum time to spend on running site-specific / Autoscroll behaviors (can be less if behavior finishes early).
|
||||
- `--pageExtraDelay`: how long to wait *after* finishing behaviors (or after `behaviorTimeout` has been reached) before moving on to next page.
|
||||
|
||||
A site-specific behavior (or Autoscroll) will start after the page is loaded (at most after `--waitUntil` seconds) and exactly after `--postLoadDelay` seconds.
|
||||
|
||||
The behavior will then run until finished or at most until `--behaviorTimeout` is reached (90 seconds by default).
|
||||
|
||||
## Loading Custom Behaviors
|
||||
|
||||
Browsertrix Crawler also supports fully user-defined behaviors, which have all the capabilities of the built-in behaviors.
|
||||
|
||||
They can use a library of provided functions, and run on one or more pages in the crawl.
|
||||
|
||||
Custom behaviors are specified with the `--customBehaviors` flag, which can be repeated and can accept the following options.
|
||||
|
||||
- A path to a single behavior file. This can be mounted into the crawler as a volume.
|
||||
- A path to a directory of behavior files. This can be mounted into the crawler as a volume.
|
||||
- A URL for a single behavior file to download. This should be a URL that the crawler has access to.
|
||||
- A URL for a git repository of the form `git+https://git.example.com/repo.git`, with optional query parameters `branch` (to specify a particular branch to use) and `path` (to specify a relative path to a directory within the git repository where the custom behaviors are located). This should be a git repo the crawler has access to without additional auth.
|
||||
|
||||
### Examples
|
||||
|
||||
|
@ -52,3 +84,186 @@ docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --u
|
|||
```sh
|
||||
docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --customBehaviors "git+https://git.example.com/custom-behaviors?branch=dev&path=path/to/behaviors"
|
||||
```
|
||||
|
||||
## Creating Custom Behaviors
|
||||
|
||||
A custom behavior file can be in one of the following supported formats:
|
||||
- JSON User Flow
|
||||
- JavaScript / Typescript (compiled to JavaScript)
|
||||
|
||||
### JSON Flow Behaviors
|
||||
|
||||
Browsertrix Crawler 1.6 and up supports replaying the JSON User Flow format generated by [DevTools Recorder](https://developer.chrome.com/docs/devtools/recorder), which is built-in to Chrome devtools.
|
||||
|
||||
This format can be generated by using the DevTools Recorder to create a series of steps, which are serialized to JSON.
|
||||
|
||||
The format represents a series of steps that should happen on a particular page.
|
||||
|
||||
The recorder is capable of picking the right selectors interactively and supports events such as `click`, `change`, `waitForElement` and more. See the [feature reference](https://developer.chrome.com/docs/devtools/recorder/reference) for a more complete list.
|
||||
|
||||
#### User Flow Extensions
|
||||
|
||||
Browsertrix extends the functionality compared to DevTools Recorder in the following ways:
|
||||
|
||||
- Browsertrix Crawler will attempt to continue even if initial step fails, for up to 3 failures.
|
||||
|
||||
- If a step is repeated 3 or more times, Browsertrix Crawler will attempt to repeat the step as far as it can until the step fails.
|
||||
|
||||
- Browsertrix Crawler ignores the `navigate` and `viewport` step. The `navigate` event is used to match when a particular user flow should run, but does not navigate away from the page.
|
||||
|
||||
- If `navigate` step is removed, user flow can run on every page in the crawler.
|
||||
|
||||
- A `customStep` step with name `runOncePerCrawl` can be added to indicate that a user flow should run only once for a given crawl.
|
||||
|
||||
### JavaScript Behaviors
|
||||
|
||||
The main native format of custom behaviors is a Javascript class.
|
||||
|
||||
There should be a single class per file, and it should be of the following format:
|
||||
|
||||
#### Behavior Class
|
||||
|
||||
```javascript
|
||||
class MyBehavior
|
||||
{
|
||||
// required: an id for this behavior, will be displayed in the logs
|
||||
// when the behavior is run.
|
||||
static id = "My Behavior Id";
|
||||
|
||||
// required: a function that checks if a behavior should be run
|
||||
// for a given page.
|
||||
// This function can check the DOM / window.location to determine
|
||||
// what page it is on. The first behavior that returns 'true'
|
||||
// for a given page is used on that page.
|
||||
static isMatch() {
|
||||
return window.location.href === "https://my-site.example.com/";
|
||||
}
|
||||
|
||||
// optional: if true, will also check isMatch() and possibly run
|
||||
// this behavior in each iframe.
|
||||
// if false, or not defined, this behavior will be skipped for iframes.
|
||||
static runInIframes = false;
|
||||
|
||||
// optional: if defined, provides a way to define a custom way to determine
|
||||
// when a page has finished loading beyond the standard 'load' event.
|
||||
//
|
||||
// if defined, the crawler will await 'awaitPageLoad()' before moving on to
|
||||
// post-crawl processing operations, including link extraction, screenshots,
|
||||
// and running main behavior
|
||||
async awaitPageLoad() {
|
||||
|
||||
}
|
||||
|
||||
// required: the main behavior async iterator, which should yield for
|
||||
// each 'step' in the behavior.
|
||||
// When the iterator finishes, the behavior is done.
|
||||
// (See below for more info)
|
||||
async* run(ctx) {
|
||||
//... yield ctx.getState("starting behavior");
|
||||
|
||||
// do something
|
||||
|
||||
//... yield ctx.getState("a step has been performed");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Behavior run() loop
|
||||
|
||||
The `run()` loop provides the main loop for the behavior to run. It must be an [async iterator](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/AsyncIterator), which means that it can optionally call `yield` to return state to the crawler and allow it to print the state.
|
||||
|
||||
For example, a behavior that iterates over elements and then clicks them either once or twice (based on the value of a custom `.clickTwice` property) could be written as follows:
|
||||
|
||||
```javascript
|
||||
async* run(ctx) {
|
||||
let click = 0;
|
||||
let dblClick = 0;
|
||||
for await (const elem of document.querySelectorAll(".my-selector")) {
|
||||
if (elem.clickTwice) {
|
||||
elem.click();
|
||||
elem.click();
|
||||
dblClick++;
|
||||
} else {
|
||||
elem.click();
|
||||
click++;
|
||||
}
|
||||
ctx.log({msg: "Clicked on elem", click, dblClick});
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This behavior will run to completion and log every time a click event is made. However, this behavior can not be paused and resumed (supported in ArchiveWeb.page) and generally can not be interrupted.
|
||||
|
||||
One approach is to yield after every major 'step' in the behavior, for example:
|
||||
|
||||
```javascript
|
||||
async* run(ctx) {
|
||||
let click = 0;
|
||||
let dblClick = 0;
|
||||
for await (const elem of document.querySelectorAll(".my-selector")) {
|
||||
if (elem.clickTwice) {
|
||||
elem.click();
|
||||
elem.click();
|
||||
dblClick++;
|
||||
// allows behavior to be paused here
|
||||
yield {msg: "Double-clicked on elem", click, dblClick};
|
||||
} else {
|
||||
elem.click();
|
||||
click++;
|
||||
// allows behavior to be paused here
|
||||
yield {msg: "Single-clicked on elem", click, dblClick};
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The data that is yielded will be logged in the `behaviorScriptCustom` context.
|
||||
|
||||
This allows for the behavior to log the current state of the behavior and allow for it to be gracefully
|
||||
interrupted after each logical 'step'.
|
||||
|
||||
#### getState() function
|
||||
|
||||
A common pattern is to increment a particular counter, and then return the whole state.
|
||||
|
||||
A convenience function `getState()` is provided to simplify this and avoid the need to create custom counters.
|
||||
|
||||
Using this standard function, the above code might be condensed as follows:
|
||||
|
||||
```javascript
|
||||
async* run(ctx) {
|
||||
const { Lib } = ctx;
|
||||
for await (const elem of document.querySelectorAll(".my-selector")) {
|
||||
if (elem.clickTwice) {
|
||||
elem.click();
|
||||
elem.click();
|
||||
yield Lib.getState("Double-Clicked on elem", "dblClick");
|
||||
} else {
|
||||
elem.click();
|
||||
yield Lib.getState("Single-Clicked on elem", "click");
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Utility Functions
|
||||
|
||||
In addition to `getState()`, Browsertrix Behaviors includes [a small library of other utility functions](https://github.com/webrecorder/browsertrix-behaviors/blob/main/src/lib/utils.ts) which are available to behaviors under `ctx.Lib`.
|
||||
|
||||
Some of these functions which may be of use to behaviors authors are:
|
||||
|
||||
- `scrollAndClick`: scroll element into view and click
|
||||
- `sleep`: sleep for specified timeout (ms)
|
||||
- `waitUntil`: wait until a certain predicate is true
|
||||
- `waitUntilNode`: wait until a DOM node exists
|
||||
- `xpathNode`: find a DOM node by xpath
|
||||
- `xpathNodes`: find and iterate all DOM nodes by xpath
|
||||
- `xpathString`: find a string attribute by xpath
|
||||
- `iterChildElem`: iterate over all child elements of given element
|
||||
- `iterChildMatches`: iterate over all child elements that match a specific xpath
|
||||
- `isInViewport`: determine if a given element is in the visible viewport
|
||||
- `scrollToOffset`: scroll to particular offset
|
||||
- `scrollIntoView`: smoothly scroll particular element into view
|
||||
- `getState`: increment a state counter and return all state counters + string message
|
||||
|
||||
More detailed references will be added in the future.
|
||||
|
|
|
@ -19,8 +19,7 @@ Options:
|
|||
crawl configuration (can also be se
|
||||
t via CRAWL_ID env var), defaults to
|
||||
combination of Docker container hos
|
||||
tname and collection
|
||||
[string] [default: "@hostname-@id"]
|
||||
tname and collection [string]
|
||||
--waitUntil Puppeteer page.goto() condition to w
|
||||
ait for before continuing, can be mu
|
||||
ltiple separated by ','
|
||||
|
@ -88,6 +87,9 @@ Options:
|
|||
[number] [default: 1000000000]
|
||||
--generateWACZ, --generatewacz, --ge If set, generate WACZ on disk
|
||||
nerateWacz [boolean] [default: false]
|
||||
--useSHA1 If set, sha-1 instead of sha-256 has
|
||||
hes will be used for creating record
|
||||
s [boolean] [default: false]
|
||||
--logging Logging options for crawler, can inc
|
||||
lude: stats (enabled by default), js
|
||||
errors, debug
|
||||
|
@ -100,16 +102,17 @@ Options:
|
|||
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
|
||||
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
|
||||
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
|
||||
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
|
||||
inks", "sitemap", "wacz", "replay", "proxy"] [default: []]
|
||||
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
|
||||
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
|
||||
[]]
|
||||
--logExcludeContext Comma-separated list of contexts to
|
||||
NOT include in logs
|
||||
[array] [choices: "general", "worker", "recorder", "recorderNetwork", "writer"
|
||||
, "state", "redis", "storage", "text", "exclusion", "screenshots", "screencast
|
||||
", "originOverride", "healthcheck", "browser", "blocking", "behavior", "behavi
|
||||
orScript", "jsError", "fetch", "pageStatus", "memoryStatus", "crawlStatus", "l
|
||||
inks", "sitemap", "wacz", "replay", "proxy"] [default: ["recorderNetwork","jsE
|
||||
rror","screencast"]]
|
||||
orScript", "behaviorScriptCustom", "jsError", "fetch", "pageStatus", "memorySt
|
||||
atus", "crawlStatus", "links", "sitemap", "wacz", "replay", "proxy"] [default:
|
||||
["recorderNetwork","jsError","screencast"]]
|
||||
--text Extract initial (default) or final t
|
||||
ext to pages.jsonl or WARC resource
|
||||
record(s)
|
||||
|
@ -236,6 +239,9 @@ Options:
|
|||
[array] [default: []]
|
||||
--logErrorsToRedis If set, write error messages to redi
|
||||
s [boolean] [default: false]
|
||||
--logBehaviorsToRedis If set, write behavior script messag
|
||||
es to redis
|
||||
[boolean] [default: false]
|
||||
--writePagesToRedis If set, write page objects to redis
|
||||
[boolean] [default: false]
|
||||
--maxPageRetries, --retries If set, number of times to retry a p
|
||||
|
|
|
@ -74,7 +74,7 @@ Only key-based authentication is supposed for SSH proxies for now.
|
|||
|
||||
## Browser Profiles
|
||||
|
||||
The above proxy settings also apply to [Browser Profile Creation](../browser-profiles), and browser profiles can also be created using proxies, for example:
|
||||
The above proxy settings also apply to [Browser Profile Creation](browser-profiles.md), and browser profiles can also be created using proxies, for example:
|
||||
|
||||
```sh
|
||||
docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts
|
||||
|
|
|
@ -63,7 +63,7 @@ nav:
|
|||
|
||||
markdown_extensions:
|
||||
- toc:
|
||||
toc_depth: 3
|
||||
toc_depth: 4
|
||||
permalink: true
|
||||
- pymdownx.highlight:
|
||||
anchor_linenums: true
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
{
|
||||
"name": "browsertrix-crawler",
|
||||
"version": "1.6.0-beta.1",
|
||||
"version": "1.6.0",
|
||||
"main": "browsertrix-crawler",
|
||||
"type": "module",
|
||||
"repository": "https://github.com/webrecorder/browsertrix-crawler",
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue