Add documentation for --failOnContentCheck and update CLI options in docs (#869)

Related to #860 

This will give us something we can link to from Browsertrix/the
Browsertrix User Guide for up-to-date information on this option.
This commit is contained in:
Tessa Walsh 2025-07-23 15:54:12 -04:00 committed by GitHub
parent 1a4341bfbc
commit 66402c2e53
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
2 changed files with 31 additions and 0 deletions

View file

@ -268,3 +268,26 @@ Some of these functions which may be of use to behaviors authors are:
- `getState`: increment a state counter and return all state counters + string message
More detailed references will be added in the future.
## Fail On Content Check
In Browsertrix Crawler 1.7.0 and higher, the `--failOnContentCheck` option will result in a crawl failing if a behavior detects the presence or absence of certain content on a page in its `awaitPageLoad()` callback. By default, this is used to fail a crawl if site-specific behaviors determine that the user is not logged in on the following sites:
- Facebook
- Instagram
- TikTok
- X
It is also used to fail crawls with YouTube videos if one of the videos is found not to play.
It is possible to add content checks to custom behaviors. To do so, include an `awaitPageLoad` method on the behavior and use the `ctx.Lib` function `assertContentValid` to check for content and fail the behavior with a specified reason if it is not found.
For an example, see the following `awaitPageLoad` example from the site-specific behavior for X:
```javascript
async awaitPageLoad(ctx: any) {
const { sleep, assertContentValid } = ctx.Lib;
await sleep(5);
assertContentValid(() => !document.documentElement.outerHTML.match(/Log In/i), "not_logged_in");
}
```

View file

@ -261,6 +261,10 @@ Options:
ailOnFailedSeed may result in crawl
failing due to non-200 responses
[boolean] [default: false]
--failOnContentCheck If set, allows for behaviors to fail
a crawl with custom reason based on
content (e.g. logged out)
[boolean] [default: false]
--customBehaviors Custom behavior files to inject. Val
id values: URL to file, path to file
, path to directory of behaviors, UR
@ -272,6 +276,10 @@ Options:
git+https://git.example.com/repo.git
?branch=dev&path=some/dir"
[array] [default: []]
--saveStorage if set, will store the localStorage/
sessionStorage data for each page as
part of WARC-JSON-Metadata field
[boolean]
--debugAccessRedis if set, runs internal redis without
protected mode to allow external acc
ess (for debugging) [boolean]