mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 14:33:17 +00:00
Add documentation for --failOnContentCheck
and update CLI options in docs (#869)
Related to #860 This will give us something we can link to from Browsertrix/the Browsertrix User Guide for up-to-date information on this option.
This commit is contained in:
parent
1a4341bfbc
commit
66402c2e53
2 changed files with 31 additions and 0 deletions
|
@ -268,3 +268,26 @@ Some of these functions which may be of use to behaviors authors are:
|
|||
- `getState`: increment a state counter and return all state counters + string message
|
||||
|
||||
More detailed references will be added in the future.
|
||||
|
||||
## Fail On Content Check
|
||||
|
||||
In Browsertrix Crawler 1.7.0 and higher, the `--failOnContentCheck` option will result in a crawl failing if a behavior detects the presence or absence of certain content on a page in its `awaitPageLoad()` callback. By default, this is used to fail a crawl if site-specific behaviors determine that the user is not logged in on the following sites:
|
||||
|
||||
- Facebook
|
||||
- Instagram
|
||||
- TikTok
|
||||
- X
|
||||
|
||||
It is also used to fail crawls with YouTube videos if one of the videos is found not to play.
|
||||
|
||||
It is possible to add content checks to custom behaviors. To do so, include an `awaitPageLoad` method on the behavior and use the `ctx.Lib` function `assertContentValid` to check for content and fail the behavior with a specified reason if it is not found.
|
||||
|
||||
For an example, see the following `awaitPageLoad` example from the site-specific behavior for X:
|
||||
|
||||
```javascript
|
||||
async awaitPageLoad(ctx: any) {
|
||||
const { sleep, assertContentValid } = ctx.Lib;
|
||||
await sleep(5);
|
||||
assertContentValid(() => !document.documentElement.outerHTML.match(/Log In/i), "not_logged_in");
|
||||
}
|
||||
```
|
||||
|
|
|
@ -261,6 +261,10 @@ Options:
|
|||
ailOnFailedSeed may result in crawl
|
||||
failing due to non-200 responses
|
||||
[boolean] [default: false]
|
||||
--failOnContentCheck If set, allows for behaviors to fail
|
||||
a crawl with custom reason based on
|
||||
content (e.g. logged out)
|
||||
[boolean] [default: false]
|
||||
--customBehaviors Custom behavior files to inject. Val
|
||||
id values: URL to file, path to file
|
||||
, path to directory of behaviors, UR
|
||||
|
@ -272,6 +276,10 @@ Options:
|
|||
git+https://git.example.com/repo.git
|
||||
?branch=dev&path=some/dir"
|
||||
[array] [default: []]
|
||||
--saveStorage if set, will store the localStorage/
|
||||
sessionStorage data for each page as
|
||||
part of WARC-JSON-Metadata field
|
||||
[boolean]
|
||||
--debugAccessRedis if set, runs internal redis without
|
||||
protected mode to allow external acc
|
||||
ess (for debugging) [boolean]
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue