Support custom css selectors for extracting links (#689)

Support array of selectors via --selectLinks property in the
form [css selector]->[property] or [css selector]->@[attribute].
This commit is contained in:
Ilya Kreymer 2024-11-08 08:04:41 -08:00 committed by GitHub
parent 2a9b152531
commit d04509639a
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
11 changed files with 194 additions and 109 deletions

View file

@ -50,6 +50,11 @@ Options:
e-page-application crawling or when
different hashtags load dynamic cont
ent
--selectLinks one or more selectors for extracting
links, in the format [css selector]
->[property to use],[css selector]->
@[attribute to use]
[array] [default: ["a[href]->href"]]
--blockRules Additional rules for blocking certai
n URLs from being loaded, by URL reg
ex and optionally via text match in
@ -70,8 +75,7 @@ Options:
[string] [default: "crawl-@ts"]
--headless Run in headless mode, otherwise star
t xvfb [boolean] [default: false]
--driver JS driver for the crawler
[string] [default: "./defaultDriver.js"]
--driver JS driver for the crawler [string]
--generateCDX, --generatecdx, --gene If set, generate index (CDXJ) for us
rateCdx e with pywb after crawl is done
[boolean] [default: false]
@ -248,8 +252,8 @@ Options:
[boolean] [default: false]
--customBehaviors Custom behavior files to inject. Val
ues can be URLs, paths to individual
behavior files, or paths to a direct
ory of behavior files.
behavior files, or paths to a direc
tory of behavior files
[array] [default: []]
--debugAccessRedis if set, runs internal redis without
protected mode to allow external acc
@ -289,14 +293,14 @@ Options:
--version Show version number [boolean]
--url The URL of the login page [string] [required]
--user The username for the login. If not specified, will b
e prompted
e prompted [string]
--password The password for the login. If not specified, will b
e prompted (recommended)
e prompted (recommended) [string]
--filename The filename for the profile tarball, stored within
/crawls/profiles if absolute path not provided
[default: "/crawls/profiles/profile.tar.gz"]
[string] [default: "/crawls/profiles/profile.tar.gz"]
--debugScreenshot If specified, take a screenshot after login and save
as this filename
as this filename [boolean] [default: false]
--headless Run in headless mode, otherwise start xvfb
[boolean] [default: false]
--automated Start in automated mode, no interactive browser

View file

@ -17,6 +17,16 @@ can be used to specify additional seconds to wait after the page appears to have
(On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)
## Link Extraction
By default, the crawler will extract all `href` properties from all `<a>` tags that have an `href`.
This can be customized with the `--selectLinks` option, which can provide alternative selectors of the form:
`[css selector]->[property to use]` or `[css selector]->@[attribute to use]`. The default value is `a[href]->href`.
For example, to specify the default, but also include all `divs` that have class `mylink` and use `custom-href` attribute as the link, use `--selectLinks 'a[href]->href' --selectLinks 'div.mylink->@custom-href'`.
Any number of selectors can be specified in this way, and each will be applied in sequence on each page.
## Ad Blocking
Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).