mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-10-19 06:23:16 +00:00
Support custom css selectors for extracting links (#689)
Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].
This commit is contained in:
parent
2a9b152531
commit
d04509639a
11 changed files with 194 additions and 109 deletions
|
@ -50,6 +50,11 @@ Options:
|
|||
e-page-application crawling or when
|
||||
different hashtags load dynamic cont
|
||||
ent
|
||||
--selectLinks one or more selectors for extracting
|
||||
links, in the format [css selector]
|
||||
->[property to use],[css selector]->
|
||||
@[attribute to use]
|
||||
[array] [default: ["a[href]->href"]]
|
||||
--blockRules Additional rules for blocking certai
|
||||
n URLs from being loaded, by URL reg
|
||||
ex and optionally via text match in
|
||||
|
@ -70,8 +75,7 @@ Options:
|
|||
[string] [default: "crawl-@ts"]
|
||||
--headless Run in headless mode, otherwise star
|
||||
t xvfb [boolean] [default: false]
|
||||
--driver JS driver for the crawler
|
||||
[string] [default: "./defaultDriver.js"]
|
||||
--driver JS driver for the crawler [string]
|
||||
--generateCDX, --generatecdx, --gene If set, generate index (CDXJ) for us
|
||||
rateCdx e with pywb after crawl is done
|
||||
[boolean] [default: false]
|
||||
|
@ -248,8 +252,8 @@ Options:
|
|||
[boolean] [default: false]
|
||||
--customBehaviors Custom behavior files to inject. Val
|
||||
ues can be URLs, paths to individual
|
||||
behavior files, or paths to a direct
|
||||
ory of behavior files.
|
||||
behavior files, or paths to a direc
|
||||
tory of behavior files
|
||||
[array] [default: []]
|
||||
--debugAccessRedis if set, runs internal redis without
|
||||
protected mode to allow external acc
|
||||
|
@ -289,14 +293,14 @@ Options:
|
|||
--version Show version number [boolean]
|
||||
--url The URL of the login page [string] [required]
|
||||
--user The username for the login. If not specified, will b
|
||||
e prompted
|
||||
e prompted [string]
|
||||
--password The password for the login. If not specified, will b
|
||||
e prompted (recommended)
|
||||
e prompted (recommended) [string]
|
||||
--filename The filename for the profile tarball, stored within
|
||||
/crawls/profiles if absolute path not provided
|
||||
[default: "/crawls/profiles/profile.tar.gz"]
|
||||
[string] [default: "/crawls/profiles/profile.tar.gz"]
|
||||
--debugScreenshot If specified, take a screenshot after login and save
|
||||
as this filename
|
||||
as this filename [boolean] [default: false]
|
||||
--headless Run in headless mode, otherwise start xvfb
|
||||
[boolean] [default: false]
|
||||
--automated Start in automated mode, no interactive browser
|
||||
|
|
|
@ -17,6 +17,16 @@ can be used to specify additional seconds to wait after the page appears to have
|
|||
|
||||
(On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)
|
||||
|
||||
## Link Extraction
|
||||
|
||||
By default, the crawler will extract all `href` properties from all `<a>` tags that have an `href`.
|
||||
This can be customized with the `--selectLinks` option, which can provide alternative selectors of the form:
|
||||
`[css selector]->[property to use]` or `[css selector]->@[attribute to use]`. The default value is `a[href]->href`.
|
||||
|
||||
For example, to specify the default, but also include all `divs` that have class `mylink` and use `custom-href` attribute as the link, use `--selectLinks 'a[href]->href' --selectLinks 'div.mylink->@custom-href'`.
|
||||
|
||||
Any number of selectors can be specified in this way, and each will be applied in sequence on each page.
|
||||
|
||||
## Ad Blocking
|
||||
|
||||
Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue