Support custom css selectors for extracting links (#689)

Support array of selectors via --selectLinks property in the form [css selector]->[property] or [css selector]->@[attribute].
2025-10-19 06:23:16 +00:00 · 2024-11-08 08:04:41 -08:00 · 2024-11-08 08:04:41 -08:00 · d04509639a
commit d04509639a
parent 2a9b152531
11 changed files with 194 additions and 109 deletions
--- a/docs/docs/user-guide/cli-options.md
+++ b/docs/docs/user-guide/cli-options.md
@ -50,6 +50,11 @@ Options:
                                            e-page-application crawling or when
                                            different hashtags load dynamic cont
                                            ent
+      --selectLinks                         one or more selectors for extracting
+                                             links, in the format [css selector]
+                                            ->[property to use],[css selector]->
+                                            @[attribute to use]
+                                            [array] [default: ["a[href]->href"]]
      --blockRules                          Additional rules for blocking certai
                                            n URLs from being loaded, by URL reg
                                            ex and optionally via text match in
@ -70,8 +75,7 @@ Options:
                                                 [string] [default: "crawl-@ts"]
      --headless                            Run in headless mode, otherwise star
                                            t xvfb    [boolean] [default: false]
-      --driver                              JS driver for the crawler
-                                        [string] [default: "./defaultDriver.js"]
+      --driver                              JS driver for the crawler   [string]
      --generateCDX, --generatecdx, --gene  If set, generate index (CDXJ) for us
      rateCdx                               e with pywb after crawl is done
                                                      [boolean] [default: false]
@ -248,8 +252,8 @@ Options:
                                                      [boolean] [default: false]
      --customBehaviors                     Custom behavior files to inject. Val
                                            ues can be URLs, paths to individual
-                                            behavior files, or paths to a direct
-                                            ory of behavior files.
+                                             behavior files, or paths to a direc
+                                            tory of behavior files
                                                           [array] [default: []]
      --debugAccessRedis                    if set, runs internal redis without
                                            protected mode to allow external acc
@ -289,14 +293,14 @@ Options:
  --version                 Show version number                        [boolean]
  --url                     The URL of the login page        [string] [required]
  --user                    The username for the login. If not specified, will b
-                            e prompted
+                            e prompted                                  [string]
  --password                The password for the login. If not specified, will b
-                            e prompted (recommended)
+                            e prompted (recommended)                    [string]
  --filename                The filename for the profile tarball, stored within
                            /crawls/profiles if absolute path not provided
-                                    [default: "/crawls/profiles/profile.tar.gz"]
+                           [string] [default: "/crawls/profiles/profile.tar.gz"]
  --debugScreenshot         If specified, take a screenshot after login and save
-                             as this filename
+                             as this filename         [boolean] [default: false]
  --headless                Run in headless mode, otherwise start xvfb
                                                      [boolean] [default: false]
  --automated               Start in automated mode, no interactive browser
--- a/docs/docs/user-guide/common-options.md
+++ b/docs/docs/user-guide/common-options.md
@ -17,6 +17,16 @@ can be used to specify additional seconds to wait after the page appears to have

 (On the other hand, the `--pageExtraDelay`/`--delay` adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)

+## Link Extraction
+
+By default, the crawler will extract all `href` properties from all `<a>` tags that have an `href`.
+This can be customized with the `--selectLinks` option, which can provide alternative selectors of the form:
+`[css selector]->[property to use]` or `[css selector]->@[attribute to use]`. The default value is `a[href]->href`.
+
+For example, to specify the default, but also include all `divs` that have class `mylink` and use `custom-href` attribute as the link, use `--selectLinks 'a[href]->href' --selectLinks 'div.mylink->@custom-href'`.
+
+Any number of selectors can be specified in this way, and each will be applied in sequence on each page.
+
 ## Ad Blocking

 Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These [Shields](https://brave.com/shields/) be disabled or customized using [Browser Profiles](browser-profiles.md).