mirror of
https://github.com/webrecorder/browsertrix-crawler.git
synced 2025-12-08 06:09:48 +00:00
1 line
No EOL
104 KiB
JSON
1 line
No EOL
104 KiB
JSON
{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"],"fields":{"title":{"boost":1000.0},"text":{"boost":1.0},"tags":{"boost":1000000.0}}},"docs":[{"location":"","title":"Home","text":"<p>Welcome to the Browsertrix Crawler official documentation.</p> <p>Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a complex, customizable browser-based crawl in a single Docker container. Browsertrix Crawler uses Puppeteer to control one or more Brave Browser browser windows in parallel. Data is captured through the Chrome Devtools Protocol (CDP) in the browser.</p> <p>Browsertrix Crawler is a command line application responsible for the core features of Browsertrix, Webrecorder's cloud-based web archiving service. See the Browsertrix documentation for more information about Browsertrix, the cloud platform.</p> <p>Note</p> <p>This documentation applies to Browsertrix Crawler versions 1.0.0 and above. Documentation for earlier versions of the crawler is available in the Browsertrix Crawler Github repository's README file in older commits.</p>"},{"location":"#features","title":"Features","text":"<ul> <li>Single-container, browser based crawling with a headless/headful browser running pages in multiple windows.</li> <li>Support for custom browser behaviors, using Browsertrix Behaviors including autoscroll, video autoplay, and site-specific behaviors.</li> <li>YAML-based configuration, passed via file or via stdin.</li> <li>Seed lists and per-seed scoping rules.</li> <li>URL blocking rules to block capture of specific URLs (including by iframe URL and/or by iframe contents).</li> <li>Screencasting: Ability to watch crawling in real-time.</li> <li>Screenshotting: Ability to take thumbnails, full page screenshots, and/or screenshots of the initial page view.</li> <li>Optimized (non-browser) capture of non-HTML resources.</li> <li>Extensible Puppeteer driver script for customizing behavior per crawl or page.</li> <li>Ability to create and reuse browser profiles interactively or via automated user/password login using an embedded browser.</li> <li>Multi-platform support \u2014 prebuilt Docker images available for Intel/AMD and Apple Silicon (M1/M2) CPUs.</li> <li>Quality Assurance (QA) crawling \u2014 analyze the replay of existing crawls (via WACZ) and produce stats comparing what the browser encountered on a website during crawling against the replay of the crawl WACZ.</li> </ul>"},{"location":"#documentation","title":"Documentation","text":"<p>If something is missing, unclear, or seems incorrect, please open an issue and we'll try to make sure that your questions get answered here in the future!</p>"},{"location":"#code","title":"Code","text":"<p>Browsertrix Crawler is free and open source software, with all code available in the main repository on Github.</p>"},{"location":"develop/","title":"Development","text":""},{"location":"develop/#usage-with-docker-compose","title":"Usage with Docker Compose","text":"<p>Many examples in User Guide demonstrate running Browsertrix Crawler with <code>docker run</code>.</p> <p>Docker Compose is recommended for building the image and for simple configurations. A simple Docker Compose configuration file is included in the Git repository.</p> <p>To build the latest image, run:</p> <pre><code>docker-compose build\n</code></pre> <p>Docker Compose also simplifies some config options, such as mounting the volume for the crawls.</p> <p>The following command starts a crawl with 2 workers and generates the CDX:</p> <pre><code>docker-compose run crawler crawl --url https://webrecorder.net/ --generateCDX --collection wr-net --workers 2\n</code></pre> <p>In this example, the crawl data is written to <code>./crawls/collections/wr-net</code> by default.</p> <p>While the crawl is running, the status of the crawl prints the progress to the JSON-L log output. This can be disabled by using the <code>--logging</code> option and not including <code>stats</code>.</p>"},{"location":"develop/#multi-platform-build-support-for-apple-silicon","title":"Multi-Platform Build / Support for Apple Silicon","text":"<p>Browsertrix Crawler uses a browser image which supports amd64 and arm64.</p> <p>This means Browsertrix Crawler can be built natively on Apple Silicon systems using the default settings. Running <code>docker-compose build</code> on an Apple Silicon should build a native version that should work for development.</p>"},{"location":"develop/#modifying-browser-image","title":"Modifying Browser Image","text":"<p>It is also possible to build Browsertrix Crawler with a different browser image. Currently, browser images using Brave Browser and Chrome/Chromium (depending on host system chip architecture) are supported via browsertrix-browser-base, however, only Brave Browser receives regular version updates from us.</p> <p>The browser base image used is specified and can be changed at the top of the Dockerfile in the Browsertrix Crawler repo.</p> <p>Custom browser images can be used by forking browsertrix-browser-base, locally building or publishing an image, and then modifying the Dockerfile in this repo to build from that image.</p>"},{"location":"develop/docs/","title":"Documentation","text":"<p>This documentation is built with the Mkdocs static site generator.</p>"},{"location":"develop/docs/#docs-setup","title":"Docs Setup","text":"<p>Python is required to build the docs, then run:</p> <pre><code>pip install mkdocs-material\n</code></pre>"},{"location":"develop/docs/#docs-server","title":"Docs Server","text":"<p>To start the docs server, simply run:</p> <pre><code>mkdocs serve\n</code></pre> <p>The documentation will then be available on <code>http://localhost:8000/</code></p> <p>The command-line options are rebuilt using the <code>docs/gen-cli.sh</code> script.</p> <p>Refer to the Mkdocs and Material for MkDocs pages for more info about the documentation.</p>"},{"location":"user-guide/","title":"Browsertrix Crawler User Guide","text":"<p>Welcome to the Browsertrix Crawler User Guide. This page covers the basics of using Browsertrix Crawler, Webrecorder's browser-based high-fidelity crawling system, designed to run a complex, customizable, browser-based crawl in a single Docker container.</p>"},{"location":"user-guide/#getting-started","title":"Getting Started","text":"<p>Browsertrix Crawler requires Docker to be installed on the machine running the crawl.</p> <p>Assuming Docker is installed, you can run a crawl and test your archive with the following steps.</p> <p>You don't even need to clone the Browsertrix Crawler repo, just choose a directory where you'd like the crawl data to be placed, and then run the following commands. Replace <code>[URL]</code> with the website you'd like to crawl.</p> <ol> <li>Run <code>docker pull webrecorder/browsertrix-crawler</code></li> <li><code>docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url [URL] --generateWACZ --text --collection test</code></li> <li>The crawl will now run and logs in JSON Lines format will be output to the console. Depending on the size of the site, this may take a bit!</li> <li>Once the crawl is finished, a WACZ file will be created in <code>crawls/collections/test/test.wacz</code> from the directory you ran the crawl!</li> <li>You can go to ReplayWeb.page and open the generated WACZ file and browse your newly crawled archive!</li> </ol>"},{"location":"user-guide/#getting-started-with-command-line-options","title":"Getting Started with Command-Line Options","text":"<p>Here's how you can use some of the more common command-line options to configure the crawl:</p> <ul> <li> <p>To include automated text extraction for full text search to pages.jsonl, add the <code>--text</code> flag. To write extracted text to WARCs instead of or in addition to pages.jsonl, see Text Extraction.</p> </li> <li> <p>To limit the crawl to a maximum number of pages, add <code>--limit P</code> where P is the number of pages that will be crawled.</p> </li> <li> <p>To limit the crawl to a maximum size, set <code>--sizeLimit</code> (size in bytes).</p> </li> <li> <p>To limit the crawl time, set <code>--timeLimit</code> (in seconds).</p> </li> <li> <p>To run more than one browser worker and crawl in parallel, and <code>--workers N</code> where N is number of browsers to run in parallel. More browsers will require more CPU and network bandwidth, and does not guarantee faster crawling.</p> </li> <li> <p>To crawl into a new directory, specify a different name for the <code>--collection</code> param. If omitted, a new collection directory based on current time will be created. Adding the <code>--overwrite</code> flag will delete the collection directory at the start of the crawl, if it exists.</p> </li> </ul> <p>Browsertrix Crawler includes a number of additional command-line options, explained in detail throughout this User Guide.</p>"},{"location":"user-guide/#published-releases-production-use","title":"Published Releases / Production Use","text":"<p>When using Browsertrix Crawler in production, it is recommended to use a specific, published version of the image, eg. <code>webrecorder/browsertrix-crawler:[VERSION]</code> instead of <code>webrecorder/browsertrix-crawler</code> where <code>[VERSION]</code> corresponds to one of the published release tag.</p> <p>All released Docker Images are available from Docker Hub, listed by release tag here.</p> <p>Details for each corresponding release tag are also available on GitHub under Releases.</p>"},{"location":"user-guide/behaviors/","title":"Browser Behaviors","text":"<p>Browsertrix Crawler supports automatically running customized behaviors on each page. Several types of behaviors are supported, including built-in, background, and site-specific behaviors. It is also possible to add fully user-defined custom behaviors that can be added to trigger specific actions on certain pages.</p>"},{"location":"user-guide/behaviors/#built-in-behaviors","title":"Built-In Behaviors","text":"<p>The built-in behaviors include the following background behaviors which run 'in the background' continually checking for changes:</p> <ul> <li>Autoplay: find and start playing (when possible) any video or audio on the page (and in each iframe).</li> <li>Autofetch: find and start fetching any URLs that may not be fetched by default, such as other resolutions in <code>img</code> tags, <code>data-*</code>, lazy-loaded resources, etc.</li> <li>Autoclick: select all tags (default: <code>a</code> tag, customizable via <code>--clickSelector</code>) that may be clickable and attempt to click them while avoiding navigation away from the page.</li> </ul> <p>There is also a built-in 'main' behavior, which runs to completion (or until a timeout is reached):</p> <ul> <li>Autoscroll: Determine if a page might need scrolling, and scroll either up or down while new elements are being added. Continue until timeout is reached or scrolling is no longer possible.</li> </ul>"},{"location":"user-guide/behaviors/#site-specific-behaviors","title":"Site-Specific Behaviors","text":"<p>Browsertrix also comes with several 'site-specific' behaviors, which run only on specific sites. These behaviors will run instead of Autoscroll and will run until completion or timeout. Currently, site-specific behaviors include major social media sites.</p> <p>Refer to Browsertrix Behaviors for the latest list of site-specific behaviors.</p> <p>User-defined custom behaviors are also considered site-specific.</p>"},{"location":"user-guide/behaviors/#enabling-behaviors","title":"Enabling Behaviors","text":"<p>To enable built-in behaviors, specify them via a comma-separated list passed to the <code>--behaviors</code> option. All behaviors except Autoclick are enabled by default, the equivalent of <code>--behaviors autoscroll,autoplay,autofetch,siteSpecific</code>. To enable only a single behavior, such as Autoscroll, use <code>--behaviors autoscroll</code>.</p> <p>To only use Autoclick but not Autoscroll, use <code>--behaviors autoclick,autoplay,autofetch,siteSpecific</code>.</p> <p>The <code>--siteSpecific</code> flag enables all site-specific behaviors to be enabled, but only one behavior can be run per site. Each site-specific behavior specifies which site it should run on.</p> <p>To disable all behaviors, use <code>--behaviors \"\"</code>.</p>"},{"location":"user-guide/behaviors/#behavior-and-page-timeouts","title":"Behavior and Page Timeouts","text":"<p>Browsertrix includes a number of timeouts, including before, during and after running behaviors.</p> <p>The timeouts are as follows:</p> <ul> <li><code>--pageLoadTimeout</code>: how long to wait for page to finish loading, before doing anything else.</li> <li><code>--postLoadDelay</code>: how long to wait before starting any behaviors, but after page has finished loading. A custom behavior can override this (see below).</li> <li><code>--behaviorTimeout</code>: maximum time to spend on running site-specific / Autoscroll behaviors (can be less if behavior finishes early).</li> <li><code>--pageExtraDelay</code>: how long to wait after finishing behaviors (or after <code>behaviorTimeout</code> has been reached) before moving on to next page.</li> </ul> <p>A site-specific behavior (or Autoscroll) will start after the page is loaded (at most after <code>--pageLoadTimeout</code> seconds) and exactly after <code>--postLoadDelay</code> seconds.</p> <p>The behavior will then run until finished or at most until <code>--behaviorTimeout</code> is reached (90 seconds by default).</p>"},{"location":"user-guide/behaviors/#loading-custom-behaviors","title":"Loading Custom Behaviors","text":"<p>Browsertrix Crawler also supports fully user-defined behaviors, which have all the capabilities of the built-in behaviors.</p> <p>They can use a library of provided functions, and run on one or more pages in the crawl.</p> <p>Custom behaviors are specified with the <code>--customBehaviors</code> flag, which can be repeated and can accept the following options.</p> <ul> <li>A path to a single behavior file. This can be mounted into the crawler as a volume.</li> <li>A path to a directory of behavior files. This can be mounted into the crawler as a volume.</li> <li>A URL for a single behavior file to download. This should be a URL that the crawler has access to.</li> <li>A URL for a git repository of the form <code>git+https://git.example.com/repo.git</code>, with optional query parameters <code>branch</code> (to specify a particular branch to use) and <code>path</code> (to specify a relative path to a directory within the git repository where the custom behaviors are located). This should be a git repo the crawler has access to without additional auth.</li> </ul>"},{"location":"user-guide/behaviors/#examples","title":"Examples","text":""},{"location":"user-guide/behaviors/#local-filepath-directory","title":"Local filepath (directory)","text":"<pre><code>docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net --customBehaviors /custom-behaviors/\n</code></pre>"},{"location":"user-guide/behaviors/#local-filepath-file","title":"Local filepath (file)","text":"<pre><code>docker run -v $PWD/test-crawls:/crawls -v $PWD/tests/custom-behaviors/:/custom-behaviors/ webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net --customBehaviors /custom-behaviors/custom.js\n</code></pre>"},{"location":"user-guide/behaviors/#url","title":"URL","text":"<pre><code>docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://specs.webrecorder.net --customBehaviors https://example.com/custom-behavior-1 --customBehaviors https://example.org/custom-behavior-2 \n</code></pre>"},{"location":"user-guide/behaviors/#git-repository","title":"Git repository","text":"<pre><code>docker run -v $PWD/test-crawls:/crawls webrecorder/browsertrix-crawler crawl --url https://example.com/ --customBehaviors \"git+https://git.example.com/custom-behaviors?branch=dev&path=path/to/behaviors\"\n</code></pre>"},{"location":"user-guide/behaviors/#creating-custom-behaviors","title":"Creating Custom Behaviors","text":"<p>A custom behavior file can be in one of the following supported formats: - JSON User Flow - JavaScript / Typescript (compiled to JavaScript)</p>"},{"location":"user-guide/behaviors/#json-flow-behaviors","title":"JSON Flow Behaviors","text":"<p>Browsertrix Crawler 1.6 and up supports replaying the JSON User Flow format generated by DevTools Recorder, which is built-in to Chrome devtools.</p> <p>This format can be generated by using the DevTools Recorder to create a series of steps, which are serialized to JSON.</p> <p>The format represents a series of steps that should happen on a particular page.</p> <p>The recorder is capable of picking the right selectors interactively and supports events such as <code>click</code>, <code>change</code>, <code>waitForElement</code> and more. See the feature reference for a more complete list.</p>"},{"location":"user-guide/behaviors/#user-flow-extensions","title":"User Flow Extensions","text":"<p>Browsertrix extends the functionality compared to DevTools Recorder in the following ways:</p> <ul> <li> <p>Browsertrix Crawler will attempt to continue even if initial step fails, for up to 3 failures.</p> </li> <li> <p>If a step is repeated 3 or more times, Browsertrix Crawler will attempt to repeat the step as far as it can until the step fails.</p> </li> <li> <p>Browsertrix Crawler ignores the <code>navigate</code> and <code>viewport</code> step. The <code>navigate</code> event is used to match when a particular user flow should run, but does not navigate away from the page.</p> </li> <li> <p>If <code>navigate</code> step is removed, user flow can run on every page in the crawler.</p> </li> <li> <p>A <code>customStep</code> step with name <code>runOncePerCrawl</code> can be added to indicate that a user flow should run only once for a given crawl.</p> </li> </ul>"},{"location":"user-guide/behaviors/#javascript-behaviors","title":"JavaScript Behaviors","text":"<p>The main native format of custom behaviors is a Javascript class.</p> <p>There should be a single class per file, and it should be of the following format:</p>"},{"location":"user-guide/behaviors/#behavior-class","title":"Behavior Class","text":"<pre><code>class MyBehavior\n{\n // required: an id for this behavior, will be displayed in the logs\n // when the behavior is run.\n static id = \"My Behavior Id\";\n\n // required: a function that checks if a behavior should be run\n // for a given page.\n // This function can check the DOM / window.location to determine\n // what page it is on. The first behavior that returns 'true'\n // for a given page is used on that page.\n static isMatch() {\n return window.location.href === \"https://my-site.example.com/\";\n }\n\n // optional: if true, will also check isMatch() and possibly run\n // this behavior in each iframe.\n // if false, or not defined, this behavior will be skipped for iframes.\n static runInIframe = false;\n\n // optional: if defined, provides a way to define a custom way to determine\n // when a page has finished loading beyond the standard 'load' event.\n //\n // if defined, the crawler will await 'awaitPageLoad()' before moving on to\n // post-crawl processing operations, including link extraction, screenshots,\n // and running main behavior\n async awaitPageLoad() {\n\n }\n\n // required: the main behavior async iterator, which should yield for\n // each 'step' in the behavior.\n // When the iterator finishes, the behavior is done.\n // (See below for more info)\n async* run(ctx) {\n //... yield ctx.getState(\"starting behavior\");\n\n // do something\n\n //... yield ctx.getState(\"a step has been performed\");\n }\n}\n</code></pre>"},{"location":"user-guide/behaviors/#behavior-run-loop","title":"Behavior run() loop","text":"<p>The <code>run()</code> loop provides the main loop for the behavior to run. It must be an async iterator, which means that it can optionally call <code>yield</code> to return state to the crawler and allow it to print the state.</p> <p>For example, a behavior that iterates over elements and then clicks them either once or twice (based on the value of a custom <code>.clickTwice</code> property) could be written as follows:</p> <pre><code> async* run(ctx) {\n let click = 0;\n let dblClick = 0;\n for await (const elem of document.querySelectorAll(\".my-selector\")) {\n if (elem.clickTwice) {\n elem.click();\n elem.click();\n dblClick++;\n } else {\n elem.click();\n click++;\n }\n ctx.log({msg: \"Clicked on elem\", click, dblClick});\n }\n }\n</code></pre> <p>This behavior will run to completion and log every time a click event is made. However, this behavior can not be paused and resumed (supported in ArchiveWeb.page) and generally can not be interrupted.</p> <p>One approach is to yield after every major 'step' in the behavior, for example:</p> <pre><code> async* run(ctx) {\n let click = 0;\n let dblClick = 0;\n for await (const elem of document.querySelectorAll(\".my-selector\")) {\n if (elem.clickTwice) {\n elem.click();\n elem.click();\n dblClick++;\n // allows behavior to be paused here\n yield {msg: \"Double-clicked on elem\", click, dblClick};\n } else {\n elem.click();\n click++;\n // allows behavior to be paused here\n yield {msg: \"Single-clicked on elem\", click, dblClick};\n }\n }\n }\n</code></pre> <p>The data that is yielded will be logged in the <code>behaviorScriptCustom</code> context.</p> <p>This allows for the behavior to log the current state of the behavior and allow for it to be gracefully interrupted after each logical 'step'.</p>"},{"location":"user-guide/behaviors/#getstate-function","title":"getState() function","text":"<p>A common pattern is to increment a particular counter, and then return the whole state.</p> <p>A convenience function <code>getState()</code> is provided to simplify this and avoid the need to create custom counters.</p> <p>Using this standard function, the above code might be condensed as follows:</p> <pre><code> async* run(ctx) {\n const { Lib } = ctx;\n for await (const elem of document.querySelectorAll(\".my-selector\")) {\n if (elem.clickTwice) {\n elem.click();\n elem.click();\n yield Lib.getState(\"Double-Clicked on elem\", \"dblClick\");\n } else {\n elem.click();\n yield Lib.getState(\"Single-Clicked on elem\", \"click\");\n }\n }\n }\n</code></pre>"},{"location":"user-guide/behaviors/#utility-functions","title":"Utility Functions","text":"<p>In addition to <code>getState()</code>, Browsertrix Behaviors includes a small library of other utility functions which are available to behaviors under <code>ctx.Lib</code>.</p> <p>Some of these functions which may be of use to behaviors authors are:</p> <ul> <li><code>scrollAndClick</code>: scroll element into view and click</li> <li><code>sleep</code>: sleep for specified timeout (ms)</li> <li><code>waitUntil</code>: wait until a certain predicate is true</li> <li><code>waitUntilNode</code>: wait until a DOM node exists</li> <li><code>xpathNode</code>: find a DOM node by xpath</li> <li><code>xpathNodes</code>: find and iterate all DOM nodes by xpath</li> <li><code>xpathString</code>: find a string attribute by xpath</li> <li><code>iterChildElem</code>: iterate over all child elements of given element</li> <li><code>iterChildMatches</code>: iterate over all child elements that match a specific xpath</li> <li><code>isInViewport</code>: determine if a given element is in the visible viewport</li> <li><code>scrollToOffset</code>: scroll to particular offset</li> <li><code>scrollIntoView</code>: smoothly scroll particular element into view</li> <li><code>getState</code>: increment a state counter and return all state counters + string message</li> <li><code>addLink</code>: add a given URL to the crawl queue</li> </ul> <p>More detailed references will be added in the future.</p>"},{"location":"user-guide/behaviors/#fail-on-content-check","title":"Fail On Content Check","text":"<p>In Browsertrix Crawler 1.7.0 and higher, the <code>--failOnContentCheck</code> option will result in a crawl failing if a behavior detects the presence or absence of certain content on a page in its <code>awaitPageLoad()</code> callback. By default, this is used to fail a crawl if site-specific behaviors determine that the user is not logged in on the following sites:</p> <ul> <li>Facebook</li> <li>Instagram</li> <li>TikTok</li> <li>X</li> </ul> <p>It is also used to fail crawls with YouTube videos if one of the videos is found not to play.</p> <p>It is possible to add content checks to custom behaviors. To do so, include an <code>awaitPageLoad</code> method on the behavior and use the <code>ctx.Lib</code> function <code>assertContentValid</code> to check for content and fail the behavior with a specified reason if it is not found.</p> <p>For an example, see the following <code>awaitPageLoad</code> example from the site-specific behavior for X:</p> <pre><code>async awaitPageLoad(ctx: any) {\n const { sleep, assertContentValid } = ctx.Lib;\n await sleep(5);\n assertContentValid(() => !document.documentElement.outerHTML.match(/Log In/i), \"not_logged_in\");\n}\n</code></pre>"},{"location":"user-guide/browser-profiles/","title":"Creating and Using Browser Profiles","text":"<p>Browsertrix Crawler can use existing browser profiles when running a crawl. This allows the browser to be pre-configured by logging in to certain sites or changing other settings, before running a crawl. By creating a logged in profile, the actual login credentials are not included in the crawl, only (temporary) session cookies.</p>"},{"location":"user-guide/browser-profiles/#interactive-profile-creation","title":"Interactive Profile Creation","text":"<p>Interactive profile creation is used for creating profiles of more complex sites, or logging in to multiple sites at once.</p> <p>To use this mode, don't specify <code>--username</code> or <code>--password</code> flags and expose two ports on the Docker container to allow DevTools to connect to the browser and to serve a status page.</p> <p>In profile creation mode, Browsertrix Crawler launches a browser which uses a VNC server (via noVNC) running on port 6080 to provide a 'remote desktop' for interacting with the browser.</p> <p>After interactively logging into desired sites or configuring other settings, Create Profile should be clicked to initiate profile creation. Browsertrix Crawler will then stop the browser, and save the browser profile.</p> <p>To start in interactive profile creation mode, run:</p> <pre><code>docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --url \"https://example.com/\"\n</code></pre> <p>Then, open a browser pointing to <code>http://localhost:9223/</code> and use the embedded browser to log in to any sites or configure any settings as needed.</p> <p>Click Create Profile at the top when done. The profile will then be created in <code>./crawls/profiles/profile.tar.gz</code> containing the settings of this browsing session.</p> <p>It is also possible to use an existing profile via the <code>--profile</code> flag. This allows previous browsing sessions to be extended as needed.</p> <pre><code>docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -it webrecorder/browsertrix-crawler create-login-profile --url \"https://example.com/\" --filename \"/crawls/profiles/newProfile.tar.gz\" --profile \"/crawls/profiles/oldProfile.tar.gz\"\n</code></pre>"},{"location":"user-guide/browser-profiles/#headless-vs-headful-profiles","title":"Headless vs Headful Profiles","text":"<p>Browsertrix Crawler supports both headful and headless crawling. We have historically recommended using headful crawling to be most accurate to user experience, however, headless crawling may be faster and in recent versions of Chromium-based browsers should be much closer in fidelity to headful crawling.</p> <p>To use profiles in headless mode, profiles should also be created with <code>--headless</code> flag.</p> <p>When creating browser profile in headless mode, Browsertrix will use the devtools protocol on port 9222 to stream the browser interface.</p> <p>To create a profile in headless mode, run:</p> <pre><code>docker run -p 9222:9222 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles/ -it webrecorder/browsertrix-crawler create-login-profile --headless --url \"https://example.com/\"\n</code></pre>"},{"location":"user-guide/browser-profiles/#automated-profile-creation-for-user-login","title":"Automated Profile Creation for User Login","text":"<p>If the <code>--automated</code> flag is provided, Browsertrix Crawler will attempt to create a profile automatically after logging in to sites with a username and password. The username and password can be provided via <code>--username</code> and <code>--password</code> flags or, if omitted, from a command-line prompt.</p> <p>When using <code>--automated</code> or <code>--username</code> / <code>--password</code>, Browsertrix Crawler will not launch an interactive browser and instead will attempt to finish automatically.</p> <p>The automated profile creation system will log in to a single website with supplied credentials and then save the profile.</p> <p>The script profile creation system also take a screenshot so you can check if the login succeeded.</p> <p>Example: Launch a browser and login to the digipres.club Mastodon instance</p> <p>To automatically created a logged-in browser profile, run:</p> <pre><code>docker run -v $PWD/crawls/profiles:/crawls/profiles -it webrecorder/browsertrix-crawler create-login-profile --url \"https://digipres.club/\"\n</code></pre> <p>The script will then prompt you for login credentials, attempt to login, and create a tar.gz file in <code>./crawls/profiles/profile.tar.gz</code>.</p> <ul> <li> <p>The <code>--url</code> parameter should specify the URL of a login page.</p> </li> <li> <p>To specify a custom filename, pass along <code>--filename</code> parameter.</p> </li> <li> <p>To specify the username and password on the command line (for automated profile creation), pass <code>--username</code> and <code>--password</code> flags.</p> </li> <li> <p>To specify headless mode, add the <code>--headless</code> flag. Note that for crawls run with <code>--headless</code> flag, it is recommended to also create the profile with <code>--headless</code> to ensure the profile is compatible.</p> </li> <li> <p>To specify the window size for the profile creation embedded browser, specify <code>--windowSize WIDTH,HEIGHT</code>. (The default is 1600x900)</p> </li> </ul> <p>The profile creation script attempts to detect the username and password fields on a site as generically as possible, but may not work for all sites.</p>"},{"location":"user-guide/browser-profiles/#using-browser-profile-with-a-crawl","title":"Using Browser Profile with a Crawl","text":"<p>To use a previously created profile with a crawl, use the <code>--profile</code> flag or <code>profile</code> option. The <code>--profile</code> flag can then be used to specify any Brave Browser profile stored as a tarball. Browser profile can be either stored locally and provided as a path, or available online at any HTTP(S) URL which will be downloaded before starting the crawl. Using profiles created with same or older version of Browsertrix Crawler is recommended to ensure compatibility. This option allows running a crawl with the browser already pre-configured, logged in to certain sites, language settings configured, etc.</p> <p>After running the above command, you can now run a crawl with the profile, as follows:</p> <pre><code>docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --profile /crawls/profiles/profile.tar.gz --url https://digipres.club/ --generateWACZ --collection test-with-profile\n</code></pre> <p>Profiles can also be loaded from an http/https URL, eg. <code>--profile https://example.com/path/to/profile.tar.gz</code>.</p>"},{"location":"user-guide/cli-options/","title":"All Command-Line Options","text":"<p>The Browsertrix Crawler Docker image currently accepts the following parameters, broken down by entrypoint:</p>"},{"location":"user-guide/cli-options/#crawler","title":"crawler","text":"<pre><code>Options:\n --help Show help [boolean]\n --version Show version number [boolean]\n --seeds, --url The URL to start crawling from\n [array] [default: []]\n --seedFile, --urlFile If set, read a list of seed urls, on\n e per line, from the specified\n [string]\n -w, --workers The number of workers to run in para\n llel [number] [default: 1]\n --crawlId, --id A user provided ID for this crawl or\n crawl configuration (can also be se\n t via CRAWL_ID env var), defaults to\n combination of Docker container hos\n tname and collection [string]\n --waitUntil Puppeteer page.goto() condition to w\n ait for before continuing, can be mu\n ltiple separated by ','\n [array] [choices: \"load\", \"domcontentloaded\", \"networkidle0\", \"networkidle2\"]\n [default: [\"load\",\"networkidle2\"]]\n --depth The depth of the crawl for all seeds\n [number] [default: -1]\n --extraHops Number of extra 'hops' to follow, be\n yond the current scope\n [number] [default: 0]\n --pageLimit, --limit Limit crawl to this number of pages\n [number] [default: 0]\n --maxPageLimit Maximum pages to crawl, overriding\n pageLimit if both are set\n [number] [default: 0]\n --pageLoadTimeout, --timeout Timeout for each page to load (in se\n conds) [number] [default: 90]\n --scopeType A predefined scope of the crawl. For\n more customization, use 'custom' an\n d set scopeIncludeRx regexes\n [string] [choices: \"page\", \"page-spa\", \"prefix\", \"host\", \"domain\", \"any\", \"cus\n tom\"]\n --scopeIncludeRx, --include Regex of page URLs that should be in\n cluded in the crawl (defaults to the\n immediate directory of URL)[string]\n --scopeExcludeRx, --exclude Regex of page URLs that should be ex\n cluded from the crawl. [string]\n --allowHashUrls Allow Hashtag URLs, useful for singl\n e-page-application crawling or when\n different hashtags load dynamic cont\n ent\n --selectLinks, --linkSelector One or more selectors for extracting\n links, in the format [css selector]\n ->[property to use],[css selector]->\n @[attribute to use]\n [array] [default: [\"a[href]->href\"]]\n --clickSelector Selector for elements to click when\n using the autoclick behavior\n [string] [default: \"a\"]\n --blockRules Additional rules for blocking certai\n n URLs from being loaded, by URL reg\n ex and optionally via text match in\n an iframe [array] [default: []]\n --blockMessage If specified, when a URL is blocked,\n a record with this error message is\n added instead[string] [default: \"\"]\n --blockAds, --blockads If set, block advertisements from be\n ing loaded (based on Stephen Black's\n blocklist)\n [boolean] [default: false]\n --adBlockMessage If specified, when an ad is blocked,\n a record with this error message is\n added instead[string] [default: \"\"]\n -c, --collection Collection name / directory to crawl\n into[string] [default: \"crawl-@ts\"]\n --headless Run in headless mode, otherwise star\n t xvfb [boolean] [default: false]\n --driver Custom driver for the crawler, if an\n y [string]\n --generateCDX, --generatecdx, --gene If set, generate merged index in CDX\n rateCdx J format [boolean] [default: false]\n --combineWARC, --combinewarc, --comb If set, combine the warcs\n ineWarc [boolean] [default: false]\n --rolloverSize If set, declare the rollover size\n [number] [default: 1000000000]\n --generateWACZ, --generatewacz, --ge If set, generate WACZ on disk\n nerateWacz [boolean] [default: false]\n --useSHA1 If set, sha-1 instead of sha-256 has\n hes will be used for creating record\n s [boolean] [default: false]\n --logging Logging options for crawler, can inc\n lude: stats (enabled by default), js\n errors, debug\n [array] [default: [\"stats\"]]\n --logLevel Comma-separated list of log levels t\n o include in logs\n [array] [default: []]\n --context, --logContext Comma-separated list of contexts to\n include in logs\n [array] [choices: \"general\", \"worker\", \"recorder\", \"recorderNetwork\", \"writer\"\n , \"state\", \"redis\", \"storage\", \"text\", \"exclusion\", \"screenshots\", \"screencast\n \", \"originOverride\", \"healthcheck\", \"browser\", \"blocking\", \"behavior\", \"behavi\n orScript\", \"behaviorScriptCustom\", \"jsError\", \"fetch\", \"pageStatus\", \"memorySt\n atus\", \"crawlStatus\", \"links\", \"sitemap\", \"wacz\", \"replay\", \"proxy\", \"scope\",\n \"robots\"] [default: []]\n --logExcludeContext Comma-separated list of contexts to\n NOT include in logs\n [array] [choices: \"general\", \"worker\", \"recorder\", \"recorderNetwork\", \"writer\"\n , \"state\", \"redis\", \"storage\", \"text\", \"exclusion\", \"screenshots\", \"screencast\n \", \"originOverride\", \"healthcheck\", \"browser\", \"blocking\", \"behavior\", \"behavi\n orScript\", \"behaviorScriptCustom\", \"jsError\", \"fetch\", \"pageStatus\", \"memorySt\n atus\", \"crawlStatus\", \"links\", \"sitemap\", \"wacz\", \"replay\", \"proxy\", \"scope\",\n \"robots\"] [default: [\"recorderNetwork\",\"jsError\",\"screencast\"]]\n --text Extract initial (default) or final t\n ext to pages.jsonl or WARC resource\n record(s)\n [array] [choices: \"to-pages\", \"to-warc\", \"final-to-warc\"]\n --cwd Crawl working directory for captures\n . If not set, defaults to process.cw\n d() [string] [default: \"/crawls\"]\n --mobileDevice Emulate mobile device by name from:\n https://github.com/puppeteer/puppete\n er/blob/main/src/common/DeviceDescri\n ptors.ts [string]\n --userAgent Override user-agent with specified s\n tring [string]\n --userAgentSuffix Append suffix to existing browser us\n er-agent (ex: +MyCrawler, info@examp\n le.com) [string]\n --useSitemap, --sitemap If enabled, check for sitemaps at /s\n itemap.xml, or custom URL if URL is\n specified\n --sitemapFromDate, --sitemapFrom If set, filter URLs from sitemaps to\n those greater than or equal to (>=)\n provided ISO Date string (YYYY-MM-D\n D or YYYY-MM-DDTHH:MM:SS or partial\n date) [string]\n --sitemapToDate, --sitemapTo If set, filter URLs from sitemaps to\n those less than or equal to (<=) pr\n ovided ISO Date string (YYYY-MM-DD o\n r YYYY-MM-DDTHH:MM:SS or partial dat\n e) [string]\n --statsFilename If set, output stats as JSON to this\n file. (Relative filename resolves t\n o crawl working directory) [string]\n --behaviors Which background behaviors to enable\n on each page\n [array] [default: [\"autoplay\",\"autofetch\",\"autoscroll\",\"siteSpecific\"]]\n --behaviorTimeout If >0, timeout (in seconds) for in-p\n age behavior will run on each page.\n If 0, a behavior can run until finis\n h. [number] [default: 90]\n --postLoadDelay If >0, amount of time to sleep (in s\n econds) after page has loaded, befor\n e taking screenshots / getting text\n / running behaviors\n [number] [default: 0]\n --pageExtraDelay, --delay If >0, amount of time to sleep (in s\n econds) after behaviors before movin\n g on to next page\n [number] [default: 0]\n --profile, --loadProfile Path or HTTP(S) URL to tar.gz file w\n hich contains the browser profile di\n rectory [string]\n --saveProfile If set, save profile if crawl succee\n ded successfully. If no value provid\n ed, save back to save location as sp\n ecified in --profile\n --screenshot Screenshot options for crawler, can\n include: view, thumbnail, fullPage,\n fullPageFinal\n [array] [choices: \"view\", \"thumbnail\", \"fullPage\", \"fullPageFinal\"] [default:\n []]\n --screencastPort If set to a non-zero value, starts a\n n HTTP server with screencast access\n ible on this port\n [number] [default: 0]\n --screencastRedis If set, will use the state store red\n is pubsub for screencasting. Require\n s --redisStoreUrl to be set\n [boolean] [default: false]\n --warcInfo, --warcinfo Optional fields added to the warcinf\n o record in combined WARCs\n --redisStoreUrl If set, url for remote redis server\n to store state. Otherwise, using loc\n al redis instance\n [string] [default: \"redis://localhost:6379/0\"]\n --saveState If the crawl state should be seriali\n zed to the crawls/ directory. Defaul\n ts to 'partial', only saved when cra\n wl is interrupted\n [string] [choices: \"never\", \"partial\", \"always\"] [default: \"partial\"]\n --saveStateInterval If save state is set to 'always', al\n so save state during the crawl at th\n is interval (in seconds)\n [number] [default: 300]\n --saveStateHistory Number of save states to keep during\n the duration of a crawl\n [number] [default: 5]\n --sizeLimit If set, save state and exit if size\n limit exceeds this value\n [number] [default: 0]\n --diskUtilization If set, save state and exit if disk\n utilization exceeds this percentage\n value [number] [default: 0]\n --timeLimit If set, save state and exit after ti\n me limit, in seconds\n [number] [default: 0]\n --healthCheckPort port to run healthcheck on\n [number] [default: 0]\n --overwrite overwrite current crawl data: if set\n , existing collection directory will\n be deleted before crawl is started\n [boolean] [default: false]\n --waitOnDone if set, wait for interrupt signal wh\n en finished instead of exiting\n [boolean] [default: false]\n --restartsOnError if set, assume will be restarted if\n interrupted, don't run post-crawl pr\n ocesses on interrupt\n [boolean] [default: false]\n --netIdleWait number of seconds to wait for networ\n k idle after page load and after beh\n aviors are done (default: 2)\n [number] [default: 2]\n --netIdleMaxRequests max active requests allowed for netw\n ork to be considered idle\n [default: 1]\n --lang if set, sets the language used by th\n e browser, should be ISO 639 languag\n e[-country] code [string]\n --title If set, write supplied title into WA\n CZ datapackage.json metadata[string]\n --description, --desc If set, write supplied description i\n nto WACZ datapackage.json metadata\n [string]\n --originOverride if set, will redirect requests from\n each origin in key to origin in the\n value, eg. --originOverride https://\n host:port=http://alt-host:alt-port\n [array] [default: []]\n --logErrorsToRedis If set, write error messages to redi\n s [boolean] [default: false]\n --logBehaviorsToRedis If set, write behavior script messag\n es to redis\n [boolean] [default: false]\n --writePagesToRedis If set, write page objects to redis\n [boolean] [default: false]\n --maxPageRetries, --retries If set, number of times to retry a p\n age that failed to load before page\n is considered to have failed\n [number] [default: 2]\n --failOnFailedSeed If set, crawler will fail with exit\n code 1 if any seed fails. When combi\n ned with --failOnInvalidStatus,will\n result in crawl failing with exit co\n de 1 if any seed has a 4xx/5xx respo\n nse [boolean] [default: false]\n --failOnFailedLimit If set, save state and exit if numbe\n r of failed pages exceeds this value\n [number] [default: 0]\n --failOnInvalidStatus If set, will treat pages with 4xx or\n 5xx response as failures. When comb\n ined with --failOnFailedLimit or --f\n ailOnFailedSeed may result in crawl\n failing due to non-200 responses\n [boolean] [default: false]\n --failOnContentCheck If set, allows for behaviors to fail\n a crawl with custom reason based on\n content (e.g. logged out)\n [boolean] [default: false]\n --customBehaviors Custom behavior files to inject. Val\n id values: URL to file, path to file\n , path to directory of behaviors, UR\n L to Git repo of behaviors (prefixed\n with git+, optionally specify branc\n h and relative path to a directory w\n ithin repo as branch and path query\n parameters, e.g. --customBehaviors \"\n git+https://git.example.com/repo.git\n ?branch=dev&path=some/dir\"\n [array] [default: []]\n --saveStorage if set, will store the localStorage/\n sessionStorage data for each page as\n part of WARC-JSON-Metadata field\n [boolean]\n --debugAccessRedis if set, runs internal redis without\n protected mode to allow external acc\n ess (for debugging) [boolean]\n --debugAccessBrowser if set, allow debugging browser on p\n ort 9222 via CDP [boolean]\n --warcPrefix prefix for WARC files generated, inc\n luding WARCs added to WACZ [string]\n --serviceWorker, --sw service worker handling: disabled, e\n nabled, or disabled with custom prof\n ile\n [choices: \"disabled\", \"disabled-if-profile\", \"enabled\"] [default: \"disabled\"]\n --proxyServer if set, will use specified proxy ser\n ver. Takes precedence over any env v\n ar proxy settings [string]\n --proxyServerPreferSingleProxy if set, and both proxyServer and pro\n xyServerConfig are provided, the pro\n xyServer value will be preferred\n [boolean] [default: false]\n --proxyServerConfig if set, path to yaml/json file that\n configures multiple path servers per\n URL regex [string]\n --dryRun If true, no archive data is written\n to disk, only pages and logs (and op\n tionally saved state). [boolean]\n --qaSource Required for QA mode. Source (WACZ o\n r multi WACZ) for QA [string]\n --qaDebugImageDiff if specified, will write crawl.png,\n replay.png and diff.png for each pag\n e where they're different [boolean]\n --sshProxyPrivateKeyFile path to SSH private key for SOCKS5 o\n ver SSH proxy connection [string]\n --sshProxyKnownHostsFile path to SSH known hosts file for SOC\n KS5 over SSH proxy connection\n [string]\n --extraChromeArgs Extra arguments to pass directly to\n the Chrome instance (space-separated\n or multiple --extraChromeArgs)\n [array] [default: []]\n --useRobots, --robots If set, fetch and respect page disal\n lows specified in per-host robots.tx\n t [boolean] [default: false]\n --robotsAgent Agent to check in addition to '*' fo\n r robots rules\n [string] [default: \"Browsertrix/1.x\"]\n --config Path to YAML config file\n</code></pre>"},{"location":"user-guide/cli-options/#create-login-profile","title":"create-login-profile","text":"<pre><code>Options:\n --help Show help [boolean]\n --version Show version number [boolean]\n --url The URL of the login page [string] [required]\n --user The username for the login. If not specified, will b\n e prompted [string]\n --password The password for the login. If not specified, will b\n e prompted (recommended) [string]\n --filename The filename for the profile tarball, stored within\n /crawls/profiles if absolute path not provided\n [string] [default: \"/crawls/profiles/profile.tar.gz\"]\n --debugScreenshot If specified, take a screenshot after login and save\n as this filename [boolean] [default: false]\n --headless Run in headless mode, otherwise start xvfb\n [boolean] [default: false]\n --automated Start in automated mode, no interactive browser\n [boolean] [default: false]\n --interactive Deprecated. Now the default option!\n [boolean] [default: false]\n --shutdownWait Shutdown browser in interactive after this many seco\n nds, if no pings received [number] [default: 0]\n --profile Path or HTTP(S) URL to tar.gz file which contains th\n e browser profile directory [string] [default: \"\"]\n --windowSize Browser window dimensions, specified as: width,heigh\n t [string] [default: \"1360,1020\"]\n --cookieDays If >0, set all cookies, including session cookies, t\n o have this duration in days before saving profile\n [number] [default: 7]\n --proxyServer if set, will use specified proxy server. Takes prece\n dence over any env var proxy settings [string]\n --proxyServerConfig if set, path to yaml/json file that configures multi\n ple path servers per URL regex [string]\n --sshProxyPrivateKeyFile path to SSH private key for SOCKS5 over SSH proxy co\n nnection [string]\n --sshProxyKnownHostsFile path to SSH known hosts file for SOCKS5 over SSH pro\n xy connection [string]\n</code></pre>"},{"location":"user-guide/common-options/","title":"Commonly-Used Options","text":""},{"location":"user-guide/common-options/#waiting-for-page-load","title":"Waiting for Page Load","text":"<p>One of the key nuances of browser-based crawling is determining when a page is finished loading. This can be configured with the <code>--waitUntil</code> flag.</p> <p>The default is <code>load,networkidle2</code>, which waits until page load and \u22642 requests remain, but for static sites, <code>--wait-until domcontentloaded</code> may be used to speed up the crawl (to avoid waiting for ads to load for example). <code>--waitUntil networkidle0</code> may make sense for sites where absolutely all requests must be waited until before proceeding.</p> <p>See page.goto waitUntil options for more info on the options that can be used with this flag from the Puppeteer docs.</p> <p>The <code>--pageLoadTimeout</code>/<code>--timeout</code> option sets the timeout in seconds for page load, defaulting to 90 seconds. Behaviors will run on the page once either the page load condition or the page load timeout is met, whichever happens first.</p>"},{"location":"user-guide/common-options/#additional-wait","title":"Additional Wait","text":"<p>Occasionally, a page may seem to have loaded, but performs dynamic initialization / additional loading. This is can be hard to detect, and the <code>--postLoadDelay</code> flag can be used to specify additional seconds to wait after the page appears to have loaded, before moving on to post-processing actions, such as link extraction, screenshotting and text extraction (see below).</p> <p>(On the other hand, the <code>--pageExtraDelay</code>/<code>--delay</code> adds an extra after all post-load actions have taken place, and can be useful for rate-limiting.)</p>"},{"location":"user-guide/common-options/#link-extraction","title":"Link Extraction","text":"<p>By default, the crawler will extract all <code>href</code> properties from all <code><a></code> tags that have an <code>href</code>. This can be customized with the <code>--selectLinks</code> option, which can provide alternative selectors of the form: <code>[css selector]->[property to use]</code> or <code>[css selector]->@[attribute to use]</code>. The default value is <code>a[href]->href</code>.</p> <p>For example, to specify the default, but also include all <code>divs</code> that have class <code>mylink</code> and use <code>custom-href</code> attribute as the link, use <code>--selectLinks 'a[href]->href' --selectLinks 'div.mylink->@custom-href'</code>.</p> <p>Any number of selectors can be specified in this way, and each will be applied in sequence on each page.</p>"},{"location":"user-guide/common-options/#ad-blocking","title":"Ad Blocking","text":"<p>Brave Browser, the browser used by Browsertrix Crawler for crawling, has some ad and tracker blocking features enabled by default. These Shields be disabled or customized using Browser Profiles.</p> <p>Browsertrix Crawler also supports blocking ads from being loaded during capture based on Stephen Black's list of known ad hosts. To enable ad blocking based on this list, use the <code>--blockAds</code> option. If <code>--adBlockMessage</code> is set, a record with the specified error message will be added in the ad's place.</p>"},{"location":"user-guide/common-options/#sitemap-parsing","title":"Sitemap Parsing","text":"<p>The <code>--sitemap</code> option can be used to have the crawler parse a sitemap and queue any found URLs while respecting the crawl's scoping rules and limits. Browsertrix Crawler is able to parse regular sitemaps as well as sitemap indices that point out to nested sitemaps.</p> <p>By default, <code>--sitemap</code> will look for a sitemap at <code><your-seed>/sitemap.xml</code>. If a website's sitemap is hosted at a different URL, pass the URL with the flag like <code>--sitemap <sitemap url></code>.</p> <p>The <code>--sitemapFrom</code>/<code>--sitemapFromDate</code> and <code>--sitemapTo</code>/<code>--sitemapToDate</code> options allow for only extracting pages within a specific date range. If set, these options will filter URLs from sitemaps to those greater than or equal to (>=) or lesser than or equal to (<=) a provided ISO Date string (<code>YYYY-MM-DD</code>, <code>YYYY-MM-DDTHH:MM:SS</code>, or partial date), respectively.</p>"},{"location":"user-guide/common-options/#custom-warcinfo-fields","title":"Custom Warcinfo Fields","text":"<p>Custom fields can be added to the <code>warcinfo</code> WARC record, generated for each combined WARC. The fields can be specified in the YAML config under <code>warcinfo</code> section or specifying individually via the command-line.</p> <p>For example, the following are equivalent ways to add additional warcinfo fields:</p> <p>via yaml config:</p> <pre><code>warcinfo:\n operator: my-org\n hostname: hostname.my-org\n</code></pre> <p>via command-line:</p> <pre><code>--warcinfo.operator my-org --warcinfo.hostname hostname.my-org\n</code></pre>"},{"location":"user-guide/common-options/#screenshots","title":"Screenshots","text":"<p>Browsertrix Crawler includes the ability to take screenshots of each page crawled via the <code>--screenshot</code> option.</p> <p>Three screenshot options are available:</p> <ul> <li><code>--screenshot view</code>: Takes a png screenshot of the initially visible viewport (1920x1080)</li> <li><code>--screenshot fullPage</code>: Takes a png screenshot of the full page</li> <li><code>--screenshot thumbnail</code>: Takes a jpeg thumbnail of the initially visible viewport (1920x1080)</li> </ul> <p>These can be combined using a comma-separated list passed via the <code>--screenshot</code> option, e.g.: <code>--screenshot thumbnail,view,fullPage</code> or passed in separately <code>--screenshot thumbnail --screenshot view --screenshot fullPage</code>.</p> <p>Screenshots are written into a <code>screenshots.warc.gz</code> WARC file in the <code>archives/</code> directory. If the <code>--generateWACZ</code> command line option is used, the screenshots WARC is written into the <code>archive</code> directory of the WACZ file and indexed alongside the other WARCs.</p>"},{"location":"user-guide/common-options/#screencasting","title":"Screencasting","text":"<p>Browsertrix Crawler includes a screencasting option which allows watching the crawl in real-time via screencast (connected via a websocket).</p> <p>To enable, add <code>--screencastPort</code> command-line option and also map the port on the docker container. An example command might be:</p> <pre><code>docker run -p 9037:9037 -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --url https://www.example.com --screencastPort 9037\n</code></pre> <p>Then, open <code>http://localhost:9037/</code> and watch the crawl!</p>"},{"location":"user-guide/common-options/#text-extraction","title":"Text Extraction","text":"<p>Browsertrix Crawler supports text extraction via the <code>--text</code> flag, which accepts one or more of the following extraction options:</p> <ul> <li><code>--text to-pages</code> \u2014 Extract initial text and add it to the text field in pages.jsonl</li> <li><code>--text to-warc</code> \u2014 Extract initial page text and add it to a <code>urn:text:<url></code> WARC resource record</li> <li><code>--text final-to-warc</code> \u2014 Extract the final page text after all behaviors have run and add it to a <code>urn:textFinal:<url></code> WARC resource record</li> </ul> <p>The options can be separate or combined into a comma separate list, eg. <code>--text to-warc,final-to-warc</code> or <code>--text to-warc --text final-to-warc</code> are equivalent. For backwards compatibility, <code>--text</code> alone is equivalent to <code>--text to-pages</code>.</p>"},{"location":"user-guide/common-options/#uploading-crawl-outputs-to-s3-compatible-storage","title":"Uploading Crawl Outputs to S3-Compatible Storage","text":"<p>Browsertrix Crawler includes support for uploading WACZ files to S3-compatible storage, and notifying a webhook when the upload succeeds.</p> <p>S3 upload is only supported when WACZ output is enabled and will not work for WARC output.</p> <p>This feature can currently be enabled by setting environment variables (for security reasons, these settings are not passed in as part of the command-line or YAML config at this time).</p> <p>Environment variables for S3-uploads include:</p> <ul> <li><code>STORE_ACCESS_KEY</code> / <code>STORE_SECRET_KEY</code> \u2014 S3 credentials</li> <li><code>STORE_ENDPOINT_URL</code> \u2014 S3 endpoint URL</li> <li><code>STORE_PATH</code> \u2014 optional path appended to endpoint, if provided</li> <li><code>STORE_FILENAME</code> \u2014 filename or template for filename to put on S3</li> <li><code>STORE_USER</code> \u2014 optional username to pass back as part of the webhook callback</li> <li><code>STORE_REGION</code> - optional region to pass to S3 endpoint. Defaults to <code>us-east-1</code> if unspecified.</li> <li><code>CRAWL_ID</code> \u2014 unique crawl id (defaults to container hostname)</li> <li><code>WEBHOOK_URL</code> \u2014 the URL of the webhook (can be http://, https://, or redis://)</li> </ul>"},{"location":"user-guide/common-options/#webhook-notification","title":"Webhook Notification","text":"<p>The webhook URL can be an HTTP URL which receives a JSON POST request OR a Redis URL, which specifies a redis list key to which the JSON data is pushed as a string.</p> <p>Webhook notification JSON includes:</p> <ul> <li><code>id</code> \u2014 crawl id (value of <code>CRAWL_ID</code>)</li> <li><code>userId</code> \u2014 user id (value of <code>STORE_USER</code>)</li> <li><code>filename</code> \u2014 bucket path + filename of the file</li> <li><code>size</code> \u2014 size of WACZ file</li> <li><code>hash</code> \u2014 SHA-256 of WACZ file</li> <li><code>completed</code> \u2014 boolean of whether crawl fully completed or partially (due to interrupt signal or other error).</li> </ul>"},{"location":"user-guide/common-options/#saving-crawl-state-interrupting-and-restarting-the-crawl","title":"Saving Crawl State: Interrupting and Restarting the Crawl","text":"<p>A crawl can be gracefully interrupted with Ctrl-C (SIGINT) or a SIGTERM (see below for more details).</p> <p>When a crawl is interrupted, the current crawl state is written to the <code>crawls</code> subdirectory inside the collection directory. The crawl state includes the current YAML config, if any, plus the current state of the crawl.</p> <p>This crawl state YAML file can then be used as <code>--config</code> option to restart the crawl from where it was left of previously. When restarting a crawl you will need to include any command line options you used to start the original crawl (e.g. <code>--url</code>), since these are not persisted to the crawl state.</p> <p>By default, the crawl interruption waits for current pages to finish. A subsequent SIGINT will cause the crawl to stop immediately. Any unfinished pages are recorded in the <code>pending</code> section of the crawl state (if gracefully finished, the section will be empty).</p> <p>By default, the crawl state is only written when a crawl is interrupted before completing. The <code>--saveState</code> cli option can be set to <code>always</code> or <code>never</code> respectively, to control when the crawl state file should be written.</p>"},{"location":"user-guide/common-options/#periodic-state-saving","title":"Periodic State Saving","text":"<p>When the <code>--saveState</code> is set to always, Browsertrix Crawler will also save the state automatically during the crawl, as set by the <code>--saveStateInterval</code> setting. The crawler will keep the last <code>--saveStateHistory</code> save states and delete older ones. This provides extra backup, in the event that the crawl fails unexpectedly or is not terminated via Ctrl-C, several previous crawl states are still available.</p>"},{"location":"user-guide/common-options/#crawl-interruption-options","title":"Crawl Interruption Options","text":"<p>Browsertrix Crawler has different crawl interruption modes, and does everything it can to ensure the WARC data written is always valid when a crawl is interrupted. The following are three interruption scenarios:</p>"},{"location":"user-guide/common-options/#1-graceful-shutdown","title":"1. Graceful Shutdown","text":"<p>Initiated when a single SIGINT (Ctrl+C) or SIGTERM (<code>docker kill -s SIGINT</code>, <code>docker kill -s SIGTERM</code>, <code>kill</code>) signal is received.</p> <p>The crawler will attempt to finish current pages, finish any pending async requests, write all WARCS, generate WACZ files and finish other post-processing, save state from Redis, and then exit.</p>"},{"location":"user-guide/common-options/#2-less-graceful-quick-shutdown","title":"2. Less-Graceful, Quick Shutdown","text":"<p>If a second SIGINT / SIGTERM is received, the crawler will close the browser immediately, interrupting any on-going network requests. Any asynchronous fetching will not be finished. However, anything in the WARC queue will be written and WARC files will be flushed. WACZ files and other post-processing will not be generated, but the current state from Redis will still be saved if enabled (see above). WARC records should be fully finished and WARC files should be valid, though they may not contain all the data for the pages being processed during the interruption.</p>"},{"location":"user-guide/common-options/#3-violent-immediate-shutdown","title":"3. Violent / Immediate Shutdown","text":"<p>If a crawler is killed, eg. with SIGKILL signal (<code>docker kill</code>, <code>kill -9</code>), the crawler container / process will be immediately shut down. It will not have a chance to finish any WARC files, and there is no guarantee that WARC files will be valid, but the crawler will of course exit right away.</p>"},{"location":"user-guide/common-options/#recommendations","title":"Recommendations","text":"<p>It is recommended to gracefully stop the crawler by sending a SIGINT or SIGTERM signal, which can be done via Ctrl+C or <code>docker kill -s SIGINT <containerid></code>. Repeating the command will result in a faster, slightly less-graceful shutdown. Using SIGKILL is not recommended except for last resort, and only when data is to be discarded.</p> <p>Note: When using the crawler in the Browsertrix app / in Kubernetes general, stopping a crawl / stopping a pod always results in option #1 (sending a single SIGTERM signal) to the crawler pod(s)</p>"},{"location":"user-guide/crawl-scope/","title":"Crawl Scope","text":""},{"location":"user-guide/crawl-scope/#configuring-pages-included-or-excluded-from-a-crawl","title":"Configuring Pages Included or Excluded from a Crawl","text":"<p>The crawl scope can be configured globally for all seeds, or customized per seed, by specifying the <code>--scopeType</code> command-line option or setting the <code>type</code> property for each seed.</p> <p>The <code>depth</code> option also limits how many pages will be crawled for that seed, while the <code>limit</code> option sets the total number of pages crawled from any seed.</p> <p>The scope controls which linked pages are included and which pages are excluded from the crawl.</p> <p>To make this configuration as simple as possible, there are several predefined scope types. The available types are:</p> <ul> <li> <p><code>page</code> \u2014 crawl only this page and no additional links.</p> </li> <li> <p><code>page-spa</code> \u2014 crawl only this page, but load any links that include different hashtags. Useful for single-page apps that may load different content based on hashtag.</p> </li> <li> <p><code>prefix</code> \u2014 crawl any pages in the same directory, eg. starting from <code>https://example.com/path/page.html</code>, crawl anything under <code>https://example.com/path/</code> (default)</p> </li> <li> <p><code>host</code> \u2014 crawl pages that share the same host.</p> </li> <li> <p><code>domain</code> \u2014 crawl pages that share the same domain and subdomains, eg. given <code>https://example.com/</code> will also crawl <code>https://anysubdomain.example.com/</code></p> </li> <li> <p><code>any</code> \u2014 crawl any and all pages linked from this page..</p> </li> <li> <p><code>custom</code> \u2014 crawl based on the <code>--include</code> regular expression rules.</p> </li> </ul> <p>The scope settings for multi-page crawls (page-spa, prefix, host, domain) also include http/https versions, eg. given a prefix of <code>http://example.com/path/</code>, <code>https://example.com/path/</code> is also included.</p>"},{"location":"user-guide/crawl-scope/#custom-scope-inclusion-rules","title":"Custom Scope Inclusion Rules","text":"<p>Instead of setting a scope type, it is possible to configure a custom scope regular expression (regex) by setting <code>--include</code> to one or more regular expressions. If using the YAML config, the <code>include</code> field can contain a list of regexes.</p> <p>Extracted links that match the regular expression will be considered 'in scope' and included.</p>"},{"location":"user-guide/crawl-scope/#custom-scope-exclusion-rules","title":"Custom Scope Exclusion Rules","text":"<p>In addition to the inclusion rules, Browsertrix Crawler supports a separate list of exclusion regexes, that if matched, override and exclude a URL from the crawl.</p> <p>The exclusion regexes are often used with a custom scope, but could be used with a predefined scopeType as well.</p>"},{"location":"user-guide/crawl-scope/#extra-hops-beyond-current-scope","title":"Extra 'Hops' Beyond Current Scope","text":"<p>Occasionally, it may be useful to augment the scope by allowing extra links N 'hops' beyond the current scope.</p> <p>For example, this is most useful when crawling with a <code>host</code> or <code>prefix</code> scope, but also wanting to include 'one extra hop' \u2014 any link to external pages beyond the current host \u2014 but not following any of the links on those pages. This is possible with the <code>extraHops</code> setting, which defaults to 0, but can be set to a higher value N (usually 1) to go beyond the current scope.</p> <p>The <code>--extraHops</code> setting can be set globally or per seed to allow expanding the current inclusion scope N 'hops' beyond the configured scope. Note that this mechanism only expands the inclusion scope, and any exclusion rules are still applied. If a URL is to be excluded via the exclusion rules, that will take precedence over the <code>--extraHops</code>.</p>"},{"location":"user-guide/crawl-scope/#scope-rule-examples","title":"Scope Rule Examples","text":"<p>Regular expression exclude rules</p> <p>A crawl started with this config will start on <code>https://example.com/startpage.html</code> and crawl all pages on the <code>https://example.com/</code> domain except pages that match the exclusion rules \u2014 URLs that contain the strings <code>example.com/skip</code> or <code>example.com/search</code> followed by any number of characters, and URLs that contain the string <code>postfeed</code>.</p> <p><code>https://example.com/page.html</code> will be crawled but <code>https://example.com/skip/postfeed</code>, <code>https://example.com/skip/this-page.html</code>, and <code>https://example.com/search?q=searchstring</code> will not.</p> <pre><code>seeds:\n - url: https://example.com/startpage.html\n scopeType: \"host\"\n exclude:\n - example.com/skip.*\n - example.com/search.*\n - postfeed\n</code></pre> <p>Regular expression include and exclude rules</p> <p>In this example config, the scope includes regular expressions that will crawl all page URLs that match <code>example.com/(crawl-this|crawl-that)</code>, and exclude any URLs that terminate with exactly <code>skip</code>.</p> <p><code>https://example.com/crawl-this/page.html</code> and <code>https://example.com/crawl-this/page/skipme/not</code> will be crawled but <code>https://example.com/crawl-this/page/skip</code> will not.</p> <pre><code>seeds:\n - url: https://example.com/startpage.html\n include: example.com/(crawl-this|crawl-that)\n exclude:\n - skip$\n</code></pre> <p>More complicated regular expressions</p> <p>This example exclusion rule targets characters and numbers after <code>search</code> until the string <code>ID=</code>, followed by any amount of numbers.</p> <p><code>https://example.com/search/ID=5819</code>, <code>https://example.com/search/6vH8R4Tm</code>, and <code>https://example.com/search/2o3Jq89cID=5ag8h19</code> will be crawled but <code>https://example.com/search/6vH8R4TmID=5819</code> will not.</p> <pre><code>seeds:\n - url: https://example.com/startpage.html\n scopeType: \"host\"\n exclude:\n - example.com/search/[A-Za-z0-9]+ID=[0-9]+\n</code></pre> <p>The <code>include</code>, <code>exclude</code>, <code>scopeType</code>, and <code>depth</code> settings can be configured per seed or globally for the entire crawl.</p> <p>The per-seed settings override the per-crawl settings, if any.</p> <p>See the test suite tests/scopes.test.js for additional examples of configuring scope inclusion and exclusion rules.</p> <p>Note</p> <p>Include and exclude rules are always regular expressions. For rules to match, you may have to escape special characters that commonly appear in urls like <code>?</code>, <code>+</code>, or <code>.</code> by placing a <code>\\</code> before the character. For example: <code>youtube.com/watch\\?rdwz7QiG0lk</code>.</p> <p>Browsertrix Crawler does not log excluded URLs.</p>"},{"location":"user-guide/crawl-scope/#page-resource-block-rules","title":"Page Resource Block Rules","text":"<p>While scope rules define which pages are to be crawled, it is also possible to block page resources, URLs loaded within a page or within an iframe on a page.</p> <p>For example, this is useful for blocking ads or other content that is loaded within multiple pages, but should be blocked.</p> <p>The page rules block rules can be specified as a list in the <code>blockRules</code> field. Each rule can contain one of the following fields:</p> <ul> <li> <p><code>url</code>: regex for URL to match (required)</p> </li> <li> <p><code>type</code>: can be <code>block</code> or <code>allowOnly</code>. The block rule blocks the specified match, while allowOnly inverts the match and allows only the matched URLs, while blocking all others.</p> </li> <li> <p><code>inFrameUrl</code>: if specified, indicates that the rule only applies when <code>url</code> is loaded in a specific iframe or top-level frame.</p> </li> <li> <p><code>frameTextMatch</code>: if specified, the text of the specified URL is checked for the regex, and the rule applies only if there is an additional match. When specified, this field makes the block rule apply only to frame-level resource, eg. URLs loaded directly in an iframe or top-level frame.</p> </li> </ul> <p>For example, a very simple block rule that blocks all URLs from 'googleanalytics.com' on any page can be added with:</p> <pre><code>blockRules:\n - url: googleanalytics.com\n</code></pre> <p>To instead block 'googleanalytics.com' only if loaded within pages or iframes that match the regex 'example.com/no-analytics', add:</p> <pre><code>blockRules:\n - url: googleanalytics.com\n inFrameUrl: example.com/no-analytics\n</code></pre> <p>For additional examples of block rules, see the tests/blockrules.test.js file in the test suite.</p> <p>If the <code>--blockMessage</code> is also specified, a blocked URL is replaced with the specified message (added as a WARC resource record).</p>"},{"location":"user-guide/crawl-scope/#page-resource-block-rules-vs-scope-rules","title":"Page Resource Block Rules vs Scope Rules","text":"<p>If it seems confusing which rules should be used, here is a quick way to determine:</p> <ul> <li> <p>If you'd like to restrict the pages that are being crawled, use the crawl scope rules (defined above).</p> </li> <li> <p>If you'd like to restrict parts of a page that are being loaded, use the page resource block rules described in this section.</p> </li> </ul> <p>The blockRules add a filter to each URL loaded on a page and incur an extra overhead. They should only be used in advanced use cases where part of a page needs to be blocked.</p> <p>These rules can not be used to prevent entire pages for loading \u2014 use the scope exclusion rules for that (a warning will be printed if a page resource block rule matches a top-level page).</p>"},{"location":"user-guide/exit-codes/","title":"Exit codes","text":"<p>The crawler uses following exit codes to indicate crawl result.</p> Code Name Description 0 Success Crawl completed normally 1 GenericError Unspecified error, check logs for more details 3 OutOfSpace Disk is already full 9 Failed Crawl failed unexpectedly, might be worth retrying 10 BrowserCrashed Browser used to fetch pages has crashed 11 SignalInterrupted Crawl stopped gracefully in response to SIGINT signal 12 FailedLimit Limit on amount of failed pages, configured with <code>--failOnFailedLimit</code>, has been reached 13 SignalInterruptedForce Crawl stopped forcefully in response to SIGTERM or repeated SIGINT signal 14 SizeLimit Limit on maximum WARC size, configured with <code>--sizeLimit</code>, has been reached 15 TimeLimit Limit on maximum crawl duration, configured with <code>--timeLimit</code>, has been reached 16 DiskUtilization Limit on maximum disk usage, configured with <code>--diskUtilization</code>, has been reached 17 Fatal A fatal (non-retryable) error occured 21 ProxyError Unable to establish connection with proxy"},{"location":"user-guide/outputs/","title":"Outputs","text":"<p>This page covers the outputs created by Browsertrix Crawler for both crawls and browser profiles.</p>"},{"location":"user-guide/outputs/#crawl-outputs","title":"Crawl Outputs","text":"<p>Browsertrix Crawler crawl outputs are organized into collections, which can be found in the <code>/crawls/collection</code> directory. Each crawl creates a new collection by default, which can be named with the <code>-c</code> or <code>--collection</code> argument. If a collection name is not provided, Browsertrix Crawler will generate a unique collection name which includes the <code>crawl-</code> prefix followed by a timestamp of when the collection was created. Collections can be overwritten by specifying an existing collection name.</p> <p>Each collection is a directory which contains at minimum:</p> <ul> <li><code>archive/</code>: A directory containing gzipped WARC files containing the web traffic recorded during crawling.</li> <li><code>logs/</code>: A directory containing one or more crawler log files in JSON-Lines format.</li> <li><code>pages/</code>: A directory containing one or more \"Page\" files in JSON-Lines format. At minimum, this directory will contain a <code>pages.jsonl</code> file with information about the seed URLs provided to the crawler. If additional pages were discovered and in scope during crawling, information about those non-seed pages is written to <code>extraPages.jsonl</code>. For more information about the contents of Page files, see the WACZ specification.</li> <li><code>warc-cdx/</code>: A directory containing one or more CDXJ index files created while recording traffic to WARC files. These index files are </li> </ul> <p>Additionally, the collection may include:</p> <ul> <li>A WACZ file named after the collection, if the <code>--generateWACZ</code> argument is provided.</li> <li>An <code>indexes/</code> directory containing merged CDXJ index files for the crawl, if the <code>--generateCDX</code> or <code>--generateWACZ</code> arguments are provided. If the combined size of the CDXJ files in the <code>warc-cdx/</code> directory is over 50 KB, the resulting final CDXJ file will be gzipped.</li> <li>A single combined gzipped WARC file for the crawl, if the <code>--combineWARC</code> argument is provided.</li> <li>A <code>crawls/</code> directory including YAML files describing the crawl state, if the <code>--saveState</code> argument is provided with a value of \"always\", or if the crawl is interrupted and <code>--saveState</code> is not set to \"never\". These files can be used to restart a crawl from its saved state.</li> </ul>"},{"location":"user-guide/outputs/#profile-outputs","title":"Profile Outputs","text":"<p>Browser profiles that are saved by Browsertrix Crawler are written into the <code>crawls/profiles</code> directory.</p>"},{"location":"user-guide/proxies/","title":"Crawling with Proxies","text":"<p>Browser Crawler supports crawling through HTTP and SOCKS5 proxies, including through a SOCKS5 proxy over an SSH tunnel.</p> <p>To specify a proxy, the <code>PROXY_SERVER</code> environment variable or <code>--proxyServer</code> CLI flag can be passed in. If both are provided, the <code>--proxyServer</code> CLI flag will take precedence.</p> <p>The proxy server can be specified as a <code>http://</code>, <code>socks5://</code>, or <code>ssh://</code> URL.</p>"},{"location":"user-guide/proxies/#http-proxies","title":"HTTP Proxies","text":"<p>To crawl through an HTTP proxy running at <code>http://path-to-proxy-host.example.com:9000</code>, run the crawler with:</p> <pre><code>docker run -v $PWD/crawls/:/crawls/ -e PROXY_SERVER=http://path-to-proxy-host.example.com:9000 webrecorder/browsertrix-crawler crawl --url https://example.com/\n</code></pre> <p>or</p> <pre><code>docker run -v $PWD/crawls/:/crawls/ webrecorder/browsertrix-crawler crawl --url https://example.com/ --proxyServer http://path-to-proxy-host.example.com:9000 \n</code></pre> <p>The crawler does not support authentication for HTTP proxies, as that is not supported by the browser.</p> <p>(For backwards compatibility with crawler 0.x, <code>PROXY_HOST</code> and <code>PROXY_PORT</code> environment variables can be used to specify an HTTP proxy instead of <code>PROXY_SERVER</code> which takes precedence if provided).</p>"},{"location":"user-guide/proxies/#socks5-proxies","title":"SOCKS5 Proxies","text":"<p>To use a SOCKS5 proxy running at <code>path-to-proxy-host.example.com:9001</code>, run the crawler with:</p> <pre><code>docker run -v $PWD/crawls/:/crawls/ -e PROXY_SERVER=socks5://path-to-proxy-host.example.com:9001 webrecorder/browsertrix-crawler crawl --url https://example.com/\n</code></pre> <p>The crawler does support password authentication for SOCKS5 proxies, which can be provided as <code>user:password</code> in the proxy URL:</p> <pre><code>docker run-v $PWD/crawls/:/crawls/ -e PROXY_SERVER=socks5://user:password@path-to-proxy-host.example.com:9001 webrecorder/browsertrix-crawler crawl --url https://example.com/\n</code></pre>"},{"location":"user-guide/proxies/#ssh-proxies","title":"SSH Proxies","text":"<p>Starting with 1.3.0, the crawler also supports crawling through an SOCKS5 that is established over an SSH tunnel, via <code>ssh -D</code>. With this option, the crawler can SSH into a remote machine that has SSH and port forwarding enabled and crawl through that machine's network.</p> <p>To use this proxy, the private SSH key file must be provided via <code>--sshProxyPrivateKeyFile</code> CLI flag.</p> <p>The private key and public host key should be mounted as volumes into a path in the container, as shown below.</p> <p>For example, to connect via SSH to host <code>path-to-ssh-host.example.com</code> as user <code>user</code> with private key stored in <code>./my-proxy-private-key</code>, run:</p> <pre><code>docker run -v $PWD/crawls/:/crawls/ -v $PWD/my-proxy-private-key:/tmp/private-key webrecorder/browsertrix-crawler crawl --url https://httpbin.org/ip --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key\n</code></pre> <p>To also provide the host public key (eg. <code>./known_hosts</code> file) for additional verification, run:</p> <pre><code>docker run -v $PWD/crawls/:/crawls/ -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler crawl --url https://httpbin.org/ip --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts\n</code></pre> <p>The host key will only be checked if provided in a file via: <code>--sshProxyKnownHostsFile</code>.</p> <p>A custom SSH port can be provided with <code>--proxyServer ssh://user@path-to-ssh-host.example.com:2222</code>, otherwise the connection will be attempted via the default SSH port (port 22).</p> <p>The SSH connection establishes a tunnel on a local port in the container (9722) which will forward inbound/outbound traffic through the remote proxy. The <code>autossh</code> utility is used to automatically restart the SSH connection, if needed.</p> <p>Only key-based authentication is supposed for SSH proxies for now.</p>"},{"location":"user-guide/proxies/#browser-profiles","title":"Browser Profiles","text":"<p>The above proxy settings also apply to Browser Profile Creation, and browser profiles can also be created using proxies, for example:</p> <pre><code>docker run -p 6080:6080 -p 9223:9223 -v $PWD/crawls/profiles:/crawls/profiles -v $PWD/my-proxy-private-key:/tmp/private-key -v $PWD/known_hosts:/tmp/known_hosts webrecorder/browsertrix-crawler create-login-profile --url https://example.com/ --proxyServer ssh://user@path-to-ssh-host.example.com --sshProxyPrivateKeyFile /tmp/private-key --sshProxyKnownHostsFile /tmp/known_hosts\n</code></pre>"},{"location":"user-guide/proxies/#host-specific-proxies","title":"Host-Specific Proxies","text":"<p>With the 1.7.0 release, the crawler also supports running with multiple proxies, defined in a separate proxy YAML config file. The file contains a match hosts section, matching hosts by regex to named proxies.</p> <p>For example, the following YAML file can be passed to <code>--proxyConfigFile</code> option:</p> <pre><code>matchHosts:\n # load all URLs from example.com through 'example-1-proxy'\n example.com/.*: example-1-proxy\n\n # load all URLS from https://my-social.example.com/.*/posts/ through\n # a different proxy\n https://my-social.example.com/.*/posts/: social-proxy\n\n # optional default proxy\n \"\": default-proxy\n\nproxies:\n # SOCKS5 proxy just needs a URL\n example-1-proxy: socks5://username:password@my-socks-5-proxy.example.com\n\n # SSH proxy also should have at least a 'privateKeyFile'\n social-proxy:\n url: ssh://user@my-social-proxy.example.com\n privateKeyFile: /proxies/social-proxy-private-key\n # optional\n publicHostsFile: /proxies/social-proxy-public-hosts\n\n default-proxy:\n url: ssh://user@my-social-proxy.example.com\n privateKeyFile: /proxies/default-proxy-private-key\n</code></pre> <p>If the above config is stored in <code>./proxies/proxyConfig.yaml</code> along with the SSH private keys and known public hosts files, the crawler can be started with:</p> <pre><code>docker run -v $PWD/crawls:/crawls -v $PWD/proxies:/proxies -it webrecorder/browsertrix-crawler --url https://example.com/ --proxyServerConfig /proxies/proxyConfig.yaml\n</code></pre> <p>Note that if SSH proxies are provided, an SSH tunnel must be opened for each one before the crawl starts. The crawl will not start if any of the SSH proxy connections fail, even if a host-specific proxy is not actually used. SOCKS5 and HTTP proxy connections are attempted only on first use.</p> <p>The same <code>--proxyServerConfig</code> option can also be used in browser profile creation with the <code>create-login-profile</code> command in the same way.</p>"},{"location":"user-guide/proxies/#proxy-precedence","title":"Proxy Precedence","text":"<p>If both <code>--proxyServerConfig</code> and <code>--proxyServer</code>/<code>PROXY_SERVER</code> env var are specified, the <code>--proxyServerConfig</code> option takes precedence on matching hosts. To have the single <code>--proxyServer</code> option always take precedence instead, pass the <code>--proxyServerPreferSingleProxy</code> option.</p>"},{"location":"user-guide/qa/","title":"Quality Assurance","text":""},{"location":"user-guide/qa/#overview","title":"Overview","text":"<p>Browsertrix Crawler can analyze an existing crawl to compare what the browser encountered on a website during crawling against the replay of the crawl WACZ. The WACZ produced by this analysis run includes additional comparison data (stored as WARC <code>resource</code> records) for the pages found during crawling against their replay in ReplayWeb.page. This works along several dimensions, including screenshot, extracted text, and page resource comparisons.</p> <p>Note</p> <p>QA features described on this page are available in Browsertrix Crawler releases 1.1.0 and later.</p>"},{"location":"user-guide/qa/#getting-started","title":"Getting started","text":"<p>To be able to run QA on a crawl, you must first have an existing crawl, for example:</p> <pre><code>docker run -v $PWD/crawls:/crawls/ -it webrecorder/browsertrix-crawler crawl --url https://webrecorder.net/ --collection example-crawl --text to-warc --screenshot view --generateWACZ\n</code></pre> <p>Note that this crawl must be run with <code>--generateWACZ</code> flag as QA requires a WACZ to work with, and also ideally the <code>--text to-warc</code> and <code>--screenshot view</code> flags as well (see below for more details on comparison dimensions).</p> <p>To analyze this crawl, call Browsertrix Crawler with the <code>qa</code> entrypoint, passing the original crawl WACZ as the <code>qaSource</code>:</p> <pre><code>docker run -v $PWD/crawls/:/crawls/ -it webrecorder/browsertrix-crawler qa --qaSource /crawls/collections/example-crawl/example-crawl.wacz --collection example-qa --generateWACZ\n</code></pre> <p>The <code>qaSource</code> can be: - A local WACZ file path or a URL - A single WACZ or a JSON file containing a list of WACZ files in the <code>resources</code> json (Multi-WACZ)</p> <p>This assumes an existing crawl that was created in the <code>example-crawl</code> collection.</p> <p>A new WACZ for the analysis run will be created in the resulting <code>example-qa</code> collection.</p> <p>By default, the analysis crawl will visit all of the pages (as read from the source WACZ file(s)), however pages can further be limited by adding <code>--include</code> and <code>--exclude</code> regexes. The <code>--limit</code> flag will also limit how many pages are tested.</p> <p>The analysis crawl will skip over any non-HTML pages such as PDFs which can be relied upon to be bit-for-bit identical as long as the resource was fully fetched.</p>"},{"location":"user-guide/qa/#comparison-dimensions","title":"Comparison Dimensions","text":""},{"location":"user-guide/qa/#screenshot-match","title":"Screenshot Match","text":"<p>One way to compare crawl and replay is to compare the screenshots of a page while it is being crawled with when it is being replayed. The initial viewport screenshots of each page from the crawl and replay are compared on the basis of pixel value similarity. This results in a score between 0 and 1.0 representing the percentage match between the crawl and replay screenshots for each page. The screenshots are stored in <code>urn:view:<url></code> WARC resource records.</p> <p>To enable comparison on this dimension, the crawl must be run with at least the <code>--screenshot view</code> option. (Additional screenshot options can be added as well).</p>"},{"location":"user-guide/qa/#text-match","title":"Text Match","text":"<p>Another way to compare the crawl and replay results is to use the text extracted from the HTML. This is done by comparing the extracted text from crawl and replay on the basis of Levenshtein distance. This results in a score between 0 and 1.0 representing the percentage match between the crawl and replay text for each page. The extracted text is stored in <code>urn:text:<url></code> WARC resource records.</p> <p>To enable comparison on this dimension, the original crawl must be run with at least the <code>--text to-warc</code> option. (Additional text options can be added as well)</p>"},{"location":"user-guide/qa/#resources-and-page-info","title":"Resources and Page Info","text":"<p>The <code>pageinfo</code> records produced by the crawl and analysis runs include a JSON document containing information about the resources loaded on each page, such as CSS stylesheets, JavaScript scripts, fonts, images, and videos. The URL, status code, MIME type, and resource type of each resource is saved in the <code>pageinfo</code> record for each page.</p> <p>Since <code>pageinfo</code> records are produced for all crawls, this data is always available.</p>"},{"location":"user-guide/qa/#comparison-data","title":"Comparison Data","text":"<p>Comparison data is also added to the QA crawl's <code>pageinfo</code> records. The comparison data may look as follows:</p> <pre><code>\"comparison\": {\n \"screenshotMatch\": 0.95,\n \"textMatch\": 0.9,\n \"resourceCounts\": {\n \"crawlGood\": 10,\n \"crawlBad\": 0,\n \"replayGood\": 9,\n \"replayBad\": 1\n }\n}\n</code></pre> <p>This data indicates that:</p> <ul> <li>When comparing <code>urn:view:<url></code> records for crawl and replay, the screenshots are 95% similar.</li> <li>When comparing <code>urn:text:<url></code> records from crawl and replay WACZs, the text is 90% similar.</li> <li>When comparing <code>urn:pageinfo:<url></code> resource entries from crawl and replay, the crawl record had 10 good responses (2xx/3xx status code) and 0 bad responses (4xx/5xx status code), while replay had 9 good and 1 bad.</li> </ul>"},{"location":"user-guide/yaml-config/","title":"YAML Crawl Config","text":"<p>Browsertix Crawler supports the use of a YAML file to set parameters for a crawl. This can be used by passing a valid yaml file to the <code>--config</code> option.</p> <p>The YAML file can contain the same parameters as the command-line arguments. If a parameter is set on the command-line and in the YAML file, the value from the command-line will be used. For example, the following should start a crawl with config in <code>crawl-config.yaml</code>.</p> <pre><code>docker run -v $PWD/crawl-config.yaml:/app/crawl-config.yaml -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config /app/crawl-config.yaml\n</code></pre> <p>The config can also be passed via stdin, which can simplify the command. Note that this require running <code>docker run</code> with the <code>-i</code> flag. To read config from stdin, pass <code>--config stdin</code></p> <pre><code>cat ./crawl-config.yaml | docker run -i -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --config stdin\n</code></pre> <p>An example config file (eg. crawl-config.yaml) might contain:</p> <pre><code>seeds:\n - https://example.com/\n - https://www.iana.org/\n\ncombineWARC: true\n</code></pre> <p>The list of seeds can be loaded via an external file by specifying the filename via the <code>seedFile</code> config or command-line option.</p>"},{"location":"user-guide/yaml-config/#seed-file","title":"Seed File","text":"<p>The URL seed file should be a text file formatted so that each line of the file is a url string. An example file is available in the Github repository's fixture folder as urlSeedFile.txt.</p> <p>The seed file must be passed as a volume to the docker container. Your Docker command should be formatted similar to the following:</p> <pre><code>docker run -v $PWD/seedFile.txt:/app/seedFile.txt -v $PWD/crawls:/crawls/ webrecorder/browsertrix-crawler crawl --seedFile /app/seedFile.txt\n</code></pre>"},{"location":"user-guide/yaml-config/#per-seed-settings","title":"Per-Seed Settings","text":"<p>Certain settings such as scope type, scope includes and excludes, and depth can also be configured per-seed directly in the YAML file, for example:</p> <pre><code>seeds:\n - url: https://webrecorder.net/\n depth: 1\n scopeType: \"prefix\"\n</code></pre>"},{"location":"user-guide/yaml-config/#http-auth","title":"HTTP Auth","text":"<p>HTTP basic auth credentials are written to the archive</p> <p>We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.</p> <p>Browsertrix Crawler supports HTTP Basic Auth, which can be provide on a per-seed basis as part of the URL, for example: <code>--url https://username:password@example.com/</code>.</p> <p>Alternatively, credentials can be added to the <code>auth</code> field for each seed:</p> <pre><code>seeds:\n - url: https://example.com/\n auth: username:password\n</code></pre>"}]} |