README tweaks

2025-10-19 14:33:17 +00:00 · 2020-11-01 21:43:52 -08:00 · 2020-11-01 21:43:52 -08:00 · e2bce2f30d
commit e2bce2f30d
parent a875aa90d3
1 changed files with 14 additions and 5 deletions
--- a/README.md
+++ b/README.md
@ -1,14 +1,23 @@
 # Browsertrix Crawler

-Browsertrix Crwaler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container.
-
-It is designed as part of a more streamlined replacement of the original [Browsertrix](https://github.com/webrecorder/browsertrix).
+Browsertrix Crawler is a simplified browser-based high-fidelity crawling system, designed to run a single crawl in a single Docker container. It is designed as part of a more streamlined replacement of the original [Browsertrix](https://github.com/webrecorder/browsertrix).

 The original Browsertrix may be too complex for situations where a single crawl is needed, and requires managing multiple containers.

 This is an attempt to refactor Browsertrix into a core crawling system, driven by [puppeteer-cluster](https://github.com/thomasdondorf/puppeteer-cluster)
 and [puppeteer](https://github.com/puppeteer/puppeteer)

+## Features
+
+Thus far, Browsertrix Crawler supports:
+
+- Single-container, browser based crawling with multiple headless/headful browsers
+- Support for some behaviors: autoplay to capture video/audio, scrolling
+- Support for direct capture for non-HTML resources
+- Extensible driver script for customizing behavior per crawl or page via Puppeteer
+
+## Architecture
+
 The Docker container provided here packages up several components used in Browsertrix.

 The system uses:
@ -17,9 +26,9 @@ The system uses:
 - `pywb` - in recording mode for capturing the content


-The crawl produces a single pywb collection, at `/output/collections/<collection name>`.
+The crawl produces a single pywb collection, at `/output/collections/<collection name>` in the Docker container.

-The collection can be mounted as a Docker volume and then accessed in pywb.
+To access the contents of the crawl, the `/output` directory should be mounted to a volume (default in the Docker Compose setup).


 ## Crawling Parameters