browsertrix-crawler

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-08 06:09:48 +00:00

History

Tessa Walsh 1d15a155f2 Add option to respect robots.txt disallows (#888 ) Fixes #631 - Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler. - Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x' - Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types. - Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K - Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all' - Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>		2025-11-26 19:00:06 -08:00
..
assets	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 )	2024-03-16 14:59:32 -07:00
develop	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 )	2024-03-16 14:59:32 -07:00
overrides	Add MKDocs documentation site for Browsertrix Crawler 1.0.0 (#494 )	2024-03-16 14:59:32 -07:00
stylesheets	docs: Update header font (#785 )	2025-03-05 14:21:00 -08:00
user-guide	Add option to respect robots.txt disallows (#888 )	2025-11-26 19:00:06 -08:00
CNAME	CNAME: keep CNAME in docs/docs for mkdocs	2024-03-16 15:24:54 -07:00
index.md	Add crawler QA docs (#551 )	2024-04-18 16:18:22 -04:00