Stowage/browsertrix-crawler

Fork 0

mirror of https://github.com/webrecorder/browsertrix-crawler.git synced 2025-12-07 21:59:48 +00:00

Commit graph

Author	SHA1	Message	Date
Tessa Walsh	ff5619e624	Rename robots flag to --useRobots, keep --robots as alias (#932 ) Follow-up to https://github.com/webrecorder/browsertrix-crawler/issues/631 Based on feedback from https://github.com/webrecorder/browsertrix/pull/3029 Renaming `--robots` to `--useRobots` will allow us to keep the Browsertrix backend API more consistent with similar flags like `--useSitemap`. Keeping `--robots` as it's a nice shorthand alias.	2025-12-02 15:55:25 -08:00
Tessa Walsh	1d15a155f2	Add option to respect robots.txt disallows (#888 ) Fixes #631 - Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler. - Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x' - Robots.txt bodies are parsed and checked for page allow/disallow status using the https://github.com/samclarke/robots-parser library, which is the most active and well-maintained implementation I could find with TypeScript types. - Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K - Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all' - Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-11-26 19:00:06 -08:00

Author

SHA1

Message

Date

Tessa Walsh

ff5619e624

Rename robots flag to --useRobots, keep --robots as alias (#932 )

Follow-up to
https://github.com/webrecorder/browsertrix-crawler/issues/631

Based on feedback from
https://github.com/webrecorder/browsertrix/pull/3029

Renaming `--robots` to `--useRobots` will allow us to keep the
Browsertrix backend API more consistent with similar flags like
`--useSitemap`. Keeping `--robots` as it's a nice shorthand alias.

2025-12-02 15:55:25 -08:00

Tessa Walsh

1d15a155f2

Add option to respect robots.txt disallows (#888 )

Fixes #631 
- Adds --robots flag which will enable checking robots.txt for each host for each page, before the page is queued for further crawler.
- Supports --robotsAgent flag which configures agent to check in robots.txt, in addition to '*'. Defaults to 'Browsertrix/1.x'
- Robots.txt bodies are parsed and checked for page allow/disallow status
using the https://github.com/samclarke/robots-parser library, which is
the most active and well-maintained implementation I could find with
TypeScript types.
- Fetched robots.txt bodies are cached by their URL in Redis using an LRU, retaining last 100 robots entries, each upto 100K
- Non-200 responses are treated as empty robots, and empty robots are treated as 'allow all'
- Multiple request to same robots.txt are batched to perform only one fetch, waiting up to 10 seconds per fetch.

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>

2025-11-26 19:00:06 -08:00

2 commits