| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | :mod:`urllib.robotparser` ---  Parser for robots.txt
 | 
					
						
							|  |  |  | ====================================================
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. module:: urllib.robotparser
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  |    :synopsis: Load a robots.txt file and answer questions about
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |               fetchability of other URLs.
 | 
					
						
							| 
									
										
										
										
											2016-06-11 15:02:54 -04:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | .. sectionauthor:: Skip Montanaro <skip@pobox.com>
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-06-11 15:02:54 -04:00
										 |  |  | **Source code:** :source:`Lib/urllib/robotparser.py`
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | .. index::
 | 
					
						
							|  |  |  |    single: WWW
 | 
					
						
							|  |  |  |    single: World Wide Web
 | 
					
						
							|  |  |  |    single: URL
 | 
					
						
							|  |  |  |    single: robots.txt
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-06-11 15:02:54 -04:00
										 |  |  | --------------
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | This module provides a single class, :class:`RobotFileParser`, which answers
 | 
					
						
							|  |  |  | questions about whether or not a particular user agent can fetch a URL on the
 | 
					
						
							|  |  |  | Web site that published the :file:`robots.txt` file.  For more details on the
 | 
					
						
							|  |  |  | structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-03-15 16:50:23 -04:00
										 |  |  | .. class:: RobotFileParser(url='')
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2013-03-15 16:50:23 -04:00
										 |  |  |    This class provides methods to read, parse and answer questions about the
 | 
					
						
							|  |  |  |    :file:`robots.txt` file at *url*.
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |    .. method:: set_url(url)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Sets the URL referring to a :file:`robots.txt` file.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. method:: read()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Reads the :file:`robots.txt` URL and feeds it to the parser.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. method:: parse(lines)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Parses the lines argument.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. method:: can_fetch(useragent, url)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Returns ``True`` if the *useragent* is allowed to fetch the *url*
 | 
					
						
							|  |  |  |       according to the rules contained in the parsed :file:`robots.txt`
 | 
					
						
							|  |  |  |       file.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. method:: mtime()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Returns the time the ``robots.txt`` file was last fetched.  This is
 | 
					
						
							|  |  |  |       useful for long-running web spiders that need to check for new
 | 
					
						
							|  |  |  |       ``robots.txt`` files periodically.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. method:: modified()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Sets the time the ``robots.txt`` file was last fetched to the current
 | 
					
						
							|  |  |  |       time.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-10-08 12:27:06 +03:00
										 |  |  |    .. method:: crawl_delay(useragent)
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-10-08 12:27:06 +03:00
										 |  |  |       Returns the value of the ``Crawl-delay`` parameter from ``robots.txt``
 | 
					
						
							|  |  |  |       for the *useragent* in question.  If there is no such parameter or it
 | 
					
						
							|  |  |  |       doesn't apply to the *useragent* specified or the ``robots.txt`` entry
 | 
					
						
							|  |  |  |       for this parameter has invalid syntax, return ``None``.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       .. versionadded:: 3.6
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |    .. method:: request_rate(useragent)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Returns the contents of the ``Request-rate`` parameter from
 | 
					
						
							| 
									
										
										
										
											2017-11-24 02:40:26 +03:00
										 |  |  |       ``robots.txt`` as a :term:`named tuple` ``RequestRate(requests, seconds)``.
 | 
					
						
							|  |  |  |       If there is no such parameter or it doesn't apply to the *useragent*
 | 
					
						
							|  |  |  |       specified or the ``robots.txt`` entry for this parameter has invalid
 | 
					
						
							|  |  |  |       syntax, return ``None``.
 | 
					
						
							| 
									
										
										
										
											2015-10-08 12:27:06 +03:00
										 |  |  | 
 | 
					
						
							|  |  |  |       .. versionadded:: 3.6
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-05-16 07:52:07 -07:00
										 |  |  |    .. method:: site_maps()
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       Returns the contents of the ``Sitemap`` parameter from
 | 
					
						
							|  |  |  |       ``robots.txt`` in the form of a :func:`list`. If there is no such
 | 
					
						
							|  |  |  |       parameter or the ``robots.txt`` entry for this parameter has
 | 
					
						
							|  |  |  |       invalid syntax, return ``None``.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |       .. versionadded:: 3.8
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-10-08 12:27:06 +03:00
										 |  |  | 
 | 
					
						
							|  |  |  | The following example demonstrates basic use of the :class:`RobotFileParser`
 | 
					
						
							|  |  |  | class::
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |    >>> import urllib.robotparser
 | 
					
						
							|  |  |  |    >>> rp = urllib.robotparser.RobotFileParser()
 | 
					
						
							|  |  |  |    >>> rp.set_url("http://www.musi-cal.com/robots.txt")
 | 
					
						
							|  |  |  |    >>> rp.read()
 | 
					
						
							| 
									
										
										
										
											2015-10-08 12:27:06 +03:00
										 |  |  |    >>> rrate = rp.request_rate("*")
 | 
					
						
							|  |  |  |    >>> rrate.requests
 | 
					
						
							|  |  |  |    3
 | 
					
						
							|  |  |  |    >>> rrate.seconds
 | 
					
						
							|  |  |  |    20
 | 
					
						
							|  |  |  |    >>> rp.crawl_delay("*")
 | 
					
						
							|  |  |  |    6
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |    >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
 | 
					
						
							|  |  |  |    False
 | 
					
						
							|  |  |  |    >>> rp.can_fetch("*", "http://www.musi-cal.com/")
 | 
					
						
							|  |  |  |    True
 |