| 
									
										
										
										
											2000-03-31 17:51:10 +00:00
										 |  |  | \section{\module{robotparser} ---  | 
					
						
							| 
									
										
										
										
											2000-04-28 18:17:23 +00:00
										 |  |  |          Parser for robots.txt} | 
					
						
							| 
									
										
										
										
											2000-03-31 17:51:10 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | \declaremodule{standard}{robotparser} | 
					
						
							|  |  |  | \modulesynopsis{Accepts as input a list of lines or URL that refers to a | 
					
						
							|  |  |  |                 robots.txt file, parses the file, then builds a | 
					
						
							|  |  |  |                 set of rules from that list and answers questions | 
					
						
							|  |  |  |                 about fetchability of other URLs.} | 
					
						
							|  |  |  | \sectionauthor{Skip Montanaro}{skip@mojam.com} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \index{WWW} | 
					
						
							|  |  |  | \index{World-Wide Web} | 
					
						
							|  |  |  | \index{URL} | 
					
						
							|  |  |  | \index{robots.txt} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This module provides a single class, \class{RobotFileParser}, which answers | 
					
						
							|  |  |  | questions about whether or not a particular user agent can fetch a URL on | 
					
						
							|  |  |  | the web site that published the \file{robots.txt} file.  For more details on  | 
					
						
							|  |  |  | the structure of \file{robots.txt} files, see | 
					
						
							|  |  |  | \url{http://info.webcrawler.com/mak/projects/robots/norobots.html}.  | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{classdesc}{RobotFileParser}{} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This class provides a set of methods to read, parse and answer questions | 
					
						
							|  |  |  | about a single \file{robots.txt} file. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{set_url}{url} | 
					
						
							|  |  |  | Sets the URL referring to a \file{robots.txt} file. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{read}{} | 
					
						
							|  |  |  | Reads the \file{robots.txt} URL and feeds it to the parser. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{parse}{lines} | 
					
						
							|  |  |  | Parses the lines argument. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{can_fetch}{useragent, url} | 
					
						
							|  |  |  | Returns true if the \var{useragent} is allowed to fetch the \var{url} | 
					
						
							|  |  |  | according to the rules contained in the parsed \file{robots.txt} file. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{mtime}{} | 
					
						
							|  |  |  | Returns the time the \code{robots.txt} file was last fetched.  This is | 
					
						
							|  |  |  | useful for long-running web spiders that need to check for new | 
					
						
							|  |  |  | \code{robots.txt} files periodically. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{methoddesc}{modified}{} | 
					
						
							|  |  |  | Sets the time the \code{robots.txt} file was last fetched to the current | 
					
						
							|  |  |  | time. | 
					
						
							|  |  |  | \end{methoddesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \end{classdesc} | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The following example demonstrates basic use of the RobotFileParser class. | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | \begin{verbatim} | 
					
						
							|  |  |  | >>> import robotparser | 
					
						
							|  |  |  | >>> rp = robotparser.RobotFileParser() | 
					
						
							|  |  |  | >>> rp.set_url("http://www.musi-cal.com/robots.txt") | 
					
						
							|  |  |  | >>> rp.read() | 
					
						
							|  |  |  | >>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco") | 
					
						
							|  |  |  | 0 | 
					
						
							|  |  |  | >>> rp.can_fetch("*", "http://www.musi-cal.com/") | 
					
						
							|  |  |  | 1 | 
					
						
							|  |  |  | \end{verbatim} |