| 
									
										
										
										
											2009-12-19 17:57:51 +00:00
										 |  |  | .. _urllib-howto:
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | ***********************************************************
 | 
					
						
							|  |  |  |   HOWTO Fetch Internet Resources Using The urllib Package
 | 
					
						
							|  |  |  | ***********************************************************
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-06-21 21:55:18 +03:00
										 |  |  | :Author: `Michael Foord <https://agileabstractions.com/>`_
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Introduction
 | 
					
						
							|  |  |  | ============
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. sidebar:: Related Articles
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     You may also find useful the following article on fetching web resources
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  |     with Python:
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2022-06-21 21:55:18 +03:00
										 |  |  |     * `Basic Authentication <https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |         A tutorial on *Basic Authentication*, with examples in Python.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2014-10-29 08:36:35 +01:00
										 |  |  | **urllib.request** is a Python module for fetching URLs
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | (Uniform Resource Locators). It offers a very simple interface, in the form of
 | 
					
						
							|  |  |  | the *urlopen* function. This is capable of fetching URLs using a variety of
 | 
					
						
							|  |  |  | different protocols. It also offers a slightly more complex interface for
 | 
					
						
							|  |  |  | handling common situations - like basic authentication, cookies, proxies and so
 | 
					
						
							|  |  |  | on. These are provided by objects called handlers and openers.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | urllib.request supports fetching URLs for many "URL schemes" (identified by the string
 | 
					
						
							| 
									
										
										
										
											2017-05-16 23:18:09 +03:00
										 |  |  | before the ``":"`` in URL - for example ``"ftp"`` is the URL scheme of
 | 
					
						
							|  |  |  | ``"ftp://python.org/"``) using their associated network protocols (e.g. FTP, HTTP).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | This tutorial focuses on the most common case, HTTP.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | For straightforward situations *urlopen* is very easy to use. But as soon as you
 | 
					
						
							|  |  |  | encounter errors or non-trivial cases when opening HTTP URLs, you will need some
 | 
					
						
							|  |  |  | understanding of the HyperText Transfer Protocol. The most comprehensive and
 | 
					
						
							|  |  |  | authoritative reference to HTTP is :rfc:`2616`. This is a technical document and
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | not intended to be easy to read. This HOWTO aims to illustrate using *urllib*,
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | with enough detail about HTTP to help you through. It is not intended to replace
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | the :mod:`urllib.request` docs, but is supplementary to them.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Fetching URLs
 | 
					
						
							|  |  |  | =============
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | The simplest way to use urllib.request is as follows::
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     import urllib.request
 | 
					
						
							| 
									
										
										
										
											2015-04-12 13:52:49 +03:00
										 |  |  |     with urllib.request.urlopen('http://python.org/') as response:
 | 
					
						
							|  |  |  |        html = response.read()
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-04-16 11:02:56 -03:00
										 |  |  | If you wish to retrieve a resource via URL and store it in a temporary
 | 
					
						
							|  |  |  | location, you can do so via the :func:`shutil.copyfileobj` and
 | 
					
						
							|  |  |  | :func:`tempfile.NamedTemporaryFile` functions::
 | 
					
						
							| 
									
										
										
										
											2012-03-13 19:29:33 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-04-16 11:02:56 -03:00
										 |  |  |     import shutil
 | 
					
						
							|  |  |  |     import tempfile
 | 
					
						
							| 
									
										
										
										
											2012-03-13 19:29:33 -07:00
										 |  |  |     import urllib.request
 | 
					
						
							| 
									
										
										
										
											2018-04-16 11:02:56 -03:00
										 |  |  | 
 | 
					
						
							|  |  |  |     with urllib.request.urlopen('http://python.org/') as response:
 | 
					
						
							|  |  |  |         with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
 | 
					
						
							|  |  |  |             shutil.copyfileobj(response, tmp_file)
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     with open(tmp_file.name) as html:
 | 
					
						
							|  |  |  |         pass
 | 
					
						
							| 
									
										
										
										
											2012-03-13 19:29:33 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | Many uses of urllib will be that simple (note that instead of an 'http:' URL we
 | 
					
						
							| 
									
										
										
										
											2016-04-15 02:14:19 +00:00
										 |  |  | could have used a URL starting with 'ftp:', 'file:', etc.).  However, it's the
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | purpose of this tutorial to explain the more complicated cases, concentrating on
 | 
					
						
							|  |  |  | HTTP.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | HTTP is based on requests and responses - the client makes requests and servers
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | send responses. urllib.request mirrors this with a ``Request`` object which represents
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | the HTTP request you are making. In its simplest form you create a Request
 | 
					
						
							|  |  |  | object that specifies the URL you want to fetch. Calling ``urlopen`` with this
 | 
					
						
							|  |  |  | Request object returns a response object for the URL requested. This response is
 | 
					
						
							|  |  |  | a file-like object, which means you can for example call ``.read()`` on the
 | 
					
						
							|  |  |  | response::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     import urllib.request
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2023-04-22 11:24:47 -03:00
										 |  |  |     req = urllib.request.Request('http://python.org/')
 | 
					
						
							| 
									
										
										
										
											2015-04-12 13:52:49 +03:00
										 |  |  |     with urllib.request.urlopen(req) as response:
 | 
					
						
							|  |  |  |        the_page = response.read()
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | Note that urllib.request makes use of the same Request interface to handle all URL
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | schemes.  For example, you can make an FTP request like so::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     req = urllib.request.Request('ftp://example.com/')
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | In the case of HTTP, there are two extra things that Request objects allow you
 | 
					
						
							|  |  |  | to do: First, you can pass data to be sent to the server.  Second, you can pass
 | 
					
						
							| 
									
										
										
										
											2021-02-06 05:17:01 +11:00
										 |  |  | extra information ("metadata") *about* the data or about the request itself, to
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | the server - this information is sent as HTTP "headers".  Let's look at each of
 | 
					
						
							|  |  |  | these in turn.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Data
 | 
					
						
							|  |  |  | ----
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Sometimes you want to send data to a URL (often the URL will refer to a CGI
 | 
					
						
							| 
									
										
										
										
											2014-07-01 06:02:42 +03:00
										 |  |  | (Common Gateway Interface) script or other web application). With HTTP,
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | this is often done using what's known as a **POST** request. This is often what
 | 
					
						
							|  |  |  | your browser does when you submit a HTML form that you filled in on the web. Not
 | 
					
						
							|  |  |  | all POSTs have to come from forms: you can use a POST to transmit arbitrary data
 | 
					
						
							|  |  |  | to your own application. In the common case of HTML forms, the data needs to be
 | 
					
						
							|  |  |  | encoded in a standard way, and then passed to the Request object as the ``data``
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | argument. The encoding is done using a function from the :mod:`urllib.parse`
 | 
					
						
							|  |  |  | library. ::
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     import urllib.parse
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     import urllib.request
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     url = 'http://www.someserver.com/cgi-bin/register.cgi'
 | 
					
						
							|  |  |  |     values = {'name' : 'Michael Foord',
 | 
					
						
							|  |  |  |               'location' : 'Northampton',
 | 
					
						
							|  |  |  |               'language' : 'Python' }
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     data = urllib.parse.urlencode(values)
 | 
					
						
							| 
									
										
										
										
											2015-11-24 22:33:18 +00:00
										 |  |  |     data = data.encode('ascii') # data should be bytes
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     req = urllib.request.Request(url, data)
 | 
					
						
							| 
									
										
										
										
											2015-04-12 13:52:49 +03:00
										 |  |  |     with urllib.request.urlopen(req) as response:
 | 
					
						
							|  |  |  |        the_page = response.read()
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Note that other encodings are sometimes required (e.g. for file upload from HTML
 | 
					
						
							|  |  |  | forms - see `HTML Specification, Form Submission
 | 
					
						
							| 
									
										
										
										
											2016-05-07 10:49:07 +03:00
										 |  |  | <https://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | details).
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | If you do not pass the ``data`` argument, urllib uses a **GET** request. One
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | way in which GET and POST requests differ is that POST requests often have
 | 
					
						
							|  |  |  | "side-effects": they change the state of the system in some way (for example by
 | 
					
						
							|  |  |  | placing an order with the website for a hundredweight of tinned spam to be
 | 
					
						
							|  |  |  | delivered to your door).  Though the HTTP standard makes it clear that POSTs are
 | 
					
						
							|  |  |  | intended to *always* cause side-effects, and GET requests *never* to cause
 | 
					
						
							|  |  |  | side-effects, nothing prevents a GET request from having side-effects, nor a
 | 
					
						
							|  |  |  | POST requests from having no side-effects. Data can also be passed in an HTTP
 | 
					
						
							|  |  |  | GET request by encoding it in the URL itself.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This is done as follows::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     >>> import urllib.request
 | 
					
						
							|  |  |  |     >>> import urllib.parse
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     >>> data = {}
 | 
					
						
							|  |  |  |     >>> data['name'] = 'Somebody Here'
 | 
					
						
							|  |  |  |     >>> data['location'] = 'Northampton'
 | 
					
						
							|  |  |  |     >>> data['language'] = 'Python'
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     >>> url_values = urllib.parse.urlencode(data)
 | 
					
						
							| 
									
										
										
										
											2012-10-09 00:38:17 -07:00
										 |  |  |     >>> print(url_values)  # The order may differ from below.  #doctest: +SKIP
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     name=Somebody+Here&language=Python&location=Northampton
 | 
					
						
							|  |  |  |     >>> url = 'http://www.example.com/example.cgi'
 | 
					
						
							|  |  |  |     >>> full_url = url + '?' + url_values
 | 
					
						
							| 
									
										
										
										
											2011-07-23 08:04:40 +02:00
										 |  |  |     >>> data = urllib.request.urlopen(full_url)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Notice that the full URL is created by adding a ``?`` to the URL, followed by
 | 
					
						
							|  |  |  | the encoded values.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Headers
 | 
					
						
							|  |  |  | -------
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | We'll discuss here one particular HTTP header, to illustrate how to add headers
 | 
					
						
							|  |  |  | to your HTTP request.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Some websites [#]_ dislike being browsed by programs, or send different versions
 | 
					
						
							| 
									
										
										
										
											2013-12-23 18:20:51 +02:00
										 |  |  | to different browsers [#]_. By default urllib identifies itself as
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | ``Python-urllib/x.y`` (where ``x`` and ``y`` are the major and minor version
 | 
					
						
							|  |  |  | numbers of the Python release,
 | 
					
						
							|  |  |  | e.g. ``Python-urllib/2.5``), which may confuse the site, or just plain
 | 
					
						
							|  |  |  | not work. The way a browser identifies itself is through the
 | 
					
						
							|  |  |  | ``User-Agent`` header [#]_. When you create a Request object you can
 | 
					
						
							|  |  |  | pass a dictionary of headers in. The following example makes the same
 | 
					
						
							|  |  |  | request as above, but identifies itself as a version of Internet
 | 
					
						
							|  |  |  | Explorer [#]_. ::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     import urllib.parse
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     import urllib.request
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     url = 'http://www.someserver.com/cgi-bin/register.cgi'
 | 
					
						
							| 
									
										
										
										
											2015-09-20 23:16:45 +05:00
										 |  |  |     user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
 | 
					
						
							| 
									
										
										
										
											2016-05-10 12:01:23 +03:00
										 |  |  |     values = {'name': 'Michael Foord',
 | 
					
						
							|  |  |  |               'location': 'Northampton',
 | 
					
						
							|  |  |  |               'language': 'Python' }
 | 
					
						
							|  |  |  |     headers = {'User-Agent': user_agent}
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-11-24 22:33:18 +00:00
										 |  |  |     data = urllib.parse.urlencode(values)
 | 
					
						
							|  |  |  |     data = data.encode('ascii')
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     req = urllib.request.Request(url, data, headers)
 | 
					
						
							| 
									
										
										
										
											2015-04-12 13:52:49 +03:00
										 |  |  |     with urllib.request.urlopen(req) as response:
 | 
					
						
							|  |  |  |        the_page = response.read()
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | The response also has two useful methods. See the section on `info and geturl`_
 | 
					
						
							|  |  |  | which comes after we have a look at what happens when things go wrong.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Handling Exceptions
 | 
					
						
							|  |  |  | ===================
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-07-26 14:54:51 +00:00
										 |  |  | *urlopen* raises :exc:`URLError` when it cannot handle a response (though as
 | 
					
						
							|  |  |  | usual with Python APIs, built-in exceptions such as :exc:`ValueError`,
 | 
					
						
							|  |  |  | :exc:`TypeError` etc. may also be raised).
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  | :exc:`HTTPError` is the subclass of :exc:`URLError` raised in the specific case of
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | HTTP URLs.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | The exception classes are exported from the :mod:`urllib.error` module.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | URLError
 | 
					
						
							|  |  |  | --------
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Often, URLError is raised because there is no network connection (no route to
 | 
					
						
							|  |  |  | the specified server), or the specified server doesn't exist.  In this case, the
 | 
					
						
							|  |  |  | exception raised will have a 'reason' attribute, which is a tuple containing an
 | 
					
						
							|  |  |  | error code and a text error message.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | e.g. ::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     >>> req = urllib.request.Request('http://www.pretend_server.org')
 | 
					
						
							|  |  |  |     >>> try: urllib.request.urlopen(req)
 | 
					
						
							| 
									
										
										
										
											2012-10-09 00:38:17 -07:00
										 |  |  |     ... except urllib.error.URLError as e:
 | 
					
						
							| 
									
										
										
										
											2016-05-10 12:01:23 +03:00
										 |  |  |     ...     print(e.reason)      #doctest: +SKIP
 | 
					
						
							| 
									
										
										
										
											2012-10-09 00:38:17 -07:00
										 |  |  |     ...
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     (4, 'getaddrinfo failed')
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | HTTPError
 | 
					
						
							|  |  |  | ---------
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Every HTTP response from the server contains a numeric "status code". Sometimes
 | 
					
						
							|  |  |  | the status code indicates that the server is unable to fulfil the request. The
 | 
					
						
							|  |  |  | default handlers will handle some of these responses for you (for example, if
 | 
					
						
							|  |  |  | the response is a "redirection" that requests the client fetch the document from
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | a different URL, urllib will handle that for you). For those it can't handle,
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  | urlopen will raise an :exc:`HTTPError`. Typical errors include '404' (page not
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | found), '403' (request forbidden), and '401' (authentication required).
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-05-31 07:39:00 +03:00
										 |  |  | See section 10 of :rfc:`2616` for a reference on all the HTTP error codes.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  | The :exc:`HTTPError` instance raised will have an integer 'code' attribute, which
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | corresponds to the error sent by the server.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Error Codes
 | 
					
						
							|  |  |  | ~~~~~~~~~~~
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Because the default handlers handle redirects (codes in the 300 range), and
 | 
					
						
							| 
									
										
										
										
											2016-11-26 13:43:28 +02:00
										 |  |  | codes in the 100--299 range indicate success, you will usually only see error
 | 
					
						
							|  |  |  | codes in the 400--599 range.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-05-26 16:32:26 +00:00
										 |  |  | :attr:`http.server.BaseHTTPRequestHandler.responses` is a useful dictionary of
 | 
					
						
							| 
									
										
										
										
											2018-05-31 07:39:00 +03:00
										 |  |  | response codes in that shows all the response codes used by :rfc:`2616`. The
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | dictionary is reproduced here for convenience ::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     # Table mapping response codes to messages; entries have the
 | 
					
						
							|  |  |  |     # form {code: (shortmessage, longmessage)}.
 | 
					
						
							|  |  |  |     responses = {
 | 
					
						
							|  |  |  |         100: ('Continue', 'Request received, please continue'),
 | 
					
						
							|  |  |  |         101: ('Switching Protocols',
 | 
					
						
							|  |  |  |               'Switching to new protocol; obey Upgrade header'),
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         200: ('OK', 'Request fulfilled, document follows'),
 | 
					
						
							|  |  |  |         201: ('Created', 'Document created, URL follows'),
 | 
					
						
							|  |  |  |         202: ('Accepted',
 | 
					
						
							|  |  |  |               'Request accepted, processing continues off-line'),
 | 
					
						
							|  |  |  |         203: ('Non-Authoritative Information', 'Request fulfilled from cache'),
 | 
					
						
							|  |  |  |         204: ('No Content', 'Request fulfilled, nothing follows'),
 | 
					
						
							|  |  |  |         205: ('Reset Content', 'Clear input form for further input.'),
 | 
					
						
							|  |  |  |         206: ('Partial Content', 'Partial content follows.'),
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         300: ('Multiple Choices',
 | 
					
						
							|  |  |  |               'Object has several resources -- see URI list'),
 | 
					
						
							|  |  |  |         301: ('Moved Permanently', 'Object moved permanently -- see URI list'),
 | 
					
						
							|  |  |  |         302: ('Found', 'Object moved temporarily -- see URI list'),
 | 
					
						
							|  |  |  |         303: ('See Other', 'Object moved -- see Method and URL list'),
 | 
					
						
							|  |  |  |         304: ('Not Modified',
 | 
					
						
							|  |  |  |               'Document has not changed since given time'),
 | 
					
						
							|  |  |  |         305: ('Use Proxy',
 | 
					
						
							|  |  |  |               'You must use proxy specified in Location to access this '
 | 
					
						
							|  |  |  |               'resource.'),
 | 
					
						
							|  |  |  |         307: ('Temporary Redirect',
 | 
					
						
							|  |  |  |               'Object moved temporarily -- see URI list'),
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         400: ('Bad Request',
 | 
					
						
							|  |  |  |               'Bad request syntax or unsupported method'),
 | 
					
						
							|  |  |  |         401: ('Unauthorized',
 | 
					
						
							|  |  |  |               'No permission -- see authorization schemes'),
 | 
					
						
							|  |  |  |         402: ('Payment Required',
 | 
					
						
							|  |  |  |               'No payment -- see charging schemes'),
 | 
					
						
							|  |  |  |         403: ('Forbidden',
 | 
					
						
							|  |  |  |               'Request forbidden -- authorization will not help'),
 | 
					
						
							|  |  |  |         404: ('Not Found', 'Nothing matches the given URI'),
 | 
					
						
							|  |  |  |         405: ('Method Not Allowed',
 | 
					
						
							|  |  |  |               'Specified method is invalid for this server.'),
 | 
					
						
							|  |  |  |         406: ('Not Acceptable', 'URI not available in preferred format.'),
 | 
					
						
							|  |  |  |         407: ('Proxy Authentication Required', 'You must authenticate with '
 | 
					
						
							|  |  |  |               'this proxy before proceeding.'),
 | 
					
						
							|  |  |  |         408: ('Request Timeout', 'Request timed out; try again later.'),
 | 
					
						
							|  |  |  |         409: ('Conflict', 'Request conflict.'),
 | 
					
						
							|  |  |  |         410: ('Gone',
 | 
					
						
							|  |  |  |               'URI no longer exists and has been permanently removed.'),
 | 
					
						
							|  |  |  |         411: ('Length Required', 'Client must specify Content-Length.'),
 | 
					
						
							|  |  |  |         412: ('Precondition Failed', 'Precondition in headers is false.'),
 | 
					
						
							|  |  |  |         413: ('Request Entity Too Large', 'Entity is too large.'),
 | 
					
						
							|  |  |  |         414: ('Request-URI Too Long', 'URI is too long.'),
 | 
					
						
							|  |  |  |         415: ('Unsupported Media Type', 'Entity body in unsupported format.'),
 | 
					
						
							|  |  |  |         416: ('Requested Range Not Satisfiable',
 | 
					
						
							|  |  |  |               'Cannot satisfy request range.'),
 | 
					
						
							|  |  |  |         417: ('Expectation Failed',
 | 
					
						
							|  |  |  |               'Expect condition could not be satisfied.'),
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |         500: ('Internal Server Error', 'Server got itself in trouble'),
 | 
					
						
							|  |  |  |         501: ('Not Implemented',
 | 
					
						
							|  |  |  |               'Server does not support this operation'),
 | 
					
						
							|  |  |  |         502: ('Bad Gateway', 'Invalid responses from another server/proxy.'),
 | 
					
						
							|  |  |  |         503: ('Service Unavailable',
 | 
					
						
							|  |  |  |               'The server cannot process the request due to a high load'),
 | 
					
						
							|  |  |  |         504: ('Gateway Timeout',
 | 
					
						
							|  |  |  |               'The gateway server did not receive a timely response'),
 | 
					
						
							|  |  |  |         505: ('HTTP Version Not Supported', 'Cannot fulfill request.'),
 | 
					
						
							|  |  |  |         }
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | When an error is raised the server responds by returning an HTTP error code
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  | *and* an error page. You can use the :exc:`HTTPError` instance as a response on the
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | page returned. This means that as well as the code attribute, it also has read,
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | geturl, and info, methods as returned by the ``urllib.response`` module::
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     >>> req = urllib.request.Request('http://www.python.org/fish.html')
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     >>> try:
 | 
					
						
							| 
									
										
										
										
											2012-10-09 00:38:17 -07:00
										 |  |  |     ...     urllib.request.urlopen(req)
 | 
					
						
							|  |  |  |     ... except urllib.error.HTTPError as e:
 | 
					
						
							|  |  |  |     ...     print(e.code)
 | 
					
						
							|  |  |  |     ...     print(e.read())  #doctest: +ELLIPSIS, +NORMALIZE_WHITESPACE
 | 
					
						
							|  |  |  |     ...
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     404
 | 
					
						
							| 
									
										
										
										
											2012-10-09 00:38:17 -07:00
										 |  |  |     b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 | 
					
						
							|  |  |  |       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
 | 
					
						
							|  |  |  |       ...
 | 
					
						
							|  |  |  |       <title>Page Not Found</title>\n
 | 
					
						
							|  |  |  |       ...
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Wrapping it Up
 | 
					
						
							|  |  |  | --------------
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  | So if you want to be prepared for :exc:`HTTPError` *or* :exc:`URLError` there are two
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | basic approaches. I prefer the second approach.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Number 1
 | 
					
						
							|  |  |  | ~~~~~~~~
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     from urllib.request import Request, urlopen
 | 
					
						
							|  |  |  |     from urllib.error import URLError, HTTPError
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     req = Request(someurl)
 | 
					
						
							|  |  |  |     try:
 | 
					
						
							|  |  |  |         response = urlopen(req)
 | 
					
						
							| 
									
										
										
										
											2009-05-12 11:19:14 +00:00
										 |  |  |     except HTTPError as e:
 | 
					
						
							| 
									
										
										
										
											2007-09-04 07:15:32 +00:00
										 |  |  |         print('The server couldn\'t fulfill the request.')
 | 
					
						
							|  |  |  |         print('Error code: ', e.code)
 | 
					
						
							| 
									
										
										
										
											2009-05-12 11:19:14 +00:00
										 |  |  |     except URLError as e:
 | 
					
						
							| 
									
										
										
										
											2007-09-04 07:15:32 +00:00
										 |  |  |         print('We failed to reach a server.')
 | 
					
						
							|  |  |  |         print('Reason: ', e.reason)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     else:
 | 
					
						
							|  |  |  |         # everything is fine
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. note::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     The ``except HTTPError`` *must* come first, otherwise ``except URLError``
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  |     will *also* catch an :exc:`HTTPError`.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Number 2
 | 
					
						
							|  |  |  | ~~~~~~~~
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     from urllib.request import Request, urlopen
 | 
					
						
							| 
									
										
										
										
											2016-05-10 12:01:23 +03:00
										 |  |  |     from urllib.error import URLError
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     req = Request(someurl)
 | 
					
						
							|  |  |  |     try:
 | 
					
						
							|  |  |  |         response = urlopen(req)
 | 
					
						
							| 
									
										
										
										
											2009-05-12 11:19:14 +00:00
										 |  |  |     except URLError as e:
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |         if hasattr(e, 'reason'):
 | 
					
						
							| 
									
										
										
										
											2007-09-04 07:15:32 +00:00
										 |  |  |             print('We failed to reach a server.')
 | 
					
						
							|  |  |  |             print('Reason: ', e.reason)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |         elif hasattr(e, 'code'):
 | 
					
						
							| 
									
										
										
										
											2007-09-04 07:15:32 +00:00
										 |  |  |             print('The server couldn\'t fulfill the request.')
 | 
					
						
							|  |  |  |             print('Error code: ', e.code)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     else:
 | 
					
						
							|  |  |  |         # everything is fine
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | info and geturl
 | 
					
						
							|  |  |  | ===============
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 66670,66681,66688,66696-66699 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r66670 | georg.brandl | 2008-09-28 15:01:36 -0500 (Sun, 28 Sep 2008) | 2 lines
  Don't show version in title.
........
  r66681 | georg.brandl | 2008-09-29 11:51:35 -0500 (Mon, 29 Sep 2008) | 2 lines
  Update nasm location.
........
  r66688 | jesse.noller | 2008-09-29 19:15:45 -0500 (Mon, 29 Sep 2008) | 2 lines
  issue3770: if SEM_OPEN is 0, disable the mp.synchronize module, rev. Nick Coghlan, Damien Miller
........
  r66696 | andrew.kuchling | 2008-09-30 07:31:07 -0500 (Tue, 30 Sep 2008) | 1 line
  Edits, and add markup
........
  r66697 | andrew.kuchling | 2008-09-30 08:00:34 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fix
........
  r66698 | andrew.kuchling | 2008-09-30 08:00:51 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes
........
  r66699 | andrew.kuchling | 2008-09-30 08:01:46 -0500 (Tue, 30 Sep 2008) | 1 line
  Markup fixes.  (optparse.rst probably needs an entire revision pass.)
........
											
										 
											2008-10-04 22:00:42 +00:00
										 |  |  | The response returned by urlopen (or the :exc:`HTTPError` instance) has two
 | 
					
						
							|  |  |  | useful methods :meth:`info` and :meth:`geturl` and is defined in the module
 | 
					
						
							|  |  |  | :mod:`urllib.response`..
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | **geturl** - this returns the real URL of the page fetched. This is useful
 | 
					
						
							|  |  |  | because ``urlopen`` (or the opener object used) may have followed a
 | 
					
						
							|  |  |  | redirect. The URL of the page fetched may not be the same as the URL requested.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | **info** - this returns a dictionary-like object that describes the page
 | 
					
						
							|  |  |  | fetched, particularly the headers sent by the server. It is currently an
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | :class:`http.client.HTTPMessage` instance.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Typical headers include 'Content-length', 'Content-type', and so on. See the
 | 
					
						
							| 
									
										
										
										
											2022-08-04 10:13:49 +03:00
										 |  |  | `Quick Reference to HTTP Headers <https://jkorpela.fi/http.html>`_
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | for a useful listing of HTTP headers with brief explanations of their meaning
 | 
					
						
							|  |  |  | and use.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Openers and Handlers
 | 
					
						
							|  |  |  | ====================
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | When you fetch a URL you use an opener (an instance of the perhaps
 | 
					
						
							| 
									
										
										
										
											2022-07-05 05:16:10 -04:00
										 |  |  | confusingly named :class:`urllib.request.OpenerDirector`). Normally we have been using
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | the default opener - via ``urlopen`` - but you can create custom
 | 
					
						
							|  |  |  | openers. Openers use handlers. All the "heavy lifting" is done by the
 | 
					
						
							|  |  |  | handlers. Each handler knows how to open URLs for a particular URL scheme (http,
 | 
					
						
							|  |  |  | ftp, etc.), or how to handle an aspect of URL opening, for example HTTP
 | 
					
						
							|  |  |  | redirections or HTTP cookies.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | You will want to create openers if you want to fetch URLs with specific handlers
 | 
					
						
							|  |  |  | installed, for example to get an opener that handles cookies, or to get an
 | 
					
						
							|  |  |  | opener that does not handle redirections.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To create an opener, instantiate an ``OpenerDirector``, and then call
 | 
					
						
							|  |  |  | ``.add_handler(some_handler_instance)`` repeatedly.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Alternatively, you can use ``build_opener``, which is a convenience function for
 | 
					
						
							|  |  |  | creating opener objects with a single function call.  ``build_opener`` adds
 | 
					
						
							|  |  |  | several handlers by default, but provides a quick way to add more and/or
 | 
					
						
							|  |  |  | override the default handlers.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Other sorts of handlers you might want to can handle proxies, authentication,
 | 
					
						
							|  |  |  | and other common but slightly specialised situations.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | ``install_opener`` can be used to make an ``opener`` object the (global) default
 | 
					
						
							|  |  |  | opener. This means that calls to ``urlopen`` will use the opener you have
 | 
					
						
							|  |  |  | installed.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Opener objects have an ``open`` method, which can be called directly to fetch
 | 
					
						
							|  |  |  | urls in the same way as the ``urlopen`` function: there's no need to call
 | 
					
						
							|  |  |  | ``install_opener``, except as a convenience.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Basic Authentication
 | 
					
						
							|  |  |  | ====================
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | To illustrate creating and installing a handler we will use the
 | 
					
						
							|  |  |  | ``HTTPBasicAuthHandler``. For a more detailed discussion of this subject --
 | 
					
						
							|  |  |  | including an explanation of how Basic Authentication works - see the `Basic
 | 
					
						
							|  |  |  | Authentication Tutorial
 | 
					
						
							| 
									
										
										
										
											2023-04-22 11:24:47 -03:00
										 |  |  | <https://web.archive.org/web/20201215133350/http://www.voidspace.org.uk/python/articles/authentication.shtml>`__.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | When authentication is required, the server sends a header (as well as the 401
 | 
					
						
							|  |  |  | error code) requesting authentication.  This specifies the authentication scheme
 | 
					
						
							| 
									
										
										
										
											2013-12-24 11:04:36 +02:00
										 |  |  | and a 'realm'. The header looks like: ``WWW-Authenticate: SCHEME
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | realm="REALM"``.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2018-04-08 19:18:04 +03:00
										 |  |  | e.g.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | .. code-block:: none
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2012-04-24 17:36:41 +02:00
										 |  |  |     WWW-Authenticate: Basic realm="cPanel Users"
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The client should then retry the request with the appropriate name and password
 | 
					
						
							|  |  |  | for the realm included as a header in the request. This is 'basic
 | 
					
						
							|  |  |  | authentication'. In order to simplify this process we can create an instance of
 | 
					
						
							|  |  |  | ``HTTPBasicAuthHandler`` and an opener to use this handler.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The ``HTTPBasicAuthHandler`` uses an object called a password manager to handle
 | 
					
						
							|  |  |  | the mapping of URLs and realms to passwords and usernames. If you know what the
 | 
					
						
							|  |  |  | realm is (from the authentication header sent by the server), then you can use a
 | 
					
						
							|  |  |  | ``HTTPPasswordMgr``. Frequently one doesn't care what the realm is. In that
 | 
					
						
							|  |  |  | case, it is convenient to use ``HTTPPasswordMgrWithDefaultRealm``. This allows
 | 
					
						
							|  |  |  | you to specify a default username and password for a URL. This will be supplied
 | 
					
						
							|  |  |  | in the absence of you providing an alternative combination for a specific
 | 
					
						
							|  |  |  | realm. We indicate this by providing ``None`` as the realm argument to the
 | 
					
						
							|  |  |  | ``add_password`` method.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | The top-level URL is the first URL that requires authentication. URLs "deeper"
 | 
					
						
							|  |  |  | than the URL you pass to .add_password() will also match. ::
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  |     # create a password manager
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     password_mgr = urllib.request.HTTPPasswordMgrWithDefaultRealm()
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     # Add the username and password.
 | 
					
						
							| 
									
										
											  
											
												Merged revisions 68162,68166,68171,68176,68195-68196,68210,68232 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk
........
  r68162 | ronald.oussoren | 2009-01-02 16:06:00 +0100 (Fri, 02 Jan 2009) | 3 lines
  Fix for issue 4472 is incompatible with Cygwin, this patch
  should fix that.
........
  r68166 | benjamin.peterson | 2009-01-02 19:26:23 +0100 (Fri, 02 Jan 2009) | 1 line
  document PyMemberDef
........
  r68171 | georg.brandl | 2009-01-02 21:25:14 +0100 (Fri, 02 Jan 2009) | 3 lines
  #4811: fix markup glitches (mostly remains of the conversion),
  found by Gabriel Genellina.
........
  r68176 | andrew.kuchling | 2009-01-02 22:00:35 +0100 (Fri, 02 Jan 2009) | 1 line
  Add various items
........
  r68195 | georg.brandl | 2009-01-03 14:45:15 +0100 (Sat, 03 Jan 2009) | 2 lines
  Remove useless string literal.
........
  r68196 | georg.brandl | 2009-01-03 15:29:53 +0100 (Sat, 03 Jan 2009) | 2 lines
  Fix indentation.
........
  r68210 | georg.brandl | 2009-01-03 20:10:12 +0100 (Sat, 03 Jan 2009) | 2 lines
  Set eol-style correctly for mp_distributing.py.
........
  r68232 | georg.brandl | 2009-01-03 22:52:16 +0100 (Sat, 03 Jan 2009) | 2 lines
  Grammar fix.
........
											
										 
											2009-01-03 22:47:39 +00:00
										 |  |  |     # If we knew the realm, we could use it instead of None.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     top_level_url = "http://example.com/foo/"
 | 
					
						
							|  |  |  |     password_mgr.add_password(None, top_level_url, username, password)
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     handler = urllib.request.HTTPBasicAuthHandler(password_mgr)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     # create "opener" (OpenerDirector instance)
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     opener = urllib.request.build_opener(handler)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     # use the opener to fetch a URL
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     opener.open(a_url)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     # Install the opener.
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     # Now all calls to urllib.request.urlopen use our opener.
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     urllib.request.install_opener(opener)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | .. note::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2009-07-21 20:37:52 +00:00
										 |  |  |     In the above example we only supplied our ``HTTPBasicAuthHandler`` to
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     ``build_opener``. By default openers have the handlers for normal situations
 | 
					
						
							| 
									
										
										
										
											2013-04-28 11:07:16 -04:00
										 |  |  |     -- ``ProxyHandler`` (if a proxy setting such as an :envvar:`http_proxy`
 | 
					
						
							|  |  |  |     environment variable is set), ``UnknownHandler``, ``HTTPHandler``,
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     ``HTTPDefaultErrorHandler``, ``HTTPRedirectHandler``, ``FTPHandler``,
 | 
					
						
							| 
									
										
										
										
											2013-04-28 11:07:16 -04:00
										 |  |  |     ``FileHandler``, ``DataHandler``, ``HTTPErrorProcessor``.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | ``top_level_url`` is in fact *either* a full URL (including the 'http:' scheme
 | 
					
						
							|  |  |  | component and the hostname and optionally the port number)
 | 
					
						
							| 
									
										
										
										
											2017-05-16 23:18:09 +03:00
										 |  |  | e.g. ``"http://example.com/"`` *or* an "authority" (i.e. the hostname,
 | 
					
						
							|  |  |  | optionally including the port number) e.g. ``"example.com"`` or ``"example.com:8080"``
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | (the latter example includes a port number).  The authority, if present, must
 | 
					
						
							| 
									
										
										
										
											2017-05-16 23:18:09 +03:00
										 |  |  | NOT contain the "userinfo" component - for example ``"joe:password@example.com"`` is
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | not correct.
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Proxies
 | 
					
						
							|  |  |  | =======
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | **urllib** will auto-detect your proxy settings and use those. This is through
 | 
					
						
							| 
									
										
										
										
											2013-04-28 11:07:16 -04:00
										 |  |  | the ``ProxyHandler``, which is part of the normal handler chain when a proxy
 | 
					
						
							| 
									
										
										
										
											2013-04-28 11:24:35 -04:00
										 |  |  | setting is detected.  Normally that's a good thing, but there are occasions
 | 
					
						
							|  |  |  | when it may not be helpful [#]_. One way to do this is to setup our own
 | 
					
						
							|  |  |  | ``ProxyHandler``, with no proxies defined. This is done using similar steps to
 | 
					
						
							| 
									
										
										
										
											2013-12-24 11:04:36 +02:00
										 |  |  | setting up a `Basic Authentication`_ handler: ::
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     >>> proxy_support = urllib.request.ProxyHandler({})
 | 
					
						
							|  |  |  |     >>> opener = urllib.request.build_opener(proxy_support)
 | 
					
						
							|  |  |  |     >>> urllib.request.install_opener(opener)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | .. note::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     Currently ``urllib.request`` *does not* support fetching of ``https`` locations
 | 
					
						
							|  |  |  |     through a proxy.  However, this can be enabled by extending urllib.request as
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     shown in the recipe [#]_.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-07-30 23:24:16 -07:00
										 |  |  | .. note::
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2016-07-30 23:39:06 -07:00
										 |  |  |     ``HTTP_PROXY`` will be ignored if a variable ``REQUEST_METHOD`` is set; see
 | 
					
						
							|  |  |  |     the documentation on :func:`~urllib.request.getproxies`.
 | 
					
						
							| 
									
										
										
										
											2016-07-30 23:24:16 -07:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | Sockets and Layers
 | 
					
						
							|  |  |  | ==================
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  | The Python support for fetching resources from the web is layered.  urllib uses
 | 
					
						
							|  |  |  | the :mod:`http.client` library, which in turn uses the socket library.
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | As of Python 2.3 you can specify how long a socket should wait for a response
 | 
					
						
							|  |  |  | before timing out. This can be useful in applications which have to fetch web
 | 
					
						
							|  |  |  | pages. By default the socket module has *no timeout* and can hang. Currently,
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  | the socket timeout is not exposed at the http.client or urllib.request levels.
 | 
					
						
							| 
									
										
										
										
											2008-05-26 16:32:26 +00:00
										 |  |  | However, you can set the default timeout globally for all sockets using ::
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     import socket
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     import urllib.request
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  |     # timeout in seconds
 | 
					
						
							|  |  |  |     timeout = 10
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  |     socket.setdefaulttimeout(timeout)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     # this call to urllib.request.urlopen now uses the default timeout
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |     # we have set in the socket module
 | 
					
						
							| 
									
										
										
										
											2008-06-23 04:41:59 +00:00
										 |  |  |     req = urllib.request.Request('http://www.voidspace.org.uk')
 | 
					
						
							|  |  |  |     response = urllib.request.urlopen(req)
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | -------
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | Footnotes
 | 
					
						
							|  |  |  | =========
 | 
					
						
							|  |  |  | 
 | 
					
						
							|  |  |  | This document was reviewed and revised by John Lee.
 | 
					
						
							|  |  |  | 
 | 
					
						
							| 
									
										
										
										
											2015-09-20 23:17:41 +05:00
										 |  |  | .. [#] Google for example.
 | 
					
						
							| 
									
										
										
										
											2016-12-10 05:12:56 +00:00
										 |  |  | .. [#] Browser sniffing is a very bad practice for website design - building
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |        sites using web standards is much more sensible. Unfortunately a lot of
 | 
					
						
							|  |  |  |        sites still send different versions to different browsers.
 | 
					
						
							|  |  |  | .. [#] The user agent for MSIE 6 is
 | 
					
						
							|  |  |  |        *'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)'*
 | 
					
						
							|  |  |  | .. [#] For details of more HTTP request headers, see
 | 
					
						
							|  |  |  |        `Quick Reference to HTTP Headers`_.
 | 
					
						
							|  |  |  | .. [#] In my case I have to use a proxy to access the internet at work. If you
 | 
					
						
							|  |  |  |        attempt to fetch *localhost* URLs through this proxy it blocks them. IE
 | 
					
						
							| 
									
										
										
										
											2008-06-23 11:23:31 +00:00
										 |  |  |        is set to use the proxy, which urllib picks up on. In order to test
 | 
					
						
							|  |  |  |        scripts with a localhost server, I have to prevent urllib from using
 | 
					
						
							| 
									
										
										
										
											2007-08-15 14:28:22 +00:00
										 |  |  |        the proxy.
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  | .. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
 | 
					
						
							| 
									
										
										
										
											2020-09-26 21:47:25 -03:00
										 |  |  |        <https://code.activestate.com/recipes/456195/>`_.
 | 
					
						
							| 
									
										
										
										
											2009-01-03 21:18:54 +00:00
										 |  |  | 
 |