cpython/Lib/urlparse.py
Christian Heimes faf2f63faf Merged revisions 59703-59773 via svnmerge from
svn+ssh://pythondev@svn.python.org/python/trunk

........
  r59704 | christian.heimes | 2008-01-04 04:15:05 +0100 (Fri, 04 Jan 2008) | 1 line

  Moved include "Python.h" in front of other imports to silence a warning.
........
  r59706 | raymond.hettinger | 2008-01-04 04:22:53 +0100 (Fri, 04 Jan 2008) | 10 lines

  Minor fix-ups to named tuples:

  * Make the _replace() method respect subclassing.

  * Using property() to make _fields read-only wasn't a good idea.
    It caused len(Point._fields) to fail.

  * Add note to _cast() about length checking and alternative with the star-operator.
........
  r59707 | jeffrey.yasskin | 2008-01-04 09:01:23 +0100 (Fri, 04 Jan 2008) | 3 lines

  Make math.{floor,ceil}({int,long}) return float again for backwards
  compatibility after r59671 made them return integral types.
........
  r59709 | christian.heimes | 2008-01-04 14:21:07 +0100 (Fri, 04 Jan 2008) | 1 line

  Bug #1713: posixpath.ismount() claims symlink to a mountpoint is a mountpoint.
........
  r59712 | lars.gustaebel | 2008-01-04 15:00:33 +0100 (Fri, 04 Jan 2008) | 5 lines

  Issue #1735: TarFile.extractall() now correctly sets
  directory permissions and times.

  (will backport to 2.5)
........
  r59714 | andrew.kuchling | 2008-01-04 15:47:17 +0100 (Fri, 04 Jan 2008) | 1 line

  Update links to bug/patch tracker
........
  r59716 | christian.heimes | 2008-01-04 16:23:30 +0100 (Fri, 04 Jan 2008) | 1 line

  Added interface to Windows' WSAIoctl and a simple example for a network sniffer.
........
  r59717 | christian.heimes | 2008-01-04 16:29:00 +0100 (Fri, 04 Jan 2008) | 1 line

  And here is the rest of Hirokazu Yamamoto's patch for VS6.0 support. Thanks Hiro!
........
  r59719 | christian.heimes | 2008-01-04 16:34:06 +0100 (Fri, 04 Jan 2008) | 1 line

  Reverted last transaction. It's the wrong branch.
........
  r59721 | christian.heimes | 2008-01-04 16:48:06 +0100 (Fri, 04 Jan 2008) | 1 line

  socket.ioctl is only available on Windows
........
  r59722 | andrew.kuchling | 2008-01-04 19:24:41 +0100 (Fri, 04 Jan 2008) | 1 line

  Fix markup
........
  r59723 | andrew.kuchling | 2008-01-04 19:25:05 +0100 (Fri, 04 Jan 2008) | 1 line

  Fix markup
........
  r59725 | guido.van.rossum | 2008-01-05 01:59:59 +0100 (Sat, 05 Jan 2008) | 3 lines

  Patch #1725 by Mark Dickinson, fixes incorrect conversion of -1e1000
  and adds errors for -0x.
........
  r59726 | guido.van.rossum | 2008-01-05 02:21:57 +0100 (Sat, 05 Jan 2008) | 2 lines

  Patch #1698 by Senthil: allow '@' in username when parsed by urlparse.py.
........
  r59727 | raymond.hettinger | 2008-01-05 02:35:43 +0100 (Sat, 05 Jan 2008) | 1 line

  Improve namedtuple's _cast() method with a docstring, new name, and error-checking.
........
  r59728 | raymond.hettinger | 2008-01-05 03:17:24 +0100 (Sat, 05 Jan 2008) | 1 line

  Add error-checking to namedtuple's _replace() method.
........
  r59730 | fred.drake | 2008-01-05 05:38:38 +0100 (Sat, 05 Jan 2008) | 2 lines

  clean up a comment
........
  r59731 | jeffrey.yasskin | 2008-01-05 09:47:13 +0100 (Sat, 05 Jan 2008) | 11 lines

  Continue rolling back pep-3141 changes that changed behavior from 2.5. This
  round included:
   * Revert round to its 2.6 behavior (half away from 0).
   * Because round, floor, and ceil always return float again, it's no
     longer necessary to have them delegate to __xxx___, so I've ripped
     that out of their implementations and the Real ABC. This also helps
     in implementing types that work in both 2.6 and 3.0: you return int
     from the __xxx__ methods, and let it get enabled by the version
     upgrade.
   * Make pow(-1, .5) raise a ValueError again.
........
  r59736 | andrew.kuchling | 2008-01-05 16:13:49 +0100 (Sat, 05 Jan 2008) | 1 line

  Fix comment typo
........
  r59738 | thomas.heller | 2008-01-05 18:15:44 +0100 (Sat, 05 Jan 2008) | 1 line

  Add myself.
........
  r59739 | georg.brandl | 2008-01-05 18:49:17 +0100 (Sat, 05 Jan 2008) | 2 lines

  Fix C++-style comment.
........
  r59742 | georg.brandl | 2008-01-05 20:28:16 +0100 (Sat, 05 Jan 2008) | 2 lines

  Remove with_statement future imports from 2.6 docs.
........
  r59743 | georg.brandl | 2008-01-05 20:29:45 +0100 (Sat, 05 Jan 2008) | 2 lines

  Simplify index entries; fix #1712.
........
  r59744 | georg.brandl | 2008-01-05 20:44:22 +0100 (Sat, 05 Jan 2008) | 2 lines

  Doc patch #1730 from Robin Stocker; minor corrections mostly to os.rst.
........
  r59749 | georg.brandl | 2008-01-05 21:29:13 +0100 (Sat, 05 Jan 2008) | 2 lines

  Revert socket.rst to unix-eol.
........
  r59750 | georg.brandl | 2008-01-05 21:33:46 +0100 (Sat, 05 Jan 2008) | 2 lines

  Set native svn:eol-style property for text files.
........
  r59752 | georg.brandl | 2008-01-05 21:46:29 +0100 (Sat, 05 Jan 2008) | 2 lines

  #1719: capitalization error in "UuidCreate".
........
  r59753 | georg.brandl | 2008-01-05 22:02:25 +0100 (Sat, 05 Jan 2008) | 2 lines

  Repair markup.
........
  r59754 | georg.brandl | 2008-01-05 22:10:50 +0100 (Sat, 05 Jan 2008) | 2 lines

  Use markup.
........
  r59757 | christian.heimes | 2008-01-05 22:35:52 +0100 (Sat, 05 Jan 2008) | 1 line

  Final adjustments for #1601
........
  r59758 | guido.van.rossum | 2008-01-05 23:19:06 +0100 (Sat, 05 Jan 2008) | 3 lines

  Patch #1637: fix urlparse for URLs like 'http://x.com?arg=/foo'.
  Fix by John Nagle.
........
  r59759 | guido.van.rossum | 2008-01-05 23:20:01 +0100 (Sat, 05 Jan 2008) | 2 lines

  Add John Nagle (of issue #1637).
........
  r59765 | raymond.hettinger | 2008-01-06 10:02:24 +0100 (Sun, 06 Jan 2008) | 1 line

  Small code simplification.  Forgot that classmethods can be called from intances.
........
  r59766 | martin.v.loewis | 2008-01-06 11:09:48 +0100 (Sun, 06 Jan 2008) | 2 lines

  Use vcbuild for VS 2009.
........
  r59767 | martin.v.loewis | 2008-01-06 12:03:43 +0100 (Sun, 06 Jan 2008) | 2 lines

  Package using VS 2008.
........
  r59768 | martin.v.loewis | 2008-01-06 12:13:16 +0100 (Sun, 06 Jan 2008) | 2 lines

  Don't try to package msvcr90 for the moment.
........
  r59769 | georg.brandl | 2008-01-06 15:17:36 +0100 (Sun, 06 Jan 2008) | 4 lines

  #1696393: don't check for '.' and '..' in ntpath.walk since
  they aren't returned from os.listdir anymore.
  Reported by Michael Haggerty.
........
  r59770 | georg.brandl | 2008-01-06 15:27:15 +0100 (Sun, 06 Jan 2008) | 3 lines

  #1742: don't raise exception on os.path.relpath("a", "a"), but return os.curdir.
  Reported by Jesse Towner.
........
  r59771 | georg.brandl | 2008-01-06 15:33:52 +0100 (Sun, 06 Jan 2008) | 2 lines

  #1591: Clarify docstring of Popen3.
........
  r59772 | georg.brandl | 2008-01-06 16:30:34 +0100 (Sun, 06 Jan 2008) | 2 lines

  #1680: fix context manager example function name.
........
  r59773 | georg.brandl | 2008-01-06 16:34:57 +0100 (Sun, 06 Jan 2008) | 2 lines

  #1755097: document default values for [].sort() and sorted().
........
2008-01-06 16:59:19 +00:00

373 lines
12 KiB
Python

"""Parse (absolute and relative) URLs.
See RFC 1808: "Relative Uniform Resource Locators", by R. Fielding,
UC Irvine, June 1995.
"""
__all__ = ["urlparse", "urlunparse", "urljoin", "urldefrag",
"urlsplit", "urlunsplit"]
# A classification of schemes ('' means apply by default)
uses_relative = ['ftp', 'http', 'gopher', 'nntp', 'imap',
'wais', 'file', 'https', 'shttp', 'mms',
'prospero', 'rtsp', 'rtspu', '', 'sftp']
uses_netloc = ['ftp', 'http', 'gopher', 'nntp', 'telnet',
'imap', 'wais', 'file', 'mms', 'https', 'shttp',
'snews', 'prospero', 'rtsp', 'rtspu', 'rsync', '',
'svn', 'svn+ssh', 'sftp']
non_hierarchical = ['gopher', 'hdl', 'mailto', 'news',
'telnet', 'wais', 'imap', 'snews', 'sip', 'sips']
uses_params = ['ftp', 'hdl', 'prospero', 'http', 'imap',
'https', 'shttp', 'rtsp', 'rtspu', 'sip', 'sips',
'mms', '', 'sftp']
uses_query = ['http', 'wais', 'imap', 'https', 'shttp', 'mms',
'gopher', 'rtsp', 'rtspu', 'sip', 'sips', '']
uses_fragment = ['ftp', 'hdl', 'http', 'gopher', 'news',
'nntp', 'wais', 'https', 'shttp', 'snews',
'file', 'prospero', '']
# Characters valid in scheme names
scheme_chars = ('abcdefghijklmnopqrstuvwxyz'
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
'0123456789'
'+-.')
MAX_CACHE_SIZE = 20
_parse_cache = {}
def clear_cache():
"""Clear the parse cache."""
global _parse_cache
_parse_cache = {}
class BaseResult(tuple):
"""Base class for the parsed result objects.
This provides the attributes shared by the two derived result
objects as read-only properties. The derived classes are
responsible for checking the right number of arguments were
supplied to the constructor.
"""
__slots__ = ()
# Attributes that access the basic components of the URL:
@property
def scheme(self):
return self[0]
@property
def netloc(self):
return self[1]
@property
def path(self):
return self[2]
@property
def query(self):
return self[-2]
@property
def fragment(self):
return self[-1]
# Additional attributes that provide access to parsed-out portions
# of the netloc:
@property
def username(self):
netloc = self.netloc
if "@" in netloc:
userinfo = netloc.rsplit("@", 1)[0]
if ":" in userinfo:
userinfo = userinfo.split(":", 1)[0]
return userinfo
return None
@property
def password(self):
netloc = self.netloc
if "@" in netloc:
userinfo = netloc.rsplit("@", 1)[0]
if ":" in userinfo:
return userinfo.split(":", 1)[1]
return None
@property
def hostname(self):
netloc = self.netloc
if "@" in netloc:
netloc = netloc.rsplit("@", 1)[1]
if ":" in netloc:
netloc = netloc.split(":", 1)[0]
return netloc.lower() or None
@property
def port(self):
netloc = self.netloc
if "@" in netloc:
netloc = netloc.rsplit("@", 1)[1]
if ":" in netloc:
port = netloc.split(":", 1)[1]
return int(port, 10)
return None
class SplitResult(BaseResult):
__slots__ = ()
def __new__(cls, scheme, netloc, path, query, fragment):
return BaseResult.__new__(
cls, (scheme, netloc, path, query, fragment))
def geturl(self):
return urlunsplit(self)
class ParseResult(BaseResult):
__slots__ = ()
def __new__(cls, scheme, netloc, path, params, query, fragment):
return BaseResult.__new__(
cls, (scheme, netloc, path, params, query, fragment))
@property
def params(self):
return self[3]
def geturl(self):
return urlunparse(self)
def urlparse(url, scheme='', allow_fragments=True):
"""Parse a URL into 6 components:
<scheme>://<netloc>/<path>;<params>?<query>#<fragment>
Return a 6-tuple: (scheme, netloc, path, params, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes."""
tuple = urlsplit(url, scheme, allow_fragments)
scheme, netloc, url, query, fragment = tuple
if scheme in uses_params and ';' in url:
url, params = _splitparams(url)
else:
params = ''
return ParseResult(scheme, netloc, url, params, query, fragment)
def _splitparams(url):
if '/' in url:
i = url.find(';', url.rfind('/'))
if i < 0:
return url, ''
else:
i = url.find(';')
return url[:i], url[i+1:]
def _splitnetloc(url, start=0):
delim = len(url) # position of end of domain part of url, default is end
for c in '/?#': # look for delimiters; the order is NOT important
wdelim = url.find(c, start) # find first of this delim
if wdelim >= 0: # if found
delim = min(delim, wdelim) # use earliest delim position
return url[start:delim], url[delim:] # return (domain, rest)
def urlsplit(url, scheme='', allow_fragments=True):
"""Parse a URL into 5 components:
<scheme>://<netloc>/<path>?<query>#<fragment>
Return a 5-tuple: (scheme, netloc, path, query, fragment).
Note that we don't break the components up in smaller bits
(e.g. netloc is a single string) and we don't expand % escapes."""
allow_fragments = bool(allow_fragments)
key = url, scheme, allow_fragments, type(url), type(scheme)
cached = _parse_cache.get(key, None)
if cached:
return cached
if len(_parse_cache) >= MAX_CACHE_SIZE: # avoid runaway growth
clear_cache()
netloc = query = fragment = ''
i = url.find(':')
if i > 0:
if url[:i] == 'http': # optimize the common case
scheme = url[:i].lower()
url = url[i+1:]
if url[:2] == '//':
netloc, url = _splitnetloc(url, 2)
if allow_fragments and '#' in url:
url, fragment = url.split('#', 1)
if '?' in url:
url, query = url.split('?', 1)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return v
for c in url[:i]:
if c not in scheme_chars:
break
else:
scheme, url = url[:i].lower(), url[i+1:]
if scheme in uses_netloc and url[:2] == '//':
netloc, url = _splitnetloc(url, 2)
if allow_fragments and scheme in uses_fragment and '#' in url:
url, fragment = url.split('#', 1)
if scheme in uses_query and '?' in url:
url, query = url.split('?', 1)
v = SplitResult(scheme, netloc, url, query, fragment)
_parse_cache[key] = v
return v
def urlunparse(components):
"""Put a parsed URL back together again. This may result in a
slightly different, but equivalent URL, if the URL that was parsed
originally had redundant delimiters, e.g. a ? with an empty query
(the draft states that these are equivalent)."""
scheme, netloc, url, params, query, fragment = components
if params:
url = "%s;%s" % (url, params)
return urlunsplit((scheme, netloc, url, query, fragment))
def urlunsplit(components):
scheme, netloc, url, query, fragment = components
if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
if url and url[:1] != '/': url = '/' + url
url = '//' + (netloc or '') + url
if scheme:
url = scheme + ':' + url
if query:
url = url + '?' + query
if fragment:
url = url + '#' + fragment
return url
def urljoin(base, url, allow_fragments=True):
"""Join a base URL and a possibly relative URL to form an absolute
interpretation of the latter."""
if not base:
return url
if not url:
return base
bscheme, bnetloc, bpath, bparams, bquery, bfragment = \
urlparse(base, '', allow_fragments)
scheme, netloc, path, params, query, fragment = \
urlparse(url, bscheme, allow_fragments)
if scheme != bscheme or scheme not in uses_relative:
return url
if scheme in uses_netloc:
if netloc:
return urlunparse((scheme, netloc, path,
params, query, fragment))
netloc = bnetloc
if path[:1] == '/':
return urlunparse((scheme, netloc, path,
params, query, fragment))
if not (path or params or query):
return urlunparse((scheme, netloc, bpath,
bparams, bquery, fragment))
segments = bpath.split('/')[:-1] + path.split('/')
# XXX The stuff below is bogus in various ways...
if segments[-1] == '.':
segments[-1] = ''
while '.' in segments:
segments.remove('.')
while 1:
i = 1
n = len(segments) - 1
while i < n:
if (segments[i] == '..'
and segments[i-1] not in ('', '..')):
del segments[i-1:i+1]
break
i = i+1
else:
break
if segments == ['', '..']:
segments[-1] = ''
elif len(segments) >= 2 and segments[-1] == '..':
segments[-2:] = ['']
return urlunparse((scheme, netloc, '/'.join(segments),
params, query, fragment))
def urldefrag(url):
"""Removes any existing fragment from URL.
Returns a tuple of the defragmented URL and the fragment. If
the URL contained no fragments, the second element is the
empty string.
"""
if '#' in url:
s, n, p, a, q, frag = urlparse(url)
defrag = urlunparse((s, n, p, a, q, ''))
return defrag, frag
else:
return url, ''
test_input = """
http://a/b/c/d
g:h = <URL:g:h>
http:g = <URL:http://a/b/c/g>
http: = <URL:http://a/b/c/d>
g = <URL:http://a/b/c/g>
./g = <URL:http://a/b/c/g>
g/ = <URL:http://a/b/c/g/>
/g = <URL:http://a/g>
//g = <URL:http://g>
?y = <URL:http://a/b/c/d?y>
g?y = <URL:http://a/b/c/g?y>
g?y/./x = <URL:http://a/b/c/g?y/./x>
. = <URL:http://a/b/c/>
./ = <URL:http://a/b/c/>
.. = <URL:http://a/b/>
../ = <URL:http://a/b/>
../g = <URL:http://a/b/g>
../.. = <URL:http://a/>
../../g = <URL:http://a/g>
../../../g = <URL:http://a/../g>
./../g = <URL:http://a/b/g>
./g/. = <URL:http://a/b/c/g/>
/./g = <URL:http://a/./g>
g/./h = <URL:http://a/b/c/g/h>
g/../h = <URL:http://a/b/c/h>
http:g = <URL:http://a/b/c/g>
http: = <URL:http://a/b/c/d>
http:?y = <URL:http://a/b/c/d?y>
http:g?y = <URL:http://a/b/c/g?y>
http:g?y/./x = <URL:http://a/b/c/g?y/./x>
"""
def test():
import sys
base = ''
if sys.argv[1:]:
fn = sys.argv[1]
if fn == '-':
fp = sys.stdin
else:
fp = open(fn)
else:
from io import StringIO
fp = StringIO(test_input)
while 1:
line = fp.readline()
if not line: break
words = line.split()
if not words:
continue
url = words[0]
parts = urlparse(url)
print('%-10s : %s' % (url, parts))
abs = urljoin(base, url)
if not base:
base = abs
wrapped = '<URL:%s>' % abs
print('%-10s = %s' % (url, wrapped))
if len(words) == 3 and words[1] == '=':
if wrapped != words[2]:
print('EXPECTED', words[2], '!!!!!!!!!!')
if __name__ == '__main__':
test()