summaryrefslogtreecommitdiff
path: root/Doc
diff options
context:
space:
mode:
authorNick Coghlan <ncoghlan@gmail.com>2010-11-30 15:48:08 +0000
committerNick Coghlan <ncoghlan@gmail.com>2010-11-30 15:48:08 +0000
commit061e682c3d9117988172db53cd10ce70a490f92a (patch)
treeefc7c838ad39f69a83e291e2528d42a0ae364d12 /Doc
parent49707d7b6eb5d1fa589bf5fa20e0b7e25bd1c978 (diff)
downloadcpython-061e682c3d9117988172db53cd10ce70a490f92a.tar.gz
Issue 9873: the URL parsing functions now accept ASCII encoded byte sequences in addition to character strings
Diffstat (limited to 'Doc')
-rw-r--r--Doc/library/urllib.parse.rst231
-rw-r--r--Doc/whatsnew/3.2.rst8
2 files changed, 178 insertions, 61 deletions
diff --git a/Doc/library/urllib.parse.rst b/Doc/library/urllib.parse.rst
index a15ff4770a..eab218e0c3 100644
--- a/Doc/library/urllib.parse.rst
+++ b/Doc/library/urllib.parse.rst
@@ -24,7 +24,15 @@ following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
-The :mod:`urllib.parse` module defines the following functions:
+The :mod:`urllib.parse` module defines functions that fall into two broad
+categories: URL parsing and URL quoting. These are covered in detail in
+the following sections.
+
+URL Parsing
+-----------
+
+The URL parsing functions focus on splitting a URL string into its components,
+or on combining URL components into a URL string.
.. function:: urlparse(urlstring, scheme='', allow_fragments=True)
@@ -242,6 +250,161 @@ The :mod:`urllib.parse` module defines the following functions:
string. If there is no fragment identifier in *url*, return *url* unmodified
and an empty string.
+ The return value is actually an instance of a subclass of :class:`tuple`. This
+ class has the following additional read-only convenience attributes:
+
+ +------------------+-------+-------------------------+----------------------+
+ | Attribute | Index | Value | Value if not present |
+ +==================+=======+=========================+======================+
+ | :attr:`url` | 0 | URL with no fragment | empty string |
+ +------------------+-------+-------------------------+----------------------+
+ | :attr:`fragment` | 1 | Fragment identifier | empty string |
+ +------------------+-------+-------------------------+----------------------+
+
+ See section :ref:`urlparse-result-object` for more information on the result
+ object.
+
+ .. versionchanged:: 3.2
+ Result is a structured object rather than a simple 2-tuple
+
+
+Parsing ASCII Encoded Bytes
+---------------------------
+
+The URL parsing functions were originally designed to operate on character
+strings only. In practice, it is useful to be able to manipulate properly
+quoted and encoded URLs as sequences of ASCII bytes. Accordingly, the
+URL parsing functions in this module all operate on :class:`bytes` and
+:class:`bytearray` objects in addition to :class:`str` objects.
+
+If :class:`str` data is passed in, the result will also contain only
+:class:`str` data. If :class:`bytes` or :class:`bytearray` data is
+passed in, the result will contain only :class:`bytes` data.
+
+Attempting to mix :class:`str` data with :class:`bytes` or
+:class:`bytearray` in a single function call will result in a
+:exc:`TypeError` being thrown, while attempting to pass in non-ASCII
+byte values will trigger :exc:`UnicodeDecodeError`.
+
+To support easier conversion of result objects between :class:`str` and
+:class:`bytes`, all return values from URL parsing functions provide
+either an :meth:`encode` method (when the result contains :class:`str`
+data) or a :meth:`decode` method (when the result contains :class:`bytes`
+data). The signatures of these methods match those of the corresponding
+:class:`str` and :class:`bytes` methods (except that the default encoding
+is ``'ascii'`` rather than ``'utf-8'``). Each produces a value of a
+corresponding type that contains either :class:`bytes` data (for
+:meth:`encode` methods) or :class:`str` data (for
+:meth:`decode` methods).
+
+Applications that need to operate on potentially improperly quoted URLs
+that may contain non-ASCII data will need to do their own decoding from
+bytes to characters before invoking the URL parsing methods.
+
+The behaviour described in this section applies only to the URL parsing
+functions. The URL quoting functions use their own rules when producing
+or consuming byte sequences as detailed in the documentation of the
+individual URL quoting functions.
+
+.. versionchanged:: 3.2
+ URL parsing functions now accept ASCII encoded byte sequences
+
+
+.. _urlparse-result-object:
+
+Structured Parse Results
+------------------------
+
+The result objects from the :func:`urlparse`, :func:`urlsplit` and
+:func:`urldefrag`functions are subclasses of the :class:`tuple` type.
+These subclasses add the attributes listed in the documentation for
+those functions, the encoding and decoding support described in the
+previous section, as well as an additional method:
+
+.. method:: urllib.parse.SplitResult.geturl()
+
+ Return the re-combined version of the original URL as a string. This may
+ differ from the original URL in that the scheme may be normalized to lower
+ case and empty components may be dropped. Specifically, empty parameters,
+ queries, and fragment identifiers will be removed.
+
+ For :func:`urldefrag` results, only empty fragment identifiers will be removed.
+ For :func:`urlsplit` and :func:`urlparse` results, all noted changes will be
+ made to the URL returned by this method.
+
+ The result of this method remains unchanged if passed back through the original
+ parsing function:
+
+ >>> from urllib.parse import urlsplit
+ >>> url = 'HTTP://www.Python.org/doc/#'
+ >>> r1 = urlsplit(url)
+ >>> r1.geturl()
+ 'http://www.Python.org/doc/'
+ >>> r2 = urlsplit(r1.geturl())
+ >>> r2.geturl()
+ 'http://www.Python.org/doc/'
+
+
+The following classes provide the implementations of the structured parse
+results when operating on :class:`str` objects:
+
+.. class:: DefragResult(url, fragment)
+
+ Concrete class for :func:`urldefrag` results containing :class:`str`
+ data. The :meth:`encode` method returns a :class:`DefragResultBytes`
+ instance.
+
+ .. versionadded:: 3.2
+
+.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
+
+ Concrete class for :func:`urlparse` results containing :class:`str`
+ data. The :meth:`encode` method returns a :class:`ParseResultBytes`
+ instance.
+
+.. class:: SplitResult(scheme, netloc, path, query, fragment)
+
+ Concrete class for :func:`urlsplit` results containing :class:`str`
+ data. The :meth:`encode` method returns a :class:`SplitResultBytes`
+ instance.
+
+
+The following classes provide the implementations of the parse results when
+operating on :class:`bytes` or :class:`bytearray` objects:
+
+.. class:: DefragResultBytes(url, fragment)
+
+ Concrete class for :func:`urldefrag` results containing :class:`bytes`
+ data. The :meth:`decode` method returns a :class:`DefragResult`
+ instance.
+
+ .. versionadded:: 3.2
+
+.. class:: ParseResultBytes(scheme, netloc, path, params, query, fragment)
+
+ Concrete class for :func:`urlparse` results containing :class:`bytes`
+ data. The :meth:`decode` method returns a :class:`ParseResult`
+ instance.
+
+ .. versionadded:: 3.2
+
+.. class:: SplitResultBytes(scheme, netloc, path, query, fragment)
+
+ Concrete class for :func:`urlsplit` results containing :class:`bytes`
+ data. The :meth:`decode` method returns a :class:`SplitResult`
+ instance.
+
+ .. versionadded:: 3.2
+
+
+URL Quoting
+-----------
+
+The URL quoting functions focus on taking program data and making it safe
+for use as URL components by quoting special characters and appropriately
+encoding non-ASCII text. They also support reversing these operations to
+recreate the original data from the contents of a URL component if that
+task isn't already covered by the URL parsing functions above.
.. function:: quote(string, safe='/', encoding=None, errors=None)
@@ -322,8 +485,7 @@ The :mod:`urllib.parse` module defines the following functions:
If it is a :class:`str`, unescaped non-ASCII characters in *string*
are encoded into UTF-8 bytes.
- Example: ``unquote_to_bytes('a%26%EF')`` yields
- ``b'a&\xef'``.
+ Example: ``unquote_to_bytes('a%26%EF')`` yields ``b'a&\xef'``.
.. function:: urlencode(query, doseq=False, safe='', encoding=None, errors=None)
@@ -340,12 +502,13 @@ The :mod:`urllib.parse` module defines the following functions:
the optional parameter *doseq* is evaluates to *True*, individual
``key=value`` pairs separated by ``'&'`` are generated for each element of
the value sequence for the key. The order of parameters in the encoded
- string will match the order of parameter tuples in the sequence. This module
- provides the functions :func:`parse_qs` and :func:`parse_qsl` which are used
- to parse query strings into Python data structures.
+ string will match the order of parameter tuples in the sequence.
When *query* parameter is a :class:`str`, the *safe*, *encoding* and *error*
- parameters are sent the :func:`quote_plus` for encoding.
+ parameters are passed down to :func:`quote_plus` for encoding.
+
+ To reverse this encoding process, :func:`parse_qs` and :func:`parse_qsl` are
+ provided in this module to parse query strings into Python data structures.
.. versionchanged:: 3.2
Query parameter supports bytes and string objects.
@@ -376,57 +539,3 @@ The :mod:`urllib.parse` module defines the following functions:
:rfc:`1738` - Uniform Resource Locators (URL)
This specifies the formal syntax and semantics of absolute URLs.
-
-
-.. _urlparse-result-object:
-
-Results of :func:`urlparse` and :func:`urlsplit`
-------------------------------------------------
-
-The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
-subclasses of the :class:`tuple` type. These subclasses add the attributes
-described in those functions, as well as provide an additional method:
-
-.. method:: ParseResult.geturl()
-
- Return the re-combined version of the original URL as a string. This may differ
- from the original URL in that the scheme will always be normalized to lower case
- and empty components may be dropped. Specifically, empty parameters, queries,
- and fragment identifiers will be removed.
-
- The result of this method is a fixpoint if passed back through the original
- parsing function:
-
- >>> import urllib.parse
- >>> url = 'HTTP://www.Python.org/doc/#'
-
- >>> r1 = urllib.parse.urlsplit(url)
- >>> r1.geturl()
- 'http://www.Python.org/doc/'
-
- >>> r2 = urllib.parse.urlsplit(r1.geturl())
- >>> r2.geturl()
- 'http://www.Python.org/doc/'
-
-
-The following classes provide the implementations of the parse results:
-
-.. class:: BaseResult
-
- Base class for the concrete result classes. This provides most of the
- attribute definitions. It does not provide a :meth:`geturl` method. It is
- derived from :class:`tuple`, but does not override the :meth:`__init__` or
- :meth:`__new__` methods.
-
-
-.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
-
- Concrete class for :func:`urlparse` results. The :meth:`__new__` method is
- overridden to support checking that the right number of arguments are passed.
-
-
-.. class:: SplitResult(scheme, netloc, path, query, fragment)
-
- Concrete class for :func:`urlsplit` results. The :meth:`__new__` method is
- overridden to support checking that the right number of arguments are passed.
-
diff --git a/Doc/whatsnew/3.2.rst b/Doc/whatsnew/3.2.rst
index 2d1f0efe1b..056e2facd2 100644
--- a/Doc/whatsnew/3.2.rst
+++ b/Doc/whatsnew/3.2.rst
@@ -573,6 +573,14 @@ New, Improved, and Deprecated Modules
(Contributed by Rodolpho Eckhardt and Nick Coghlan, :issue:`10220`.)
.. XXX: Mention inspect.getattr_static (Michael Foord)
+.. XXX: Mention urllib.parse changes
+ Issue 9873 (Nick Coghlan):
+ - ASCII byte sequence support in URL parsing
+ - named tuple for urldefrag return value
+ Issue 5468 (Dan Mahn) for urlencode:
+ - bytes input support
+ - non-UTF8 percent encoding of non-ASCII characters
+ Issue 2987 for IPv6 (RFC2732) support in urlparse
Multi-threading
===============