Issue 9873: the URL parsing functions now accept ASCII encoded byte sequences in addition to character strings

author: Nick Coghlan <ncoghlan@gmail.com> 2010-11-30 15:48:08 +0000
committer: Nick Coghlan <ncoghlan@gmail.com> 2010-11-30 15:48:08 +0000
commit: 061e682c3d9117988172db53cd10ce70a490f92a (patch)
tree: efc7c838ad39f69a83e291e2528d42a0ae364d12 /Doc
parent: 49707d7b6eb5d1fa589bf5fa20e0b7e25bd1c978 (diff)
download: cpython-061e682c3d9117988172db53cd10ce70a490f92a.tar.gz
2 files changed, 178 insertions, 61 deletions
diff --git a/Doc/library/urllib.parse.rst b/Doc/library/urllib.parse.rst
index a15ff4770a..eab218e0c3 100644
--- a/Doc/library/urllib.parse.rst
+++ b/Doc/library/urllib.parse.rst
@@ -24,7 +24,15 @@ following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
 ``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
 ``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
 
-The :mod:`urllib.parse` module defines the following functions:
+The :mod:`urllib.parse` module defines functions that fall into two broad
+categories: URL parsing and URL quoting. These are covered in detail in
+the following sections.
+
+URL Parsing
+-----------
+
+The URL parsing functions focus on splitting a URL string into its components,
+or on combining URL components into a URL string.
 
 .. function:: urlparse(urlstring, scheme='', allow_fragments=True)
 
@@ -242,6 +250,161 @@ The :mod:`urllib.parse` module defines the following functions:
    string.  If there is no fragment identifier in *url*, return *url* unmodified
    and an empty string.
 
+   The return value is actually an instance of a subclass of :class:`tuple`.  This
+   class has the following additional read-only convenience attributes:
+
+   +------------------+-------+-------------------------+----------------------+
+   | Attribute        | Index | Value                   | Value if not present |
+   +==================+=======+=========================+======================+
+   | :attr:`url`      | 0     | URL with no fragment    | empty string         |
+   +------------------+-------+-------------------------+----------------------+
+   | :attr:`fragment` | 1     | Fragment identifier     | empty string         |
+   +------------------+-------+-------------------------+----------------------+
+
+   See section :ref:`urlparse-result-object` for more information on the result
+   object.
+
+   .. versionchanged:: 3.2
+      Result is a structured object rather than a simple 2-tuple
+
+
+Parsing ASCII Encoded Bytes
+---------------------------
+
+The URL parsing functions were originally designed to operate on character
+strings only. In practice, it is useful to be able to manipulate properly
+quoted and encoded URLs as sequences of ASCII bytes. Accordingly, the
+URL parsing functions in this module all operate on :class:`bytes` and
+:class:`bytearray` objects in addition to :class:`str` objects.
+
+If :class:`str` data is passed in, the result will also contain only
+:class:`str` data. If :class:`bytes` or :class:`bytearray` data is
+passed in, the result will contain only :class:`bytes` data.
+
+Attempting to mix :class:`str` data with :class:`bytes` or
+:class:`bytearray` in a single function call will result in a
+:exc:`TypeError` being thrown, while attempting to pass in non-ASCII
+byte values will trigger :exc:`UnicodeDecodeError`.
+
+To support easier conversion of result objects between :class:`str` and
+:class:`bytes`, all return values from URL parsing functions provide
+either an :meth:`encode` method (when the result contains :class:`str`
+data) or a :meth:`decode` method (when the result contains :class:`bytes`
+data). The signatures of these methods match those of the corresponding
+:class:`str` and :class:`bytes` methods (except that the default encoding
+is ``'ascii'`` rather than ``'utf-8'``). Each produces a value of a
+corresponding type that contains either :class:`bytes` data (for
+:meth:`encode` methods) or :class:`str` data (for
+:meth:`decode` methods).
+
+Applications that need to operate on potentially improperly quoted URLs
+that may contain non-ASCII data will need to do their own decoding from
+bytes to characters before invoking the URL parsing methods.
+
+The behaviour described in this section applies only to the URL parsing
+functions. The URL quoting functions use their own rules when producing
+or consuming byte sequences as detailed in the documentation of the
+individual URL quoting functions.
+
+.. versionchanged:: 3.2
+   URL parsing functions now accept ASCII encoded byte sequences
+
+
+.. _urlparse-result-object:
+
+Structured Parse Results
+------------------------
+
+The result objects from the :func:`urlparse`, :func:`urlsplit`  and
+:func:`urldefrag`functions are subclasses of the :class:`tuple` type.
+These subclasses add the attributes listed in the documentation for
+those functions, the encoding and decoding support described in the
+previous section, as well as an additional method:
+
+.. method:: urllib.parse.SplitResult.geturl()
+
+   Return the re-combined version of the original URL as a string. This may
+   differ from the original URL in that the scheme may be normalized to lower
+   case and empty components may be dropped. Specifically, empty parameters,
+   queries, and fragment identifiers will be removed.
+
+   For :func:`urldefrag` results, only empty fragment identifiers will be removed.
+   For :func:`urlsplit` and :func:`urlparse` results, all noted changes will be
+   made to the URL returned by this method.
+
+   The result of this method remains unchanged if passed back through the original
+   parsing function:
+
+      >>> from urllib.parse import urlsplit
+      >>> url = 'HTTP://www.Python.org/doc/#'
+      >>> r1 = urlsplit(url)
+      >>> r1.geturl()
+      'http://www.Python.org/doc/'
+      >>> r2 = urlsplit(r1.geturl())
+      >>> r2.geturl()
+      'http://www.Python.org/doc/'
+
+
+The following classes provide the implementations of the structured parse
+results when operating on :class:`str` objects:
+
+.. class:: DefragResult(url, fragment)
+
+   Concrete class for :func:`urldefrag` results containing :class:`str`
+   data. The :meth:`encode` method returns a :class:`DefragResultBytes`
+   instance.
+
+   .. versionadded:: 3.2
+
+.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
+
+   Concrete class for :func:`urlparse` results containing :class:`str`
+   data. The :meth:`encode` method returns a :class:`ParseResultBytes`
+   instance.
+
+.. class:: SplitResult(scheme, netloc, path, query, fragment)
+
+   Concrete class for :func:`urlsplit` results containing :class:`str`
+   data. The :meth:`encode` method returns a :class:`SplitResultBytes`
+   instance.
+
+
+The following classes provide the implementations of the parse results when
+operating on :class:`bytes` or :class:`bytearray` objects:
+
+.. class:: DefragResultBytes(url, fragment)
+
+   Concrete class for :func:`urldefrag` results containing :class:`bytes`
+   data. The :meth:`decode` method returns a :class:`DefragResult`
+   instance.
+
+   .. versionadded:: 3.2
+
+.. class:: ParseResultBytes(scheme, netloc, path, params, query, fragment)
+
+   Concrete class for :func:`urlparse` results containing :class:`bytes`
+   data. The :meth:`decode` method returns a :class:`ParseResult`
+   instance.
+
+   .. versionadded:: 3.2
+
+.. class:: SplitResultBytes(scheme, netloc, path, query, fragment)
+
+   Concrete class for :func:`urlsplit` results containing :class:`bytes`
+   data. The :meth:`decode` method returns a :class:`SplitResult`
+   instance.
+
+   .. versionadded:: 3.2
+
+
+URL Quoting
+-----------
+
+The URL quoting functions focus on taking program data and making it safe
+for use as URL components by quoting special characters and appropriately
+encoding non-ASCII text. They also support reversing these operations to
+recreate the original data from the contents of a URL component if that
+task isn't already covered by the URL parsing functions above.
 
 .. function:: quote(string, safe='/', encoding=None, errors=None)
 
@@ -322,8 +485,7 @@ The :mod:`urllib.parse` module defines the following functions:
    If it is a :class:`str`, unescaped non-ASCII characters in *string*
    are encoded into UTF-8 bytes.
 
-   Example: ``unquote_to_bytes('a%26%EF')`` yields
-   ``b'a&\xef'``.
+   Example: ``unquote_to_bytes('a%26%EF')`` yields ``b'a&\xef'``.
 
 
 .. function:: urlencode(query, doseq=False, safe='', encoding=None, errors=None)
@@ -340,12 +502,13 @@ The :mod:`urllib.parse` module defines the following functions:
    the optional parameter *doseq* is evaluates to *True*, individual
    ``key=value`` pairs separated by ``'&'`` are generated for each element of
    the value sequence for the key.  The order of parameters in the encoded
-   string will match the order of parameter tuples in the sequence. This module
-   provides the functions :func:`parse_qs` and :func:`parse_qsl` which are used
-   to parse query strings into Python data structures.
+   string will match the order of parameter tuples in the sequence.
 
    When *query* parameter is a :class:`str`, the *safe*, *encoding* and *error*
-   parameters are sent the :func:`quote_plus` for encoding.
+   parameters are passed down to :func:`quote_plus` for encoding.
+
+   To reverse this encoding process, :func:`parse_qs` and :func:`parse_qsl` are
+   provided in this module to parse query strings into Python data structures.
 
    .. versionchanged:: 3.2
       Query parameter supports bytes and string objects.
@@ -376,57 +539,3 @@ The :mod:`urllib.parse` module defines the following functions:
 
    :rfc:`1738` - Uniform Resource Locators (URL)
       This specifies the formal syntax and semantics of absolute URLs.
-
-
-.. _urlparse-result-object:
-
-Results of :func:`urlparse` and :func:`urlsplit`
-------------------------------------------------
-
-The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
-subclasses of the :class:`tuple` type.  These subclasses add the attributes
-described in those functions, as well as provide an additional method:
-
-.. method:: ParseResult.geturl()
-
-   Return the re-combined version of the original URL as a string. This may differ
-   from the original URL in that the scheme will always be normalized to lower case
-   and empty components may be dropped. Specifically, empty parameters, queries,
-   and fragment identifiers will be removed.
-
-   The result of this method is a fixpoint if passed back through the original
-   parsing function:
-
-      >>> import urllib.parse
-      >>> url = 'HTTP://www.Python.org/doc/#'
-
-      >>> r1 = urllib.parse.urlsplit(url)
-      >>> r1.geturl()
-      'http://www.Python.org/doc/'
-
-      >>> r2 = urllib.parse.urlsplit(r1.geturl())
-      >>> r2.geturl()
-      'http://www.Python.org/doc/'
-
-
-The following classes provide the implementations of the parse results:
-
-.. class:: BaseResult
-
-   Base class for the concrete result classes.  This provides most of the
-   attribute definitions.  It does not provide a :meth:`geturl` method.  It is
-   derived from :class:`tuple`, but does not override the :meth:`__init__` or
-   :meth:`__new__` methods.
-
-
-.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
-
-   Concrete class for :func:`urlparse` results.  The :meth:`__new__` method is
-   overridden to support checking that the right number of arguments are passed.
-
-
-.. class:: SplitResult(scheme, netloc, path, query, fragment)
-
-   Concrete class for :func:`urlsplit` results.  The :meth:`__new__` method is
-   overridden to support checking that the right number of arguments are passed.
-
diff --git a/Doc/whatsnew/3.2.rst b/Doc/whatsnew/3.2.rst
index 2d1f0efe1b..056e2facd2 100644
--- a/Doc/whatsnew/3.2.rst
+++ b/Doc/whatsnew/3.2.rst
@@ -573,6 +573,14 @@ New, Improved, and Deprecated Modules
   (Contributed by Rodolpho Eckhardt and Nick Coghlan, :issue:`10220`.)
 
 .. XXX: Mention inspect.getattr_static (Michael Foord)
+.. XXX: Mention urllib.parse changes
+          Issue 9873 (Nick Coghlan):
+            - ASCII byte sequence support in URL parsing
+            - named tuple for urldefrag return value
+          Issue 5468 (Dan Mahn) for urlencode:
+            - bytes input support
+            - non-UTF8 percent encoding of non-ASCII characters
+          Issue 2987 for IPv6 (RFC2732) support in urlparse
 
 Multi-threading
 ===============
author	Nick Coghlan <ncoghlan@gmail.com>	2010-11-30 15:48:08 +0000
committer	Nick Coghlan <ncoghlan@gmail.com>	2010-11-30 15:48:08 +0000
commit	061e682c3d9117988172db53cd10ce70a490f92a (patch)
tree	efc7c838ad39f69a83e291e2528d42a0ae364d12 /Doc
parent	49707d7b6eb5d1fa589bf5fa20e0b7e25bd1c978 (diff)
download	cpython-061e682c3d9117988172db53cd10ce70a490f92a.tar.gz