summaryrefslogtreecommitdiff
path: root/Doc/library/codecs.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Doc/library/codecs.rst')
-rw-r--r--Doc/library/codecs.rst208
1 files changed, 142 insertions, 66 deletions
diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst
index 762bb9821e..009ae26c99 100644
--- a/Doc/library/codecs.rst
+++ b/Doc/library/codecs.rst
@@ -22,6 +22,25 @@ manages the codec and error handling lookup process.
It defines the following functions:
+.. function:: encode(obj, encoding='utf-8', errors='strict')
+
+ Encodes *obj* using the codec registered for *encoding*.
+
+ *Errors* may be given to set the desired error handling scheme. The
+ default error handler is ``strict`` meaning that encoding errors raise
+ :exc:`ValueError` (or a more codec specific subclass, such as
+ :exc:`UnicodeEncodeError`). Refer to :ref:`codec-base-classes` for more
+ information on codec error handling.
+
+.. function:: decode(obj, encoding='utf-8', errors='strict')
+
+ Decodes *obj* using the codec registered for *encoding*.
+
+ *Errors* may be given to set the desired error handling scheme. The
+ default error handler is ``strict`` meaning that decoding errors raise
+ :exc:`ValueError` (or a more codec specific subclass, such as
+ :exc:`UnicodeDecodeError`). Refer to :ref:`codec-base-classes` for more
+ information on codec error handling.
.. function:: register(search_function)
@@ -46,9 +65,9 @@ It defines the following functions:
The various functions or classes take the following arguments:
*encode* and *decode*: These must be functions or methods which have the same
- interface as the :meth:`encode`/:meth:`decode` methods of Codec instances (see
- Codec Interface). The functions/methods are expected to work in a stateless
- mode.
+ interface as the :meth:`~Codec.encode`/:meth:`~Codec.decode` methods of Codec
+ instances (see :ref:`Codec Interface <codec-objects>`). The functions/methods
+ are expected to work in a stateless mode.
*incrementalencoder* and *incrementaldecoder*: These have to be factory
functions providing the following interface:
@@ -65,7 +84,7 @@ It defines the following functions:
``factory(stream, errors='strict')``
The factory functions must return objects providing the interfaces defined by
- the base classes :class:`StreamWriter` and :class:`StreamReader`, respectively.
+ the base classes :class:`StreamReader` and :class:`StreamWriter`, respectively.
Stream codecs can maintain state.
Possible values for errors are
@@ -78,7 +97,11 @@ It defines the following functions:
reference (for encoding only)
* ``'backslashreplace'``: replace with backslashed escape sequences (for
encoding only)
- * ``'surrogateescape'``: replace with surrogate U+DCxx, see :pep:`383`
+ * ``'surrogateescape'``: on decoding, replace with code points in the Unicode
+ Private Use Area ranging from U+DC80 to U+DCFF. These private code
+ points will then be turned back into the same bytes when the
+ ``surrogateescape`` error handler is used when encoding the data.
+ (See :pep:`383` for more.)
as well as any other error handling name defined via :func:`register_error`.
@@ -155,13 +178,16 @@ functions which use :func:`lookup` for the codec lookup:
when *name* is specified as the errors parameter.
For encoding *error_handler* will be called with a :exc:`UnicodeEncodeError`
- instance, which contains information about the location of the error. The error
- handler must either raise this or a different exception or return a tuple with a
- replacement for the unencodable part of the input and a position where encoding
- should continue. The encoder will encode the replacement and continue encoding
- the original input at the specified position. Negative position values will be
- treated as being relative to the end of the input string. If the resulting
- position is out of bound an :exc:`IndexError` will be raised.
+ instance, which contains information about the location of the error. The
+ error handler must either raise this or a different exception or return a
+ tuple with a replacement for the unencodable part of the input and a position
+ where encoding should continue. The replacement may be either :class:`str` or
+ :class:`bytes`. If the replacement is bytes, the encoder will simply copy
+ them into the output buffer. If the replacement is a string, the encoder will
+ encode the replacement. Encoding continues on original input at the
+ specified position. Negative position values will be treated as being
+ relative to the end of the input string. If the resulting position is out of
+ bound an :exc:`IndexError` will be raised.
Decoding and translating works similar, except :exc:`UnicodeDecodeError` or
:exc:`UnicodeTranslateError` will be passed to the handler and that the
@@ -307,11 +333,13 @@ implement the file protocols.
The :class:`Codec` class defines the interface for stateless encoders/decoders.
-To simplify and standardize error handling, the :meth:`encode` and
-:meth:`decode` methods may implement different error handling schemes by
+To simplify and standardize error handling, the :meth:`~Codec.encode` and
+:meth:`~Codec.decode` methods may implement different error handling schemes by
providing the *errors* string argument. The following string values are defined
and implemented by all standard Python codecs:
+.. tabularcolumns:: |l|L|
+
+-------------------------+-----------------------------------------------+
| Value | Meaning |
+=========================+===============================================+
@@ -400,12 +428,14 @@ interfaces of the stateless encoder and decoder:
The :class:`IncrementalEncoder` and :class:`IncrementalDecoder` classes provide
the basic interface for incremental encoding and decoding. Encoding/decoding the
input isn't done with one call to the stateless encoder/decoder function, but
-with multiple calls to the :meth:`encode`/:meth:`decode` method of the
-incremental encoder/decoder. The incremental encoder/decoder keeps track of the
-encoding/decoding process during method calls.
-
-The joined output of calls to the :meth:`encode`/:meth:`decode` method is the
-same as if all the single inputs were joined into one, and this input was
+with multiple calls to the
+:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method of
+the incremental encoder/decoder. The incremental encoder/decoder keeps track of
+the encoding/decoding process during method calls.
+
+The joined output of calls to the
+:meth:`~IncrementalEncoder.encode`/:meth:`~IncrementalDecoder.decode` method is
+the same as if all the single inputs were joined into one, and this input was
encoded/decoded with the stateless encoder/decoder.
@@ -458,7 +488,8 @@ define in order to be compatible with the Python codec registry.
.. method:: reset()
- Reset the encoder to the initial state.
+ Reset the encoder to the initial state. The output is discarded: call
+ ``.encode('', final=True)`` to reset the encoder and to get the output.
.. method:: IncrementalEncoder.getstate()
@@ -684,7 +715,7 @@ compatible with the Python codec registry.
Read one line from the input stream and return the decoded data.
*size*, if given, is passed as size argument to the stream's
- :meth:`readline` method.
+ :meth:`read` method.
If *keepends* is false line-endings will be stripped from the lines
returned.
@@ -786,11 +817,9 @@ methods and attributes from the underlying stream.
Encodings and Unicode
---------------------
-Strings are stored internally as sequences of codepoints (to be precise
-as :c:type:`Py_UNICODE` arrays). Depending on the way Python is compiled (either
-via ``--without-wide-unicode`` or ``--with-wide-unicode``, with the
-former being the default) :c:type:`Py_UNICODE` is either a 16-bit or 32-bit data
-type. Once a string object is used outside of CPU and memory, CPU endianness
+Strings are stored internally as sequences of codepoints in range ``0 - 10FFFF``
+(see :pep:`393` for more details about the implementation).
+Once a string object is used outside of CPU and memory, CPU endianness
and how these arrays are stored as bytes become an issue. Transforming a
string object into a sequence of bytes is called encoding and recreating the
string object from the sequence of bytes is known as decoding. There are many
@@ -901,6 +930,15 @@ is meant to be exhaustive. Notice that spelling alternatives that only differ in
case or use a hyphen instead of an underscore are also valid aliases; therefore,
e.g. ``'utf-8'`` is a valid alias for the ``'utf_8'`` codec.
+.. impl-detail::
+
+ Some common encodings can bypass the codecs lookup machinery to
+ improve performance. These optimization opportunities are only
+ recognized by CPython for a limited set of aliases: utf-8, utf8,
+ latin-1, latin1, iso-8859-1, mbcs (Windows only), ascii, utf-16,
+ and utf-32. Using alternative spellings for these encodings may
+ result in slower execution.
+
Many of the character sets support the same languages. They vary in individual
characters (e.g. whether the EURO SIGN is supported or not), and in the
assignment of characters to code positions. For the European languages in
@@ -915,6 +953,8 @@ particular, the following variants typically exist:
* an IBM PC code page, which is ASCII compatible
+.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
+
+-----------------+--------------------------------+--------------------------------+
| Codec | Aliases | Languages |
+=================+================================+================================+
@@ -1003,6 +1043,11 @@ particular, the following variants typically exist:
+-----------------+--------------------------------+--------------------------------+
| cp1258 | windows-1258 | Vietnamese |
+-----------------+--------------------------------+--------------------------------+
+| cp65001 | | Windows only: Windows UTF-8 |
+| | | (``CP_UTF8``) |
+| | | |
+| | | .. versionadded:: 3.3 |
++-----------------+--------------------------------+--------------------------------+
| euc_jp | eucjp, ujis, u-jis | Japanese |
+-----------------+--------------------------------+--------------------------------+
| euc_jis_2004 | jisx0213, eucjis2004 | Japanese |
@@ -1122,7 +1167,21 @@ particular, the following variants typically exist:
| utf_8_sig | | all languages |
+-----------------+--------------------------------+--------------------------------+
-.. XXX fix here, should be in above table
+Python Specific Encodings
+-------------------------
+
+A number of predefined codecs are specific to Python, so their codec names have
+no meaning outside Python. These are listed in the tables below based on the
+expected input and output types (note that while text encodings are the most
+common use case for codecs, the underlying codec infrastructure supports
+arbitrary data transforms rather than just text encodings). For asymmetric
+codecs, the stated purpose describes the encoding direction.
+
+The following codecs provide :class:`str` to :class:`bytes` encoding and
+:term:`bytes-like object` to :class:`str` decoding, similar to the Unicode text
+encodings.
+
+.. tabularcolumns:: |l|p{0.3\linewidth}|p{0.3\linewidth}|
+--------------------+---------+---------------------------+
| Codec | Aliases | Purpose |
@@ -1160,45 +1219,61 @@ particular, the following variants typically exist:
| unicode_internal | | Return the internal |
| | | representation of the |
| | | operand |
+| | | |
+| | | .. deprecated:: 3.3 |
+--------------------+---------+---------------------------+
-The following codecs provide bytes-to-bytes mappings.
-
-+--------------------+---------------------------+---------------------------+
-| Codec | Aliases | Purpose |
-+====================+===========================+===========================+
-| base64_codec | base64, base-64 | Convert operand to MIME |
-| | | base64 |
-+--------------------+---------------------------+---------------------------+
-| bz2_codec | bz2 | Compress the operand |
-| | | using bz2 |
-+--------------------+---------------------------+---------------------------+
-| hex_codec | hex | Convert operand to |
-| | | hexadecimal |
-| | | representation, with two |
-| | | digits per byte |
-+--------------------+---------------------------+---------------------------+
-| quopri_codec | quopri, quoted-printable, | Convert operand to MIME |
-| | quotedprintable | quoted printable |
-+--------------------+---------------------------+---------------------------+
-| uu_codec | uu | Convert the operand using |
-| | | uuencode |
-+--------------------+---------------------------+---------------------------+
-| zlib_codec | zip, zlib | Compress the operand |
-| | | using gzip |
-+--------------------+---------------------------+---------------------------+
-
-The following codecs provide string-to-string mappings.
-
-+--------------------+---------------------------+---------------------------+
-| Codec | Aliases | Purpose |
-+====================+===========================+===========================+
-| rot_13 | rot13 | Returns the Caesar-cypher |
-| | | encryption of the operand |
-+--------------------+---------------------------+---------------------------+
+The following codecs provide :term:`bytes-like object` to :class:`bytes`
+mappings.
+
+
+.. tabularcolumns:: |l|L|L|
+
++----------------------+---------------------------+------------------------------+
+| Codec | Purpose | Encoder/decoder |
++======================+===========================+==============================+
+| base64_codec [#b64]_ | Convert operand to MIME | :meth:`base64.b64encode`, |
+| | base64 (the result always | :meth:`base64.b64decode` |
+| | includes a trailing | |
+| | ``'\n'``) | |
++----------------------+---------------------------+------------------------------+
+| bz2_codec | Compress the operand | :meth:`bz2.compress`, |
+| | using bz2 | :meth:`bz2.decompress` |
++----------------------+---------------------------+------------------------------+
+| hex_codec | Convert operand to | :meth:`base64.b16encode`, |
+| | hexadecimal | :meth:`base64.b16decode` |
+| | representation, with two | |
+| | digits per byte | |
++----------------------+---------------------------+------------------------------+
+| quopri_codec | Convert operand to MIME | :meth:`quopri.encodestring`, |
+| | quoted printable | :meth:`quopri.decodestring` |
++----------------------+---------------------------+------------------------------+
+| uu_codec | Convert the operand using | :meth:`uu.encode`, |
+| | uuencode | :meth:`uu.decode` |
++----------------------+---------------------------+------------------------------+
+| zlib_codec | Compress the operand | :meth:`zlib.compress`, |
+| | using gzip | :meth:`zlib.decompress` |
++----------------------+---------------------------+------------------------------+
+
+.. [#b64] Rather than accepting any :term:`bytes-like object`,
+ ``'base64_codec'`` accepts only :class:`bytes` and :class:`bytearray` for
+ encoding and only :class:`bytes`, :class:`bytearray`, and ASCII-only
+ instances of :class:`str` for decoding
+
+
+The following codecs provide :class:`str` to :class:`str` mappings.
+
+.. tabularcolumns:: |l|L|
+
++--------------------+---------------------------+
+| Codec | Purpose |
++====================+===========================+
+| rot_13 | Returns the Caesar-cypher |
+| | encryption of the operand |
++--------------------+---------------------------+
.. versionadded:: 3.2
- bytes-to-bytes and string-to-string codecs.
+ bytes-to-bytes and str-to-str codecs.
:mod:`encodings.idna` --- Internationalized Domain Names in Applications
@@ -1272,12 +1347,13 @@ functions can be used directly if desired.
.. module:: encodings.mbcs
:synopsis: Windows ANSI codepage
-Encode operand according to the ANSI codepage (CP_ACP). This codec only
-supports ``'strict'`` and ``'replace'`` error handlers to encode, and
-``'strict'`` and ``'ignore'`` error handlers to decode.
+Encode operand according to the ANSI codepage (CP_ACP).
Availability: Windows only.
+.. versionchanged:: 3.3
+ Support any error handler.
+
.. versionchanged:: 3.2
Before 3.2, the *errors* argument was ignored; ``'replace'`` was always used
to encode, and ``'ignore'`` to decode.