From 11c014dd9fefd18303c037258aac4101bfbae340 Mon Sep 17 00:00:00 2001 From: Andrew Kuchling Date: Wed, 19 Mar 2014 16:23:01 -0400 Subject: #13437: link to the source code for a few more modules --- Doc/library/codecs.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 3729dac8ae..36144e9ef2 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -7,6 +7,7 @@ .. sectionauthor:: Marc-André Lemburg .. sectionauthor:: Martin v. Löwis +**Source code:** :source:`Lib/codecs.py` .. index:: single: Unicode @@ -1418,4 +1419,3 @@ This module implements a variant of the UTF-8 codec: On encoding a UTF-8 encoded BOM will be prepended to the UTF-8 encoded bytes. For the stateful encoder this is only done once (on the first write to the byte stream). For decoding an optional UTF-8 encoded BOM at the start of the data will be skipped. - -- cgit v1.2.1 From db4ad80ec7958ad28530e990b75c8aff03e2bd38 Mon Sep 17 00:00:00 2001 From: Victor Stinner Date: Fri, 1 Aug 2014 12:28:48 +0200 Subject: Issue #18395: Rename ``_Py_char2wchar()`` to :c:func:`Py_DecodeLocale`, rename ``_Py_wchar2char()`` to :c:func:`Py_EncodeLocale`, and document these functions. --- Doc/library/codecs.rst | 1 + 1 file changed, 1 insertion(+) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 36144e9ef2..4c2a023570 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -318,6 +318,7 @@ and writing to platform dependent files: encodings. +.. _surrogateescape: .. _codec-base-classes: Codec Base Classes -- cgit v1.2.1 From 11e25614459b076ba1c6b12d6a08cf491212d9db Mon Sep 17 00:00:00 2001 From: Serhiy Storchaka Date: Tue, 25 Nov 2014 13:57:17 +0200 Subject: Issue #19676: Added the "namereplace" error handler. --- Doc/library/codecs.rst | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 4c2a023570..ea4c450de7 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -98,6 +98,8 @@ It defines the following functions: reference (for encoding only) * ``'backslashreplace'``: replace with backslashed escape sequences (for encoding only) + * ``'namereplace'``: replace with ``\N{...}`` escape sequences (for + encoding only) * ``'surrogateescape'``: on decoding, replace with code points in the Unicode Private Use Area ranging from U+DC80 to U+DCFF. These private code points will then be turned back into the same bytes when the @@ -232,6 +234,11 @@ functions which use :func:`lookup` for the codec lookup: Implements the ``backslashreplace`` error handling (for encoding only): the unencodable character is replaced by a backslashed escape sequence. +.. function:: namereplace_errors(exception) + + Implements the ``namereplace`` error handling (for encoding only): the + unencodable character is replaced by a ``\N{...}`` escape sequence. + To simplify working with encoded files or stream, the module also defines these utility functions: @@ -363,6 +370,9 @@ and implemented by all standard Python codecs: | ``'backslashreplace'`` | Replace with backslashed escape sequences | | | (only for encoding). | +-------------------------+-----------------------------------------------+ +| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | +| | (only for encoding). | ++-------------------------+-----------------------------------------------+ | ``'surrogateescape'`` | Replace byte with surrogate U+DCxx, as defined| | | in :pep:`383`. | +-------------------------+-----------------------------------------------+ @@ -384,6 +394,9 @@ schemes: .. versionchanged:: 3.4 The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs. +.. versionadded:: 3.4 + The ``'namereplace'`` error handler. + The set of allowed values can be extended via :meth:`register_error`. @@ -477,6 +490,8 @@ define in order to be compatible with the Python codec registry. * ``'backslashreplace'`` Replace with backslashed escape sequences. + * ``'namereplace'`` Replace with ``\N{...}`` escape sequences. + The *errors* argument will be assigned to an attribute of the same name. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the :class:`IncrementalEncoder` @@ -625,6 +640,8 @@ compatible with the Python codec registry. * ``'backslashreplace'`` Replace with backslashed escape sequences. + * ``'namereplace'`` Replace with ``\N{...}`` escape sequences. + The *errors* argument will be assigned to an attribute of the same name. Assigning to this attribute makes it possible to switch between different error handling strategies during the lifetime of the :class:`StreamWriter` object. -- cgit v1.2.1 From 552c9aa2fb119fa6c4215b8ddff0f70171d46b72 Mon Sep 17 00:00:00 2001 From: Berker Peksag Date: Tue, 25 Nov 2014 18:59:20 +0200 Subject: Issue #19676: Tweak documentation a bit. * Updated version info to 3.5 * Fixed a markup error * Added a versionadded directive to namereplace_errors documentation --- Doc/library/codecs.rst | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index ea4c450de7..06bce84b7b 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -239,6 +239,8 @@ functions which use :func:`lookup` for the codec lookup: Implements the ``namereplace`` error handling (for encoding only): the unencodable character is replaced by a ``\N{...}`` escape sequence. + .. versionadded:: 3.5 + To simplify working with encoded files or stream, the module also defines these utility functions: @@ -394,7 +396,7 @@ schemes: .. versionchanged:: 3.4 The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs. -.. versionadded:: 3.4 +.. versionadded:: 3.5 The ``'namereplace'`` error handler. The set of allowed values can be extended via :meth:`register_error`. -- cgit v1.2.1 From 5a43a543a4fdc8af1dbe10a0719e2114a5d6a7a4 Mon Sep 17 00:00:00 2001 From: Nick Coghlan Date: Wed, 7 Jan 2015 13:14:47 +1000 Subject: Issue #19548: clean up merge issues in codecs docs Patch by Martin Panter to clean up some problems with the merge of the codecs docs changes from Python 3.4. --- Doc/library/codecs.rst | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 0227d9b963..b67e653c14 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -256,7 +256,6 @@ and writing to platform dependent files: encodings. -.. _surrogateescape: .. _codec-base-classes: Codec Base Classes @@ -273,6 +272,7 @@ implement the file protocols. Codec authors also need to define how the codec will handle encoding and decoding errors. +.. _surrogateescape: .. _error-handlers: Error Handlers @@ -319,7 +319,8 @@ The following error handlers are only applicable to | | :func:`backslashreplace_errors`. | +-------------------------+-----------------------------------------------+ | ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | -| | (only for encoding). | +| | (only for encoding). Implemented in | +| | :func:`namereplace_errors`. | +-------------------------+-----------------------------------------------+ | ``'surrogateescape'`` | On decoding, replace byte with individual | | | surrogate code ranging from ``U+DC80`` to | @@ -422,7 +423,8 @@ functions: .. function:: namereplace_errors(exception) - Implements the ``namereplace`` error handling (for encoding only): the + Implements the ``'namereplace'`` error handling (for encoding with + :term:`text encodings ` only): the unencodable character is replaced by a ``\N{...}`` escape sequence. .. versionadded:: 3.5 -- cgit v1.2.1 From 166e44f9d20cd9b84df33d93346ba6221d35b2c1 Mon Sep 17 00:00:00 2001 From: Georg Brandl Date: Wed, 14 Jan 2015 08:26:30 +0100 Subject: Closes #23181: codepoint -> code point --- Doc/library/codecs.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index b67e653c14..3510f69ebd 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -841,7 +841,7 @@ methods and attributes from the underlying stream. Encodings and Unicode --------------------- -Strings are stored internally as sequences of codepoints in +Strings are stored internally as sequences of code points in range ``0x0``-``0x10FFFF``. (See :pep:`393` for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness @@ -852,23 +852,23 @@ There are a variety of different text serialisation codecs, which are collectivity referred to as :term:`text encodings `. The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps -the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string -object that contains codepoints above ``U+00FF`` can't be encoded with this +the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string +object that contains code points above ``U+00FF`` can't be encoded with this codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks like the following (although the details of the error message may differ): ``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256)``. There's another group of encodings (the so called charmap encodings) that choose -a different subset of all Unicode code points and how these codepoints are +a different subset of all Unicode code points and how these code points are mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on Windows). There's a string constant with 256 characters that shows you which character is mapped to which byte value. -All of these encodings can only encode 256 of the 1114112 codepoints +All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. A simple and straightforward way that can store each Unicode -code point, is to store each codepoint as four consecutive bytes. There are two +code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you -- cgit v1.2.1 From f7b3d343578c306d612261a925e648edd60daee6 Mon Sep 17 00:00:00 2001 From: Serhiy Storchaka Date: Sun, 25 Jan 2015 22:56:57 +0200 Subject: Issue #22286: The "backslashreplace" error handlers now works with decoding and translating. --- Doc/library/codecs.rst | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 3510f69ebd..048f0e9167 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -314,8 +314,8 @@ The following error handlers are only applicable to | | reference (only for encoding). Implemented | | | in :func:`xmlcharrefreplace_errors`. | +-------------------------+-----------------------------------------------+ -| ``'backslashreplace'`` | Replace with backslashed escape sequences | -| | (only for encoding). Implemented in | +| ``'backslashreplace'`` | Replace with backslashed escape sequences. | +| | Implemented in | | | :func:`backslashreplace_errors`. | +-------------------------+-----------------------------------------------+ | ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | @@ -350,6 +350,10 @@ In addition, the following error handler is specific to the given codecs: .. versionadded:: 3.5 The ``'namereplace'`` error handler. +.. versionchanged:: 3.5 + The ``'backslashreplace'`` error handlers now works with decoding and + translating. + The set of allowed values can be extended by registering a new named error handler: @@ -417,9 +421,9 @@ functions: .. function:: backslashreplace_errors(exception) - Implements the ``'backslashreplace'`` error handling (for encoding with - :term:`text encodings ` only): the - unencodable character is replaced by a backslashed escape sequence. + Implements the ``'backslashreplace'`` error handling (for + :term:`text encodings ` only): malformed data is + replaced by a backslashed escape sequence. .. function:: namereplace_errors(exception) -- cgit v1.2.1 From e1f8c740d420c781334995b7e6b7aa0c6e43e030 Mon Sep 17 00:00:00 2001 From: Serhiy Storchaka Date: Tue, 12 May 2015 23:16:55 +0300 Subject: Issue #22682: Added support for the kz1048 encoding. --- Doc/library/codecs.rst | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 0430cb92a4..b3bd6af530 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -1162,6 +1162,10 @@ particular, the following variants typically exist: +-----------------+--------------------------------+--------------------------------+ | koi8_u | | Ukrainian | +-----------------+--------------------------------+--------------------------------+ +| kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh | +| | | | +| | | .. versionadded:: 3.5 | ++-----------------+--------------------------------+--------------------------------+ | mac_cyrillic | maccyrillic | Bulgarian, Byelorussian, | | | | Macedonian, Russian, Serbian | +-----------------+--------------------------------+--------------------------------+ -- cgit v1.2.1 From dd55d4dbde9f5d4ffe40d1f7f66c23b1108946ba Mon Sep 17 00:00:00 2001 From: Serhiy Storchaka Date: Tue, 12 May 2015 23:24:19 +0300 Subject: Issue #22681: Added support for the koi8_t encoding. --- Doc/library/codecs.rst | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'Doc/library/codecs.rst') diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index b3bd6af530..743ccba58f 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -1160,6 +1160,10 @@ particular, the following variants typically exist: +-----------------+--------------------------------+--------------------------------+ | koi8_r | | Russian | +-----------------+--------------------------------+--------------------------------+ +| koi8_t | | Tajik | +| | | | +| | | .. versionadded:: 3.5 | ++-----------------+--------------------------------+--------------------------------+ | koi8_u | | Ukrainian | +-----------------+--------------------------------+--------------------------------+ | kz1048 | kz_1048, strk1048_2002, rk1048 | Kazakh | -- cgit v1.2.1