sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253

==============
Input Encoding
==============

:Author: Günter Milde
:Discussions-To: docutils-develop@lists.sf.net
:Status: Draft
:Type: API
:Created: 2022-07-02
:Docutils-Version: > 0.19 (see `open issues`_)
:Replaces: undocumented behaviour in 0.19
:Resolution: None


Abstract
========

When the `input_encoding`_ setting is not specified, Docutils uses
a heuristic to determine or guess the source's encoding.

The actual behaviour is not documented and depends on the Python version.


Motivation
==========

  | Errors should never pass silently.
  | Unless explicitly silenced.

  -- ``import this``

The behaviour of Docutils when the `input_encoding`_ configuration
setting is kept at its default value ``None`` is currently suboptimal
and underdocumented.

If the source encoding cannot be determined from the source and
decoding the with UTF-8 fails, the file is decoded with the
locale encoding or 'latin1' (hard-coded 2nd fallback),
even in `UTF-8 mode`_.

This leads to a character mix up (`mojibake`), if the file is actually
corrupt UTF-8 or in another encoding (UTF-16 or a different legacy encoding)
without reporting an error.


Rationale
=========

The "hard coded" second fallback encoding "latin1" may have been
practical in times where "latin1" was the most commonly used
encoding for text files. It is far from optimal in times where using
legacy 8-bit encodings without specifying them (via `input_encoding`_ or
in the document) can be considered an error.

The same holds for using a non-UTF-8 locale encoding as fallback when
Python's `UTF-8 mode`_ is active.


Specification
=============

.. Describe the syntax and semantics of any new feature.

The encoding of a reStructuredText source is determined from the
`input_encoding`_ setting or an `explicit encoding declaration`_
(BOM or special comment).

The default input encoding is UTF-8 (codec 'utf-8-sig').

If the encoding is unspecified and decoding with UTF-8 fails,
the `preferred encoding`_ is used as a fallback
(if it maps to a valid codec and differs from UTF-8).


Differences to the default behaviour of Python's `open()`:

- The UTF-8 encoding is always tried first.
  (This is almost sure to fail if the actual source encoding differs.)

- An `explicit encoding declaration`_ takes precedence over
  the `preferred encoding`_.

- An optional BOM_ is removed from UTF-8 encoded sources.

.. _preferred encoding:
   https://docs.python.org/3/library/locale.html#locale.getpreferredencoding

Explicit encoding declaration
-----------------------------

A `Unicode byte order mark` (BOM_) in the source is interpreted as
encoding declaration.

The encoding of a reStructuredText source file can also be given by a
"magic comment" similar to :PEP:`263`.
This makes the input encoding both *visible* and *changeable*
on a per-source file basis.

To declare the input encoding, a comment like ::

  .. text encoding: <encoding name>

must be placed into the source file either as first or second line.

Examples: (using formats recognized by popular editors) ::

  .. -*- mode: rst -*-
     -*- coding: latin1 -*-

or::

  .. vim: set fileencoding=cp737 :

More precisely, the first and second line are searched for the following
regular expression::

  coding[:=]\s*([-\w.]+)

The first group of this expression is then interpreted as encoding name.
If the first line matches the second line is ignored.


Backwards Compatibility
=======================

.. Describe potential impact and severity on pre-existing code.

The following incompatible changes are expected:

- Only remove BOM (U+FEFF ZWNBSP at start of data), no other ZWNBSPs.

  This is the behaviour known from the "utf-8-sig" and "utf-16" codecs_.

- Raise UnicodeError (instead of decoding with 'latin1') if decoding the
  source with UTF-8 fails and the locale encoding is not set or UTF-8.

- Raise UnicodeError (instead of decoding with the locale encoding)
  if Python is started in `UTF-8 mode`_.


Security Implications
=====================

No security implications are expeted.


How to Teach This
=================

* Document the specification_.

* Document the `special comment`.

* Recommend specifying the source encoding (via `input_encoding`_ or
  with BOM or special comment), especially if it is not UTF-8.

* "To avoid erroneous application of a locale encoding but keep
  detection of an `explicit encoding declaration`_ in the source,
  start Python in `UTF-8 mode`_."


Reference Implementation
========================

.. Link to any existing implementation and details about its state, e.g.
   proof-of-concept.


Rejected Ideas
==============

.. Why certain ideas that were brought while discussing this proposal
   were not ultimately pursued.


Open Issues
===========

.. Any points that are still being decided/discussed.

* When shall we implement the incompatible API changes?

  - 2 minor versions after announcing.

  - Faster/immediately, because the current behaviour is a bug.

* Change the default `input_encoding`_ value to "UTF-8"?

* Keep the auto-detection (as opt-in or as default)?

  +1  convenient for users with differently encoded sources
  -1  complicates code

  Adherence to an encoding-specification in the source (BOM or "magic
  comment") remains the default behaviour for Python source code:

    An explicit encoding declaration takes precedence over the default.

    -- :PEP:`3120`


Better feedback

* Warning or Error, when `input_encoding` value differs from the
  encoding declared in the source (BOM or special comment).

* Info or Warning, when using the "locale" fallback encoding.

* More helpful report of UnicodeDecodeError

  - Hint at "input_encoding_error_handler"?

  - Report the line number of the undecodable character.

  - Print context around the undecodable character
    (decode with "replace" error handler, print slice around error)?


References
==========

`<input-encoding-tests.py>`_
  Script for the exploration of the handling of input encoding
  in Python and Docutils.

Patches #194 Deprecate PEP 263 coding slugs support
  https://sourceforge.net/p/docutils/patches/194/

.. _input_encoding:
    https://docutils.sourceforge.io/docs/user/config.html#input-encoding

.. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode
.. _codecs:
    https://docs.python.org/3/library/codecs.html#encodings-and-unicode
.. _BOM: https://docs.python.org/3/library/codecs.html#codecs.BOM


Copyright
=========

This document is placed in the public domain or under the
CC0-1.0-Universal license, whichever is more permissive.


..
    Local Variables:
    mode: indented-text
    indent-tabs-mode: nil
    sentence-end-double-space: t
    fill-column: 70
    coding: utf-8
    End: