summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authormilde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04>2022-07-06 13:59:57 +0000
committermilde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04>2022-07-06 13:59:57 +0000
commitb1a4a86e1548d19aa9fbdb0c75e15cff07800a8c (patch)
tree7479b1174f5cfd88eb56774b07ed0cb042747a93
parent5938461c64dae3ac01029d3fc835dab8d04f019c (diff)
downloaddocutils-b1a4a86e1548d19aa9fbdb0c75e15cff07800a8c.tar.gz
Documentation updates
Document future changes regarding front-end tools. Clarify/update planned changes to the input-encoding handling. Various smaller fixes. Add links. git-svn-id: https://svn.code.sf.net/p/docutils/code/trunk@9107 929543f6-e4f2-0310-98a6-ba3bd3dd1d04
-rw-r--r--docutils/HISTORY.txt8
-rw-r--r--docutils/README.txt18
-rw-r--r--docutils/RELEASE-NOTES.txt71
-rw-r--r--docutils/docs/api/publisher.txt8
-rw-r--r--docutils/docs/ref/rst/directives.txt26
-rw-r--r--docutils/docs/user/config.txt21
-rw-r--r--docutils/docs/user/html.txt2
-rw-r--r--docutils/docs/user/links.txt14
-rw-r--r--docutils/docs/user/tools.txt36
-rwxr-xr-xdocutils/docutils/__main__.py4
-rw-r--r--sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt165
11 files changed, 230 insertions, 143 deletions
diff --git a/docutils/HISTORY.txt b/docutils/HISTORY.txt
index c3032c34b..b8acc43ed 100644
--- a/docutils/HISTORY.txt
+++ b/docutils/HISTORY.txt
@@ -138,9 +138,13 @@ Release 0.19 (2022-07-05)
- Fix help output.
- Actual code moved to docutils.__main__.py.
+* tools/rst2odt_prepstyles.py
-Release 0.18.1
-==============
+ - Options ``-h`` and ``--help`` print short usage message.
+
+
+Release 0.18.1 (2021-11-23)
+===========================
* docutils/nodes.py
diff --git a/docutils/README.txt b/docutils/README.txt
index 095cb0a66..795536d33 100644
--- a/docutils/README.txt
+++ b/docutils/README.txt
@@ -30,7 +30,7 @@ This is for those who want to get up & running quickly.
Try for example::
rst2html.py FAQ.txt FAQ.html (Unix)
- python tools/rst2html.py FAQ.txt FAQ.html (Windows)
+ docutils FAQ.txt FAQ.html (Unix and Windows)
See Usage_ below for details.
@@ -47,7 +47,8 @@ following sources has been implemented:
* `PEPs (Python Enhancement Proposals)`_.
-Support for the following sources is planned:
+Support for the following sources is planned or provided by
+`third party tools`_:
* Inline documentation from Python modules and packages, extracted
with namespace context.
@@ -63,6 +64,7 @@ Support for the following sources is planned:
.. _PEPs (Python Enhancement Proposals):
https://peps.python.org/pep-0012
+.. _third party tools: docs/user/links.html#related-applications
Dependencies
@@ -151,18 +153,18 @@ Installation
* Run ``setup.py install``.
See also OS-specific installation instructions below.
+ Optional steps:
+
+ * `Running the test suite`_
+
+ * `Converting the documentation`_
+
* For installing "by hand" or in "development mode", see the
`editable installs`_ section in the `Docutils version repository`_
documentation.
.. _editable installs: docs/dev/repository.html#editable-installs
-Optional steps:
-
-* `Running the test suite`_
-
-* `Converting the documentation`_
-
GNU/Linux, BSDs, Unix, Mac OS X, etc.
-------------------------------------
diff --git a/docutils/RELEASE-NOTES.txt b/docutils/RELEASE-NOTES.txt
index 9d6c12c9f..52325abda 100644
--- a/docutils/RELEASE-NOTES.txt
+++ b/docutils/RELEASE-NOTES.txt
@@ -20,28 +20,45 @@ For a more detailed list of changes, please see the Docutils `HISTORY`_.
Future changes
==============
-* Setup:
+* Usage:
- - Do not install the auxilliary script ``tools/rst2odt_prepstyles.py``
- in the binary PATH.
+ - The ``rst2*.py`` `front end tools`_ will be renamed to ``rst2*``
+ (dropping the ``.py`` extension). [#]_
- - Remove file ``install.py``. There are better ways, see README.txt__.
+ Exceptions:
+ The auxilliary script ``rst2odt_prepstyles.py`` will become
+ available via ``python -m docutils.writers.odf_odt.prepstyles``.
- __ README.html#installation
+ The ``rstpep2html.py`` script will be retired.
+ Use ``docutils --reader=pep --writer=pep_html`` for a PEP preview. [#]_
-* Input encoding:
+ .. [#] Some Linux distributions already use the short names.
+ .. [#] The final rendering is done by a Sphinx-based build system
+ (see :PEP:`676`).
- - Reduce chances of unrecognised character mix-ups (mojibake).
- Stop using the hard-coded fallback "latin-1" if the
- `input-encoding`_ is not specified and decoding with UTF-8 fails.
+ - The default HTML writer "html__" will change from "html4css1"
+ to "html5" in Docutils 2.0.
- Don't use the locale encoding as fallback in `UTF-8 mode`_.
+ Use ``get_writer_by_name('html')`` or the rst2html.py_ front end,
+ if you want the output to be up-to-date automatically.
- - Only remove BOM (U+FEFF ZWNBSP at start of data), no other ZWNBSPs.
+ Use the "html4" writer or ``rst2html4.py``, if you depend on
+ stability of the generated HTML code, e.g. because you use a
+ custom style sheet or post-processing that may break otherwise.
+
+ __ docs/user/html.html#html
+
+* `Input encoding`_:
- .. _input-encoding: docs/user/config.html#input-encoding
- .. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode
+ - Raise UnicodeError (instead of falling back to the locale encoding)
+ if decoding the source with the default encoding (UTF-8) fails and
+ Python is started in `UTF-8 mode`_.
+
+ Raise UnicodeError (instead of falling back to "latin1") if both,
+ default and locale encoding, fail.
+
+ - Only remove BOM (U+FEFF ZWNBSP at start of data), no other ZWNBSPs.
* `html5` writer:
@@ -63,8 +80,6 @@ Future changes
- Remove option ``--embed-images`` (obsoleted by "image_loading_")
in Docutils 2.0.
- .. _image_loading: docs/user/config.html#image-loading
-
* `latex2e` writer:
- Change default of use_latex_citations_ setting to True
@@ -81,9 +96,10 @@ Future changes
* `xetex` writer:
- - Settings in the "latex2e writer" `configuration file section`__ will
- be ignored in Docutils 0.20. Move settings intended for both, `xetex`
- and `latex2e` writers to section "latex writers".
+ - Settings in the [latex2e writer] `configuration file section`__
+ will be ignored by the `xetex` writer in Docutils 0.20.
+ Move settings intended for both, `xetex` and `latex2e` writers
+ to section [latex writers].
__ docs/user/config.html#configuration-file-sections-entries
@@ -95,24 +111,23 @@ Future changes
* Drop support for `old-format configuration files`_ in Docutils 2.0.
+* Remove file ``install.py``. See README.txt__ for alternatives.
+
+ __ README.html#installation
+
* Move math format conversion from docutils/utils/math (called from
docutils/writers/_html_base.py) to a transform__.
__ docs/ref/transforms.html
-* The default HTML writer "html" with frontend ``rst2html.py`` will change
- from "html4css1" to "html5" in Docutils 2.0.
-
- Use ``get_writer_by_name('html')`` or the rst2html.py_ front end, if you
- want the output to be up-to-date automatically.
-
- Use the "html4" writer or ``rst2html4.py``, if you depend on
- stability of the generated HTML code, e.g. because you use a custom
- style sheet or post-processing that may break otherwise.
-
* Remove the ``--html-writer`` option of the ``buildhtml.py`` application
(obsoleted by the `"writer" option`_) in Docutils 2.0.
+.. _front end tools: docs/user/tools.html
+.. _input encoding:
+.. _input_encoding: docs/user/config.html#input-encoding
+.. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode
+.. _image_loading: docs/user/config.html#image-loading
.. _old-format configuration files:
docs/user/config.html#old-format-configuration-files
.. _rst2html.py: docs/user/tools.html#rst2html-py
diff --git a/docutils/docs/api/publisher.txt b/docutils/docs/api/publisher.txt
index 86f2affe1..8f5e4a0d6 100644
--- a/docutils/docs/api/publisher.txt
+++ b/docutils/docs/api/publisher.txt
@@ -97,10 +97,18 @@ details about individual settings.
Encodings
---------
+The default input encoding is UTF-8.
+A different encoding can be specified with the `input_encoding`_ setting
+or an `explicit encoding declaration` (BOM or special comment).
+The locale encoding may be used as a fallback.
+
The default output encoding of Docutils is UTF-8.
+A different encoding can be specified with the `output_encoding`_ setting.
Docutils may introduce some non-ASCII text if you use
`auto-symbol footnotes`_ or the `"contents" directive`_.
+.. _input_encoding: ../user/config.html#input-encoding
+.. _output_encoding: ../user/config.html#output-encoding
.. _auto-symbol footnotes:
../ref/rst/restructuredtext.html#auto-symbol-footnotes
.. _"contents" directive:
diff --git a/docutils/docs/ref/rst/directives.txt b/docutils/docs/ref/rst/directives.txt
index f4607ea22..4a57f27d5 100644
--- a/docutils/docs/ref/rst/directives.txt
+++ b/docutils/docs/ref/rst/directives.txt
@@ -166,33 +166,33 @@ value.
There are two image directives: "image" and "figure".
-.. attention::
+.. attention::
It is up to the author to ensure compatibility of the image data format
with the output format or user agent (LaTeX engine, `HTML browser`__).
The following, non exhaustive table provides an overview:
-
+
=========== ====== ====== ===== ===== ===== ===== ===== ===== ===== =====
.. vector image raster image moving image [#]_
----------- ------------- ----------------------------- -----------------
.. SVG PDF PNG JPG GIF APNG AVIF WebM MP4 OGG
=========== ====== ====== ===== ===== ===== ===== ===== ===== ===== =====
HTML4_ ✓ [#]_ ✓ ✓ ✓ (✓) (✓)
-
+
HTML5_ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
-
+
LaTeX_ [#]_ ✓ ✓ ✓
-
+
ODT_ ✓ ✓ ✓ ✓ ✓
=========== ====== ====== ===== ===== ===== ===== ===== ===== ===== =====
-
+
.. [#] The `html5 writer`_ uses the ``<video>`` tag if the image URI
ends with an extension matching one of the listed video formats
(since Docutils 0.17).
-
+
.. [#] The html4 writer uses an ``<object>`` tag for SVG images
for better compatibility with older browsers.
-
+
.. [#] When compiling with ``pdflatex``, ``xelatex``, or ``lualatex``.
The original ``latex`` engine supports only the EPS image format.
Some build systems, e.g. rubber_ support additional formats
@@ -201,7 +201,7 @@ There are two image directives: "image" and "figure".
__ https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Image_types
.. _HTML4:
.. _html4 writer: ../../user/html.html#html4css1
-.. _HTML5:
+.. _HTML5:
.. _html5 writer: ../../user/html.html#html5-polyglot
.. _LaTeX: ../../user/latex.html#image-inclusion
.. _ODT: ../../user/odt.html
@@ -890,7 +890,7 @@ The following options are recognized:
``encoding`` : string
The text encoding of the external CSV data (file or URL).
- Defaults to the document's input_encoding_.
+ Defaults to the document's `input_encoding`_.
``escape`` : char
A one-character string used to escape the
@@ -1488,8 +1488,8 @@ The following options are recognized:
Works only with ``code`` or ``literal``.
``encoding`` : string
- The text encoding of the external data file. Defaults to the
- document's input_encoding_.
+ The text encoding of the external file.
+ Defaults to the document's input_encoding_.
.. _input_encoding: ../../user/config.html#input-encoding
@@ -1583,7 +1583,7 @@ The following options are recognized:
``encoding`` : string
The text encoding of the external raw data (file or URL).
- Defaults to the document's encoding (if specified).
+ Defaults to the document's `input_encoding`_ (if specified).
and the common option class_.
diff --git a/docutils/docs/user/config.txt b/docutils/docs/user/config.txt
index 69020f0b8..f6c17ec0b 100644
--- a/docutils/docs/user/config.txt
+++ b/docutils/docs/user/config.txt
@@ -67,8 +67,6 @@ the three implicit ones listed above (or the ones defined by the
``DOCUTILSCONFIG`` environment variable), and its entries will have
priority.
-.. _Docutils Runtime Settings: ../api/runtime-settings.html
-
-------------------------
Configuration File Syntax
@@ -155,7 +153,7 @@ Below are the Docutils runtime settings, listed by config file
section. **Any setting may be specified in any section, but only
settings from active sections will be used.** Sections correspond to
Docutils components (module name or alias; section names are always in
-lowercase letters). Each `Docutils application`_ uses a specific set
+lowercase letters). Each Docutils application_ uses a specific set
of components; corresponding configuration file sections are applied
when the application is used. Configuration sections are applied in
general-to-specific order, as follows:
@@ -198,7 +196,7 @@ Some knowledge of Python_ is assumed for some attributes.
.. _Python: https://www.python.org/
.. _RFC 822: https://www.rfc-editor.org/rfc/rfc822.txt
.. _front-end tool:
-.. _Docutils application: tools.html
+.. _application: tools.html
[general]
@@ -299,7 +297,7 @@ error_encoding_error_handler
----------------------------
The error handler for unencodable characters in error output.
-Acceptable values are the `Error Handlers`_ of Python's "encoding" module.
+Acceptable values are the `Error Handlers`_ of Python's "codecs" module.
See also output_encoding_error_handler_.
Default: "backslashreplace"
@@ -375,7 +373,7 @@ input_encoding_error_handler
----------------------------
The error handler for undecodable characters in the input.
-Acceptable values are the `Error Handlers`_ of Python's "encoding" module,
+Acceptable values are the `Error Handlers`_ of Python's "codecs" module,
including:
strict
@@ -425,7 +423,7 @@ output_encoding_error_handler
-----------------------------
The error handler for unencodable characters in the output.
-Acceptable values are the `Error Handlers`_ of Python's "encoding" module,
+Acceptable values are the `Error Handlers`_ of Python's "codecs" module,
including:
strict
@@ -2161,9 +2159,9 @@ New in 0.17. Obsoletes the ``html_writer`` option.
[docutils application]
--------------------------
-New in 0.17. Config file support added in 0.18.
-Renamed in 0.19 (the old name "docutils-cli application" is kept as alias).
-Support for reader/parser import names added in 0.19.
+| New in 0.17. Config file support added in 0.18.
+| Renamed in 0.19 (the old name "docutils-cli application" is kept as alias).
+| Support for reader/parser import names added in 0.19.
reader
~~~~~~
@@ -2320,6 +2318,7 @@ pep_template [pep_html writer] template
.. _block quote: ../ref/rst/restructuredtext.html#block-quotes
.. _citations: ../ref/rst/restructuredtext.html#citations
.. _bullet lists: ../ref/rst/restructuredtext.html#bullet-lists
+.. _Docutils Runtime Settings: ../api/runtime-settings.html
.. _enumerated lists: ../ref/rst/restructuredtext.html#enumerated-lists
.. _field lists: ../ref/rst/restructuredtext.html#field-lists
.. _field names: ../ref/rst/restructuredtext.html#field-names
@@ -2332,5 +2331,5 @@ pep_template [pep_html writer] template
.. _tables: ../ref/rst/restructuredtext.html#tables
.. _table of contents: ../ref/rst/directives.html#contents
-.. _Error Handlers:
+.. _Error Handlers:
https://docs.python.org/3/library/codecs.html#error-handlers
diff --git a/docutils/docs/user/html.txt b/docutils/docs/user/html.txt
index 0ec334de5..df151ae5e 100644
--- a/docutils/docs/user/html.txt
+++ b/docutils/docs/user/html.txt
@@ -9,6 +9,8 @@ Docutils HTML writers
html
----
+:front-end: rst2html.py_
+
`html` is an alias for the default Docutils HTML writer.
The default may change with the development of HTML, browsers, Docutils,
diff --git a/docutils/docs/user/links.txt b/docutils/docs/user/links.txt
index 2f8f43617..41d2bc08b 100644
--- a/docutils/docs/user/links.txt
+++ b/docutils/docs/user/links.txt
@@ -361,6 +361,7 @@ Applications using docutils/reStructuredText and helper applications.
.. _Text-Restructured: https://metacpan.org/dist/Text-Restructured
.. _repository: ../dev/repository.html
+
Tools
`````
@@ -384,24 +385,25 @@ Development
```````````
* Sphinx_ extends the ReStructuredText syntax to better support the
- documentation of Software (and other) *projects* (but other documents
+ documentation of Software projects (but other documents
can be written with it too).
- The `Python documentation`_ is based on reStructuredText and Sphinx.
-
- .. _Python documentation: https://docs.python.org/
+* `Sphinx Extensions`_ allow automatic testing of code snippets,
+ inclusion of docstrings from Python modules (API docs), and more.
* Trac_, a project management and bug/issue tracking system, supports
`using reStructuredText
<https://trac.edgewall.org/wiki/WikiRestructuredText>`__ as an
alternative to wiki markup.
- .. _Trac: https://trac.edgewall.org/
* PyLit_ provides a bidirectional text <--> code converter for *literate
programming with reStructuredText*.
- .. _PyLit: https://repo.or.cz/pylit.git
+.. _Sphinx extensions: https://www.sphinx-doc.org/en/master/usage/extensions/
+.. _Python documentation: https://docs.python.org/
+.. _Trac: https://trac.edgewall.org/
+.. _PyLit: https://codeberg.org/milde/pylit
CMS Systems
diff --git a/docutils/docs/user/tools.txt b/docutils/docs/user/tools.txt
index 3aae93cce..b0831e3d9 100644
--- a/docutils/docs/user/tools.txt
+++ b/docutils/docs/user/tools.txt
@@ -12,6 +12,13 @@
.. contents::
+.. note::
+ Docutils front-end tool support is currently `under discussion`__.
+ Tool names, install details and the set of auto-installed tools
+ will `change in future Docutils versions`__.
+
+ __ https://sourceforge.net/p/docutils/feature-requests/88/
+ __ ../../RELEASE-NOTES.html#future-changes
--------------
Introduction
@@ -27,10 +34,12 @@ understands the syntax of the text), and a "Writer" (which knows how
to generate a specific data format).
Most [#]_ front ends have common options and the same command-line usage
-pattern (see `the tools`_ below for concrete examples)::
+pattern::
toolname [options] [<source> [<destination>]]
+See `the tools`_ below for concrete examples.
+
Each tool has a "``--help``" option which lists the
`command-line options`_ and arguments it supports.
Processing can also be customized with `configuration files`_.
@@ -40,11 +49,6 @@ one argument (source) is specified, the standard output (stdout) is
used for the destination. If no arguments are specified, the standard
input (stdin) is used for the source.
-.. note::
- Docutils front-end tool support is currently under discussion.
- Tool names, install details and the set of auto-installed tools
- may change in future Docutils versions.
-
.. [#] The exceptions are buildhtml.py_, quicktest.py_ and
rst2odt_prepstyles.py_.
@@ -70,11 +74,11 @@ list.
Generic Command Line Front End
==============================
-:Readers: Standalone, PEP
-:Parsers: reStructuredText, Markdown (requires 3rd party packages)
-:Writers: html_, html4css1_, html5_, latex__, manpage_,
+:Readers: Standalone (default), PEP
+:Parsers: reStructuredText (default), Markdown (requires 3rd party packages)
+:Writers: html_, html4css1_, html5_ (default), latex__, manpage_,
odt_, pep_html_, pseudo-xml_, s5_html_, xelatex_, xml_,
-:Config_: See `[docutils application]`_
+:Config_: `[docutils application]`_
The generic front end allows combining "reader", "parser", and
"writer" components from the Docutils package or 3rd party plug-ins.
@@ -111,7 +115,7 @@ buildhtml.py
------------
:Readers: Standalone, PEP
-:Parser: reStructuredText
+:Parser: reStructuredText
:Writers: html_, html5_, pep_html_
:Config_: `[buildhtml application]`_
@@ -251,8 +255,18 @@ For example, to process a PEP into HTML::
cd <path-to-docutils>/docs/peps
rstpep2html.py pep-0287.txt pep-0287.html
+The same result can be achieved with the genric front end::
+
+ cd <path-to-docutils>/docs/peps
+ docutils --reader=pep --writer=pep_html pep-0287.txt pep-0287.html
+
+The rendering of published PEPs is done by a Sphinx-based build system
+(see :PEP:`676`).
+
+
.. _pep_html: html.html#pep-html
+
rst2s5.py
---------
diff --git a/docutils/docutils/__main__.py b/docutils/docutils/__main__.py
index d528b031d..cbb724842 100755
--- a/docutils/docutils/__main__.py
+++ b/docutils/docutils/__main__.py
@@ -31,8 +31,8 @@ class CliSettingsSpec(docutils.SettingsSpec):
Configurable reader, parser, and writer components.
- The "--writer" default will change to 'html' when this becomes
- an alias for 'html5'.
+ The "--writer" default will change to 'html' in Docutils 2.0
+ when 'html' becomes an alias for the current value 'html5'.
"""
settings_spec = (
diff --git a/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt b/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt
index 5a5fa6ebb..c2b584015 100644
--- a/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt
+++ b/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt
@@ -1,19 +1,19 @@
-:Title: Input Encodings
+:Title: Input Encoding
:Author: Günter Milde
:Discussions-To: docutils-develop@lists.sf.net
:Status: Draft
:Type: API
:Created: 2022-07-02
-:Docutils-Version: 0.19 or later
-:Replaces: undocumented behaviour in 0.18
+:Docutils-Version: > 0.19 (see `open issues`_)
+:Replaces: undocumented behaviour in 0.19
:Resolution: None
Abstract
========
-When the `input_encoding`_ setting is not specified, Docutils tries a
-heuristic to determine a "successfull encoding".
+When the `input_encoding`_ setting is not specified, Docutils uses
+a heuristic to determine or guess the source's encoding.
The actual behaviour is not documented and depends on the Python version.
@@ -21,63 +21,36 @@ The actual behaviour is not documented and depends on the Python version.
Motivation
==========
-The behaviour of Docutils when the `input_encoding`_ configuration
-setting is kept at its default value ``None`` is inconsistent and
-underdocumented.
-
-* If the source encoding cannot be determined from the data and
- decoding the source with UTF-8 fails, the file is decoded with the
- locale encoding or 'latin1' (hard-coded 2nd fallback).
-
- This leads to a character mix up (`mojibake`), if the file is actually
- in another encoding (corrupt UTF-8, UTF-16, or a different legacy
- encoding) without reporting an error.
-
-With the end of Python 2.7 support, some results from reading a file
-under Python 3.x will be seen as a regression by "late adopters".
-The following critical use cases are fixed in 0.19b2:
-
-* An encoding-specification in the file (BOM or special comment) is
- ignored under Python 3.x (unless reading with Python's default fails).
+ | Errors should never pass silently.
+ | Unless explicitly silenced.
-* If the user's locale is set to an 8-bit encoding ("latin1", say)
- UTF-8 encoded files are opened as UTF-8 under Python 2.x and Python 3.15
- but assuming the locale encoding under Python 3.7 … 3.14. [#]_
- The second case leads to character mix up.
+ -- ``import this``
-* A BOM in an utf-8 file is not removed, if the locale encoding is UTF-8.
+The behaviour of Docutils when the `input_encoding`_ configuration
+setting is kept at its default value ``None`` is currently suboptimal
+and underdocumented.
+If the source encoding cannot be determined from the source and
+decoding the with UTF-8 fails, the file is decoded with the
+locale encoding or 'latin1' (hard-coded 2nd fallback),
+even in `UTF-8 mode`_.
-.. [#] Assuming the `UTF-8 mode`_ at its default value, which will change
- in Python 3.15 (:PEP:`686`).
+This leads to a character mix up (`mojibake`), if the file is actually
+corrupt UTF-8 or in another encoding (UTF-16 or a different legacy encoding)
+without reporting an error.
Rationale
=========
- | Errors should never pass silently.
- | Unless explicitly silenced.
-
- -- ``import this``
-
The "hard coded" second fallback encoding "latin1" may have been
-practical in times where "latin1" was the most commonly used 8-bit
+practical in times where "latin1" was the most commonly used
encoding for text files. It is far from optimal in times where using
legacy 8-bit encodings without specifying them (via `input_encoding`_ or
in the document) can be considered an error.
-The same holds (for a lesser extend) for a non-UTF-8 locale encoding.
-
-
-Docutils supports a syntax to declare the encoding of a reStructuredText
-source file, similar to :PEP:`263`. This support should be kept.
-
-Adherence to an encoding-specification in the file (BOM or "magic
-comment") remains the default behaviour for Python source code:
-
- An explicit encoding declaration takes precedence over the default.
-
- -- :PEP:`3120`
+The same holds for using a non-UTF-8 locale encoding as fallback when
+Python's `UTF-8 mode`_ is active.
Specification
@@ -86,25 +59,65 @@ Specification
.. Describe the syntax and semantics of any new feature.
The encoding of a reStructuredText source is determined from the
-`input_encoding`_ setting or an `explicit encoding declaration`
+`input_encoding`_ setting or an `explicit encoding declaration`_
(BOM or special comment).
-The default encoding is UTF-8 (codec 'utf-8-sig').
+The default input encoding is UTF-8 (codec 'utf-8-sig').
If the encoding is unspecified and decoding with UTF-8 fails,
the `preferred encoding`_ is used as a fallback
(if it maps to a valid codec and differs from UTF-8).
-Differences to Python's default `open()`:
+Differences to the default behaviour of Python's `open()`:
- The UTF-8 encoding is always tried first.
- (This is almost sure to fail if the true encoding differs.)
+ (This is almost sure to fail if the actual source encoding differs.)
+
+- An `explicit encoding declaration`_ takes precedence over
+ the `preferred encoding`_.
+
+- An optional BOM_ is removed from UTF-8 encoded sources.
.. _preferred encoding:
https://docs.python.org/3/library/locale.html#locale.getpreferredencoding
+Explicit encoding declaration
+-----------------------------
+
+A `Unicode byte order mark` (BOM_) in the source is interpreted as
+encoding declaration.
+
+The encoding of a reStructuredText source file can also be given by a
+"magic comment" similar to :PEP:`263`.
+This makes the input encoding both *visible* and *changeable*
+on a per-source file basis.
+
+To declare the input encoding, the a comment like ::
+
+ .. text encoding: <encoding name>
+
+must be placed into the source file either as first or second line.
+
+Examples: (using formats recognized by popular editors) ::
+
+ .. -*- mode: rst -*-
+ -*- coding: latin1 -*-
+
+or::
+
+ .. vim: set fileencoding=cp737 :
+
+More precisely, the first and second line are searched for the following
+regular expression::
+
+ coding[:=]\s*([-\w.]+)
+
+The first group of this expression is then interpreted as encoding name.
+If the first line matches the second line is ignored.
+
+
Backwards Compatibility
=======================
@@ -118,12 +131,10 @@ The following incompatible changes are expected:
- Raise UnicodeError (instead of decoding with 'latin1') if decoding the
source with UTF-8 fails and the locale encoding is not set or UTF-8.
-
+
- Raise UnicodeError (instead of decoding with the locale encoding)
if Python is started in `UTF-8 mode`_.
-.. _codecs:
- https://docs.python.org/3/library/codecs.html#encodings-and-unicode
Security Implications
=====================
@@ -136,14 +147,14 @@ How to Teach This
* Document the specification_.
-* Document the _`special comment`.
+* Document the `special comment`.
* Recommend specifying the source encoding (via `input_encoding`_ or
with BOM or special comment), especially if it is not UTF-8.
-* "To avoid erroneous application of a locale encoding
- but keep detection of an encoding-specification in the source
- (BOM or special comment), start Python in `UTF-8 mode`_."
+* "To avoid erroneous application of a locale encoding but keep
+ detection of an `explicit encoding declaration`_ in the source,
+ start Python in `UTF-8 mode`_."
Reference Implementation
@@ -168,9 +179,9 @@ Open Issues
* When shall we implement the incompatible API changes?
- 2 minor versions after announcing.
-
+
- Faster/immediately, because the current behaviour is a bug.
-
+
* Change the default `input_encoding`_ value to "UTF-8"?
* Keep the auto-detection (as opt-in or as default)?
@@ -178,18 +189,48 @@ Open Issues
+1 convenient for users with differently encoded sources
-1 complicates code
+ Adherence to an encoding-specification in the source (BOM or "magic
+ comment") remains the default behaviour for Python source code:
+
+ An explicit encoding declaration takes precedence over the default.
+
+ -- :PEP:`3120`
+
+
+Better feedback
+
+* Warning or Error, when `input_encoding` value differs from the
+ encoding declared in the source (BOM or special comment).
+
+* Info or Warning, when using the "locale" fallback encoding.
+
+* More helpful report of UnicodeDecodeError
+
+ - Hint at "input_encoding_error_handler"?
+
+ - Report the line number of the undecodable character.
+
+ - Print context around the undecodable character
+ (decode with "replace" error handler, print slice around error)?
+
References
==========
`<input-encoding-tests.py>`_
- Test script for the exploration of the handling of input encoding
+ Script for the exploration of the handling of input encoding
in Python and Docutils.
+Patches #194 Deprecate PEP 263 coding slugs support
+ https://sourceforge.net/p/docutils/patches/194/
+
.. _input_encoding:
https://docutils.sourceforge.io/docs/user/config.html#input-encoding
.. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode
+.. _codecs:
+ https://docs.python.org/3/library/codecs.html#encodings-and-unicode
+.. _BOM: https://docs.python.org/3/library/codecs.html#codecs.BOM
Copyright