diff options
author | milde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04> | 2022-07-06 13:59:57 +0000 |
---|---|---|
committer | milde <milde@929543f6-e4f2-0310-98a6-ba3bd3dd1d04> | 2022-07-06 13:59:57 +0000 |
commit | b1a4a86e1548d19aa9fbdb0c75e15cff07800a8c (patch) | |
tree | 7479b1174f5cfd88eb56774b07ed0cb042747a93 | |
parent | 5938461c64dae3ac01029d3fc835dab8d04f019c (diff) | |
download | docutils-b1a4a86e1548d19aa9fbdb0c75e15cff07800a8c.tar.gz |
Documentation updates
Document future changes regarding front-end tools.
Clarify/update planned changes to the input-encoding handling.
Various smaller fixes. Add links.
git-svn-id: https://svn.code.sf.net/p/docutils/code/trunk@9107 929543f6-e4f2-0310-98a6-ba3bd3dd1d04
-rw-r--r-- | docutils/HISTORY.txt | 8 | ||||
-rw-r--r-- | docutils/README.txt | 18 | ||||
-rw-r--r-- | docutils/RELEASE-NOTES.txt | 71 | ||||
-rw-r--r-- | docutils/docs/api/publisher.txt | 8 | ||||
-rw-r--r-- | docutils/docs/ref/rst/directives.txt | 26 | ||||
-rw-r--r-- | docutils/docs/user/config.txt | 21 | ||||
-rw-r--r-- | docutils/docs/user/html.txt | 2 | ||||
-rw-r--r-- | docutils/docs/user/links.txt | 14 | ||||
-rw-r--r-- | docutils/docs/user/tools.txt | 36 | ||||
-rwxr-xr-x | docutils/docutils/__main__.py | 4 | ||||
-rw-r--r-- | sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt | 165 |
11 files changed, 230 insertions, 143 deletions
diff --git a/docutils/HISTORY.txt b/docutils/HISTORY.txt index c3032c34b..b8acc43ed 100644 --- a/docutils/HISTORY.txt +++ b/docutils/HISTORY.txt @@ -138,9 +138,13 @@ Release 0.19 (2022-07-05) - Fix help output. - Actual code moved to docutils.__main__.py. +* tools/rst2odt_prepstyles.py -Release 0.18.1 -============== + - Options ``-h`` and ``--help`` print short usage message. + + +Release 0.18.1 (2021-11-23) +=========================== * docutils/nodes.py diff --git a/docutils/README.txt b/docutils/README.txt index 095cb0a66..795536d33 100644 --- a/docutils/README.txt +++ b/docutils/README.txt @@ -30,7 +30,7 @@ This is for those who want to get up & running quickly. Try for example:: rst2html.py FAQ.txt FAQ.html (Unix) - python tools/rst2html.py FAQ.txt FAQ.html (Windows) + docutils FAQ.txt FAQ.html (Unix and Windows) See Usage_ below for details. @@ -47,7 +47,8 @@ following sources has been implemented: * `PEPs (Python Enhancement Proposals)`_. -Support for the following sources is planned: +Support for the following sources is planned or provided by +`third party tools`_: * Inline documentation from Python modules and packages, extracted with namespace context. @@ -63,6 +64,7 @@ Support for the following sources is planned: .. _PEPs (Python Enhancement Proposals): https://peps.python.org/pep-0012 +.. _third party tools: docs/user/links.html#related-applications Dependencies @@ -151,18 +153,18 @@ Installation * Run ``setup.py install``. See also OS-specific installation instructions below. + Optional steps: + + * `Running the test suite`_ + + * `Converting the documentation`_ + * For installing "by hand" or in "development mode", see the `editable installs`_ section in the `Docutils version repository`_ documentation. .. _editable installs: docs/dev/repository.html#editable-installs -Optional steps: - -* `Running the test suite`_ - -* `Converting the documentation`_ - GNU/Linux, BSDs, Unix, Mac OS X, etc. ------------------------------------- diff --git a/docutils/RELEASE-NOTES.txt b/docutils/RELEASE-NOTES.txt index 9d6c12c9f..52325abda 100644 --- a/docutils/RELEASE-NOTES.txt +++ b/docutils/RELEASE-NOTES.txt @@ -20,28 +20,45 @@ For a more detailed list of changes, please see the Docutils `HISTORY`_. Future changes ============== -* Setup: +* Usage: - - Do not install the auxilliary script ``tools/rst2odt_prepstyles.py`` - in the binary PATH. + - The ``rst2*.py`` `front end tools`_ will be renamed to ``rst2*`` + (dropping the ``.py`` extension). [#]_ - - Remove file ``install.py``. There are better ways, see README.txt__. + Exceptions: + The auxilliary script ``rst2odt_prepstyles.py`` will become + available via ``python -m docutils.writers.odf_odt.prepstyles``. - __ README.html#installation + The ``rstpep2html.py`` script will be retired. + Use ``docutils --reader=pep --writer=pep_html`` for a PEP preview. [#]_ -* Input encoding: + .. [#] Some Linux distributions already use the short names. + .. [#] The final rendering is done by a Sphinx-based build system + (see :PEP:`676`). - - Reduce chances of unrecognised character mix-ups (mojibake). - Stop using the hard-coded fallback "latin-1" if the - `input-encoding`_ is not specified and decoding with UTF-8 fails. + - The default HTML writer "html__" will change from "html4css1" + to "html5" in Docutils 2.0. - Don't use the locale encoding as fallback in `UTF-8 mode`_. + Use ``get_writer_by_name('html')`` or the rst2html.py_ front end, + if you want the output to be up-to-date automatically. - - Only remove BOM (U+FEFF ZWNBSP at start of data), no other ZWNBSPs. + Use the "html4" writer or ``rst2html4.py``, if you depend on + stability of the generated HTML code, e.g. because you use a + custom style sheet or post-processing that may break otherwise. + + __ docs/user/html.html#html + +* `Input encoding`_: - .. _input-encoding: docs/user/config.html#input-encoding - .. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode + - Raise UnicodeError (instead of falling back to the locale encoding) + if decoding the source with the default encoding (UTF-8) fails and + Python is started in `UTF-8 mode`_. + + Raise UnicodeError (instead of falling back to "latin1") if both, + default and locale encoding, fail. + + - Only remove BOM (U+FEFF ZWNBSP at start of data), no other ZWNBSPs. * `html5` writer: @@ -63,8 +80,6 @@ Future changes - Remove option ``--embed-images`` (obsoleted by "image_loading_") in Docutils 2.0. - .. _image_loading: docs/user/config.html#image-loading - * `latex2e` writer: - Change default of use_latex_citations_ setting to True @@ -81,9 +96,10 @@ Future changes * `xetex` writer: - - Settings in the "latex2e writer" `configuration file section`__ will - be ignored in Docutils 0.20. Move settings intended for both, `xetex` - and `latex2e` writers to section "latex writers". + - Settings in the [latex2e writer] `configuration file section`__ + will be ignored by the `xetex` writer in Docutils 0.20. + Move settings intended for both, `xetex` and `latex2e` writers + to section [latex writers]. __ docs/user/config.html#configuration-file-sections-entries @@ -95,24 +111,23 @@ Future changes * Drop support for `old-format configuration files`_ in Docutils 2.0. +* Remove file ``install.py``. See README.txt__ for alternatives. + + __ README.html#installation + * Move math format conversion from docutils/utils/math (called from docutils/writers/_html_base.py) to a transform__. __ docs/ref/transforms.html -* The default HTML writer "html" with frontend ``rst2html.py`` will change - from "html4css1" to "html5" in Docutils 2.0. - - Use ``get_writer_by_name('html')`` or the rst2html.py_ front end, if you - want the output to be up-to-date automatically. - - Use the "html4" writer or ``rst2html4.py``, if you depend on - stability of the generated HTML code, e.g. because you use a custom - style sheet or post-processing that may break otherwise. - * Remove the ``--html-writer`` option of the ``buildhtml.py`` application (obsoleted by the `"writer" option`_) in Docutils 2.0. +.. _front end tools: docs/user/tools.html +.. _input encoding: +.. _input_encoding: docs/user/config.html#input-encoding +.. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode +.. _image_loading: docs/user/config.html#image-loading .. _old-format configuration files: docs/user/config.html#old-format-configuration-files .. _rst2html.py: docs/user/tools.html#rst2html-py diff --git a/docutils/docs/api/publisher.txt b/docutils/docs/api/publisher.txt index 86f2affe1..8f5e4a0d6 100644 --- a/docutils/docs/api/publisher.txt +++ b/docutils/docs/api/publisher.txt @@ -97,10 +97,18 @@ details about individual settings. Encodings --------- +The default input encoding is UTF-8. +A different encoding can be specified with the `input_encoding`_ setting +or an `explicit encoding declaration` (BOM or special comment). +The locale encoding may be used as a fallback. + The default output encoding of Docutils is UTF-8. +A different encoding can be specified with the `output_encoding`_ setting. Docutils may introduce some non-ASCII text if you use `auto-symbol footnotes`_ or the `"contents" directive`_. +.. _input_encoding: ../user/config.html#input-encoding +.. _output_encoding: ../user/config.html#output-encoding .. _auto-symbol footnotes: ../ref/rst/restructuredtext.html#auto-symbol-footnotes .. _"contents" directive: diff --git a/docutils/docs/ref/rst/directives.txt b/docutils/docs/ref/rst/directives.txt index f4607ea22..4a57f27d5 100644 --- a/docutils/docs/ref/rst/directives.txt +++ b/docutils/docs/ref/rst/directives.txt @@ -166,33 +166,33 @@ value. There are two image directives: "image" and "figure". -.. attention:: +.. attention:: It is up to the author to ensure compatibility of the image data format with the output format or user agent (LaTeX engine, `HTML browser`__). The following, non exhaustive table provides an overview: - + =========== ====== ====== ===== ===== ===== ===== ===== ===== ===== ===== .. vector image raster image moving image [#]_ ----------- ------------- ----------------------------- ----------------- .. SVG PDF PNG JPG GIF APNG AVIF WebM MP4 OGG =========== ====== ====== ===== ===== ===== ===== ===== ===== ===== ===== HTML4_ ✓ [#]_ ✓ ✓ ✓ (✓) (✓) - + HTML5_ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ - + LaTeX_ [#]_ ✓ ✓ ✓ - + ODT_ ✓ ✓ ✓ ✓ ✓ =========== ====== ====== ===== ===== ===== ===== ===== ===== ===== ===== - + .. [#] The `html5 writer`_ uses the ``<video>`` tag if the image URI ends with an extension matching one of the listed video formats (since Docutils 0.17). - + .. [#] The html4 writer uses an ``<object>`` tag for SVG images for better compatibility with older browsers. - + .. [#] When compiling with ``pdflatex``, ``xelatex``, or ``lualatex``. The original ``latex`` engine supports only the EPS image format. Some build systems, e.g. rubber_ support additional formats @@ -201,7 +201,7 @@ There are two image directives: "image" and "figure". __ https://developer.mozilla.org/en-US/docs/Web/Media/Formats/Image_types .. _HTML4: .. _html4 writer: ../../user/html.html#html4css1 -.. _HTML5: +.. _HTML5: .. _html5 writer: ../../user/html.html#html5-polyglot .. _LaTeX: ../../user/latex.html#image-inclusion .. _ODT: ../../user/odt.html @@ -890,7 +890,7 @@ The following options are recognized: ``encoding`` : string The text encoding of the external CSV data (file or URL). - Defaults to the document's input_encoding_. + Defaults to the document's `input_encoding`_. ``escape`` : char A one-character string used to escape the @@ -1488,8 +1488,8 @@ The following options are recognized: Works only with ``code`` or ``literal``. ``encoding`` : string - The text encoding of the external data file. Defaults to the - document's input_encoding_. + The text encoding of the external file. + Defaults to the document's input_encoding_. .. _input_encoding: ../../user/config.html#input-encoding @@ -1583,7 +1583,7 @@ The following options are recognized: ``encoding`` : string The text encoding of the external raw data (file or URL). - Defaults to the document's encoding (if specified). + Defaults to the document's `input_encoding`_ (if specified). and the common option class_. diff --git a/docutils/docs/user/config.txt b/docutils/docs/user/config.txt index 69020f0b8..f6c17ec0b 100644 --- a/docutils/docs/user/config.txt +++ b/docutils/docs/user/config.txt @@ -67,8 +67,6 @@ the three implicit ones listed above (or the ones defined by the ``DOCUTILSCONFIG`` environment variable), and its entries will have priority. -.. _Docutils Runtime Settings: ../api/runtime-settings.html - ------------------------- Configuration File Syntax @@ -155,7 +153,7 @@ Below are the Docutils runtime settings, listed by config file section. **Any setting may be specified in any section, but only settings from active sections will be used.** Sections correspond to Docutils components (module name or alias; section names are always in -lowercase letters). Each `Docutils application`_ uses a specific set +lowercase letters). Each Docutils application_ uses a specific set of components; corresponding configuration file sections are applied when the application is used. Configuration sections are applied in general-to-specific order, as follows: @@ -198,7 +196,7 @@ Some knowledge of Python_ is assumed for some attributes. .. _Python: https://www.python.org/ .. _RFC 822: https://www.rfc-editor.org/rfc/rfc822.txt .. _front-end tool: -.. _Docutils application: tools.html +.. _application: tools.html [general] @@ -299,7 +297,7 @@ error_encoding_error_handler ---------------------------- The error handler for unencodable characters in error output. -Acceptable values are the `Error Handlers`_ of Python's "encoding" module. +Acceptable values are the `Error Handlers`_ of Python's "codecs" module. See also output_encoding_error_handler_. Default: "backslashreplace" @@ -375,7 +373,7 @@ input_encoding_error_handler ---------------------------- The error handler for undecodable characters in the input. -Acceptable values are the `Error Handlers`_ of Python's "encoding" module, +Acceptable values are the `Error Handlers`_ of Python's "codecs" module, including: strict @@ -425,7 +423,7 @@ output_encoding_error_handler ----------------------------- The error handler for unencodable characters in the output. -Acceptable values are the `Error Handlers`_ of Python's "encoding" module, +Acceptable values are the `Error Handlers`_ of Python's "codecs" module, including: strict @@ -2161,9 +2159,9 @@ New in 0.17. Obsoletes the ``html_writer`` option. [docutils application] -------------------------- -New in 0.17. Config file support added in 0.18. -Renamed in 0.19 (the old name "docutils-cli application" is kept as alias). -Support for reader/parser import names added in 0.19. +| New in 0.17. Config file support added in 0.18. +| Renamed in 0.19 (the old name "docutils-cli application" is kept as alias). +| Support for reader/parser import names added in 0.19. reader ~~~~~~ @@ -2320,6 +2318,7 @@ pep_template [pep_html writer] template .. _block quote: ../ref/rst/restructuredtext.html#block-quotes .. _citations: ../ref/rst/restructuredtext.html#citations .. _bullet lists: ../ref/rst/restructuredtext.html#bullet-lists +.. _Docutils Runtime Settings: ../api/runtime-settings.html .. _enumerated lists: ../ref/rst/restructuredtext.html#enumerated-lists .. _field lists: ../ref/rst/restructuredtext.html#field-lists .. _field names: ../ref/rst/restructuredtext.html#field-names @@ -2332,5 +2331,5 @@ pep_template [pep_html writer] template .. _tables: ../ref/rst/restructuredtext.html#tables .. _table of contents: ../ref/rst/directives.html#contents -.. _Error Handlers: +.. _Error Handlers: https://docs.python.org/3/library/codecs.html#error-handlers diff --git a/docutils/docs/user/html.txt b/docutils/docs/user/html.txt index 0ec334de5..df151ae5e 100644 --- a/docutils/docs/user/html.txt +++ b/docutils/docs/user/html.txt @@ -9,6 +9,8 @@ Docutils HTML writers html ---- +:front-end: rst2html.py_ + `html` is an alias for the default Docutils HTML writer. The default may change with the development of HTML, browsers, Docutils, diff --git a/docutils/docs/user/links.txt b/docutils/docs/user/links.txt index 2f8f43617..41d2bc08b 100644 --- a/docutils/docs/user/links.txt +++ b/docutils/docs/user/links.txt @@ -361,6 +361,7 @@ Applications using docutils/reStructuredText and helper applications. .. _Text-Restructured: https://metacpan.org/dist/Text-Restructured .. _repository: ../dev/repository.html + Tools ````` @@ -384,24 +385,25 @@ Development ``````````` * Sphinx_ extends the ReStructuredText syntax to better support the - documentation of Software (and other) *projects* (but other documents + documentation of Software projects (but other documents can be written with it too). - The `Python documentation`_ is based on reStructuredText and Sphinx. - - .. _Python documentation: https://docs.python.org/ +* `Sphinx Extensions`_ allow automatic testing of code snippets, + inclusion of docstrings from Python modules (API docs), and more. * Trac_, a project management and bug/issue tracking system, supports `using reStructuredText <https://trac.edgewall.org/wiki/WikiRestructuredText>`__ as an alternative to wiki markup. - .. _Trac: https://trac.edgewall.org/ * PyLit_ provides a bidirectional text <--> code converter for *literate programming with reStructuredText*. - .. _PyLit: https://repo.or.cz/pylit.git +.. _Sphinx extensions: https://www.sphinx-doc.org/en/master/usage/extensions/ +.. _Python documentation: https://docs.python.org/ +.. _Trac: https://trac.edgewall.org/ +.. _PyLit: https://codeberg.org/milde/pylit CMS Systems diff --git a/docutils/docs/user/tools.txt b/docutils/docs/user/tools.txt index 3aae93cce..b0831e3d9 100644 --- a/docutils/docs/user/tools.txt +++ b/docutils/docs/user/tools.txt @@ -12,6 +12,13 @@ .. contents:: +.. note:: + Docutils front-end tool support is currently `under discussion`__. + Tool names, install details and the set of auto-installed tools + will `change in future Docutils versions`__. + + __ https://sourceforge.net/p/docutils/feature-requests/88/ + __ ../../RELEASE-NOTES.html#future-changes -------------- Introduction @@ -27,10 +34,12 @@ understands the syntax of the text), and a "Writer" (which knows how to generate a specific data format). Most [#]_ front ends have common options and the same command-line usage -pattern (see `the tools`_ below for concrete examples):: +pattern:: toolname [options] [<source> [<destination>]] +See `the tools`_ below for concrete examples. + Each tool has a "``--help``" option which lists the `command-line options`_ and arguments it supports. Processing can also be customized with `configuration files`_. @@ -40,11 +49,6 @@ one argument (source) is specified, the standard output (stdout) is used for the destination. If no arguments are specified, the standard input (stdin) is used for the source. -.. note:: - Docutils front-end tool support is currently under discussion. - Tool names, install details and the set of auto-installed tools - may change in future Docutils versions. - .. [#] The exceptions are buildhtml.py_, quicktest.py_ and rst2odt_prepstyles.py_. @@ -70,11 +74,11 @@ list. Generic Command Line Front End ============================== -:Readers: Standalone, PEP -:Parsers: reStructuredText, Markdown (requires 3rd party packages) -:Writers: html_, html4css1_, html5_, latex__, manpage_, +:Readers: Standalone (default), PEP +:Parsers: reStructuredText (default), Markdown (requires 3rd party packages) +:Writers: html_, html4css1_, html5_ (default), latex__, manpage_, odt_, pep_html_, pseudo-xml_, s5_html_, xelatex_, xml_, -:Config_: See `[docutils application]`_ +:Config_: `[docutils application]`_ The generic front end allows combining "reader", "parser", and "writer" components from the Docutils package or 3rd party plug-ins. @@ -111,7 +115,7 @@ buildhtml.py ------------ :Readers: Standalone, PEP -:Parser: reStructuredText +:Parser: reStructuredText :Writers: html_, html5_, pep_html_ :Config_: `[buildhtml application]`_ @@ -251,8 +255,18 @@ For example, to process a PEP into HTML:: cd <path-to-docutils>/docs/peps rstpep2html.py pep-0287.txt pep-0287.html +The same result can be achieved with the genric front end:: + + cd <path-to-docutils>/docs/peps + docutils --reader=pep --writer=pep_html pep-0287.txt pep-0287.html + +The rendering of published PEPs is done by a Sphinx-based build system +(see :PEP:`676`). + + .. _pep_html: html.html#pep-html + rst2s5.py --------- diff --git a/docutils/docutils/__main__.py b/docutils/docutils/__main__.py index d528b031d..cbb724842 100755 --- a/docutils/docutils/__main__.py +++ b/docutils/docutils/__main__.py @@ -31,8 +31,8 @@ class CliSettingsSpec(docutils.SettingsSpec): Configurable reader, parser, and writer components. - The "--writer" default will change to 'html' when this becomes - an alias for 'html5'. + The "--writer" default will change to 'html' in Docutils 2.0 + when 'html' becomes an alias for the current value 'html5'. """ settings_spec = ( diff --git a/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt b/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt index 5a5fa6ebb..c2b584015 100644 --- a/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt +++ b/sandbox/enhancement-proposals/input-encoding/dep-999-input-encoding.txt @@ -1,19 +1,19 @@ -:Title: Input Encodings +:Title: Input Encoding :Author: Günter Milde :Discussions-To: docutils-develop@lists.sf.net :Status: Draft :Type: API :Created: 2022-07-02 -:Docutils-Version: 0.19 or later -:Replaces: undocumented behaviour in 0.18 +:Docutils-Version: > 0.19 (see `open issues`_) +:Replaces: undocumented behaviour in 0.19 :Resolution: None Abstract ======== -When the `input_encoding`_ setting is not specified, Docutils tries a -heuristic to determine a "successfull encoding". +When the `input_encoding`_ setting is not specified, Docutils uses +a heuristic to determine or guess the source's encoding. The actual behaviour is not documented and depends on the Python version. @@ -21,63 +21,36 @@ The actual behaviour is not documented and depends on the Python version. Motivation ========== -The behaviour of Docutils when the `input_encoding`_ configuration -setting is kept at its default value ``None`` is inconsistent and -underdocumented. - -* If the source encoding cannot be determined from the data and - decoding the source with UTF-8 fails, the file is decoded with the - locale encoding or 'latin1' (hard-coded 2nd fallback). - - This leads to a character mix up (`mojibake`), if the file is actually - in another encoding (corrupt UTF-8, UTF-16, or a different legacy - encoding) without reporting an error. - -With the end of Python 2.7 support, some results from reading a file -under Python 3.x will be seen as a regression by "late adopters". -The following critical use cases are fixed in 0.19b2: - -* An encoding-specification in the file (BOM or special comment) is - ignored under Python 3.x (unless reading with Python's default fails). + | Errors should never pass silently. + | Unless explicitly silenced. -* If the user's locale is set to an 8-bit encoding ("latin1", say) - UTF-8 encoded files are opened as UTF-8 under Python 2.x and Python 3.15 - but assuming the locale encoding under Python 3.7 … 3.14. [#]_ - The second case leads to character mix up. + -- ``import this`` -* A BOM in an utf-8 file is not removed, if the locale encoding is UTF-8. +The behaviour of Docutils when the `input_encoding`_ configuration +setting is kept at its default value ``None`` is currently suboptimal +and underdocumented. +If the source encoding cannot be determined from the source and +decoding the with UTF-8 fails, the file is decoded with the +locale encoding or 'latin1' (hard-coded 2nd fallback), +even in `UTF-8 mode`_. -.. [#] Assuming the `UTF-8 mode`_ at its default value, which will change - in Python 3.15 (:PEP:`686`). +This leads to a character mix up (`mojibake`), if the file is actually +corrupt UTF-8 or in another encoding (UTF-16 or a different legacy encoding) +without reporting an error. Rationale ========= - | Errors should never pass silently. - | Unless explicitly silenced. - - -- ``import this`` - The "hard coded" second fallback encoding "latin1" may have been -practical in times where "latin1" was the most commonly used 8-bit +practical in times where "latin1" was the most commonly used encoding for text files. It is far from optimal in times where using legacy 8-bit encodings without specifying them (via `input_encoding`_ or in the document) can be considered an error. -The same holds (for a lesser extend) for a non-UTF-8 locale encoding. - - -Docutils supports a syntax to declare the encoding of a reStructuredText -source file, similar to :PEP:`263`. This support should be kept. - -Adherence to an encoding-specification in the file (BOM or "magic -comment") remains the default behaviour for Python source code: - - An explicit encoding declaration takes precedence over the default. - - -- :PEP:`3120` +The same holds for using a non-UTF-8 locale encoding as fallback when +Python's `UTF-8 mode`_ is active. Specification @@ -86,25 +59,65 @@ Specification .. Describe the syntax and semantics of any new feature. The encoding of a reStructuredText source is determined from the -`input_encoding`_ setting or an `explicit encoding declaration` +`input_encoding`_ setting or an `explicit encoding declaration`_ (BOM or special comment). -The default encoding is UTF-8 (codec 'utf-8-sig'). +The default input encoding is UTF-8 (codec 'utf-8-sig'). If the encoding is unspecified and decoding with UTF-8 fails, the `preferred encoding`_ is used as a fallback (if it maps to a valid codec and differs from UTF-8). -Differences to Python's default `open()`: +Differences to the default behaviour of Python's `open()`: - The UTF-8 encoding is always tried first. - (This is almost sure to fail if the true encoding differs.) + (This is almost sure to fail if the actual source encoding differs.) + +- An `explicit encoding declaration`_ takes precedence over + the `preferred encoding`_. + +- An optional BOM_ is removed from UTF-8 encoded sources. .. _preferred encoding: https://docs.python.org/3/library/locale.html#locale.getpreferredencoding +Explicit encoding declaration +----------------------------- + +A `Unicode byte order mark` (BOM_) in the source is interpreted as +encoding declaration. + +The encoding of a reStructuredText source file can also be given by a +"magic comment" similar to :PEP:`263`. +This makes the input encoding both *visible* and *changeable* +on a per-source file basis. + +To declare the input encoding, the a comment like :: + + .. text encoding: <encoding name> + +must be placed into the source file either as first or second line. + +Examples: (using formats recognized by popular editors) :: + + .. -*- mode: rst -*- + -*- coding: latin1 -*- + +or:: + + .. vim: set fileencoding=cp737 : + +More precisely, the first and second line are searched for the following +regular expression:: + + coding[:=]\s*([-\w.]+) + +The first group of this expression is then interpreted as encoding name. +If the first line matches the second line is ignored. + + Backwards Compatibility ======================= @@ -118,12 +131,10 @@ The following incompatible changes are expected: - Raise UnicodeError (instead of decoding with 'latin1') if decoding the source with UTF-8 fails and the locale encoding is not set or UTF-8. - + - Raise UnicodeError (instead of decoding with the locale encoding) if Python is started in `UTF-8 mode`_. -.. _codecs: - https://docs.python.org/3/library/codecs.html#encodings-and-unicode Security Implications ===================== @@ -136,14 +147,14 @@ How to Teach This * Document the specification_. -* Document the _`special comment`. +* Document the `special comment`. * Recommend specifying the source encoding (via `input_encoding`_ or with BOM or special comment), especially if it is not UTF-8. -* "To avoid erroneous application of a locale encoding - but keep detection of an encoding-specification in the source - (BOM or special comment), start Python in `UTF-8 mode`_." +* "To avoid erroneous application of a locale encoding but keep + detection of an `explicit encoding declaration`_ in the source, + start Python in `UTF-8 mode`_." Reference Implementation @@ -168,9 +179,9 @@ Open Issues * When shall we implement the incompatible API changes? - 2 minor versions after announcing. - + - Faster/immediately, because the current behaviour is a bug. - + * Change the default `input_encoding`_ value to "UTF-8"? * Keep the auto-detection (as opt-in or as default)? @@ -178,18 +189,48 @@ Open Issues +1 convenient for users with differently encoded sources -1 complicates code + Adherence to an encoding-specification in the source (BOM or "magic + comment") remains the default behaviour for Python source code: + + An explicit encoding declaration takes precedence over the default. + + -- :PEP:`3120` + + +Better feedback + +* Warning or Error, when `input_encoding` value differs from the + encoding declared in the source (BOM or special comment). + +* Info or Warning, when using the "locale" fallback encoding. + +* More helpful report of UnicodeDecodeError + + - Hint at "input_encoding_error_handler"? + + - Report the line number of the undecodable character. + + - Print context around the undecodable character + (decode with "replace" error handler, print slice around error)? + References ========== `<input-encoding-tests.py>`_ - Test script for the exploration of the handling of input encoding + Script for the exploration of the handling of input encoding in Python and Docutils. +Patches #194 Deprecate PEP 263 coding slugs support + https://sourceforge.net/p/docutils/patches/194/ + .. _input_encoding: https://docutils.sourceforge.io/docs/user/config.html#input-encoding .. _UTF-8 mode: https://docs.python.org/3/library/os.html#utf8-mode +.. _codecs: + https://docs.python.org/3/library/codecs.html#encodings-and-unicode +.. _BOM: https://docs.python.org/3/library/codecs.html#codecs.BOM Copyright |