diff options
author | Georg Brandl <georg@python.org> | 2014-11-06 13:18:32 +0100 |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2014-11-06 13:18:32 +0100 |
commit | 8c0814068d229cfbf67f9e3a070bcdaa089c7ffa (patch) | |
tree | 3297ab209f67532ff71c9e8b82b6edd1f8b984a6 /doc | |
parent | 69e83eb0856666d2594c96b1e8fae42dbeb92318 (diff) | |
download | pygments-8c0814068d229cfbf67f9e3a070bcdaa089c7ffa.tar.gz |
Update docs w.r.t. encodings.
Diffstat (limited to 'doc')
-rw-r--r-- | doc/docs/cmdline.rst | 8 | ||||
-rw-r--r-- | doc/docs/lexers.rst | 2 | ||||
-rw-r--r-- | doc/docs/unicode.rst | 20 |
3 files changed, 21 insertions, 9 deletions
diff --git a/doc/docs/cmdline.rst b/doc/docs/cmdline.rst index bf0177a3..165af969 100644 --- a/doc/docs/cmdline.rst +++ b/doc/docs/cmdline.rst @@ -136,9 +136,13 @@ Pygments tries to be smart regarding encodings in the formatting process: * If you give an ``outencoding`` option, it will override ``encoding`` as the output encoding. +* If you give an ``inencoding`` option, it will override ``encoding`` + as the input encoding. + * If you don't give an encoding and have given an output file, the default - encoding for lexer and formatter is ``latin1`` (which will pass through - all non-ASCII characters). + encoding for lexer and formatter is the terminal encoding or the default + locale encoding of the system. As a last resort, ``latin1`` is used (which + will pass through all non-ASCII characters). * If you don't give an encoding and haven't given an output file (that means output is written to the console), the default encoding for lexer and diff --git a/doc/docs/lexers.rst b/doc/docs/lexers.rst index 914b53ef..fefc940e 100644 --- a/doc/docs/lexers.rst +++ b/doc/docs/lexers.rst @@ -27,7 +27,7 @@ Currently, **all lexers** support these options: `encoding` If given, must be an encoding name (such as ``"utf-8"``). This encoding will be used to convert the input string to Unicode (if it is not already - a Unicode string). The default is ``"latin1"``. + a Unicode string). The default is ``"guess"``. If this option is set to ``"guess"``, a simple UTF-8 vs. Latin-1 detection is used, if it is set to ``"chardet"``, the diff --git a/doc/docs/unicode.rst b/doc/docs/unicode.rst index e79b4bec..7291a3b2 100644 --- a/doc/docs/unicode.rst +++ b/doc/docs/unicode.rst @@ -6,12 +6,20 @@ Since Pygments 0.6, all lexers use unicode strings internally. Because of that you might encounter the occasional :exc:`UnicodeDecodeError` if you pass strings with the wrong encoding. -Per default all lexers have their input encoding set to `latin1`. -If you pass a lexer a string object (not unicode), it tries to decode the data -using this encoding. -You can override the encoding using the `encoding` lexer option. If you have the -`chardet`_ library installed and set the encoding to ``chardet`` if will ananlyse -the text and use the encoding it thinks is the right one automatically: +Per default all lexers have their input encoding set to `guess`. This means +that the following encodings are tried: + +* UTF-8 (including BOM handling) +* The locale encoding (i.e. the result of `locale.getpreferredencoding()`) +* As a last resort, `latin1` + +If you pass a lexer a byte string object (not unicode), it tries to decode the +data using this encoding. + +You can override the encoding using the `encoding` or `inencoding` lexer +options. If you have the `chardet`_ library installed and set the encoding to +``chardet`` if will ananlyse the text and use the encoding it thinks is the +right one automatically: .. sourcecode:: python |