diff options
-rw-r--r-- | doc/docs/cmdline.rst | 8 | ||||
-rw-r--r-- | doc/docs/lexers.rst | 2 | ||||
-rw-r--r-- | doc/docs/unicode.rst | 20 | ||||
-rw-r--r-- | pygments/lexers/dylan.py | 2 |
4 files changed, 22 insertions, 10 deletions
diff --git a/doc/docs/cmdline.rst b/doc/docs/cmdline.rst index bf0177a3..165af969 100644 --- a/doc/docs/cmdline.rst +++ b/doc/docs/cmdline.rst @@ -136,9 +136,13 @@ Pygments tries to be smart regarding encodings in the formatting process: * If you give an ``outencoding`` option, it will override ``encoding`` as the output encoding. +* If you give an ``inencoding`` option, it will override ``encoding`` + as the input encoding. + * If you don't give an encoding and have given an output file, the default - encoding for lexer and formatter is ``latin1`` (which will pass through - all non-ASCII characters). + encoding for lexer and formatter is the terminal encoding or the default + locale encoding of the system. As a last resort, ``latin1`` is used (which + will pass through all non-ASCII characters). * If you don't give an encoding and haven't given an output file (that means output is written to the console), the default encoding for lexer and diff --git a/doc/docs/lexers.rst b/doc/docs/lexers.rst index 914b53ef..fefc940e 100644 --- a/doc/docs/lexers.rst +++ b/doc/docs/lexers.rst @@ -27,7 +27,7 @@ Currently, **all lexers** support these options: `encoding` If given, must be an encoding name (such as ``"utf-8"``). This encoding will be used to convert the input string to Unicode (if it is not already - a Unicode string). The default is ``"latin1"``. + a Unicode string). The default is ``"guess"``. If this option is set to ``"guess"``, a simple UTF-8 vs. Latin-1 detection is used, if it is set to ``"chardet"``, the diff --git a/doc/docs/unicode.rst b/doc/docs/unicode.rst index e79b4bec..7291a3b2 100644 --- a/doc/docs/unicode.rst +++ b/doc/docs/unicode.rst @@ -6,12 +6,20 @@ Since Pygments 0.6, all lexers use unicode strings internally. Because of that you might encounter the occasional :exc:`UnicodeDecodeError` if you pass strings with the wrong encoding. -Per default all lexers have their input encoding set to `latin1`. -If you pass a lexer a string object (not unicode), it tries to decode the data -using this encoding. -You can override the encoding using the `encoding` lexer option. If you have the -`chardet`_ library installed and set the encoding to ``chardet`` if will ananlyse -the text and use the encoding it thinks is the right one automatically: +Per default all lexers have their input encoding set to `guess`. This means +that the following encodings are tried: + +* UTF-8 (including BOM handling) +* The locale encoding (i.e. the result of `locale.getpreferredencoding()`) +* As a last resort, `latin1` + +If you pass a lexer a byte string object (not unicode), it tries to decode the +data using this encoding. + +You can override the encoding using the `encoding` or `inencoding` lexer +options. If you have the `chardet`_ library installed and set the encoding to +``chardet`` if will ananlyse the text and use the encoding it thinks is the +right one automatically: .. sourcecode:: python diff --git a/pygments/lexers/dylan.py b/pygments/lexers/dylan.py index 5e95a5e3..9875fc08 100644 --- a/pygments/lexers/dylan.py +++ b/pygments/lexers/dylan.py @@ -88,7 +88,7 @@ class DylanLexer(RegexLexer): 'type-error-value', 'type-for-copy', 'type-union', 'union', 'values', 'vector', 'zero?')) - valid_name = '\\\\?[a-z0-9!&*<>|^$%@_\\-+~?/=]+' + valid_name = '\\\\?[\\w!&*<>|^$%@\\-+~?/=]+' def get_tokens_unprocessed(self, text): for index, token, value in RegexLexer.get_tokens_unprocessed(self, text): |