summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/docs/cmdline.rst8
-rw-r--r--doc/docs/lexers.rst2
-rw-r--r--doc/docs/unicode.rst20
-rw-r--r--pygments/lexers/dylan.py2
4 files changed, 22 insertions, 10 deletions
diff --git a/doc/docs/cmdline.rst b/doc/docs/cmdline.rst
index bf0177a3..165af969 100644
--- a/doc/docs/cmdline.rst
+++ b/doc/docs/cmdline.rst
@@ -136,9 +136,13 @@ Pygments tries to be smart regarding encodings in the formatting process:
* If you give an ``outencoding`` option, it will override ``encoding``
as the output encoding.
+* If you give an ``inencoding`` option, it will override ``encoding``
+ as the input encoding.
+
* If you don't give an encoding and have given an output file, the default
- encoding for lexer and formatter is ``latin1`` (which will pass through
- all non-ASCII characters).
+ encoding for lexer and formatter is the terminal encoding or the default
+ locale encoding of the system. As a last resort, ``latin1`` is used (which
+ will pass through all non-ASCII characters).
* If you don't give an encoding and haven't given an output file (that means
output is written to the console), the default encoding for lexer and
diff --git a/doc/docs/lexers.rst b/doc/docs/lexers.rst
index 914b53ef..fefc940e 100644
--- a/doc/docs/lexers.rst
+++ b/doc/docs/lexers.rst
@@ -27,7 +27,7 @@ Currently, **all lexers** support these options:
`encoding`
If given, must be an encoding name (such as ``"utf-8"``). This encoding
will be used to convert the input string to Unicode (if it is not already
- a Unicode string). The default is ``"latin1"``.
+ a Unicode string). The default is ``"guess"``.
If this option is set to ``"guess"``, a simple UTF-8 vs. Latin-1
detection is used, if it is set to ``"chardet"``, the
diff --git a/doc/docs/unicode.rst b/doc/docs/unicode.rst
index e79b4bec..7291a3b2 100644
--- a/doc/docs/unicode.rst
+++ b/doc/docs/unicode.rst
@@ -6,12 +6,20 @@ Since Pygments 0.6, all lexers use unicode strings internally. Because of that
you might encounter the occasional :exc:`UnicodeDecodeError` if you pass strings
with the wrong encoding.
-Per default all lexers have their input encoding set to `latin1`.
-If you pass a lexer a string object (not unicode), it tries to decode the data
-using this encoding.
-You can override the encoding using the `encoding` lexer option. If you have the
-`chardet`_ library installed and set the encoding to ``chardet`` if will ananlyse
-the text and use the encoding it thinks is the right one automatically:
+Per default all lexers have their input encoding set to `guess`. This means
+that the following encodings are tried:
+
+* UTF-8 (including BOM handling)
+* The locale encoding (i.e. the result of `locale.getpreferredencoding()`)
+* As a last resort, `latin1`
+
+If you pass a lexer a byte string object (not unicode), it tries to decode the
+data using this encoding.
+
+You can override the encoding using the `encoding` or `inencoding` lexer
+options. If you have the `chardet`_ library installed and set the encoding to
+``chardet`` if will ananlyse the text and use the encoding it thinks is the
+right one automatically:
.. sourcecode:: python
diff --git a/pygments/lexers/dylan.py b/pygments/lexers/dylan.py
index 5e95a5e3..9875fc08 100644
--- a/pygments/lexers/dylan.py
+++ b/pygments/lexers/dylan.py
@@ -88,7 +88,7 @@ class DylanLexer(RegexLexer):
'type-error-value', 'type-for-copy', 'type-union', 'union', 'values',
'vector', 'zero?'))
- valid_name = '\\\\?[a-z0-9!&*<>|^$%@_\\-+~?/=]+'
+ valid_name = '\\\\?[\\w!&*<>|^$%@\\-+~?/=]+'
def get_tokens_unprocessed(self, text):
for index, token, value in RegexLexer.get_tokens_unprocessed(self, text):