diff options
author | Georg Brandl <georg@python.org> | 2014-01-18 21:16:32 +0100 |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2014-01-18 21:16:32 +0100 |
commit | ff3a8dea781fb0492de4abbd4da48a5b1c110974 (patch) | |
tree | 5aaf665818ca148242ba821fc95940b396009f17 /doc/docs/unicode.rst | |
parent | 97703d63f39e6086d497a6a749c9eee3293dcbeb (diff) | |
download | pygments-ff3a8dea781fb0492de4abbd4da48a5b1c110974.tar.gz |
New docs + website using Sphinx.
Diffstat (limited to 'doc/docs/unicode.rst')
-rw-r--r-- | doc/docs/unicode.rst | 50 |
1 files changed, 50 insertions, 0 deletions
diff --git a/doc/docs/unicode.rst b/doc/docs/unicode.rst new file mode 100644 index 00000000..e79b4bec --- /dev/null +++ b/doc/docs/unicode.rst @@ -0,0 +1,50 @@ +===================== +Unicode and Encodings +===================== + +Since Pygments 0.6, all lexers use unicode strings internally. Because of that +you might encounter the occasional :exc:`UnicodeDecodeError` if you pass strings +with the wrong encoding. + +Per default all lexers have their input encoding set to `latin1`. +If you pass a lexer a string object (not unicode), it tries to decode the data +using this encoding. +You can override the encoding using the `encoding` lexer option. If you have the +`chardet`_ library installed and set the encoding to ``chardet`` if will ananlyse +the text and use the encoding it thinks is the right one automatically: + +.. sourcecode:: python + + from pygments.lexers import PythonLexer + lexer = PythonLexer(encoding='chardet') + +The best way is to pass Pygments unicode objects. In that case you can't get +unexpected output. + +The formatters now send Unicode objects to the stream if you don't set the +output encoding. You can do so by passing the formatters an `encoding` option: + +.. sourcecode:: python + + from pygments.formatters import HtmlFormatter + f = HtmlFormatter(encoding='utf-8') + +**You will have to set this option if you have non-ASCII characters in the +source and the output stream does not accept Unicode written to it!** +This is the case for all regular files and for terminals. + +Note: The Terminal formatter tries to be smart: if its output stream has an +`encoding` attribute, and you haven't set the option, it will encode any +Unicode string with this encoding before writing it. This is the case for +`sys.stdout`, for example. The other formatters don't have that behavior. + +Another note: If you call Pygments via the command line (`pygmentize`), +encoding is handled differently, see :doc:`the command line docs <cmdline>`. + +.. versionadded:: 0.7 + The formatters now also accept an `outencoding` option which will override + the `encoding` option if given. This makes it possible to use a single + options dict with lexers and formatters, and still have different input and + output encodings. + +.. _chardet: http://chardet.feedparser.org/ |