diff options
author | Jharrod LaFon <jharrod.lafon@gmail.com> | 2014-04-14 14:01:51 -0400 |
---|---|---|
committer | Jharrod LaFon <jharrod.lafon@gmail.com> | 2014-04-14 14:01:51 -0400 |
commit | acd5cf2113bb179731a94984bb826528a31fcb06 (patch) | |
tree | 66da33b35d68d2dc3ecbc6cfb9b401ac6351a6a9 /doc/docs/unicode.rst | |
parent | a88ed45d9bdc2e158fe7d69be8e01f798ded7b8e (diff) | |
parent | 5d57fe78405ac06a306f5ed2dd1b630a909cbdfb (diff) | |
download | pygments-acd5cf2113bb179731a94984bb826528a31fcb06.tar.gz |
Merged head
Diffstat (limited to 'doc/docs/unicode.rst')
-rw-r--r-- | doc/docs/unicode.rst | 50 |
1 files changed, 50 insertions, 0 deletions
diff --git a/doc/docs/unicode.rst b/doc/docs/unicode.rst new file mode 100644 index 00000000..e79b4bec --- /dev/null +++ b/doc/docs/unicode.rst @@ -0,0 +1,50 @@ +===================== +Unicode and Encodings +===================== + +Since Pygments 0.6, all lexers use unicode strings internally. Because of that +you might encounter the occasional :exc:`UnicodeDecodeError` if you pass strings +with the wrong encoding. + +Per default all lexers have their input encoding set to `latin1`. +If you pass a lexer a string object (not unicode), it tries to decode the data +using this encoding. +You can override the encoding using the `encoding` lexer option. If you have the +`chardet`_ library installed and set the encoding to ``chardet`` if will ananlyse +the text and use the encoding it thinks is the right one automatically: + +.. sourcecode:: python + + from pygments.lexers import PythonLexer + lexer = PythonLexer(encoding='chardet') + +The best way is to pass Pygments unicode objects. In that case you can't get +unexpected output. + +The formatters now send Unicode objects to the stream if you don't set the +output encoding. You can do so by passing the formatters an `encoding` option: + +.. sourcecode:: python + + from pygments.formatters import HtmlFormatter + f = HtmlFormatter(encoding='utf-8') + +**You will have to set this option if you have non-ASCII characters in the +source and the output stream does not accept Unicode written to it!** +This is the case for all regular files and for terminals. + +Note: The Terminal formatter tries to be smart: if its output stream has an +`encoding` attribute, and you haven't set the option, it will encode any +Unicode string with this encoding before writing it. This is the case for +`sys.stdout`, for example. The other formatters don't have that behavior. + +Another note: If you call Pygments via the command line (`pygmentize`), +encoding is handled differently, see :doc:`the command line docs <cmdline>`. + +.. versionadded:: 0.7 + The formatters now also accept an `outencoding` option which will override + the `encoding` option if given. This makes it possible to use a single + options dict with lexers and formatters, and still have different input and + output encodings. + +.. _chardet: http://chardet.feedparser.org/ |