Internationalization, Localization and Unicode

Author: James Gardner
updated:2006-12-11

Note

This is a work in progress. We hope the internationalization, localization and Unicode support in Pylons is now robust and flexible but we would appreciate hearing about any issues we have. Just drop a line to the pylons-discuss mailing list on Google Groups.

This is the first draft of the full document including Unicode. Expect some typos and spelling mistakes!

Table of Contents

Internationalization and localization are means of adapting software for non-native environments, especially for other nations and cultures.

Parts of an application which might need to be localized might include:

The distinction between internationalization and localization is subtle but important. Internationalization is the adaptation of products for potential use virtually everywhere, while localization is the addition of special features for use in a specific locale.

For example, in terms of language used in software, internationalization is the process of marking up all strings that might need to be translated whilst localization is the process of producing translations for a particular locale.

Pylons provides built-in support to enable you to internationalize language but leaves you to handle any other aspects of internationalization which might be appropriate to your application.

Note

Internationalization is often abbreviated as I18N (or i18n or I18n) where the number 18 refers to the number of letters omitted. Localization is often abbreviated L10n or l10n in the same manner. These abbreviations also avoid picking one spelling (internationalisation vs. internationalization, etc.) over the other.

In order to represent characters from multiple languages, you will need to use Unicode so this documentation will start with a description of why Unicode is useful, its history and how to use Unicode in Python.

1   Understanding Unicode

If you've ever come across text in a foreign language that contains lots of ???? characters or have written some Python code and received a message such as UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: ordinal not in range(128) then you have run into a problem with character sets, encodings, Unicode and the like.

The truth is that many developers are put off by Unicode because most of the time it is possible to muddle through rather than take the time to learn the basics. To make the problem worse if you have a system that manages to fudge the issues and just about work and then start trying to do things properly with Unicode it often highlights problems in other parts of your code.

The good news is that Python has great Unicode support, so the rest of this article will show you how to correctly use Unicode in Pylons to avoid unwanted ? characters and UnicodeDecodeErrors.

1.1   What is Unicode?

When computers were first being used the characters that were most important were unaccented English letters. Each of these letters could be represented by a number between 32 and 127 and thus was born ASCII, a character set where space was 32, the letter "A" was 65 and everything could be stored in 7 bits.

Most computers in those days were using 8-bit bytes so people quickly realized that they could use the codes 128-255 for their own purposes. Different people used the codes 128-255 to represent different characters and before long these different sets of characters were also standardized into code pages. This meant that if you needed some non-ASCII characters in a document you could also specify a codepage which would define which extra characters were available. For example Israel DOS used a code page called 862, while Greek users used 737. This just about worked for Western languages provided you didn't want to write an Israeli document with Greek characters but it didn't work at all for Asian languages where there are many more characters than can be represented in 8 bits.

Unicode is a character set that solves these problems by uniquely defining every character that is used anywhere in the world. Rather than defining a character as a particular combination of bits in the way ASCII does, each character is assigned a code point. For example the word hello is made from code points U+0048 U+0065 U+006C U+006C U+006F. The full list of code points can be found at http://www.unicode.org/charts/.

There are lots of different ways of encoding Unicode code points into bits but the most popular encoding is UTF-8. Using UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the useful side effect that English text looks exactly the same in UTF-8 as it did in ASCII, because for every ASCII character with hexadecimal value 0xXY, the corresponding Unicode code point is U+00XY. This backwards compatibility is why if you are developing an application that is only used by English speakers you can often get away without handling characters properly and still expect things to work most of the time. Of course, if you use a different encoding such as UTF-16 this doesn't apply since none of the code points are encoded to 8 bits.

The important things to note from the discussion so far are that:

  • Unicode can represent pretty much any character in any writing system in widespread use today

  • Unicode uses code points to represent characters and the way these map to bits in memory depends on the encoding

  • The most popular encoding is UTF-8 which has several convenient properties:
    1. It can handle any Unicode code point
    2. A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can't handle zero bytes
    3. A string of ASCII text is also valid UTF-8 text
    4. UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
    5. If bytes are corrupted or lost, it's possible to determine the start of the next UTF-8-encoded code point and resynchronize.

Note

Since Unicode 3.1, some extensions have even been defined so that the defined range is now U+000000 to U+10FFFF (21 bits), and formally, the character set is defined as 31-bits to allow for future expansion. It is a myth that there are 65,536 Unicode code points and that every Unicode letter can really be squeezed into two bytes. It is also incorrect to think that UTF-8 can represent less characters than UTF-16. UTF-8 simply uses a variable number of bytes for a character, sometimes just one byte (8 bits).

1.2   Unicode in Python

In Python Unicode strings are expressed as instances of the built-in unicode type. Under the hood, Python represents Unicode strings as either 16 or 32 bit integers, depending on how the Python interpreter was compiled.

The unicode() constructor has the signature unicode(string[, encoding, errors]). All of its arguments should be 8-bit strings. The first argument is converted to Unicode using the specified encoding; if you leave off the encoding argument, the ASCII encoding is used for the conversion, so characters greater than 127 will be treated as errors:

>>> unicode('hello')
u'hello'
>>> s = unicode('hello')
>>> type(s)
<type 'unicode'>
>>> unicode('hello' + chr(255))
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
                    ordinal not in range(128)

The errors argument specifies what to do if the string can't be decoded to ascii. Legal values for this argument are 'strict' (raise a UnicodeDecodeError exception), 'replace' (replace the character that can't be decoded with another one), or 'ignore' (just leave the character out of the Unicode result).

>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
                    ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'

It is important to understand the difference between encoding and decoding. Unicode strings are considered to be the Unicode code points but any representation of the Unicode string has to be encoded to something else, for example UTF-8 or ASCII. So when you are converting an ASCII or UTF-8 string to Unicode you are decoding it and when you are converting from Unicode to UTF-8 or ASCII you are encoding it. This is why the error in the example above says that the ASCII codec cannot decode the byte 0x80 from ASCII to Unicode because it is not in the range(128) or 0-127. In fact 0x80 is hex for 128 which the first number outside the ASCII range. However if we tell Python that the character 0x80 is encoded with the 'latin-1', 'iso_8859_1' or '8859' character sets (which incidentally are different names for the same thing) we get the result we expected:

Note

The character encodings Python supports are listed at http://docs.python.org/lib/standard-encodings.html

Unicode objects in Python have most of the same methods that normal Python strings provide. Python will try to use the 'ascii' codec to convert strings to Unicode if you do an operation on both types:

You can encode a Unicode string using a particular encoding like this:

1.3   Unicode Literals in Python Source Code

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character:

You can also use ", """` or ''' versions too. For example:

Specific code points can be written using the \u escape sequence, which is followed by four hex digits giving the code point. If you use \U instead you specify 8 hex digits instead of 4. Unicode literals can also use the same escape sequences as 8-bit strings, including \x, but \x only takes two hex digits so it can't express all the available code points. You can add characters to Unicode strings using the unichr() built-in function and find out what the ordinal is with ord().

Here is an example demonstrating the different alternatives:

Using escape sequences for code points greater than 127 is fine in small doses but Python 2.4 and above support writing Unicode literals in any encoding as long as you declare the encoding being used by including a special comment as either the first or second line of the source file:

If you don't include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:

When you run it with Python 2.4, it will output the following warning:

sys:1: DeprecationWarning: Non-ASCII character '\xe9' in file testas.py on line
2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for de
tails

and then the following output:

233

For real world use it is recommended that you use the UTF-8 encoding for your file but you must be sure that your text editor actually saves the file as UTF-8 otherwise the Python interpreter will try to parse UTF-8 characters but they will actually be stored as something else.

Note

Windows users who use the SciTE editor can specify the encoding of their file from the menu using the File->Encoding.

Note

If you are working with Unicode in detail you might also be interested in the unicodedata module which can be used to find out Unicode properties such as a character's name, category, numeric value and the like.

1.4   Input and Output

We now know how to use Unicode in Python source code but input and output can also be different using Unicode. Of course, some libraries natively support Unicode and if these libraries return Unicode objects you will not have to do anything special to support them. XML parsers and SQL databases frequently support Unicode for example.

If you remember from the discussion earlier, Unicode data consists of code points. In order to send Unicode data via a socket or write it to a file you usually need to encode it to a series of bytes and then decode the data back to Unicode when reading it. You can of course perform the encoding manually reading a byte at the time but since encodings such as UTF-8 can have variable numbers of bytes per character it is usually much easier to use Python's built-in support in the form of the codecs module.

The codecs module includes a version of the open() function that returns a file-like object that assumes the file's contents are in a specified encoding and accepts Unicode parameters for methods such as .read() and .write().

The function's parameters are open(filename, mode='rb', encoding=None, errors='strict', buffering=1). mode can be 'r', 'w', or 'a', just like the corresponding parameter to the regular built-in open() function. You can add a + character to update the file. buffering is similar to the standard function's parameter. encoding is a string giving the encoding to use, if not specified or specified as None, a regular Python file object that accepts 8-bit strings is returned. Otherwise, a wrapper object is returned, and data written to or read from the wrapper object will be converted as needed. errors specifies the action for encoding errors and can be one of the usual values of 'strict', 'ignore', or 'replace' which we saw right at the begining of this document when we were encoding strings in Python source files.

Here is an example of how to read Unicode from a UTF-8 encoded file:

It's also possible to open files in update mode, allowing both reading and writing:

Notice that we used the repr() function to display the Unicode data. This is very useful because if you tried to print the Unicode data directly, Python would need to encode it before it could be sent the console and depending on which characters were present and the character set used by the console, an error might be raised. This is avoided if you use repr().

The Unicode character U+FEFF is used as a byte-order mark or BOM, and is often written as the first character of a file in order to assist with auto-detection of the file's byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file, but with others such as UTF-8 it isn't necessary.

When such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as 'utf-16-le' and 'utf-16-be' for little-endian and big-endian encodings, that specify one particular byte ordering and don't skip the BOM.

Note

Some editors including SciTE will put a byte order mark (BOM) in the text file when saved as UTF-8, which is strange because UTF-8 doesn't need BOMs.

1.5   Unicode Filenames

Most modern operating systems support the use of Unicode filenames. The filenames are transparently converted to the underlying filesystem encoding. The type of encoding depends on the operating system.

On Windows 9x, the encoding is mbcs.

On Mac OS X, the encoding is utf-8.

On Unix, the encoding is the user's preference according to the result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET) failed.

On Windows NT+, file names are Unicode natively, so no conversion is performed. getfilesystemencoding still returns mbcs, as this is the encoding that applications should use when they explicitly want to convert Unicode strings to byte strings that are equivalent when used as file names.

mbcs is a special encoding for Windows that effectively means "use whichever encoding is appropriate". In Python 2.3 and above you can find out the system encoding with sys.getfilesystemencoding().

Most file and directory functions and methods support Unicode. For example:

Other functions such as os.listdir() will return Unicode if you pass a Unicode argument and will try to return strings if you pass an ordinary 8 bit string. For example running this example as test.py:

will produce the following output:

['Sample?', 'test.py'] [u'Sampleu1388', u'test.py']

2   Applying this to Web Programming

So far we've seen how to use encoding in source files and seen how to decode text to Unicode and encode it back to text. We've also seen that Unicode objects can be manipulated in similar ways to strings and we've seen how to perform input and output operations on files. Next we are going to look at how best to use Unicode in a web app.

The main rule is this:

Your application should use Unicode for all strings internally, decoding
any input to Unicode as soon as it enters the application and encoding the
Unicode to UTF-8 or another encoding only on output.

If you fail to do this you will find that UnicodeDecodeError s will start popping up in unexpected places when Unicode strings are used with normal 8-bit strings because Python's default encoding is ASCII and it will try to decode the text to ASCII and fail. It is always better to do any encoding or decoding at the edges of your application otherwise you will end up patching lots of different parts of your application unnecessarily as and when errors pop up.

Unless you have a very good reason not to it is wise to use UTF-8 as the default encoding since it is so widely supported.

The second rule is:

Always test your application with characters above 127 and above 255
wherever possible.

If you fail to do this you might think your application is working fine, but as soon as your users do put in non-ASCII characters you will have problems. Using arabic is always a good test and www.google.ae is a good source of sample text.

The third rule is:

Always do any checking of a string for illegal characters once it's in the
form that will be used or stored, otherwise the illegal characters might be
disguised.

For example, let's say you have a content management system that takes a Unicode filename, and you want to disallow paths with a '/' character. You might write this code:

This is INCORRECT. If an attacker could specify the 'base64' encoding, they could pass L2V0Yy9wYXNzd2Q= which is the base-64 encoded form of the string '/etc/passwd' which is a file you clearly don't want an attacker to get hold of. The above code looks for / characters in the encoded form and misses the dangerous character in the resulting decoded form.

Those are the three basic rules so now we will look at some of the places you might want to perform Unicode decoding in a Pylons application.

2.1   Request Parameters

Currently the Pylons input values come from request.params but these are not decoded to Unicode by default because not all input should be assumed to be Unicode data.

If you would like However you can use the two functions below:

These can then be used as follows:

This code is discussed in ticket 135 but shouldn't be used with file uploads since these shouldn't ordinarily be decoded to Unicode.

2.2   Templating

Pylons uses Myghty as its default templating language and Myghty 1.1 and above fully support Unicode. The Myghty documentation explains how to use Unicode and you at http://www.myghty.org/docs/unicode.myt but the important idea is that you can Unicode literals pretty much anywhere you can use normal 8-bit strings including in m.write() and m.comp(). You can also pass Unicode data to Pylons' render_response() and Response() callables.

Any Unicode data output by Myghty is automatically decoded to whichever encoding you have chosen. The default is UTF-8 but you can choose which encoding to use by editing your project's config/environment.py file and adding an option like this:

replacing UTF-8 with the encoding you wish to use.

If you need to disable Unicode support altogether you can set this:

but again, you would have to have a good reason to want to do this.

2.3   Output Encoding

Web pages should be generated with a specific encoding, most likely UTF-8. At the very least, that means you should specify the following in the <head> section:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

You should also set the charset in the Content-Type header:

If you specify that your output is UTF-8, generally the web browser will give you UTF-8. If you want the browser to submit data using a different character set, you can set the encoding by adding the accept-encoding tag to your form. Here is an example:

<form accept-encoding="US-ASCII" ...>

However, be forewarned that if the user tries to give you non-ASCII text, then:

  • Firefox will translate the non-ASCII text into HTML entities.
  • IE will ignore your suggested encoding and give you UTF-8 anyway.

The lesson to be learned is that if you output UTF-8, you had better be prepared to accept UTF-8 by decoding the data in request.params as described in the section above entitled "Request Parameters".

Another technique which is sometimes used to determine the character set is to use an algorithm to analyse the input and guess the encoding based on probabilities.

For instance, if you get a file, and you don't know what encoding it is encoded in, you can often rename the file with a .txt extension and then try to open it in Firefox. Then you can use the "View->Character Encoding" menu to try to auto-detect the encoding.

2.4   Databases

Your database driver should automatically convert from Unicode objects to a particular charset when writing and back again when reading. Again it is normal to use UTF-8 which is well supported.

You should check your database's documentation for information on how it handles Unicode.

For example MySQL's Unicode documentation is here http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html

Also note that you need to consider both the encoding of the database and the encoding used by the database driver.

If you're using MySQL together with SQLAlchemy, see the following, as there are some bugs in MySQLdb that you'll need to work around:

http://www.mail-archive.com/sqlalchemy@googlegroups.com/msg00366.html

3   Internationalization and Localization

By now you should have a good idea of what Unicode is, how to use it in Python and which areas of you application need to pay specific attention to decoding and encoding Unicode data.

This final section will look at the issue of making your application work with multiple languages.

3.1   Getting Started

Everywhere in your code where you want strings to be available in different languages you wrap them in the _() function. There are also a number of other translation functions which are documented in the API reference at http://pylonshq.com/docs/module-pylons.i18n.translation.html

Note

The _() function is a reference to the ugettext() function. _() is a convention for marking text to be translated and saves on keystrokes. ugettext() is the Unicode version of gettext().

In our example we want the string 'Hello' to appear in three different languages: English, French and Spanish. We also want to display the word 'Hello' in the default language. We'll then go on to use some pural words too.

Lets call our project translate_demo:

paster create --template=pylons translate_demo

Now lets add a friendly controller that says hello:

cd translate_demo
paster controller hello

Edit controllers/hello.py controller to look like this making use of the _() function everywhere where the string Hello appears:

When writing your controllers it is important not to piece sentences together manually because certain languages might need to invert the grammars. As an example this is bad:

but this is perfectly acceptable:

The controller has now been internationalized but it will raise a LanguageError until we have specified the alternative languages.

Pylons uses GNU gettext to handle internationalization. GNU gettext use three types of files in the translation framework.

POT (Portable Object Template) files

The first step in the localization process. A program is used to search through your project's source code and pick out every string passed to one of the translation functions, such as _(). This list is put together in a specially-formatted template file that will form the basis of all translations. This is the .pot file.

PO (Portable Object) files

The second step in the localization process. Using the POT file as a template, the list of messages are translated and saved as a .po file.

MO (Machine Object) files

The final step in the localization process. The PO file is run through a program that turns it into an optimized machine-readable binary file, which is the .mo file. Compiling the translations to machine code makes the localized program much faster in retrieving the translations while it is running.

Versions of Pylons prior to 0.9.4 came with a setuptools extension to help with the extraction of strings and production of a .mo file. The implementation did not support Unicode nor the ungettext function and was therfore dropped in Python 0.9.4.

You will therefore need to use an external program to perform these tasks. You may use whichever you prefer but xgettext is highly recommended. Python's gettext utility has some bugs, especially regarding plurals.

Here are some compatible tools and projects:

The Rosetta Project (https://launchpad.ubuntu.com/rosetta/)

The Ubuntu Linux project has a web site that allows you to translate messages without even looking at a PO or POT file, and export directly to a MO.

poEdit (http://www.poedit.org/)

An open source program for Windows and UNIX/Linux which provides an easy-to-use GUI for editing PO files and generating MO files.

KBabel (http://i18n.kde.org/tools/kbabel/)

Another open source PO editing program for KDE.

GNU Gettext (http://www.gnu.org/software/gettext/)

The official Gettext tools package contains command-line tools for creating POTs, manipulating POs, and generating MOs. For those comfortable with a command shell.

As an example we will quickly discuss the use of poEdit which is cross platform and has a GUI which makes it easier to get started with.

To use poEdit with the translate_demo you would do the following:

  1. Download and install poEdit.
  2. A dialog pops up. Fill in all the fields you can on the Project Info tab, enter the path to your project on the Paths tab (ie /path/to/translate_demo) and enter the following keywords on separate lines on the keywords tab: _, N_, ugettext, gettext, ngettext, ungettext.
  3. Click OK

poEdit will search your source tree and find all the strings you have marked up. You can then enter your translations in whatever charset you chose in the project info tab. UTF-8 is a good choice.

Finally, after entering your translations you then save the catalog and rename the .mo file produced to translate_demo.mo and put it in the translate_demo/i18n/es/LC_MESSAGES directory or whatever is appropriate for your translation.

You will need to repeat the process of creating a .mo file for the fr, es and en translations.

The relevant lines from i18n/en/LC_MESSAGES/translate_demo.po look like this:

#: translate_demo\controllers\hello.py:6 translate_demo\controllers\hello.py:9
msgid "Hello"
msgstr "Hello"

The relevant lines from i18n/es/LC_MESSAGES/translate_demo.po look like this:

#: translate_demo\controllers\hello.py:6 translate_demo\controllers\hello.py:9
msgid "Hello"
msgstr "°Hola!"

The relevant lines from i18n/fr/LC_MESSAGES/translate_demo.po look like this:

#: translate_demo\controllers\hello.py:6 translate_demo\controllers\hello.py:9
msgid "Hello"
msgstr "Bonjour"

Whichever tools you use you should end up with an i18n directory that looks like this when you have finished:

i18n/en/LC_MESSAGES/translate_demo.po
i18n/en/LC_MESSAGES/translate_demo.mo
i18n/es/LC_MESSAGES/translate_demo.po
i18n/es/LC_MESSAGES/translate_demo.mo
i18n/fr/LC_MESSAGES/translate_demo.po
i18n/fr/LC_MESSAGES/translate_demo.mo

3.2   Testing the Application

Start the server with the following command:

paster serve --reload development.ini

Test your controller by visiting http://localhost:5000/hello. You should see the following output:

Default: Hello
fr: Bonjour
en: Hello
es: °Hola!

You can now set the language used in a controller on the fly.

For example this could be used to allow a user to set which language they wanted your application to work in. You could save the value to the session object:

then on each controller call the language to be used could be read from the session and set in your controller's __before__() method so that the pages remained in the same language that was previously set:

One more useful thing to be able to do is to set the default language to be used in the configuration file. Just add a lang variable together with the code of the language you wanted to use in your development.ini file. For example to set the default language to Spanish you would add lang = es to your development.ini. The relevant part from the file might look something like this:

If you are running the server with the --reload option the server will automatically restart if you change the development.ini file. Otherwise restart the server manually and the output would this time be as follows:

Default: °Hola!
fr: Bonjour
en: Hello
es: °Hola!

3.3   Missing Translations

If your code calls _() with a string that doesn't exist in your language catalogue, the string passed to _() is returned instead.

Modify the last line of the hello controller to look like this:

Warning

Of course, in real life breaking up sentences in this way is very dangerous because some grammars might require the order of the words to be different.

If you run the example again the output will be:

Default: °Hola!
fr: Bonjour World!
en: Hello World!
es: °Hola! World!

This is because we never provided a translation for the string 'World!' so the string itself is used.

3.4   Translations Within Templates

You can also use the _() function within templates in exactly the same way you do in code. For example:

would produce the string 'Hello' in the language you had set.

There is one complication though. gettext's xgettext command can only extract strings that need translating from Python code in .py files. This means that if you write _('Hello') in a template such as a Myghty template, xgettext will not find the string 'Hello' as one which needs translating.

As long as xgettext can find a string marked for translation with one of the translation functions and defined in Python code in your project filesystem it will manage the translation when the same string is defined in a Myghty template and marked for translation.

One solution to ensure all strings are picked up for translation is to create a file in lib with an appropriate filename, i18n.py for example, and then add a list of all the strings which appear in your templates so that your translation tool can then extract the strings in lib/i18n.py for translation and use the translated versions in your templates as well.

For example if you wanted to ensure the translated string 'Good Morning' was available in all templates you could create a lib/i18n.py file that looked something like this:

This approach requires quite a lot of work and is rather fragile. The best solution if you are using a templating system such as Myghty or Cheetah which uses compiled Python files is to use a Makefile to ensure that every template is compiled to Python before running the extraction tool to make sure that every template is scanned.

Of course, if your cache directory is in the default location or elsewhere within your project's filesystem, you will probably find that all templates have been compiled as Python files during the course of the development process. This means that your tool's extraction command will successfully pick up strings to translate from the cached files anyway.

You may also find that your extraction tool is capable of extracting the strings correctly from the template anyway, particularly if the templating langauge is quite similar to Python. It is best not to rely on this though.

3.5   Producing a Python Egg

Finally you can produce an egg of your project which includes the translation files like this:

python setup.py bdist_egg

The setup.py automatically includes the .mo language catalogs your application needs so that your application can be distributed as an egg. This is done with the following line in your setup.py file:

package_data={'translate_demo': ['i18n/*/LC_MESSAGES/*.mo']},

Internationalization support is zip safe so your application can be run directly from the egg without the need for easy_install to extract it.

3.6   Plural Forms

Pylons also defines ungettext() and ngettext() functions which can be imported from pylons.i18n. They are designed for internationalizing plural words and can be used as follows:

If you wish to use plural forms in your application you need to add the appropriate headers to the .po files for the language you are using. You can read more about this at http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150

One thing to keep in mind is that other languages don't have the same plural forms as English. While English only has 2 pulral forms, singular and plural, Slovenian has 4! That means that you must use gettext's support for pluralization if you hope to get pluralization right. Specifically, the following will not work:

4   Summary

Hopefully you now understand the history of Unicode, how to use it in Python and where to apply Unicode encoding and decoding in a Pylons application. You should also be able to use Unicode in your web app remembering the basic rule to use UTF-8 to talk to the world, do the encode and decode at the edge of your application.

You should also be able to internationalize and then localize your application using Pylons' support for GNU gettext.

5   Further Reading

This information is based partly on the following articles which can be consulted for further information.:

http://www.joelonsoftware.com/articles/Unicode.html

http://www.amk.ca/python/howto/unicode

http://en.wikipedia.org/wiki/Internationalization

Please feel free to report any mistakes to the Pylons mailing list or to the author. Any corrections or clarifications would be gratefully received.