diff options
author | Jarkko Hietaniemi <jhi@iki.fi> | 2002-03-20 00:55:54 +0000 |
---|---|---|
committer | Jarkko Hietaniemi <jhi@iki.fi> | 2002-03-20 00:55:54 +0000 |
commit | b310b0538cc1a7948587a9e5ff30683fec2a3ece (patch) | |
tree | f3f8fa0dd8ad9ba1aecae60fa5a8079b3fb69237 /pod/perlunicode.pod | |
parent | 563aca73f1f51dd4b10089896b544a6ae41316cb (diff) | |
download | perl-b310b0538cc1a7948587a9e5ff30683fec2a3ece.tar.gz |
If it looks like UTF-8 (either nl_langinfo or locale variables),
think UTF-8, embrace your inner UTF-8, as suggested by Larry.
(And as suggested by Markus Kuhn.)
While we are at it, document also the case of
mixed hash keys as a known potential troublemaker.
(Since it's locale-related, sometimes.)
p4raw-id: //depot/perl@15350
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 34 |
1 files changed, 33 insertions, 1 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 518d239dd6..34e00c8076 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -828,6 +828,29 @@ are specifically discussed. There is no C<utfebcdic> pragma or the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic> for more discussion of the issues. +=head2 Locales + +Usually locale settins and Unicode do not affect each other, but +there are a couple of exceptions: + +=over 4 + +=item * + +If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG) +contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching), +the default encoding of your STDIN, STDOUT, and STDERR, and of +B<any subsequent file open>, is UTF-8. + +=item * + +Perl tries really hard to work both with Unicode and the old byte +oriented world: most often this is nice, but sometimes this causes +problems. See L</BUGS> for example how sometimes using locales +with Unicode can be a good thing. + +=back + =head2 Using Unicode in XS If you want to handle Perl Unicode in XS extensions, you may find @@ -936,7 +959,16 @@ Use of locales with Unicode data may lead to odd results. Currently there is some attempt to apply 8-bit locale info to characters in the range 0..255, but this is demonstrably incorrect for locales that use characters above that range when mapped into Unicode. It will also -tend to run slower. Avoidance of locales is strongly encouraged. +tend to run slower. Avoidance of locales is strongly encouraged, +with one known expection, see the next paragraph. + +If the keys of a hash are "mixed", that is, some keys are Unicode, +while some keys are "byte", the keys may behave differently in regular +expressions since the definition of character classes like C</\w/> +is different for byte strings and character strings. This problem can +sometimes be helped by using an appropriate locale (see L<perllocale>). +Another way is to force all the strings to be character encoded by +using utf8::upgrade() (see L<utf8>). Some functions are slower when working on UTF-8 encoded strings than on byte encoded strings. All functions that need to hop over |