summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorJarkko Hietaniemi <jhi@iki.fi>2002-03-20 00:55:54 +0000
committerJarkko Hietaniemi <jhi@iki.fi>2002-03-20 00:55:54 +0000
commitb310b0538cc1a7948587a9e5ff30683fec2a3ece (patch)
treef3f8fa0dd8ad9ba1aecae60fa5a8079b3fb69237 /pod/perlunicode.pod
parent563aca73f1f51dd4b10089896b544a6ae41316cb (diff)
downloadperl-b310b0538cc1a7948587a9e5ff30683fec2a3ece.tar.gz
If it looks like UTF-8 (either nl_langinfo or locale variables),
think UTF-8, embrace your inner UTF-8, as suggested by Larry. (And as suggested by Markus Kuhn.) While we are at it, document also the case of mixed hash keys as a known potential troublemaker. (Since it's locale-related, sometimes.) p4raw-id: //depot/perl@15350
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod34
1 files changed, 33 insertions, 1 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 518d239dd6..34e00c8076 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -828,6 +828,29 @@ are specifically discussed. There is no C<utfebcdic> pragma or
the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
for more discussion of the issues.
+=head2 Locales
+
+Usually locale settins and Unicode do not affect each other, but
+there are a couple of exceptions:
+
+=over 4
+
+=item *
+
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
+=item *
+
+Perl tries really hard to work both with Unicode and the old byte
+oriented world: most often this is nice, but sometimes this causes
+problems. See L</BUGS> for example how sometimes using locales
+with Unicode can be a good thing.
+
+=back
+
=head2 Using Unicode in XS
If you want to handle Perl Unicode in XS extensions, you may find
@@ -936,7 +959,16 @@ Use of locales with Unicode data may lead to odd results. Currently
there is some attempt to apply 8-bit locale info to characters in the
range 0..255, but this is demonstrably incorrect for locales that use
characters above that range when mapped into Unicode. It will also
-tend to run slower. Avoidance of locales is strongly encouraged.
+tend to run slower. Avoidance of locales is strongly encouraged,
+with one known expection, see the next paragraph.
+
+If the keys of a hash are "mixed", that is, some keys are Unicode,
+while some keys are "byte", the keys may behave differently in regular
+expressions since the definition of character classes like C</\w/>
+is different for byte strings and character strings. This problem can
+sometimes be helped by using an appropriate locale (see L<perllocale>).
+Another way is to force all the strings to be character encoded by
+using utf8::upgrade() (see L<utf8>).
Some functions are slower when working on UTF-8 encoded strings than
on byte encoded strings. All functions that need to hop over