If it looks like UTF-8 (either nl_langinfo or locale variables),

think UTF-8, embrace your inner UTF-8, as suggested by Larry. (And as suggested by Markus Kuhn.) While we are at it, document also the case of mixed hash keys as a known potential troublemaker. (Since it's locale-related, sometimes.) p4raw-id: //depot/perl@15350
author: Jarkko Hietaniemi <jhi@iki.fi> 2002-03-20 00:55:54 +0000
committer: Jarkko Hietaniemi <jhi@iki.fi> 2002-03-20 00:55:54 +0000
commit: b310b0538cc1a7948587a9e5ff30683fec2a3ece (patch)
tree: f3f8fa0dd8ad9ba1aecae60fa5a8079b3fb69237 /pod/perlunicode.pod
parent: 563aca73f1f51dd4b10089896b544a6ae41316cb (diff)
download: perl-b310b0538cc1a7948587a9e5ff30683fec2a3ece.tar.gz
1 files changed, 33 insertions, 1 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 518d239dd6..34e00c8076 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -828,6 +828,29 @@ are specifically discussed. There is no C<utfebcdic> pragma or
 the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
 for more discussion of the issues.
 
+=head2 Locales
+
+Usually locale settins and Unicode do not affect each other, but
+there are a couple of exceptions:
+
+=over 4
+
+=item *
+
+If your locale environment variables (LANGUAGE, LC_ALL, LC_CTYPE, LANG)
+contain the strings 'UTF-8' or 'UTF8' (case-insensitive matching),
+the default encoding of your STDIN, STDOUT, and STDERR, and of
+B<any subsequent file open>, is UTF-8.
+
+=item *
+
+Perl tries really hard to work both with Unicode and the old byte
+oriented world: most often this is nice, but sometimes this causes
+problems.  See L</BUGS> for example how sometimes using locales
+with Unicode can be a good thing.
+
+=back
+
 =head2 Using Unicode in XS
 
 If you want to handle Perl Unicode in XS extensions, you may find
@@ -936,7 +959,16 @@ Use of locales with Unicode data may lead to odd results.  Currently
 there is some attempt to apply 8-bit locale info to characters in the
 range 0..255, but this is demonstrably incorrect for locales that use
 characters above that range when mapped into Unicode.  It will also
-tend to run slower.  Avoidance of locales is strongly encouraged.
+tend to run slower.  Avoidance of locales is strongly encouraged,
+with one known expection, see the next paragraph.
+
+If the keys of a hash are "mixed", that is, some keys are Unicode,
+while some keys are "byte", the keys may behave differently in regular
+expressions since the definition of character classes like C</\w/>
+is different for byte strings and character strings.  This problem can
+sometimes be helped by using an appropriate locale (see L<perllocale>).
+Another way is to force all the strings to be character encoded by
+using utf8::upgrade() (see L<utf8>).
 
 Some functions are slower when working on UTF-8 encoded strings than
 on byte encoded strings. All functions that need to hop over
author	Jarkko Hietaniemi <jhi@iki.fi>	2002-03-20 00:55:54 +0000
committer	Jarkko Hietaniemi <jhi@iki.fi>	2002-03-20 00:55:54 +0000
commit	b310b0538cc1a7948587a9e5ff30683fec2a3ece (patch)
tree	f3f8fa0dd8ad9ba1aecae60fa5a8079b3fb69237 /pod/perlunicode.pod
parent	563aca73f1f51dd4b10089896b544a6ae41316cb (diff)
download	perl-b310b0538cc1a7948587a9e5ff30683fec2a3ece.tar.gz