diff options
author | Karl Williamson <public@khwilliamson.com> | 2010-12-01 16:33:54 -0700 |
---|---|---|
committer | Father Chrysostomos <sprout@cpan.org> | 2010-12-01 18:23:45 -0800 |
commit | 20db750130061015fab1ffed94ff374c2bd38af3 (patch) | |
tree | f5852919978cf5cb1d80e098e92a3eef5972abf1 /pod/perlunifaq.pod | |
parent | 4ee7c0eabacb52cfaad975a33feeb842bbf347b3 (diff) | |
download | perl-20db750130061015fab1ffed94ff374c2bd38af3.tar.gz |
Document Unicode doc fix
Diffstat (limited to 'pod/perlunifaq.pod')
-rw-r--r-- | pod/perlunifaq.pod | 42 |
1 files changed, 21 insertions, 21 deletions
diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod index 877e4d15e6..9fd2b38056 100644 --- a/pod/perlunifaq.pod +++ b/pod/perlunifaq.pod @@ -138,27 +138,27 @@ concern, and you can just C<eval> dumped data as always. =head2 Why do some characters not uppercase or lowercase correctly? -It seemed like a good idea at the time, to keep the semantics the same for -standard strings, when Perl got Unicode support. The plan is to fix this -in the future, and the casing component has in fact mostly been fixed, but we -have to deal with the fact that Perl treats equal strings differently, -depending on the internal state. - -First the casing. Just put a C<use feature 'unicode_strings'> near the -beginning of your program. Within its lexical scope, C<uc>, C<lc>, C<ucfirst>, -C<lcfirst>, and the regular expression escapes C<\U>, C<\L>, C<\u>, C<\l> use -Unicode semantics for changing case regardless of whether the UTF8 flag is on -or not. However, if you pass strings to subroutines in modules outside the -pragma's scope, they currently likely won't behave this way, and you have to -try one of the solutions below. There is another exception as well: if you -have furnished your own casing functions to override the default, these will -not be called unless the UTF8 flag is on) - -This remains a problem for the regular expression constructs -C</.../i>, C<(?i:...)>, and C</[[:posix:]]/>. - -To force Unicode semantics, you can upgrade the internal representation to -by doing C<utf8::upgrade($string)>. This can be used +Starting in Perl 5.14 (and partially in Perl 5.12), just put a +C<use feature 'unicode_strings'> near the beginning of your program. +Within its lexical scope you shouldn't have this problem. It also is +automatically enabled under C<use feature ':5.12'> or using C<-E> on the +command line for Perl 5.12 or higher. + +The rationale for requiring this is to not break older programs that +rely on the way things worked before Unicode came along. Those older +programs knew only about the ASCII character set, and so may not work +properly for additional characters. When a string is encoded in UTF-8, +Perl assumes that the program is prepared to deal with Unicode, but when +the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC +platform) is wanted, and so those characters that are not ASCII +characters aren't recognized as to what they would be in Unicode. +C<use feature 'unicode_strings'> tells Perl to treat all characters as +Unicode, whether the string is encoded in UTF-8 or not, thus avoiding +the problem. + +However, on earlier Perls, or if you pass strings to subroutines outside +the feature's scope, you can force Unicode semantics by changing the +encoding to UTF-8 by doing C<utf8::upgrade($string)>. This can be used safely on any string, as it checks and does not change strings that have already been upgraded. |