diff options
author | Karl Williamson <public@khwilliamson.com> | 2010-12-01 16:33:54 -0700 |
---|---|---|
committer | Father Chrysostomos <sprout@cpan.org> | 2010-12-01 18:23:45 -0800 |
commit | 20db750130061015fab1ffed94ff374c2bd38af3 (patch) | |
tree | f5852919978cf5cb1d80e098e92a3eef5972abf1 | |
parent | 4ee7c0eabacb52cfaad975a33feeb842bbf347b3 (diff) | |
download | perl-20db750130061015fab1ffed94ff374c2bd38af3.tar.gz |
Document Unicode doc fix
-rw-r--r-- | lib/feature.pm | 21 | ||||
-rw-r--r-- | pod/perldelta.pod | 33 | ||||
-rw-r--r-- | pod/perlre.pod | 44 | ||||
-rw-r--r-- | pod/perlunicode.pod | 57 | ||||
-rw-r--r-- | pod/perlunifaq.pod | 42 |
5 files changed, 105 insertions, 92 deletions
diff --git a/lib/feature.pm b/lib/feature.pm index f8a9078234..c70010df6e 100644 --- a/lib/feature.pm +++ b/lib/feature.pm @@ -105,11 +105,22 @@ See L<perlsub/"Persistent Private Variables"> for details. =head2 the 'unicode_strings' feature -C<use feature 'unicode_strings'> tells the compiler to treat -all strings outside of C<use locale> and C<use bytes> as Unicode. It is -available starting with Perl 5.11.3, but is not fully implemented. - -See L<perlunicode/The "Unicode Bug"> for details. +C<use feature 'unicode_strings'> tells the compiler to use Unicode semantics +in all string operations executed within its scope (unless they are also +within the scope of either C<use locale> or C<use bytes>). The same applies +to all regular expressions compiled within the scope, even if executed outside +it. + +C<no feature 'unicode_strings'> tells the compiler to use the traditional +Perl semantics wherein the native character set semantics is used unless it is +clear to Perl that Unicode is desired. This can lead to some surprises +when the behavior suddenly changes. (See +L<perlunicode/The "Unicode Bug"> for details.) For this reason, if you are +potentially using Unicode in your program, the +C<use feature 'unicode_strings'> subpragma is B<strongly> recommended. + +This subpragma is available starting with Perl 5.11.3, but was not fully +implemented until 5.13.8. =head1 FEATURE BUNDLES diff --git a/pod/perldelta.pod b/pod/perldelta.pod index cfeff1f352..b7d710bdcc 100644 --- a/pod/perldelta.pod +++ b/pod/perldelta.pod @@ -2,7 +2,6 @@ =for comment This has been completed up to 779bcb7d, except for: -1b9f127-fad448f (Karl Williamson says he will do this) ad9e76a8629ed1ac483f0a7ed0e4da40ac5a1a00 d9a4b459f94297889956ac3adc42707365f274c2 @@ -81,6 +80,18 @@ method support still works as expected: open my $fh, ">", $file; $fh->autoflush(1); # IO::File not loaded +=head2 Full functionality for C<use feature 'unicode_strings'> + +This release provides full functionality for C<use feature +'unicode_strings'>. Under its scope, all string operations executed and +regular expressions compiled (even if executed outside its scope) have +Unicode semantics. See L<feature>. + +This feature avoids the "Unicode Bug" (See +L<perlunicode/The "Unicode Bug"> for details.) If their is a +possibility that your code will process Unicode strings, you are +B<strongly> encouraged to use this subpragma to avoid nasty surprises. + =head1 Security XXX Any security-related notices go here. In particular, any security @@ -492,12 +503,6 @@ L<[perl #79178]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=79178>. =item * -A number of bugs with regular expression bracketed character classes -have been fixed, mostly having to do with matching characters in the -non-ASCII Latin-1 range. - -=item * - A closure containing an C<if> statement followed by a constant or variable is no longer treated as a constant L<[perl #63540]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=63540>. @@ -514,6 +519,20 @@ A regular expression optimisation would sometimes cause a match with a C<{n,m}> quantifier to fail when it should match L<[perl #79152]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=79152>. +=item * + +What has become known as the "Unicode Bug" is resolved in this release. +Under C<use feature 'unicode_strings'>, the internal storage format of a +string no longer affects the external semantics. There are two known +exceptions. User-defined case changing functions, which are planned to +be deprecated in 5.14, require utf8-encoded strings to function; and the +character C<LATIN SMALL LETTER SHARP S> in regular expression +case-insensitive matching has a somewhat different set of bugs depending +on the internal storage format. Case-insensitive matching of all +characters that have multi-character matches, as this one does, is +problematical in Perl. +L<[perl #58182]|http://rt.perl.org/rt3/Public/Bug/Display.html?id=58182>. + =back =head1 Known Problems diff --git a/pod/perlre.pod b/pod/perlre.pod index acc1ad57a7..f415a16ffd 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -646,31 +646,37 @@ locale, and can differ from one match to another if there is an intervening call of the L<setlocale() function|perllocale/The setlocale function>. This modifier is automatically set if the regular expression is compiled -within the scope of a C<"use locale"> pragma. +within the scope of a C<"use locale"> pragma. Results are not +well-defined when using this and matching against a utf8-encoded string. C<"u"> means to use Unicode semantics when pattern matching. It is -automatically set if the regular expression is compiled within the scope -of a L<C<"use feature 'unicode_strings">|feature> pragma (and isn't -also in the scope of L<C<"use locale">|locale> nor -L<C<"use bytes">|bytes> pragmas. It is not fully implemented at the -time of this writing, but work is being done to complete the job. On -EBCDIC platforms this currently has no effect, but on ASCII platforms, -it effectively turns them into Latin-1 platforms. That is, the ASCII -characters remain as ASCII characters (since ASCII is a subset of -Latin-1), but the non-ASCII code points are treated as Latin-1 -characters. Right now, this only applies to the C<"\b">, C<"\s">, and -C<"\w"> pattern matching operators, plus their complements. For -example, when this option is not on, C<"\w"> matches precisely -C<[A-Za-z0-9_]> (on a non-utf8 string). When the option is on, it -matches not just those, but all the Latin-1 word characters (such as an -"n" with a tilde). It thus matches exactly the same set of code points -from 0 to 255 as it would if the string were encoded in utf8. +automatically set if the regular expression is encoded in utf8, or is +compiled within the scope of a +L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in +the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes> +pragmas. On ASCII platforms, the code points between 128 and 255 take on their +Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas +in strict ASCII their meanings are undefined. Thus the platform +effectively becomes a Unicode platform. The ASCII characters remain as +ASCII characters (since ASCII is a subset of Latin-1 and Unicode). For +example, when this option is not on, on a non-utf8 string, C<"\w"> +matches precisely C<[A-Za-z0-9_]>. When the option is on, it matches +not just those, but all the Latin-1 word characters (such as an "n" with +a tilde). On EBCDIC platforms, which already are equivalent to Latin-1, +this modifier changes behavior only when the C<"/i"> modifier is also +specified, and affects only two characters, giving them full Unicode +semantics: the C<MICRO SIGN> will match the Greek capital and +small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP +S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not. +(This last case is buggy, however.) C<"d"> means to use the traditional Perl pattern matching behavior. This is dualistic (hence the name C<"d">, which also could stand for -"default"). When this is in effect, Perl matches utf8-encoded strings +"depends"). When this is in effect, Perl matches utf8-encoded strings using Unicode rules, and matches non-utf8-encoded strings using the -platform's native character set rules. +platform's native character set rules. (If the regular expression +itself is encoded in utf8, Unicode rules are used regardless of the +target string's encoding.) See L<perlunicode/The "Unicode Bug">. It is automatically selected by default if the regular expression is compiled neither within the scope of a C<"use locale"> pragma nor a <C<"use feature 'unicode_strings"> diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 20acb55114..925ae36da2 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -1450,7 +1450,8 @@ The term, the "Unicode bug" has been applied to an inconsistency with the Unicode characters whose ordinals are in the Latin-1 Supplement block, that is, between 128 and 255. Without a locale specified, unlike all other characters or code points, these characters have very different semantics in -byte semantics versus character semantics. +byte semantics versus character semantics, unless +C<use feature 'unicode_strings'> is specified. In character semantics they are interpreted as Unicode code points, which means they have the same semantics as Latin-1 (ISO-8859-1). @@ -1514,45 +1515,21 @@ ASCII range (except in a locale), along with Perl's desire to add Unicode support seamlessly. The result wasn't seamless: these characters were orphaned. -Work is being done to correct this, but only some of it is complete. -What has been finished is: - -=over - -=item * - -the matching of C<\b>, C<\s>, C<\w> and the Posix -character classes and their complements in regular expressions - -=item * - -case changing (but not user-defined casing) - -=item * - -case-insensitive (C</i>) regular expression matching for [bracketed -character classes] only, except for some bugs with C<LATIN SMALL -LETTER SHARP S> (which is supposed to match the two character sequence -"ss" (or "Ss" or "sS" or "SS"), but Perl has a number of bugs for all -such multi-character case insensitive characters, of which this is just -one example. - -=back - -Due to concerns, and some evidence, that older code might -have come to rely on the existing behavior, the new behavior must be explicitly -enabled by the feature C<unicode_strings> in the L<feature> pragma, even though -no new syntax is involved. - -See L<perlfunc/lc> for details on how this pragma works in combination with -various others for casing. - -Even though the implementation is incomplete, it is planned to have this -pragma affect all the problematic behaviors in later releases: you can't -have one without them all. - -In the meantime, a workaround is to always call utf8::upgrade($string), or to -use the standard module L<Encode>. Also, a scalar that has any characters +Starting in Perl 5.14, C<use feature 'unicode_strings'> can be used to +cause Perl to use Unicode semantics on all string operations within the +scope of the feature subpragma. Regular expressions compiled in its +scope retain that behavior even when executed or compiled into larger +regular expressions outside the scope. (The pragma does not, however, +affect user-defined case changing operations. These still require a +UTF-8 encoded string to operate.) + +In Perl 5.12, the subpragma affected casing changes, but not regular +expressions. See L<perlfunc/lc> for details on how this pragma works in +combination with various others for casing. + +For earlier Perls, or when a string is passed to a function outside the +subpragma's scope, a workaround is to always call C<utf8::upgrade($string)>, +or to use the standard module L<Encode>. Also, a scalar that has any characters whose ordinal is above 0x100, or which were specified using either of the C<\N{...}> notations will automatically have character semantics. diff --git a/pod/perlunifaq.pod b/pod/perlunifaq.pod index 877e4d15e6..9fd2b38056 100644 --- a/pod/perlunifaq.pod +++ b/pod/perlunifaq.pod @@ -138,27 +138,27 @@ concern, and you can just C<eval> dumped data as always. =head2 Why do some characters not uppercase or lowercase correctly? -It seemed like a good idea at the time, to keep the semantics the same for -standard strings, when Perl got Unicode support. The plan is to fix this -in the future, and the casing component has in fact mostly been fixed, but we -have to deal with the fact that Perl treats equal strings differently, -depending on the internal state. - -First the casing. Just put a C<use feature 'unicode_strings'> near the -beginning of your program. Within its lexical scope, C<uc>, C<lc>, C<ucfirst>, -C<lcfirst>, and the regular expression escapes C<\U>, C<\L>, C<\u>, C<\l> use -Unicode semantics for changing case regardless of whether the UTF8 flag is on -or not. However, if you pass strings to subroutines in modules outside the -pragma's scope, they currently likely won't behave this way, and you have to -try one of the solutions below. There is another exception as well: if you -have furnished your own casing functions to override the default, these will -not be called unless the UTF8 flag is on) - -This remains a problem for the regular expression constructs -C</.../i>, C<(?i:...)>, and C</[[:posix:]]/>. - -To force Unicode semantics, you can upgrade the internal representation to -by doing C<utf8::upgrade($string)>. This can be used +Starting in Perl 5.14 (and partially in Perl 5.12), just put a +C<use feature 'unicode_strings'> near the beginning of your program. +Within its lexical scope you shouldn't have this problem. It also is +automatically enabled under C<use feature ':5.12'> or using C<-E> on the +command line for Perl 5.12 or higher. + +The rationale for requiring this is to not break older programs that +rely on the way things worked before Unicode came along. Those older +programs knew only about the ASCII character set, and so may not work +properly for additional characters. When a string is encoded in UTF-8, +Perl assumes that the program is prepared to deal with Unicode, but when +the string isn't, Perl assumes that only ASCII (unless it is an EBCDIC +platform) is wanted, and so those characters that are not ASCII +characters aren't recognized as to what they would be in Unicode. +C<use feature 'unicode_strings'> tells Perl to treat all characters as +Unicode, whether the string is encoded in UTF-8 or not, thus avoiding +the problem. + +However, on earlier Perls, or if you pass strings to subroutines outside +the feature's scope, you can force Unicode semantics by changing the +encoding to UTF-8 by doing C<utf8::upgrade($string)>. This can be used safely on any string, as it checks and does not change strings that have already been upgraded. |