diff options
author | Rafael Garcia-Suarez <rgs@consttype.org> | 2009-12-20 16:23:36 +0100 |
---|---|---|
committer | Rafael Garcia-Suarez <rgs@consttype.org> | 2009-12-20 16:28:36 +0100 |
commit | 1863b87966ed39b042c45e12d1b4e0b90b9cc071 (patch) | |
tree | eae5c03c697269b036352d4b007f9c1294f189c9 /lib/feature.pm | |
parent | 1d5fe431325abdb0f3947d563ebdef67bd4cb7cd (diff) | |
download | perl-1863b87966ed39b042c45e12d1b4e0b90b9cc071.tar.gz |
Introduce C<use feature "unicode_strings">
This turns on the unicode semantics for uc/lc/ucfirst/lcfirst
operations on strings without the UTF8 bit set but with ASCII
characters higher than 127. This replaces the "legacy" pragma
experiment.
Note that currently this feature sets both a bit in $^H and
a (unused) key in %^H. The bit in $^H could be replaced by
a flag on the uc/lc/etc op. It's probably not feasible to
test a key in %^H in pp_uc in friends each time we want to
know which semantics to apply.
Diffstat (limited to 'lib/feature.pm')
-rw-r--r-- | lib/feature.pm | 98 |
1 files changed, 90 insertions, 8 deletions
diff --git a/lib/feature.pm b/lib/feature.pm index 915b5c75b7..649ccb3e5c 100644 --- a/lib/feature.pm +++ b/lib/feature.pm @@ -1,19 +1,24 @@ package feature; -our $VERSION = '1.13'; +our $VERSION = '1.14'; # (feature name) => (internal name, used in %^H) my %feature = ( - switch => 'feature_switch', - say => "feature_say", - state => "feature_state", + switch => 'feature_switch', + say => "feature_say", + state => "feature_state", + unicode_strings => "feature_unicode", ); +# This gets set (for now) in $^H as well as in %^H, +# for runtime speed of the uc/lc/ucfirst/lcfirst functions. +our $hint_uni8bit = 0x00000800; + # NB. the latest bundle must be loaded by the -E switch (see toke.c) my %feature_bundle = ( "5.10" => [qw(switch say state)], - "5.11" => [qw(switch say state)], + "5.11" => [qw(switch say state unicode_strings)], ); # special case @@ -43,9 +48,9 @@ feature - Perl pragma to enable new syntactic features It is usually impossible to add new syntax to Perl without breaking some existing programs. This pragma provides a way to minimize that -risk. New syntactic constructs can be enabled by C<use feature 'foo'>, -and will be parsed only when the appropriate feature pragma is in -scope. +risk. New syntactic constructs, or new semantic meanings to older +constructs, can be enabled by C<use feature 'foo'>, and will be parsed +only when the appropriate feature pragma is in scope. =head2 Lexical effect @@ -95,6 +100,80 @@ variables. See L<perlsub/"Persistent Private Variables"> for details. +=head2 the 'unicode_strings' feature + +C<use feature 'unicode_strings'> tells the compiler to treat +strings with codepoints larger than 128 as Unicode. It is available +starting with Perl 5.11.3. + +In greater detail: + +This feature modifies the semantics for the 128 characters on ASCII +systems that have the 8th bit set. (See L</EBCDIC platforms> below for +EBCDIC systems.) By default, unless C<S<use locale>> is specified, or the +scalar containing such a character is known by Perl to be encoded in UTF8, +the semantics are essentially that the characters have an ordinal number, +and that's it. They are caseless, and aren't anything: they're not +controls, not letters, not punctuation, ..., not anything. + +This behavior stems from when Perl did not support Unicode, and ASCII was the +only known character set outside of C<S<use locale>>. In order to not +possibly break pre-Unicode programs, these characters have retained their old +non-meanings, except when it is clear to Perl that Unicode is what is meant, +for example by calling utf8::upgrade() on a scalar, or if the scalar also +contains characters that are only available in Unicode. Then these 128 +characters take on their Unicode meanings. + +The problem with this behavior is that a scalar that encodes these characters +has a different meaning depending on if it is stored as utf8 or not. +In general, the internal storage method should not affect the +external behavior. + +The behavior is known to have effects on these areas: + +=over 4 + +=item * + +Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>, +and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression +substitutions. + +=item * + +Using caseless (C</i>) regular expression matching + +=item * + +Matching a number of properties in regular expressions, such as C<\w> + +=item * + +User-defined case change mappings. You can create a C<ToUpper()> function, for +example, which overrides Perl's built-in case mappings. The scalar must be +encoded in utf8 for your function to actually be invoked. + +=back + +B<This lack of semantics for these characters is currently the default,> +outside of C<use locale>. See below for EBCDIC. + +To turn on B<case changing semantics only> for these characters, use +C<use feature "unicode_strings">. + +The other old (legacy) behaviors regarding these characters are currently +unaffected by this pragma. + +=head4 EBCDIC platforms + +On EBCDIC platforms, the situation is somewhat different. The legacy +semantics are whatever the underlying semantics of the native C language +library are. Each of the three EBCDIC encodings currently known by Perl is an +isomorph of the Latin-1 character set. That means every character in Latin-1 +has a corresponding EBCDIC equivalent, and vice-versa. Specifying C<S<no +legacy>> currently makes sure that all EBCDIC characters have the same +B<casing only> semantics as their corresponding Latin-1 characters. + =head1 FEATURE BUNDLES It's possible to load a whole slew of features in one go, using @@ -164,6 +243,7 @@ sub import { unknown_feature($name); } $^H{$feature{$name}} = 1; + $^H |= $hint_uni8bit if $name eq 'unicode_strings'; } } @@ -173,6 +253,7 @@ sub unimport { # A bare C<no feature> should disable *all* features if (!@_) { delete @^H{ values(%feature) }; + $^H &= ~ $hint_uni8bit; return; } @@ -194,6 +275,7 @@ sub unimport { } else { delete $^H{$feature{$name}}; + $^H &= ~ $hint_uni8bit if $name eq 'unicode_strings'; } } } |