summaryrefslogtreecommitdiff
path: root/lib/feature.pm
diff options
context:
space:
mode:
authorRafael Garcia-Suarez <rgs@consttype.org>2009-12-20 16:23:36 +0100
committerRafael Garcia-Suarez <rgs@consttype.org>2009-12-20 16:28:36 +0100
commit1863b87966ed39b042c45e12d1b4e0b90b9cc071 (patch)
treeeae5c03c697269b036352d4b007f9c1294f189c9 /lib/feature.pm
parent1d5fe431325abdb0f3947d563ebdef67bd4cb7cd (diff)
downloadperl-1863b87966ed39b042c45e12d1b4e0b90b9cc071.tar.gz
Introduce C<use feature "unicode_strings">
This turns on the unicode semantics for uc/lc/ucfirst/lcfirst operations on strings without the UTF8 bit set but with ASCII characters higher than 127. This replaces the "legacy" pragma experiment. Note that currently this feature sets both a bit in $^H and a (unused) key in %^H. The bit in $^H could be replaced by a flag on the uc/lc/etc op. It's probably not feasible to test a key in %^H in pp_uc in friends each time we want to know which semantics to apply.
Diffstat (limited to 'lib/feature.pm')
-rw-r--r--lib/feature.pm98
1 files changed, 90 insertions, 8 deletions
diff --git a/lib/feature.pm b/lib/feature.pm
index 915b5c75b7..649ccb3e5c 100644
--- a/lib/feature.pm
+++ b/lib/feature.pm
@@ -1,19 +1,24 @@
package feature;
-our $VERSION = '1.13';
+our $VERSION = '1.14';
# (feature name) => (internal name, used in %^H)
my %feature = (
- switch => 'feature_switch',
- say => "feature_say",
- state => "feature_state",
+ switch => 'feature_switch',
+ say => "feature_say",
+ state => "feature_state",
+ unicode_strings => "feature_unicode",
);
+# This gets set (for now) in $^H as well as in %^H,
+# for runtime speed of the uc/lc/ucfirst/lcfirst functions.
+our $hint_uni8bit = 0x00000800;
+
# NB. the latest bundle must be loaded by the -E switch (see toke.c)
my %feature_bundle = (
"5.10" => [qw(switch say state)],
- "5.11" => [qw(switch say state)],
+ "5.11" => [qw(switch say state unicode_strings)],
);
# special case
@@ -43,9 +48,9 @@ feature - Perl pragma to enable new syntactic features
It is usually impossible to add new syntax to Perl without breaking
some existing programs. This pragma provides a way to minimize that
-risk. New syntactic constructs can be enabled by C<use feature 'foo'>,
-and will be parsed only when the appropriate feature pragma is in
-scope.
+risk. New syntactic constructs, or new semantic meanings to older
+constructs, can be enabled by C<use feature 'foo'>, and will be parsed
+only when the appropriate feature pragma is in scope.
=head2 Lexical effect
@@ -95,6 +100,80 @@ variables.
See L<perlsub/"Persistent Private Variables"> for details.
+=head2 the 'unicode_strings' feature
+
+C<use feature 'unicode_strings'> tells the compiler to treat
+strings with codepoints larger than 128 as Unicode. It is available
+starting with Perl 5.11.3.
+
+In greater detail:
+
+This feature modifies the semantics for the 128 characters on ASCII
+systems that have the 8th bit set. (See L</EBCDIC platforms> below for
+EBCDIC systems.) By default, unless C<S<use locale>> is specified, or the
+scalar containing such a character is known by Perl to be encoded in UTF8,
+the semantics are essentially that the characters have an ordinal number,
+and that's it. They are caseless, and aren't anything: they're not
+controls, not letters, not punctuation, ..., not anything.
+
+This behavior stems from when Perl did not support Unicode, and ASCII was the
+only known character set outside of C<S<use locale>>. In order to not
+possibly break pre-Unicode programs, these characters have retained their old
+non-meanings, except when it is clear to Perl that Unicode is what is meant,
+for example by calling utf8::upgrade() on a scalar, or if the scalar also
+contains characters that are only available in Unicode. Then these 128
+characters take on their Unicode meanings.
+
+The problem with this behavior is that a scalar that encodes these characters
+has a different meaning depending on if it is stored as utf8 or not.
+In general, the internal storage method should not affect the
+external behavior.
+
+The behavior is known to have effects on these areas:
+
+=over 4
+
+=item *
+
+Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
+and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in regular expression
+substitutions.
+
+=item *
+
+Using caseless (C</i>) regular expression matching
+
+=item *
+
+Matching a number of properties in regular expressions, such as C<\w>
+
+=item *
+
+User-defined case change mappings. You can create a C<ToUpper()> function, for
+example, which overrides Perl's built-in case mappings. The scalar must be
+encoded in utf8 for your function to actually be invoked.
+
+=back
+
+B<This lack of semantics for these characters is currently the default,>
+outside of C<use locale>. See below for EBCDIC.
+
+To turn on B<case changing semantics only> for these characters, use
+C<use feature "unicode_strings">.
+
+The other old (legacy) behaviors regarding these characters are currently
+unaffected by this pragma.
+
+=head4 EBCDIC platforms
+
+On EBCDIC platforms, the situation is somewhat different. The legacy
+semantics are whatever the underlying semantics of the native C language
+library are. Each of the three EBCDIC encodings currently known by Perl is an
+isomorph of the Latin-1 character set. That means every character in Latin-1
+has a corresponding EBCDIC equivalent, and vice-versa. Specifying C<S<no
+legacy>> currently makes sure that all EBCDIC characters have the same
+B<casing only> semantics as their corresponding Latin-1 characters.
+
=head1 FEATURE BUNDLES
It's possible to load a whole slew of features in one go, using
@@ -164,6 +243,7 @@ sub import {
unknown_feature($name);
}
$^H{$feature{$name}} = 1;
+ $^H |= $hint_uni8bit if $name eq 'unicode_strings';
}
}
@@ -173,6 +253,7 @@ sub unimport {
# A bare C<no feature> should disable *all* features
if (!@_) {
delete @^H{ values(%feature) };
+ $^H &= ~ $hint_uni8bit;
return;
}
@@ -194,6 +275,7 @@ sub unimport {
}
else {
delete $^H{$feature{$name}};
+ $^H &= ~ $hint_uni8bit if $name eq 'unicode_strings';
}
}
}