Add /a regex modifier

This restricts certain constructs, like \w, to matching in the ASCII range only.
author: Karl Williamson <public@khwilliamson.com> 2011-01-17 08:58:53 -0700
committer: Karl Williamson <public@khwilliamson.com> 2011-01-17 09:20:20 -0700
commit: cfaf538b6276c6a8ef80ff6c66e106c6a4f1caaa (patch)
tree: b452229efc219b8936089921181cd3bedb77718a /pod
parent: 0c6e81ebcf01f01349b1260a05c55b61266c80d4 (diff)
download: perl-cfaf538b6276c6a8ef80ff6c66e106c6a4f1caaa.tar.gz
4 files changed, 44 insertions, 9 deletions
diff --git a/pod/perldelta.pod b/pod/perldelta.pod
index b1ea785631..13b79a188a 100644
--- a/pod/perldelta.pod
+++ b/pod/perldelta.pod
@@ -28,6 +28,26 @@ here, but most should go in the L</Performance Enhancements> section.
 
 [ List each enhancement as a =head2 entry ]
 
+=head2 New regular expression modifier C</a>
+
+The C</a> regular expression modifier restricts C<\s> to match precisely
+the five characters C<[ \f\n\r\t]>, C<\d> to match precisely the 10
+characters C<[0-9]>, C<\w> to match precisely the 63 characters
+C<[A-Za-z0-9_]>, and the Posix (C<[[:posix:]]>) character classes to
+match only the appropriate ASCII characters.  The complements, of
+course, match everything but; and C<\b> and C<\B> are correspondingly
+affected.  Otherwise, C</a> behaves like the C</u> modifier, in that
+case-insensitive matching uses Unicode semantics; for example, "k" will
+match the Unicode C<\N{KELVIN SIGN}> under C</i> matching, and code
+points in the Latin1 range, above ASCII will have Unicode semantics when
+it comes to case-insensitive matching.  Like its cousins (C</u>, C</l>,
+and C</d>), and in spite of the terminology, C</a> in 5.14 will not
+actually be able to be used as a suffix at the end of a regular
+expression (this restriction is planned to be lifted in 5.16).  It must
+occur either as an infix modifier, such as C<(?a:...)> or (C<(?a)...>,
+or it can be turned on within the lexical scope of C<use re '/a'>.
+Turning on C</a> turns off the other "character set" modifiers.
+
 =head2 Any unsigned value can be encoded as a character
 
 With this release, Perl is adopting a model that any unsigned value can
diff --git a/pod/perlre.pod b/pod/perlre.pod
index b74618f575..39840fc8c7 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -596,9 +596,9 @@ whitespace formatting, a simple C<#> will suffice.  Note that Perl closes
 the comment as soon as it sees a C<)>, so there is no way to put a literal
 C<)> in the comment.
 
-=item C<(?dlupimsx-imsx)>
+=item C<(?adlupimsx-imsx)>
 
-=item C<(?^lupimsx)>
+=item C<(?^alupimsx)>
 X<(?)> X<(?^)>
 
 One or more embedded pattern-match modifiers, to be turned on (or
@@ -636,8 +636,8 @@ after the C<"?"> is a shorthand equivalent to C<d-imsx>.  Flags (except
 C<"d">) may follow the caret to override it.
 But a minus sign is not legal with it.
 
-Also, starting in Perl 5.14, are modifiers C<"d">, C<"l">, and C<"u">,
-which for 5.14 may not be used as suffix modifiers.
+Also, starting in Perl 5.14, are modifiers C<"a">, C<"d">, C<"l">, and
+C<"u">, which for 5.14 may not be used as suffix modifiers.
 
 C<"l"> means to use a locale (see L<perllocale>) when pattern matching.
 The locale used will be the one in effect at the time of execution of
@@ -659,7 +659,7 @@ Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
 in strict ASCII their meanings are undefined.  Thus the platform
 effectively becomes a Unicode platform.  The ASCII characters remain as
 ASCII characters (since ASCII is a subset of Latin-1 and Unicode).  For
-example, when this option is not on, on a non-utf8 string, C<"\w">
+example, when this option is XXX not on, on a non-utf8 string, C<"\w">
 matches precisely C<[A-Za-z0-9_]>.  When the option is on, it matches
 not just those, but all the Latin-1 word characters (such as an "n" with
 a tilde).  On EBCDIC platforms, which already are equivalent to Latin-1,
@@ -670,6 +670,16 @@ small letters C<MU>; otherwise not; and the C<LATIN CAPITAL LETTER SHARP
 S> will match any of C<SS>, C<Ss>, C<sS>, and C<ss>, otherwise not.
 (This last case is buggy, however.)
 
+C<"a"> is the same as C<"u">, but C<\d>, C<\s>, C<\w>, and the Posix
+character classes are restricted to matching in the ASCII range only.
+That is, with this modifier, C<\d> always means precisely the digits
+C<"0"> to C<"9">; C<\s> means the five characters C<[ \f\n\r\t]>;
+C<\w> means the 53 characters C<[A-Za-z0-9_]>; and likewise, all the
+Posix classes such as C<[[:print:]]> match only the appropriate
+ASCII-range characters.  As you would expect, this modifier causes, for
+example, C<\D> to mean the same thing as C<[^0-9]>.  C<"a"> behaves the
+same as C<"u"> with regards to case-insensitive matches.  XXX
+
 C<"d"> means to use the traditional Perl pattern matching behavior.
 This is dualistic (hence the name C<"d">, which also could stand for
 "depends").  When this is in effect, Perl matches utf8-encoded strings
@@ -695,9 +705,9 @@ anywhere in a pattern has a global effect.
 =item C<(?:pattern)>
 X<(?:)>
 
-=item C<(?dluimsx-imsx:pattern)>
+=item C<(?adluimsx-imsx:pattern)>
 
-=item C<(?^luimsx:pattern)>
+=item C<(?^aluimsx:pattern)>
 X<(?^:)>
 
 This is for clustering, not capturing; it groups subexpressions like
@@ -713,7 +723,7 @@ but doesn't spit out extra fields.  It's also cheaper not to capture
 characters if you don't need to.
 
 Any letters between C<?> and C<:> act as flags modifiers as with
-C<(?dluimsx-imsx)>.  For example,
+C<(?adluimsx-imsx)>.  For example,
 
     /(?s-i:more.*than).*million/i
 
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index 1b0689b558..6422a2dd24 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -366,6 +366,10 @@ digit, while the character class C<\s> matches any whitespace character.
 New in perl 5.10.0 are the classes C<\h> and C<\v> which match horizontal
 and vertical whitespace characters.
 
+The exact set of characters matched by C<\d>, C<\s>, and C<\w> varies
+depending on various pragma and regular expression modifiers.  See
+L<perlre>.
+
 The uppercase variants (C<\W>, C<\D>, C<\S>, C<\H>, and C<\V>) are
 character classes that match, respectively, any character that isn't a
 word character, digit, whitespace, horizontal whitespace, or vertical
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 3329d60808..1baeb1672a 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -84,7 +84,8 @@ characters the locale considers decimal digits.  Without a locale, C<\d>
 matches just the digits '0' to '9'.
 
 Unicode digits may cause some confusion, and some security issues.  In UTF-8
-strings, C<\d> matches the same characters matched by
+strings, unless the C<"a"> modifier is specified, C<\d> matches the same
+characters matched by
 C<\p{General_Category=Decimal_Number}>, or synonymously,
 C<\p{General_Category=Digit}>.  Starting with Unicode version 4.1, this is the
 same set of characters matched by C<\p{Numeric_Type=Decimal}>.
author	Karl Williamson <public@khwilliamson.com>	2011-01-17 08:58:53 -0700
committer	Karl Williamson <public@khwilliamson.com>	2011-01-17 09:20:20 -0700
commit	cfaf538b6276c6a8ef80ff6c66e106c6a4f1caaa (patch)
tree	b452229efc219b8936089921181cd3bedb77718a /pod
parent	0c6e81ebcf01f01349b1260a05c55b61266c80d4 (diff)
download	perl-cfaf538b6276c6a8ef80ff6c66e106c6a4f1caaa.tar.gz