diff options
author | Karl Williamson <public@khwilliamson.com> | 2010-09-23 23:36:40 -0600 |
---|---|---|
committer | Jesse Vincent <jesse@bestpractical.com> | 2010-10-15 23:14:29 +0900 |
commit | a12cf05f80a65e40fe339b086ab2d10e18d838c1 (patch) | |
tree | bd1254d24bac6bb121801a2a06d01c7e17703b92 /pod/perlre.pod | |
parent | bdc22dd52e899130c8c4111c985fcbd7eec164a5 (diff) | |
download | perl-a12cf05f80a65e40fe339b086ab2d10e18d838c1.tar.gz |
Subject: [perl #58182] partial: Add uni \s,\w matching
This commit causes regex sequences \b, \s, and \w (and complements) to
match in the latin1 range in the scope of feature 'unicode_strings' or
with the /u regex modifier.
It uses the previously unused flags field in the respective regnodes to
indicate the type of matching, and in regexec.c, uses that to decide
which of the handy.h macros to use, native or Latin1.
I chose this for now rather than create new nodes for each type of
match. An earlier version of this patch did that, and in every case the
switch case: statements were adjacent, offering no performance
advantage. If regexec were modified to use in-line functions or more
macros for various short section of it, then it would be faster to have
new nodes rather than using the flags field. But, using that field
simplified things, as this change flies under the radar in a number of
places where it would not if separate nodes were used.
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 20 |
1 files changed, 17 insertions, 3 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index 7329bd8f2e..d4e6599b90 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -646,9 +646,23 @@ L<setlocale() function|perllocale/The setlocale function>. This modifier is automatically set if the regular expression is compiled within the scope of a C<"use locale"> pragma. -C<"u"> has no effect currently. It is automatically set if the regular -expression is compiled within the scope of a -L<C<"use feature 'unicode_strings">|feature> pragma. +C<"u"> means to use Unicode semantics when pattern matching. It is +automatically set if the regular expression is compiled within the scope +of a L<C<"use feature 'unicode_strings">|feature> pragma (and isn't +also in the scope of L<C<"use locale">|locale> nor +L<C<"use bytes">|bytes> pragmas. It is not fully implemented at the +time of this writing, but work is being done to complete the job. On +EBCDIC platforms this currently has no effect, but on ASCII platforms, +it effectively turns them into Latin-1 platforms. That is, the ASCII +characters remain as ASCII characters (since ASCII is a subset of +Latin-1), but the non-ASCII code points are treated as Latin-1 +characters. Right now, this only applies to the C<"\b">, C<"\s">, and +C<"\w"> pattern matching operators, plus their complements. For +example, when this option is not on, C<"\w"> matches precisely +C<[A-Za-z0-9_]> (on a non-utf8 string). When the option is on, it +matches not just those, but all the Latin-1 word characters (such as an +"n" with a tilde). It thus matches exactly the same set of code points +from 0 to 255 as it would if the string were encoded in utf8. C<"d"> means to use the traditional Perl pattern matching behavior. This is dualistic (hence the name C<"d">, which also could stand for |