summaryrefslogtreecommitdiff
path: root/pod/perlrecharclass.pod
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2010-09-23 23:36:40 -0600
committerJesse Vincent <jesse@bestpractical.com>2010-10-15 23:14:29 +0900
commita12cf05f80a65e40fe339b086ab2d10e18d838c1 (patch)
treebd1254d24bac6bb121801a2a06d01c7e17703b92 /pod/perlrecharclass.pod
parentbdc22dd52e899130c8c4111c985fcbd7eec164a5 (diff)
downloadperl-a12cf05f80a65e40fe339b086ab2d10e18d838c1.tar.gz
Subject: [perl #58182] partial: Add uni \s,\w matching
This commit causes regex sequences \b, \s, and \w (and complements) to match in the latin1 range in the scope of feature 'unicode_strings' or with the /u regex modifier. It uses the previously unused flags field in the respective regnodes to indicate the type of matching, and in regexec.c, uses that to decide which of the handy.h macros to use, native or Latin1. I chose this for now rather than create new nodes for each type of match. An earlier version of this patch did that, and in every case the switch case: statements were adjacent, offering no performance advantage. If regexec were modified to use in-line functions or more macros for various short section of it, then it would be faster to have new nodes rather than using the flags field. But, using that field simplified things, as this change flies under the radar in a number of places where it would not if separate nodes were used.
Diffstat (limited to 'pod/perlrecharclass.pod')
-rw-r--r--pod/perlrecharclass.pod8
1 files changed, 7 insertions, 1 deletions
diff --git a/pod/perlrecharclass.pod b/pod/perlrecharclass.pod
index 5aa93486d5..7cb2f78ebc 100644
--- a/pod/perlrecharclass.pod
+++ b/pod/perlrecharclass.pod
@@ -682,7 +682,8 @@ nor EBCDIC, they match the ASCII defaults (0 to 9 for C<\d>; 52 letters,
A regular expression is marked for Unicode semantics if it is encoded in
utf8 (usually as a result of including a literal character whose code
point is above 255), or if it contains a C<\N{U+...}> or C<\N{I<name>}>
-construct.
+construct, or (starting in Perl 5.14) if it was compiled in the scope of a
+C<S<use feature "unicode_strings">> pragma.
The differences in behavior between locale and non-locale semantics
can affect any character whose code point is 255 or less. The
@@ -693,6 +694,11 @@ L<perlunicode/The "Unicode Bug">.
For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
or the POSIX character classes, and use the Unicode properties instead.
+That way you can control whether you want matching of just characters in
+the ASCII character set, or any Unicode characters.
+C<S<use feature "unicode_strings">> will allow seamless Unicode behavior
+no matter what the internal encodings are, but won't allow restricting
+to just the ASCII characters.
=head4 Examples