summaryrefslogtreecommitdiff
path: root/pod/perlunicode.pod
diff options
context:
space:
mode:
authorKarl Williamson <public@khwilliamson.com>2011-02-02 12:01:34 -0700
committerKarl Williamson <public@khwilliamson.com>2011-02-02 16:31:23 -0700
commit56ca34cada940c7f6aae9a59da266e541530041e (patch)
tree98fd450cd1ce016ebeddfdbe4d2241925b1fc618 /pod/perlunicode.pod
parent19c4061aa8fa454637e29db1afd668c3f66d3a01 (diff)
downloadperl-56ca34cada940c7f6aae9a59da266e541530041e.tar.gz
Move ANYOF folding from regexec to regcomp
This is for security as well as performance. It allows Unicode properties to not be matched case sensitively. As a result the swash inversion hash is converted from having utf8 keys to numeric, code point, keys. It also for the first time fixes the bug where /i doesn't work for a code point not at the end of a range in a bracketed character class has a multi-character fold
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r--pod/perlunicode.pod27
1 files changed, 27 insertions, 0 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod
index 360af1d767..7f3a795198 100644
--- a/pod/perlunicode.pod
+++ b/pod/perlunicode.pod
@@ -349,6 +349,27 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret
(^) between the first brace and the property name: C<\p{^Tamil}> is
equal to C<\P{Tamil}>.
+Almost all properties are immune to case-insensitive matching. That is,
+adding a C</i> regular expression modifier does not change what they
+match. There are two sets that are affected.
+The first set is
+C<Uppercase_Letter>,
+C<Lowercase_Letter>,
+and C<Titlecase_Letter>,
+all of which match C<Cased_Letter> under C</i> matching.
+And the second set is
+C<Uppercase>,
+C<Lowercase>,
+and C<Titlecase>,
+all of which match C<Cased> under C</i> matching.
+This set also includes its subsets C<PosixUpper> and C<PosixLower> both
+of which under C</i> matching match C<PosixAlpha>.
+(The difference between these sets is that some things, such as Roman
+Numerals come in both upper and lower case so they are C<Cased>, but aren't considered to be
+letters, so they aren't C<Cased_Letter>s.)
+L<perluniprops> includes a notation for all forms that have C</i>
+differences.
+
=head3 B<General_Category>
Every Unicode character is assigned a general category, which is the "most
@@ -773,6 +794,12 @@ C<\p> or C<\P> construct.
Note that the effect is compile-time and immutable once defined.
+However the subroutines are passed a single parameter which is 0 if
+case-sensitive matching is in effect, and non-zero if caseless matching
+is in effect. The subroutine may return different values depending on
+the value of the flag, and one set of values will immutably be in effect
+for all case-sensitive matches; the other set for all case-insensitive
+matches.
The subroutines must return a specially-formatted string, with one
or more newline-separated lines. Each line must be one of the following: