diff options
author | Karl Williamson <public@khwilliamson.com> | 2011-02-02 12:01:34 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2011-02-02 16:31:23 -0700 |
commit | 56ca34cada940c7f6aae9a59da266e541530041e (patch) | |
tree | 98fd450cd1ce016ebeddfdbe4d2241925b1fc618 /pod/perlunicode.pod | |
parent | 19c4061aa8fa454637e29db1afd668c3f66d3a01 (diff) | |
download | perl-56ca34cada940c7f6aae9a59da266e541530041e.tar.gz |
Move ANYOF folding from regexec to regcomp
This is for security as well as performance. It allows Unicode properties to
not be matched case sensitively. As a result the swash inversion hash is
converted from having utf8 keys to numeric, code point, keys.
It also for the first time fixes the bug where /i doesn't work for a code point
not at the end of a range in a bracketed character class has a multi-character
fold
Diffstat (limited to 'pod/perlunicode.pod')
-rw-r--r-- | pod/perlunicode.pod | 27 |
1 files changed, 27 insertions, 0 deletions
diff --git a/pod/perlunicode.pod b/pod/perlunicode.pod index 360af1d767..7f3a795198 100644 --- a/pod/perlunicode.pod +++ b/pod/perlunicode.pod @@ -349,6 +349,27 @@ You can also use negation in both C<\p{}> and C<\P{}> by introducing a caret (^) between the first brace and the property name: C<\p{^Tamil}> is equal to C<\P{Tamil}>. +Almost all properties are immune to case-insensitive matching. That is, +adding a C</i> regular expression modifier does not change what they +match. There are two sets that are affected. +The first set is +C<Uppercase_Letter>, +C<Lowercase_Letter>, +and C<Titlecase_Letter>, +all of which match C<Cased_Letter> under C</i> matching. +And the second set is +C<Uppercase>, +C<Lowercase>, +and C<Titlecase>, +all of which match C<Cased> under C</i> matching. +This set also includes its subsets C<PosixUpper> and C<PosixLower> both +of which under C</i> matching match C<PosixAlpha>. +(The difference between these sets is that some things, such as Roman +Numerals come in both upper and lower case so they are C<Cased>, but aren't considered to be +letters, so they aren't C<Cased_Letter>s.) +L<perluniprops> includes a notation for all forms that have C</i> +differences. + =head3 B<General_Category> Every Unicode character is assigned a general category, which is the "most @@ -773,6 +794,12 @@ C<\p> or C<\P> construct. Note that the effect is compile-time and immutable once defined. +However the subroutines are passed a single parameter which is 0 if +case-sensitive matching is in effect, and non-zero if caseless matching +is in effect. The subroutine may return different values depending on +the value of the flag, and one set of values will immutably be in effect +for all case-sensitive matches; the other set for all case-insensitive +matches. The subroutines must return a specially-formatted string, with one or more newline-separated lines. Each line must be one of the following: |