regex: Add pseudo-Posix class: 'cased'

/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property \p{Cased}. This commit introduces a pseudo-Posix class, internally named 'cased', to represent this. This class isn't specifiable by the user, except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug output will say ':cased:'. The regex parsing either of :lower: or :upper: will change them into :cased:, where already existing logic can handle this, just like any other class. This commit fixes the regression introduced in 3018b823898645e44b8c37c70ac5c6302b031381, and that these have never worked under 'use locale'. The next commit will un-TODO the tests for these things.
author: Karl Williamson <public@khwilliamson.com> 2012-12-30 21:14:58 -0700
committer: Karl Williamson <public@khwilliamson.com> 2012-12-31 11:03:28 -0700
commit: b0d691b286d92d66e559deb75501333ab819383b (patch)
tree: a3148f7f77ccb80faadea0a5f30774485f7f36b1 /regexec.c
parent: e8d596e06a8502f992b53ea859e136ec40f7497c (diff)
download: perl-b0d691b286d92d66e559deb75501333ab819383b.tar.gz
1 files changed, 13 insertions, 1 deletions
diff --git a/regexec.c b/regexec.c
index df7288a893..4e2008a577 100644
--- a/regexec.c
+++ b/regexec.c
@@ -438,6 +438,8 @@ S_isFOO_lc(pTHX_ const U8 classnum, const U8 character)
         case _CC_ENUM_ALPHA:     return isALPHA_LC(character);
         case _CC_ENUM_ASCII:     return isASCII_LC(character);
         case _CC_ENUM_BLANK:     return isBLANK_LC(character);
+        case _CC_ENUM_CASED:     return isLOWER_LC(character)
+                                        || isUPPER_LC(character);
         case _CC_ENUM_CNTRL:     return isCNTRL_LC(character);
         case _CC_ENUM_DIGIT:     return isDIGIT_LC(character);
         case _CC_ENUM_GRAPH:     return isGRAPH_LC(character);
@@ -7330,7 +7332,17 @@ S_reginclass(pTHX_ regexp * const prog, const regnode * const n, const U8* const
                  * will be 1, so the exclusive or will reverse things, so we
                  * are testing for \W.  On the third iteration, 'to_complement'
                  * will be 0, and we would be testing for \s; the fourth
-                 * iteration would test for \S, etc. */
+                 * iteration would test for \S, etc.
+                 *
+                 * Note that this code assumes that all the classes are closed
+                 * under folding.  For example, if a character matches \w, then
+                 * its fold does too; and vice versa.  This should be true for
+                 * any well-behaved locale for all the currently defined Posix
+                 * classes, except for :lower: and :upper:, which are handled
+                 * by the pseudo-class :cased: which matches if either of the
+                 * other two does.  To get rid of this assumption, an outer
+                 * loop could be used below to iterate over both the source
+                 * character, and its fold (if different) */
 
                 int count = 0;
                 int to_complement = 0;
author	Karl Williamson <public@khwilliamson.com>	2012-12-30 21:14:58 -0700
committer	Karl Williamson <public@khwilliamson.com>	2012-12-31 11:03:28 -0700
commit	b0d691b286d92d66e559deb75501333ab819383b (patch)
tree	a3148f7f77ccb80faadea0a5f30774485f7f36b1 /regexec.c
parent	e8d596e06a8502f992b53ea859e136ec40f7497c (diff)
download	perl-b0d691b286d92d66e559deb75501333ab819383b.tar.gz