Teach regex optimizer to handle above-Latin1

Until this commit, the regular expression optimizer has essentially punted on above-Latin1 code points. Under some circumstances, they would be taken into account, more or less, but often, the generated synthetic start class would end up matching all above-Latin1 code points. With the advent of inversion lists, it becomes feasible to actually fully handle such code points, as inversion lists are a convenient way to express arbitrary lists of code points and take their union, intersection, etc. This commit changes the optimizer to use inversion lists for operating on the code points the synthetic start class can match. I don't much understand the overall operation of the optimizer. I'm told that previous porters found that perturbing it caused unexpected behaviors. I had promised to get this change in 5.18, but didn't. I'm trying to get it in early enough into the 5.20 preliminary series that any problems will surface before 5.20 ships. This commit doesn't change the macro level logic, but does significantly change various micro level things. Thus the 'and' and 'or' subroutines have been rewritten to use inversion lists. I'm pretty confident that they do what their names suggest. I re-derived the equations for what these operations should do, getting the same results in some cases, but extending others where the previous code mostly punted. The derivations are given in comments in the respective routines. Some of the code is greatly simplified, as it no longer has to treat above-Latin1 specially. It is now feasible for /i matching of above-Latin1 code points to know explicitly the folds that should be in the synthetic start class. But more prepatory work needs to be done before putting that into place. ...
author: Karl Williamson <public@khwilliamson.com> 2013-09-22 21:36:29 -0600
committer: Karl Williamson <public@khwilliamson.com> 2013-09-24 11:36:19 -0600
commit: cdd87c1d4df41f9a54cccff996fa64d291adcee8 (patch)
tree: 433feb19de18b9fb38c9a4dc8c1830ad3ae4be6c /regcomp.h
parent: fb38762fa113a105b623d0eb7681d2cc03b0c161 (diff)
download: perl-cdd87c1d4df41f9a54cccff996fa64d291adcee8.tar.gz
1 files changed, 2 insertions, 8 deletions
diff --git a/regcomp.h b/regcomp.h
index f0153fc12c..eccb46690a 100644
--- a/regcomp.h
+++ b/regcomp.h
@@ -357,14 +357,6 @@ struct regnode_ssc {
 
 #define ANYOF_FLAGS_ALL		(0xff & ~0x10)
 
-/* These are the flags that ANYOF_INVERT being set or not doesn't affect
- * whether they are operative or not.  e.g., the node still has LOCALE
- * regardless of being inverted; whereas ANYOF_ABOVE_LATIN1_ALL means something
- * different if inverted */
-#define INVERSION_UNAFFECTED_FLAGS (ANYOF_LOCALE                        \
-	                           |ANYOF_LOC_FOLD                      \
-	                           |ANYOF_POSIXL                         \
-	                           |ANYOF_NONBITMAP_NON_UTF8)
 #define ANYOF_LOCALE_FLAGS (ANYOF_LOCALE                        \
                            |ANYOF_LOC_FOLD                      \
                            |ANYOF_POSIXL)
@@ -482,6 +474,8 @@ struct regnode_ssc {
 #define ANYOF_POSIXL_OR(source, dest) STMT_START { (dest)->classflags |= (source)->classflags ; } STMT_END
 #define ANYOF_CLASS_OR(source, dest) ANYOF_POSIXL_OR((source), (dest))
 
+#define ANYOF_POSIXL_AND(source, dest) STMT_START { (dest)->classflags &= (source)->classflags ; } STMT_END
+
 #define ANYOF_BITMAP_ZERO(ret)	Zero(((struct regnode_charclass*)(ret))->bitmap, ANYOF_BITMAP_SIZE, char)
 #define ANYOF_BITMAP(p)		(((struct regnode_charclass*)(p))->bitmap)
 #define ANYOF_BITMAP_BYTE(p, c)	(ANYOF_BITMAP(p)[(((U8)(c)) >> 3) & 31])
author	Karl Williamson <public@khwilliamson.com>	2013-09-22 21:36:29 -0600
committer	Karl Williamson <public@khwilliamson.com>	2013-09-24 11:36:19 -0600
commit	cdd87c1d4df41f9a54cccff996fa64d291adcee8 (patch)
tree	433feb19de18b9fb38c9a4dc8c1830ad3ae4be6c /regcomp.h
parent	fb38762fa113a105b623d0eb7681d2cc03b0c161 (diff)
download	perl-cdd87c1d4df41f9a54cccff996fa64d291adcee8.tar.gz