diff options
author | Karl Williamson <public@khwilliamson.com> | 2014-01-12 23:39:43 -0700 |
---|---|---|
committer | Karl Williamson <public@khwilliamson.com> | 2014-01-22 11:45:56 -0700 |
commit | 710680787cad21825395c0224606ac1535624c52 (patch) | |
tree | c4f06a3bbc9eaa17894e69081b84b73c4baea6f0 /regcomp.h | |
parent | beab9ebe349dffa8fc22a2912b83f62d2365e594 (diff) | |
download | perl-710680787cad21825395c0224606ac1535624c52.tar.gz |
Use bit instead of node for regex SSC
The flag bits in regular expression ANYOF nodes are perennially in short
supply. However there are still plenty of regex nodes possible. So one
solution to needing to pass more information is to create a node that
encapsulates what is needed. That is what commit
9aa1e39f96ac28f6ce5d814d9a1eccf1464aba4a did to tell regexec.c that a
particular ANYOF node is for the synthetic start class (SSC). However
this solution introduces other issues. If you have to express two
things, then you need a regnode for A, a regnode for B, a regnode for
both A and B, and another regnode for both not A nor B; With three
things, you need 8 regnodes to express all possible combinations. This
becomes unwieldy to write code for. The number of combinations goes way
down if some of them are mutually exclusive. At the time of that
commit, I thought that a SSC need not ever warn if matching against an
above-Unicode code point. I was wrong, and that has been corrected
earlier in the 5.19 series.
But it finally came to me how to tell regexec that an ANYOF node is
for the SSC without taking up a flag bit and without requiring a regnode
type. The 'next_off' field in a regnode tells the engine the offeset in
the regex program to the node it's supposed to go to after processing
this one. Since the SSC stands alone, its 'next_off' field is unused,
and we can put anything we want in it. That, however, is not true of
other ANYOF regnodes. But it turns out that there are certain values
that will never be legitimate in the 'next_off' field in these, and so
this commit uses one of those to signal that this ANYOF field is an SSC.
regnodes come in various sizes, and the offset is in terms of how many
of the smallest ones are there to the next node to look at. Since ANYOF
nodes are large, the offset is always > 1, and so this commit uses 1 to
indicate an SSC.
Diffstat (limited to 'regcomp.h')
-rw-r--r-- | regcomp.h | 20 |
1 files changed, 16 insertions, 4 deletions
@@ -204,6 +204,19 @@ struct regnode_ssc { SV* invlist; /* list of code points matched */ }; +/* We take advantage of 'next_off' not otherwise being used in the SSC by + * actually using it: by setting it to 1. This allows us to test and + * distinguish between an SSC and other ANYOF node types, as 'next_off' cannot + * otherwise be 1, because it is the offset to the next regnode expressed in + * units of regnodes. Since an ANYOF node contains extra fields, it adds up + * to 12 regnode units on 32-bit systems, (hence the minimum this can be (if + * not 0) is 11 there. Even if things get tightly packed on a 64-bit system, + * it still would be more than 1. */ +#define set_ANYOF_SYNTHETIC(n) STMT_START{ OP(n) = ANYOF; \ + NEXT_OFF(n) = 1; \ + } STMT_END +#define is_ANYOF_SYNTHETIC(n) (OP(n) == ANYOF && NEXT_OFF(n) == 1) + /* XXX fix this description. Impose a limit of REG_INFTY on various pattern matching operations to limit stack growth and to avoid "infinite" recursions. @@ -322,10 +335,9 @@ struct regnode_ssc { * performance penalty. Better would be to split it off into a separate node, * which actually would improve performance a bit by allowing regexec.c to test * for a UTF-8 character being above 255 without having to call a function nor - * calculate its code point value. However, this solution might need to have a - * second node type, ANYOF_SYNTHETIC_ABOVE_LATIN1_ALL. Several flags are not - * used in synthetic start class (SSC) nodes, so could be shared should new - * flags be needed for SSCs. */ + * calculate its code point value. Several flags are not used in synthetic + * start class (SSC) nodes, so could be shared should new flags be needed for + * SSCs. */ /* regexec.c is expecting this to be in the low bit */ #define ANYOF_INVERT 0x01 |