| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This new regnode is used to handle interpolated already-compiled regex
sets inside outer regex sets.
If it isn't present, it will mean that what appears to be a nested,
interpolated set really isn't.
I created a new regnode structure to hold a pointer. This has to be
temporary as pointers can be invalidated. I thought of just having a
regnode without a pointer as a marker, and using a parallel array to
store the data, rather than creating a whole new regnode structure for
just pointers, but parallel data structures can get out of sync, so this
seemed best.
This commit just sets up the regnode; a future commit will actually use
it.
|
| |
|
|
|
|
|
| |
Under the given circumstances, these work precisely like others that
already have good descriptions.
|
|
|
|
|
|
| |
It turns out that one isn't supposed to fill in the offset to the next
regnode at node creation time. And this node is like EXACTish, so the
string stuff isn't accounted for in its regcomp.sym definition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node is like ANYOFHb, but is used when more than one leading byte
is the same in all the matched code points.
ANYOFHb is used to avoid having to convert from UTF-8 to code point for
something that won't match. It checks that the first byte in the UTF-8
encoded target is the desired one, thus ruling out most of the possible
code points.
But for higher code points that require longer UTF-8 sequences, many
many non-matching code points pass this filter. Its almost 200K that it
is ineffective for for code points above 0xFFFF.
This commit creates a new node type that addresses this problem.
Instead of a single byte, it stores as many leading bytes that are the
same for all code points that match the class. For many classes, that
will cut down the number of possible false positives by a huge amount
before having to convert to code point to make the final determination.
This regnode adds a UTF-8 string at the end. It is still much smaller,
even in the rare worst case, than a plain ANYOF node because the maximum
string length, 15 bytes, is still shorter than the 32-byte bitmap that
is present in a plain ANYOF. Most of the time the added string will
instead be at most 4 bytes.
|
|
|
|
| |
I don't think lack of these affects anything, but they were inconsistent
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is like the ANYOFR regnode added in the previous commit, but all
code points in the range it matches are known to have the same first
UTF-8 start byte. That means it can't match UTF-8 invariant characters,
like ASCII, because the "start" byte is different on each one, so it
could only match a range of 1, and the compiler wouldn't generate this
node for that; instead using an EXACT.
Pattern matching can rule out most code points by looking at the first
character of their UTF-8 representation, before having to convert from
UTF-8.
On ASCII this rules out all but 64 2-byte UTF-8 characters from this
simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the
test is less effective for higher code points.
I believe that most UTF-8 patterns that otherwise would compile to
ANYOFR will instead compile to this, as I can't envision real life
applications wanting to match large single ranges. Even the 2048
surrogates all have the same first byte.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches a single range of code points. It is both faster and
smaller than other ANYOF-type nodes, requiring, after set-up, a single
subtraction and conditional branch.
The vast majority of Unicode properties match a single range (though
most of the properties likely to be used in real world applications have
more than a single range). But things like [ij] are a single range, and
those are quite commonly encountered. This new regnode matches them more
efficiently than a bitmap would, and doesn't require the space for one
either.
The flags field is used to store the minimum matchable start byte for
UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH
nodes which have a similar mechanism, allows for quick weeding out of
many possible matches without having to convert the UTF-8 to its
corresponding code point.
This regnode packs the 32 bit argument with 20 bits for the minimum code
point the node matches, and 12 bits for the maximum range. If the input
is a value outside these, it simply won't compile to this regnode,
instead going to one of the ANYOFH flavors.
ANYOFR is sufficient to match all of Unicode except for the final
(private use) 65K plane.
|
|
|
|
|
|
|
| |
Having this enabled me to more quickly understand what's going on.
A trailing period is removed from some long descriptions to make them
slightly shorter.
|
|
|
|
| |
The new name is shorter and I believe, clearer.
|
|
|
|
|
| |
This is like LEXACT, but it is known that only strings encoded in UTF-8
will match it, so don't even have to try if that condition isn't met.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode for strings that don't fit in a regular
one, and adds a structure for that regnode to use. Actually using them
is deferred to the next commit.
This new regnode structure is needed because the previous structure only
allows for an 8 bit length field, 255 max bytes. This commit puts the
length instead in a new field, the same place single-argument regnodes
put their argument. Hence this long string is an extra 32 bits of
overhead, but at no string length is this node ever bigger than the
combination of the smaller nodes it replaces.
I also considered simply combining the original 8 bit length field
(which is now unused) with the first byte of the string field to get a
16 bit length, and have the actual string be offset by 1. But I
rejected that because it would mean the string would usually not be
aligned, slowing down memory accesses.
This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold,
using 4K fewer overhead bytes to do so. That means it can handle
strings containing 262000 bytes. The comments give ideas for expanding
that should it become necessary or desirable.
Besides the space advantage, any hardware acceleration in memcmp
can be done in much bigger chunks, and otherwise the memcmp inner loop
(often written in assembly) will run many more times in a row, and our
outer loop that calls it, correspondingly fewer.
|
|
|
|
|
|
|
|
|
| |
EXACTFU nodes always now fold their strings; the information here had
not been updated to reflect that change.
And the descriptions of several EXACTish nodes are now changed to be
slightly shorter and to remove mention of the string length, which is
problematic, and is covered in the description for EXACT
|
|
|
|
|
|
|
|
|
| |
I spent some time in this code trying to understand some things, and as
a result I'm commenting previously undocumented features. The comments
about what an entry in regcomp.sym should look like are moved to that
file, rather than the file that reads it. The former is most often
touched, and they had gotten out-of-sync in the latter. Things now make
more sense to me, and hopefully anyone using this in the future.
|
|
|
|
|
| |
The length of an EXACTish node is the same bits as the FLAGS field in
other nodes; it doesn't "precede the length", as previously claimed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a
loose upper bound for the first UTF-8 byte matchable by the node. (The
'r' stands for 'range'). It would be nice to have a tight upper bound,
but to do so requires 4 more bits than are available without changing
the node arguments types, and hence increasing the node size. Having a
loose bound is better than no bound, and comes essentially free, by
using two unused bits in the current ANYOFH node, and requiring only a
few extra, pipeline-able, mask, etc instructions at run time, no extra
conditionals. Any ANYOFH nodes that would benefit from having an upper
bound will instead be compiled into this node type.
Its use is based on the following observations.
There are 64 possible start bytes, so the full range can be expressed in
6 bits. This means that the flags field in ANYOFH nodes containing the
start byte has two extra bits that can be used for something else.
An ANYOFH node only happens when there is no matching code point in the
bit map, so the smallest code point that could be is 256. The start
byte for that is C4, so there are actually only 60 possible start bytes.
(perl can be compiled with a larger bit map in which case the minimum
start byte would be even higher.)
A second observation is that knowing the highest start byte is above F0
is no better than knowing it's F0. This is because the highest code
point whose start byte is F0 is U+3FFFF, and all code points above that
that are currently allocated are all very specialized and rarely
encountered. And there's no likelihood of that changing anytime soon as
there's plenty of unallocated space below that. So if an ANYOFH node's
highest start byte is F0 or above, there's no advantage to knowing what
the actual max possible start byte is, so leave it as ANYOFH,.
That means the highest start byte we care about in ANYOFHr is EF. That
cuts the number of start bytes we care about down to 43, still 6 bits
required to represent them, but it allows for the following scheme:
Populate the flags field by subtracting C0 from the lowest start byte
and shift left 2 bits. That leaves the the bottom two bits unused.
We use them as follows, where x is the start byte of the lowest code
point in the node:
bits
----
11 The upper limit of the range can be as much as (EF - x) / 8
10 The upper limit of the range can be as much as (EF - x) / 4
01 The upper limit of the range can be as much as (EF - x) / 2
00 The upper limit of the range can be as much as EF
That partitions the loose upper bound into 4 possible ranges, with it
being tighter the closer it is to the strict lower bound. This makes
the loose upper bound more meaningful when there is most to gain by
having one.
Some examples of what the various upper bounds would be for all the
possibilities of these two bits are:
Upper bound given the 2 bits
Low bound 11 10 01 00
--------- -- -- -- --
C4 C9 CE D9 EF
D0 D3 D7 DF EF
E0 E1 E3 E7 EF
Start bytes of E0 and above represent many more code points each than
lower ones, as they are 3 byte sequences instead of two. This scheme
provides tighter bounds for them, which is also a point in its favor.
Thus we have provided a loose upper bound using two otherwise unused
bits. An alternate scheme could have had the intervals all the same,
but this provides a tighter bound when it makes the most sense to.
For EBCDIC the range is is from C8 to F4,
Tests will be added in a later commit
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a lower bound for the first UTF-8 byte matchable by an
ANYOFH node. The flags field is otherwise unused, and using it for this
purpose allows code to rule out match possibilities without having to
convert from UTF-8 to code point.
It might be better to do the inverse instead, to have the field be an
upper bound. The reason is that the conversion is cheap for smaller
numbers. The commit following mostly addresses this.
|
|
|
|
|
| |
This is easier to read, especially when a third type is added a few
commits ahead.
|
|
|
|
| |
Simplify the description for ANYOFb
|
|
|
|
|
|
|
|
|
| |
ANYOFHb will be for nodes where all the matching code points share the
frst UTF-8 byte. ANYOFH will be for all others. Neither of these has a
bitmap.
I noticed that we can omit some execution conditionals by splitting the
nodes.
|
| |
|
|
|
|
|
| |
These were misleading, as elsewhere a leading 'N' in the name means the
complement. Instead move the N to the end of the name
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An ANYOFH regnode is generated instead of a plain ANYOF one when
nothing it can match is in the bitmap used in ANYOF nodes. It is
therefore smaller as the 4 word (or more) bitmap is omitted.
This means that for it to match a target string, that string must be
UTF-8 (since the bitmap is for at least the lowest 256 code points).
And only in rare circumstances are there any flags associated with it in
the regnode flags field.
This commit changes things so that the flags field in an ANYOFH node is
repurposed to be the first UTF-8 encoded byte of every code point
matched by the class if there is a common byte for all of them; or 0 if
some have different first bytes.
(That means that those rare cases where the flags field isn't otherwise
empty can no longer be ANYOFH nodes.)
The purpose of this is so that a future commit can take advantage of
this, and more quickly scan the target string for places that this node
can match. A problem with ANYOF nodes is that they are code point
oriented (U32 or U64), and the target string is UTF-8, so conversion has
to be done. By having the partial conversion compiled in, we can look
for that at runtime instead of having to look at every character in the
scan.
|
|
|
|
| |
See [perl #132367].
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a regnode for the case where nothing in the bit map has
matches. This allows the bitmap to be omitted, saving 32 bytes of
otherwise wasted space per node. Many non-Latin Unicode properties have
this characteristic. Further, since this node applies only to code
points above 255, which are representable only in UTF-8, we can
trivially fail a match where the target string isn't in UTF-8. Time
savings also accrue from skipping the bitmap look-up. When swashes are
removed, even more time will be saved.
|
|
|
|
|
|
|
| |
The ANYOFM/NANYOFM regnodes are generalizations of these. They have
more masks and shifts than the removed nodes, but not more branches, so
are effectively the same speed. Remove the ASCII/NASCII nodes in favor
of having less code to maintain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for
handling an 's' at the leading edge, at the trailing edge, and at both
edges for nodes under /di that there is nothing else in that would
prevent them from being EXACTFU nodes. If two of these get joined, it
could create an 'ss' sequence which can't be an EXACTFU node, for U+DF
would match them unconditionally. Instead, under /di it should match
if and only if the target string is UTF-8 encoded.
I realized later that having three types becomes harder to deal with
when adding yet more node types, so this commit turns the three into
just one node type, indicating that at least one edge of the node is an
's'.
It also simplifies the parsing of the pattern and determining which node
to use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EXACTFUP was created by the previous commit to handle a problematic case
in which not all the code points in an EXACTFU node are /i foldable at
compile time. Doing so will allow a future commit to use the pre-folded
EXACTFU nodes (done in a prior commit), saving execution time for the
common case. The only problematic code point is the MICRO SIGN. Most
patterns don't use this character.
EXACTFU_SS is problematic in a different way. It contains the sequence
'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in
it can be pre-folded (unless it also contains a MICRO SIGN). The reason
this is problematic is that it is the only non-UTF-8 node where the
length in folding can change. To process it at runtime, the more
general fold equivalence function is used that is capable of handling
length disparities, but is slower than the functions otherwise used for
non-UTF-8.
What I've chosen to do for now is to make a single node type for all the
problematic cases (which at this time means just the two aforementioned
ones). If we didn't do this, we'd have to add a third node type for
patterns that contain both 'ss' and MICRO. Or artificially split the
pattern so the two never were in the same node, but we can't do that
because it can cause bugs in handling multi-character folds. If more
special handling is found to be needed, there'd be a combinatorial
explosion of additional node types to handle all possible combinations.
What this effectively means is that the slower, more general foldEQ
function is used for portions of patterns containing the MICRO sign when
the pattern isn't in UTF-8, even though there is no inherent reason to
do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
|
|
|
|
|
|
|
|
|
|
| |
If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now
created. This character is the only one not needing UTF-8 to represent,
but its fold does need UTF-8, which causes some issues, so it has to be
specially handled. When matching against a non-UTF-8 target string, the
pattern is effectively folded, but not if the target is UTF-8. By
creating this node, we can remove the special handling required for the
nodes that don't have a MICRO SIGN, in a future commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So, we don't have to re-fold them.
Previous commits have caused any EXACTFAA nodes to be pre-folded, and we
now have the infrastructure in regexec.c to take advantage of this,
including in non-UTF-8 patterns. This commit changes to do this.
The only non-pre-folded EXACTFAA nodes are those that are not UTF-8, but
the target string is. The reason is that the MICRO SIGN folds to
something representable only in UTF-8, so if you have both non-UTF-8, it
effectively is folded, and if you have the pattern in UTF-8, it gets
folded to the proper character.
In order for non-UTF-8 /iaa nodes to always be fully folded, there would
need to be a separate node for ones that contain the MICRO SIGN, and
then only that one wouldn't be considered folded when the target is
UTF-8. I don't think it's worth it, as the only gain would be in
matching a non-UTF-8 /iaa node against a UTF-8 target string. I suspect
/iaa will be used mostly in non-UTF8 target strings. Comments have been
added to point this out in case someone thinks it should be implemented.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that now, the regular methods for handling multi-character
folds work for the ones involving LATIN SMALL LETTER SHARP S when the
pattern is in UTF-8. So the special code for handling this case can be
removed, and a regular EXACTFU node is generated. This has the
advantage of being trie-able, and requiring fewer operations at run
time, as the pattern is pre-folded at compile time, and doesn't have to
be re-folded during each backtracking at run-time.
This means that the EXACTFU_SS node type will only be generated for
non-UTF-8 patterns, and the handling of it is unchanged in these cases.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous two commits fixed bugs where it would be possible during
optimization to join two EXACTFish nodes together, and the result would
not work properly with LATIN SMALL LETTER SHARP S. But by doing so,
the commits caused all non-UTF-8 EXACTFU nodes that begin or end with
[Ss] from being trieable.
This commit changes things so that the only the ones that are
non-trieable are the ones that, when joined, have the sequence [Ss][Ss]
in them. To do so, I created three new node types that indicate if the
node begins with [Ss] or ends with them, or both. These preclude having
to examine the node contents at joining to determine this. And since
there are plenty of node types available, it seemed the best choice.
But other options would be available should we run out of nodes.
Examining the first and final characters of a node is not expensive, for
example.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a regnode that otherwise would be an EXACTFU except that it
contains a code point that requires UTF-8 to match, including all the
possible folds involving it. Hence if the target string isn't UTF-8, we
know it can't possibly match, without needing to try.
For completeness, there could also be an EXACTFAA_ONLY8 and an
EXACTFL_ONLY8 created, but I think these are unlikely to actually appear
in the wild, since using /aa is mainly about ASCII, and /l mostly will
involve characters that don't require UTF-8.
|
|
|
|
|
|
|
| |
This is a regnode that otherwise would be an EXACT except that it
contains a code point that requires UTF-8 to represent. Hence if the
target string isn't UTF-8, we know it can't possibly match, without
needing to try.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches when the existing node ANYOFM would not match; i.e., they
are complements.
I almost didn't create this node, but it turns out to significantly
speed up various classes of matches. For example qr/[^g]/, both /i and
not, turn into this node; and something like
(("a" x $large_number) . "b") =~ /[^a]/
goes through the string a word at a time, instead of previously
byte-by-byte. Benchmarks are at the end of this mesage.
This node gets generated when complementing any single ASCII character
and when complementing any ASCII case pair, like /[^Gg]/. It never gets
generated if the class includes a character that isn't ASCII (actually
UTF-8 invariant, which matters only on EBCDIC platforms).
The details of when this node gets constructed are complicated. It
happens when the bit patterns of the characters in the class happen to
have certain very particular characteristics, depending on the vagaries
of the character set. [BbCc] will do so, but [AaBb] does not. [^01]
does, but not [^12]. Precisely, look at all the bit patterns of the
characters in the set, and count the total number of differing bits,
calling it 'n'. If and only if the number of characters is 2**n, this
node gets generated. As an example, on both ASCII and EBCDIC, the last
4 bits of '0' are 0000; of '1' are 0001; of '2' are 0010; and of '3' are
0011. The other 4 bits are the same for each of these 4 digits. That
means that only 2 bits differ among the 4 characters, and 2**2==4, so
the NANYOFM node will get generated. Similarly, 8=1000 and 0=0000
differ only in one bit so 2**1==2, and so [^08] will generate this node.
We could consider in the future, an extension where, if the input
doesn't work to generate this node, that we construct the closure of
that input to generate this node, which would have false positives that
would have to be tested for. The speedup of this node is so significant
that that could still be faster than what we have today.
The benchmarks are for a 64-bit word. 32-bits would not be as good.
Key:
Ir Instruction read
Dr Data read
Dw Data write
COND conditional branches
IND indirect branches
The numbers (except for the final column) represent raw counts per loop
iteration. The higher the number in the final column, the faster.
(('a' x 1) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 2782.0 2648.0 105.1
Dr 845.0 799.0 105.8
Dw 531.0 500.0 106.2
COND 431.0 419.0 102.9
IND 22.0 22.0 100.0
(('a' x 10) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 3358.0 2671.0 125.7
Dr 998.0 801.0 124.6
Dw 630.0 500.0 126.0
COND 503.0 424.0 118.6
IND 22.0 22.0 100.0
(('a' x 100) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 9118.0 2773.0 328.8
Dr 2528.0 814.0 310.6
Dw 1620.0 500.0 324.0
COND 1223.0 450.0 271.8
IND 22.0 22.0 100.0
(('a' x 1000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 66718.0 3650.0 1827.9
Dr 17828.0 923.0 1931.5
Dw 11520.0 500.0 2304.0
COND 8423.0 668.0 1260.9
IND 22.0 22.0 100.0
(('a' x 10000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir 642718.0 12650.0 5080.8
Dr 170828.0 2048.0 8341.2
Dw 110520.0 500.0 22104.0
COND 80423.0 2918.0 2756.1
IND 22.0 22.0 100.0
(('a' x 100000) . 'b') =~ /[^a]/
blead nanyof Ratio %
-------- -------- --------
Ir Inf 102654.8 6237.1
Dr Inf 13299.3 12788.9
Dw Inf 500.9 219708.7
COND 800424.1 25419.1 3148.9
IND 22.0 22.0 100.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The comments could lead one to thinking one could specify any of the
argument fields that nodes can have. But in fact, the value is a
boolean, 0 meaning to use the normal offset field of all regnodes; and 1
meaning to use the ARG field that some regnodes have. If a regnode had
more than just the one argument field, the one that corresponds to that
would be used.
This commit enforces that, and changes regcomp.sym to not use '2',
which is misleading.
It clarifies the comments about this and what '.' means in the flags
field
|
|
|
|
|
|
| |
This changes regcomp.sym to generate the correct lengths for ANYOF
nodes, which means they don't have to be special cased in regcomp.c,
leading to simplification
|
|
|
|
|
|
| |
This is like ANYOFL, but has runtime matches of /[[:posix:]]/ in it,
which requires extra space. Adding this will allow a future commit to
simplify handling for ANYOF nodes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are currently two similar backtracking states for simple
non-greedy pattern repeats:
CURLY_B_min
CURLY_B_min_known
the latter is a variant of the former for when the character which must
follow the repeat is known, e.g. /(...)*?X.../, which allows quick
skipping to the next viable position.
The code for the two cases:
case CURLY_B_min_fail:
case CURLY_B_min_known_fail:
share a lot of similarities. This commit merges the two states into a
single CURLY_B_min state, with an associated single CURLY_B_min_fail
fail state.
That one code block can handle both types, with a single
if (ST.c1 == CHRTEST_VOID) ...
test to choose between the two variant parts of the code.
This makes the code smaller and more maintainable, at the cost of one
extra test per backtrack.
|
| |
|
|
|
|
|
|
|
| |
The EXACTFA nodes are in fact not generated by /a, but by /aa. Change
the name to EXACTFAA to correspond.
I found myself getting confused by this.
|
|
|
|
|
| |
This uses a mask instead of a bitmap, and is restricted to representing
invariant characters under UTF-8 that meet particular bit patterns.
|
|
|
|
| |
These will be used in a future commit
|
|
|
|
| |
To be used in the implementation thereof.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RT #126697
In a regex, after executing a (?{...}) code block, if we fail and
backtrack over the codeblock, we're supposed to unwind the savestack, so
that for any example any local()s within the code block are undone.
It turns out that a backtracking state isn't pushed for (?{...}), only
for postponed evals ( i.e. (??{...})). This means that it relies on one
of the earlier backtracking states to clear the savestack on its behalf.
This can't always be relied upon, and the ticket above contains code where
this falls down; in particular:
'ABC' =~ m{
\A
(?:
(?: AB | A | BC )
(?{
local $count = $count + 1;
print "! count=$count; ; pos=${\pos}\n";
})
)*
\z
}x
Here we end up relying on TRIE_next to do the cleaning up, but TRIE_next
doesn't, since there's nothing it would be responsible for that needs
cleaning up.
The solution to this is to push a backtrack state for every (?{...}) as
well as every (??{...}). The sole job of that state is to do a
LEAVE_SCOPE(ST.lastcp).
The existing backtrack state EVAL_AB has been renamed EVAL_postponed_AB
to make it clear it's only used on postponed /(??{A})B/ regexes, and a new
state has been added, EVAL_B, which is only called when backtracking after
failing something in the B in /(?{...})B/.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling,
and other special casing by unifying them, at pretty much no cost.
GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as
a U32 and an I32 respectively. ARG() holds the "parno" we will recurse
into. ARG2L() holds a signed offset to the relevant start node for the
recursion.
Prior to this patch the argument to GOSUB would always be >=, and unlike
other parts of our logic we would not use 0 to represent "start/end" of
pattern, as GOSTART would be used for "recurse to beginning of pattern",
after this patch we use 0 to represent "start/end", and a lot of
complexity "goes away" along with GOSTART regops.
|