| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
this way we can avoid pushing every buffer, we only need to push
the nestroot of the ref.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This eliminates the regnode_2L data structure, and merges it with the older
regnode_2 data structure. At the same time it makes each "arg" property of the
various regnode types that have one be consistently structured as an anonymous
union like this:
union {
U32 arg1u;
I32 arg2i;
struct {
U16 arg1a;
U16 arg1b;
};
};
We then expose four macros for accessing each slot: ARG1u() ARG1i() and
ARG1a() and ARG1b(). Code then explicitly designates which they want. The old
logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1,
which was confusing to say the least. The regnode_2L structure had a U32 arg1,
and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new
set of macros we use the regnode_2 for both, and use the appropriate macros to
show whether we want to signed or unsigned values.
This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit
args". However as each slot can also store two U16s, a regnode_3 can hold up
to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes
use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count
and ARG3a() and ARG3b() for parens before and inside the quantifier.
It also changes the functions reganode() to reg1node() and changes reg2Lanode()
to reg2node(). The 2L thing was just confusing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This way we can do the required paren restoration only when it is in use. When
we match a REF type node which is potentially a reference to an unclosed paren
we push the match context information, currently for "everything", but in a
future patch we can teach it to be more efficient by adding a new parameter to
the REF regop to track which parens it should save.
This converts the backtracking changes from the previous commit, so that it is
run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which
is by default 0. We don't make the new fields in the struct conditional as the
stack frames are large and our changes don't make any real difference and it
keeps things simpler to not have conditional members, especially since some of
the structures have to line up with each other.
If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means
something is sensitive to us not necessarily restoring the parens properly on
failure. We make some assumptions that the paren state after a failing state
will be corrected by a future successful state, or that the state of the
parens is irrelevant as we will fail anyway. This can be made not true by
EVAL, backrefs, and potentially some other scenarios. Thus I have left this
inefficient logic in place but guarded by the flag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.
We change BRANCH and BRANCHJ to store the number of parens before the
branch, and the number of parens after the branch was completed. When
a BRANCH operation fails, we clear the buffers it contains before we
continue on.
It is a bit more complex than it should be because we have BRANCHJ
and BRANCH. (One of these days we should merge them together.)
This is also made somewhat more complex because TRIE nodes are actually
branches, and may need to track capture buffers also, at two levels.
The overall TRIE op, and for jump tries especially where we emulate
the behavior of branches. So we have to do the same clearing logic if
a trie branch fails as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was originally a patch which made somewhat drastic changes to how
we represent capture buffers, which Dave M and I and are still
discussing offline and which has a larger impact than is acceptable to
address at the current time. As such I have reverted the controversial
parts of this patch for now, while keeping most of it intact even if in
some cases the changes are unused except for debugging purposes.
This patch still contains valuable changes, for instance teaching CURLYX
and CURLYM about how many parens there are before the curly[1] (which
will be useful in follow up patches even if stricly speaking they are
not directly used yet), tests and other cleanups. Also this patch is
sufficiently large that reverting it out would have a large effect on
the patches that were made on top of it.
Thus keeping most of this patch while eliminating the controversial
parts of it for now seemed the best approach, especially as some of the
changes it introduces and the follow up patches based on it are very
useful in cleaning up the structures we use to represent regops.
[1] Curly is the regexp internals term for quantifiers, named after
x{min,max} "curly brace" quantifiers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This splits a bunch of the subcomponents of the regex engine into
smaller files.
regcomp_debug.c
regcomp_internal.h
regcomp_invlist.c
regcomp_study.c
regcomp_trie.c
The only real change besides to the build machine to achieve the split
is to also adds some new defines which can be used in embed.fnc to control
exports without having to enumerate /every/ regex engine file. For
instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used
in embed.fnc to manage exports.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Having internal tabs causes confusion in diffs and reviews. In the
following patch I will move a lot of code around, creating new files
and they will all be whitespace clean: no trailing whitespace,
tabs expanded to the next tabstop properly, and no trailing empty
lines at the bottom of the file.
This patch prepares for that split, and future splits and changes to
the regex engine by precleaning the main regex engine files with the
same rules.
It should show no changes under '-w'.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It turns out that any character class whose UTF-8 representation is two
bytes long, and where all elements share the same first byte can be
represented by a compact, fast regnode designed for the purpose.
This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for
classes where all elements have the same first byte, and this just
changes the two-byte ones to use a bitmap instead of an inversion list.
The advantages of this are that no conversion to code point is required
(the continuation byte is just looked up in the bitmap) and no inversion
list is needed. The inversion list would occupy more space, from 4 to
34 extra 64-bit words, plus an AV and SV, depending on what elements the
class matches.
Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic,
and several other (lesser-known) scripts are of this form.
It would be possible to extend this technique to larger bitmaps, but
this commit is a start.
|
|
|
|
|
|
| |
This previously was lumped in with plain ANYOF. A future commit will be
easier if this is separated out, and doing so leads to some
simplifications, and from having to know all the OPs in this type.
|
| |
|
|
|
|
|
|
|
| |
This reverts commit d62feba66bf43f35d092bb026694f927e9f94d38.
As explained in its commit message. It adds some comments to point out
that the commit exists, for the curious.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Several of the POSIXA classes are a single range on ASCII platforms, and
[:digit:] is a single range on both ASCII and EBCDIC. This regnode was
designed to replace the POSIXA regnode for such classes to get a bit of
performance by not needing to do an array lookup. Instead it encodes
some bits in the flags field that with shifting and masking get the
right values for the single range's bounds for any such node.
However, performance tests conducted by Sergey Aleynikov showed this was
actually slower than what it intended to replace. Rather than
completely drop this work, I'm adding it to blead, and immediately
reverting it, so that should parts of it ever become useful, it would be
available.
A few tests fail; those are skipped for the purposes of this commit so
that it doesn't interfere with bisecting.
The code also isn't completely commented.
One could add a regnode for each posix class it was decided should have
the expected performance boost. But regnodes are a finite resource, and
the boost is probably not large enough to justify doing so.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were not validating that when (?<=a|ab) matched that the right hand
side of the match lined up with the position of the assertion. Similar
for (?<!a|ab) and related patterns, eg, (*positive_lookbehind:).
Note these problems do NOT affect lookahead.
Part of the difficulty here was that the SUCCEED node was serving too
many purposes, necessitating a new regop LOOKBEHIND_END.
Includes more tests for various lookahead or lookbehind cases.
|
|
|
|
|
|
|
| |
These categorize the many types of EXACT nodes, so that code can refer
to a particular subset of such nodes without having to list all of them
out. This simplifies some 'if' statements, and makes updating things
easier.
|
|
|
|
|
| |
These are often tested together. By making them adjacent we can use
inRANGE.
|
| |
|
|
|
|
|
|
|
|
| |
These are mostly used in regexec.c in three functions. Two of the
functions use less than half the available ones, as case labels in a
switch() statement. By moving all the ones used by those functions to
be nearly contiguous at the beginning, compilers can generate smaller
jump tables for the switch().
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This new regnode is used to handle interpolated already-compiled regex
sets inside outer regex sets.
If it isn't present, it will mean that what appears to be a nested,
interpolated set really isn't.
I created a new regnode structure to hold a pointer. This has to be
temporary as pointers can be invalidated. I thought of just having a
regnode without a pointer as a marker, and using a parallel array to
store the data, rather than creating a whole new regnode structure for
just pointers, but parallel data structures can get out of sync, so this
seemed best.
This commit just sets up the regnode; a future commit will actually use
it.
|
| |
|
|
|
|
|
| |
Under the given circumstances, these work precisely like others that
already have good descriptions.
|
|
|
|
|
|
| |
It turns out that one isn't supposed to fill in the offset to the next
regnode at node creation time. And this node is like EXACTish, so the
string stuff isn't accounted for in its regcomp.sym definition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node is like ANYOFHb, but is used when more than one leading byte
is the same in all the matched code points.
ANYOFHb is used to avoid having to convert from UTF-8 to code point for
something that won't match. It checks that the first byte in the UTF-8
encoded target is the desired one, thus ruling out most of the possible
code points.
But for higher code points that require longer UTF-8 sequences, many
many non-matching code points pass this filter. Its almost 200K that it
is ineffective for for code points above 0xFFFF.
This commit creates a new node type that addresses this problem.
Instead of a single byte, it stores as many leading bytes that are the
same for all code points that match the class. For many classes, that
will cut down the number of possible false positives by a huge amount
before having to convert to code point to make the final determination.
This regnode adds a UTF-8 string at the end. It is still much smaller,
even in the rare worst case, than a plain ANYOF node because the maximum
string length, 15 bytes, is still shorter than the 32-byte bitmap that
is present in a plain ANYOF. Most of the time the added string will
instead be at most 4 bytes.
|
|
|
|
| |
I don't think lack of these affects anything, but they were inconsistent
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is like the ANYOFR regnode added in the previous commit, but all
code points in the range it matches are known to have the same first
UTF-8 start byte. That means it can't match UTF-8 invariant characters,
like ASCII, because the "start" byte is different on each one, so it
could only match a range of 1, and the compiler wouldn't generate this
node for that; instead using an EXACT.
Pattern matching can rule out most code points by looking at the first
character of their UTF-8 representation, before having to convert from
UTF-8.
On ASCII this rules out all but 64 2-byte UTF-8 characters from this
simple comparison. 3-byte it's up to 4096, and 4-byte, 2**18, so the
test is less effective for higher code points.
I believe that most UTF-8 patterns that otherwise would compile to
ANYOFR will instead compile to this, as I can't envision real life
applications wanting to match large single ranges. Even the 2048
surrogates all have the same first byte.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches a single range of code points. It is both faster and
smaller than other ANYOF-type nodes, requiring, after set-up, a single
subtraction and conditional branch.
The vast majority of Unicode properties match a single range (though
most of the properties likely to be used in real world applications have
more than a single range). But things like [ij] are a single range, and
those are quite commonly encountered. This new regnode matches them more
efficiently than a bitmap would, and doesn't require the space for one
either.
The flags field is used to store the minimum matchable start byte for
UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH
nodes which have a similar mechanism, allows for quick weeding out of
many possible matches without having to convert the UTF-8 to its
corresponding code point.
This regnode packs the 32 bit argument with 20 bits for the minimum code
point the node matches, and 12 bits for the maximum range. If the input
is a value outside these, it simply won't compile to this regnode,
instead going to one of the ANYOFH flavors.
ANYOFR is sufficient to match all of Unicode except for the final
(private use) 65K plane.
|
|
|
|
|
|
|
| |
Having this enabled me to more quickly understand what's going on.
A trailing period is removed from some long descriptions to make them
slightly shorter.
|
|
|
|
| |
The new name is shorter and I believe, clearer.
|
|
|
|
|
| |
This is like LEXACT, but it is known that only strings encoded in UTF-8
will match it, so don't even have to try if that condition isn't met.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode for strings that don't fit in a regular
one, and adds a structure for that regnode to use. Actually using them
is deferred to the next commit.
This new regnode structure is needed because the previous structure only
allows for an 8 bit length field, 255 max bytes. This commit puts the
length instead in a new field, the same place single-argument regnodes
put their argument. Hence this long string is an extra 32 bits of
overhead, but at no string length is this node ever bigger than the
combination of the smaller nodes it replaces.
I also considered simply combining the original 8 bit length field
(which is now unused) with the first byte of the string field to get a
16 bit length, and have the actual string be offset by 1. But I
rejected that because it would mean the string would usually not be
aligned, slowing down memory accesses.
This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold,
using 4K fewer overhead bytes to do so. That means it can handle
strings containing 262000 bytes. The comments give ideas for expanding
that should it become necessary or desirable.
Besides the space advantage, any hardware acceleration in memcmp
can be done in much bigger chunks, and otherwise the memcmp inner loop
(often written in assembly) will run many more times in a row, and our
outer loop that calls it, correspondingly fewer.
|
|
|
|
|
|
|
|
|
| |
EXACTFU nodes always now fold their strings; the information here had
not been updated to reflect that change.
And the descriptions of several EXACTish nodes are now changed to be
slightly shorter and to remove mention of the string length, which is
problematic, and is covered in the description for EXACT
|
|
|
|
|
|
|
|
|
| |
I spent some time in this code trying to understand some things, and as
a result I'm commenting previously undocumented features. The comments
about what an entry in regcomp.sym should look like are moved to that
file, rather than the file that reads it. The former is most often
touched, and they had gotten out-of-sync in the latter. Things now make
more sense to me, and hopefully anyone using this in the future.
|
|
|
|
|
| |
The length of an EXACTish node is the same bits as the FLAGS field in
other nodes; it doesn't "precede the length", as previously claimed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a
loose upper bound for the first UTF-8 byte matchable by the node. (The
'r' stands for 'range'). It would be nice to have a tight upper bound,
but to do so requires 4 more bits than are available without changing
the node arguments types, and hence increasing the node size. Having a
loose bound is better than no bound, and comes essentially free, by
using two unused bits in the current ANYOFH node, and requiring only a
few extra, pipeline-able, mask, etc instructions at run time, no extra
conditionals. Any ANYOFH nodes that would benefit from having an upper
bound will instead be compiled into this node type.
Its use is based on the following observations.
There are 64 possible start bytes, so the full range can be expressed in
6 bits. This means that the flags field in ANYOFH nodes containing the
start byte has two extra bits that can be used for something else.
An ANYOFH node only happens when there is no matching code point in the
bit map, so the smallest code point that could be is 256. The start
byte for that is C4, so there are actually only 60 possible start bytes.
(perl can be compiled with a larger bit map in which case the minimum
start byte would be even higher.)
A second observation is that knowing the highest start byte is above F0
is no better than knowing it's F0. This is because the highest code
point whose start byte is F0 is U+3FFFF, and all code points above that
that are currently allocated are all very specialized and rarely
encountered. And there's no likelihood of that changing anytime soon as
there's plenty of unallocated space below that. So if an ANYOFH node's
highest start byte is F0 or above, there's no advantage to knowing what
the actual max possible start byte is, so leave it as ANYOFH,.
That means the highest start byte we care about in ANYOFHr is EF. That
cuts the number of start bytes we care about down to 43, still 6 bits
required to represent them, but it allows for the following scheme:
Populate the flags field by subtracting C0 from the lowest start byte
and shift left 2 bits. That leaves the the bottom two bits unused.
We use them as follows, where x is the start byte of the lowest code
point in the node:
bits
----
11 The upper limit of the range can be as much as (EF - x) / 8
10 The upper limit of the range can be as much as (EF - x) / 4
01 The upper limit of the range can be as much as (EF - x) / 2
00 The upper limit of the range can be as much as EF
That partitions the loose upper bound into 4 possible ranges, with it
being tighter the closer it is to the strict lower bound. This makes
the loose upper bound more meaningful when there is most to gain by
having one.
Some examples of what the various upper bounds would be for all the
possibilities of these two bits are:
Upper bound given the 2 bits
Low bound 11 10 01 00
--------- -- -- -- --
C4 C9 CE D9 EF
D0 D3 D7 DF EF
E0 E1 E3 E7 EF
Start bytes of E0 and above represent many more code points each than
lower ones, as they are 3 byte sequences instead of two. This scheme
provides tighter bounds for them, which is also a point in its favor.
Thus we have provided a loose upper bound using two otherwise unused
bits. An alternate scheme could have had the intervals all the same,
but this provides a tighter bound when it makes the most sense to.
For EBCDIC the range is is from C8 to F4,
Tests will be added in a later commit
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a lower bound for the first UTF-8 byte matchable by an
ANYOFH node. The flags field is otherwise unused, and using it for this
purpose allows code to rule out match possibilities without having to
convert from UTF-8 to code point.
It might be better to do the inverse instead, to have the field be an
upper bound. The reason is that the conversion is cheap for smaller
numbers. The commit following mostly addresses this.
|
|
|
|
|
| |
This is easier to read, especially when a third type is added a few
commits ahead.
|
|
|
|
| |
Simplify the description for ANYOFb
|
|
|
|
|
|
|
|
|
| |
ANYOFHb will be for nodes where all the matching code points share the
frst UTF-8 byte. ANYOFH will be for all others. Neither of these has a
bitmap.
I noticed that we can omit some execution conditionals by splitting the
nodes.
|
| |
|
|
|
|
|
| |
These were misleading, as elsewhere a leading 'N' in the name means the
complement. Instead move the N to the end of the name
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An ANYOFH regnode is generated instead of a plain ANYOF one when
nothing it can match is in the bitmap used in ANYOF nodes. It is
therefore smaller as the 4 word (or more) bitmap is omitted.
This means that for it to match a target string, that string must be
UTF-8 (since the bitmap is for at least the lowest 256 code points).
And only in rare circumstances are there any flags associated with it in
the regnode flags field.
This commit changes things so that the flags field in an ANYOFH node is
repurposed to be the first UTF-8 encoded byte of every code point
matched by the class if there is a common byte for all of them; or 0 if
some have different first bytes.
(That means that those rare cases where the flags field isn't otherwise
empty can no longer be ANYOFH nodes.)
The purpose of this is so that a future commit can take advantage of
this, and more quickly scan the target string for places that this node
can match. A problem with ANYOF nodes is that they are code point
oriented (U32 or U64), and the target string is UTF-8, so conversion has
to be done. By having the partial conversion compiled in, we can look
for that at runtime instead of having to look at every character in the
scan.
|
|
|
|
| |
See [perl #132367].
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a regnode for the case where nothing in the bit map has
matches. This allows the bitmap to be omitted, saving 32 bytes of
otherwise wasted space per node. Many non-Latin Unicode properties have
this characteristic. Further, since this node applies only to code
points above 255, which are representable only in UTF-8, we can
trivially fail a match where the target string isn't in UTF-8. Time
savings also accrue from skipping the bitmap look-up. When swashes are
removed, even more time will be saved.
|
|
|
|
|
|
|
| |
The ANYOFM/NANYOFM regnodes are generalizations of these. They have
more masks and shifts than the removed nodes, but not more branches, so
are effectively the same speed. Remove the ASCII/NASCII nodes in favor
of having less code to maintain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 8a100c918ec81926c0536594df8ee1fcccb171da created node types for
handling an 's' at the leading edge, at the trailing edge, and at both
edges for nodes under /di that there is nothing else in that would
prevent them from being EXACTFU nodes. If two of these get joined, it
could create an 'ss' sequence which can't be an EXACTFU node, for U+DF
would match them unconditionally. Instead, under /di it should match
if and only if the target string is UTF-8 encoded.
I realized later that having three types becomes harder to deal with
when adding yet more node types, so this commit turns the three into
just one node type, indicating that at least one edge of the node is an
's'.
It also simplifies the parsing of the pattern and determining which node
to use.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
EXACTFUP was created by the previous commit to handle a problematic case
in which not all the code points in an EXACTFU node are /i foldable at
compile time. Doing so will allow a future commit to use the pre-folded
EXACTFU nodes (done in a prior commit), saving execution time for the
common case. The only problematic code point is the MICRO SIGN. Most
patterns don't use this character.
EXACTFU_SS is problematic in a different way. It contains the sequence
'ss' which is folded to by LATIN SMALL LETTER SHARP S, but everything in
it can be pre-folded (unless it also contains a MICRO SIGN). The reason
this is problematic is that it is the only non-UTF-8 node where the
length in folding can change. To process it at runtime, the more
general fold equivalence function is used that is capable of handling
length disparities, but is slower than the functions otherwise used for
non-UTF-8.
What I've chosen to do for now is to make a single node type for all the
problematic cases (which at this time means just the two aforementioned
ones). If we didn't do this, we'd have to add a third node type for
patterns that contain both 'ss' and MICRO. Or artificially split the
pattern so the two never were in the same node, but we can't do that
because it can cause bugs in handling multi-character folds. If more
special handling is found to be needed, there'd be a combinatorial
explosion of additional node types to handle all possible combinations.
What this effectively means is that the slower, more general foldEQ
function is used for portions of patterns containing the MICRO sign when
the pattern isn't in UTF-8, even though there is no inherent reason to
do so for non-UTF-8 strings that don't also contain the 'ss' sequence.
|
|
|
|
|
|
|
|
|
|
| |
If a non-UTF-8 pattern contains a MICRO SIGN, this special node is now
created. This character is the only one not needing UTF-8 to represent,
but its fold does need UTF-8, which causes some issues, so it has to be
specially handled. When matching against a non-UTF-8 target string, the
pattern is effectively folded, but not if the target is UTF-8. By
creating this node, we can remove the special handling required for the
nodes that don't have a MICRO SIGN, in a future commit.
|