| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
This makes it consistent with the other inversion lists for this sort of
thing, and finishes the fix for GH #17154
|
| |
|
|
|
|
|
|
| |
It turns out that one isn't supposed to fill in the offset to the next
regnode at node creation time. And this node is like EXACTish, so the
string stuff isn't accounted for in its regcomp.sym definition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This node is like ANYOFHb, but is used when more than one leading byte
is the same in all the matched code points.
ANYOFHb is used to avoid having to convert from UTF-8 to code point for
something that won't match. It checks that the first byte in the UTF-8
encoded target is the desired one, thus ruling out most of the possible
code points.
But for higher code points that require longer UTF-8 sequences, many
many non-matching code points pass this filter. Its almost 200K that it
is ineffective for for code points above 0xFFFF.
This commit creates a new node type that addresses this problem.
Instead of a single byte, it stores as many leading bytes that are the
same for all code points that match the class. For many classes, that
will cut down the number of possible false positives by a huge amount
before having to convert to code point to make the final determination.
This regnode adds a UTF-8 string at the end. It is still much smaller,
even in the rare worst case, than a plain ANYOF node because the maximum
string length, 15 bytes, is still shorter than the 32-byte bitmap that
is present in a plain ANYOF. Most of the time the added string will
instead be at most 4 bytes.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This matches a single range of code points. It is both faster and
smaller than other ANYOF-type nodes, requiring, after set-up, a single
subtraction and conditional branch.
The vast majority of Unicode properties match a single range (though
most of the properties likely to be used in real world applications have
more than a single range). But things like [ij] are a single range, and
those are quite commonly encountered. This new regnode matches them more
efficiently than a bitmap would, and doesn't require the space for one
either.
The flags field is used to store the minimum matchable start byte for
UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH
nodes which have a similar mechanism, allows for quick weeding out of
many possible matches without having to convert the UTF-8 to its
corresponding code point.
This regnode packs the 32 bit argument with 20 bits for the minimum code
point the node matches, and 12 bits for the maximum range. If the input
is a value outside these, it simply won't compile to this regnode,
instead going to one of the ANYOFH flavors.
ANYOFR is sufficient to match all of Unicode except for the final
(private use) 65K plane.
|
| |
|
|
|
|
| |
The new name is shorter and I believe, clearer.
|
|
|
|
|
| |
This is like LEXACT, but it is known that only strings encoded in UTF-8
will match it, so don't even have to try if that condition isn't met.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode for strings that don't fit in a regular
one, and adds a structure for that regnode to use. Actually using them
is deferred to the next commit.
This new regnode structure is needed because the previous structure only
allows for an 8 bit length field, 255 max bytes. This commit puts the
length instead in a new field, the same place single-argument regnodes
put their argument. Hence this long string is an extra 32 bits of
overhead, but at no string length is this node ever bigger than the
combination of the smaller nodes it replaces.
I also considered simply combining the original 8 bit length field
(which is now unused) with the first byte of the string field to get a
16 bit length, and have the actual string be offset by 1. But I
rejected that because it would mean the string would usually not be
aligned, slowing down memory accesses.
This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold,
using 4K fewer overhead bytes to do so. That means it can handle
strings containing 262000 bytes. The comments give ideas for expanding
that should it become necessary or desirable.
Besides the space advantage, any hardware acceleration in memcmp
can be done in much bigger chunks, and otherwise the memcmp inner loop
(often written in assembly) will run many more times in a row, and our
outer loop that calls it, correspondingly fewer.
|
| |
|
|
|
|
| |
This is no longer used
|
|
|
|
|
| |
This is in preparation for the current mechanism in a later commit to
become a not legal lhs
|
|
|
|
| |
This is always a good idea
|
|
|
|
| |
This macro has the same definition as another.
|
|
|
|
| |
in qr//
|
|
|
|
|
| |
This referred to the commit message, but now that the number has been
determined, use that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a
loose upper bound for the first UTF-8 byte matchable by the node. (The
'r' stands for 'range'). It would be nice to have a tight upper bound,
but to do so requires 4 more bits than are available without changing
the node arguments types, and hence increasing the node size. Having a
loose bound is better than no bound, and comes essentially free, by
using two unused bits in the current ANYOFH node, and requiring only a
few extra, pipeline-able, mask, etc instructions at run time, no extra
conditionals. Any ANYOFH nodes that would benefit from having an upper
bound will instead be compiled into this node type.
Its use is based on the following observations.
There are 64 possible start bytes, so the full range can be expressed in
6 bits. This means that the flags field in ANYOFH nodes containing the
start byte has two extra bits that can be used for something else.
An ANYOFH node only happens when there is no matching code point in the
bit map, so the smallest code point that could be is 256. The start
byte for that is C4, so there are actually only 60 possible start bytes.
(perl can be compiled with a larger bit map in which case the minimum
start byte would be even higher.)
A second observation is that knowing the highest start byte is above F0
is no better than knowing it's F0. This is because the highest code
point whose start byte is F0 is U+3FFFF, and all code points above that
that are currently allocated are all very specialized and rarely
encountered. And there's no likelihood of that changing anytime soon as
there's plenty of unallocated space below that. So if an ANYOFH node's
highest start byte is F0 or above, there's no advantage to knowing what
the actual max possible start byte is, so leave it as ANYOFH,.
That means the highest start byte we care about in ANYOFHr is EF. That
cuts the number of start bytes we care about down to 43, still 6 bits
required to represent them, but it allows for the following scheme:
Populate the flags field by subtracting C0 from the lowest start byte
and shift left 2 bits. That leaves the the bottom two bits unused.
We use them as follows, where x is the start byte of the lowest code
point in the node:
bits
----
11 The upper limit of the range can be as much as (EF - x) / 8
10 The upper limit of the range can be as much as (EF - x) / 4
01 The upper limit of the range can be as much as (EF - x) / 2
00 The upper limit of the range can be as much as EF
That partitions the loose upper bound into 4 possible ranges, with it
being tighter the closer it is to the strict lower bound. This makes
the loose upper bound more meaningful when there is most to gain by
having one.
Some examples of what the various upper bounds would be for all the
possibilities of these two bits are:
Upper bound given the 2 bits
Low bound 11 10 01 00
--------- -- -- -- --
C4 C9 CE D9 EF
D0 D3 D7 DF EF
E0 E1 E3 E7 EF
Start bytes of E0 and above represent many more code points each than
lower ones, as they are 3 byte sequences instead of two. This scheme
provides tighter bounds for them, which is also a point in its favor.
Thus we have provided a loose upper bound using two otherwise unused
bits. An alternate scheme could have had the intervals all the same,
but this provides a tighter bound when it makes the most sense to.
For EBCDIC the range is is from C8 to F4,
Tests will be added in a later commit
|
|
|
|
|
|
|
|
|
|
|
| |
This patch, started by Yves Orton, and refined in consultation with Tony
Cook, imposes a maximum depth of unclosed left parentheses, at which
point it croaks. This is to prevent the segfault in the ticket.
The patch adds a variable that can be set to increase or decrease this
limit at run time (actually regex compilation time) should this be
desired, and hence our pre-determined limit of 1000 can be changed if
necessary.
|
|
|
|
| |
regexes no longer use these
|
| |
|
|
|
|
|
|
|
|
| |
Per LGTM analysis: https://lgtm.com/projects/g/Perl/perl5/alerts/?mode=tree&ruleFocus=2163210746
and LGTM recommendation: https://lgtm.com/rules/2163210746/
For: RT 133699
|
| |
|
|
|
|
|
| |
This commit makes the v (verbose) modifier to -Dr do something: turn on
all possible regex debugging.
|
| |
|
|
|
|
|
|
|
|
| |
The previous commit removed the sizing pass, but to minimize the
difference listing, it left in all the references it could to the
various passes, with the first pass set to FALSE. This commit now
removes those references, as well as to some variables that are no
longer used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit removes the sizing pass for regular expression compilation.
It attempts to be the minimum required to do this. Future patches are
in the works that improve it,, and there is certainly lots more that
could be done.
This is being done now for security reasons, as there have been several
bugs leading to CVEs where the sizing pass computed the size improperly,
and a crafted pattern could allow an attack. This means that simple
bugs can too easily become attack vectors.
This is NOT the AST that people would like, but it should make it easier
to move the code in that direction.
Instead of a sizing pass, as the pattern is parsed, new space is
malloc'd for each regnode found. To minimize the number of such mallocs
that actually go out and request memory, an initial guess is made, based
on the length of the pattern being compiled. The guessed amount is
malloc'd and then immediately released. Hopefully that memory won't be
gobbled up by another process before we actually gain control of it.
The guess is currently simply the number of bytes in the pattern.
Patches and/or suggestions are welcome on improving the guess or this
method.
This commit does not mean, however, that only one pass is done in all
cases. Currently there are several situations that require extra
passes. These are:
a) If the pattern isn't UTF-8, but encounters a construct that
requires it to become UTF-8, the parse is immediately stopped,
the translation is done, and the parse restarted. This is
unchanged from how it worked prior to this commit.
b) If the pattern uses /d rules and it encounters a construct that
requires it to become /u, the parse is immediately stopped and
restarted using /u rules. A future enhancement is to only
restart if something has been encountered that would generate
something different than what has already been generated, as
many operations are the same under both /d and /u. Prior to
this commit, in rare circumstances was the parse immediately
restarted. Only those few that changed the sizing did so.
Instead the sizing pass was allowed to complete and then the
generation pass ran, using /u. Some CVEs were caused by faulty
implementation here.
c) Very large patterns may need to have long jumps in their
program. Prior to this commit, that was determined in the
sizing pass, and all jumps were made long during generation.
Now, the first time the need for a long jump is detected, the
parse is immediately restarted, and all jumps are made long. I
haven't investigated enough to be sure, but it might be
sufficient to just redo the current jump, making it long, and
then switch to using just long jumps, without having to restart
the parse from the beginning.
d) If a reference that could be to capturing parentheses doesn't
find such parentheses, a flag is set. For references that could
be octal constants, they are assumed to be those constants
instead of a capturing group. At the end of the parse, if the
flag indicates either that the assumption(s) were wrong or that
it is a fatal reference to a non-existent group, the pattern is
reparsed knowing the total number of these groups.
e) If (?R) or (?0) are encountered, the flag listed in item d)
above is set to force a reparse. I did have code in place that
avoided doing the reparse, but I wasn't confident enough that
our test suite exercises that area of the code enough to have
exposed all the potential interaction bugs, and I think this
construct is used rarely enough to not worry about avoiding the
reparse at this point in the development.
f) If (?|pattern) is encountered, the behavior is the same as in
item e) above. The pattern will end up being reparsed after the
total number of parenthesized groups are known. I decided not
to invest the effort at this time in trying to get this to work
without a reparse.
It might be that if we are continuing the parse to just count
parentheses, and we encounter a construct that normally would restart
the parse immediately, that we could defer that restart. This would cut
down the maximum number of parses required. As of this commit, the
worst case is we find something that requires knowing all the
parentheses; later we have to switch to /u rules and so the parse is
restarted. Still later we have to switch to long jumps, and the parse
is restarted again. Still later we have to upgrade to UTF-8, and the
parse is restarted yet again. Then the parse is completed, and the
final total of parentheses is known, so everything is redone a final
time. Deferring those intermediate restarts would save a bunch of
reparsing.
Prior to this commit, warnings were issued only during the code
generation pass, which didn't get called unless the sizing pass(es)
completed successfully. But now, we don't know if the pass will
succeed, fail, or whether it will have to be restarted. To avoid
outputting the same warning more than once, the position in the parse of
the last warning generated is kept (across parses). The code looks at
that position when it is about to generate a warning. If the parsing
has previously gotten that far, it assumes that the warning has already
been generated, and suppresses it this time. The current state of
parsing is such that I believe this assumption is valid. If the parses
had divergent paths, that assumption could become invalid.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the pattern parsing to use offsets from the first node in
the pattern program, rather than direct addresses of such nodes. This
is in preparation for a later change in which more mallocs will be done
which will change those addresses, whereas the offsets will remain
constant. Once the final program space is allocated, real addresses are
used as currently. This limits the necessary changes to a few
functions. Also, real addresses are used if they are constant across a
function; again this limits the changes.
Doing this introduces a new typedef for clarity 'regnode_offset' which
is not a pointer, but a count. This necessitates changing a bunch of
things to use 0 instead of NULL to indicate an error.
A new boolean is also required to indicate if we are in the first or
second passes of the pattern. And separate heap space is allocated for
scratch during the first pass.
|
|
|
|
|
|
| |
This changes regcomp.sym to generate the correct lengths for ANYOF
nodes, which means they don't have to be special cased in regcomp.c,
leading to simplification
|
|
|
|
|
|
|
| |
This struct has two names. I previously left the less descriptive one
as the primary because of back compat issues. But I now realize that
regcomp.h is only used in the core, so it's ok to swap for the better
name to be primary.
|
|
|
|
|
|
|
|
|
|
| |
These are use to allow the bit map of run-time /[[:posix:]]/l classes to
be stored in a variable, and not just in the argument of an ANYOF node.
This will enable the next commit to use such a variable. The current
macros are rewritten to just call the new ones with the proper arguments.
A macro of a different sort is also created to allow one to set the
entire bit field in the node at once.
|
|
|
|
|
| |
I had kept these macros around for backwards compatibility. But now I
realize regcomp.h is only for core use, so no need to retain them.
|
| |
|
|
|
|
|
|
| |
This is a more fundamental operation than the pre-existing
FILL_ADVANCE_NODE, which is changed to use this for the portion that
fills the node but doesn't advance the pointer.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit doubles the upper limit that unbounded regular expression
quantifiers can match up to. Things like {m,} "+" and "*" now can match
up to U16_MAX times.
We probably should make this a 32 bit value, but doing this doubling was
easy and has fewer potential implications.
See http://nntp.perl.org/group/perl.perl5.porters/251413 and
followups
|
|
|
|
|
| |
The ANYOF regnode type (generated by bracketed character classes and
\p{}) was allocating too much space because the argument field was being counted twice
|
| |
|
|
|
|
|
| |
This is in preparation for future commits that will use it in multiple
places
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2bfbbbaf9ef1783ba914ff9e9270e877fbbb6aba changed things so -Dr
output could be changed through an environment variable to truncate
the output differently than the default.
For most purposes, the default is good enough, but for someone trying to
debug the regcomp internals, sometimes one wants to see more than is
output by default.
That commit did not catch all the places. This one changes the handling
so that any place that use the previous default maximum now uses the
environment variable (if set) instead.
|
|
|
|
| |
However, we do preserve it outside PERL_CORE for the use of XS authors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[perl #129140] attempting double-free
Thus fixes some leaks and double frees in regexes which contain code
blocks.
During compilation, an array of struct reg_code_block's is malloced.
Initially this is just attached to the RExC_state_t struct local var in
Perl_re_op_compile(). Later it may be attached to a pattern. The difficulty
is ensuring that the array is free()d (and the ref counts contained within
decremented) should compilation croak early, while avoiding double frees
once the array has been attached to a regex.
The current mechanism of making the array the PVX of an SV is a bit flaky,
as the array can be realloced(), and code can be re-entered when utf8 is
detected mid-compilation.
This commit changes the array into separately malloced head and body.
The body contains the actual array, and can be realloced. The head
contains a pointer to the array, plus size and an 'attached' boolean.
This indicates whether the struct has been attached to a regex, and is
effectively a 1-bit ref count.
Whenever a head is allocated, SAVEDESTRUCTOR_X() is used to call
S_free_codeblocks() to free the head and body on scope exit. This function
skips the freeing if 'attached' is true, and this flag is set only at the
point where the head gets attached to the regex.
In one way this complicates the code, since the num_code_blocks field is now
not always available (it's only there is a head has been allocated), but
mainly its simplifies, since all the book-keeping is now done in the two
new static functions S_alloc_code_blocks() and S_free_codeblocks()
|
|
|
|
|
|
|
|
| |
This indents code and reflows the comments to account for the enclosing
block added by the previous commit.
At the same time, it adds some other miscellaneous white space changes,
and adds, revises other comments.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang has taken it upon itself to warn when an equality is wrapped in
double parentheses, e.g.
((foo == bar))
Which is a bit dumb, as any code along the lines of
#define isBAR (foo == BAR)
if (isBAR) {}
will trigger the warning.
This commit shuts clang up by putting in a harmless cast:
#define isBAR cBOOL(foo == BAR)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way we tracked if pattern recursion was infinite did not work
properly. A pattern like
"a"=~/(.(?2))((?<=(?=(?1)).))/
would loop forever, slowly eat up all available ram as it added
pattern recursion stack frames.
This patch changes the rules for recursion so that recursively
entering a given pattern "subroutine" twice from the same position
fails the match. This means that where previously we might have
seen fatal exception we will now simply fail. This means that
"aaabbb"=~/a(?R)?b/
succeeds with $& equal to "aaabbb".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling,
and other special casing by unifying them, at pretty much no cost.
GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as
a U32 and an I32 respectively. ARG() holds the "parno" we will recurse
into. ARG2L() holds a signed offset to the relevant start node for the
recursion.
Prior to this patch the argument to GOSUB would always be >=, and unlike
other parts of our logic we would not use 0 to represent "start/end" of
pattern, as GOSTART would be used for "recurse to beginning of pattern",
after this patch we use 0 to represent "start/end", and a lot of
complexity "goes away" along with GOSTART regops.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The regex engine when displaying debugging info, say under -Dr, will elide
data in order to keep the output from getting too long. For example,
the number of code points in all of Unicode matched by \w is quite
large, and so when displaying a pattern that matches this, only the
first some number of them are printed, and the rest are truncated,
represented by "...".
Sometimes, one wants to see more than what the
compiled-into-the-engine-max shows. This commit creates code to read
this environment variable to override the default max lengths. This
changes the lengths for everything to the input number, even if they
have different compiled maximums in the absence of this variable.
I'm not currently documenting this variable, as I don't think it works
properly under threads, and we may want to alter the behavior in various
ways as a result of gaining experience with using it.
|
|
|
|
| |
So, it's better to not have a mask to include the unused ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently most pp_leavefoo subs have something along the lines of
POPBLOCK(cx);
POPFOO(cx);
where POPBLOCK does cxstack_ix-- and sets cx to point to the top CX stack
entry. It then restores a bunch of PL_ vars saved in the CX struct.
Then POPFOO does any type-specific restoration, e.g. POPSUB decrements the
ref count of the cv that was just executed.
However, this is logically the wrong order. When we *enter* a scope, we do
PUSHBLOCK;
PUSHFOO;
so undoing the PUSHBLOCK should be the last thing we do. As it happens,
it doesn't really make any difference to the running, which is why we've
never fixed it before.
Reordering it has two advantages.
First, it allows the steps for scope exit to be the exact logical reverse
of scope exit, which makes understanding what's going on and debugging
easier.
It allows us to make the code cleaner.
This commit also removes the cxstack_ix-- and setting cx steps from
POPBLOCK; now we already expect cx to be set (which it usually already is)
and we do the cxstack_ix-- ourselves. This also means we can remove a
whole bunch of cxstack_ix++'s that were added immediately after the
POPBLOCK in order to prevent the context being inadvertently overwritten
before we've finished using it.
So in full,
POPBLOCK(cx);
POPFOO(cx);
is now implemented as:
cx = &cxstack[cxstack_ix];
... other stuff done with cx ...
POPFOO(cx);
POPBLOCK(cx);
cxstack_ix--;
Finally, this commit also tweaks PL_curcop in pp_leaveeval, since
otherwise PL_curcop could temporarily be NULL when debugging code is
called in the presence of 'use re Debug'. It also stops the debugging code
crashing if PL_curcop is still NULL.
|