| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
| |
This commit makes the v (verbose) modifier to -Dr do something: turn on
all possible regex debugging.
|
| |
|
|
|
|
|
|
|
|
| |
The previous commit removed the sizing pass, but to minimize the
difference listing, it left in all the references it could to the
various passes, with the first pass set to FALSE. This commit now
removes those references, as well as to some variables that are no
longer used.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit removes the sizing pass for regular expression compilation.
It attempts to be the minimum required to do this. Future patches are
in the works that improve it,, and there is certainly lots more that
could be done.
This is being done now for security reasons, as there have been several
bugs leading to CVEs where the sizing pass computed the size improperly,
and a crafted pattern could allow an attack. This means that simple
bugs can too easily become attack vectors.
This is NOT the AST that people would like, but it should make it easier
to move the code in that direction.
Instead of a sizing pass, as the pattern is parsed, new space is
malloc'd for each regnode found. To minimize the number of such mallocs
that actually go out and request memory, an initial guess is made, based
on the length of the pattern being compiled. The guessed amount is
malloc'd and then immediately released. Hopefully that memory won't be
gobbled up by another process before we actually gain control of it.
The guess is currently simply the number of bytes in the pattern.
Patches and/or suggestions are welcome on improving the guess or this
method.
This commit does not mean, however, that only one pass is done in all
cases. Currently there are several situations that require extra
passes. These are:
a) If the pattern isn't UTF-8, but encounters a construct that
requires it to become UTF-8, the parse is immediately stopped,
the translation is done, and the parse restarted. This is
unchanged from how it worked prior to this commit.
b) If the pattern uses /d rules and it encounters a construct that
requires it to become /u, the parse is immediately stopped and
restarted using /u rules. A future enhancement is to only
restart if something has been encountered that would generate
something different than what has already been generated, as
many operations are the same under both /d and /u. Prior to
this commit, in rare circumstances was the parse immediately
restarted. Only those few that changed the sizing did so.
Instead the sizing pass was allowed to complete and then the
generation pass ran, using /u. Some CVEs were caused by faulty
implementation here.
c) Very large patterns may need to have long jumps in their
program. Prior to this commit, that was determined in the
sizing pass, and all jumps were made long during generation.
Now, the first time the need for a long jump is detected, the
parse is immediately restarted, and all jumps are made long. I
haven't investigated enough to be sure, but it might be
sufficient to just redo the current jump, making it long, and
then switch to using just long jumps, without having to restart
the parse from the beginning.
d) If a reference that could be to capturing parentheses doesn't
find such parentheses, a flag is set. For references that could
be octal constants, they are assumed to be those constants
instead of a capturing group. At the end of the parse, if the
flag indicates either that the assumption(s) were wrong or that
it is a fatal reference to a non-existent group, the pattern is
reparsed knowing the total number of these groups.
e) If (?R) or (?0) are encountered, the flag listed in item d)
above is set to force a reparse. I did have code in place that
avoided doing the reparse, but I wasn't confident enough that
our test suite exercises that area of the code enough to have
exposed all the potential interaction bugs, and I think this
construct is used rarely enough to not worry about avoiding the
reparse at this point in the development.
f) If (?|pattern) is encountered, the behavior is the same as in
item e) above. The pattern will end up being reparsed after the
total number of parenthesized groups are known. I decided not
to invest the effort at this time in trying to get this to work
without a reparse.
It might be that if we are continuing the parse to just count
parentheses, and we encounter a construct that normally would restart
the parse immediately, that we could defer that restart. This would cut
down the maximum number of parses required. As of this commit, the
worst case is we find something that requires knowing all the
parentheses; later we have to switch to /u rules and so the parse is
restarted. Still later we have to switch to long jumps, and the parse
is restarted again. Still later we have to upgrade to UTF-8, and the
parse is restarted yet again. Then the parse is completed, and the
final total of parentheses is known, so everything is redone a final
time. Deferring those intermediate restarts would save a bunch of
reparsing.
Prior to this commit, warnings were issued only during the code
generation pass, which didn't get called unless the sizing pass(es)
completed successfully. But now, we don't know if the pass will
succeed, fail, or whether it will have to be restarted. To avoid
outputting the same warning more than once, the position in the parse of
the last warning generated is kept (across parses). The code looks at
that position when it is about to generate a warning. If the parsing
has previously gotten that far, it assumes that the warning has already
been generated, and suppresses it this time. The current state of
parsing is such that I believe this assumption is valid. If the parses
had divergent paths, that assumption could become invalid.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This changes the pattern parsing to use offsets from the first node in
the pattern program, rather than direct addresses of such nodes. This
is in preparation for a later change in which more mallocs will be done
which will change those addresses, whereas the offsets will remain
constant. Once the final program space is allocated, real addresses are
used as currently. This limits the necessary changes to a few
functions. Also, real addresses are used if they are constant across a
function; again this limits the changes.
Doing this introduces a new typedef for clarity 'regnode_offset' which
is not a pointer, but a count. This necessitates changing a bunch of
things to use 0 instead of NULL to indicate an error.
A new boolean is also required to indicate if we are in the first or
second passes of the pattern. And separate heap space is allocated for
scratch during the first pass.
|
|
|
|
|
|
| |
This changes regcomp.sym to generate the correct lengths for ANYOF
nodes, which means they don't have to be special cased in regcomp.c,
leading to simplification
|
|
|
|
|
|
|
| |
This struct has two names. I previously left the less descriptive one
as the primary because of back compat issues. But I now realize that
regcomp.h is only used in the core, so it's ok to swap for the better
name to be primary.
|
|
|
|
|
|
|
|
|
|
| |
These are use to allow the bit map of run-time /[[:posix:]]/l classes to
be stored in a variable, and not just in the argument of an ANYOF node.
This will enable the next commit to use such a variable. The current
macros are rewritten to just call the new ones with the proper arguments.
A macro of a different sort is also created to allow one to set the
entire bit field in the node at once.
|
|
|
|
|
| |
I had kept these macros around for backwards compatibility. But now I
realize regcomp.h is only for core use, so no need to retain them.
|
| |
|
|
|
|
|
|
| |
This is a more fundamental operation than the pre-existing
FILL_ADVANCE_NODE, which is changed to use this for the portion that
fills the node but doesn't advance the pointer.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit doubles the upper limit that unbounded regular expression
quantifiers can match up to. Things like {m,} "+" and "*" now can match
up to U16_MAX times.
We probably should make this a 32 bit value, but doing this doubling was
easy and has fewer potential implications.
See http://nntp.perl.org/group/perl.perl5.porters/251413 and
followups
|
|
|
|
|
| |
The ANYOF regnode type (generated by bracketed character classes and
\p{}) was allocating too much space because the argument field was being counted twice
|
| |
|
|
|
|
|
| |
This is in preparation for future commits that will use it in multiple
places
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 2bfbbbaf9ef1783ba914ff9e9270e877fbbb6aba changed things so -Dr
output could be changed through an environment variable to truncate
the output differently than the default.
For most purposes, the default is good enough, but for someone trying to
debug the regcomp internals, sometimes one wants to see more than is
output by default.
That commit did not catch all the places. This one changes the handling
so that any place that use the previous default maximum now uses the
environment variable (if set) instead.
|
|
|
|
| |
However, we do preserve it outside PERL_CORE for the use of XS authors.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[perl #129140] attempting double-free
Thus fixes some leaks and double frees in regexes which contain code
blocks.
During compilation, an array of struct reg_code_block's is malloced.
Initially this is just attached to the RExC_state_t struct local var in
Perl_re_op_compile(). Later it may be attached to a pattern. The difficulty
is ensuring that the array is free()d (and the ref counts contained within
decremented) should compilation croak early, while avoiding double frees
once the array has been attached to a regex.
The current mechanism of making the array the PVX of an SV is a bit flaky,
as the array can be realloced(), and code can be re-entered when utf8 is
detected mid-compilation.
This commit changes the array into separately malloced head and body.
The body contains the actual array, and can be realloced. The head
contains a pointer to the array, plus size and an 'attached' boolean.
This indicates whether the struct has been attached to a regex, and is
effectively a 1-bit ref count.
Whenever a head is allocated, SAVEDESTRUCTOR_X() is used to call
S_free_codeblocks() to free the head and body on scope exit. This function
skips the freeing if 'attached' is true, and this flag is set only at the
point where the head gets attached to the regex.
In one way this complicates the code, since the num_code_blocks field is now
not always available (it's only there is a head has been allocated), but
mainly its simplifies, since all the book-keeping is now done in the two
new static functions S_alloc_code_blocks() and S_free_codeblocks()
|
|
|
|
|
|
|
|
| |
This indents code and reflows the comments to account for the enclosing
block added by the previous commit.
At the same time, it adds some other miscellaneous white space changes,
and adds, revises other comments.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Clang has taken it upon itself to warn when an equality is wrapped in
double parentheses, e.g.
((foo == bar))
Which is a bit dumb, as any code along the lines of
#define isBAR (foo == BAR)
if (isBAR) {}
will trigger the warning.
This commit shuts clang up by putting in a harmless cast:
#define isBAR cBOOL(foo == BAR)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way we tracked if pattern recursion was infinite did not work
properly. A pattern like
"a"=~/(.(?2))((?<=(?=(?1)).))/
would loop forever, slowly eat up all available ram as it added
pattern recursion stack frames.
This patch changes the rules for recursion so that recursively
entering a given pattern "subroutine" twice from the same position
fails the match. This means that where previously we might have
seen fatal exception we will now simply fail. This means that
"aaabbb"=~/a(?R)?b/
succeeds with $& equal to "aaabbb".
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling,
and other special casing by unifying them, at pretty much no cost.
GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as
a U32 and an I32 respectively. ARG() holds the "parno" we will recurse
into. ARG2L() holds a signed offset to the relevant start node for the
recursion.
Prior to this patch the argument to GOSUB would always be >=, and unlike
other parts of our logic we would not use 0 to represent "start/end" of
pattern, as GOSTART would be used for "recurse to beginning of pattern",
after this patch we use 0 to represent "start/end", and a lot of
complexity "goes away" along with GOSTART regops.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The regex engine when displaying debugging info, say under -Dr, will elide
data in order to keep the output from getting too long. For example,
the number of code points in all of Unicode matched by \w is quite
large, and so when displaying a pattern that matches this, only the
first some number of them are printed, and the rest are truncated,
represented by "...".
Sometimes, one wants to see more than what the
compiled-into-the-engine-max shows. This commit creates code to read
this environment variable to override the default max lengths. This
changes the lengths for everything to the input number, even if they
have different compiled maximums in the absence of this variable.
I'm not currently documenting this variable, as I don't think it works
properly under threads, and we may want to alter the behavior in various
ways as a result of gaining experience with using it.
|
|
|
|
| |
So, it's better to not have a mask to include the unused ones.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently most pp_leavefoo subs have something along the lines of
POPBLOCK(cx);
POPFOO(cx);
where POPBLOCK does cxstack_ix-- and sets cx to point to the top CX stack
entry. It then restores a bunch of PL_ vars saved in the CX struct.
Then POPFOO does any type-specific restoration, e.g. POPSUB decrements the
ref count of the cv that was just executed.
However, this is logically the wrong order. When we *enter* a scope, we do
PUSHBLOCK;
PUSHFOO;
so undoing the PUSHBLOCK should be the last thing we do. As it happens,
it doesn't really make any difference to the running, which is why we've
never fixed it before.
Reordering it has two advantages.
First, it allows the steps for scope exit to be the exact logical reverse
of scope exit, which makes understanding what's going on and debugging
easier.
It allows us to make the code cleaner.
This commit also removes the cxstack_ix-- and setting cx steps from
POPBLOCK; now we already expect cx to be set (which it usually already is)
and we do the cxstack_ix-- ourselves. This also means we can remove a
whole bunch of cxstack_ix++'s that were added immediately after the
POPBLOCK in order to prevent the context being inadvertently overwritten
before we've finished using it.
So in full,
POPBLOCK(cx);
POPFOO(cx);
is now implemented as:
cx = &cxstack[cxstack_ix];
... other stuff done with cx ...
POPFOO(cx);
POPBLOCK(cx);
cxstack_ix--;
Finally, this commit also tweaks PL_curcop in pp_leaveeval, since
otherwise PL_curcop could temporarily be NULL when debugging code is
called in the presence of 'use re Debug'. It also stops the debugging code
crashing if PL_curcop is still NULL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds the final Unicode boundary type previously missing from core
Perl: the LineBreak one. This feature is already available in the
Unicode::LineBreak module, but I've been told that there are portability
and some other issues with that module. What's added here is a
light-weight version that is lacking the customizable features of the
module.
This implements the default Line Breaking algorithm, but with the
customizations that Unicode is expecting everybody to add, as their
test file tests for them. In other words, this passes Unicode's fairly
extensive furnished tests, but wouldn't if it didn't include certain
customizations specified by Unicode beyond the basic algorithm.
The implementation uses a look-up table of the characters surrounding a
boundary to see if it is a suitable place to break a line. In a few
cases, context needs to be taken into account, so there is code in
addition to the lookup table to handle those.
This should meet the needs for line breaking of many applications,
without having to load the module.
The algorithm is somewhat independent of the Unicode version, just like
the other boundary types. Only if new rules are added, or existing ones
modified is there need to go in and change this code. Otherwise,
running regen/mk_invlists.pl should be sufficient when a new Unicode
release is done to keep it up-to-date, again like the other Unicode
boundary types.
|
|
|
|
| |
Better to not have this clutter.
|
|
|
|
|
|
|
|
|
|
|
| |
The mask removed here was to make sure that right shifting didn't
propagate the sign bit, but is unnecessary as the value shifted is
unsigned. And confining things to a U8 with that mask assumes that the
bit vector being operated on has 256 elements max. This isn't
necessarily true these days, as one can change ANYOF_BITMAP_SIZE.
In fact changing that number was failing until this commit.
It also adds white space to make it easier to read.
|
|
|
|
|
| |
Instead of having this code repeated in several places, call
the more base macro from the others.
|
|
|
|
|
|
|
|
|
|
|
| |
I've long been confronted with trying to do things to create a spare bit
to use. I thought it easier now, while it's fresh in my mind, to free
up one for future use, rather than re-learn things when it next becomes
necessary. It would have been a different story if the freed bit had
required a performance penalty.
This commit also updates the comments about how to create even more
spare bits should it become necessary.
|
|
|
|
| |
Some of the names are expanded slightly and not shortened
|
| |
|
| |
|
|
|
|
|
|
| |
This commit sets a flag at pattern compilation time to indicate if
a rare case is present that requires special handling, so that that
handling can be avoided unless necessary.
|
|
|
|
|
|
| |
This changes the spare bit to be adjacent to the LOC_FOLD bit, in
preparation for the next commit, which will use that bit for a
LOC_FOLD-related use.
|
|
|
|
|
|
|
|
| |
This is done by combining 2 mutually exclusive bits into one. I hadn't
seen this possibility before because the name of one of them misled me.
It also misled me into turning on one that flag unnecessarily, and to
miss opportunities to not have to create a swash at runtime. This
commit corrects those things as well.
|
| |
|
|
|
|
|
| |
This places the flag bits of like-type flags adjacent for convenience in
reading the code. It also improves the commentary about their purposes.
|
|
|
|
|
|
| |
Since commit a0bd1a30d379f2625c307657d63fc50173d7a56d, a synthetic start
class node can be just an ANYOF-type node. I don't think this causes a
bug, just misses a potential optimisation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously use of this under /l regex rules was a compile time error.
Now it works like \b{wb} and \b{sb}, which compile under locale rules
and always work like Unicode says they should. A UTF-8 locale implies
Unicode rules, and the goal is for it to work seamlessly with the rest
of perl. This construct was the only one I am aware of that didn't work
seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales.
For all three of these constructs, use with a non-UTF-8 runtime locale
raises a warning, and Unicode rules are used anyway.
UTF-8 locale collation still has problems, but this is low priority to
fix, as it's a lot of work, and if one really cares, one should be using
Unicode::Collate.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ANYOF_FLAGS bits are all used up, but a future commit wants one.
This commit frees up a bit by sharing two of the existing
comparatively-rarely-used ones. One bit is used only under /d matching
rules, while the other is used only when not under /d. Only the latter
bit is used in synthetic start classes. The previous commit introduced
an ANYOFD node type corresponding to /d. An SSC never is this type.
Thus, the bits have mutually exclusive meanings, and we can use the node
type to distinguish between the two meanings of the combined bit.
An alternative implementation would have been to use the
ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one
chosen. But this is used more frequently, so the disambiguation would
have been exercised more frequently, slowing execution down ever so
slightly; more importantly, this one required fewer code changes, by a
slight amount.
|
|
|
|
|
|
| |
This horrible thing broke encapsulation and was as buggy as a very buggy
thing. It's been officially deprecated since 5.20.0 and now it can finally
die die die!!!!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
An empty cpan/.dir-locals.el stops Emacs using the core defaults for
code imported from CPAN.
Committer's work:
To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed
to be incremented in many files, including throughout dist/PathTools.
perldelta entry for module updates.
Add two Emacs control files to MANIFEST; re-sort MANIFEST.
For: RT #124119.
|
|
|
|
|
|
|
| |
This was experimentally introduced in 5.18, and no issues were raised,
except that it got us to thinking and spurred us to stop allowing $^X,
where 'X' is a non-printable control character, and that change caused
some issues.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
A function implements seeing if the space between any two characters is
a grapheme cluster break. Afer I wrote this, I realized that an array
lookup might be a better implementation, but the deadline for v5.22 was
too close to change it. I did see that my gcc optimized it down to
an array lookup.
This makes the implementation of \X go from being complicated to
trivial.
|