diff options
author | Karl Williamson <khw@cpan.org> | 2018-10-19 09:48:34 -0600 |
---|---|---|
committer | Karl Williamson <khw@cpan.org> | 2018-10-20 00:09:56 -0600 |
commit | 7c932d07cab18751bfc7515b4320436273a459e2 (patch) | |
tree | 795bd8462e38ea535f4246c0fac3c7ebe0db3671 /regcomp.h | |
parent | deeb899527fafbf46c2ae732d30fbaedd79d1b84 (diff) | |
download | perl-7c932d07cab18751bfc7515b4320436273a459e2.tar.gz |
Remove sizing pass from regular expression compiler
This commit removes the sizing pass for regular expression compilation.
It attempts to be the minimum required to do this. Future patches are
in the works that improve it,, and there is certainly lots more that
could be done.
This is being done now for security reasons, as there have been several
bugs leading to CVEs where the sizing pass computed the size improperly,
and a crafted pattern could allow an attack. This means that simple
bugs can too easily become attack vectors.
This is NOT the AST that people would like, but it should make it easier
to move the code in that direction.
Instead of a sizing pass, as the pattern is parsed, new space is
malloc'd for each regnode found. To minimize the number of such mallocs
that actually go out and request memory, an initial guess is made, based
on the length of the pattern being compiled. The guessed amount is
malloc'd and then immediately released. Hopefully that memory won't be
gobbled up by another process before we actually gain control of it.
The guess is currently simply the number of bytes in the pattern.
Patches and/or suggestions are welcome on improving the guess or this
method.
This commit does not mean, however, that only one pass is done in all
cases. Currently there are several situations that require extra
passes. These are:
a) If the pattern isn't UTF-8, but encounters a construct that
requires it to become UTF-8, the parse is immediately stopped,
the translation is done, and the parse restarted. This is
unchanged from how it worked prior to this commit.
b) If the pattern uses /d rules and it encounters a construct that
requires it to become /u, the parse is immediately stopped and
restarted using /u rules. A future enhancement is to only
restart if something has been encountered that would generate
something different than what has already been generated, as
many operations are the same under both /d and /u. Prior to
this commit, in rare circumstances was the parse immediately
restarted. Only those few that changed the sizing did so.
Instead the sizing pass was allowed to complete and then the
generation pass ran, using /u. Some CVEs were caused by faulty
implementation here.
c) Very large patterns may need to have long jumps in their
program. Prior to this commit, that was determined in the
sizing pass, and all jumps were made long during generation.
Now, the first time the need for a long jump is detected, the
parse is immediately restarted, and all jumps are made long. I
haven't investigated enough to be sure, but it might be
sufficient to just redo the current jump, making it long, and
then switch to using just long jumps, without having to restart
the parse from the beginning.
d) If a reference that could be to capturing parentheses doesn't
find such parentheses, a flag is set. For references that could
be octal constants, they are assumed to be those constants
instead of a capturing group. At the end of the parse, if the
flag indicates either that the assumption(s) were wrong or that
it is a fatal reference to a non-existent group, the pattern is
reparsed knowing the total number of these groups.
e) If (?R) or (?0) are encountered, the flag listed in item d)
above is set to force a reparse. I did have code in place that
avoided doing the reparse, but I wasn't confident enough that
our test suite exercises that area of the code enough to have
exposed all the potential interaction bugs, and I think this
construct is used rarely enough to not worry about avoiding the
reparse at this point in the development.
f) If (?|pattern) is encountered, the behavior is the same as in
item e) above. The pattern will end up being reparsed after the
total number of parenthesized groups are known. I decided not
to invest the effort at this time in trying to get this to work
without a reparse.
It might be that if we are continuing the parse to just count
parentheses, and we encounter a construct that normally would restart
the parse immediately, that we could defer that restart. This would cut
down the maximum number of parses required. As of this commit, the
worst case is we find something that requires knowing all the
parentheses; later we have to switch to /u rules and so the parse is
restarted. Still later we have to switch to long jumps, and the parse
is restarted again. Still later we have to upgrade to UTF-8, and the
parse is restarted yet again. Then the parse is completed, and the
final total of parentheses is known, so everything is redone a final
time. Deferring those intermediate restarts would save a bunch of
reparsing.
Prior to this commit, warnings were issued only during the code
generation pass, which didn't get called unless the sizing pass(es)
completed successfully. But now, we don't know if the pass will
succeed, fail, or whether it will have to be restarted. To avoid
outputting the same warning more than once, the position in the parse of
the last warning generated is kept (across parses). The code looks at
that position when it is about to generate a warning. If the parsing
has previously gotten that far, it assumes that the warning has already
been generated, and suppresses it this time. The current state of
parsing is such that I believe this assumption is valid. If the parses
had divergent paths, that assumption could become invalid.
Diffstat (limited to 'regcomp.h')
-rw-r--r-- | regcomp.h | 3 |
1 files changed, 1 insertions, 2 deletions
@@ -381,10 +381,9 @@ struct regnode_ssc { #define REG_MAGIC 0234 -#define SIZE_ONLY RExC_pass1 +#define SIZE_ONLY FALSE #define PASS1 SIZE_ONLY #define PASS2 (! SIZE_ONLY) - /* An ANYOF node is basically a bitmap with the index being a code point. If * the bit for that code point is 1, the code point matches; if 0, it doesn't * match (complemented if inverted). There is an additional mechanism to deal |