Remove sizing pass from regular expression compiler

This commit removes the sizing pass for regular expression compilation. It attempts to be the minimum required to do this. Future patches are in the works that improve it,, and there is certainly lots more that could be done. This is being done now for security reasons, as there have been several bugs leading to CVEs where the sizing pass computed the size improperly, and a crafted pattern could allow an attack. This means that simple bugs can too easily become attack vectors. This is NOT the AST that people would like, but it should make it easier to move the code in that direction. Instead of a sizing pass, as the pattern is parsed, new space is malloc'd for each regnode found. To minimize the number of such mallocs that actually go out and request memory, an initial guess is made, based on the length of the pattern being compiled. The guessed amount is malloc'd and then immediately released. Hopefully that memory won't be gobbled up by another process before we actually gain control of it. The guess is currently simply the number of bytes in the pattern. Patches and/or suggestions are welcome on improving the guess or this method. This commit does not mean, however, that only one pass is done in all cases. Currently there are several situations that require extra passes. These are: a) If the pattern isn't UTF-8, but encounters a construct that requires it to become UTF-8, the parse is immediately stopped, the translation is done, and the parse restarted. This is unchanged from how it worked prior to this commit. b) If the pattern uses /d rules and it encounters a construct that requires it to become /u, the parse is immediately stopped and restarted using /u rules. A future enhancement is to only restart if something has been encountered that would generate something different than what has already been generated, as many operations are the same under both /d and /u. Prior to this commit, in rare circumstances was the parse immediately restarted. Only those few that changed the sizing did so. Instead the sizing pass was allowed to complete and then the generation pass ran, using /u. Some CVEs were caused by faulty implementation here. c) Very large patterns may need to have long jumps in their program. Prior to this commit, that was determined in the sizing pass, and all jumps were made long during generation. Now, the first time the need for a long jump is detected, the parse is immediately restarted, and all jumps are made long. I haven't investigated enough to be sure, but it might be sufficient to just redo the current jump, making it long, and then switch to using just long jumps, without having to restart the parse from the beginning. d) If a reference that could be to capturing parentheses doesn't find such parentheses, a flag is set. For references that could be octal constants, they are assumed to be those constants instead of a capturing group. At the end of the parse, if the flag indicates either that the assumption(s) were wrong or that it is a fatal reference to a non-existent group, the pattern is reparsed knowing the total number of these groups. e) If (?R) or (?0) are encountered, the flag listed in item d) above is set to force a reparse. I did have code in place that avoided doing the reparse, but I wasn't confident enough that our test suite exercises that area of the code enough to have exposed all the potential interaction bugs, and I think this construct is used rarely enough to not worry about avoiding the reparse at this point in the development. f) If (?|pattern) is encountered, the behavior is the same as in item e) above. The pattern will end up being reparsed after the total number of parenthesized groups are known. I decided not to invest the effort at this time in trying to get this to work without a reparse. It might be that if we are continuing the parse to just count parentheses, and we encounter a construct that normally would restart the parse immediately, that we could defer that restart. This would cut down the maximum number of parses required. As of this commit, the worst case is we find something that requires knowing all the parentheses; later we have to switch to /u rules and so the parse is restarted. Still later we have to switch to long jumps, and the parse is restarted again. Still later we have to upgrade to UTF-8, and the parse is restarted yet again. Then the parse is completed, and the final total of parentheses is known, so everything is redone a final time. Deferring those intermediate restarts would save a bunch of reparsing. Prior to this commit, warnings were issued only during the code generation pass, which didn't get called unless the sizing pass(es) completed successfully. But now, we don't know if the pass will succeed, fail, or whether it will have to be restarted. To avoid outputting the same warning more than once, the position in the parse of the last warning generated is kept (across parses). The code looks at that position when it is about to generate a warning. If the parsing has previously gotten that far, it assumes that the warning has already been generated, and suppresses it this time. The current state of parsing is such that I believe this assumption is valid. If the parses had divergent paths, that assumption could become invalid.
author: Karl Williamson <khw@cpan.org> 2018-10-19 09:48:34 -0600
committer: Karl Williamson <khw@cpan.org> 2018-10-20 00:09:56 -0600
commit: 7c932d07cab18751bfc7515b4320436273a459e2 (patch)
tree: 795bd8462e38ea535f4246c0fac3c7ebe0db3671 /regcomp.h
parent: deeb899527fafbf46c2ae732d30fbaedd79d1b84 (diff)
download: perl-7c932d07cab18751bfc7515b4320436273a459e2.tar.gz
1 files changed, 1 insertions, 2 deletions
diff --git a/regcomp.h b/regcomp.h
index f2ce68fb71..c0b2c07ec5 100644
--- a/regcomp.h
+++ b/regcomp.h
@@ -381,10 +381,9 @@ struct regnode_ssc {
 
 #define REG_MAGIC 0234
 
-#define SIZE_ONLY RExC_pass1
+#define SIZE_ONLY FALSE
 #define PASS1 SIZE_ONLY
 #define PASS2 (! SIZE_ONLY)
-
 /* An ANYOF node is basically a bitmap with the index being a code point.  If
  * the bit for that code point is 1, the code point matches;  if 0, it doesn't
  * match (complemented if inverted).  There is an additional mechanism to deal
author	Karl Williamson <khw@cpan.org>	2018-10-19 09:48:34 -0600
committer	Karl Williamson <khw@cpan.org>	2018-10-20 00:09:56 -0600
commit	7c932d07cab18751bfc7515b4320436273a459e2 (patch)
tree	795bd8462e38ea535f4246c0fac3c7ebe0db3671 /regcomp.h
parent	deeb899527fafbf46c2ae732d30fbaedd79d1b84 (diff)
download	perl-7c932d07cab18751bfc7515b4320436273a459e2.tar.gz