summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Finalize perldelta for 5.29.4v5.29.4Aaron Crane2018-10-201-309/+55
|
* Update Module::Corelist for 5.29.4Aaron Crane2018-10-201-2/+25
|
* Merge branch 'remove sizing pass' into bleadKarl Williamson2018-10-2015-1748/+1855
|\
| * perldeltaKarl Williamson2018-10-201-1/+3
| |
| * regcomp.c: White space onlyKarl Williamson2018-10-201-205/+204
| | | | | | | | | | This out/indents the blocks that were removed/added by the previous commit
| * Remove references to passes from regex compilerKarl Williamson2018-10-202-265/+36
| | | | | | | | | | | | | | | | The previous commit removed the sizing pass, but to minimize the difference listing, it left in all the references it could to the various passes, with the first pass set to FALSE. This commit now removes those references, as well as to some variables that are no longer used.
| * Remove sizing pass from regular expression compilerKarl Williamson2018-10-202-228/+393
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit removes the sizing pass for regular expression compilation. It attempts to be the minimum required to do this. Future patches are in the works that improve it,, and there is certainly lots more that could be done. This is being done now for security reasons, as there have been several bugs leading to CVEs where the sizing pass computed the size improperly, and a crafted pattern could allow an attack. This means that simple bugs can too easily become attack vectors. This is NOT the AST that people would like, but it should make it easier to move the code in that direction. Instead of a sizing pass, as the pattern is parsed, new space is malloc'd for each regnode found. To minimize the number of such mallocs that actually go out and request memory, an initial guess is made, based on the length of the pattern being compiled. The guessed amount is malloc'd and then immediately released. Hopefully that memory won't be gobbled up by another process before we actually gain control of it. The guess is currently simply the number of bytes in the pattern. Patches and/or suggestions are welcome on improving the guess or this method. This commit does not mean, however, that only one pass is done in all cases. Currently there are several situations that require extra passes. These are: a) If the pattern isn't UTF-8, but encounters a construct that requires it to become UTF-8, the parse is immediately stopped, the translation is done, and the parse restarted. This is unchanged from how it worked prior to this commit. b) If the pattern uses /d rules and it encounters a construct that requires it to become /u, the parse is immediately stopped and restarted using /u rules. A future enhancement is to only restart if something has been encountered that would generate something different than what has already been generated, as many operations are the same under both /d and /u. Prior to this commit, in rare circumstances was the parse immediately restarted. Only those few that changed the sizing did so. Instead the sizing pass was allowed to complete and then the generation pass ran, using /u. Some CVEs were caused by faulty implementation here. c) Very large patterns may need to have long jumps in their program. Prior to this commit, that was determined in the sizing pass, and all jumps were made long during generation. Now, the first time the need for a long jump is detected, the parse is immediately restarted, and all jumps are made long. I haven't investigated enough to be sure, but it might be sufficient to just redo the current jump, making it long, and then switch to using just long jumps, without having to restart the parse from the beginning. d) If a reference that could be to capturing parentheses doesn't find such parentheses, a flag is set. For references that could be octal constants, they are assumed to be those constants instead of a capturing group. At the end of the parse, if the flag indicates either that the assumption(s) were wrong or that it is a fatal reference to a non-existent group, the pattern is reparsed knowing the total number of these groups. e) If (?R) or (?0) are encountered, the flag listed in item d) above is set to force a reparse. I did have code in place that avoided doing the reparse, but I wasn't confident enough that our test suite exercises that area of the code enough to have exposed all the potential interaction bugs, and I think this construct is used rarely enough to not worry about avoiding the reparse at this point in the development. f) If (?|pattern) is encountered, the behavior is the same as in item e) above. The pattern will end up being reparsed after the total number of parenthesized groups are known. I decided not to invest the effort at this time in trying to get this to work without a reparse. It might be that if we are continuing the parse to just count parentheses, and we encounter a construct that normally would restart the parse immediately, that we could defer that restart. This would cut down the maximum number of parses required. As of this commit, the worst case is we find something that requires knowing all the parentheses; later we have to switch to /u rules and so the parse is restarted. Still later we have to switch to long jumps, and the parse is restarted again. Still later we have to upgrade to UTF-8, and the parse is restarted yet again. Then the parse is completed, and the final total of parentheses is known, so everything is redone a final time. Deferring those intermediate restarts would save a bunch of reparsing. Prior to this commit, warnings were issued only during the code generation pass, which didn't get called unless the sizing pass(es) completed successfully. But now, we don't know if the pass will succeed, fail, or whether it will have to be restarted. To avoid outputting the same warning more than once, the position in the parse of the last warning generated is kept (across parses). The code looks at that position when it is about to generate a warning. If the parsing has previously gotten that far, it assumes that the warning has already been generated, and suppresses it this time. The current state of parsing is such that I believe this assumption is valid. If the parses had divergent paths, that assumption could become invalid.
| * regcomp.c: Avoid potential NULL ptr dereferenceKarl Williamson2018-10-201-0/+7
| | | | | | | | | | | | This commit cause the passed in variable to be non-NULL before dereferencing it, as defensive coding practice. A future commit causes this to matter.
| * regcomp.c: Test that code block exists before cleaningKarl Williamson2018-10-202-3/+5
| | | | | | | | This is defensive coding progress, to avoid dereferencing a NULL ptr.
| * regcomp.c: Extract code into a functionKarl Williamson2018-10-204-73/+89
| | | | | | | | | | This should have no changes in behavior, and is for a future commit where this code will be called from a second place.
| * regcomp.c: Test for having /u earlier for \p{}Karl Williamson2018-10-201-3/+4
| | | | | | | | | | | | | | | | When \p{} or \p{} are encountered during parsing, that indicates that the pattern should be compiled not under /d, but under /u, as these are Unicode constructs. This commit moves the test for that to somewhat earlier. This saves only a little work currently, but in a future commit it saves a lot more wasted work.
| * regcomp.c: Remove variable in favor of struct elementKarl Williamson2018-10-201-86/+85
| | | | | | | | | | | | The code has a structure element that means the same thing part way through a function as a variable. Just use the struct element all the way through
| * regcomp.c: Move fcn call out of loopKarl Williamson2018-10-201-2/+2
| | | | | | | | | | | | | | The loop in this case is by a goto label, and the function determines if there are runtime code blocks in the pattern. That doesn't change if we have to reparse, so the return from the function doesn't change, so we only have to call it once.
| * regcomp.c: Extract code into a functionKarl Williamson2018-10-204-1/+14
| | | | | | | | | | This is in preparation for a later commit where the code will become more complex and be called from more than one place.
| * regcomp.c: Consolidate checks for warnings fatalityKarl Williamson2018-10-201-21/+12
| | | | | | | | | | | | | | | | | | This adds code so that whenever a warning is about to be emitted, it first checks to see if the warning is fatal, and if so mortalizes the SV that otherwise would leak. This partially fixes ticket [perl #133589]. It doesn't help if the warnings are called through a subroutine outside of regcomp.c
| * regcomp.c: Add macro for warnings outputKarl Williamson2018-10-201-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | This macro does nothing for now. It is being added in this separate commit to lessen the number of differences in the future commit that will need it, so that these don't distract from the main intent of that commit. The code is moving away from emitting all warnings in the code generation pass, to emitting them as soon as encountered. But to avoid emitting them more than once should the pattern have to be reparsed, they will not be emitted unless the parse has gotten further this time than it got earlier. This commit prepares for that.
| * regcomp.c: Defer setting the OP variant of a regnodeKarl Williamson2018-10-201-2/+11
| | | | | | | | | | | | This is in preparation for a future commit. We have allocated space and know that the regnode will be some time of ANYOF one. But we don't need to know which one precisely until later than we had been setting it.
| * regcomp.c: Defer setting regnode operandKarl Williamson2018-10-201-4/+4
| | | | | | | | | | Don't set this until we know that we are actually going to have a regnode that requires this operand.
| * regcomp.c: Defer setting flags of a regnodeKarl Williamson2018-10-201-9/+10
| | | | | | | | | | This is in preparation for a future commit where we won't know what the regnode is until later in the code.
| * regcomp.c: Move some declarationsKarl Williamson2018-10-201-4/+6
| | | | | | | | | | This is in preparation for some blocks to be removed in a future commit, so the declarations have to be at the top of the enclosing block
| * regcomp.c: Add some const's to static fcnKarl Williamson2018-10-203-4/+6
| | | | | | | | | | Since it's static, the compiler can figure out these are consts, but knowing this helped me read the code
| * regcomp.c: Use an equivalent 'if' conditionKarl Williamson2018-10-201-2/+2
| | | | | | | | | | | | By inspection of the code, I see that this 'if' won't get executed unless the variable is non-null. The function whose return sets it will raise an error if it would otherwise return NULL unexpectedly.
| * regcomp.c: Reorder 'if' clausesKarl Williamson2018-10-201-2/+3
| | | | | | | | | | This is for sake of clarity so that the comment applies to the adjacent clause.
| * regcomp.c: Rmv unnecessary elseKarl Williamson2018-10-201-3/+1
| | | | | | | | | | The 'if' if executed aborts, so control would never reach here unless the 'else' would be taken.
| * regcomp.c: Use SvREFCNT_inc_NN()Karl Williamson2018-10-201-1/+1
| | | | | | | | | | We know this is non-null as it panics a little ways above unless that's the case.
| * regcomp.c: Use an equivalent 'if' conditionKarl Williamson2018-10-201-2/+2
| | | | | | | | | | | | By inspection of the code, I see that this 'if' won't get executed unless the variable is non-null. The function whose return sets it will raise an error if it would otherwise return NULL unexpectedly.
| * regcomp.c: Combine expression into a macroKarl Williamson2018-10-201-5/+6
| | | | | | | | | | This expression will change in a future commit, so this isolates that change
| * regcomp.c: Split variable into twoKarl Williamson2018-10-201-5/+11
| | | | | | | | | | | | | | | | | | | | This commit causes there to now be two variables that count the number of parentheses in a pattern: a) the current number as the pattern is parsed b) the total number, not known until the pattern has been completely parsed This will prove useful in later commits.
| * Consolidate code into a single macroKarl Williamson2018-10-201-14/+15
| | | | | | | | | | | | | | | | | | | | If we die during the code generation phase, we set the regex SV to be freed during cleanup. This consolidates many of those instances into one macro, so that it can be easily changed. And instead of tieing it to the particular phase, we clean up whenever that SV actually exists. This requires initializing it to NULL.
| * regcomp.c: Omit warning if error about to be raisedKarl Williamson2018-10-201-5/+7
| | | | | | | | | | | | | | | | This commit changes the code to skip a warning when it knows an error is about to happen. Currently this doesn't matter, as the warning would be emitted only in a later pass, and the error would actually happen first, so the warning doesn't get output at all. But future commits will change that, so this commit is in preparation for that.
| * regcomp.c: Add ability to not warn during substitute parseKarl Williamson2018-10-201-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under certain conditions, regcomp.c will pretend something other than the input pattern is to be parsed. There is a mechanism to seamlessly show the original code when that substitute expression contains the original as a subset. But there are cases where the entire substitute is constructed by regcomp.c, and has none of the original pattern in it. Since it is our construction, it should be legal, devoid of warnings, but if somehow something happened to generate a warning, it could lead to seg faults, etc. This commit adds and uses a mechanism to turn off warnings while parsing these constructs. Should a warning attempt to be output, instead of a seg fault, a panic error message giving debugging details is output.
| * regcomp.c: Generalize conditions to output warningsKarl Williamson2018-10-201-9/+14
| | | | | | | | | | | | Warnings are deferred until the code generation pass. This doesn't change that, but makes the condition into a mcaro call so that can be changed in a future commit.
| * regcomp.c: Move some more code to earlierKarl Williamson2018-10-201-4/+4
| | | | | | | | | | It is better defensive coding to restore as soon as possible, rather than deferring it.
| * regcomp.c: Move some code to earlierKarl Williamson2018-10-201-5/+6
| | | | | | | | | | It is better defensive coding to restore as soon as possible, rather than deferring it.
| * regcomp.c: Consolidate 2nd pass for warningsKarl Williamson2018-10-201-36/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | Warnings have to generally be delayed being output until the 2nd pass, as the first pass can be restarted multiple times, and so the same warning could be output multiple times if the restarted code outputs a warning. Prior to this commit, there was an assert that the warnings are being output in the 2nd pass. This commit changes it so that the assert is turned into an 'if' in common code, and the dispersed 'if's that formerly were used are removed as much as possible. If that removes an indented block, this commit also outdents the block contents.
| * regcomp.c: Add macro for warning experimental featuresKarl Williamson2018-10-201-13/+15
| | | | | | | | | | This consolidates the code that warns that an experimental feature is being called into a common macro.
| * regcomp.c: Put common code in a macroKarl Williamson2018-10-201-55/+59
| | | | | | | | | | | | This trivial code is extracted into a common macro in preparation for a future commit when it will become non-trivial, and hence that logic will only have to occur once.
| * regcomp.c: Rename macro and labelKarl Williamson2018-10-201-29/+29
| | | | | | | | | | These move away from talking about pass 1, in preparation for future commits.
| * regcomp.c: Use another macro consistentlyKarl Williamson2018-10-201-7/+7
| | | | | | | | | | | | Sometimes the code refers to the union member explicitly, and sometimes it uses the macro that evaluates to that member. This commit changes to consistently use the latter.
| * regcomp.c: Use name consistentlyKarl Williamson2018-10-201-39/+37
| | | | | | | | | | | | | | | | | | | | In the function Perl_re_op_compile(), there is a stack variable 'ri', and an element of a stack structure rxi, which part way through one is set to the other, so that at times one is used and other times the other is used. This commit removes the variable, in favor of the struct element, making consistent use throughout the function (and since the struct is passed to its callees, it makes this value available to them always)
| * regcomp.c: Use regnode offsets during parsingKarl Williamson2018-10-205-336/+393
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes the pattern parsing to use offsets from the first node in the pattern program, rather than direct addresses of such nodes. This is in preparation for a later change in which more mallocs will be done which will change those addresses, whereas the offsets will remain constant. Once the final program space is allocated, real addresses are used as currently. This limits the necessary changes to a few functions. Also, real addresses are used if they are constant across a function; again this limits the changes. Doing this introduces a new typedef for clarity 'regnode_offset' which is not a pointer, but a count. This necessitates changing a bunch of things to use 0 instead of NULL to indicate an error. A new boolean is also required to indicate if we are in the first or second passes of the pattern. And separate heap space is allocated for scratch during the first pass.
| * regcomp.sym: Add lengths for ANYOF nodesKarl Williamson2018-10-205-46/+33
| | | | | | | | | | | | This changes regcomp.sym to generate the correct lengths for ANYOF nodes, which means they don't have to be special cased in regcomp.c, leading to simplification
| * regcomp.h: Swap struct vs typedefKarl Williamson2018-10-202-4/+3
| | | | | | | | | | | | | | This struct has two names. I previously left the less descriptive one as the primary because of back compat issues. But I now realize that regcomp.h is only used in the core, so it's ok to swap for the better name to be primary.
| * regcomp.c: Generate new regnode for /[[:posix:]]/lKarl Williamson2018-10-202-48/+38
| | | | | | | | | | | | | | This follows on to the previous commit and generates code to use the new regnode. This allows some simplifications. The determination of the regnode is moved later in the function that does this; the code that backed this out if we guessed wrong is excised.
| * regcomp.sym: Add node type ANYOF_POSIXLKarl Williamson2018-10-204-161/+176
| | | | | | | | | | | | This is like ANYOFL, but has runtime matches of /[[:posix:]]/ in it, which requires extra space. Adding this will allow a future commit to simplify handling for ANYOF nodes.
| * regcomp.h: Add some macrosKarl Williamson2018-10-201-4/+13
| | | | | | | | | | | | | | | | | | | | These are use to allow the bit map of run-time /[[:posix:]]/l classes to be stored in a variable, and not just in the argument of an ANYOF node. This will enable the next commit to use such a variable. The current macros are rewritten to just call the new ones with the proper arguments. A macro of a different sort is also created to allow one to set the entire bit field in the node at once.
| * regcomp.h: Remove unused macrosKarl Williamson2018-10-201-4/+0
| | | | | | | | | | I had kept these macros around for backwards compatibility. But now I realize regcomp.h is only for core use, so no need to retain them.
| * regcomp.c: White-space, comment onlyKarl Williamson2018-10-201-171/+187
| |
| * regcomp.h: White-space, comments onlyKarl Williamson2018-10-201-7/+11
| |
| * regcomp.c: Add conversion macros and use themKarl Williamson2018-10-201-14/+19
| | | | | | | | | | | | This adds macros that hide the details in finding the regnode address from the offset from the beginning of the pattern's program, and vice versa.