summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.h: Clarify commentKarl Williamson2021-06-121-1/+1
|
* regcomp.h: Fix typos in commentsKarl Williamson2021-05-281-2/+2
|
* speed up regex ops under DEBUGGING when -Dr not usedYves Orton2021-02-121-2/+2
|
* style: Detabify indentation of the C code maintained by the core.Michael G. Schwern2021-01-171-17/+17
| | | | | | | | | | | This just detabifies to get rid of the mixed tab/space indentation. Applying consistent indentation and dealing with other tabs are another issue. Done with `expand -i`. * vutil.* left alone, it's part of version. * Left regen managed files alone for now.
* regcomp.h: Restrict scope of symbols to coreKarl Williamson2020-11-021-1/+2
| | | | Everything defined here are internal symbols
* handy.h: Create nBIT_UMAX() macroKarl Williamson2020-07-171-1/+1
| | | | This encapsulates a common paradigm
* handy.h: Create nBIT_MASK(n) macroKarl Williamson2020-07-171-5/+5
| | | | | This encapsulates a common paradigm, making sure that it is done correctly for the platform's size.
* Use dNOOP for otherwise empty declarationsKarl Williamson2020-04-261-3/+3
| | | | | | | | Otherwise, the semi-colon the code has afterwards can be interpreted by some compilers as ending the block of declarations, and treat any later declarations in the same block as erors in C89. For an example, see the ticket this fixes #17725
* regcomp.c: Rmv C undefined behaviorKarl Williamson2020-04-121-4/+3
| | | | | | | | | | | | | | | One analyzer said that what this commit changes was C undefined behavior, in casting void* pointers. Right now, the only actual type it is called with is SV*, but I made it void*, because I thought it might be used more generally. But, it turns out that Unicode is planning on changing its regular expression processing requirements to where what I have no longer will make sense. And, since only SV* is actually used, this commit changes the void* to SV*, removing any undefined behavior, with no changes to program logic. The changes for the new Unicode direction will come in probably 5.34; their document is still in draft, but I anticipate it will soon be finalized.
* regcomp.h: Add macroKarl Williamson2020-03-181-0/+16
| | | | | | | This macro is for files outside of regcomp.c/regexec.c (and re_comp.c/re_exec.c) to have access to the re.pm Debug options. Right now pp_match could benefit from this.
* regcomp.h: Combine two macros into oneKarl Williamson2020-03-111-13/+14
| | | | One of these macros is no longer used, so just combine them.
* Allow wildcard pattern debuggingKarl Williamson2020-03-051-1/+5
| | | | | | | | | | | | | | | | | | | | | | This fixes #17026 Patterns can have subpatterns since 5.30. These are processed when encountered, by suspending the main pattern compilation, compiling the subpattern, and then matching that against the set of all legal possibilities, which Perl knows about. Debugging info for the compilation portion of the subpattern was added by be8790133a0ce8fc67454e55e7849a47a0858d32, without fanfare. But, prior to this new commit, debugging info was not available for that matching portion of the compilation, except under DEBUGGING builds, with -Drv. This commit adds a new option to 'use re qw(Debug ...)', WILDCARD, to enable subpattern match debugging. Whatever other match debugging options have been turned on will show up when a wildcard subpattern is compiled iff WILDCARD is specified. The output of this may be voluminous, which is why you have to ask for it specifically. Or, the EXTRA option turns it on, along with several other things.
* regex: Change internal macro nameKarl Williamson2020-03-051-2/+2
| | | | | It wasn't clear to me that the macro did more than a declaration, given its name. Rename it to be clear as to what it does.
* regcomp.h: remove redundant commentHugo van der Sanden2020-03-051-30/+0
| | | | This 1989 comment stopped being true in 1997 with commit c277df4222.
* regcomp.h -reorder regexp_internal to close x86-64 alignment holesRichard Leach2020-03-021-1/+1
|
* regcomp.h: Use pre-existing macro to hide variableKarl Williamson2020-02-261-25/+25
| | | | | Instead of repeating this variable name all over the place, use the macro that had been created to refer to it.
* regcomp.sym: Add new regnode type for (?[])Karl Williamson2020-02-191-0/+19
| | | | | | | | | | | | | | | | | | This new regnode is used to handle interpolated already-compiled regex sets inside outer regex sets. If it isn't present, it will mean that what appears to be a nested, interpolated set really isn't. I created a new regnode structure to hold a pointer. This has to be temporary as pointers can be invalidated. I thought of just having a regnode without a pointer as a marker, and using a parallel array to store the data, rather than creating a whole new regnode structure for just pointers, but parallel data structures can get out of sync, so this seemed best. This commit just sets up the regnode; a future commit will actually use it.
* Move data for PL_InBitmap to charclass_invlists.hKarl Williamson2019-11-261-15/+0
| | | | | This makes it consistent with the other inversion lists for this sort of thing, and finishes the fix for GH #17154
* regcomp.h: Delete obsolote macroKarl Williamson2019-11-221-2/+0
|
* PATCH: gh #17319 SegfaultKarl Williamson2019-11-221-1/+1
| | | | | | It turns out that one isn't supposed to fill in the offset to the next regnode at node creation time. And this node is like EXACTish, so the string stuff isn't accounted for in its regcomp.sym definition
* Add ANYOFHs regnodeKarl Williamson2019-11-201-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | This node is like ANYOFHb, but is used when more than one leading byte is the same in all the matched code points. ANYOFHb is used to avoid having to convert from UTF-8 to code point for something that won't match. It checks that the first byte in the UTF-8 encoded target is the desired one, thus ruling out most of the possible code points. But for higher code points that require longer UTF-8 sequences, many many non-matching code points pass this filter. Its almost 200K that it is ineffective for for code points above 0xFFFF. This commit creates a new node type that addresses this problem. Instead of a single byte, it stores as many leading bytes that are the same for all code points that match the class. For many classes, that will cut down the number of possible false positives by a huge amount before having to convert to code point to make the final determination. This regnode adds a UTF-8 string at the end. It is still much smaller, even in the rare worst case, than a plain ANYOF node because the maximum string length, 15 bytes, is still shorter than the 32-byte bitmap that is present in a plain ANYOF. Most of the time the added string will instead be at most 4 bytes.
* regcomp.h: Fix up commentKarl Williamson2019-11-171-1/+1
|
* Add ANYOFR regnodeKarl Williamson2019-11-171-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | This matches a single range of code points. It is both faster and smaller than other ANYOF-type nodes, requiring, after set-up, a single subtraction and conditional branch. The vast majority of Unicode properties match a single range (though most of the properties likely to be used in real world applications have more than a single range). But things like [ij] are a single range, and those are quite commonly encountered. This new regnode matches them more efficiently than a bitmap would, and doesn't require the space for one either. The flags field is used to store the minimum matchable start byte for UTF-8 strings, and is ignored for non-UTF-8 targets. This, like ANYOFH nodes which have a similar mechanism, allows for quick weeding out of many possible matches without having to convert the UTF-8 to its corresponding code point. This regnode packs the 32 bit argument with 20 bits for the minimum code point the node matches, and 12 bits for the maximum range. If the input is a value outside these, it simply won't compile to this regnode, instead going to one of the ANYOFH flavors. ANYOFR is sufficient to match all of Unicode except for the final (private use) 65K plane.
* regcomp.h: Fix commentKarl Williamson2019-11-161-1/+1
|
* Change the names of some regnodesKarl Williamson2019-10-291-7/+7
| | | | The new name is shorter and I believe, clearer.
* regex: Add LEXACT_ONLY8 node typeKarl Williamson2019-09-291-7/+13
| | | | | This is like LEXACT, but it is known that only strings encoded in UTF-8 will match it, so don't even have to try if that condition isn't met.
* Add regnode LEXACT, for long stringsKarl Williamson2019-09-291-5/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode for strings that don't fit in a regular one, and adds a structure for that regnode to use. Actually using them is deferred to the next commit. This new regnode structure is needed because the previous structure only allows for an 8 bit length field, 255 max bytes. This commit puts the length instead in a new field, the same place single-argument regnodes put their argument. Hence this long string is an extra 32 bits of overhead, but at no string length is this node ever bigger than the combination of the smaller nodes it replaces. I also considered simply combining the original 8 bit length field (which is now unused) with the first byte of the string field to get a 16 bit length, and have the actual string be offset by 1. But I rejected that because it would mean the string would usually not be aligned, slowing down memory accesses. This new LEXACT regnode can hold up to what 1024 regular EXACT ones hold, using 4K fewer overhead bytes to do so. That means it can handle strings containing 262000 bytes. The comments give ideas for expanding that should it become necessary or desirable. Besides the space advantage, any hardware acceleration in memcmp can be done in much bigger chunks, and otherwise the memcmp inner loop (often written in assembly) will run many more times in a row, and our outer loop that calls it, correspondingly fewer.
* regcomp.h: Add commentsKarl Williamson2019-09-291-0/+9
|
* regcomp.h: Remove obsolete macroKarl Williamson2019-09-291-2/+0
| | | | This is no longer used
* regcomp: Use new set macro to store a valueKarl Williamson2019-09-291-0/+2
| | | | | This is in preparation for the current mechanism in a later commit to become a not legal lhs
* regcomp.h: Parenthesize param in macro expansionKarl Williamson2019-09-271-1/+1
| | | | This is always a good idea
* regcomp.h: Remove duplicate macro expansionKarl Williamson2019-09-271-1/+2
| | | | This macro has the same definition as another.
* Add ability to dump pre-optimized compiled patternKarl Williamson2019-08-261-9/+13
| | | | in qr//
* regcomp.h: Use actual commit numberKarl Williamson2019-06-261-1/+2
| | | | | This referred to the commit message, but now that the number has been determined, use that.
* Add ANYOFHr regnodeKarl Williamson2019-06-261-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit adds a new regnode, ANYOFHr, like ANYOFH, but it also has a loose upper bound for the first UTF-8 byte matchable by the node. (The 'r' stands for 'range'). It would be nice to have a tight upper bound, but to do so requires 4 more bits than are available without changing the node arguments types, and hence increasing the node size. Having a loose bound is better than no bound, and comes essentially free, by using two unused bits in the current ANYOFH node, and requiring only a few extra, pipeline-able, mask, etc instructions at run time, no extra conditionals. Any ANYOFH nodes that would benefit from having an upper bound will instead be compiled into this node type. Its use is based on the following observations. There are 64 possible start bytes, so the full range can be expressed in 6 bits. This means that the flags field in ANYOFH nodes containing the start byte has two extra bits that can be used for something else. An ANYOFH node only happens when there is no matching code point in the bit map, so the smallest code point that could be is 256. The start byte for that is C4, so there are actually only 60 possible start bytes. (perl can be compiled with a larger bit map in which case the minimum start byte would be even higher.) A second observation is that knowing the highest start byte is above F0 is no better than knowing it's F0. This is because the highest code point whose start byte is F0 is U+3FFFF, and all code points above that that are currently allocated are all very specialized and rarely encountered. And there's no likelihood of that changing anytime soon as there's plenty of unallocated space below that. So if an ANYOFH node's highest start byte is F0 or above, there's no advantage to knowing what the actual max possible start byte is, so leave it as ANYOFH,. That means the highest start byte we care about in ANYOFHr is EF. That cuts the number of start bytes we care about down to 43, still 6 bits required to represent them, but it allows for the following scheme: Populate the flags field by subtracting C0 from the lowest start byte and shift left 2 bits. That leaves the the bottom two bits unused. We use them as follows, where x is the start byte of the lowest code point in the node: bits ---- 11 The upper limit of the range can be as much as (EF - x) / 8 10 The upper limit of the range can be as much as (EF - x) / 4 01 The upper limit of the range can be as much as (EF - x) / 2 00 The upper limit of the range can be as much as EF That partitions the loose upper bound into 4 possible ranges, with it being tighter the closer it is to the strict lower bound. This makes the loose upper bound more meaningful when there is most to gain by having one. Some examples of what the various upper bounds would be for all the possibilities of these two bits are: Upper bound given the 2 bits Low bound 11 10 01 00 --------- -- -- -- -- C4 C9 CE D9 EF D0 D3 D7 DF EF E0 E1 E3 E7 EF Start bytes of E0 and above represent many more code points each than lower ones, as they are 3 byte sequences instead of two. This scheme provides tighter bounds for them, which is also a point in its favor. Thus we have provided a loose upper bound using two otherwise unused bits. An alternate scheme could have had the intervals all the same, but this provides a tighter bound when it makes the most sense to. For EBCDIC the range is is from C8 to F4, Tests will be added in a later commit
* PATCH: [perl #131551] Too deep regex compilation recursionKarl Williamson2019-03-171-0/+3
| | | | | | | | | | | This patch, started by Yves Orton, and refined in consultation with Tony Cook, imposes a maximum depth of unclosed left parentheses, at which point it croaks. This is to prevent the segfault in the ticket. The patch adds a variable that can be set to increase or decrease this limit at run time (actually regex compilation time) should this be desired, and hence our pre-determined limit of 1000 can be changed if necessary.
* regcomp.h: Rmv obsolete references to 'swash'Karl Williamson2019-03-131-13/+14
| | | | regexes no longer use these
* regcomp.h: Fix typo in commentKarl Williamson2018-12-091-1/+1
|
* Provide header guards to prevent re-inclusionJames E Keenan2018-12-041-0/+6
| | | | | | | | Per LGTM analysis: https://lgtm.com/projects/g/Perl/perl5/alerts/?mode=tree&ruleFocus=2163210746 and LGTM recommendation: https://lgtm.com/rules/2163210746/ For: RT 133699
* regcomp.h: Clarify commentsKarl Williamson2018-11-261-3/+3
|
* -Drv now turns on all regex debuggingKarl Williamson2018-11-161-24/+24
| | | | | This commit makes the v (verbose) modifier to -Dr do something: turn on all possible regex debugging.
* regcomp.h: Delete duplicate macro defnKarl Williamson2018-11-161-2/+0
|
* Remove references to passes from regex compilerKarl Williamson2018-10-201-3/+0
| | | | | | | | The previous commit removed the sizing pass, but to minimize the difference listing, it left in all the references it could to the various passes, with the first pass set to FALSE. This commit now removes those references, as well as to some variables that are no longer used.
* Remove sizing pass from regular expression compilerKarl Williamson2018-10-201-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit removes the sizing pass for regular expression compilation. It attempts to be the minimum required to do this. Future patches are in the works that improve it,, and there is certainly lots more that could be done. This is being done now for security reasons, as there have been several bugs leading to CVEs where the sizing pass computed the size improperly, and a crafted pattern could allow an attack. This means that simple bugs can too easily become attack vectors. This is NOT the AST that people would like, but it should make it easier to move the code in that direction. Instead of a sizing pass, as the pattern is parsed, new space is malloc'd for each regnode found. To minimize the number of such mallocs that actually go out and request memory, an initial guess is made, based on the length of the pattern being compiled. The guessed amount is malloc'd and then immediately released. Hopefully that memory won't be gobbled up by another process before we actually gain control of it. The guess is currently simply the number of bytes in the pattern. Patches and/or suggestions are welcome on improving the guess or this method. This commit does not mean, however, that only one pass is done in all cases. Currently there are several situations that require extra passes. These are: a) If the pattern isn't UTF-8, but encounters a construct that requires it to become UTF-8, the parse is immediately stopped, the translation is done, and the parse restarted. This is unchanged from how it worked prior to this commit. b) If the pattern uses /d rules and it encounters a construct that requires it to become /u, the parse is immediately stopped and restarted using /u rules. A future enhancement is to only restart if something has been encountered that would generate something different than what has already been generated, as many operations are the same under both /d and /u. Prior to this commit, in rare circumstances was the parse immediately restarted. Only those few that changed the sizing did so. Instead the sizing pass was allowed to complete and then the generation pass ran, using /u. Some CVEs were caused by faulty implementation here. c) Very large patterns may need to have long jumps in their program. Prior to this commit, that was determined in the sizing pass, and all jumps were made long during generation. Now, the first time the need for a long jump is detected, the parse is immediately restarted, and all jumps are made long. I haven't investigated enough to be sure, but it might be sufficient to just redo the current jump, making it long, and then switch to using just long jumps, without having to restart the parse from the beginning. d) If a reference that could be to capturing parentheses doesn't find such parentheses, a flag is set. For references that could be octal constants, they are assumed to be those constants instead of a capturing group. At the end of the parse, if the flag indicates either that the assumption(s) were wrong or that it is a fatal reference to a non-existent group, the pattern is reparsed knowing the total number of these groups. e) If (?R) or (?0) are encountered, the flag listed in item d) above is set to force a reparse. I did have code in place that avoided doing the reparse, but I wasn't confident enough that our test suite exercises that area of the code enough to have exposed all the potential interaction bugs, and I think this construct is used rarely enough to not worry about avoiding the reparse at this point in the development. f) If (?|pattern) is encountered, the behavior is the same as in item e) above. The pattern will end up being reparsed after the total number of parenthesized groups are known. I decided not to invest the effort at this time in trying to get this to work without a reparse. It might be that if we are continuing the parse to just count parentheses, and we encounter a construct that normally would restart the parse immediately, that we could defer that restart. This would cut down the maximum number of parses required. As of this commit, the worst case is we find something that requires knowing all the parentheses; later we have to switch to /u rules and so the parse is restarted. Still later we have to switch to long jumps, and the parse is restarted again. Still later we have to upgrade to UTF-8, and the parse is restarted yet again. Then the parse is completed, and the final total of parentheses is known, so everything is redone a final time. Deferring those intermediate restarts would save a bunch of reparsing. Prior to this commit, warnings were issued only during the code generation pass, which didn't get called unless the sizing pass(es) completed successfully. But now, we don't know if the pass will succeed, fail, or whether it will have to be restarted. To avoid outputting the same warning more than once, the position in the parse of the last warning generated is kept (across parses). The code looks at that position when it is about to generate a warning. If the parsing has previously gotten that far, it assumes that the warning has already been generated, and suppresses it this time. The current state of parsing is such that I believe this assumption is valid. If the parses had divergent paths, that assumption could become invalid.
* regcomp.c: Use regnode offsets during parsingKarl Williamson2018-10-201-22/+22
| | | | | | | | | | | | | | | | | | | This changes the pattern parsing to use offsets from the first node in the pattern program, rather than direct addresses of such nodes. This is in preparation for a later change in which more mallocs will be done which will change those addresses, whereas the offsets will remain constant. Once the final program space is allocated, real addresses are used as currently. This limits the necessary changes to a few functions. Also, real addresses are used if they are constant across a function; again this limits the changes. Doing this introduces a new typedef for clarity 'regnode_offset' which is not a pointer, but a count. This necessitates changing a bunch of things to use 0 instead of NULL to indicate an error. A new boolean is also required to indicate if we are in the first or second passes of the pattern. And separate heap space is allocated for scratch during the first pass.
* regcomp.sym: Add lengths for ANYOF nodesKarl Williamson2018-10-201-21/+14
| | | | | | This changes regcomp.sym to generate the correct lengths for ANYOF nodes, which means they don't have to be special cased in regcomp.c, leading to simplification
* regcomp.h: Swap struct vs typedefKarl Williamson2018-10-201-1/+1
| | | | | | | This struct has two names. I previously left the less descriptive one as the primary because of back compat issues. But I now realize that regcomp.h is only used in the core, so it's ok to swap for the better name to be primary.
* regcomp.h: Add some macrosKarl Williamson2018-10-201-4/+13
| | | | | | | | | | These are use to allow the bit map of run-time /[[:posix:]]/l classes to be stored in a variable, and not just in the argument of an ANYOF node. This will enable the next commit to use such a variable. The current macros are rewritten to just call the new ones with the proper arguments. A macro of a different sort is also created to allow one to set the entire bit field in the node at once.
* regcomp.h: Remove unused macrosKarl Williamson2018-10-201-4/+0
| | | | | I had kept these macros around for backwards compatibility. But now I realize regcomp.h is only for core use, so no need to retain them.
* regcomp.h: White-space, comments onlyKarl Williamson2018-10-201-7/+11
|