summaryrefslogtreecommitdiff
path: root/regcomp.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.h: Clarify commentsKarl Williamson2018-11-261-3/+3
|
* -Drv now turns on all regex debuggingKarl Williamson2018-11-161-24/+24
| | | | | This commit makes the v (verbose) modifier to -Dr do something: turn on all possible regex debugging.
* regcomp.h: Delete duplicate macro defnKarl Williamson2018-11-161-2/+0
|
* Remove references to passes from regex compilerKarl Williamson2018-10-201-3/+0
| | | | | | | | The previous commit removed the sizing pass, but to minimize the difference listing, it left in all the references it could to the various passes, with the first pass set to FALSE. This commit now removes those references, as well as to some variables that are no longer used.
* Remove sizing pass from regular expression compilerKarl Williamson2018-10-201-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit removes the sizing pass for regular expression compilation. It attempts to be the minimum required to do this. Future patches are in the works that improve it,, and there is certainly lots more that could be done. This is being done now for security reasons, as there have been several bugs leading to CVEs where the sizing pass computed the size improperly, and a crafted pattern could allow an attack. This means that simple bugs can too easily become attack vectors. This is NOT the AST that people would like, but it should make it easier to move the code in that direction. Instead of a sizing pass, as the pattern is parsed, new space is malloc'd for each regnode found. To minimize the number of such mallocs that actually go out and request memory, an initial guess is made, based on the length of the pattern being compiled. The guessed amount is malloc'd and then immediately released. Hopefully that memory won't be gobbled up by another process before we actually gain control of it. The guess is currently simply the number of bytes in the pattern. Patches and/or suggestions are welcome on improving the guess or this method. This commit does not mean, however, that only one pass is done in all cases. Currently there are several situations that require extra passes. These are: a) If the pattern isn't UTF-8, but encounters a construct that requires it to become UTF-8, the parse is immediately stopped, the translation is done, and the parse restarted. This is unchanged from how it worked prior to this commit. b) If the pattern uses /d rules and it encounters a construct that requires it to become /u, the parse is immediately stopped and restarted using /u rules. A future enhancement is to only restart if something has been encountered that would generate something different than what has already been generated, as many operations are the same under both /d and /u. Prior to this commit, in rare circumstances was the parse immediately restarted. Only those few that changed the sizing did so. Instead the sizing pass was allowed to complete and then the generation pass ran, using /u. Some CVEs were caused by faulty implementation here. c) Very large patterns may need to have long jumps in their program. Prior to this commit, that was determined in the sizing pass, and all jumps were made long during generation. Now, the first time the need for a long jump is detected, the parse is immediately restarted, and all jumps are made long. I haven't investigated enough to be sure, but it might be sufficient to just redo the current jump, making it long, and then switch to using just long jumps, without having to restart the parse from the beginning. d) If a reference that could be to capturing parentheses doesn't find such parentheses, a flag is set. For references that could be octal constants, they are assumed to be those constants instead of a capturing group. At the end of the parse, if the flag indicates either that the assumption(s) were wrong or that it is a fatal reference to a non-existent group, the pattern is reparsed knowing the total number of these groups. e) If (?R) or (?0) are encountered, the flag listed in item d) above is set to force a reparse. I did have code in place that avoided doing the reparse, but I wasn't confident enough that our test suite exercises that area of the code enough to have exposed all the potential interaction bugs, and I think this construct is used rarely enough to not worry about avoiding the reparse at this point in the development. f) If (?|pattern) is encountered, the behavior is the same as in item e) above. The pattern will end up being reparsed after the total number of parenthesized groups are known. I decided not to invest the effort at this time in trying to get this to work without a reparse. It might be that if we are continuing the parse to just count parentheses, and we encounter a construct that normally would restart the parse immediately, that we could defer that restart. This would cut down the maximum number of parses required. As of this commit, the worst case is we find something that requires knowing all the parentheses; later we have to switch to /u rules and so the parse is restarted. Still later we have to switch to long jumps, and the parse is restarted again. Still later we have to upgrade to UTF-8, and the parse is restarted yet again. Then the parse is completed, and the final total of parentheses is known, so everything is redone a final time. Deferring those intermediate restarts would save a bunch of reparsing. Prior to this commit, warnings were issued only during the code generation pass, which didn't get called unless the sizing pass(es) completed successfully. But now, we don't know if the pass will succeed, fail, or whether it will have to be restarted. To avoid outputting the same warning more than once, the position in the parse of the last warning generated is kept (across parses). The code looks at that position when it is about to generate a warning. If the parsing has previously gotten that far, it assumes that the warning has already been generated, and suppresses it this time. The current state of parsing is such that I believe this assumption is valid. If the parses had divergent paths, that assumption could become invalid.
* regcomp.c: Use regnode offsets during parsingKarl Williamson2018-10-201-22/+22
| | | | | | | | | | | | | | | | | | | This changes the pattern parsing to use offsets from the first node in the pattern program, rather than direct addresses of such nodes. This is in preparation for a later change in which more mallocs will be done which will change those addresses, whereas the offsets will remain constant. Once the final program space is allocated, real addresses are used as currently. This limits the necessary changes to a few functions. Also, real addresses are used if they are constant across a function; again this limits the changes. Doing this introduces a new typedef for clarity 'regnode_offset' which is not a pointer, but a count. This necessitates changing a bunch of things to use 0 instead of NULL to indicate an error. A new boolean is also required to indicate if we are in the first or second passes of the pattern. And separate heap space is allocated for scratch during the first pass.
* regcomp.sym: Add lengths for ANYOF nodesKarl Williamson2018-10-201-21/+14
| | | | | | This changes regcomp.sym to generate the correct lengths for ANYOF nodes, which means they don't have to be special cased in regcomp.c, leading to simplification
* regcomp.h: Swap struct vs typedefKarl Williamson2018-10-201-1/+1
| | | | | | | This struct has two names. I previously left the less descriptive one as the primary because of back compat issues. But I now realize that regcomp.h is only used in the core, so it's ok to swap for the better name to be primary.
* regcomp.h: Add some macrosKarl Williamson2018-10-201-4/+13
| | | | | | | | | | These are use to allow the bit map of run-time /[[:posix:]]/l classes to be stored in a variable, and not just in the argument of an ANYOF node. This will enable the next commit to use such a variable. The current macros are rewritten to just call the new ones with the proper arguments. A macro of a different sort is also created to allow one to set the entire bit field in the node at once.
* regcomp.h: Remove unused macrosKarl Williamson2018-10-201-4/+0
| | | | | I had kept these macros around for backwards compatibility. But now I realize regcomp.h is only for core use, so no need to retain them.
* regcomp.h: White-space, comments onlyKarl Williamson2018-10-201-7/+11
|
* regcomp.h: Create FILL_NODE macro and use itKarl Williamson2018-10-201-2/+10
| | | | | | This is a more fundamental operation than the pre-existing FILL_ADVANCE_NODE, which is changed to use this for the portion that fills the node but doesn't advance the pointer.
* Change REG_INFTY to 2**16-1, instead of 2**15-1Karl Williamson2018-10-011-7/+7
| | | | | | | | | | | | This commit doubles the upper limit that unbounded regular expression quantifiers can match up to. Things like {m,} "+" and "*" now can match up to U16_MAX times. We probably should make this a 32 bit value, but doing this doubling was easy and has fewer potential implications. See http://nntp.perl.org/group/perl.perl5.porters/251413 and followups
* Don't overallocate space for /[foo]/Karl Williamson2018-09-201-2/+4
| | | | | The ANYOF regnode type (generated by bracketed character classes and \p{}) was allocating too much space because the argument field was being counted twice
* regcomp.h: Add commentKarl Williamson2017-12-251-1/+1
|
* Change number to mnemonicKarl Williamson2017-12-181-0/+2
| | | | | This is in preparation for future commits that will use it in multiple places
* regcomp.h: Add commentKarl Williamson2017-10-281-0/+1
|
* Change upper limit handling of -Dr outputKarl Williamson2017-10-271-15/+12
| | | | | | | | | | | | | | Commit 2bfbbbaf9ef1783ba914ff9e9270e877fbbb6aba changed things so -Dr output could be changed through an environment variable to truncate the output differently than the default. For most purposes, the default is good enough, but for someone trying to debug the regcomp internals, sometimes one wants to see more than is output by default. That commit did not catch all the places. This one changes the handling so that any place that use the previous default maximum now uses the environment variable (if set) instead.
* Don't use VOL internally, because "volatile" works just fineAaron Crane2017-10-211-1/+1
| | | | However, we do preserve it outside PERL_CORE for the use of XS authors.
* better handle freeing of code blocks in /(?{...})/David Mitchell2017-01-241-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [perl #129140] attempting double-free Thus fixes some leaks and double frees in regexes which contain code blocks. During compilation, an array of struct reg_code_block's is malloced. Initially this is just attached to the RExC_state_t struct local var in Perl_re_op_compile(). Later it may be attached to a pattern. The difficulty is ensuring that the array is free()d (and the ref counts contained within decremented) should compilation croak early, while avoiding double frees once the array has been attached to a regex. The current mechanism of making the array the PVX of an SV is a bit flaky, as the array can be realloced(), and code can be re-entered when utf8 is detected mid-compilation. This commit changes the array into separately malloced head and body. The body contains the actual array, and can be realloced. The head contains a pointer to the array, plus size and an 'attached' boolean. This indicates whether the struct has been attached to a regex, and is effectively a 1-bit ref count. Whenever a head is allocated, SAVEDESTRUCTOR_X() is used to call S_free_codeblocks() to free the head and body on scope exit. This function skips the freeing if 'attached' is true, and this flag is set only at the point where the head gets attached to the regex. In one way this complicates the code, since the num_code_blocks field is now not always available (it's only there is a head has been allocated), but mainly its simplifies, since all the book-keeping is now done in the two new static functions S_alloc_code_blocks() and S_free_codeblocks()
* regcomp.[ch]: Comments, White-space, onlyKarl Williamson2016-07-161-1/+1
| | | | | | | | This indents code and reflows the comments to account for the enclosing block added by the previous commit. At the same time, it adds some other miscellaneous white space changes, and adds, revises other comments.
* regcomp.h: Use #define mnemonic, not hard-coded numberKarl Williamson2016-07-161-1/+1
|
* silence -Wparentheses-equalityDavid Mitchell2016-03-281-1/+1
| | | | | | | | | | | | | | | | | | Clang has taken it upon itself to warn when an equality is wrapped in double parentheses, e.g. ((foo == bar)) Which is a bit dumb, as any code along the lines of #define isBAR (foo == BAR) if (isBAR) {} will trigger the warning. This commit shuts clang up by putting in a harmless cast: #define isBAR cBOOL(foo == BAR)
* fix Perl #126182, out of memory due to infinite pattern recursionYves Orton2016-03-061-2/+0
| | | | | | | | | | | | | | | | | | | The way we tracked if pattern recursion was infinite did not work properly. A pattern like "a"=~/(.(?2))((?<=(?=(?1)).))/ would loop forever, slowly eat up all available ram as it added pattern recursion stack frames. This patch changes the rules for recursion so that recursively entering a given pattern "subroutine" twice from the same position fails the match. This means that where previously we might have seen fatal exception we will now simply fail. This means that "aaabbb"=~/a(?R)?b/ succeeds with $& equal to "aaabbb".
* Unify GOSTART and GOSUBYves Orton2016-03-061-1/+2
| | | | | | | | | | | | | | | | GOSTART is a special case of GOSUB, we can remove a lot of offset twiddling, and other special casing by unifying them, at pretty much no cost. GOSUB has 2 arguments, ARG() and ARG2L(), which are interpreted as a U32 and an I32 respectively. ARG() holds the "parno" we will recurse into. ARG2L() holds a signed offset to the relevant start node for the recursion. Prior to this patch the argument to GOSUB would always be >=, and unlike other parts of our logic we would not use 0 to represent "start/end" of pattern, as GOSTART would be used for "recurse to beginning of pattern", after this patch we use 0 to represent "start/end", and a lot of complexity "goes away" along with GOSTART regops.
* Add environment variable for -Dr: PERL_DUMP_RE_MAX_LENKarl Williamson2016-02-191-9/+12
| | | | | | | | | | | | | | | | | | | The regex engine when displaying debugging info, say under -Dr, will elide data in order to keep the output from getting too long. For example, the number of code points in all of Unicode matched by \w is quite large, and so when displaying a pattern that matches this, only the first some number of them are printed, and the rest are truncated, represented by "...". Sometimes, one wants to see more than what the compiled-into-the-engine-max shows. This commit creates code to read this environment variable to override the default max lengths. This changes the lengths for everything to the input number, even if they have different compiled maximums in the absence of this variable. I'm not currently documenting this variable, as I don't think it works properly under threads, and we may want to alter the behavior in various ways as a result of gaining experience with using it.
* regcomp.h: Not all ANYOF flags are in use.Karl Williamson2016-02-181-1/+1
| | | | So, it's better to not have a mask to include the unused ones.
* reverse the order of POPBLOCK; POPFOODavid Mitchell2016-02-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently most pp_leavefoo subs have something along the lines of POPBLOCK(cx); POPFOO(cx); where POPBLOCK does cxstack_ix-- and sets cx to point to the top CX stack entry. It then restores a bunch of PL_ vars saved in the CX struct. Then POPFOO does any type-specific restoration, e.g. POPSUB decrements the ref count of the cv that was just executed. However, this is logically the wrong order. When we *enter* a scope, we do PUSHBLOCK; PUSHFOO; so undoing the PUSHBLOCK should be the last thing we do. As it happens, it doesn't really make any difference to the running, which is why we've never fixed it before. Reordering it has two advantages. First, it allows the steps for scope exit to be the exact logical reverse of scope exit, which makes understanding what's going on and debugging easier. It allows us to make the code cleaner. This commit also removes the cxstack_ix-- and setting cx steps from POPBLOCK; now we already expect cx to be set (which it usually already is) and we do the cxstack_ix-- ourselves. This also means we can remove a whole bunch of cxstack_ix++'s that were added immediately after the POPBLOCK in order to prevent the context being inadvertently overwritten before we've finished using it. So in full, POPBLOCK(cx); POPFOO(cx); is now implemented as: cx = &cxstack[cxstack_ix]; ... other stuff done with cx ... POPFOO(cx); POPBLOCK(cx); cxstack_ix--; Finally, this commit also tweaks PL_curcop in pp_leaveeval, since otherwise PL_curcop could temporarily be NULL when debugging code is called in the presence of 'use re Debug'. It also stops the debugging code crashing if PL_curcop is still NULL.
* Add qr/\b{lb}/Karl Williamson2016-01-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the final Unicode boundary type previously missing from core Perl: the LineBreak one. This feature is already available in the Unicode::LineBreak module, but I've been told that there are portability and some other issues with that module. What's added here is a light-weight version that is lacking the customizable features of the module. This implements the default Line Breaking algorithm, but with the customizations that Unicode is expecting everybody to add, as their test file tests for them. In other words, this passes Unicode's fairly extensive furnished tests, but wouldn't if it didn't include certain customizations specified by Unicode beyond the basic algorithm. The implementation uses a look-up table of the characters surrounding a boundary to see if it is a suitable place to break a line. In a few cases, context needs to be taken into account, so there is code in addition to the lookup table to handle those. This should meet the needs for line breaking of many applications, without having to load the module. The algorithm is somewhat independent of the Unicode version, just like the other boundary types. Only if new rules are added, or existing ones modified is there need to go in and change this code. Otherwise, running regen/mk_invlists.pl should be sufficient when a new Unicode release is done to keep it up-to-date, again like the other Unicode boundary types.
* regcomp.h: Remove extraneous 'struct'sKarl Williamson2015-12-261-3/+3
| | | | Better to not have this clutter.
* regcomp.h: Fix shift and maskKarl Williamson2015-12-261-1/+1
| | | | | | | | | | | The mask removed here was to make sure that right shifting didn't propagate the sign bit, but is unnecessary as the value shifted is unsigned. And confining things to a U8 with that mask assumes that the bit vector being operated on has 256 elements max. This isn't necessarily true these days, as one can change ANYOF_BITMAP_SIZE. In fact changing that number was failing until this commit. It also adds white space to make it easier to read.
* regcomp.h: Use more basic macro in #definesKarl Williamson2015-12-261-2/+2
| | | | | Instead of having this code repeated in several places, call the more base macro from the others.
* regcomp.h: Free up bit in ANYOF FLAGS fieldKarl Williamson2015-12-261-71/+64
| | | | | | | | | | | I've long been confronted with trying to do things to create a spare bit to use. I thought it easier now, while it's fresh in my mind, to free up one for future use, rather than re-learn things when it next becomes necessary. It would have been a different story if the freed bit had required a performance penalty. This commit also updates the comments about how to create even more spare bits should it become necessary.
* regcomp.h: Shorten, clarify names of internal flagsKarl Williamson2015-12-261-16/+17
| | | | Some of the names are expanded slightly and not shortened
* regcomp.h: reword some commentsKarl Williamson2015-12-221-33/+32
|
* regcomp.h: Add commentsKarl Williamson2015-12-171-40/+119
|
* regex matching: Don't do unnecessary workKarl Williamson2015-12-171-1/+3
| | | | | | This commit sets a flag at pattern compilation time to indicate if a rare case is present that requires special handling, so that that handling can be avoided unless necessary.
* regcomp.h: Renumber 2 flag bitsKarl Williamson2015-12-171-4/+4
| | | | | | This changes the spare bit to be adjacent to the LOC_FOLD bit, in preparation for the next commit, which will use that bit for a LOC_FOLD-related use.
* regex: Free a ANYOF node bitKarl Williamson2015-12-171-10/+16
| | | | | | | | This is done by combining 2 mutually exclusive bits into one. I hadn't seen this possibility before because the name of one of them misled me. It also misled me into turning on one that flag unnecessarily, and to miss opportunities to not have to create a swash at runtime. This commit corrects those things as well.
* regcomp.[ch]: Comment additions, fixesKarl Williamson2015-09-031-23/+46
|
* regcomp.h: Reorder some flag definitions.Karl Williamson2015-09-031-15/+16
| | | | | This places the flag bits of like-type flags adjacent for convenience in reading the code. It also improves the commentary about their purposes.
* regcomp.h: SSC no longer has to be strict ANYOFKarl Williamson2015-09-031-1/+1
| | | | | | Since commit a0bd1a30d379f2625c307657d63fc50173d7a56d, a synthetic start class node can be just an ANYOF-type node. I don't think this causes a bug, just misses a potential optimisation.
* Make qr/(?[ ])/ work in UTF-8 localesKarl Williamson2015-08-241-2/+6
| | | | | | | | | | | | | | | | Previously use of this under /l regex rules was a compile time error. Now it works like \b{wb} and \b{sb}, which compile under locale rules and always work like Unicode says they should. A UTF-8 locale implies Unicode rules, and the goal is for it to work seamlessly with the rest of perl. This construct was the only one I am aware of that didn't work seamlessly (not counting OS interfaces) under UTF-8 LC_CTYPE locales. For all three of these constructs, use with a non-UTF-8 runtime locale raises a warning, and Unicode rules are used anyway. UTF-8 locale collation still has problems, but this is low priority to fix, as it's a lot of work, and if one really cares, one should be using Unicode::Collate.
* regcomp.h: Fold 2 ANYOF flags into a single oneKarl Williamson2015-08-241-8/+13
| | | | | | | | | | | | | | | | | | The ANYOF_FLAGS bits are all used up, but a future commit wants one. This commit frees up a bit by sharing two of the existing comparatively-rarely-used ones. One bit is used only under /d matching rules, while the other is used only when not under /d. Only the latter bit is used in synthetic start classes. The previous commit introduced an ANYOFD node type corresponding to /d. An SSC never is this type. Thus, the bits have mutually exclusive meanings, and we can use the node type to distinguish between the two meanings of the combined bit. An alternative implementation would have been to use the ANYOF_HAS_NONBITMAP_NON_UTF8_MATCHES non-/d bit instead of the one chosen. But this is used more frequently, so the disambiguation would have been exercised more frequently, slowing execution down ever so slightly; more importantly, this one required fewer code changes, by a slight amount.
* remove deprecated /\C/ RE character classDavid Mitchell2015-06-191-2/+1
| | | | | | This horrible thing broke encapsulation and was as buggy as a very buggy thing. It's been officially deprecated since 5.20.0 and now it can finally die die die!!!!
* Replace common Emacs file-local variables with dir-localsDagfinn Ilmari Mannsåker2015-03-221-6/+0
| | | | | | | | | | | | | | | | An empty cpan/.dir-locals.el stops Emacs using the core defaults for code imported from CPAN. Committer's work: To keep t/porting/cmp_version.t and t/porting/utils.t happy, $VERSION needed to be incremented in many files, including throughout dist/PathTools. perldelta entry for module updates. Add two Emacs control files to MANIFEST; re-sort MANIFEST. For: RT #124119.
* \s matching VT is no longer experimentalKarl Williamson2015-02-211-2/+0
| | | | | | | This was experimentally introduced in 5.18, and no issues were raised, except that it got us to thinking and spurred us to stop allowing $^X, where 'X' is a non-printable control character, and that change caused some issues.
* Add \b{sb}Karl Williamson2015-02-191-0/+1
|
* Add qr/\b{wb}/Karl Williamson2015-02-191-1/+2
|
* Add qr/\b{gcb}/Karl Williamson2015-02-191-0/+5
| | | | | | | | | | | A function implements seeing if the space between any two characters is a grapheme cluster break. Afer I wrote this, I realized that an array lookup might be a better implementation, but the deadline for v5.22 was too close to change it. I did see that my gcc optimized it down to an array lookup. This makes the implementation of \X go from being complicated to trivial.