summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* Fix rules for parsing numeric escapes in regexesYves Orton2013-06-251-9/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 726ee55d introduced better handling of things like \87 in a regex, but as an unfortunate side effect broke latex2html. The rules for handling backslashes in regexen are a bit arcane. Anything starting with \0 is octal. The sequences \1 through \9 are always backrefs. Any other sequence is interpreted as a decimal, and if there are that many capture buffers defined in the pattern at that point then the sequence is a backreference. If however it is larger than the number of buffers the sequence is treated as an octal digit. A consequence of this is that \118 could be a backreference to the 118th capture buffer, or it could be the string "\11" . "8". In other words depending on the context we might even use a different number of digits for the escape! This also left an awkward edge case, of multi digit sequences starting with 8 or 9 like m/\87/ which would result in us parsing as though we had seen /87/ (iow a null byte at the start) or worse like /\x{00}87/ which is clearly wrong. This patches fixes the cases where the capture buffers are defined, and causes things like the \87 or \97 to throw the same error that /\8/ would. One might argue we should complain about an illegal octal sequence, but this seems more consistent with an error like /\9/ and IMO will be less surprising in an error message. This patch includes exhaustive tests of patterns of the form /(a)\1/, /((a))\2/ etc, so that we dont break this again if we change the logic more.
* regcomp.c:regdump_intflags: rem unused varFather Chrysostomos2013-06-221-1/+0
|
* Fix and add tests for *PRUNE/*THEN plus leading non-greedy +Yves Orton2013-06-221-4/+9
| | | | "aaabc" should match /a+?(*THEN)bc/ with "abc".
* Show intflags as well as extflagsYves Orton2013-06-221-1/+27
|
* regcomp.c: Reorder tests to avoid throwing away work.Karl Williamson2013-06-171-6/+6
| | | | | Prior to this patch, we did work based on a test that could be thrown away as a result of the next test. This reorders so that doesn't happen
* Possessive and non greedy quantifier modifiers are mutually exclusiveYves Orton2013-06-131-10/+2
| | | | | | | | | | | | | When I added support for possessive modifiers it was possible to build perl so that they could be combined even if it made no sense to do so. Later on in relation to Perl #118375 I got confused and thought they were undocumented but legal. So to prevent further confusion, and since nobody has every mentioned it since they were added, I am removing the unusued conditional build logic, and clearly documenting why they aren't allowed.
* do not warn when optimizing away /x{0,0}?+/ and /x{0,0}+/Yves Orton2013-06-121-3/+10
| | | | | | | | | | | | | | | | In c37d14f947f7998211b0455e453160fb7e15b22e Karl fixed an issue reported in [perl #118375] "5.18 regex regression Quantifier follows nothing in regex" but he fixed only the non-greedy modifier mentioned in the ticket, and did not include support for the other quantifier modifiers like the non-greedy possessive (stupid but not illegal), and the possessive (useful) modifiers. Hopefully this covers them all. Note that because Karl already included support for m/x {0,0} ?/x I have done so as well for the new cases. I do not necessarily endorse the idea that it is legal or should be tested for. I am inclined to think that '{0,0}?+' should be indivisible even under /x.
* Quantifier follows nothing in regexKarl Williamson2013-06-101-0/+6
| | | | | | | | | This changes the code so that any '?' immediately (subject to /x rules) following a {0} quantifier is absorbed immediately as part of that quantifier. This allows the optimization to NOTHING to work for non-greedy matches. Thanks to Yves Orton for doing most of the leg work on this
* Stop /(a|b)(?=a){3}/ from warning twiceFather Chrysostomos2013-06-091-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [sprout@dromedary-001 perl2.git]$ ../perl.git/Porting/bisect.pl --target=miniperl --start=perl-5.8.0 --end=v5.10.0 -- ./miniperl -Ilib -we 'BEGIN { $SIG{__WARN__} = sub { die if $_[0] =~ /Quantifier/ && $warned++; warn shift }}; ""=~/(N|N)(?{})?/' ... 07be1b83a6b2d24b492356181ddf70e1c7917ae3 is the first bad commit commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3 Author: Yves Orton <demerphq@gmail.com> Date: Fri Jun 9 02:56:37 2006 +0200 Re: [PATCH] Better version of the Aho-Corasick patch and lots of benchmarks. Message-ID: <9b18b3110606081556t779de698r82f361d82a05fbc8@mail.gmail.com> (with tweaks) p4raw-id: //depot/perl@28373 Since that commit, it has been possible for S_study_chunk to be called twice if the trie optimisation kicks in (which happens for /(a|b)/). ‘Quantifier unexpected on zero-length expression’ is the only warning in S_study_chunk. Now it can appear twice if the quantified zero- length expression is in the same regexp as a trie optimisation. So pass a flag to S_study_chunk when ‘restudying’ to indicate that the warning should be skipped. There are two code paths that call S_study_chunk, one for when there is no top-level alternation, which triggers the error in this case, and one for when there is a top-level alternation (/a|b/). I wasn’t able to figure out how to trigger the double warning in the second case, but I passed the flag for the restudy in that code path anyway, since I don’t think it can be wrong.
* Allow qr/(?[ [a] ])/ interpolation in (?[...])Father Chrysostomos2013-06-071-0/+5
| | | | | | | | | | | | | | | | | | | Interpolation fails if the interpolated extended character class con- tains any bracketed character classes itself. The sizing pass looks for [ and passes control to the regular charac- ter class parser. When the charclass is finished, it begins scanning for [ again. If it finds ], it assumes it is the end. That fails with (?[ (?a:(?[ [a] ])) ]). The sizing pass hands [ [a] ]... off to the charclass parser, which parses [ [a] and hands control back to the sizing pass. It then sees ‘ ])) ])’, assumes that the first ]) is the end of the entire construct, so the main regexp parser sees the parenthesis following and dies. If we change the sizing pass to look for ?[ we can simply record the depth (depth++) and then when we see ] decrement the depth or exist the loop if it is zero.
* Don’t leak when compiling /(?[\d\d])/Father Chrysostomos2013-06-061-0/+2
| | | | | The ‘Operand with no preceding operator’ error was leaking the last two operands.
* Free operand when encountering unmatched ')' in (?[])Father Chrysostomos2013-06-061-0/+1
| | | | | | | | I only need to free the operand (current), not the left-paren token that turns out not to be a paren (lparen). For lparen to leak, there would have to be two operands in a row on the charclass parsing stack, which currently never happens.
* Stop /(?[\p{...}])/ compilation from leakingFather Chrysostomos2013-06-061-0/+1
| | | | | | The swash returned by utf8_heavy.pl was not being freed in the code path to handle returning character classes to the (?[...]) parser (when ret_invlist is true).
* Stop (?[]) operators from leakingFather Chrysostomos2013-06-061-6/+14
| | | | | | | | | | When a (?[]) extended charclass is compiled, the various operands are stored as inversion lists in separate SVs and then combined together into new inversion lists. The functions that take care of combining inversion lists only ever free one operand, and sometimes free none. Most of the operators in (?[]) were trusting the invlist functions to free everything that was no longer needed, causing (?[]) compilation to leak invlists.
* In regcomp.c, Set_Node_Cur_Length() uses parse_start, so explicitly pass it.Nicholas Clark2013-06-061-11/+10
| | | | | | | | | | | | | | The macro Set_Node_Cur_Length() had been referring to the variable parse_start within its body. This somewhat secret reference is potentially risky, as it was always taking a parameter node, hence one might assume that that was all that it used, and change the value stored in parse_start. (Specifically, one place that assigns RExC_parse - 1 to parse_start, and later uses parse_start + 1, which looks like an obvious cleanup candidate) So make parse_start an explicit parameter. Also, eliminate the never-used macro Set_Cur_Node_Length. This was added as part of commit fac927409d5ddf11 (April 2001) but never used.
* In S_regatom, declare parse_start when RE_TRACK_PATTERN_OFFSETS is defined.Nicholas Clark2013-06-061-0/+2
| | | | | | | Commit 779fedd7c3021f01 (March 2013) moved code which unconditionally used parse_start into another block. Hence the variable is now only needed when RE_TRACK_PATTERN_OFFSETS is defined, so wrap its declaration in #ifdef/#endif to avoid C compiler warnings.
* Don’t leak the /(?[])/ parsing stack on errorFather Chrysostomos2013-06-051-2/+1
| | | | | | | Instead of creating the parsing stack and then freeing it after pars- ing the (?[...]) construct (leaking it whenever one of the various errors scattered throughout the parsing code occurs), mortalise it to begin with and let the mortals stack take care of it.
* [perl #118297] Fix interpolating downgraded variables into upgraded regexpDagfinn Ilmari Mannsåker2013-06-041-3/+2
| | | | | | The code alredy upgraded the pattern if interpolating an upgraded string into it, but not vice versa. Just use sv_catsv_nomg() instead of sv_catpvn_nomg(), so that it can upgrade as necessary.
* eliminate PL_regdummyDavid Mitchell2013-06-021-2/+5
| | | | | | | This global (per-interpreter) var is just used during regex compilation as a placeholder to point RExC_emit at during the first (non-emitting) pass, to indicate to not to emit anything. There's no need for it to be a global var: just add it as an extra field in the RExC_state_t struct instead.
* eliminate PL_reg_stateDavid Mitchell2013-06-021-11/+0
| | | | | | | | | | This is a struct that holds all the global state of the current regex match. The previous set of commits have gradually removed all the fields of this struct (by making things local rather than global state). Since the struct is now empty, the PL_reg_state var can be removed, along with the SAVEt_RE_STATE save type which was used to save and restore those fields on recursive re-entry to the regex engine.
* eliminate PL_reg_poscache, PL_reg_poscache_sizeDavid Mitchell2013-06-021-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | Eliminate these two global vars (well, fields in the global PL_reg_state), that hold the regex super-liner cache. PL_reg_poscache_size gets replaced with a field in the local regmatch_info struct, while PL_reg_poscache (which needs freeing at end of pattern execution or on croak()), goes in the regmatch_info_aux struct. Note that this includes a slight change in behaviour. Each regex execution now has its own private poscache pointer, initially null. If the super-linear behaviour is detected, the cache is malloced, used for the duration of the pattern match, then freed. The former behaviour allocated a global poscache on first use, which was retained between regex executions. Since the poscache could between 0.25 and 2x the size of the string being matched, that could potentially be a big buffer lying around unused. So we save memory at the expense of a new malloc/free for every regex that triggers super-linear behaviour. The old behaviour saved the old pointer on reentrancy, then restored the old one (and possibly freed the new buffer) at exit. Except it didn't for (?{}), so /(?{ m{something-that-triggers-super-linear-cache} })/ would leak each time the inner regex was called. This is now fixed automatically.
* eliminate PL_reg_maxiter, PL_reg_leftiterDavid Mitchell2013-06-021-2/+0
| | | | | Move these two fields of PL_reg_state into the regmatch_info struct, so they are local to each match.
* add regmatch_eval_state structDavid Mitchell2013-06-021-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace several PL_reg* vars with a new struct. This is part of the goal of removing all global regex state. These particular vars are used in the case of a regex with (?{}) code blocks. In this case, when the code in a block is called, various bits of state (such as $1, pos($_)) are temporarily set up, even though the match has not yet completed. This involves updating the current PL_curpm to point to a fake PMOP which points to the regex currently being executed. That regex has all its current fields that are associated with captures (such as subbeg) temporarily saved and overwritten with the current partial match results. Similarly, $_ is temporarily aliased to the current match string, and any old pos() position is saved. This saving was formerly done to the various PL_reg* vars. When the regex has finished executing (or if the code block croaks), its fields are restored to the original values. Since this can happen in a croak, it may be done using SAVEDESTRUCTOR_X() on the save stack. This precludes just moving the PL_reg* vars into the regmatch_info struct, since that is just allocated as a local var in regexec_flags(), and would have already been abandoned and possibly overwritten after the croak and longjmp, but before the SAVEDESTRUCTOR_X() action is taken. So instead we put all the vars into new struct, and malloc that on entry to the regex engine when we know we need to copy the various fields. We save a pointer to that in the regmatch_info struct, as well as passing it to SAVEDESTRUCTOR_X(). The destructor may get called up to twice in the non-croak case: first it's called explicitly at the end of regexec_flags(), which restores subbeg etc; then again from the savestack, which just free()s the struct. In the croak case, it's called just once, and does both the restoring and the freeing. The vars / PL_reg_state fields this commit eliminates are: re_state_eval_setup_done PL_reg_oldsaved PL_reg_oldsavedlen PL_reg_oldsavedoffset PL_reg_oldsavedcoffset PL_reg_magic PL_reg_oldpos PL_nrs PL_reg_oldcurpm
* Make ‘Escape literal pattern white space’ a default warningFather Chrysostomos2013-05-271-1/+1
| | | | All deprecated warnings are supposed to be default warnings.
* perldiag: miscellaneous clean-upFather Chrysostomos2013-05-261-0/+2
| | | | | | | | | | | | | | | | •‘Corrupted regexp opcode’ is a ‘can’t happen’ error, so it belongs in the P category. • Two spaces after dots for consistency • Rewrap for slightly better splain output • The description usually begins on the same line as the category, so do so consistently • Reorder alphabetically • Missing category • Single, not double, backslash • Squash two adjacent (due to reordering) entries with identical descriptions • ‘given’ does not depend on lexical $_ any more • Remove duplicate entries (and placate diag.t with diag_listed_as)
* regcomp.c: Actually emit proper warningKarl Williamson2013-05-221-4/+8
| | | | | | | Before this commit, /\g/ raised the wrong warning Reference to invalid group 0 This rearranges the code so that the proper warning is emitted. Unterminated \g... pattern
* regcomp.c: Add commentKarl Williamson2013-05-201-1/+1
|
* Fix multi-char fold edge caseKarl Williamson2013-05-201-29/+54
| | | | | | | | | | | | | | | | | | | | | | | | | use locale; fc("\N{LATIN CAPITAL LETTER SHARP S}") eq 2 x fc("\N{LATIN SMALL LETTER LONG S}") should return true, as the SHARP S folds to two 's's in a row, and the LONG S is an antique variant of 's', and folds to s. Until this commit, the expression was false. Similarly, the following should match, but didn't until this commit: "\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa The reason these didn't work properly is that in both cases the actual fold to 's' is disallowed. In the first case because of locale; and in the second because of /aa. And the code wasn't smart enough to realize that these were legal. The fix is to special case these so that the fold of sharp s (both capital and small) is two LONG S's under /aa; as is the fold of the capital sharp s under locale. The latter is user-visible, and the documentation of fc() now points that out. I believe this is such an edge case that no mention of it need be done in perldelta.
* Expand flags parameter from boolean in _to_fold_latin1Karl Williamson2013-05-201-1/+1
| | | | This will be used in future commits to pass more flags.
* regcomp.c: Remove always-true testKarl Williamson2013-05-201-2/+1
| | | | | In this code, j is guaranteed to be above 255, so no need to test for that.
* regcomp.c: White-space onlyKarl Williamson2013-05-201-68/+66
| | | | | | The previous commit allows us to outdent a largish block, reflowing things to fit into the extra available width, and saving a few vertical pixels.
* regcomp.c: Reorder two 'if's so shorter branches are firstKarl Williamson2013-05-201-28/+32
| | | | This makes it easier to understand what is going on
* regcomp.c: Clarify commentKarl Williamson2013-05-201-11/+11
|
* regcomp.c: White-space onlyKarl Williamson2013-05-201-2/+2
| | | | Change this to follow perl coding conventions
* regcomp.c: White-space only, wrap comment to fitKarl Williamson2013-05-201-1/+2
|
* regcomp.c: Use mnemonic instead of numberKarl Williamson2013-05-201-1/+1
|
* Fix compiler warnings in regcomp.cKarl Williamson2013-05-181-23/+23
|
* Fix regex /il and /iaa failures for single element [] classKarl Williamson2013-05-091-4/+12
| | | | | | | | | | | | | This was a regression introduced in the v5.17 series. It only affected UTF-8 encoded patterns. Basically, the code here should have corresponded to, and didn't, similar logic located after the defchar: label in this file, which is executed for the general case (not stemming from a single element [bracketed] character class node). We don't fold code points 0-255 under locale, as those aren't known until run time. Similarly, we don't allow folds that cross the 255/256 boundary, as those aren't well-defined; and under /aa we don't allow folds that cross the 127/128 boundary.
* make /(?p:...)/ keep RXf_PMf_KEEPCOPY flagDavid Mitchell2013-05-061-1/+2
| | | | | | | | | RT #117135 The /p flag, when used internally within a pattern, isn't like the other internal patterns, e.g. (?i:...), in that once seen, the pattern should have the RXf_PMf_KEEPCOPY flag globally set and not just enabled within the scope of the (?p:...).
* Deprecate spaces/comments in some regex tokensKarl Williamson2013-05-021-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | This commit deprecates having space/comments between the first two characters of regular expression forms '(*VERB:ARG)' and '(?...)'. That is, the '(' should be immediately be followed by the '*' or '?'. Previously, things like: qr/((?# This is a comment in the middle of a token)?:foo)/ were silently accepted. The problem is that during regex parsing, the '(' is seen, and the input pointer advanced, skipping comments and, under /x, white space, without regard to whether the left parenthesis is part of a bigger token or not. S_reg() handles the parsing of what comes next in the input, and it just assumes that the first character it sees is the one that immediately followed the input parenthesis. Since the parenthesis may or may not be a part of a bigger token, and the current structure of handling things, I don't see an elegant way to fix this. What I did is flag the single call to S_reg() where this could be an issue, and have S_reg check for for adjacency if the parenthesis is part of a bigger token, and if so, warn if not-adjacent.
* PATCH: [perl #117327]: Sequence (?#...) not recognized in regexKarl Williamson2013-05-021-0/+12
| | | | | | | | | | | | This reverts commit 504858073fe16afb61d66a8b6748851780e51432, which was made under the erroneous belief that certain code was unreachable. This code appears to be reachable, however, only if the input has a 2-character lexical token split by space or a comment. The token should be indivisible, and was accepted only through an accident of implementation. A later commit will deprecate splitting it, with the view of eventually making splitting it a fatal error. In the meantime, though, this reverts to the original behavior, and adds a (temporary) test to verify that.
* Handle /@a/ array expansion within regex engineDavid Mitchell2013-04-201-17/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously /a@{b}c/ would be parsed as regcomp('a', join($", @b), 'c') This means that the array was flattened and its contents stringified before hitting the regex engine. Change it so that it is parsed as regcomp('a', @b, 'c') (but where the array isn't flattened, but rather just the AV itself is pushed onto the stack, c.f. push @b, ....). This means that the regex engine itself can process any qr// objects within the array, and correctly extract out any previously-compiled code blocks (thus preserving the correct closure behaviour). This is an improvement on 5.16.x behaviour, and brings it into line with the newish 5.17.x behaviour for *scalar* vars where they happen to hold regex objects. It also fixes a regression from 5.16.x, which meant that you suddenly needed a 'use re eval' in scope if any of the elements of the array were a qr// object with code blocks (RT #115004). It also means that 'qr' overloading is now handled within interpolated arrays as well as scalars: use overload 'qr' => sub { return qr/a/ }; my $o = bless []; my @a = ($o); "a" =~ /^$o$/; # always worked "a" =~ /^@a$/; # now works too
* S_pat_upgrade_to_utf8(): add num_code_blocks argDavid Mitchell2013-04-201-5/+7
| | | | | | | | | | | | | | This function was added a few commits ago in this branch. It's intended to upgrade a pattern string to utf8, while simultaneously adjusting the start/end byte indices of any code blocks. In two of the three places it is called from, all code blocks will already have been processed, so the number of code blocks equals pRExC_state->num_code_blocks. In the third place however (S_concat_pat), not all code blocks have yet been processed, so using num_code_blocks causes us to fall off the end of the index array. Add an extra arg to S_pat_upgrade_to_utf8() to tell it how many code blocks exist so far.
* Perl_re_op_compile() re-indent codeDavid Mitchell2013-04-201-37/+37
| | | | | Re-indent code after the previous commit removed a block scope. Only whitespace changes.
* re_op_compile: eliminate a local var and scopeDavid Mitchell2013-04-201-17/+11
| | | | | | | Eliminate a local var and the block scope it is declared in (There should be no functional changes). Re-indenting will be in the next commit.
* combine regex concat overload loopsDavid Mitchell2013-04-201-17/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when the components of a runtime regex (e.g. the "a", $x, "-" in /a$x-/) are concatenated into a single pattern string, the handling of magic and various types of overloading is done within two separate loops: (in perlish pseudocode): foreach (@arg) { SvGETMAGIC($_); apply 'qr' overloading to $_; } foreach (@arg) { $pat .= $_, allowing for '.' and '""' overloading; } This commit changes it to: foreach (@arg) { SvGETMAGIC($_); apply 'qr' overloading to $_; $pat .= $_, allowing for '.' and '""' overloading; } Note that this is in theory a user-visible change in behaviour, since the order in which various perl-level tie and overload functions get called may change. But that was just a quirk of the current implementation, rather than a documented feature.
* Perl_re_op_compile(): extract conatting codeDavid Mitchell2013-04-201-117/+148
| | | | | | | | | Extract out the big chunk of code that concatenates the components of a pattern string, into the new static function S_concat_pat(). As well as being tidier, it will shortly allow us to recursively concatenate, and allow us to directly interpolate arrays such as /@foo/, rather than relying on pp_join to do it for us.
* Perl_re_op_compile(): handle utf8 concating betterDavid Mitchell2013-04-201-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | When concatting the list of arguments together to form a final pattern string, the code formerly did a quick scan of all the args first, and if any of them were SvUTF8, it set the (empty) destination string to UTF8 before concatting all the individual args. This avoided the pattern getting upgraded to utf8 halfway through, and thus the indices for code blocks becoming invalid. However this was not 100% reliable because, as an "XXX" code comment of mine pointed out, when overloading is involved it is possible for an arg to appear initially not to be utf8, but to be utf8 when its value is finally accessed. This results an obscure bug (as shown in the test added for this commit), where literal /(?{code})/ still required 'use re "eval"'. The fix for this is to instead adjust the code block indices on the fly if the pattern string happens to get upgraded to utf8. This is easy(er) now that we have the new S_pat_upgrade_to_utf8() function. As well as fixing the bug, this also simplifies the main concat loop in the code, which will make it easier to handle interpolating arrays (e.g. /@foo/) when we move the interpolation from the join op into the regex engine itself shortly.
* Perl_re_op_compile: eliminate clunky if (0) {}David Mitchell2013-04-201-13/+13
| | | | | | | | | | | | | | There was a bit of code that did if (0) { redo_first_pass: ...foo...; } to allow us to jump back and repeat the first pass, doing some extra stuff the second time round. Since foo has now been abstracted into a separate function, we can instead call it each time directly before jumping, allowing us to remove the ugly if (0).
* Perl_re_op_compile(): eliminate xend varDavid Mitchell2013-04-201-5/+2
| | | | | it's value is easily (re)calculated, and we no longer have to worry about making sure we update it everywhere.