summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* PATCH: [perl #120799] 5.18 regression with [:^ascii] and \x80-\xFFKarl Williamson2013-12-191-0/+9
| | | | | | | | | | | | | | | Posix classes generally match different sets of characters under /d rules than otherwise. This isn't true for [:ascii:], but the handling for it is shared with the others, so it needs to use the same mechanism to deal with that. I forgot this in commit bb9ee97444732c84b33c2f2432aa28e52e4651dc which created this regression. Our tests for this only use regexes with a single element, and an optimization added in 5.18 causes this bug to be bypassed. These tests should be enhanced to force both code paths, but not for this commit, which should be suitable for a maintenance release. (cherry picked from commit 46c10357a881cd92500e4ade81cbc8813e49e2cb)
* PATCH [perl #119713] Regex \8 and \9 after literals no longer workKarl Williamson2013-11-271-0/+7
| | | | | | | | Commit 726ee55d introduced a regression that has been fixed in blead by commit f1e1b256. However the later commit changed some buggy behavior into errors instead of warnings, and so is contraindicated in a maintenance release. This current commit attempts to fix the regression without changing other behavior. It includes the pat.t tests from f1e1b256.
* RT #118213: handle $r=qr/.../; /$r/p properlyDavid Mitchell2013-11-191-18/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (cherry-picked from 5b0e71e9d506. Some of the new tests are unsuitable for 5.18.x and fail with this commit; they'll be disabled in the next commit) In the case where a qr// regex is directly used by PMOP (rather than being interpolated with some other stuff and a new regex created, such as /a$r/p), then the PMf_KEEPCOPY flag will be set on the PMOP, but the corresponding RXf_PMf_KEEPCOPY flag *won't* be set on the regex. Since most of the regex handling for copying the string and extracting out ${^PREMATCH} etc is done based on the RXf_PMf_KEEPCOPY flag in the regex, this is a bit of a problem. Prior to 5.18.0 this wasn't so noticeable, since various other bugs around //p handling meant that ${$PREMATCH} etc often accidentally got set anyway. 5.18.0 fixed these bugs, and so as a side-effect, exposed the PMOP verses regex flag issue. In particular, this stopped working in 5.18.0: my $pat = qr/a/; 'aaaa' =~ /$pat/gp or die; print "MATCH=[${^MATCH}]\n"; (prints 'a' in 5.16.0, undef in 5.18.0). The presence /g caused the engine to copy the string anyway by luck. We can't just set the RXf_PMf_KEEPCOPY flag on the regex if we see the PMf_KEEPCOPY flag on the PMOP, otherwise stuff like this will be wrong: $r = qr/..../; /$r/p; # set RXf_PMf_KEEPCOPY on $r /$r/; # does a /p match by mistake Since for 5.19.x onwards COW is enabled by default (and cheap copies are always made regardless of /p), then this fix is mainly for PERL_NO_COW builds and for backporting to 5.18.x. (Although it still applies to strings that can't be COWed for whatever reason). Since we can't set a flag in the rx, we fix this by: 1) when calling the regex engine (which may attempt to copy part or all of the capture string), make sure we pass REXEC_COPY_STR, but neither of REXEC_COPY_SKIP_PRE, REXEC_COPY_SKIP_POST when we call regexec() from pp_match or pp_subst when the corresponding PMOP has PMf_KEEPCOPY set. 2) in Perl_reg_numbered_buff_fetch() etc, check for PMf_KEEPCOPY in PL_curpm as well as for RXf_PMf_KEEPCOPY in the current rx before deciding whether to process ${^PREMATCH} etc. As well as adding new tests to t/re/reg_pmod.t, I also changed the string to be matched against from being '12...' to '012...', to ensure that the lengths of ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} would all be different.
* Don’t leak when compiling /(?[\d\d])/Father Chrysostomos2013-07-311-0/+2
| | | | | | | The ‘Operand with no preceding operator’ error was leaking the last two operands. (cherry picked from commit b573e7000fd9c1cfae30ae5fb328a25b9bf3870a)
* Free operand when encountering unmatched ')' in (?[])Father Chrysostomos2013-07-311-0/+1
| | | | | | | | | | I only need to free the operand (current), not the left-paren token that turns out not to be a paren (lparen). For lparen to leak, there would have to be two operands in a row on the charclass parsing stack, which currently never happens. (cherry picked from commit 4bc5d08976b7df23b63a56cc017a20ac5766fbbc)
* Stop /(?[\p{...}])/ compilation from leakingFather Chrysostomos2013-07-311-0/+1
| | | | | | | | The swash returned by utf8_heavy.pl was not being freed in the code path to handle returning character classes to the (?[...]) parser (when ret_invlist is true). (cherry picked from commit c80d037c54749655d40eac068936c5222ce9d8ee)
* Stop (?[]) operators from leakingFather Chrysostomos2013-07-311-6/+14
| | | | | | | | | | | | When a (?[]) extended charclass is compiled, the various operands are stored as inversion lists in separate SVs and then combined together into new inversion lists. The functions that take care of combining inversion lists only ever free one operand, and sometimes free none. Most of the operators in (?[]) were trusting the invlist functions to free everything that was no longer needed, causing (?[]) compilation to leak invlists. (cherry picked from commit a84e671a269f736a404a62f21caacc8a431c2aca)
* Don’t leak the /(?[])/ parsing stack on errorFather Chrysostomos2013-07-311-2/+1
| | | | | | | | | Instead of creating the parsing stack and then freeing it after pars- ing the (?[...]) construct (leaking it whenever one of the various errors scattered throughout the parsing code occurs), mortalise it to begin with and let the mortals stack take care of it. (cherry picked from commit 1e4f088863436a8019c7d864691903ffdafeefda)
* Fix regex seqfault 5.18 regressionKarl Williamson2013-07-141-10/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This segfault is a result of an optimization that can leave the compilation in an inconsistent state. /f{0}/ doesn't match anything, and hence should be removable from the regex for all f. However, qr{(?&foo){0}(?<foo>)} caused a segfault. What was happening prior to this commit is that (?&foo) refers to a named capture group further along in the regex. The "{0}" caused the "(?&foo)" to be discarded prior to setting up the pointers between the two related subexpressions; a segfault follows. This commit removes the optimization, and should be suitable for a maintenance release. One might think that no one would be writing code like this, but this example was distilled from machine-generated code in Regexp::Grammars. Perhaps this optimization can be done, but the location I chose for checking it was during parsing, which turns out to be premature. It would be better to do it in the optimization phase of regex compilation. Another option would be to retain it where it was, but for it to operate only on a limited set of nodes, such as EXACTish, which would have no unintended consequences. But that is for looking at in the future; the important thing is to have a simple patch suitable for fixing this regression in a maintenance release. For the record, the code being reverted was mistakenly added by me in commit 3018b823898645e44b8c37c70ac5c6302b031381, and wasn't even mentioned in that commit message. It should have had its own commit. Conflicts: regcomp.c
* [perl #118297] Fix interpolating downgraded variables into upgraded regexpDagfinn Ilmari Mannsåker2013-06-061-3/+2
| | | | | | The code alredy upgraded the pattern if interpolating an upgraded string into it, but not vice versa. Just use sv_catsv_nomg() instead of sv_catpvn_nomg(), so that it can upgrade as necessary.
* Fix regex /il and /iaa failures for single element [] classKarl Williamson2013-05-091-4/+12
| | | | | | | | | | | | | This was a regression introduced in the v5.17 series. It only affected UTF-8 encoded patterns. Basically, the code here should have corresponded to, and didn't, similar logic located after the defchar: label in this file, which is executed for the general case (not stemming from a single element [bracketed] character class node). We don't fold code points 0-255 under locale, as those aren't known until run time. Similarly, we don't allow folds that cross the 255/256 boundary, as those aren't well-defined; and under /aa we don't allow folds that cross the 127/128 boundary.
* make /(?p:...)/ keep RXf_PMf_KEEPCOPY flagDavid Mitchell2013-05-061-1/+2
| | | | | | | | | RT #117135 The /p flag, when used internally within a pattern, isn't like the other internal patterns, e.g. (?i:...), in that once seen, the pattern should have the RXf_PMf_KEEPCOPY flag globally set and not just enabled within the scope of the (?p:...).
* Deprecate spaces/comments in some regex tokensKarl Williamson2013-05-021-3/+20
| | | | | | | | | | | | | | | | | | | | | | | | This commit deprecates having space/comments between the first two characters of regular expression forms '(*VERB:ARG)' and '(?...)'. That is, the '(' should be immediately be followed by the '*' or '?'. Previously, things like: qr/((?# This is a comment in the middle of a token)?:foo)/ were silently accepted. The problem is that during regex parsing, the '(' is seen, and the input pointer advanced, skipping comments and, under /x, white space, without regard to whether the left parenthesis is part of a bigger token or not. S_reg() handles the parsing of what comes next in the input, and it just assumes that the first character it sees is the one that immediately followed the input parenthesis. Since the parenthesis may or may not be a part of a bigger token, and the current structure of handling things, I don't see an elegant way to fix this. What I did is flag the single call to S_reg() where this could be an issue, and have S_reg check for for adjacency if the parenthesis is part of a bigger token, and if so, warn if not-adjacent.
* PATCH: [perl #117327]: Sequence (?#...) not recognized in regexKarl Williamson2013-05-021-0/+12
| | | | | | | | | | | | This reverts commit 504858073fe16afb61d66a8b6748851780e51432, which was made under the erroneous belief that certain code was unreachable. This code appears to be reachable, however, only if the input has a 2-character lexical token split by space or a comment. The token should be indivisible, and was accepted only through an accident of implementation. A later commit will deprecate splitting it, with the view of eventually making splitting it a fatal error. In the meantime, though, this reverts to the original behavior, and adds a (temporary) test to verify that.
* Handle /@a/ array expansion within regex engineDavid Mitchell2013-04-201-17/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously /a@{b}c/ would be parsed as regcomp('a', join($", @b), 'c') This means that the array was flattened and its contents stringified before hitting the regex engine. Change it so that it is parsed as regcomp('a', @b, 'c') (but where the array isn't flattened, but rather just the AV itself is pushed onto the stack, c.f. push @b, ....). This means that the regex engine itself can process any qr// objects within the array, and correctly extract out any previously-compiled code blocks (thus preserving the correct closure behaviour). This is an improvement on 5.16.x behaviour, and brings it into line with the newish 5.17.x behaviour for *scalar* vars where they happen to hold regex objects. It also fixes a regression from 5.16.x, which meant that you suddenly needed a 'use re eval' in scope if any of the elements of the array were a qr// object with code blocks (RT #115004). It also means that 'qr' overloading is now handled within interpolated arrays as well as scalars: use overload 'qr' => sub { return qr/a/ }; my $o = bless []; my @a = ($o); "a" =~ /^$o$/; # always worked "a" =~ /^@a$/; # now works too
* S_pat_upgrade_to_utf8(): add num_code_blocks argDavid Mitchell2013-04-201-5/+7
| | | | | | | | | | | | | | This function was added a few commits ago in this branch. It's intended to upgrade a pattern string to utf8, while simultaneously adjusting the start/end byte indices of any code blocks. In two of the three places it is called from, all code blocks will already have been processed, so the number of code blocks equals pRExC_state->num_code_blocks. In the third place however (S_concat_pat), not all code blocks have yet been processed, so using num_code_blocks causes us to fall off the end of the index array. Add an extra arg to S_pat_upgrade_to_utf8() to tell it how many code blocks exist so far.
* Perl_re_op_compile() re-indent codeDavid Mitchell2013-04-201-37/+37
| | | | | Re-indent code after the previous commit removed a block scope. Only whitespace changes.
* re_op_compile: eliminate a local var and scopeDavid Mitchell2013-04-201-17/+11
| | | | | | | Eliminate a local var and the block scope it is declared in (There should be no functional changes). Re-indenting will be in the next commit.
* combine regex concat overload loopsDavid Mitchell2013-04-201-17/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently when the components of a runtime regex (e.g. the "a", $x, "-" in /a$x-/) are concatenated into a single pattern string, the handling of magic and various types of overloading is done within two separate loops: (in perlish pseudocode): foreach (@arg) { SvGETMAGIC($_); apply 'qr' overloading to $_; } foreach (@arg) { $pat .= $_, allowing for '.' and '""' overloading; } This commit changes it to: foreach (@arg) { SvGETMAGIC($_); apply 'qr' overloading to $_; $pat .= $_, allowing for '.' and '""' overloading; } Note that this is in theory a user-visible change in behaviour, since the order in which various perl-level tie and overload functions get called may change. But that was just a quirk of the current implementation, rather than a documented feature.
* Perl_re_op_compile(): extract conatting codeDavid Mitchell2013-04-201-117/+148
| | | | | | | | | Extract out the big chunk of code that concatenates the components of a pattern string, into the new static function S_concat_pat(). As well as being tidier, it will shortly allow us to recursively concatenate, and allow us to directly interpolate arrays such as /@foo/, rather than relying on pp_join to do it for us.
* Perl_re_op_compile(): handle utf8 concating betterDavid Mitchell2013-04-201-14/+14
| | | | | | | | | | | | | | | | | | | | | | | | | When concatting the list of arguments together to form a final pattern string, the code formerly did a quick scan of all the args first, and if any of them were SvUTF8, it set the (empty) destination string to UTF8 before concatting all the individual args. This avoided the pattern getting upgraded to utf8 halfway through, and thus the indices for code blocks becoming invalid. However this was not 100% reliable because, as an "XXX" code comment of mine pointed out, when overloading is involved it is possible for an arg to appear initially not to be utf8, but to be utf8 when its value is finally accessed. This results an obscure bug (as shown in the test added for this commit), where literal /(?{code})/ still required 'use re "eval"'. The fix for this is to instead adjust the code block indices on the fly if the pattern string happens to get upgraded to utf8. This is easy(er) now that we have the new S_pat_upgrade_to_utf8() function. As well as fixing the bug, this also simplifies the main concat loop in the code, which will make it easier to handle interpolating arrays (e.g. /@foo/) when we move the interpolation from the join op into the regex engine itself shortly.
* Perl_re_op_compile: eliminate clunky if (0) {}David Mitchell2013-04-201-13/+13
| | | | | | | | | | | | | | There was a bit of code that did if (0) { redo_first_pass: ...foo...; } to allow us to jump back and repeat the first pass, doing some extra stuff the second time round. Since foo has now been abstracted into a separate function, we can instead call it each time directly before jumping, allowing us to remove the ugly if (0).
* Perl_re_op_compile(): eliminate xend varDavid Mitchell2013-04-201-5/+2
| | | | | it's value is easily (re)calculated, and we no longer have to worry about making sure we update it everywhere.
* Perl_re_op_compile(): add S_pat_upgrade_to_utf8()David Mitchell2013-04-201-46/+58
| | | | | | | | Extract out the code that upgrades the pattern string to utf8 (and adjusts any code-block indices) into a separate function, S_pat_upgrade_to_utf8(). As well as being tidier, we'll be calling the code from other places shortly.
* fix comment typo in regcomp.cDavid Mitchell2013-04-151-1/+1
|
* fix runtime /(?{})/ with overload::constant qrDavid Mitchell2013-04-121-24/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two issues fixed here. First, when a pattern has a run-time code-block included, such as $code = '(?{...})' /foo$code/ the mechanism used to parse those run-time blocks: of feeding the resultant pattern into a call to eval_sv() with the string qr'foo(?{...})' and then extracting out any resulting opcode trees from the returned qr object -- suffered from the re-parsed qr'..' also being subject to overload:constant qr processing, which could result in Bad Things happening. Since we now have the PL_parser->lex_re_reparsing flag in scope throughout the parsing of the pattern, this is easy to detect and avoid. The second issue is a mechanism to avoid recursion when getting false positives in S_has_runtime_code() for code like '[(?{})]'. For patterns like this, we would suspect that the pattern may have code (even though it doesn't), so feed it into qr'...' and reparse, and again it looks like runtime code, so feed it in, rinse and repeat. The thing to stop recursion was when we saw a qr with a single OP_CONST string, we assumed it couldn't have any run-time component, and thus no run-time code blocks. However, this broke qr/foo/ in the presence of overload::constant qr overloading, which could convert foo into a string containing code blocks. The fix for this is to change the recursion-avoidance mechanism (in a way which also turns out to be simpler too). Basically, when we fake up a qr'...' and eval it, we turn off any 'use re eval' in scope: its not needed, since we know the .... will be a constant string without any overloading. Then we use the lack of 'use re eval' in scope to skip calling S_has_runtime_code() and just assume that the code has no run-time patterns (if it has, then eventually the regex parser will rightly complain about 'Eval-group not allowed at runtime'). This commit also adds some fairly comprehensive tests for this.
* add lex_re_reparsing boolean to yy_parser structDavid Mitchell2013-04-121-2/+1
| | | | | | | | | | | | | | | When re-parsing a pattern for run-time (?{}) code blocks, we end up with the EVAL_RE_REPARSING flag set in PL_in_eval. Currently we clear this flag as soon as scan_str() returns, to ensure that it's not set if we happen to parse further patterns (e.g. within the (?{ ... }) code itself. However, a soon-to-be-applied bugfix requires us to know the reparsing state beyond this point. To solve this, we add a new boolean flag to the parser struct, which is set from PL_in_eval in S_sublex_push() (with the old value being saved). This allows us to have the flag around for the entire pattern string parsing phase, without it affecting nested pattern compilation.
* Eliminate PL_reg_state.re_reparsing, part 2David Mitchell2013-04-121-2/+0
| | | | | | The previous commit added an alternative flag mechanism to PL_reg_state.re_reparsing, but kept the old one around for consistency checking. Remove the old one now.
* Eliminate PL_reg_state.re_reparsing, part 1David Mitchell2013-04-121-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | PL_reg_state.re_reparsing is a hacky flag used to allow runtime code blocks to be included in patterns. Basically, since code blocks are now handled by the perl parser within literal patterns, runtime patterns are handled by taking the (assembled at runtime) pattern, and feeding it back through the parser via the equivalent of eval q{qr'the_pattern'}, so that run-time (?{..})'s appear to be literal code blocks. When this happens, the global flag PL_reg_state.re_reparsing is set, which modifies lexing and parsing in minor ways (such as whether \\ is stripped). Now, I'm in the slow process of trying to eliminate global regex state (i.e. gradually removing the fields of PL_reg_state), and also a change which will be coming a few commits ahead requires the info which this flag indicates to linger for longer (currently it is cleared immediately after the call to scan_str(). For those two reasons, this commit adds a new mechanism to indicate this: a new flag to eval_sv(), G_RE_REPARSING (which sets OPpEVAL_RE_REPARSING in the entereval op), which sets the EVAL_RE_REPARSING bit in PL_in_eval. Its still a yukky global flag hack, but its a *different* global flag hack now. For this commit, we add the new flag(s) but keep the old PL_reg_state.re_reparsing flag and assert that the two mechanisms always match. The next commit will remove re_reparsing.
* re_op_compile(): reapply debugging statementsDavid Mitchell2013-04-121-0/+8
| | | | | | | These were temporarily removed a few commits ago to make rebasing easier. (And since the code's been simplified in the conflicting branch, not so many debug statements had to be added back as were in the original).
* Handle overloading properly in compile-time regexDavid Mitchell2013-04-121-87/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [perl #116823] In re_op_compile(), there were two different code paths for compile-time patterns (/foo(?{1})bar/) and runtime (/$foo(?{1})bar/). The code in question is where the various components of the pattern are concatenated into a single string, for example, 'foo', '(?{1})' and 'bar' in the first pattern. In the run-time branch, the code assumes that each component (e.g. the value of $foo) can be absolutely anything, and full magic and overload handling is applied as each component is retrieved and appended to the pattern string. The compile-time branch on the other hand, was a lot simpler because it "knew" that each component is just a simple constant SV attached to an OP_CONST op. This turned out to be an incorrect assumption, due to overload::constant qr overloading; here, a simple constant part of a compile-time pattern, such as 'foo', can be converted into whatever the overload function returns; in particular, an object blessed into an overloaded class. So the "simple" SVs that get attached to OP_CONST ops can in fact be complex and need full magic, overloading etc applied to them. The quickest solution to this turned out to be, for the compile-time case, extract out the SV from each OP_CONST and assemble them into a temporary SV** array; then from then onwards, treat it the same as the run-time case (which expects an array of SVs).
* re-indent after last changeDavid Mitchell2013-04-121-50/+50
| | | | (only whitespace changes)
* re_op_compile(): unify 1-op and N-op branchesDavid Mitchell2013-04-121-17/+31
| | | | | | | | | | | | When assembling a compile-time pattern from a list of OP_CONSTs (and possibly embedded code-blocks), there were separate code paths for a single arg (a lone OP_CONST) and a list of OP_CONST / DO's. Unify the branches into single loop. This will make a subsequent commit easier, where we will need to do more processing of each "constant". Re-indenting has been left to the commit that follows this.
* re_op_compile(): simplify a code snippetDavid Mitchell2013-04-121-4/+1
| | | | and eliminate one local var.
* re-indent code after previous commitDavid Mitchell2013-04-121-99/+99
| | | | (whitespace changes only)
* regex and overload: unifiy 1 and N arg branchesDavid Mitchell2013-04-121-26/+20
| | | | | | | | | | | | | When compiling a regex, something like /a$b/ that parses two two args, was treated in a different code path than /$a/ say, which is only one arg. In particular the 1-arg code path, where it handled "" overloading, didn't check for a loop (where the ""-sub returns the overloaded object itself) - the N-arg branch did handle that. By unififying the branches, we get that fix for free, and ensure that any future fixes don't have to be applied to two separate branches. Re-indented has been left to the commit that follows this.
* re_op_compile(): temp remove some debugging codeDavid Mitchell2013-04-121-10/+0
| | | | | | | These four DEBUG_PARSE_r()'s were recently added to a block I code which I have just been extensively reworking in a separate branch. Temporarily remove these statements to allow my branch to be rebased; I'll re-add them (or similar) afterwards.
* rework split() special case interaction with regex engineYves Orton2013-03-271-5/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch resolves several issues at once. The parts are sufficiently interconnected that it is hard to break it down into smaller commits. The tickets open for these issues are: RT #94490 - split and constant folding RT #116086 - split "\x20" doesn't work as documented It additionally corrects some issues with cached regexes that were exposed by the split changes (and applied to them). It effectively reverts 5255171e6cd0accee6f76ea2980e32b3b5b8e171 and cccd1425414e6518c1fc8b7bcaccfb119320c513. Prior to this patch the special RXf_SKIPWHITE behavior of split(" ", $thing) was only available if Perl could resolve the first argument to split at compile time, meaning under various arcane situations. This manifested as oddities like my $delim = $cond ? " " : qr/\s+/; split $delim, $string; and split $cond ? " ", qr/\s+/, $string not behaving the same as: ($cond ? split(" ", $string) : split(/\s+/, $string)) which isn't very convenient. This patch changes this by adding a new flag to the op_pmflags, PMf_SPLIT which enables pp_regcomp() to know whether it was called as part of split, which allows the RXf_SPLIT to be passed into run time regex compilation. We also preserve the original flags so pattern caching works properly, by adding a new property to the regexp structure, "compflags", and related macros for accessing it. We preserve the original flags passed into the compilation process, so we can compare when we are trying to decide if we need to recompile. Note that this essentially the opposite fix from the one applied originally to fix #94490 in 5255171e6cd0accee6f76ea2980e32b3b5b8e171. The reverted patch was meant to make: split( 0 || " ", $thing ) #1 consistent with my $x=0; split( $x || " ", $thing ) #2 and not with split( " ", $thing ) #3 This was reverted because it broke C<split("\x{20}", $thing)>, and because one might argue that is not that #1 does the wrong thing, but rather that the behavior of #2 that is wrong. In other words we might expect that all three should behave the same as #3, and that instead of "fixing" the behavior of #1 to be like #2, we should really fix the behavior of #2 to behave like #3. (Which is what we did.) Also, it doesn't make sense to move the special case detection logic further from the regex engine. We really want the regex engine to decide this stuff itself, otherwise split " ", ... wouldn't work properly with an alternate engine. (Imagine we add a special regexp meta pattern that behaves the same as " " does in a split /.../. For instance we might make split /(*SPLITWHITE)/ trigger the same behavior as split " ". The other major change as result of this patch is it effectively reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which and free up bits in the regex flags structure. But we dont want to get rid of these vars, and it turns out that RXf_SEEN_LOOKBEHIND is used only in the same situation as the new RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to RXf_NO_INPLACE_SUBST, and then instead of using two vars we use only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to have their bits back.
* simplify regcomp.c by using vars to avoid repeated macrosYves Orton2013-03-271-14/+7
| | | | | | | Use two temporary variables to simplify the logic, and maybe speed up a nanosecond or two. Also chainsaw some long dead logic. (I #ifdef'ed it out years ago)
* regcomp.c: silence compiler warningDavid Mitchell2013-03-231-1/+2
| | | | add a cast before doing a printf "%x" on a pointer
* Revert "PATCH: regex longjmp flaws"Nicholas Clark2013-03-191-4/+2
| | | | | | | | | | | | | This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc. The netbsd - 5.0.2 compiler pointed out that the recent changes to add longjmps to speed up some regex compilations can result in clobbering a few values. These depend on the compiled code, and so didn't show up in other compiler's warnings. This patch reinitializes them after a longjmp. [With a lot of hand editing in regcomp.c, to propagate the changes through subsequent commits.]
* In Perl_re_op_compile(), tidy up after removing setjmp().Nicholas Clark2013-03-191-23/+15
| | | | | | | | | | Remove volatile qualifiers. Remove the variable jump_ret. Move the initialisation of restudied back to the declaration. This reverts several of the changes made by commits 5d51ce98fae3de07 and bbd61b5ffb7621c2. However, I can't see a cleaner way to avoid code duplication when restarting the parse than to approach I've taken here - the label redo_first_pass is now inside an if (0) block, which is clear but ugly.
* Replace the longjmp()s in Perl_re_op_compile() with goto.Nicholas Clark2013-03-191-30/+8
| | | | | | | The regex parse needs to be restarted if it turns out that it should be done as UTF-8, not bytes. Using setjmp()/longjmp() complicates compilation considerably, causing warnings about missing use of volatile, and hitting code generation errors from clang's ASAN. Using goto is much clearer.
* Move the longjmp() that implements REQUIRE_UTF8 up to Perl_re_op_compile().Nicholas Clark2013-03-191-1/+2
| | | | | | With longjmp() and setjmp() now in the same function (and all tests passing), it becomes easy to replace the pair with a goto. Still evil, but "the lesser of two evils".
* Add a flag RESTART_UTF8 to the reg*() routines in regcomp.cNicholas Clark2013-03-191-11/+75
| | | | | | Add a flag RESTART_UTF8 along with infrastructure to the reg*() routines to permit the parse to be restarted without using longjmp(). However, it's not used yet.
* In S_regclass(), create listsv as a mortal, claiming a reference if needed.Nicholas Clark2013-03-191-19/+5
| | | | | | | | | | | The SV listsv is sometimes stored in an array generated near the end of S_regclass(). In other cases it is not used, and it needs to be freed if any of the warnings that S_regclass() can trigger turn out to be fatal. The simplest solution to this problem is to declare it from the start as a mortal, and claim a (new) reference to it if it is *not* to be freed. This permits the removal of all other code related to ensuring that it is freed at the right time, but not freed prematurely if a call to a warning returns.
* Document when and why S_reg{,branch,piece,atom,class}() return NULL.Nicholas Clark2013-03-191-35/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As documented in pod/perlreguts.pod, the call graph for regex parsing involves several levels of functions in regcomp.c, sometimes recursing more than once. The top level compiling function, S_reg(), calls S_regbranch() to parse each single branch of an alternation. In turn, that calls S_regpiece() to parse a simple pattern followed by quantifier, which calls S_regatom() to parse that simple pattern. S_regatom() can call S_regclass() to handle classes, but can also recurse into S_reg() to handle subpatterns and some other constructions. Some other routines call call S_reg(), sometimes using an alternative pattern that they generate dynamically to represent their input. These routines all return a pointer to a regnode structure, and take a pointer to an integer that holds flags, which is also used to return information. Historically, it has not been clear when and why they return NULL, and whether the return value can be ignored. In particular, "Jumbo regexp patch" (commit c277df42229d99fe, from Nov 1997), added code with two calls from S_reg() to S_regbranch(), one of which checks the return value and generates a LONGJMP node if it returns NULL, the other of which is called in void context, and so both ignores any return value, or the possibility that it is NULL. After some analysis I have untangled the possible return values from these 5 functions (and related functions which call S_reg()). Starting from the top: S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma- like constructions that it handles. Otherwise, historically it would return NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece() returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL. S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with TRYAGAIN, but (historically) if S_regatom() returns NULL without setting the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls S_reg() it has similar behaviour when passing back return values, although often it is able to loop instead on getting a TRYAGAIN. Which gets us back to S_reg(), which can only *generate* NULL in conjunction with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it called generated it. All other functions that these call that return regnode structures cannot return NULL. Hence 1) in the loop of functions called, there is no source for a return value of NULL without the TRYAGAIN flag being set 2) a return value of NULL with TRYAGAIN set from an inner function does not propagate out past S_regbranch() Hence the only return values that most functions can generate are non-NULL, or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot return NULL. The longest sequence of functions that can return NULL (with TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch(). Rapidly returning right round the loop back to S_reg() is not possible. Hence code added by commit c277df42229d99fe to handle a NULL return from S_regbranch(), along with some other code is dead. I have replaced all unreachable code with FAIL()s that panic.
* Return orig_emit from S_regclass() when ret_invlist is true.Nicholas Clark2013-03-191-1/+1
| | | | | | | | The return value isn't used (yet). Previously the code was returning END, which is a macro for the regnode number for "End of program" which happens to be 0. It happens that 0 is valid C for a NULL pointer, but using it in this way makes the intent unclear. Not only is orig_emit a valid type, it's semantically the correct thing to return, as it's most recently added node.
* Remove unreachable duplicate (?#...) parsing code from S_reg()Nicholas Clark2013-03-191-8/+0
| | | | | | | | | | | | | I believe that this code was rendered unreachable when perl 5.001 added code to S_nextchar() to skip over embedded comments. Adrian Enache noted this in March 2003, and proposed a patch which removed it. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html The patch wasn't applied at that time, and when he sent it again August, he omitted that hunk. See http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html That version was applied as commit e994fd663a4d8acc.
* fix a segfault in run-time qr//s with (?{})David Mitchell2013-03-181-2/+9
| | | | | | While assembling the regex, it was was examining CONSTs in the optree using the wrong pad. When consts are moved into the pad on threaded builds, segvs might be the result.