| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Posix classes generally match different sets of characters under /d
rules than otherwise. This isn't true for [:ascii:], but the handling
for it is shared with the others, so it needs to use the same mechanism
to deal with that. I forgot this in commit
bb9ee97444732c84b33c2f2432aa28e52e4651dc which created this regression.
Our tests for this only use regexes with a single element, and an
optimization added in 5.18 causes this bug to be bypassed. These tests
should be enhanced to force both code paths, but not for this commit,
which should be suitable for a maintenance release.
(cherry picked from commit 46c10357a881cd92500e4ade81cbc8813e49e2cb)
|
|
|
|
|
|
|
|
| |
Commit 726ee55d introduced a regression that has been fixed in blead by
commit f1e1b256. However the later commit changed some buggy behavior
into errors instead of warnings, and so is contraindicated in a
maintenance release. This current commit attempts to fix the regression
without changing other behavior. It includes the pat.t tests from f1e1b256.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(cherry-picked from 5b0e71e9d506. Some of the new tests are unsuitable for
5.18.x and fail with this commit; they'll be disabled in the next commit)
In the case where a qr// regex is directly used by PMOP (rather than being
interpolated with some other stuff and a new regex created, such as
/a$r/p), then the PMf_KEEPCOPY flag will be set on the PMOP, but the
corresponding RXf_PMf_KEEPCOPY flag *won't* be set on the regex.
Since most of the regex handling for copying the string and extracting out
${^PREMATCH} etc is done based on the RXf_PMf_KEEPCOPY flag in the regex,
this is a bit of a problem.
Prior to 5.18.0 this wasn't so noticeable, since various other bugs around
//p handling meant that ${$PREMATCH} etc often accidentally got set
anyway. 5.18.0 fixed these bugs, and so as a side-effect, exposed the
PMOP verses regex flag issue. In particular, this stopped working in
5.18.0:
my $pat = qr/a/;
'aaaa' =~ /$pat/gp or die;
print "MATCH=[${^MATCH}]\n";
(prints 'a' in 5.16.0, undef in 5.18.0).
The presence /g caused the engine to copy the string anyway by luck.
We can't just set the RXf_PMf_KEEPCOPY flag on the regex if we see the
PMf_KEEPCOPY flag on the PMOP, otherwise stuff like this will be wrong:
$r = qr/..../;
/$r/p; # set RXf_PMf_KEEPCOPY on $r
/$r/; # does a /p match by mistake
Since for 5.19.x onwards COW is enabled by default (and cheap copies are
always made regardless of /p), then this fix is mainly for PERL_NO_COW
builds and for backporting to 5.18.x. (Although it still applies to
strings that can't be COWed for whatever reason).
Since we can't set a flag in the rx, we fix this by:
1) when calling the regex engine (which may attempt to copy part or all of
the capture string), make sure we pass REXEC_COPY_STR, but neither of
REXEC_COPY_SKIP_PRE, REXEC_COPY_SKIP_POST when we call regexec() from
pp_match or pp_subst when the corresponding PMOP has PMf_KEEPCOPY set.
2) in Perl_reg_numbered_buff_fetch() etc, check for PMf_KEEPCOPY in
PL_curpm as well as for RXf_PMf_KEEPCOPY in the current rx before deciding
whether to process ${^PREMATCH} etc.
As well as adding new tests to t/re/reg_pmod.t, I also changed the
string to be matched against from being '12...' to '012...', to ensure that
the lengths of ${^PREMATCH}, ${^MATCH}, ${^POSTMATCH} would all be
different.
|
|
|
|
|
|
|
| |
The ‘Operand with no preceding operator’ error was leaking the last
two operands.
(cherry picked from commit b573e7000fd9c1cfae30ae5fb328a25b9bf3870a)
|
|
|
|
|
|
|
|
|
|
| |
I only need to free the operand (current), not the left-paren token
that turns out not to be a paren (lparen).
For lparen to leak, there would have to be two operands in a row on
the charclass parsing stack, which currently never happens.
(cherry picked from commit 4bc5d08976b7df23b63a56cc017a20ac5766fbbc)
|
|
|
|
|
|
|
|
| |
The swash returned by utf8_heavy.pl was not being freed in the code
path to handle returning character classes to the (?[...]) parser
(when ret_invlist is true).
(cherry picked from commit c80d037c54749655d40eac068936c5222ce9d8ee)
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a (?[]) extended charclass is compiled, the various operands are
stored as inversion lists in separate SVs and then combined together
into new inversion lists. The functions that take care of combining
inversion lists only ever free one operand, and sometimes free none.
Most of the operators in (?[]) were trusting the invlist functions to
free everything that was no longer needed, causing (?[]) compilation
to leak invlists.
(cherry picked from commit a84e671a269f736a404a62f21caacc8a431c2aca)
|
|
|
|
|
|
|
|
|
| |
Instead of creating the parsing stack and then freeing it after pars-
ing the (?[...]) construct (leaking it whenever one of the various
errors scattered throughout the parsing code occurs), mortalise it to
begin with and let the mortals stack take care of it.
(cherry picked from commit 1e4f088863436a8019c7d864691903ffdafeefda)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This segfault is a result of an optimization that can leave the
compilation in an inconsistent state.
/f{0}/
doesn't match anything, and hence should be removable from the regex for
all f. However,
qr{(?&foo){0}(?<foo>)}
caused a segfault. What was happening prior to this commit is that
(?&foo) refers to a named capture group further along in the regex.
The "{0}" caused the "(?&foo)" to be discarded prior to setting up the
pointers between the two related subexpressions; a segfault follows.
This commit removes the optimization, and should be suitable for a
maintenance release.
One might think that no one would be writing code like this, but this
example was distilled from machine-generated code in Regexp::Grammars.
Perhaps this optimization can be done, but the location I chose for
checking it was during parsing, which turns out to be premature. It
would be better to do it in the optimization phase of regex compilation.
Another option would be to retain it where it was, but for it to operate
only on a limited set of nodes, such as EXACTish, which would have no
unintended consequences. But that is for looking at in the future; the
important thing is to have a simple patch suitable for fixing this
regression in a maintenance release.
For the record, the code being reverted was mistakenly added by me in
commit 3018b823898645e44b8c37c70ac5c6302b031381, and wasn't even
mentioned in that commit message. It should have had its own commit.
Conflicts:
regcomp.c
|
|
|
|
|
|
| |
The code alredy upgraded the pattern if interpolating an upgraded
string into it, but not vice versa. Just use sv_catsv_nomg() instead
of sv_catpvn_nomg(), so that it can upgrade as necessary.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was a regression introduced in the v5.17 series. It only affected
UTF-8 encoded patterns. Basically, the code here should have
corresponded to, and didn't, similar logic located after the defchar:
label in this file, which is executed for the general case (not stemming
from a single element [bracketed] character class node).
We don't fold code points 0-255 under locale, as those aren't known
until run time. Similarly, we don't allow folds that cross the 255/256
boundary, as those aren't well-defined; and under /aa we don't allow
folds that cross the 127/128 boundary.
|
|
|
|
|
|
|
|
|
| |
RT #117135
The /p flag, when used internally within a pattern, isn't like the
other internal patterns, e.g. (?i:...), in that once seen, the
pattern should have the RXf_PMf_KEEPCOPY flag globally set and not
just enabled within the scope of the (?p:...).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit deprecates having space/comments between the first two
characters of regular expression forms '(*VERB:ARG)' and '(?...)'.
That is, the '(' should be immediately be followed by the '*' or '?'.
Previously, things like:
qr/((?# This is a comment in the middle of a token)?:foo)/
were silently accepted.
The problem is that during regex parsing, the '(' is seen, and the input
pointer advanced, skipping comments and, under /x, white space, without
regard to whether the left parenthesis is part of a bigger token or not.
S_reg() handles the parsing of what comes next in the input, and it
just assumes that the first character it sees is the one that
immediately followed the input parenthesis.
Since the parenthesis may or may not be a part of a bigger token, and
the current structure of handling things, I don't see an elegant way to
fix this. What I did is flag the single call to S_reg() where this
could be an issue, and have S_reg check for for adjacency if the
parenthesis is part of a bigger token, and if so, warn if not-adjacent.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 504858073fe16afb61d66a8b6748851780e51432, which
was made under the erroneous belief that certain code was unreachable.
This code appears to be reachable, however, only if the input has
a 2-character lexical token split by space or a comment. The token
should be indivisible, and was accepted only through an accident of
implementation. A later commit will deprecate splitting it, with the
view of eventually making splitting it a fatal error. In the meantime,
though, this reverts to the original behavior, and adds a (temporary)
test to verify that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously /a@{b}c/ would be parsed as
regcomp('a', join($", @b), 'c')
This means that the array was flattened and its contents stringified before
hitting the regex engine.
Change it so that it is parsed as
regcomp('a', @b, 'c')
(but where the array isn't flattened, but rather just the AV itself is
pushed onto the stack, c.f. push @b, ....).
This means that the regex engine itself can process any qr// objects
within the array, and correctly extract out any previously-compiled
code blocks (thus preserving the correct closure behaviour). This is
an improvement on 5.16.x behaviour, and brings it into line with the
newish 5.17.x behaviour for *scalar* vars where they happen to hold
regex objects.
It also fixes a regression from 5.16.x, which meant that you suddenly
needed a 'use re eval' in scope if any of the elements of the array were
a qr// object with code blocks (RT #115004).
It also means that 'qr' overloading is now handled within interpolated
arrays as well as scalars:
use overload 'qr' => sub { return qr/a/ };
my $o = bless [];
my @a = ($o);
"a" =~ /^$o$/; # always worked
"a" =~ /^@a$/; # now works too
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function was added a few commits ago in this branch. It's intended
to upgrade a pattern string to utf8, while simultaneously adjusting the
start/end byte indices of any code blocks. In two of the three places
it is called from, all code blocks will already have been processed,
so the number of code blocks equals pRExC_state->num_code_blocks.
In the third place however (S_concat_pat), not all code blocks have yet
been processed, so using num_code_blocks causes us to fall off the end of
the index array.
Add an extra arg to S_pat_upgrade_to_utf8() to tell it how many code
blocks exist so far.
|
|
|
|
|
| |
Re-indent code after the previous commit removed a block scope.
Only whitespace changes.
|
|
|
|
|
|
|
| |
Eliminate a local var and the block scope it is declared in
(There should be no functional changes).
Re-indenting will be in the next commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when the components of a runtime regex (e.g. the "a", $x, "-"
in /a$x-/) are concatenated into a single pattern string, the handling of
magic and various types of overloading is done within two separate loops:
(in perlish pseudocode):
foreach (@arg) {
SvGETMAGIC($_);
apply 'qr' overloading to $_;
}
foreach (@arg) {
$pat .= $_, allowing for '.' and '""' overloading;
}
This commit changes it to:
foreach (@arg) {
SvGETMAGIC($_);
apply 'qr' overloading to $_;
$pat .= $_, allowing for '.' and '""' overloading;
}
Note that this is in theory a user-visible change in behaviour, since
the order in which various perl-level tie and overload functions get
called may change. But that was just a quirk of the current
implementation, rather than a documented feature.
|
|
|
|
|
|
|
|
|
| |
Extract out the big chunk of code that concatenates the components
of a pattern string, into the new static function S_concat_pat().
As well as being tidier, it will shortly allow us to recursively
concatenate, and allow us to directly interpolate arrays such as
/@foo/, rather than relying on pp_join to do it for us.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When concatting the list of arguments together to form a final pattern
string, the code formerly did a quick scan of all the args first, and
if any of them were SvUTF8, it set the (empty) destination string to UTF8
before concatting all the individual args. This avoided the pattern
getting upgraded to utf8 halfway through, and thus the indices for code
blocks becoming invalid.
However this was not 100% reliable because, as an "XXX" code comment of
mine pointed out, when overloading is involved it is possible for an arg
to appear initially not to be utf8, but to be utf8 when its value is
finally accessed. This results an obscure bug (as shown in the test added
for this commit), where literal /(?{code})/ still required 'use re
"eval"'.
The fix for this is to instead adjust the code block indices on the fly
if the pattern string happens to get upgraded to utf8. This is easy(er)
now that we have the new S_pat_upgrade_to_utf8() function.
As well as fixing the bug, this also simplifies the main concat loop in
the code, which will make it easier to handle interpolating arrays (e.g.
/@foo/) when we move the interpolation from the join op into the regex
engine itself shortly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a bit of code that did
if (0) {
redo_first_pass:
...foo...;
}
to allow us to jump back and repeat the first pass, doing some extra stuff
the second time round. Since foo has now been abstracted into a separate
function, we can instead call it each time directly before jumping,
allowing us to remove the ugly if (0).
|
|
|
|
|
| |
it's value is easily (re)calculated, and we no longer have to worry
about making sure we update it everywhere.
|
|
|
|
|
|
|
|
| |
Extract out the code that upgrades the pattern string to utf8 (and adjusts
any code-block indices) into a separate function, S_pat_upgrade_to_utf8().
As well as being tidier, we'll be calling the code from other places
shortly.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are two issues fixed here.
First, when a pattern has a run-time code-block included, such as
$code = '(?{...})'
/foo$code/
the mechanism used to parse those run-time blocks: of feeding the
resultant pattern into a call to eval_sv() with the string
qr'foo(?{...})'
and then extracting out any resulting opcode trees from the returned
qr object -- suffered from the re-parsed qr'..' also being subject to
overload:constant qr processing, which could result in Bad Things
happening.
Since we now have the PL_parser->lex_re_reparsing flag in scope throughout
the parsing of the pattern, this is easy to detect and avoid.
The second issue is a mechanism to avoid recursion when getting false
positives in S_has_runtime_code() for code like '[(?{})]'.
For patterns like this, we would suspect that the pattern may have code
(even though it doesn't), so feed it into qr'...' and reparse, and
again it looks like runtime code, so feed it in, rinse and repeat.
The thing to stop recursion was when we saw a qr with a single OP_CONST
string, we assumed it couldn't have any run-time component, and thus no
run-time code blocks.
However, this broke qr/foo/ in the presence of overload::constant qr
overloading, which could convert foo into a string containing code blocks.
The fix for this is to change the recursion-avoidance mechanism (in a way
which also turns out to be simpler too). Basically, when we fake up a
qr'...' and eval it, we turn off any 'use re eval' in scope: its not
needed, since we know the .... will be a constant string without any
overloading. Then we use the lack of 'use re eval' in scope to
skip calling S_has_runtime_code() and just assume that the code has no
run-time patterns (if it has, then eventually the regex parser will
rightly complain about 'Eval-group not allowed at runtime').
This commit also adds some fairly comprehensive tests for this.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When re-parsing a pattern for run-time (?{}) code blocks,
we end up with the EVAL_RE_REPARSING flag set in PL_in_eval.
Currently we clear this flag as soon as scan_str() returns, to ensure that
it's not set if we happen to parse further patterns (e.g. within the
(?{ ... }) code itself.
However, a soon-to-be-applied bugfix requires us to know the reparsing
state beyond this point. To solve this, we add a new boolean flag to the
parser struct, which is set from PL_in_eval in S_sublex_push() (with the
old value being saved). This allows us to have the flag around for the
entire pattern string parsing phase, without it affecting nested pattern
compilation.
|
|
|
|
|
|
| |
The previous commit added an alternative flag mechanism to
PL_reg_state.re_reparsing, but kept the old one around for consistency
checking. Remove the old one now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_reg_state.re_reparsing is a hacky flag used to allow runtime
code blocks to be included in patterns. Basically, since code blocks
are now handled by the perl parser within literal patterns, runtime
patterns are handled by taking the (assembled at runtime) pattern,
and feeding it back through the parser via the equivalent of
eval q{qr'the_pattern'},
so that run-time (?{..})'s appear to be literal code blocks.
When this happens, the global flag PL_reg_state.re_reparsing is set,
which modifies lexing and parsing in minor ways (such as whether \\ is
stripped).
Now, I'm in the slow process of trying to eliminate global regex state
(i.e. gradually removing the fields of PL_reg_state), and also a change
which will be coming a few commits ahead requires the info which this flag
indicates to linger for longer (currently it is cleared immediately after
the call to scan_str().
For those two reasons, this commit adds a new mechanism to indicate this:
a new flag to eval_sv(), G_RE_REPARSING (which sets OPpEVAL_RE_REPARSING
in the entereval op), which sets the EVAL_RE_REPARSING bit in PL_in_eval.
Its still a yukky global flag hack, but its a *different* global flag hack
now.
For this commit, we add the new flag(s) but keep the old
PL_reg_state.re_reparsing flag and assert that the two mechanisms always
match. The next commit will remove re_reparsing.
|
|
|
|
|
|
|
| |
These were temporarily removed a few commits ago to make rebasing easier.
(And since the code's been simplified in the conflicting branch, not so
many debug statements had to be added back as were in the original).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[perl #116823]
In re_op_compile(), there were two different code paths for compile-time
patterns (/foo(?{1})bar/) and runtime (/$foo(?{1})bar/).
The code in question is where the various components of the pattern
are concatenated into a single string, for example, 'foo', '(?{1})' and
'bar' in the first pattern.
In the run-time branch, the code assumes that each component (e.g. the
value of $foo) can be absolutely anything, and full magic and overload
handling is applied as each component is retrieved and appended to the
pattern string.
The compile-time branch on the other hand, was a lot simpler because it
"knew" that each component is just a simple constant SV attached to an
OP_CONST op. This turned out to be an incorrect assumption, due to
overload::constant qr overloading; here, a simple constant part of a
compile-time pattern, such as 'foo', can be converted into whatever the
overload function returns; in particular, an object blessed into an
overloaded class. So the "simple" SVs that get attached to OP_CONST ops
can in fact be complex and need full magic, overloading etc applied to
them.
The quickest solution to this turned out to be, for the compile-time case,
extract out the SV from each OP_CONST and assemble them into a temporary
SV** array; then from then onwards, treat it the same as the run-time case
(which expects an array of SVs).
|
|
|
|
| |
(only whitespace changes)
|
|
|
|
|
|
|
|
|
|
|
|
| |
When assembling a compile-time pattern from a list of OP_CONSTs (and
possibly embedded code-blocks), there were separate code paths for a
single arg (a lone OP_CONST) and a list of OP_CONST / DO's.
Unify the branches into single loop.
This will make a subsequent commit easier, where we will need to do more
processing of each "constant".
Re-indenting has been left to the commit that follows this.
|
|
|
|
| |
and eliminate one local var.
|
|
|
|
| |
(whitespace changes only)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When compiling a regex, something like /a$b/ that parses two two args,
was treated in a different code path than /$a/ say, which is only one arg.
In particular the 1-arg code path, where it handled "" overloading, didn't
check for a loop (where the ""-sub returns the overloaded object itself) -
the N-arg branch did handle that. By unififying the branches, we get that
fix for free, and ensure that any future fixes don't have to be applied to
two separate branches.
Re-indented has been left to the commit that follows this.
|
|
|
|
|
|
|
| |
These four DEBUG_PARSE_r()'s were recently added to a block I code
which I have just been extensively reworking in a separate branch.
Temporarily remove these statements to allow my branch to be rebased;
I'll re-add them (or similar) afterwards.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch resolves several issues at once. The parts are
sufficiently interconnected that it is hard to break it down
into smaller commits. The tickets open for these issues are:
RT #94490 - split and constant folding
RT #116086 - split "\x20" doesn't work as documented
It additionally corrects some issues with cached regexes that
were exposed by the split changes (and applied to them).
It effectively reverts 5255171e6cd0accee6f76ea2980e32b3b5b8e171
and cccd1425414e6518c1fc8b7bcaccfb119320c513.
Prior to this patch the special RXf_SKIPWHITE behavior of
split(" ", $thing)
was only available if Perl could resolve the first argument to
split at compile time, meaning under various arcane situations.
This manifested as oddities like
my $delim = $cond ? " " : qr/\s+/;
split $delim, $string;
and
split $cond ? " ", qr/\s+/, $string
not behaving the same as:
($cond ? split(" ", $string) : split(/\s+/, $string))
which isn't very convenient.
This patch changes this by adding a new flag to the op_pmflags,
PMf_SPLIT which enables pp_regcomp() to know whether it was called
as part of split, which allows the RXf_SPLIT to be passed into run
time regex compilation. We also preserve the original flags so
pattern caching works properly, by adding a new property to the
regexp structure, "compflags", and related macros for accessing it.
We preserve the original flags passed into the compilation process,
so we can compare when we are trying to decide if we need to
recompile.
Note that this essentially the opposite fix from the one applied
originally to fix #94490 in 5255171e6cd0accee6f76ea2980e32b3b5b8e171.
The reverted patch was meant to make:
split( 0 || " ", $thing ) #1
consistent with
my $x=0; split( $x || " ", $thing ) #2
and not with
split( " ", $thing ) #3
This was reverted because it broke C<split("\x{20}", $thing)>, and
because one might argue that is not that #1 does the wrong thing,
but rather that the behavior of #2 that is wrong. In other words
we might expect that all three should behave the same as #3, and
that instead of "fixing" the behavior of #1 to be like #2, we should
really fix the behavior of #2 to behave like #3. (Which is what we did.)
Also, it doesn't make sense to move the special case detection logic
further from the regex engine. We really want the regex engine to decide
this stuff itself, otherwise split " ", ... wouldn't work properly with
an alternate engine. (Imagine we add a special regexp meta pattern that behaves
the same as " " does in a split /.../. For instance we might make
split /(*SPLITWHITE)/ trigger the same behavior as split " ".
The other major change as result of this patch is it effectively
reverts commit cccd1425414e6518c1fc8b7bcaccfb119320c513, which
was intended to get rid of RXf_SPLIT and RXf_SKIPWHITE, which
and free up bits in the regex flags structure.
But we dont want to get rid of these vars, and it turns out that
RXf_SEEN_LOOKBEHIND is used only in the same situation as the new
RXf_MODIFIES_VARS. So I have renamed RXf_SEEN_LOOKBEHIND to
RXf_NO_INPLACE_SUBST, and then instead of using two vars we use
only the one. Which in turn allows RXf_SPLIT and RXf_SKIPWHITE to
have their bits back.
|
|
|
|
|
|
|
| |
Use two temporary variables to simplify the logic, and maybe
speed up a nanosecond or two.
Also chainsaw some long dead logic. (I #ifdef'ed it out years ago)
|
|
|
|
| |
add a cast before doing a printf "%x" on a pointer
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 595598ee1f247e72e06e4cfbe0f98406015df5cc.
The netbsd - 5.0.2 compiler pointed out that the recent changes to add
longjmps to speed up some regex compilations can result in clobbering a
few values. These depend on the compiled code, and so didn't show up in
other compiler's warnings. This patch reinitializes them after a
longjmp.
[With a lot of hand editing in regcomp.c, to propagate the changes through
subsequent commits.]
|
|
|
|
|
|
|
|
|
|
| |
Remove volatile qualifiers. Remove the variable jump_ret. Move the
initialisation of restudied back to the declaration. This reverts several of
the changes made by commits 5d51ce98fae3de07 and bbd61b5ffb7621c2.
However, I can't see a cleaner way to avoid code duplication when restarting
the parse than to approach I've taken here - the label redo_first_pass is
now inside an if (0) block, which is clear but ugly.
|
|
|
|
|
|
|
| |
The regex parse needs to be restarted if it turns out that it should be done
as UTF-8, not bytes. Using setjmp()/longjmp() complicates compilation
considerably, causing warnings about missing use of volatile, and hitting
code generation errors from clang's ASAN. Using goto is much clearer.
|
|
|
|
|
|
| |
With longjmp() and setjmp() now in the same function (and all tests passing),
it becomes easy to replace the pair with a goto. Still evil, but "the lesser
of two evils".
|
|
|
|
|
|
| |
Add a flag RESTART_UTF8 along with infrastructure to the reg*() routines to
permit the parse to be restarted without using longjmp(). However, it's not
used yet.
|
|
|
|
|
|
|
|
|
|
|
| |
The SV listsv is sometimes stored in an array generated near the end of
S_regclass(). In other cases it is not used, and it needs to be freed if
any of the warnings that S_regclass() can trigger turn out to be fatal.
The simplest solution to this problem is to declare it from the start as a
mortal, and claim a (new) reference to it if it is *not* to be freed. This
permits the removal of all other code related to ensuring that it is freed
at the right time, but not freed prematurely if a call to a warning returns.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As documented in pod/perlreguts.pod, the call graph for regex parsing
involves several levels of functions in regcomp.c, sometimes recursing more
than once.
The top level compiling function, S_reg(), calls S_regbranch() to parse each
single branch of an alternation. In turn, that calls S_regpiece() to parse
a simple pattern followed by quantifier, which calls S_regatom() to parse
that simple pattern. S_regatom() can call S_regclass() to handle classes,
but can also recurse into S_reg() to handle subpatterns and some other
constructions. Some other routines call call S_reg(), sometimes using an
alternative pattern that they generate dynamically to represent their input.
These routines all return a pointer to a regnode structure, and take a
pointer to an integer that holds flags, which is also used to return
information.
Historically, it has not been clear when and why they return NULL, and
whether the return value can be ignored. In particular, "Jumbo regexp patch"
(commit c277df42229d99fe, from Nov 1997), added code with two calls from
S_reg() to S_regbranch(), one of which checks the return value and generates
a LONGJMP node if it returns NULL, the other of which is called in void
context, and so both ignores any return value, or the possibility that it is
NULL.
After some analysis I have untangled the possible return values from these
5 functions (and related functions which call S_reg()).
Starting from the top:
S_reg() will return NULL and set the flags to TRYAGAIN at the end of pragma-
like constructions that it handles. Otherwise, historically it would return
NULL if S_regbranch() returned NULL. In turn, S_regbranch() would return
NULL if S_regpiece() returned NULL without setting TRYAGAIN. If S_regpiece()
returns TRYAGAIN, S_regbranch() loops, and ultimately will not return NULL.
S_regpiece() returns NULL with TRYAGAIN if S_regatom() returns NULL with
TRYAGAIN, but (historically) if S_regatom() returns NULL without setting
the flags to TRYAGAIN, S_regpiece() would to. Where S_regatom() calls
S_reg() it has similar behaviour when passing back return values, although
often it is able to loop instead on getting a TRYAGAIN.
Which gets us back to S_reg(), which can only *generate* NULL in conjunction
with TRYAGAIN. NULL without TRYAGAIN could only be returned if a routine it
called generated it. All other functions that these call that return regnode
structures cannot return NULL. Hence
1) in the loop of functions called, there is no source for a return value of
NULL without the TRYAGAIN flag being set
2) a return value of NULL with TRYAGAIN set from an inner function does not
propagate out past S_regbranch()
Hence the only return values that most functions can generate are non-NULL,
or NULL with TRYAGAIN set, and as S_regbranch() catches these, it cannot
return NULL. The longest sequence of functions that can return NULL (with
TRYAGAIN set) is S_reg() -> S_regatom() -> S_regpiece() -> S_regbranch().
Rapidly returning right round the loop back to S_reg() is not possible.
Hence code added by commit c277df42229d99fe to handle a NULL return from
S_regbranch(), along with some other code is dead.
I have replaced all unreachable code with FAIL()s that panic.
|
|
|
|
|
|
|
|
| |
The return value isn't used (yet). Previously the code was returning END,
which is a macro for the regnode number for "End of program" which happens to
be 0. It happens that 0 is valid C for a NULL pointer, but using it in this
way makes the intent unclear. Not only is orig_emit a valid type, it's
semantically the correct thing to return, as it's most recently added node.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I believe that this code was rendered unreachable when perl 5.001 added
code to S_nextchar() to skip over embedded comments. Adrian Enache noted
this in March 2003, and proposed a patch which removed it. See
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-03/msg00840.html
The patch wasn't applied at that time, and when he sent it again August,
he omitted that hunk. See
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2003-08/msg01820.html
That version was applied as commit e994fd663a4d8acc.
|
|
|
|
|
|
| |
While assembling the regex, it was was examining CONSTs in the optree
using the wrong pad. When consts are moved into the pad on threaded
builds, segvs might be the result.
|