| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 726ee55d introduced better handling of things like \87 in a
regex, but as an unfortunate side effect broke latex2html.
The rules for handling backslashes in regexen are a bit arcane.
Anything starting with \0 is octal.
The sequences \1 through \9 are always backrefs.
Any other sequence is interpreted as a decimal, and if there
are that many capture buffers defined in the pattern at that point
then the sequence is a backreference. If however it is larger
than the number of buffers the sequence is treated as an octal digit.
A consequence of this is that \118 could be a backreference to
the 118th capture buffer, or it could be the string "\11" . "8". In
other words depending on the context we might even use a different
number of digits for the escape!
This also left an awkward edge case, of multi digit sequences
starting with 8 or 9 like m/\87/ which would result in us parsing
as though we had seen /87/ (iow a null byte at the start) or worse
like /\x{00}87/ which is clearly wrong.
This patches fixes the cases where the capture buffers are defined,
and causes things like the \87 or \97 to throw the same error that
/\8/ would. One might argue we should complain about an illegal
octal sequence, but this seems more consistent with an error like
/\9/ and IMO will be less surprising in an error message.
This patch includes exhaustive tests of patterns of the form
/(a)\1/, /((a))\2/ etc, so that we dont break this again if we
change the logic more.
|
| |
|
|
|
|
| |
"aaabc" should match /a+?(*THEN)bc/ with "abc".
|
| |
|
|
|
|
|
| |
Prior to this patch, we did work based on a test that could be thrown
away as a result of the next test. This reorders so that doesn't happen
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When I added support for possessive modifiers it was possible to
build perl so that they could be combined even if it made no sense
to do so.
Later on in relation to Perl #118375 I got confused and thought
they were undocumented but legal.
So to prevent further confusion, and since nobody has every mentioned
it since they were added, I am removing the unusued conditional
build logic, and clearly documenting why they aren't allowed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In c37d14f947f7998211b0455e453160fb7e15b22e Karl fixed an issue
reported in [perl #118375] "5.18 regex regression Quantifier follows nothing in regex"
but he fixed only the non-greedy modifier mentioned in the ticket,
and did not include support for the other quantifier modifiers like the
non-greedy possessive (stupid but not illegal), and the possessive
(useful) modifiers.
Hopefully this covers them all.
Note that because Karl already included support for m/x {0,0} ?/x
I have done so as well for the new cases. I do not necessarily endorse
the idea that it is legal or should be tested for. I am inclined to
think that '{0,0}?+' should be indivisible even under /x.
|
|
|
|
|
|
|
|
|
| |
This changes the code so that any '?' immediately (subject to /x rules)
following a {0} quantifier is absorbed immediately as part of that
quantifier. This allows the optimization to NOTHING to work for
non-greedy matches.
Thanks to Yves Orton for doing most of the leg work on this
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[sprout@dromedary-001 perl2.git]$ ../perl.git/Porting/bisect.pl --target=miniperl --start=perl-5.8.0 --end=v5.10.0 -- ./miniperl -Ilib -we 'BEGIN { $SIG{__WARN__} = sub { die if $_[0] =~ /Quantifier/ && $warned++; warn shift }}; ""=~/(N|N)(?{})?/'
...
07be1b83a6b2d24b492356181ddf70e1c7917ae3 is the first bad commit
commit 07be1b83a6b2d24b492356181ddf70e1c7917ae3
Author: Yves Orton <demerphq@gmail.com>
Date: Fri Jun 9 02:56:37 2006 +0200
Re: [PATCH] Better version of the Aho-Corasick patch and lots of benchmarks.
Message-ID: <9b18b3110606081556t779de698r82f361d82a05fbc8@mail.gmail.com>
(with tweaks)
p4raw-id: //depot/perl@28373
Since that commit, it has been possible for S_study_chunk to be called
twice if the trie optimisation kicks in (which happens for /(a|b)/).
‘Quantifier unexpected on zero-length expression’ is the only warning
in S_study_chunk. Now it can appear twice if the quantified zero-
length expression is in the same regexp as a trie optimisation.
So pass a flag to S_study_chunk when ‘restudying’ to indicate that the
warning should be skipped.
There are two code paths that call S_study_chunk, one for when there
is no top-level alternation, which triggers the error in this case,
and one for when there is a top-level alternation (/a|b/). I wasn’t
able to figure out how to trigger the double warning in the second
case, but I passed the flag for the restudy in that code path anyway,
since I don’t think it can be wrong.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Interpolation fails if the interpolated extended character class con-
tains any bracketed character classes itself.
The sizing pass looks for [ and passes control to the regular charac-
ter class parser. When the charclass is finished, it begins scanning
for [ again. If it finds ], it assumes it is the end.
That fails with (?[ (?a:(?[ [a] ])) ]). The sizing pass hands
[ [a] ]... off to the charclass parser, which parses [ [a] and
hands control back to the sizing pass. It then sees ‘ ])) ])’,
assumes that the first ]) is the end of the entire construct, so the
main regexp parser sees the parenthesis following and dies.
If we change the sizing pass to look for ?[ we can simply record the
depth (depth++) and then when we see ] decrement the depth or exist
the loop if it is zero.
|
|
|
|
|
| |
The ‘Operand with no preceding operator’ error was leaking the last
two operands.
|
|
|
|
|
|
|
|
| |
I only need to free the operand (current), not the left-paren token
that turns out not to be a paren (lparen).
For lparen to leak, there would have to be two operands in a row on
the charclass parsing stack, which currently never happens.
|
|
|
|
|
|
| |
The swash returned by utf8_heavy.pl was not being freed in the code
path to handle returning character classes to the (?[...]) parser
(when ret_invlist is true).
|
|
|
|
|
|
|
|
|
|
| |
When a (?[]) extended charclass is compiled, the various operands are
stored as inversion lists in separate SVs and then combined together
into new inversion lists. The functions that take care of combining
inversion lists only ever free one operand, and sometimes free none.
Most of the operators in (?[]) were trusting the invlist functions to
free everything that was no longer needed, causing (?[]) compilation
to leak invlists.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The macro Set_Node_Cur_Length() had been referring to the variable
parse_start within its body. This somewhat secret reference is potentially
risky, as it was always taking a parameter node, hence one might assume that
that was all that it used, and change the value stored in parse_start.
(Specifically, one place that assigns RExC_parse - 1 to parse_start, and later
uses parse_start + 1, which looks like an obvious cleanup candidate)
So make parse_start an explicit parameter.
Also, eliminate the never-used macro Set_Cur_Node_Length. This was added as
part of commit fac927409d5ddf11 (April 2001) but never used.
|
|
|
|
|
|
|
| |
Commit 779fedd7c3021f01 (March 2013) moved code which unconditionally used
parse_start into another block. Hence the variable is now only needed when
RE_TRACK_PATTERN_OFFSETS is defined, so wrap its declaration in #ifdef/#endif
to avoid C compiler warnings.
|
|
|
|
|
|
|
| |
Instead of creating the parsing stack and then freeing it after pars-
ing the (?[...]) construct (leaking it whenever one of the various
errors scattered throughout the parsing code occurs), mortalise it to
begin with and let the mortals stack take care of it.
|
|
|
|
|
|
| |
The code alredy upgraded the pattern if interpolating an upgraded
string into it, but not vice versa. Just use sv_catsv_nomg() instead
of sv_catpvn_nomg(), so that it can upgrade as necessary.
|
|
|
|
|
|
|
| |
This global (per-interpreter) var is just used during regex compilation as
a placeholder to point RExC_emit at during the first (non-emitting) pass,
to indicate to not to emit anything. There's no need for it to be a global
var: just add it as an extra field in the RExC_state_t struct instead.
|
|
|
|
|
|
|
|
|
|
| |
This is a struct that holds all the global state of the current regex
match.
The previous set of commits have gradually removed all the fields of this
struct (by making things local rather than global state). Since the struct
is now empty, the PL_reg_state var can be removed, along with the
SAVEt_RE_STATE save type which was used to save and restore those fields
on recursive re-entry to the regex engine.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Eliminate these two global vars (well, fields in the global
PL_reg_state), that hold the regex super-liner cache.
PL_reg_poscache_size gets replaced with a field in the local regmatch_info
struct, while PL_reg_poscache (which needs freeing at end of pattern
execution or on croak()), goes in the regmatch_info_aux struct.
Note that this includes a slight change in behaviour.
Each regex execution now has its own private poscache pointer, initially
null. If the super-linear behaviour is detected, the cache is malloced,
used for the duration of the pattern match, then freed.
The former behaviour allocated a global poscache on first use, which was
retained between regex executions. Since the poscache could between 0.25
and 2x the size of the string being matched, that could potentially be a
big buffer lying around unused. So we save memory at the expense of a new
malloc/free for every regex that triggers super-linear behaviour.
The old behaviour saved the old pointer on reentrancy, then restored the
old one (and possibly freed the new buffer) at exit. Except it didn't for
(?{}), so /(?{ m{something-that-triggers-super-linear-cache} })/ would
leak each time the inner regex was called. This is now fixed
automatically.
|
|
|
|
|
| |
Move these two fields of PL_reg_state into the regmatch_info struct, so
they are local to each match.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace several PL_reg* vars with a new struct. This is part of the
goal of removing all global regex state.
These particular vars are used in the case of a regex with (?{}) code
blocks. In this case, when the code in a block is called, various bits of
state (such as $1, pos($_)) are temporarily set up, even though the match
has not yet completed.
This involves updating the current PL_curpm to point to a fake PMOP which
points to the regex currently being executed. That regex has all its
current fields that are associated with captures (such as subbeg)
temporarily saved and overwritten with the current partial match results.
Similarly, $_ is temporarily aliased to the current match string, and any
old pos() position is saved. This saving was formerly done to the various
PL_reg* vars.
When the regex has finished executing (or if the code block croaks),
its fields are restored to the original values. Since this can happen in a
croak, it may be done using SAVEDESTRUCTOR_X() on the save stack. This
precludes just moving the PL_reg* vars into the regmatch_info struct,
since that is just allocated as a local var in regexec_flags(), and would
have already been abandoned and possibly overwritten after the croak and
longjmp, but before the SAVEDESTRUCTOR_X() action is taken.
So instead we put all the vars into new struct, and malloc that on entry to
the regex engine when we know we need to copy the various fields.
We save a pointer to that in the regmatch_info struct, as well as passing
it to SAVEDESTRUCTOR_X(). The destructor may get called up to twice in the
non-croak case: first it's called explicitly at the end of regexec_flags(),
which restores subbeg etc; then again from the savestack, which just
free()s the struct. In the croak case, it's called just once, and does
both the restoring and the freeing.
The vars / PL_reg_state fields this commit eliminates are:
re_state_eval_setup_done
PL_reg_oldsaved
PL_reg_oldsavedlen
PL_reg_oldsavedoffset
PL_reg_oldsavedcoffset
PL_reg_magic
PL_reg_oldpos
PL_nrs
PL_reg_oldcurpm
|
|
|
|
| |
All deprecated warnings are supposed to be default warnings.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
•‘Corrupted regexp opcode’ is a ‘can’t happen’ error, so it belongs
in the P category.
• Two spaces after dots for consistency
• Rewrap for slightly better splain output
• The description usually begins on the same line as the category, so
do so consistently
• Reorder alphabetically
• Missing category
• Single, not double, backslash
• Squash two adjacent (due to reordering) entries with identical
descriptions
• ‘given’ does not depend on lexical $_ any more
• Remove duplicate entries (and placate diag.t with diag_listed_as)
|
|
|
|
|
|
|
| |
Before this commit, /\g/ raised the wrong warning
Reference to invalid group 0
This rearranges the code so that the proper warning is emitted.
Unterminated \g... pattern
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
use locale;
fc("\N{LATIN CAPITAL LETTER SHARP S}")
eq 2 x fc("\N{LATIN SMALL LETTER LONG S}")
should return true, as the SHARP S folds to two 's's in a row, and the
LONG S is an antique variant of 's', and folds to s. Until this commit,
the expression was false.
Similarly, the following should match, but didn't until this commit:
"\N{LATIN SMALL LETTER SHARP S}" =~ /\N{LATIN SMALL LETTER LONG S}{2}/iaa
The reason these didn't work properly is that in both cases the actual
fold to 's' is disallowed. In the first case because of locale; and in
the second because of /aa. And the code wasn't smart enough to realize
that these were legal.
The fix is to special case these so that the fold of sharp s (both
capital and small) is two LONG S's under /aa; as is the fold of the
capital sharp s under locale. The latter is user-visible, and the
documentation of fc() now points that out. I believe this is such an
edge case that no mention of it need be done in perldelta.
|
|
|
|
| |
This will be used in future commits to pass more flags.
|
|
|
|
|
| |
In this code, j is guaranteed to be above 255, so no need to test for
that.
|
|
|
|
|
|
| |
The previous commit allows us to outdent a largish block, reflowing
things to fit into the extra available width, and saving a few vertical
pixels.
|
|
|
|
| |
This makes it easier to understand what is going on
|
| |
|
|
|
|
| |
Change this to follow perl coding conventions
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was a regression introduced in the v5.17 series. It only affected
UTF-8 encoded patterns. Basically, the code here should have
corresponded to, and didn't, similar logic located after the defchar:
label in this file, which is executed for the general case (not stemming
from a single element [bracketed] character class node).
We don't fold code points 0-255 under locale, as those aren't known
until run time. Similarly, we don't allow folds that cross the 255/256
boundary, as those aren't well-defined; and under /aa we don't allow
folds that cross the 127/128 boundary.
|
|
|
|
|
|
|
|
|
| |
RT #117135
The /p flag, when used internally within a pattern, isn't like the
other internal patterns, e.g. (?i:...), in that once seen, the
pattern should have the RXf_PMf_KEEPCOPY flag globally set and not
just enabled within the scope of the (?p:...).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit deprecates having space/comments between the first two
characters of regular expression forms '(*VERB:ARG)' and '(?...)'.
That is, the '(' should be immediately be followed by the '*' or '?'.
Previously, things like:
qr/((?# This is a comment in the middle of a token)?:foo)/
were silently accepted.
The problem is that during regex parsing, the '(' is seen, and the input
pointer advanced, skipping comments and, under /x, white space, without
regard to whether the left parenthesis is part of a bigger token or not.
S_reg() handles the parsing of what comes next in the input, and it
just assumes that the first character it sees is the one that
immediately followed the input parenthesis.
Since the parenthesis may or may not be a part of a bigger token, and
the current structure of handling things, I don't see an elegant way to
fix this. What I did is flag the single call to S_reg() where this
could be an issue, and have S_reg check for for adjacency if the
parenthesis is part of a bigger token, and if so, warn if not-adjacent.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 504858073fe16afb61d66a8b6748851780e51432, which
was made under the erroneous belief that certain code was unreachable.
This code appears to be reachable, however, only if the input has
a 2-character lexical token split by space or a comment. The token
should be indivisible, and was accepted only through an accident of
implementation. A later commit will deprecate splitting it, with the
view of eventually making splitting it a fatal error. In the meantime,
though, this reverts to the original behavior, and adds a (temporary)
test to verify that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously /a@{b}c/ would be parsed as
regcomp('a', join($", @b), 'c')
This means that the array was flattened and its contents stringified before
hitting the regex engine.
Change it so that it is parsed as
regcomp('a', @b, 'c')
(but where the array isn't flattened, but rather just the AV itself is
pushed onto the stack, c.f. push @b, ....).
This means that the regex engine itself can process any qr// objects
within the array, and correctly extract out any previously-compiled
code blocks (thus preserving the correct closure behaviour). This is
an improvement on 5.16.x behaviour, and brings it into line with the
newish 5.17.x behaviour for *scalar* vars where they happen to hold
regex objects.
It also fixes a regression from 5.16.x, which meant that you suddenly
needed a 'use re eval' in scope if any of the elements of the array were
a qr// object with code blocks (RT #115004).
It also means that 'qr' overloading is now handled within interpolated
arrays as well as scalars:
use overload 'qr' => sub { return qr/a/ };
my $o = bless [];
my @a = ($o);
"a" =~ /^$o$/; # always worked
"a" =~ /^@a$/; # now works too
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This function was added a few commits ago in this branch. It's intended
to upgrade a pattern string to utf8, while simultaneously adjusting the
start/end byte indices of any code blocks. In two of the three places
it is called from, all code blocks will already have been processed,
so the number of code blocks equals pRExC_state->num_code_blocks.
In the third place however (S_concat_pat), not all code blocks have yet
been processed, so using num_code_blocks causes us to fall off the end of
the index array.
Add an extra arg to S_pat_upgrade_to_utf8() to tell it how many code
blocks exist so far.
|
|
|
|
|
| |
Re-indent code after the previous commit removed a block scope.
Only whitespace changes.
|
|
|
|
|
|
|
| |
Eliminate a local var and the block scope it is declared in
(There should be no functional changes).
Re-indenting will be in the next commit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when the components of a runtime regex (e.g. the "a", $x, "-"
in /a$x-/) are concatenated into a single pattern string, the handling of
magic and various types of overloading is done within two separate loops:
(in perlish pseudocode):
foreach (@arg) {
SvGETMAGIC($_);
apply 'qr' overloading to $_;
}
foreach (@arg) {
$pat .= $_, allowing for '.' and '""' overloading;
}
This commit changes it to:
foreach (@arg) {
SvGETMAGIC($_);
apply 'qr' overloading to $_;
$pat .= $_, allowing for '.' and '""' overloading;
}
Note that this is in theory a user-visible change in behaviour, since
the order in which various perl-level tie and overload functions get
called may change. But that was just a quirk of the current
implementation, rather than a documented feature.
|
|
|
|
|
|
|
|
|
| |
Extract out the big chunk of code that concatenates the components
of a pattern string, into the new static function S_concat_pat().
As well as being tidier, it will shortly allow us to recursively
concatenate, and allow us to directly interpolate arrays such as
/@foo/, rather than relying on pp_join to do it for us.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When concatting the list of arguments together to form a final pattern
string, the code formerly did a quick scan of all the args first, and
if any of them were SvUTF8, it set the (empty) destination string to UTF8
before concatting all the individual args. This avoided the pattern
getting upgraded to utf8 halfway through, and thus the indices for code
blocks becoming invalid.
However this was not 100% reliable because, as an "XXX" code comment of
mine pointed out, when overloading is involved it is possible for an arg
to appear initially not to be utf8, but to be utf8 when its value is
finally accessed. This results an obscure bug (as shown in the test added
for this commit), where literal /(?{code})/ still required 'use re
"eval"'.
The fix for this is to instead adjust the code block indices on the fly
if the pattern string happens to get upgraded to utf8. This is easy(er)
now that we have the new S_pat_upgrade_to_utf8() function.
As well as fixing the bug, this also simplifies the main concat loop in
the code, which will make it easier to handle interpolating arrays (e.g.
/@foo/) when we move the interpolation from the join op into the regex
engine itself shortly.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There was a bit of code that did
if (0) {
redo_first_pass:
...foo...;
}
to allow us to jump back and repeat the first pass, doing some extra stuff
the second time round. Since foo has now been abstracted into a separate
function, we can instead call it each time directly before jumping,
allowing us to remove the ugly if (0).
|
|
|
|
|
| |
it's value is easily (re)calculated, and we no longer have to worry
about making sure we update it everywhere.
|