| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
It is no longer needed as of 1067df30ae9.
|
|
|
|
|
|
|
|
| |
We get an integer overflow message when we left shift a 1 into the
highest bit of a word. This changes the 1's into 1U's to indicate
unsigned. This is done for all the flag bits in the affected word, as
they could get reorderd by someone in the future, unintentionally
reintroducing this problem again.
|
|
|
|
|
|
|
|
|
|
| |
It is planned for a future Perl release to have /xx mean something
different from just /x. To prepare for this, this commit raises a
deprecation warning if someone currently has this usage. A grep of CPAN
did not turn up any instances of this, but this is to be safe anyway.
The added code is more general than actually needed, in case we want to
do this for another flag.
|
|
|
|
|
|
| |
This doesn't actually use the flag yet.
We no longer have to make version-dependent changes to
ext/Devel-Peek/t/Peek.t, (it being in /ext) so this doesn't
|
| |
|
|
|
|
|
|
| |
This sets a #define to point in the middle of the free-space, so that
bits at either end can be added without having to adjust many other
defines.
|
|
|
|
|
|
|
| |
This #defined a symbol then did a compile time check that it was the
same as another symbol. This commit simply defines it as the other
symbol directly, and moves it to above the other definitions, which it
no longer is part of. This prepares for the next commit.
|
|
|
|
|
|
| |
We do not need a placeholder for unused flag bits. And removing them
makes the generated regnodes.h more accurate as to what bits are
available.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This moves three bits to create a block of unused bits at the beginning.
The first bit had to be moved to make space for other uses that are
coming in future commits. This breaks binary compatibility, so might as
well move the other two bits so that all the unused bits are
consolidated at the beginning.
This pool of unused bits is the boundary between the bits that are
common to op.h and regexp.h (and in op_reg_common.h) and those that are
separate. It's best to have all the unused bits there, so when we need
to use one, it can be taken from either side, as needed, without us
being trapped into having an available bit, but of the wrong kind.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- after return/croak/die/exit, return/break are pointless
(break is not a terminator/separator, it's a goto)
- after goto, another goto (!) is pointless
- in some cases (usually function ends) introduce explicit NOT_REACHED
to make the noreturn nature clearer (do not do this everywhere, though,
since that would mean adding NOT_REACHED after every croak)
- for the added NOT_REACHED also add /* NOTREACHED */ since
NOT_REACHED is for gcc (and VC), while the comment is for linters
- declaring variables in switch blocks is just too fragile:
it kind of works for narrowing the scope (which is nice),
but breaks the moment there are initializations for the variables
(the initializations will be skipped since the flow will bypass
the start of the block); in some easy cases simply hoist the declarations
out of the block and move them earlier
Note 1: Since after this patch the core is not yet -Wunreachable-code
clean, not enabling that via cflags.SH, one needs to -Accflags=... it.
Note 2: At least with the older gcc 4.4.7 there are far too many
"unreachable code" warnings, which seem to go away with gcc 4.8,
maybe better flow control analysis. Therefore, the warning should
eventually be enabled only for modernish gccs (what about clang and
Intel cc?)
|
|
|
|
|
|
|
| |
This reverts commit 8c2b19724d117cecfa186d044abdbf766372c679.
I don't understand - smoke-me came back happy with three
separate reports... oh well, some other time.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- after croak/die/exit (or return), break (or return!) are pointless
(break is not a terminator/separator, it's a promise of a jump)
- after goto, another goto (!) is pointless
- in some cases (usually function ends) introduce explicit NOT_REACHED
to make the noreturn nature clearer (do not do this everywhere, though,
since that would mean adding NOT_REACHED after every croak)
- for the added NOT_REACHED also add /* NOTREACHED */ since
NOT_REACHED is for gcc (and VC), while the comment is for linters
- declaring variables in switch blocks is just too fragile:
it kind of works for narrowing the scope (which is nice),
but breaks the moment there are initializations for the variables
(they will be skipped!); in some easy cases simply hoist the declarations
out of the block and move them earlier
There are still a few places left.
|
|
|
|
| |
(For some odd reason assert() cannot be found and Jenkins becomes apoplectic.)
|
|
|
|
|
|
|
|
| |
The default case really is impossible because all the valid
enums values are already covered in the switch.
The NOT_REACHED; is for the compiler (from perl.h),
the /* NOTREACHED */ is for static analyzers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit v5.19.8-533-g63baef5 changed the handling of locale-dependent
regexes so that the pattern was considered tainted at compile-time, rather
than determining it each time at run-time whenever it executed a
locale-dependent node. Unfortunately due to the conflating of two flags,
RXf_TAINTED and RXf_TAINTED_SEEN, it had the side effect of permanently
marking a pattern as tainted once it had had a single tainted result.
E.g.
use re qw(taint);
use Scalar::Util qw(tainted);
for ($^X, "abc") {
/(.*)/ or die;
print "not " unless tainted("$1"); print "tainted\n";
};
which from 5.19.9 onwards output:
tainted
tainted
but with this commit (and with 5.19.8 and earlier), it now outputs:
tainted
not tainted
The RXf_TAINTED flag indicates that the pattern itself is tainted, e.g.
$r = qr/$tainted_value/
while the RXf_TAINTED_SEEN flag means that the results of the last match
are tainted, e.g.
use re 'tainted';
$tainted =~ /(.*)/;
# $1 is tainted
Pre 63baef5, the code used to look like:
at run-time:
turn off RXf_TAINTED_SEEN;
while (nodes to execute) {
switch(node) {
case
BOUNDL: /* and other locale-specific ops */
turn on RXf_TAINTED_SEEN;
...;
}
}
if (tainted || RXf_TAINTED)
turn on RXf_TAINTED_SEEN;
63baef5 changed it to:
at compile-time:
if (pattern has locale ops)
turn on RXf_TAINTED_SEEN;
at run-time:
while (nodes to execute) {
...
}
if (tainted || RXf_TAINTED)
turn on RXf_TAINTED_SEEN;
This commit changes it to:
at compile-time;
if (pattern has locale ops)
turn on RXf_TAINTED;
at run-time:
turn off RXf_TAINTED_SEEN;
while (nodes to execute) {
...
}
if (tainted || RXf_TAINTED)
turn on RXf_TAINTED_SEEN;
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently prog->substrs->data[] is a 3 element array of structures.
Elements 0 and 1 record the longest anchored and floating substrings,
while element 2 ('check'), is a copy of the longest of 0 and 1.
Record in a new field, prog->substrs->check_ix, the index of which element
was copied. (Eventually I intend to remove the copy altogether.)
Also for the anchored substr, set max_offset equal to min offset.
Previously it was left as zero and ignored, although if copied to check,
the check copy of max *was* set equal to min. Having this always set will
allow us to make the code simpler.
|
|
|
|
|
| |
In particular, specify that the various offset fields are char rather
than byte counts.
|
| |
|
|
|
|
|
|
|
|
|
| |
The flag tells us that a pattern may match an infinitely long string.
The new member in the regexp struct tells us how long the string might
be.
With these two items we can implement regexp based $/
|
|
|
|
|
|
|
|
|
|
| |
RXf_IS_ANCHORED as a replacement
The only requirement outside of the regex engine is to identify that there is
an anchor involved at all. So we move the 4 anchor flags to intflags and replace
it with a single aggregate flag RXf_IS_ANCHORED in extflags.
This frees up another 3 bits in extflags.
|
|
|
|
| |
So they stay stable as I move other flags from extflags to intflags
|
|
|
|
|
|
|
|
| |
This required removing the RXf_GPOS_CHECK mask as it uses one flag
that will stay in extflags for now (RXf_ANCH_GPOS), and one flag that
moves to intflags (RXf_GPOS_SEEN). This mask is strange however, as
you cant have RXf_ANCH_GPOS without having RXf_GPOS_SEEN so I dont
know why we test both. Further investigation required.
|
| |
|
|
|
|
|
| |
Includes some improvements to how we dump regexps so that when a regexp
is for the standard perl engine we also show the intflags for the engine
|
|
|
|
| |
plus some typo fixes. I probably changed some things in perlintern, too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As part of getting the regexp engine to handle long strings, this com-
mit changes any variables, parameters and struct members that hold
lengths of the string being matched against (or parts thereof) to use
SSize_t or STRLEN instead of [IU]32.
To avoid having to change any logic, I kept the signedness the same.
I did not change anything that affects the length of the regular
expression itself, so regexps are still practically limited to
I32_MAX. Changing that would involve changing the size of regnodes,
which would be a lot more involved.
These changes should fix bugs, but are very hard to test. In most
cases, I don’t know the regexp engine well enough to come up with test
cases that test the paths in question with long strings. In other
cases I don’t have a box with enough memory to test the fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Using I32 for the fields that record information about the location of
a fixed string that must be found for a regular expression to match
can result in match failures, because I32 is not large enough to store
offsets >= 2**31.
SSize_t is appropriate, since it is 64 bits on 64-bit platforms and 32
bits on 32-bit platforms.
This commit changes enough instances of I32 to SSize_t to get the
added test passing and suppress compiler warnings. A later commit
will change many more.
|
| |
|
|
|
|
|
|
|
|
|
| |
Change the internal fields for storing positions so that //g in scalar
context can move past the 2**31 character threshold. Before this com-
mit, the numbers would wrap, resulting in assertion failures.
The changes in this commit are only enough to get the added test pass-
ing. Stay tuned for more.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The value of pos() is stored as a byte offset. If it is stored on a
tied variable or a reference (or glob), then the stringification could
change, resulting in pos() now pointing to a different character off-
set or pointing to the middle of a character:
$ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, a; print pos $x'
2
$ ./perl -Ilib -le '$x = bless [], chr 256; pos $x=1; bless $x, "\x{1000}"; print pos $x'
Malformed UTF-8 character (unexpected end of string) in match position at -e line 1.
0
So pos() should be stored as a character offset.
The regular expression engine expects byte offsets always, so allow it
to store bytes when possible (a pure non-magical string) but use char-
acters otherwise.
This does result in more complexity than I should like, but the alter-
native (always storing a character offset) would slow down regular
expressions, which is a big no-no.
|
|
|
|
|
|
| |
In the API, rename the 'screamer' arg to be 'sv' instead;
update the description of the functions args;
improve the documentation of the REXEC_* flags for the 'flags' arg.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
On something like:
$_ = "123456789";
pos = 6;
s/.(?=.\G)/X/g;
each iteration could in theory start with pos one character to the left
of the previous position, and with the substitution replacing bits that
it has already replaced. Since that way madness lies, ban any attempt by
s/// to substitute to the left of a previous position.
To implement this, add a new flag to regexec(), REXEC_FAIL_ON_UNDERFLOW.
This tells regexec() to return failure even if the match itself succeeded,
but where the start of $& is before the passed stringarg point.
This change caused one existing test to fail (which was added about a year
ago):
$_="abcdef";
s/bc|(.)\G(.)/$1 ? "[$1-$2]" : "XX"/ge;
print; # used to print "aXX[c-d][d-e][e-f]"; now prints "aXXdef"
I think that that test relies on ambiguous behaviour, and that my change
makes things saner.
Note that s/// with \G is generally very under-tested.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Normally a /g match starts its processing at the previous pos() (or at
char 0 if pos is not set); however in the case of something like /abc\G/
we actually need to start 3 characters before pos. This has been handled
by the *callers* of regexec() subtracting prog->gofs from the stringarg
arg before calling it, or by setting stringarg to strbeg for floating,
such as /\w+\G/.
This is clearly wrong: the callers of regexec() shouldn't need to worry
about the details of getting \G right: move this code into regexec()
itself.
(Note that although this commit passes all tests, it quite possibly isn't
logically correct. It will get fixed up further during the next few
commits)
|
| |
|
|
|
|
|
|
|
|
|
|
| |
This is a struct that holds all the global state of the current regex
match.
The previous set of commits have gradually removed all the fields of this
struct (by making things local rather than global state). Since the struct
is now empty, the PL_reg_state var can be removed, along with the
SAVEt_RE_STATE save type which was used to save and restore those fields
on recursive re-entry to the regex engine.
|
|
|
|
|
|
|
|
| |
Its only used for printing debugging messages, and its value is already
available as the startpos local var in S_regmatch().
Whoo hoo! This var was the last field within the PL_reg_state global state
struct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently PL_reg_curpm is actually #deffed to a field within PL_reg_state;
promote it into a fully autonomous perl-interpreter variable.
PL_reg_curpm points to a fake PMOP that's used to temporarily point
PL_curpm to, that we can hang the current regex off, so that this works:
"a" =~ /^(.)(?{ print $1 })/ # prints 'a'
It turns out that it doesn't need to be saved and restored when we
recursively enter the regex engine; that is already handled by saving and
restoring which regex is currently attached to PL_reg_curpm.
So we just need a single global (per interpreter) placeholder.
Since we're shortly going to get rid of PL_reg_state, we need to move it
out of that struct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Eliminate these two global vars (well, fields in the global
PL_reg_state), that hold the regex super-liner cache.
PL_reg_poscache_size gets replaced with a field in the local regmatch_info
struct, while PL_reg_poscache (which needs freeing at end of pattern
execution or on croak()), goes in the regmatch_info_aux struct.
Note that this includes a slight change in behaviour.
Each regex execution now has its own private poscache pointer, initially
null. If the super-linear behaviour is detected, the cache is malloced,
used for the duration of the pattern match, then freed.
The former behaviour allocated a global poscache on first use, which was
retained between regex executions. Since the poscache could between 0.25
and 2x the size of the string being matched, that could potentially be a
big buffer lying around unused. So we save memory at the expense of a new
malloc/free for every regex that triggers super-linear behaviour.
The old behaviour saved the old pointer on reentrancy, then restored the
old one (and possibly freed the new buffer) at exit. Except it didn't for
(?{}), so /(?{ m{something-that-triggers-super-linear-cache} })/ would
leak each time the inner regex was called. This is now fixed
automatically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit reorganised state save and cleanup at the end of regex
execution. Use this new mechanism, by recording the original values
of PL_regmatch_slab and PL_regmatch_state in the regmatch_info_aux struct,
and restoring them and freeing higher slabs as part of the general
S_cleanup_regmatch_info_aux() destructor, rather than pushing the old
values directly onto the savestack and using another specific destructor.
Also, make the initial allocating of (up to) 3 PL_regmatch_state slots
more efficient by doing it in a loop.
We also skip the first slot; this may already be in use if we're called
reentrantly.
try 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the regmatch_info struct was allocated as a local var on the C
stack, while some extra state (only needed for regexes having (?{})) was
malloced (as a regmatch_eval_state struct) as needed - and a destructor set
up to clean it up afterwards. This being because the stuff being cleaned
up couldn't be allocated on the C stack as it needed to hang around after
a croak().
Reorganise this so that:
* regmatch_info is on the C stack as before.
* a new struct, regmatch_info_aux is allocated within the first slot of the
regmatch_state stack, for fields which must always exist but which need
cleanup afterwards. This is currently unused, but will be shortly.
* a new struct, regmatch_info_aux_eval (which is just a renamed
regmatch_eval_state struct), is optionally allocated in the second
slot of regmatch_state. This is logically part of regmatch_info_aux,
except that splitting it in two stops it being too large to fit in a
regmatch_state slot (we can fit it in two instead).
(The second and third structs aren't allocated when we're intuit()
rather than regexec()).
Doing it like this simplifies allocation and cleanup: there's no need for
a malloc(), and we are already going to allocate a slab's worth of
regmatch_state slots, so using an extra one of two of them is effectively
free; and the cleanup just requires calling a single overall destructor.
In the next few commits, more of the regexec() state setup and tear-down
will be integrated into this new regime. And in particular, the new
regmatch_info_aux struct will give us somewhere to hang things like
PL_reg_poscache once it stops being global (it being local state that
needs cleanup).
|
|
|
|
|
| |
Move these two fields of PL_reg_state into the regmatch_info struct, so
they are local to each match.
|
|
|
|
|
| |
Earlier commits made the use of this var just local to the current
match, so move it to the local regmatch_info struct instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since this value is actually just always equal to cBOOL(RX_UTF8(rx)),
there's no need to save the old value of the local boolean
(as u.eval.saved_utf8_pat) when switching back and forwards between
regexes with (??{}); instead, just re-calculate it whenever we switch,
and update reginfo->is_utf8_pat and its cached value in the is_utf8_pat
local var accordingly.
Also, pass reginfo as an arg to S_setup_EXACTISH_ST_c1_c2() rather than
is_utf8_pat directly; this will allow us to eliminate PL_reg_match_utf8
shortly.
A new test is included that detects a mistake I made while working up
this change: I recalculated is_utf8_pat, but forgot to update
reginfo->is_utf8_pat too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way that the regex engine knows that the match string is utf8 is
currently a complete mess. It's partially signalled by the utf8 flag of
the passed SV, but also by the RXf_MATCH_UTF8 flag in the regex itself,
and the value of PL_reg_match_utf8.
Currently all the callers of the engine (such as pp_match, pp_split etc)
initially use RX_MATCH_UTF8_set() before calling the engine. This sets both
the RXf_MATCH_UTF8 flag on the regex, and PL_reg_match_utf8.
Then the two entry points to the engine (regexec_flags() and
re_intuit_start()) initially repeat the RX_MATCH_UTF8_set()
themselves.
Remove the usage of RX_MATCH_UTF8_set() by the callers of the engine,
and instead just rely on the engine to do it.
Also, remove the "secret" setting of PL_reg_match_utf8 by
RX_MATCH_UTF8_set(), and do it explicitly.
This is a prelude to eliminating PL_reg_match_utf8.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace several PL_reg* vars with a new struct. This is part of the
goal of removing all global regex state.
These particular vars are used in the case of a regex with (?{}) code
blocks. In this case, when the code in a block is called, various bits of
state (such as $1, pos($_)) are temporarily set up, even though the match
has not yet completed.
This involves updating the current PL_curpm to point to a fake PMOP which
points to the regex currently being executed. That regex has all its
current fields that are associated with captures (such as subbeg)
temporarily saved and overwritten with the current partial match results.
Similarly, $_ is temporarily aliased to the current match string, and any
old pos() position is saved. This saving was formerly done to the various
PL_reg* vars.
When the regex has finished executing (or if the code block croaks),
its fields are restored to the original values. Since this can happen in a
croak, it may be done using SAVEDESTRUCTOR_X() on the save stack. This
precludes just moving the PL_reg* vars into the regmatch_info struct,
since that is just allocated as a local var in regexec_flags(), and would
have already been abandoned and possibly overwritten after the croak and
longjmp, but before the SAVEDESTRUCTOR_X() action is taken.
So instead we put all the vars into new struct, and malloc that on entry to
the regex engine when we know we need to copy the various fields.
We save a pointer to that in the regmatch_info struct, as well as passing
it to SAVEDESTRUCTOR_X(). The destructor may get called up to twice in the
non-croak case: first it's called explicitly at the end of regexec_flags(),
which restores subbeg etc; then again from the savestack, which just
free()s the struct. In the croak case, it's called just once, and does
both the restoring and the freeing.
The vars / PL_reg_state fields this commit eliminates are:
re_state_eval_setup_done
PL_reg_oldsaved
PL_reg_oldsavedlen
PL_reg_oldsavedoffset
PL_reg_oldsavedcoffset
PL_reg_magic
PL_reg_oldpos
PL_nrs
PL_reg_oldcurpm
|
| |
|
|
|
|
|
| |
by moving it from the global PL_reg_state struct to the local reginfo
struct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(note that this is a change both to the perl API and the regex engine
plugin API).
Currently, Perl_re_intuit_start() is passed an SV, plus pointers to:
where in the string to start matching (strpos); and to the end of the
string (strend).
Unlike Perl_regexec_flags(), it doesn't also have a strbeg arg.
Because of this this, it guesses strbeg: based on the passed SV (if its
svPOK()); or just set to strpos otherwise. This latter can happen if for
example the SV is overloaded. Note also that this latter guess is wrong,
and could in theory make /\b.../ fail.
But just to confuse matters, although Perl_re_intuit_start() itself uses
its guesstimate strbeg var, some of the functions it calls use the global
value of PL_bostr instead. To make this work, the *callers* of
Perl_re_intuit_start() currently set PL_bostr first. This is why \b
doesn't actually break.
The fix to this unholy mess is to simply add a strbeg arg to
Perl_re_intuit_start(). It's also the first step to eliminating PL_bostr
altogether.
|
|
|
|
|
|
|
| |
This is another global regex state variable (actually a field of
PL_reg_state). Eliminate it by moving it into the regmatch_info struct
instead, which is local to each match. Also, rename it to strend, which is
a less misleading description in these exciting days of multi-line matches.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
regmatch_info is a small struct that is currently directly allocated as a
local var in Perl_regexec_flags(), and has a few fields that maintain part
of the state of the current pattern match. It is passed as an arg to
various functions that regexec_flags() calls, such as regtry().
In some ways its a rival to PL_reg_state, which also maintains state for
the current match, but which is a global variable (whose state needs
saving and restoring whenever the regex engine goes reentrant). It makes
more sense to store state in the regmatch_info struct, and as a first step
in moving more state to there, this commit makes more use of
regmatch_info.
In particular, it makes Perl_re_intuit_start() also allocate such a
struct, so that now *both* the main execution entry points to the regex
engine make use of it. It's also now passed as an arg to more of the static
functions that these two op-level ones call.
Two changes of special note. First, whether S_find_byclass() got called
with a null reginfo pointer of not indicated whether it had been called
from Perl_regexec_flags() (with a valid reginfo pointer), or from
Perl_re_intuit_start() (null pointer). Since they both pass non-null
reginfo pointers now, instead we add an extra field, reginfo->intuit that
indicates who's the top-level caller.
Secondly, to allow in future for various macros to uniformly refer to
values like reginfo->foo, where the structure is actually allocated as a
local var in Perl_regexec_flags(), we change the reginfo from being the
struct itself to being a pointer to the struct, (so Perl_regexec_flags
itself now uses reginfo->foo too rather than reginfo.foo).
In summary, all the above is essentially window dressing that makes no
functional changes to the code, but will facilitate future changes.
|