| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Simply adding scan + max causes undefined behaviour per ANSI C if the
result points outside of the object scan points at.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
Slightly reorganise the set-up code in these two functions,
for example by gathering all 'reginfo->foo =' assignments into a single
block. This is just swapping the order of a few lines, and shouldn't make
any functional difference.
|
|
|
|
|
|
|
|
|
|
| |
This is a struct that holds all the global state of the current regex
match.
The previous set of commits have gradually removed all the fields of this
struct (by making things local rather than global state). Since the struct
is now empty, the PL_reg_state var can be removed, along with the
SAVEt_RE_STATE save type which was used to save and restore those fields
on recursive re-entry to the regex engine.
|
|
|
|
|
|
|
|
| |
Its only used for printing debugging messages, and its value is already
available as the startpos local var in S_regmatch().
Whoo hoo! This var was the last field within the PL_reg_state global state
struct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently PL_reg_curpm is actually #deffed to a field within PL_reg_state;
promote it into a fully autonomous perl-interpreter variable.
PL_reg_curpm points to a fake PMOP that's used to temporarily point
PL_curpm to, that we can hang the current regex off, so that this works:
"a" =~ /^(.)(?{ print $1 })/ # prints 'a'
It turns out that it doesn't need to be saved and restored when we
recursively enter the regex engine; that is already handled by saving and
restoring which regex is currently attached to PL_reg_curpm.
So we just need a single global (per interpreter) placeholder.
Since we're shortly going to get rid of PL_reg_state, we need to move it
out of that struct.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Eliminate these two global vars (well, fields in the global
PL_reg_state), that hold the regex super-liner cache.
PL_reg_poscache_size gets replaced with a field in the local regmatch_info
struct, while PL_reg_poscache (which needs freeing at end of pattern
execution or on croak()), goes in the regmatch_info_aux struct.
Note that this includes a slight change in behaviour.
Each regex execution now has its own private poscache pointer, initially
null. If the super-linear behaviour is detected, the cache is malloced,
used for the duration of the pattern match, then freed.
The former behaviour allocated a global poscache on first use, which was
retained between regex executions. Since the poscache could between 0.25
and 2x the size of the string being matched, that could potentially be a
big buffer lying around unused. So we save memory at the expense of a new
malloc/free for every regex that triggers super-linear behaviour.
The old behaviour saved the old pointer on reentrancy, then restored the
old one (and possibly freed the new buffer) at exit. Except it didn't for
(?{}), so /(?{ m{something-that-triggers-super-linear-cache} })/ would
leak each time the inner regex was called. This is now fixed
automatically.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The previous commit reorganised state save and cleanup at the end of regex
execution. Use this new mechanism, by recording the original values
of PL_regmatch_slab and PL_regmatch_state in the regmatch_info_aux struct,
and restoring them and freeing higher slabs as part of the general
S_cleanup_regmatch_info_aux() destructor, rather than pushing the old
values directly onto the savestack and using another specific destructor.
Also, make the initial allocating of (up to) 3 PL_regmatch_state slots
more efficient by doing it in a loop.
We also skip the first slot; this may already be in use if we're called
reentrantly.
try 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the regmatch_info struct was allocated as a local var on the C
stack, while some extra state (only needed for regexes having (?{})) was
malloced (as a regmatch_eval_state struct) as needed - and a destructor set
up to clean it up afterwards. This being because the stuff being cleaned
up couldn't be allocated on the C stack as it needed to hang around after
a croak().
Reorganise this so that:
* regmatch_info is on the C stack as before.
* a new struct, regmatch_info_aux is allocated within the first slot of the
regmatch_state stack, for fields which must always exist but which need
cleanup afterwards. This is currently unused, but will be shortly.
* a new struct, regmatch_info_aux_eval (which is just a renamed
regmatch_eval_state struct), is optionally allocated in the second
slot of regmatch_state. This is logically part of regmatch_info_aux,
except that splitting it in two stops it being too large to fit in a
regmatch_state slot (we can fit it in two instead).
(The second and third structs aren't allocated when we're intuit()
rather than regexec()).
Doing it like this simplifies allocation and cleanup: there's no need for
a malloc(), and we are already going to allocate a slab's worth of
regmatch_state slots, so using an extra one of two of them is effectively
free; and the cleanup just requires calling a single overall destructor.
In the next few commits, more of the regexec() state setup and tear-down
will be integrated into this new regime. And in particular, the new
regmatch_info_aux struct will give us somewhere to hang things like
PL_reg_poscache once it stops being global (it being local state that
needs cleanup).
|
|
|
|
|
|
|
|
|
|
|
|
| |
regexec() calls regtry() for each match start position until a match is
found. regtry() has some code that says: "if this regex contains (?{})'s,
and if we haven't done so already, set up the extra state needed".
Move the setup one level higher, into regexec(). Here, we just do it once,
and don't have to keep checking whether we've already done it.
This is part of an effort to consolidate all regex state initialisation
into one place.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, S_regmatch() has, in its outermost scope:
oldsave = PL_savestack_ix;
SAVEDESTRUCTOR_X(S_clear_backtrack_stack, NULL);
SAVEVPTR(PL_regmatch_slab);
SAVEVPTR(PL_regmatch_state);
... rest of function ....
/* clean up; in particular, free all slabs above current one */
LEAVE_SCOPE(oldsave);
This means that at the end of regmatch(), all slabs in the regmatch_state
stack above where we started, are freed.
Hoist this two levels higher into Perl_regexec_flags(). Now, since
a) the main activity of regexec() is call regmatch() (via regtry()) for
each possible string start position until a match is found;
b) there isn't any other savestack manipulation between the two functions;
the main affect of this change is that higher slabs in the regmatch_state
stack are only freed at the end of all match attempts from all positions,
rather than after each fail at a particular start point. Since the
repeated calls to regmatch() are likely to have a similar pattern of
regmatch_state stack usage, this should usually be an efficiency win.
It is also part of plan to consolidate all the setting up of local match
state in one place.
|
|
|
|
|
| |
Move these two fields of PL_reg_state into the regmatch_info struct, so
they are local to each match.
|
|
|
|
|
|
|
|
|
|
| |
Now that the RXf_MATCH_UTF8 flag on a regex is just used to indicate
whether the captures on a successful match are utf8, only set
this flag at the end of a successful match, rather than at the start of
the match.
This should make no functional difference the way things stand at the
moment, but makes things conceptually cleaner.
|
|
|
|
|
|
|
|
|
| |
A typo in this function meant that the same value rex->suboffset was
saved to both eval_state->suboffset and eval_state->subcoffset.
However, I can't think of a test case where this will fail, since each
qr// gets its own set of suboffset vars these days. (and thus I wonder
whether there's even any need to save them when calling (?{}).
|
|
|
|
|
| |
Earlier commits made the use of this var just local to the current
match, so move it to the local regmatch_info struct instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since this value is actually just always equal to cBOOL(RX_UTF8(rx)),
there's no need to save the old value of the local boolean
(as u.eval.saved_utf8_pat) when switching back and forwards between
regexes with (??{}); instead, just re-calculate it whenever we switch,
and update reginfo->is_utf8_pat and its cached value in the is_utf8_pat
local var accordingly.
Also, pass reginfo as an arg to S_setup_EXACTISH_ST_c1_c2() rather than
is_utf8_pat directly; this will allow us to eliminate PL_reg_match_utf8
shortly.
A new test is included that detects a mistake I made while working up
this change: I recalculated is_utf8_pat, but forgot to update
reginfo->is_utf8_pat too.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The way that the regex engine knows that the match string is utf8 is
currently a complete mess. It's partially signalled by the utf8 flag of
the passed SV, but also by the RXf_MATCH_UTF8 flag in the regex itself,
and the value of PL_reg_match_utf8.
Currently all the callers of the engine (such as pp_match, pp_split etc)
initially use RX_MATCH_UTF8_set() before calling the engine. This sets both
the RXf_MATCH_UTF8 flag on the regex, and PL_reg_match_utf8.
Then the two entry points to the engine (regexec_flags() and
re_intuit_start()) initially repeat the RX_MATCH_UTF8_set()
themselves.
Remove the usage of RX_MATCH_UTF8_set() by the callers of the engine,
and instead just rely on the engine to do it.
Also, remove the "secret" setting of PL_reg_match_utf8 by
RX_MATCH_UTF8_set(), and do it explicitly.
This is a prelude to eliminating PL_reg_match_utf8.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace several PL_reg* vars with a new struct. This is part of the
goal of removing all global regex state.
These particular vars are used in the case of a regex with (?{}) code
blocks. In this case, when the code in a block is called, various bits of
state (such as $1, pos($_)) are temporarily set up, even though the match
has not yet completed.
This involves updating the current PL_curpm to point to a fake PMOP which
points to the regex currently being executed. That regex has all its
current fields that are associated with captures (such as subbeg)
temporarily saved and overwritten with the current partial match results.
Similarly, $_ is temporarily aliased to the current match string, and any
old pos() position is saved. This saving was formerly done to the various
PL_reg* vars.
When the regex has finished executing (or if the code block croaks),
its fields are restored to the original values. Since this can happen in a
croak, it may be done using SAVEDESTRUCTOR_X() on the save stack. This
precludes just moving the PL_reg* vars into the regmatch_info struct,
since that is just allocated as a local var in regexec_flags(), and would
have already been abandoned and possibly overwritten after the croak and
longjmp, but before the SAVEDESTRUCTOR_X() action is taken.
So instead we put all the vars into new struct, and malloc that on entry to
the regex engine when we know we need to copy the various fields.
We save a pointer to that in the regmatch_info struct, as well as passing
it to SAVEDESTRUCTOR_X(). The destructor may get called up to twice in the
non-croak case: first it's called explicitly at the end of regexec_flags(),
which restores subbeg etc; then again from the savestack, which just
free()s the struct. In the croak case, it's called just once, and does
both the restoring and the freeing.
The vars / PL_reg_state fields this commit eliminates are:
re_state_eval_setup_done
PL_reg_oldsaved
PL_reg_oldsavedlen
PL_reg_oldsavedoffset
PL_reg_oldsavedcoffset
PL_reg_magic
PL_reg_oldpos
PL_nrs
PL_reg_oldcurpm
|
|
|
|
|
|
|
|
|
|
| |
This function only set up a destructor if reginfo->sv existed. If not,
some stuff which was set up wouldn't be restored if the code died within a
(?{}) block.
As it happens, the way the regex engine is called from perl core means we
always pass a valid SV; but in theory someone could call the engine from
XS while passing just a string and no SV.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's a block of code in S_regtry() that looks a bit like:
if ((prog->extflags & RXf_EVAL_SEEN) && not_yet_done)
{
...
}
Move this block of code out into a separate static function,
S_setup_eval_state(). No functional changes.
Also, rename the corresponding static cleanup/destructor function from
restore_pos() to S_restore_eval_state(), to better reflect what it does
these days (restoring pos() being only a small part of it).
|
| |
|
|
|
|
|
| |
by moving it from the global PL_reg_state struct to the local reginfo
struct.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(note that this is a change both to the perl API and the regex engine
plugin API).
Currently, Perl_re_intuit_start() is passed an SV, plus pointers to:
where in the string to start matching (strpos); and to the end of the
string (strend).
Unlike Perl_regexec_flags(), it doesn't also have a strbeg arg.
Because of this this, it guesses strbeg: based on the passed SV (if its
svPOK()); or just set to strpos otherwise. This latter can happen if for
example the SV is overloaded. Note also that this latter guess is wrong,
and could in theory make /\b.../ fail.
But just to confuse matters, although Perl_re_intuit_start() itself uses
its guesstimate strbeg var, some of the functions it calls use the global
value of PL_bostr instead. To make this work, the *callers* of
Perl_re_intuit_start() currently set PL_bostr first. This is why \b
doesn't actually break.
The fix to this unholy mess is to simply add a strbeg arg to
Perl_re_intuit_start(). It's also the first step to eliminating PL_bostr
altogether.
|
|
|
|
|
|
| |
Remove the is_utf8_pat arg from these two static functions in regexec.c.
Since both these functions are now passed a valid reginfo pointer, this
info is already available as one of the fields in that struct.
|
|
|
|
|
|
|
| |
This is another global regex state variable (actually a field of
PL_reg_state). Eliminate it by moving it into the regmatch_info struct
instead, which is local to each match. Also, rename it to strend, which is
a less misleading description in these exciting days of multi-line matches.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
regmatch_info is a small struct that is currently directly allocated as a
local var in Perl_regexec_flags(), and has a few fields that maintain part
of the state of the current pattern match. It is passed as an arg to
various functions that regexec_flags() calls, such as regtry().
In some ways its a rival to PL_reg_state, which also maintains state for
the current match, but which is a global variable (whose state needs
saving and restoring whenever the regex engine goes reentrant). It makes
more sense to store state in the regmatch_info struct, and as a first step
in moving more state to there, this commit makes more use of
regmatch_info.
In particular, it makes Perl_re_intuit_start() also allocate such a
struct, so that now *both* the main execution entry points to the regex
engine make use of it. It's also now passed as an arg to more of the static
functions that these two op-level ones call.
Two changes of special note. First, whether S_find_byclass() got called
with a null reginfo pointer of not indicated whether it had been called
from Perl_regexec_flags() (with a valid reginfo pointer), or from
Perl_re_intuit_start() (null pointer). Since they both pass non-null
reginfo pointers now, instead we add an extra field, reginfo->intuit that
indicates who's the top-level caller.
Secondly, to allow in future for various macros to uniformly refer to
values like reginfo->foo, where the structure is actually allocated as a
local var in Perl_regexec_flags(), we change the reginfo from being the
struct itself to being a pointer to the struct, (so Perl_regexec_flags
itself now uses reginfo->foo too rather than reginfo.foo).
In summary, all the above is essentially window dressing that makes no
functional changes to the code, but will facilitate future changes.
|
|
|
|
| |
This will be used in future commits to pass more flags.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two non-API macros were added with 5.17.1 to support the more
complex calling conventions required by /({})/ code blocks:
PUSH_MULTICALL_WITHDEPTH(the_cv, depth)
CHANGE_MULTICALL_WITHDEPTH(the_cv, depth)
which allowed us to do the same as the API versions, but to optionally
not increment the caller depth, and to change the current CV.
Replace these with two new macros:
PUSH_MULTICALL_FLAGS(the_cv, flags)
CHANGE_MULTICALL_FLAGS(the_cv, flags)
which instead allow us to set extra flags in cx->cx_type.
The depth increment skip is handled by the new CXp_SUB_RE_FAKE flag,
and all (?{}) calls set the new CXp_SUB_RE flag.
These two new flags will shortly allow us to change how caller() and
__SUB__ handle code blocks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When re-parsing a pattern for run-time (?{}) code blocks,
we end up with the EVAL_RE_REPARSING flag set in PL_in_eval.
Currently we clear this flag as soon as scan_str() returns, to ensure that
it's not set if we happen to parse further patterns (e.g. within the
(?{ ... }) code itself.
However, a soon-to-be-applied bugfix requires us to know the reparsing
state beyond this point. To solve this, we add a new boolean flag to the
parser struct, which is set from PL_in_eval in S_sublex_push() (with the
old value being saved). This allows us to have the flag around for the
entire pattern string parsing phase, without it affecting nested pattern
compilation.
|
|
|
|
|
|
| |
The previous commit added an alternative flag mechanism to
PL_reg_state.re_reparsing, but kept the old one around for consistency
checking. Remove the old one now.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PL_reg_state.re_reparsing is a hacky flag used to allow runtime
code blocks to be included in patterns. Basically, since code blocks
are now handled by the perl parser within literal patterns, runtime
patterns are handled by taking the (assembled at runtime) pattern,
and feeding it back through the parser via the equivalent of
eval q{qr'the_pattern'},
so that run-time (?{..})'s appear to be literal code blocks.
When this happens, the global flag PL_reg_state.re_reparsing is set,
which modifies lexing and parsing in minor ways (such as whether \\ is
stripped).
Now, I'm in the slow process of trying to eliminate global regex state
(i.e. gradually removing the fields of PL_reg_state), and also a change
which will be coming a few commits ahead requires the info which this flag
indicates to linger for longer (currently it is cleared immediately after
the call to scan_str().
For those two reasons, this commit adds a new mechanism to indicate this:
a new flag to eval_sv(), G_RE_REPARSING (which sets OPpEVAL_RE_REPARSING
in the entereval op), which sets the EVAL_RE_REPARSING bit in PL_in_eval.
Its still a yukky global flag hack, but its a *different* global flag hack
now.
For this commit, we add the new flag(s) but keep the old
PL_reg_state.re_reparsing flag and assert that the two mechanisms always
match. The next commit will remove re_reparsing.
|
|
|
|
| |
Apparently this breaks the z/OS 1.13 c preprocessor. (Yay, now I can say I have written code for z/OS!)
|
|
|
|
|
|
| |
This code does a save_re_context() and then calls swash_init, which also
does a save_re_context. This is unnecessary; the save should be done in
the lowest possible level.
|
| |
|
| |
|
|
|
|
| |
This appears to fix the regression introduced in c7304fe2c
|
|
|
|
| |
The latter is more clearly named to indicate it includes the underscore.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
/[[:upper:]]/i and /[[:lower:]]/i should match the Unicode property
\p{Cased}. This commit introduces a pseudo-Posix class, internally named
'cased', to represent this. This class isn't specifiable by the user,
except through using either /[[:upper:]]/i or /[[:lower:]]/i. Debug
output will say ':cased:'.
The regex parsing either of :lower: or :upper: will change them into
:cased:, where already existing logic can handle this, just like any
other class.
This commit fixes the regression introduced in
3018b823898645e44b8c37c70ac5c6302b031381, and that these have never
worked under 'use locale'. The next commit will un-TODO the tests for
these things.
|
|
|
|
|
|
|
|
| |
Until recently, these were needed to be (or it made sense to be) in
numerical value of what the rhs of each #define evaluates to. But now,
they are all initialized to something else, and the numerical value is
not even apparent. Alphabetical order gives a logical ordering to help
a reader find things.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This creates a regnode specifically for the synthetic start class, which
is a type of ANYOF node. The flag bit previously used to denote this is
removed. This paves the way for this bit to be freed up, but first the
other use of this bit must also be removed, which will be done in the
next commit.
There are now three ANYOF-type regnodes. This one should be called only
in one place in regexec.c. The other special one is ANYOF_WARN_SUPER.
A synthetic start class node should not do any warning, so there is no
issue of having something need to be both types.
|
|
|
|
|
|
|
| |
This uses a regnode type, of which we have many available, to free up
a bit in the ANYOF regnode flag field, of which we have none, and are
trying to have the same bit do double duty. This will enable us to
remove some of that double duty in the next commit.
|
|
|
|
|
|
|
|
|
|
|
| |
This global flag is cleared at the start of execution, and then set if
any locale-based nodes are executed. At the end of execution, the
RXf_TAINTED_SEEN flag on the regex is set/cleared based on RF_tainted.
We eliminate RF_tainted by simply directly setting RXf_TAINTED_SEEN
each time a taintable node is executed.
This is the final step before eliminating PL_reg_flags.
|
|
|
|
|
|
|
|
| |
This global flag indicates whether the currently executing regex has
issued a recursion limit warning yet.
Replace it with a boolean var local to the regmatch_info struct.
This is a second step to eliminating PL_reg_flags.
|
|
|
|
|
|
|
|
| |
This global flag indicates whether the currently executing regex is utf8.
Replace it with a boolean var local to to the matching function, and pass
it around via function args, or as a member of the regmatch_info struct.
This is a first step to eliminating PL_reg_flags.
|
|
|
|
|
|
|
| |
Perl_re_intuit_start would set, but never unset, the RF_utf8 flag in
PL_reg_flags. This meant that two successive patterns, the first utf8 and
the sdeconfd not, that are processed using only intuit, will get the flag
wrong on the second one. The fix is trivial.
|
| |
|
|
|
|
|
|
|
|
| |
These two instances of 'if (a) { b } if (c) { b } are combined to
if (a || c) { b }
The final instance is made into an else since the 'if' before it does a
break, so that the break is eliminated.
|