| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
We use U16 for various internal logic related to parens. If we
exceed this count stuff is going to go silently wrong. Might as
well throw a proper error during compilation to detect this.
|
|
|
|
|
|
| |
This category is only used in the regex engine, we should be able
to disable it specifically, as it seems like we will never actually
remove demove support for the things it warns about.
|
| |
|
|
|
|
|
|
|
| |
We will move various members of the regexp structure to a new
structure which just contains information about the match. Wrapping
the members in the standard macros means that change can be made
less invasive. We already did all of this in regexec.c
|
|
|
|
|
|
|
|
|
|
| |
This insulates access to the regexp match offset data so we can
fix the define later and move the offset structure into a new struct.
The RXp_OFFSp() was introduced in a recent commit to deliberately
break anything using RXp_OFFS() directly. It is hard to type
deliberately, nothing but the internals should use it. Everything
else should use one of the wrappers around it.
|
|
|
|
|
| |
this way we can avoid pushing every buffer, we only need to push
the nestroot of the ref.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This eliminates the regnode_2L data structure, and merges it with the older
regnode_2 data structure. At the same time it makes each "arg" property of the
various regnode types that have one be consistently structured as an anonymous
union like this:
union {
U32 arg1u;
I32 arg2i;
struct {
U16 arg1a;
U16 arg1b;
};
};
We then expose four macros for accessing each slot: ARG1u() ARG1i() and
ARG1a() and ARG1b(). Code then explicitly designates which they want. The old
logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1,
which was confusing to say the least. The regnode_2L structure had a U32 arg1,
and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new
set of macros we use the regnode_2 for both, and use the appropriate macros to
show whether we want to signed or unsigned values.
This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit
args". However as each slot can also store two U16s, a regnode_3 can hold up
to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes
use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count
and ARG3a() and ARG3b() for parens before and inside the quantifier.
It also changes the functions reganode() to reg1node() and changes reg2Lanode()
to reg2node(). The 2L thing was just confusing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This way we can do the required paren restoration only when it is in use. When
we match a REF type node which is potentially a reference to an unclosed paren
we push the match context information, currently for "everything", but in a
future patch we can teach it to be more efficient by adding a new parameter to
the REF regop to track which parens it should save.
This converts the backtracking changes from the previous commit, so that it is
run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which
is by default 0. We don't make the new fields in the struct conditional as the
stack frames are large and our changes don't make any real difference and it
keeps things simpler to not have conditional members, especially since some of
the structures have to line up with each other.
If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means
something is sensitive to us not necessarily restoring the parens properly on
failure. We make some assumptions that the paren state after a failing state
will be corrected by a future successful state, or that the state of the
parens is irrelevant as we will fail anyway. This can be made not true by
EVAL, backrefs, and potentially some other scenarios. Thus I have left this
inefficient logic in place but guarded by the flag.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at
the same time. When a branch fails it should reset any capture buffers
that might be touched by its branch.
We change BRANCH and BRANCHJ to store the number of parens before the
branch, and the number of parens after the branch was completed. When
a BRANCH operation fails, we clear the buffers it contains before we
continue on.
It is a bit more complex than it should be because we have BRANCHJ
and BRANCH. (One of these days we should merge them together.)
This is also made somewhat more complex because TRIE nodes are actually
branches, and may need to track capture buffers also, at two levels.
The overall TRIE op, and for jump tries especially where we emulate
the behavior of branches. So we have to do the same clearing logic if
a trie branch fails as well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This was originally a patch which made somewhat drastic changes to how
we represent capture buffers, which Dave M and I and are still
discussing offline and which has a larger impact than is acceptable to
address at the current time. As such I have reverted the controversial
parts of this patch for now, while keeping most of it intact even if in
some cases the changes are unused except for debugging purposes.
This patch still contains valuable changes, for instance teaching CURLYX
and CURLYM about how many parens there are before the curly[1] (which
will be useful in follow up patches even if stricly speaking they are
not directly used yet), tests and other cleanups. Also this patch is
sufficiently large that reverting it out would have a large effect on
the patches that were made on top of it.
Thus keeping most of this patch while eliminating the controversial
parts of it for now seemed the best approach, especially as some of the
changes it introduces and the follow up patches based on it are very
useful in cleaning up the structures we use to represent regops.
[1] Curly is the regexp internals term for quantifiers, named after
x{min,max} "curly brace" quantifiers.
|
|
|
|
|
|
| |
The comment in the code was wrong. The workaround therein is for
RT #133135, not for what the comment said. But 133135 is GH #16520, so
change to the new number.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
And by removing it we silence the following build warning:
regcomp.c:2285:28: warning: format specifies type 'int' but the argument has type 'I32' (aka 'long') [-Wformat]
parno, RExC_parno_to_logical[parno], RExC_parno_to_logical_next[parno]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
./regcomp_internal.h:212:37: note: expanded from macro 'RExC_parno_to_logical'
#define RExC_parno_to_logical (pRExC_state->parno_to_logical)
^
regcomp.c:2285:58: warning: format specifies type 'int' but the argument has type 'I32' (aka 'long') [-Wformat]
parno, RExC_parno_to_logical[parno], RExC_parno_to_logical_next[parno]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./regcomp_internal.h:213:37: note: expanded from macro 'RExC_parno_to_logical_next'
#define RExC_parno_to_logical_next (pRExC_state->parno_to_logical_next)
^
2 warnings generated.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Dave M pointed out that this idea was flawed, and after some testing I have
come to agree with him. This removes it. It was only available for 5.37.8,
so no deprecation cycle involved.
The point of (**{ ... }) was to have a postponed eval that does not disable
optimizations. But some of the optimizations are disabled because if they
are not we do not match correctly as the optimizations will make unwarranted
assumptions about the pattern, assumptions which can be incorrect depending
on what pattern is returned from the codeblock. The original idea was
proposed because (?{ ... }) was treated as though it was (??{ ... }) and
disabled many optimizations, when in fact it doesn't interact with
optimizations at all. When I added (*{ ... }) as the optimistic version of
(?{ ... }) I used "completeness" as the justification for also adding
(**{ ... }) when it does not make sense to do so.
|
|
|
|
|
|
|
|
| |
In fe5492d916201ce31a107839a36bcb1435fe7bf0 I made a fencepost error
copying the logical_to_parno and related data structures. They all needed a
+1 on their size as the paren counts (logical and physical) as they have
to account for capture buffer 0 which is always present which represents
the entire match.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and
(??{ ... }). The only difference being that the star variants are
"optimisitic" and are defined to never disable optimisations. This is
especially relevant now that use of (?{ ... }) prevents important
optimisations anywhere in the pattern, instead of the older and inconsistent
rules where it only affected the parts that contained the EVAL.
It is also very useful for injecting debugging style expressions to the
pattern to understand what the regex engine is actually doing. The older
style (?{ ... }) variants would change the regex engines behavior, meaning
this was not as effective a tool as it could have been.
Similarly it is now possible to test that a given regex optimisation
works correctly using (*{ ... }), which was not possible with (?{ ... }).
|
|
|
|
|
|
|
|
| |
This allows us to resolve a test inconsistency between CURLYX and CURLY
and CURLYM, which have different maximums. We use I32 and not U32 because
the existing count logic uses -1 internally and using an I32 for the min/max
prevents warnings about comparing signed and unsigned values when the
count is compared against the min or max.
|
|
|
|
| |
The tight & is hard to read.
|
|
|
|
| |
They are unused these days.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Branch reset was hacked in without much thought about how it might interact
with other features. Over time we added named capture and recursive patterns
with GOSUB, but I guess because branch reset is somewhat esoteric we didnt
notice the accumulating issues related to it.
The main problem was my original hack used a fairly simple device to give
multiple OPEN/CLOSE opcodes the same target buffer id. When it was introduced
this was fine. When GOSUB was added later however, we overlooked at that this
broke a key part of the book-keeping for GOSUB.
A GOSUB regop needs to know where to jump to, and which close paren to stop
at. However the structure of the regexp program can change from the time the
regop is created. This means we keep track of every OPEN/CLOSE regop we
encounter during parsing, and when something is inserted into the middle of
the program we make sure to move the offsets we store for the OPEN/CLOSE data.
This is essentially keyed and scaled to the number of parens we have seen.
When branch reset is used however the number of OPEN/CLOSE regops is more than
the number of logical buffers we have seen, and we only move one of the
OPEN/CLOSE buffers that is in the branch reset. Which of course breaks things.
Another issues with branch reset is that it creates weird artifacts like this:
/(?|(?<a>a)|(?<b>b))(?&a)(?&b)/ where the (?&b) actually maps to the (?<a>a)
capture buffer because they both have the same id. Another case is that you
cannot check if $+{b} matched and $+{a} did not, because conceptually they
were the same buffer under the hood.
These bugs are now fixed. The "aliasing" of capture buffers to each other is
now done virtually, and under the hood each capture buffer is distinct. We
introduce the concept of a "logical parno" which is the user visible capture
buffer id, and keep it distinct from the true capture buffer id. Most of the
internal logic uses the "true parno" for its business, so a bunch of problems
go away, and we keep maps from logical to physical parnos, and vice versa,
along with a map that gives use the "next physical parno with the same
logical parno". Thus we can quickly skip through the physical capture buffers
to find the one that matched. This means we also have to introduce a
logical_total_parens as well, to complement the already existing total_parens.
The latter refers to the true number of capture buffers. The former represents
the logical number visible to the user.
It is helpful to consider the following table:
Logical: $1 $2 $3 $2 $3 $4 $2 $5
Physical: 1 2 3 4 5 6 7 8
Next: 0 4 5 7 0 0 0 0
Pattern: /(pre)(?|(?<a>a)(?<b>b)|(?<c>c)(?<d>d)(?<e>e)|(?<f>))(post)/
The names are mapped to physical buffers. So $+{b} will show what is in
physical buffer 3. But $3 will show whichever of buffer 3 or 5 matched.
Similarly @{^CAPTURE} will contain 5 elements, not 8. But %+ will contain all
6 named buffers.
Since the need to map these values is rare, we only store these maps when they
are needed and branch reset has been used, when they are NULL it is assumed
that physical and logical buffers are identical.
Currently the way this change is implemented will likely break plug in regexp
engines because they will be missing the new logical_total_parens field at
the very least. Given that the perl internals code is somewhat poorly
abstracted from the regexp engine, with parts of the abstraction leaking out,
I think this is acceptable. If we want to make plug in regexp engines work
properly IMO we need to add some more hooks that they need to implement than
we currently do. For instance mg.c does more work than it should. Given there
are only a few plug in regexp engines and that it is specialized work, I
think this is acceptable. We can work with the authors to refine the API
properly later.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RX_OFFS() exposes a bit too much about how capture buffers are represented.
This adds RX_OFFS_START() and RX_OFFS_END() and RX_OFFS_VALID() to replace
most of the uses of the RX_OFFS() macro or direct access to the rx->off[]
array. (We add RX_OFFSp() for those rare cases that should have direct
access to the array.) This allows us to replace this logic with more
complicated macros in the future. Pretty much anything using RX_OFFS() is
going to be broken by future changes, so changing the define allows us to
track it down easily.
Not all use of the rx->offs[] array are converted; some uses are required
for the regex engine internals, but anything outside of the regex engine
should be using the replacement macros, and most things in the regex internals
should use it also.
|
|
|
|
|
|
|
|
|
|
| |
We were only handling "synthetic start classes", not ones that are in
the program itself, because those dont have an entry in the data array.
So after copying the program after ruling out that the regstclass is
synthetic we can assume that if its non-null it points into the program
itself, and simply set it up the copy accordingly.
Fixes https://github.com/Perl/perl5/issues/9373
|
|
|
|
|
|
| |
Many of the typos in this file reported in GH 20435 have been cleaned up
since that pull request was originally filed. This commit cleans up the
remaining ones and a few others spotted along the way.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This splits a bunch of the subcomponents of the regex engine into
smaller files.
regcomp_debug.c
regcomp_internal.h
regcomp_invlist.c
regcomp_study.c
regcomp_trie.c
The only real change besides to the build machine to achieve the split
is to also adds some new defines which can be used in embed.fnc to control
exports without having to enumerate /every/ regex engine file. For
instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used
in embed.fnc to manage exports.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Having internal tabs causes confusion in diffs and reviews. In the
following patch I will move a lot of code around, creating new files
and they will all be whitespace clean: no trailing whitespace,
tabs expanded to the next tabstop properly, and no trailing empty
lines at the bottom of the file.
This patch prepares for that split, and future splits and changes to
the regex engine by precleaning the main regex engine files with the
same rules.
It should show no changes under '-w'.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We used the ARG() macro to access the parno data for the OPEN
and CLOSE regops. This made it difficult to find what needed to
change when the type and size or location of this data in the
node was modified. Replacing this access with a specific macro
makes the code more legible and future proof.
This was actually backported from finding everything that broke
by changing the regnode type for OPEN and CLOSE to 2L and moving
the paren parameter to the 2L slot. We might do something like this
in the future and separating the PARNO() macros from their
implementation will make it easier.
|
| |
|
|
|
|
|
| |
RX_PRECOMP() has for some time been an expression that
will always be true, thus there is not need to test it.
|
|
|
|
|
|
|
|
|
|
|
| |
`av_push`, `av_store`, and `av_fetch` have been changed to their
inline `_simple` variants in a number of places. No test failures
have resulted from these changes.
The remaining uses of the full-fat functions are not suitable to
change because either the AVs have magic, or existing tests show
that the `key` parameter is not > -1 when that is a pre-requisite
of a replacement inline function.
|
| |
|
|
|
|
|
| |
If AvARRAY() was NULL (and it is in one test) this would cause
undefined behaviour in the S_concat_pat() outer for loop.
|
|
|
|
| |
CID 353002
|
| |
|
| |
|
| |
|
|
|
|
| |
The code confusing uses type and kind as synonyms. Lets end that bad habit
|
|
|
|
|
|
|
|
|
|
| |
This adds REGNODE_AFTER_varies() which is used when the called *knows*
that the current regnode is variable length. We then use it to handle
EXACTish style nodes as determined by PL_regnode_arg_len_varies.
As part of this patch Perl_regnext() Perl_regnode_after() and
Perl_check_regnode_after() are moved to reginline.h, which is loaded via
regcomp.c only when we are compiling the regex engine.
|
|
|
|
|
| |
Now that REGNODE_AFTER() can handle all cases it makes sense
to remove the dynamic() suffix.
|
|
|
|
|
|
| |
This is a first step, we add the support for EXACTish nodes
here, but we do not use it. In a following commit we will move
it to a new file. This patch is just to keep the move clean.
|
| |
|
| |
|
| |
|
| |
|
|
|
|
|
| |
This is in preparation for a future patch, so we can access
PL_reg_off_by_arg() from an inline function in regexec.c
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is really easy to get confused about the difference between
NEXTOPER() and regnext() of a regnode. The two concepts are related,
similar, but importantly distinct. NEXTOPER() is also defined in such a
way that it is easy to abuse and misunderstand and encourages producing
code that is fragile to larger change, effectively "baking in"
assumptions to the code that are difficult to discover by searching.
Changing the type and storage requirements of a regnode may break things
in subtle and hard to debug ways.
An example of how NEXTOPER() is problematic is that this:
NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the
branch node", it means "jump forward by a regnode which happens to be
two regnodes large". In other words NEXTOPER is just a fancy way of
writing "node+1".
This patch replaces NEXTOPER() with three new macros:
REGNODE_AFTER_dynamic(node)
REGNODE_AFTER_opcode(node,op)
REGNODE_AFTER_type(node,tregnode_OPNAME)
The first is the most generic case, it jumps forward by the size of the
node, and determines that size by consulting OP(node). The second is
where you have already extracted OP(node), and the third is where you
know the actual structure that you want to jump forward by. Every
regnode type has a corresponding type, which is known at compile time,
so using the third will produce the most efficient code. However in many
cases the code operates on one of several types, whose size may be the
same now, but may change in the future, in which case one of the other
forms is preferred. The run time logic in regexec.c should probably
only use the REGNODE_AFTER_type() interface.
Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(),
which is used in a specific piece of legacy logic but should not be
used otherwise. It is not safe to go backwards from an arbitrary node,
we simply have no way to know how large the previous node is and thus
where it starts.
This patch includes some logic that validates assumptions during DEBUG
mode which should catch errors from resizing regnodes.
After this patch changing the size of an existing regnode should be
relatively safe and errors related to sizing should trigger assertion
fails.
This patch includes changes to perlreguts.pod to explain this stuff
better.
|
|
|
|
|
|
|
|
|
| |
In a follow up patch we will use this data from regexec.c which
currently cannot see the variable.
This changes a comment in regen/mk_invlists.pl which necessitated
rebuilding several files related to unicode. Only the hashes associated
with mk_invlists.pl were changed.
|
|
|
|
|
|
| |
This declutters the code and allows us to remove the casting as well.
As a byproduct the loop control logic is a bit simplified.
|
|
|
|
| |
Declutter code.
|
|
|
|
|
|
|
|
|
|
| |
retarray is a straightforward array created and populated with
SvIVX(sv_dat) number of elements within this function. As the
number of elements is known and all elements will be used, a
correctly-sized array can be created using newAV_alloc_x.
Since it's a straightforward array, av_push_simple can be used in
place of av_push. This will be more efficient.
|