summaryrefslogtreecommitdiff
path: root/regcomp_internal.h
diff options
context:
space:
mode:
authorYves Orton <demerphq@gmail.com>2022-12-29 12:07:22 +0100
committerYves Orton <demerphq@gmail.com>2023-01-12 09:11:51 +0100
commitfe5492d916201ce31a107839a36bcb1435fe7bf0 (patch)
tree91da4178b61e1e788e5979b534396dc923760d35 /regcomp_internal.h
parent17419a88100044035ee6dd9f8947f7d411d94863 (diff)
downloadperl-fe5492d916201ce31a107839a36bcb1435fe7bf0.tar.gz
regcomp.c etc - rework branch reset so it works properly
Branch reset was hacked in without much thought about how it might interact with other features. Over time we added named capture and recursive patterns with GOSUB, but I guess because branch reset is somewhat esoteric we didnt notice the accumulating issues related to it. The main problem was my original hack used a fairly simple device to give multiple OPEN/CLOSE opcodes the same target buffer id. When it was introduced this was fine. When GOSUB was added later however, we overlooked at that this broke a key part of the book-keeping for GOSUB. A GOSUB regop needs to know where to jump to, and which close paren to stop at. However the structure of the regexp program can change from the time the regop is created. This means we keep track of every OPEN/CLOSE regop we encounter during parsing, and when something is inserted into the middle of the program we make sure to move the offsets we store for the OPEN/CLOSE data. This is essentially keyed and scaled to the number of parens we have seen. When branch reset is used however the number of OPEN/CLOSE regops is more than the number of logical buffers we have seen, and we only move one of the OPEN/CLOSE buffers that is in the branch reset. Which of course breaks things. Another issues with branch reset is that it creates weird artifacts like this: /(?|(?<a>a)|(?<b>b))(?&a)(?&b)/ where the (?&b) actually maps to the (?<a>a) capture buffer because they both have the same id. Another case is that you cannot check if $+{b} matched and $+{a} did not, because conceptually they were the same buffer under the hood. These bugs are now fixed. The "aliasing" of capture buffers to each other is now done virtually, and under the hood each capture buffer is distinct. We introduce the concept of a "logical parno" which is the user visible capture buffer id, and keep it distinct from the true capture buffer id. Most of the internal logic uses the "true parno" for its business, so a bunch of problems go away, and we keep maps from logical to physical parnos, and vice versa, along with a map that gives use the "next physical parno with the same logical parno". Thus we can quickly skip through the physical capture buffers to find the one that matched. This means we also have to introduce a logical_total_parens as well, to complement the already existing total_parens. The latter refers to the true number of capture buffers. The former represents the logical number visible to the user. It is helpful to consider the following table: Logical: $1 $2 $3 $2 $3 $4 $2 $5 Physical: 1 2 3 4 5 6 7 8 Next: 0 4 5 7 0 0 0 0 Pattern: /(pre)(?|(?<a>a)(?<b>b)|(?<c>c)(?<d>d)(?<e>e)|(?<f>))(post)/ The names are mapped to physical buffers. So $+{b} will show what is in physical buffer 3. But $3 will show whichever of buffer 3 or 5 matched. Similarly @{^CAPTURE} will contain 5 elements, not 8. But %+ will contain all 6 named buffers. Since the need to map these values is rare, we only store these maps when they are needed and branch reset has been used, when they are NULL it is assumed that physical and logical buffers are identical. Currently the way this change is implemented will likely break plug in regexp engines because they will be missing the new logical_total_parens field at the very least. Given that the perl internals code is somewhat poorly abstracted from the regexp engine, with parts of the abstraction leaking out, I think this is acceptable. If we want to make plug in regexp engines work properly IMO we need to add some more hooks that they need to implement than we currently do. For instance mg.c does more work than it should. Given there are only a few plug in regexp engines and that it is specialized work, I think this is acceptable. We can work with the authors to refine the API properly later.
Diffstat (limited to 'regcomp_internal.h')
-rw-r--r--regcomp_internal.h48
1 files changed, 48 insertions, 0 deletions
diff --git a/regcomp_internal.h b/regcomp_internal.h
index f1b81625a0..a895452511 100644
--- a/regcomp_internal.h
+++ b/regcomp_internal.h
@@ -66,15 +66,57 @@ struct RExC_state_t {
* independent warning is raised for any given spot */
Size_t latest_warn_offset;
+ /* Branch reset /(?|...|...)/ gives us two concepts of capture buffer id.
+ * "Logical Parno" is the user visible view with branch reset taken into
+ * account. "Parno" (or physical parno) is the actual capture buffers in
+ * the pattern *NOT* taking into account branch reset. We also maintain
+ * a map of "next" pointers which allow us to skip to the next physical
+ * capture buffer with the same logical id, with 0 representing "none".
+ *
+ * As we compile we keep track of the two different counts using the
+ * 'logical_npar' and 'npar' members, and we keep track of the upper bound
+ * of both in 'total_par' and 'logical_total_par', we also populate
+ * the 'logical_to_parno' map, which gives us the first physical parno
+ * for a given logical parno, and the `parno_to_logical` array which gives
+ * us the logical id for each physical parno. When compilation is
+ * completed we construct the 'parno_to_logical_next' array from the
+ * 'parno_to_logical' array. (We do not bother constructing it during
+ * compilation as we do not need it, and we can construct it in O(N) time
+ * once we are done, but would need more complicated logic during the
+ * compile, because we want the next pointers to go from smallest to
+ * largest, eg, left to right.)
+ *
+ * Logical: $1 $2 $3 $4 $2 $3 $2 $5
+ * Physical: 1 2 3 4 5 6 7 8
+ * Next: 0 5 6 0 7 0 0 0
+ * Pattern /(a) (?| (b) (c) (d) | (e) (f) | (g) ) (h)/
+ *
+ * As much as possible the internals use and store the physical id of
+ * of capture buffers. We decode the physical to the logical only when
+ * we need to, for instance when someone use $2.
+ *
+ * Note that when branch reset is not used logical and physical are the
+ * same and the next data would be all zero. So when branch reset is not
+ * used we do not need to populate this data into the final regexp.
+ *
+ */
+ I32 *logical_to_parno; /* logical_parno to parno */
+ I32 *parno_to_logical; /* parno to logical_parno */
+ I32 *parno_to_logical_next; /* parno to next (greater value)
+ parno with the same
+ logical_parno as parno.*/
+
I32 npar; /* Capture buffer count so far in the
parse, (OPEN) plus one. ("par" 0 is
the whole pattern)*/
+ I32 logical_npar; /* Logical version of npar */
I32 total_par; /* During initial parse, is either 0,
or -1; the latter indicating a
reparse is needed. After that pass,
it is what 'npar' became after the
pass. Hence, it being > 0 indicates
we are in a reparse situation */
+ I32 logical_total_par; /* Logical version to total par */
I32 nestroot; /* root parens we are in - used by
accept */
I32 seen_zerolen;
@@ -157,6 +199,11 @@ struct RExC_state_t {
#define RExC_seen (pRExC_state->seen)
#define RExC_size (pRExC_state->size)
#define RExC_maxlen (pRExC_state->maxlen)
+#define RExC_logical_npar (pRExC_state->logical_npar)
+#define RExC_logical_total_parens (pRExC_state->logical_total_par)
+#define RExC_logical_to_parno (pRExC_state->logical_to_parno)
+#define RExC_parno_to_logical (pRExC_state->parno_to_logical)
+#define RExC_parno_to_logical_next (pRExC_state->parno_to_logical_next)
#define RExC_npar (pRExC_state->npar)
#define RExC_total_parens (pRExC_state->total_par)
#define RExC_parens_buf_size (pRExC_state->parens_buf_size)
@@ -1194,4 +1241,5 @@ static const scan_data_t zero_scan_data = {
CLEAR_OPTSTART; \
node = dumpuntil(r,start,(b),(e),last,sv,indent+1,depth+1);
+
#endif /* REGCOMP_INTERNAL_H */