summaryrefslogtreecommitdiff
path: root/regcomp_debug.c
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.h - use a common union for head and args across all regnodes.Yves Orton2023-03-291-11/+11
| | | | | | | | | | | | This helps with HPUX builds where we need to ensure everything is aligned the same (on 32 bit boundaries). It also strongly encourages everything to use the accessor macros and not access the members directly. By using a union for the variadic fields we make it more obvious that some regops use the field in different ways. This patch also converts all the arg unions into a standardized union with standardized member names.
* regexec.c - use RXp_LASTPAREN(rex) to access rex->lastparenYves Orton2023-03-131-1/+1
| | | | | This field will be moving to a new struct. Converting this to a macro will make that move easier.
* regcomp.c - extend REF to hold the paren it needs to regcppushYves Orton2023-03-131-3/+6
| | | | | this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.
* regex engine - simplify regnode structures and make them consistentYves Orton2023-03-131-16/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This eliminates the regnode_2L data structure, and merges it with the older regnode_2 data structure. At the same time it makes each "arg" property of the various regnode types that have one be consistently structured as an anonymous union like this: union { U32 arg1u; I32 arg2i; struct { U16 arg1a; U16 arg1b; }; }; We then expose four macros for accessing each slot: ARG1u() ARG1i() and ARG1a() and ARG1b(). Code then explicitly designates which they want. The old logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1, which was confusing to say the least. The regnode_2L structure had a U32 arg1, and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new set of macros we use the regnode_2 for both, and use the appropriate macros to show whether we want to signed or unsigned values. This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit args". However as each slot can also store two U16s, a regnode_3 can hold up to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count and ARG3a() and ARG3b() for parens before and inside the quantifier. It also changes the functions reganode() to reg1node() and changes reg2Lanode() to reg2node(). The 2L thing was just confusing.
* regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffersYves Orton2023-03-131-2/+10
| | | | | | | | | | | | | | | | | | | | In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well.
* regcomp.c - track parens related to CURLYX and CURLYMYves Orton2023-03-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | This was originally a patch which made somewhat drastic changes to how we represent capture buffers, which Dave M and I and are still discussing offline and which has a larger impact than is acceptable to address at the current time. As such I have reverted the controversial parts of this patch for now, while keeping most of it intact even if in some cases the changes are unused except for debugging purposes. This patch still contains valuable changes, for instance teaching CURLYX and CURLYM about how many parens there are before the curly[1] (which will be useful in follow up patches even if stricly speaking they are not directly used yet), tests and other cleanups. Also this patch is sufficiently large that reverting it out would have a large effect on the patches that were made on top of it. Thus keeping most of this patch while eliminating the controversial parts of it for now seemed the best approach, especially as some of the changes it introduces and the follow up patches based on it are very useful in cleaning up the structures we use to represent regops. [1] Curly is the regexp internals term for quantifiers, named after x{min,max} "curly brace" quantifiers.
* regcomp.c - add optimistic eval (*{ ... }) and (**{ ... })Yves Orton2023-01-191-1/+5
| | | | | | | | | | | | | | | | | This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and (??{ ... }). The only difference being that the star variants are "optimisitic" and are defined to never disable optimisations. This is especially relevant now that use of (?{ ... }) prevents important optimisations anywhere in the pattern, instead of the older and inconsistent rules where it only affected the parts that contained the EVAL. It is also very useful for injecting debugging style expressions to the pattern to understand what the regex engine is actually doing. The older style (?{ ... }) variants would change the regex engines behavior, meaning this was not as effective a tool as it could have been. Similarly it is now possible to test that a given regex optimisation works correctly using (*{ ... }), which was not possible with (?{ ... }).
* regcomp_study.c - disable CURLYX optimizations when EVAL has been seen anywhereYves Orton2023-01-151-1/+5
| | | | | | | | | | | | Historically we disabled CURLYX optimizations when they *contained* an EVAL, on the assumption that the optimization might affect how many times, etc, the eval was called. However, this is also true for CURLYX with evals *afterwards*. If the CURLYN or CURLYM optimization can prune off the search space, then an eval afterwards will be affected. An when you take into account GOSUB, it means that an eval in front might be affected by an optimization after it. So for now we disable CURLYN and CURLYM in any pattern with an EVAL.
* regcomp.c etc - rework branch reset so it works properlyYves Orton2023-01-121-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Branch reset was hacked in without much thought about how it might interact with other features. Over time we added named capture and recursive patterns with GOSUB, but I guess because branch reset is somewhat esoteric we didnt notice the accumulating issues related to it. The main problem was my original hack used a fairly simple device to give multiple OPEN/CLOSE opcodes the same target buffer id. When it was introduced this was fine. When GOSUB was added later however, we overlooked at that this broke a key part of the book-keeping for GOSUB. A GOSUB regop needs to know where to jump to, and which close paren to stop at. However the structure of the regexp program can change from the time the regop is created. This means we keep track of every OPEN/CLOSE regop we encounter during parsing, and when something is inserted into the middle of the program we make sure to move the offsets we store for the OPEN/CLOSE data. This is essentially keyed and scaled to the number of parens we have seen. When branch reset is used however the number of OPEN/CLOSE regops is more than the number of logical buffers we have seen, and we only move one of the OPEN/CLOSE buffers that is in the branch reset. Which of course breaks things. Another issues with branch reset is that it creates weird artifacts like this: /(?|(?<a>a)|(?<b>b))(?&a)(?&b)/ where the (?&b) actually maps to the (?<a>a) capture buffer because they both have the same id. Another case is that you cannot check if $+{b} matched and $+{a} did not, because conceptually they were the same buffer under the hood. These bugs are now fixed. The "aliasing" of capture buffers to each other is now done virtually, and under the hood each capture buffer is distinct. We introduce the concept of a "logical parno" which is the user visible capture buffer id, and keep it distinct from the true capture buffer id. Most of the internal logic uses the "true parno" for its business, so a bunch of problems go away, and we keep maps from logical to physical parnos, and vice versa, along with a map that gives use the "next physical parno with the same logical parno". Thus we can quickly skip through the physical capture buffers to find the one that matched. This means we also have to introduce a logical_total_parens as well, to complement the already existing total_parens. The latter refers to the true number of capture buffers. The former represents the logical number visible to the user. It is helpful to consider the following table: Logical: $1 $2 $3 $2 $3 $4 $2 $5 Physical: 1 2 3 4 5 6 7 8 Next: 0 4 5 7 0 0 0 0 Pattern: /(pre)(?|(?<a>a)(?<b>b)|(?<c>c)(?<d>d)(?<e>e)|(?<f>))(post)/ The names are mapped to physical buffers. So $+{b} will show what is in physical buffer 3. But $3 will show whichever of buffer 3 or 5 matched. Similarly @{^CAPTURE} will contain 5 elements, not 8. But %+ will contain all 6 named buffers. Since the need to map these values is rare, we only store these maps when they are needed and branch reset has been used, when they are NULL it is assumed that physical and logical buffers are identical. Currently the way this change is implemented will likely break plug in regexp engines because they will be missing the new logical_total_parens field at the very least. Given that the perl internals code is somewhat poorly abstracted from the regexp engine, with parts of the abstraction leaking out, I think this is acceptable. If we want to make plug in regexp engines work properly IMO we need to add some more hooks that they need to implement than we currently do. For instance mg.c does more work than it should. Given there are only a few plug in regexp engines and that it is specialized work, I think this is acceptable. We can work with the authors to refine the API properly later.
* regexec engine - wrap and replace RX_OFFS() with better abstractionsYves Orton2023-01-111-4/+4
| | | | | | | | | | | | | | | | RX_OFFS() exposes a bit too much about how capture buffers are represented. This adds RX_OFFS_START() and RX_OFFS_END() and RX_OFFS_VALID() to replace most of the uses of the RX_OFFS() macro or direct access to the rx->off[] array. (We add RX_OFFSp() for those rare cases that should have direct access to the array.) This allows us to replace this logic with more complicated macros in the future. Pretty much anything using RX_OFFS() is going to be broken by future changes, so changing the define allows us to track it down easily. Not all use of the rx->offs[] array are converted; some uses are required for the regex engine internals, but anything outside of the regex engine should be using the replacement macros, and most things in the regex internals should use it also.
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-0/+1625
This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.