summaryrefslogtreecommitdiff
path: root/regcomp_internal.h
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.h - document RE_PESSIMISTIC_PARENS and VOLATILE_REF definesYves Orton2023-03-191-2/+0
| | | | | | | | | | | | | These two defines are related to each other, and even though VOLATILE_REF is not explicitly used in regexec.c which would require it being placed in regcomp.h, it is implicitly, and RE_PESSIMISTIC_PARENS *is* used in regexec.c. So put them both in regcomp.h and document them together. This adds copious documentation for what they both are for. RE_PESSIMISTIC_PARENS is effectively a "build option" (although intended for debugging regex engine bugs only). VOLATILE_REF is the name of a flag which is used to mark REF nodes as requiring special backtracking support in regexec.c
* regcomp_internal.h - make unused macros for deprecation warnings take ↵Yves Orton2023-03-181-11/+11
| | | | | | category as parameter we should have a deprecation category per type of deprecation
* regexec.c - make REF into a backtracking stateYves Orton2023-03-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | This way we can do the required paren restoration only when it is in use. When we match a REF type node which is potentially a reference to an unclosed paren we push the match context information, currently for "everything", but in a future patch we can teach it to be more efficient by adding a new parameter to the REF regop to track which parens it should save. This converts the backtracking changes from the previous commit, so that it is run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which is by default 0. We don't make the new fields in the struct conditional as the stack frames are large and our changes don't make any real difference and it keeps things simpler to not have conditional members, especially since some of the structures have to line up with each other. If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means something is sensitive to us not necessarily restoring the parens properly on failure. We make some assumptions that the paren state after a failing state will be corrected by a future successful state, or that the state of the parens is irrelevant as we will fail anyway. This can be made not true by EVAL, backrefs, and potentially some other scenarios. Thus I have left this inefficient logic in place but guarded by the flag.
* regcomp_internal.h - move utility macros out of struct definitionYves Orton2023-02-091-9/+10
| | | | | It is bad enough we have conditional parts of the struct, lets not make it even worse by having defines in the middle of it.
* regcomp.c - add optimistic eval (*{ ... }) and (**{ ... })Yves Orton2023-01-191-0/+3
| | | | | | | | | | | | | | | | | This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and (??{ ... }). The only difference being that the star variants are "optimisitic" and are defined to never disable optimisations. This is especially relevant now that use of (?{ ... }) prevents important optimisations anywhere in the pattern, instead of the older and inconsistent rules where it only affected the parts that contained the EVAL. It is also very useful for injecting debugging style expressions to the pattern to understand what the regex engine is actually doing. The older style (?{ ... }) variants would change the regex engines behavior, meaning this was not as effective a tool as it could have been. Similarly it is now possible to test that a given regex optimisation works correctly using (*{ ... }), which was not possible with (?{ ... }).
* regcomp.c - increase size of CURLY nodes so the min/max is a I32Yves Orton2023-01-151-0/+2
| | | | | | | | This allows us to resolve a test inconsistency between CURLYX and CURLY and CURLYM, which have different maximums. We use I32 and not U32 because the existing count logic uses -1 internally and using an I32 for the min/max prevents warnings about comparing signed and unsigned values when the count is compared against the min or max.
* regcomp_study.c - Add a way to disable CURLYX optimisationsYves Orton2023-01-151-0/+6
| | | | | | Also break up the condition so there is one condition per line so it is more readable, and fold repeated binary tests together. This makes it more obvious what the expression is doing.
* regcomp_internal.h: Fix leak in regex testsKarl Williamson2023-01-121-0/+4
| | | | | | Commit fe5492d916201ce31a107839a36bcb1435fe7bf0 introduced leaks when a regex compilation fails. This commit uses the standard method we have to deal with these kinds of things.
* regcomp.c etc - rework branch reset so it works properlyYves Orton2023-01-121-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Branch reset was hacked in without much thought about how it might interact with other features. Over time we added named capture and recursive patterns with GOSUB, but I guess because branch reset is somewhat esoteric we didnt notice the accumulating issues related to it. The main problem was my original hack used a fairly simple device to give multiple OPEN/CLOSE opcodes the same target buffer id. When it was introduced this was fine. When GOSUB was added later however, we overlooked at that this broke a key part of the book-keeping for GOSUB. A GOSUB regop needs to know where to jump to, and which close paren to stop at. However the structure of the regexp program can change from the time the regop is created. This means we keep track of every OPEN/CLOSE regop we encounter during parsing, and when something is inserted into the middle of the program we make sure to move the offsets we store for the OPEN/CLOSE data. This is essentially keyed and scaled to the number of parens we have seen. When branch reset is used however the number of OPEN/CLOSE regops is more than the number of logical buffers we have seen, and we only move one of the OPEN/CLOSE buffers that is in the branch reset. Which of course breaks things. Another issues with branch reset is that it creates weird artifacts like this: /(?|(?<a>a)|(?<b>b))(?&a)(?&b)/ where the (?&b) actually maps to the (?<a>a) capture buffer because they both have the same id. Another case is that you cannot check if $+{b} matched and $+{a} did not, because conceptually they were the same buffer under the hood. These bugs are now fixed. The "aliasing" of capture buffers to each other is now done virtually, and under the hood each capture buffer is distinct. We introduce the concept of a "logical parno" which is the user visible capture buffer id, and keep it distinct from the true capture buffer id. Most of the internal logic uses the "true parno" for its business, so a bunch of problems go away, and we keep maps from logical to physical parnos, and vice versa, along with a map that gives use the "next physical parno with the same logical parno". Thus we can quickly skip through the physical capture buffers to find the one that matched. This means we also have to introduce a logical_total_parens as well, to complement the already existing total_parens. The latter refers to the true number of capture buffers. The former represents the logical number visible to the user. It is helpful to consider the following table: Logical: $1 $2 $3 $2 $3 $4 $2 $5 Physical: 1 2 3 4 5 6 7 8 Next: 0 4 5 7 0 0 0 0 Pattern: /(pre)(?|(?<a>a)(?<b>b)|(?<c>c)(?<d>d)(?<e>e)|(?<f>))(post)/ The names are mapped to physical buffers. So $+{b} will show what is in physical buffer 3. But $3 will show whichever of buffer 3 or 5 matched. Similarly @{^CAPTURE} will contain 5 elements, not 8. But %+ will contain all 6 named buffers. Since the need to map these values is rare, we only store these maps when they are needed and branch reset has been used, when they are NULL it is assumed that physical and logical buffers are identical. Currently the way this change is implemented will likely break plug in regexp engines because they will be missing the new logical_total_parens field at the very least. Given that the perl internals code is somewhat poorly abstracted from the regexp engine, with parts of the abstraction leaking out, I think this is acceptable. If we want to make plug in regexp engines work properly IMO we need to add some more hooks that they need to implement than we currently do. For instance mg.c does more work than it should. Given there are only a few plug in regexp engines and that it is specialized work, I think this is acceptable. We can work with the authors to refine the API properly later.
* regcomp_internal.h - remove typedefYves Orton2022-12-181-2/+3
| | | | | We already typedef scan_data_t in perl.h, we should not do so also in regcomp_internal.h, we just need to defined struct scan_data_t.
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-0/+1196
This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.