summaryrefslogtreecommitdiff
path: root/regcomp.c
Commit message (Collapse)AuthorAgeFilesLines
* regcomp.c - throw an error if we have more U16_MAX open parens.Yves Orton2023-03-191-0/+3
| | | | | | We use U16 for various internal logic related to parens. If we exceed this count stuff is going to go silently wrong. Might as well throw a proper error during compilation to detect this.
* warnings.pm - support deprecated::unicode_property_name categoryYves Orton2023-03-181-1/+1
| | | | | | This category is only used in the regex engine, we should be able to disable it specifically, as it seems like we will never actually remove demove support for the things it warns about.
* regcomp.c: use ARG1u_LOC macro to access fieldLukas Mai2023-03-141-1/+1
|
* regcomp.c - use macro wrappers to minimize impact of struct splitYves Orton2023-03-131-4/+4
| | | | | | | We will move various members of the regexp structure to a new structure which just contains information about the match. Wrapping the members in the standard macros means that change can be made less invasive. We already did all of this in regexec.c
* regcomp.c - Use RXp_OFFSp() to access offset dataYves Orton2023-03-131-7/+5
| | | | | | | | | | This insulates access to the regexp match offset data so we can fix the define later and move the offset structure into a new struct. The RXp_OFFSp() was introduced in a recent commit to deliberately break anything using RXp_OFFS() directly. It is hard to type deliberately, nothing but the internals should use it. Everything else should use one of the wrappers around it.
* regcomp.c - extend REF to hold the paren it needs to regcppushYves Orton2023-03-131-4/+6
| | | | | this way we can avoid pushing every buffer, we only need to push the nestroot of the ref.
* regex engine - simplify regnode structures and make them consistentYves Orton2023-03-131-69/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This eliminates the regnode_2L data structure, and merges it with the older regnode_2 data structure. At the same time it makes each "arg" property of the various regnode types that have one be consistently structured as an anonymous union like this: union { U32 arg1u; I32 arg2i; struct { U16 arg1a; U16 arg1b; }; }; We then expose four macros for accessing each slot: ARG1u() ARG1i() and ARG1a() and ARG1b(). Code then explicitly designates which they want. The old logic used ARG() to access an U32 arg1, and ARG1() to access an I32 arg1, which was confusing to say the least. The regnode_2L structure had a U32 arg1, and I32 arg2, and the regnode_2 data strucutre had two I32 args. With the new set of macros we use the regnode_2 for both, and use the appropriate macros to show whether we want to signed or unsigned values. This also renames the regnode_4 to regnode_3. The 3 stands for "three 32-bit args". However as each slot can also store two U16s, a regnode_3 can hold up to 6 U16s, or as 3 I32's, or a combination. For instance the CURLY style nodes use regnode_3 to store 4 values, ARG1i() for min count, ARG2i() for max count and ARG3a() and ARG3b() for parens before and inside the quantifier. It also changes the functions reganode() to reg1node() and changes reg2Lanode() to reg2node(). The 2L thing was just confusing.
* regexec.c - make REF into a backtracking stateYves Orton2023-03-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | This way we can do the required paren restoration only when it is in use. When we match a REF type node which is potentially a reference to an unclosed paren we push the match context information, currently for "everything", but in a future patch we can teach it to be more efficient by adding a new parameter to the REF regop to track which parens it should save. This converts the backtracking changes from the previous commit, so that it is run only when specifically enabled via the define RE_PESSIMISTIC_PARENS which is by default 0. We don't make the new fields in the struct conditional as the stack frames are large and our changes don't make any real difference and it keeps things simpler to not have conditional members, especially since some of the structures have to line up with each other. If enabling RE_PESSIMISTIC_PARENS fixes a backtracking bug then it means something is sensitive to us not necessarily restoring the parens properly on failure. We make some assumptions that the paren state after a failing state will be corrected by a future successful state, or that the state of the parens is irrelevant as we will fail anyway. This can be made not true by EVAL, backrefs, and potentially some other scenarios. Thus I have left this inefficient logic in place but guarded by the flag.
* regexec.c - teach BRANCH and BRANCHJ nodes to reset capture buffersYves Orton2023-03-131-4/+42
| | | | | | | | | | | | | | | | | | | | In /((a)(b)|(a))+/ we should not end up with $2 and $4 being set at the same time. When a branch fails it should reset any capture buffers that might be touched by its branch. We change BRANCH and BRANCHJ to store the number of parens before the branch, and the number of parens after the branch was completed. When a BRANCH operation fails, we clear the buffers it contains before we continue on. It is a bit more complex than it should be because we have BRANCHJ and BRANCH. (One of these days we should merge them together.) This is also made somewhat more complex because TRIE nodes are actually branches, and may need to track capture buffers also, at two levels. The overall TRIE op, and for jump tries especially where we emulate the behavior of branches. So we have to do the same clearing logic if a trie branch fails as well.
* regcomp.c - track parens related to CURLYX and CURLYMYves Orton2023-03-131-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | This was originally a patch which made somewhat drastic changes to how we represent capture buffers, which Dave M and I and are still discussing offline and which has a larger impact than is acceptable to address at the current time. As such I have reverted the controversial parts of this patch for now, while keeping most of it intact even if in some cases the changes are unused except for debugging purposes. This patch still contains valuable changes, for instance teaching CURLYX and CURLYM about how many parens there are before the curly[1] (which will be useful in follow up patches even if stricly speaking they are not directly used yet), tests and other cleanups. Also this patch is sufficiently large that reverting it out would have a large effect on the patches that were made on top of it. Thus keeping most of this patch while eliminating the controversial parts of it for now seemed the best approach, especially as some of the changes it introduces and the follow up patches based on it are very useful in cleaning up the structures we use to represent regops. [1] Curly is the regexp internals term for quantifiers, named after x{min,max} "curly brace" quantifiers.
* regcomp.c: Fix RT ticket number in commentKarl Williamson2023-03-021-1/+1
| | | | | | The comment in the code was wrong. The workaround therein is for RT #133135, not for what the comment said. But 133135 is GH #16520, so change to the new number.
* regcomp.c - remove unused debug logic that warns under clang.Yves Orton2023-02-261-4/+0
| | | | | | | | | | | | | | | | | | And by removing it we silence the following build warning: regcomp.c:2285:28: warning: format specifies type 'int' but the argument has type 'I32' (aka 'long') [-Wformat] parno, RExC_parno_to_logical[parno], RExC_parno_to_logical_next[parno]); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./regcomp_internal.h:212:37: note: expanded from macro 'RExC_parno_to_logical' #define RExC_parno_to_logical (pRExC_state->parno_to_logical) ^ regcomp.c:2285:58: warning: format specifies type 'int' but the argument has type 'I32' (aka 'long') [-Wformat] parno, RExC_parno_to_logical[parno], RExC_parno_to_logical_next[parno]); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./regcomp_internal.h:213:37: note: expanded from macro 'RExC_parno_to_logical_next' #define RExC_parno_to_logical_next (pRExC_state->parno_to_logical_next) ^ 2 warnings generated.
* Perl_re_op_compile: additional parens to satisfy CoverityRichard Leach2023-02-131-1/+1
|
* Move the macros which wrap sv_dup_inc() into sv.hPaul "LeoNerd" Evans2023-02-131-2/+0
|
* regcomp.c - remove (**{ ... }) from the regex engineYves Orton2023-02-081-6/+1
| | | | | | | | | | | | | | | | | Dave M pointed out that this idea was flawed, and after some testing I have come to agree with him. This removes it. It was only available for 5.37.8, so no deprecation cycle involved. The point of (**{ ... }) was to have a postponed eval that does not disable optimizations. But some of the optimizations are disabled because if they are not we do not match correctly as the optimizations will make unwarranted assumptions about the pattern, assumptions which can be incorrect depending on what pattern is returned from the codeblock. The original idea was proposed because (?{ ... }) was treated as though it was (??{ ... }) and disabled many optimizations, when in fact it doesn't interact with optimizations at all. When I added (*{ ... }) as the optimistic version of (?{ ... }) I used "completeness" as the justification for also adding (**{ ... }) when it does not make sense to do so.
* regcomp.c - fix fencepost error duping a regexYves Orton2023-01-211-6/+6
| | | | | | | | In fe5492d916201ce31a107839a36bcb1435fe7bf0 I made a fencepost error copying the logical_to_parno and related data structures. They all needed a +1 on their size as the paren counts (logical and physical) as they have to account for capture buffer 0 which is always present which represents the entire match.
* regcomp.c - add optimistic eval (*{ ... }) and (**{ ... })Yves Orton2023-01-191-10/+42
| | | | | | | | | | | | | | | | | This adds (*{ ... }) and (**{ ... }) as equivalents to (?{ ... }) and (??{ ... }). The only difference being that the star variants are "optimisitic" and are defined to never disable optimisations. This is especially relevant now that use of (?{ ... }) prevents important optimisations anywhere in the pattern, instead of the older and inconsistent rules where it only affected the parts that contained the EVAL. It is also very useful for injecting debugging style expressions to the pattern to understand what the regex engine is actually doing. The older style (?{ ... }) variants would change the regex engines behavior, meaning this was not as effective a tool as it could have been. Similarly it is now possible to test that a given regex optimisation works correctly using (*{ ... }), which was not possible with (?{ ... }).
* regcomp.c - increase size of CURLY nodes so the min/max is a I32Yves Orton2023-01-151-6/+6
| | | | | | | | This allows us to resolve a test inconsistency between CURLYX and CURLY and CURLYM, which have different maximums. We use I32 and not U32 because the existing count logic uses -1 internally and using an I32 for the min/max prevents warnings about comparing signed and unsigned values when the count is compared against the min or max.
* regcomp.c - add whitespace to binary operationYves Orton2023-01-151-2/+1
| | | | The tight & is hard to read.
* regcomp.h - get rid of EXTRA_STEP definesYves Orton2023-01-151-1/+0
| | | | They are unused these days.
* regcomp.c etc - rework branch reset so it works properlyYves Orton2023-01-121-10/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Branch reset was hacked in without much thought about how it might interact with other features. Over time we added named capture and recursive patterns with GOSUB, but I guess because branch reset is somewhat esoteric we didnt notice the accumulating issues related to it. The main problem was my original hack used a fairly simple device to give multiple OPEN/CLOSE opcodes the same target buffer id. When it was introduced this was fine. When GOSUB was added later however, we overlooked at that this broke a key part of the book-keeping for GOSUB. A GOSUB regop needs to know where to jump to, and which close paren to stop at. However the structure of the regexp program can change from the time the regop is created. This means we keep track of every OPEN/CLOSE regop we encounter during parsing, and when something is inserted into the middle of the program we make sure to move the offsets we store for the OPEN/CLOSE data. This is essentially keyed and scaled to the number of parens we have seen. When branch reset is used however the number of OPEN/CLOSE regops is more than the number of logical buffers we have seen, and we only move one of the OPEN/CLOSE buffers that is in the branch reset. Which of course breaks things. Another issues with branch reset is that it creates weird artifacts like this: /(?|(?<a>a)|(?<b>b))(?&a)(?&b)/ where the (?&b) actually maps to the (?<a>a) capture buffer because they both have the same id. Another case is that you cannot check if $+{b} matched and $+{a} did not, because conceptually they were the same buffer under the hood. These bugs are now fixed. The "aliasing" of capture buffers to each other is now done virtually, and under the hood each capture buffer is distinct. We introduce the concept of a "logical parno" which is the user visible capture buffer id, and keep it distinct from the true capture buffer id. Most of the internal logic uses the "true parno" for its business, so a bunch of problems go away, and we keep maps from logical to physical parnos, and vice versa, along with a map that gives use the "next physical parno with the same logical parno". Thus we can quickly skip through the physical capture buffers to find the one that matched. This means we also have to introduce a logical_total_parens as well, to complement the already existing total_parens. The latter refers to the true number of capture buffers. The former represents the logical number visible to the user. It is helpful to consider the following table: Logical: $1 $2 $3 $2 $3 $4 $2 $5 Physical: 1 2 3 4 5 6 7 8 Next: 0 4 5 7 0 0 0 0 Pattern: /(pre)(?|(?<a>a)(?<b>b)|(?<c>c)(?<d>d)(?<e>e)|(?<f>))(post)/ The names are mapped to physical buffers. So $+{b} will show what is in physical buffer 3. But $3 will show whichever of buffer 3 or 5 matched. Similarly @{^CAPTURE} will contain 5 elements, not 8. But %+ will contain all 6 named buffers. Since the need to map these values is rare, we only store these maps when they are needed and branch reset has been used, when they are NULL it is assumed that physical and logical buffers are identical. Currently the way this change is implemented will likely break plug in regexp engines because they will be missing the new logical_total_parens field at the very least. Given that the perl internals code is somewhat poorly abstracted from the regexp engine, with parts of the abstraction leaking out, I think this is acceptable. If we want to make plug in regexp engines work properly IMO we need to add some more hooks that they need to implement than we currently do. For instance mg.c does more work than it should. Given there are only a few plug in regexp engines and that it is specialized work, I think this is acceptable. We can work with the authors to refine the API properly later.
* regexec engine - wrap and replace RX_OFFS() with better abstractionsYves Orton2023-01-111-2/+2
| | | | | | | | | | | | | | | | RX_OFFS() exposes a bit too much about how capture buffers are represented. This adds RX_OFFS_START() and RX_OFFS_END() and RX_OFFS_VALID() to replace most of the uses of the RX_OFFS() macro or direct access to the rx->off[] array. (We add RX_OFFSp() for those rare cases that should have direct access to the array.) This allows us to replace this logic with more complicated macros in the future. Pretty much anything using RX_OFFS() is going to be broken by future changes, so changing the define allows us to track it down easily. Not all use of the rx->offs[] array are converted; some uses are required for the regex engine internals, but anything outside of the regex engine should be using the replacement macros, and most things in the regex internals should use it also.
* regcomp.c - in regdupe_internal() set up "in program" stclass properlyYves Orton2023-01-061-0/+9
| | | | | | | | | | We were only handling "synthetic start classes", not ones that are in the program itself, because those dont have an entry in the data array. So after copying the program after ruling out that the regstclass is synthetic we can assume that if its non-null it points into the program itself, and simply set it up the copy accordingly. Fixes https://github.com/Perl/perl5/issues/9373
* regcomp.c: Manual correction of typos from GH 20435James E Keenan2022-12-291-13/+13
| | | | | | Many of the typos in this file reported in GH 20435 have been cleaned up since that pull request was originally filed. This commit cleans up the remaining ones and a few others spotted along the way.
* regcomp.c - decompose into smaller filesYves Orton2022-12-091-10375/+102
| | | | | | | | | | | | | | | | | This splits a bunch of the subcomponents of the regex engine into smaller files. regcomp_debug.c regcomp_internal.h regcomp_invlist.c regcomp_study.c regcomp_trie.c The only real change besides to the build machine to achieve the split is to also adds some new defines which can be used in embed.fnc to control exports without having to enumerate /every/ regex engine file. For instance all of regcomp*.c defines PERL_IN_REGCOMP_ANY, and this is used in embed.fnc to manage exports.
* regex engine - cleanup internal tabs and ws (use -w to ignore)Yves Orton2022-12-091-264/+264
| | | | | | | | | | | | | | Having internal tabs causes confusion in diffs and reviews. In the following patch I will move a lot of code around, creating new files and they will all be whitespace clean: no trailing whitespace, tabs expanded to the next tabstop properly, and no trailing empty lines at the bottom of the file. This patch prepares for that split, and future splits and changes to the regex engine by precleaning the main regex engine files with the same rules. It should show no changes under '-w'.
* regcomp.c - add a PARNO() macro to wrap the ARG() macroYves Orton2022-11-101-12/+14
| | | | | | | | | | | | | | We used the ARG() macro to access the parno data for the OPEN and CLOSE regops. This made it difficult to find what needed to change when the type and size or location of this data in the node was modified. Replacing this access with a specific macro makes the code more legible and future proof. This was actually backported from finding everything that broke by changing the regnode type for OPEN and CLOSE to 2L and moving the paren parameter to the 2L slot. We might do something like this in the future and separating the PARNO() macros from their implementation will make it easier.
* regcomp.c - correct commentYves Orton2022-11-101-3/+4
|
* regcomp.c - remove always true conditionYves Orton2022-11-041-1/+0
| | | | | RX_PRECOMP() has for some time been an expression that will always be true, thus there is not need to test it.
* regcomp.c - use _simple functions where apparently safe to do soRichard Leach2022-09-221-38/+37
| | | | | | | | | | | `av_push`, `av_store`, and `av_fetch` have been changed to their inline `_simple` variants in a number of places. No test failures have resulted from these changes. The remaining uses of the full-fat functions are not suitable to change because either the AVs have magic, or existing tests show that the `key` parameter is not > -1 when that is a pre-requisite of a replacement inline function.
* S_regclass - use SvPVCLEAR_FRESH on new SVRichard Leach2022-09-181-2/+2
|
* prevent calling recursing into S_concat_pat() for an empty arrayTony Cook2022-08-171-4/+9
| | | | | If AvARRAY() was NULL (and it is in one test) this would cause undefined behaviour in the S_concat_pat() outer for loop.
* avoid dereferencing prog which may be NULLTony Cook2022-08-081-1/+1
| | | | CID 353002
* regex engine - wrap PL_regnode_name with macro REGNODE_NAME()Yves Orton2022-08-061-9/+9
|
* regex engine - wrap PL_regnode_arg_len with macro REGNODE_ARG_LEN()Yves Orton2022-08-061-20/+20
|
* regex engine - wrap PL_regnode_off_by_arg with macro REGNODE_OFF_BY_ARG()Yves Orton2022-08-061-7/+7
|
* regex engine - wrap PL_regnode_kind with macro REGNODE_TYPE()Yves Orton2022-08-061-51/+51
| | | | The code confusing uses type and kind as synonyms. Lets end that bad habit
* regex engine - integrate regnode_after() support for EXACTish nodesYves Orton2022-08-031-67/+9
| | | | | | | | | | This adds REGNODE_AFTER_varies() which is used when the called *knows* that the current regnode is variable length. We then use it to handle EXACTish style nodes as determined by PL_regnode_arg_len_varies. As part of this patch Perl_regnext() Perl_regnode_after() and Perl_check_regnode_after() are moved to reginline.h, which is loaded via regcomp.c only when we are compiling the regex engine.
* regex engine - rename REGNODE_AFTER_dynamic() REGNODE_AFTER()Yves Orton2022-08-031-24/+24
| | | | | Now that REGNODE_AFTER() can handle all cases it makes sense to remove the dynamic() suffix.
* regcomp.c - initial support for EXACTish nodes in regnode_after()Yves Orton2022-08-031-14/+13
| | | | | | This is a first step, we add the support for EXACTish nodes here, but we do not use it. In a following commit we will move it to a new file. This patch is just to keep the move clean.
* regex engine - Rename PL_reg_name to PL_regnode_nameYves Orton2022-08-031-9/+9
|
* regex engine - Rename PL_reg_off_by_arg to PL_regnode_off_by_argYves Orton2022-08-031-8/+8
|
* regex engine - Rename PL_regkind to PL_regnode_kindYves Orton2022-08-031-51/+51
|
* regex engine - Rename PL_regarglen to PL_regnode_arg_lenYves Orton2022-08-031-21/+21
|
* regex engine - Rename reg_off_by_arg to PL_reg_off_by_argYves Orton2022-08-031-8/+8
| | | | | This is in preparation for a future patch, so we can access PL_reg_off_by_arg() from an inline function in regexec.c
* regcomp.c - rename NEXTOPER to REGNODE_AFTER and related logicYves Orton2022-08-031-70/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is really easy to get confused about the difference between NEXTOPER() and regnext() of a regnode. The two concepts are related, similar, but importantly distinct. NEXTOPER() is also defined in such a way that it is easy to abuse and misunderstand and encourages producing code that is fragile to larger change, effectively "baking in" assumptions to the code that are difficult to discover by searching. Changing the type and storage requirements of a regnode may break things in subtle and hard to debug ways. An example of how NEXTOPER() is problematic is that this: NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the branch node", it means "jump forward by a regnode which happens to be two regnodes large". In other words NEXTOPER is just a fancy way of writing "node+1". This patch replaces NEXTOPER() with three new macros: REGNODE_AFTER_dynamic(node) REGNODE_AFTER_opcode(node,op) REGNODE_AFTER_type(node,tregnode_OPNAME) The first is the most generic case, it jumps forward by the size of the node, and determines that size by consulting OP(node). The second is where you have already extracted OP(node), and the third is where you know the actual structure that you want to jump forward by. Every regnode type has a corresponding type, which is known at compile time, so using the third will produce the most efficient code. However in many cases the code operates on one of several types, whose size may be the same now, but may change in the future, in which case one of the other forms is preferred. The run time logic in regexec.c should probably only use the REGNODE_AFTER_type() interface. Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(), which is used in a specific piece of legacy logic but should not be used otherwise. It is not safe to go backwards from an arbitrary node, we simply have no way to know how large the previous node is and thus where it starts. This patch includes some logic that validates assumptions during DEBUG mode which should catch errors from resizing regnodes. After this patch changing the size of an existing regnode should be relatively safe and errors related to sizing should trigger assertion fails. This patch includes changes to perlreguts.pod to explain this stuff better.
* regen/regcomp.pl - Make regarglen available as PL_regarglen in regexec.cYves Orton2022-08-031-22/+22
| | | | | | | | | In a follow up patch we will use this data from regexec.c which currently cannot see the variable. This changes a comment in regen/mk_invlists.pl which necessitated rebuilding several files related to unicode. Only the hashes associated with mk_invlists.pl were changed.
* regcomp.c - replace OP(n) macro with variable 'op' in S_dumpuntil()Yves Orton2022-08-031-16/+14
| | | | | | This declutters the code and allows us to remove the casting as well. As a byproduct the loop control logic is a bit simplified.
* regcomp.c - replace repeated OP(n) with variable 'op'.Yves Orton2022-08-031-22/+22
| | | | Declutter code.
* Perl_reg_named_buff_fetch - simpler retarray creation & pushRichard Leach2022-08-021-2/+2
| | | | | | | | | | retarray is a straightforward array created and populated with SvIVX(sv_dat) number of elements within this function. As the number of elements is known and all elements will be used, a correctly-sized array can be created using newAV_alloc_x. Since it's a straightforward array, av_push_simple can be used in place of av_push. This will be more efficient.