summaryrefslogtreecommitdiff
path: root/regexec.c
Commit message (Collapse)AuthorAgeFilesLines
* regex engine - wrap PL_regnode_name with macro REGNODE_NAME()Yves Orton2022-08-061-3/+3
|
* regex engine - wrap PL_regnode_kind with macro REGNODE_TYPE()Yves Orton2022-08-061-12/+12
| | | | The code confusing uses type and kind as synonyms. Lets end that bad habit
* regex engine - rename REGNODE_AFTER_dynamic() REGNODE_AFTER()Yves Orton2022-08-031-4/+4
| | | | | Now that REGNODE_AFTER() can handle all cases it makes sense to remove the dynamic() suffix.
* regex engine - Rename PL_reg_name to PL_regnode_nameYves Orton2022-08-031-3/+3
|
* regex engine - Rename PL_regkind to PL_regnode_kindYves Orton2022-08-031-12/+12
|
* regcomp.c - rename NEXTOPER to REGNODE_AFTER and related logicYves Orton2022-08-031-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is really easy to get confused about the difference between NEXTOPER() and regnext() of a regnode. The two concepts are related, similar, but importantly distinct. NEXTOPER() is also defined in such a way that it is easy to abuse and misunderstand and encourages producing code that is fragile to larger change, effectively "baking in" assumptions to the code that are difficult to discover by searching. Changing the type and storage requirements of a regnode may break things in subtle and hard to debug ways. An example of how NEXTOPER() is problematic is that this: NEXTOPER(NEXTOPER(branch)) does not mean "find the second node after the branch node", it means "jump forward by a regnode which happens to be two regnodes large". In other words NEXTOPER is just a fancy way of writing "node+1". This patch replaces NEXTOPER() with three new macros: REGNODE_AFTER_dynamic(node) REGNODE_AFTER_opcode(node,op) REGNODE_AFTER_type(node,tregnode_OPNAME) The first is the most generic case, it jumps forward by the size of the node, and determines that size by consulting OP(node). The second is where you have already extracted OP(node), and the third is where you know the actual structure that you want to jump forward by. Every regnode type has a corresponding type, which is known at compile time, so using the third will produce the most efficient code. However in many cases the code operates on one of several types, whose size may be the same now, but may change in the future, in which case one of the other forms is preferred. The run time logic in regexec.c should probably only use the REGNODE_AFTER_type() interface. Note that there is also a REGNODE_BEFORE() which replaces PREVOPER(), which is used in a specific piece of legacy logic but should not be used otherwise. It is not safe to go backwards from an arbitrary node, we simply have no way to know how large the previous node is and thus where it starts. This patch includes some logic that validates assumptions during DEBUG mode which should catch errors from resizing regnodes. After this patch changing the size of an existing regnode should be relatively safe and errors related to sizing should trigger assertion fails. This patch includes changes to perlreguts.pod to explain this stuff better.
* Fix GH #20009, pattern match panicKarl Williamson2022-07-281-1/+1
| | | | | | This was the result of using the wrong case in a switch() Thanks to @bram-perl for the idea of the test case reduction
* regexec.c: avoid compiler warnings about possible uninitialized useYves Orton2022-07-151-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | There is a codepath where we do not set up folder and fold_array, and rely on later codepaths not triggering uninitialized access. Because of the nature of the switch and etc, it is hard for an analyzer or even a human to be 100% certain it is all set up right in all cases where it is used. Also the variable initialization inside of the switch is a bit awkward. So this patch moves the declarations up, inits the vars to zero and then asserts before we use any of these vars under DEBUGGING to ensure we didn't forget. Example warning: re_exec.c:7121:44: warning: ‘fold_array’ may be used uninitialized in this function [-Wmaybe-uninitialized] 7121 | && UCHARAT(s) != fold_array[nextbyte]) This was noticed when building with: ./Configure -d -Doptimize=-g -Dusedevel -DDEBUGGING \ -Accflags='-fsanitize=address -fsanitize=undefined \ -ggdb3' -Aldflags='-Wl,--no-as-needed -lasan -lubsan' \ -Dcc=ccache\ gcc -Dld=gcc
* regex: Add optimizing regnodeKarl Williamson2022-07-121-0/+42
| | | | | | | | | | | | | | | | | | | | | | It turns out that any character class whose UTF-8 representation is two bytes long, and where all elements share the same first byte can be represented by a compact, fast regnode designed for the purpose. This commit adds that regnode, ANYOFHbbm. ANYOFHb already exists for classes where all elements have the same first byte, and this just changes the two-byte ones to use a bitmap instead of an inversion list. The advantages of this are that no conversion to code point is required (the continuation byte is just looked up in the bitmap) and no inversion list is needed. The inversion list would occupy more space, from 4 to 34 extra 64-bit words, plus an AV and SV, depending on what elements the class matches. Many characters in the Latin, Greek, Cyrillic, Greek, Hebrew, Arabic, and several other (lesser-known) scripts are of this form. It would be possible to extend this technique to larger bitmaps, but this commit is a start.
* regex: Refactor bitmap vs non-bitmap of qr/[]/Karl Williamson2022-07-101-12/+15
| | | | | | | | | | | | | | | | | | A bracketed character class in a pattern is generally represented by some form of ANYOF node, with matches of characters in the Latin1 range handled by a bitmap, and an inversion list for higher code point matches. But some patterns only have low matches, and some only high, and some match everything that is high. This commit refactors a little so that the distinction between nothing high matches vs everything high matches is done through the same technique. Previously one was indicated by a flag, and the other by a special value in the node's structure. Now there are two special values, and the flag is freed up for a potential future use. In the past the meaning of the flags has had to be overloaded go accommodate all the needs. freeing of a flag means This all allows for some slight simplicfications.
* regex: White-space/comment onlyKarl Williamson2022-07-101-62/+73
| | | | Fix up after previous commit
* regex: Refactor a shared flagKarl Williamson2022-07-101-26/+22
| | | | | | | | | | | | | In ANYOF nodes (generated for qr/[]/), there is a bitmap component, and possibly a non-bitmap component. It turns out that a single flag can be used to indicate the existence of the latter. When looked at this way, the name of the flag becomes simpler, and incorporates the meaning of another bit, which was previously shared with yet another meaning. Thus that other meaning can become an unshared bit. This allows for some simplification, and being able to handle the uncommon Turkish locale with fewer main-line conditionals being executed at runtime.
* regexec.c: Hoist test of invariant value to outside loopsKarl Williamson2022-07-101-4/+10
| | | | No sense testing each iteration when it isn't going to change.
* regex: Change some internal macro names for clarityKarl Williamson2022-07-031-3/+3
| | | | | | | | | These long names are designed to remind the coder that they have multiple meanings. But move the reminder text to the end, as it obscures the purposes. And some have two halves for the separate meanings; change the names so the halves are split by two underscores to visually emphasize this.
* regex: Rmv unnecessary conditionalKarl Williamson2022-07-031-6/+4
| | | | | Within this block we know that *p must be one of [Ii]. If it's not 'i', it must be 'I' without needing to test
* regexec.c: Make the trivial case first in if..else..Karl Williamson2022-07-031-7/+7
| | | | It's more understandable
* locale: Change macro name to be C conformantKarl Williamson2022-07-031-27/+27
| | | | | | C reserves leading underscores for system use; this commit fixes _CHECK_AND_WARN_PROBLEMATIC_LOCALE to be CHECK_AND_WARN_PROBLEMATIC_LOCALE_
* Change loop type to avoid compiler warningsKarl Williamson2022-07-021-2/+2
| | | | | | On some configurations we're getting comparison of signed vs unsigned. This fixes GH #19915
* regexec.c: Avoid compiler msgsKarl Williamson2022-07-011-3/+3
| | | | | Solaris complains that these could be negative, though in practice they aren't.
* Revert "regex: Add POSIXA1R node"Karl Williamson2022-07-011-86/+0
| | | | | | | This reverts commit d62feba66bf43f35d092bb026694f927e9f94d38. As explained in its commit message. It adds some comments to point out that the commit exists, for the curious.
* regex: Add POSIXA1R nodeKarl Williamson2022-07-011-0/+86
| | | | | | | | | | | | | | | | | | | | | | | | Several of the POSIXA classes are a single range on ASCII platforms, and [:digit:] is a single range on both ASCII and EBCDIC. This regnode was designed to replace the POSIXA regnode for such classes to get a bit of performance by not needing to do an array lookup. Instead it encodes some bits in the flags field that with shifting and masking get the right values for the single range's bounds for any such node. However, performance tests conducted by Sergey Aleynikov showed this was actually slower than what it intended to replace. Rather than completely drop this work, I'm adding it to blead, and immediately reverting it, so that should parts of it ever become useful, it would be available. A few tests fail; those are skipped for the purposes of this commit so that it doesn't interfere with bisecting. The code also isn't completely commented. One could add a regnode for each posix class it was decided should have the expected performance boost. But regnodes are a finite resource, and the boost is probably not large enough to justify doing so.
* regexec.c: Add some branch predictions reorder condsKarl Williamson2022-06-291-1/+8
| | | | | This reorders the conditionals in an 'if' where all must be true to succeed, so the least likely one is done first, and adds some UNLIKELY
* Convert '!!' to cBOOL()Karl Williamson2022-06-141-1/+1
| | | | | | | | | | | | I believe the '!!' is somewhat obscure; I for one didn't know about it for years of programming C, and it was buggy in one compiler, which is why cBOOL was created, I believe. And it is graphically dense, and generally harder to read than the cBOOL() construct. This commit dates from before we moved to C99 where we can simply cast to (bool), and cBOOL() has been rewritten to do that. But the vast majority of code uses cBOOL(), and this commit brings the remainder of the core .[ch] files into uniformity.
* regexec.c: Refactor switch default()Karl Williamson2022-06-121-1/+5
| | | | | It seems clearer to me to have the panic at the end of the routine instead of as the default: of a switch().
* regexec.c: Use new macros; refactor switch()Karl Williamson2022-06-121-16/+19
| | | | | | | | | | These case statements in a switch all had the same prelude for checking if the locale is UTF-8 and handling that case separately. A few commits ago created macros closer to the base level. This commit factors out the common UTF-8 handling, and then puts the lower lever things in the switch(). Perhaps the C optimizer would have been smart enough to do this too, but we might as well do it ourselves, now that it is convenient.
* Change handy.h macro names to be C standard conformantKarl Williamson2022-06-121-46/+46
| | | | | | | C reserves symbols beginning with underscores for its own use. This commit moves the underscore so it is trailing, which is legal. The symbols changed here are many of the ones in handy.h that have significant uses outside it.
* Change handy.h macro names to be C standard conformantKarl Williamson2022-06-121-13/+13
| | | | | | | C reserves symbols beginning with underscores for its own use. This commit moves the underscore so it is trailing, which is legal. The symbols changed here are most of the ones in handy.h that have few uses outside it.
* Make fc(), qr//i thread-safe on participating platformsKarl Williamson2022-06-111-24/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long standing bug in Perl that has gone undetected is that the array is global that is created when changing locales and tells fc() and qr//i matching what the folds are in the new locale. What this means is that any program only has one set of fold definitions that apply to all threads within it, even if we claim that the locales are thread-safe on the given platform. One possibility for this going undetected so long is that no one is using locales on multi-threaded systems much. Another possibility is that modern UTF-8 locales have the same set of folds as any other one. It is a simple matter to make the fold array per-thread instead of per-process, and that solves the problem transparently to other code. I discovered this stress-testing locale handling under threads. That test will be added in a future commit. In order to keep from having a dTHX inside foldEQ_locale, it has to have a pTHX_ parameter. This means that the other functions that function pointer variables get assigned to point to have to have an identical signature, which means adding pTHX_ to functions that don't require it. The bodies of all these are known to the compiler, since they are all in inline.h or in the same .c file as where they are called. Hence the compiler can optimize out the unused parameter. Two calls of STR_WITH_LEN also have to be changed because of C preprocessor limitations; perhaps there is another way to do it that I'm unfamiliar with.
* regexec.c: Handle Turkish locale if large ANYOF bitmapKarl Williamson2022-06-091-6/+24
| | | | | | Perl defaults to the bitmap for ANYOF nodes being for the lowest 256 characters, but it is possible to compile the bitmap to be up to size 2**16. Doing so, prior to this commit, broke Turkish locale handling.
* regexec.c: Avoid using a more general fcnKarl Williamson2022-06-091-18/+68
| | | | | | | The regnodes in the ANYOFH series by definition don't have a bitmap used in the other ANYOF-type nodes. Instead they have an inversion list. We can avoid the overhead of calling the general function that looks first in the bitmap.
* regex: Create a macro to avoid some #ifdef'sKarl Williamson2022-06-021-7/+1
| | | | Easier to read
* regexec.c: Use modern name for functionKarl Williamson2022-06-021-2/+2
| | | | | The newer name is not misleading; the old name has been retained only for backcompat; this removes its final use in the core.
* regcomp,regexec: Shorten #define nameKarl Williamson2022-06-021-4/+2
| | | | | | | | This names a flag bit whose meaning depends on context. The previous name contained both meanings, and was simply too long to be comprehensible. This commit splits it into two names, with the suffix '_shared'. It's not necessary to know precisely what the other meaning is when reading the code, only that it has another meaning.
* Rename internal functionKarl Williamson2022-06-021-2/+2
| | | | The old name had a negative in it, and its easier to grok positives
* regexec.c: Remove obsolete codeKarl Williamson2022-06-021-9/+0
| | | | The results of this are no longer used
* regexec.c: Restore overzealous removal for GH#8369Karl Williamson2022-05-281-0/+6
| | | | | I fixed this locally as a result of Hugo van der Sanden's comment in #19636, but neglected to commit it.
* regexec.c: Rmv obsolete recursion checks: GH #8369Karl Williamson2022-05-271-27/+1
| | | | | | There is no recursion to exceed limits of. This fixes #8369
* fix printf format warning in regexp debug outputHugo van der Sanden2022-04-121-2/+2
|
* regexec.c: make ACCEPT close logic handle SUCCEED/LOOKBEHIND_END opcodesYves Orton2022-04-081-14/+24
| | | | | | | | | | | | | | | | | | | | GH Issue #19484 reported that print "ABDE" =~ /(A (A|B(*ACCEPT)|C)+ D)(E)/x ? "yes: <$1-$2>" : "no"; does not output the expected 'AB-B', and instead does not match. Removing the + quantifier behaves as expected. This patch is 4/4 of the patches to fix this problem: SUCCEED and LOOKBEHIND_END regops are type 'END' which have a next_off of 0. This was causing regnext() to return null inside of the loop iterator for the logic in ACCEPT which closes any open capture buffers thus terminating the loop prematurely and preventing some of the capture buffers from being properly closed. SUCCEED is used to end a variety of structures, including lookahead IFMATCH and UNLESSM, SUSPEND, and CURLYM, and LOOKBEHIND_END serves the same purpose for lookbehind IFMATCH and UNLESSM. Thus this patch fixes the original bug but also fixes a variety of other cases involving ACCEPT.
* regexec.c: ACCEPT inside of a (...)+ should stop looping (CURLYM optimization)Yves Orton2022-04-081-4/+14
| | | | | | | | | | | | | | GH Issue #19484 reported that print "ABDE" =~ /(A (A|B(*ACCEPT)|C)+ D)(E)/x ? "yes: <$1-$2>" : "no"; does not output the expected 'AB-B', and instead does not match. Removing the + quantifier behaves as expected. This patch is 3/4 and fixes part of the problem: the CURLYM optimization was not terminating its loop properly when it contained an ACCEPT. This patch adds a new variable 'is_accepted' which is used to ensure that the CURLYM optimization stops after an ACCEPT regop is executed.
* regexec.c: Fix up for ACCEPT inside of a (...)+ set lastopen in CURLYMYves Orton2022-04-081-0/+1
| | | | | | | | | | | | | GH Issue #19484 reported that print "ABDE" =~ /(A (A|B(*ACCEPT)|C)+ D)(E)/x ? "yes: <$1-$2>" : "no"; does not output the expected 'AB-B', and instead does not match. Removing the + quantifier behaves as expected. This patch is 2/4 and fixes part of the problem: lastopen was not being set properly inside of the CURLYM optimization. lastopen is used by the ACCEPT logic to know which parens need to be closed.
* gh19557: restore match_end on early bailoutHugo van der Sanden2022-03-251-0/+1
| | | | | | After 271c3af797, early bailout from the inner one of a pair of nested lookbehinds would leave the desired match_end pointing at the wrong place, so the outer lookbehind could give the wrong answer.
* gh17746: add missing check on hardcountHugo van der Sanden2022-03-211-2/+5
| | | | Failing to check for max iterations caused an assertion failure.
* Inlined newSV_type(SVt_NULL) leaner than non-inlined newSV(0)Richard Leach2022-03-071-1/+1
| | | | | | | | | | | | When a function outside of sv.c creates a SV via newSV(0): * There is a call to Perl_newSV * A SV head is uprooted and its flags set * A runtime check is made to effectively see if 0 > 0 * The new SV* is returned Replacing newSV(0) with newSV_type(SVt_NULL) should be more efficient, because (assuming there are SV heads to uproot), the only step is: * A SV head is uprooted and its flags set
* regex engine: Issue #19168 - Fix variable length lookbehind matchesYves Orton2022-02-231-2/+22
| | | | | | | | | | | | We were not validating that when (?<=a|ab) matched that the right hand side of the match lined up with the position of the assertion. Similar for (?<!a|ab) and related patterns, eg, (*positive_lookbehind:). Note these problems do NOT affect lookahead. Part of the difficulty here was that the SUCCEED node was serving too many purposes, necessitating a new regop LOOKBEHIND_END. Includes more tests for various lookahead or lookbehind cases.
* regexec.c: White-space/comment onlyKarl Williamson2021-08-231-217/+218
|
* regexec.c: regrepeat() switch() on target utf8nessKarl Williamson2021-08-231-100/+119
| | | | | | | | | | | Many of the cases in the main switch() statement in regrepeat() begin with if (utf8_target) or similar. These branches can be saved, and the code easier to understand by including the utf8ness in the case statements, as this commit does; taking advantage of the changes in the previous commit.
* regexec.c: Add commentKarl Williamson2021-08-071-0/+1
|
* regexec.c: Refactor macro to generalize itKarl Williamson2021-08-071-11/+27
| | | | This is in preparation for a somewhat different use to be added.
* regexec.c: Use lsbit_pos32() to avoid iterationsKarl Williamson2021-07-301-26/+13
| | | | | | Before this commit, the code looped through a bitmap looking for a set bit. Now that we have a fast way to find where a set bit is, use it, and avoid the fruitless iterations.