summaryrefslogtreecommitdiff
path: root/t/re
diff options
context:
space:
mode:
authorYves Orton <demerphq@gmail.com>2023-01-14 11:54:07 +0100
committerYves Orton <demerphq@gmail.com>2023-03-13 21:26:08 +0800
commit38508ce8fc3a1bd12a3bb65e9d4ceb9b396a18db (patch)
tree86a30b06bfdd74b23389d43a7340c7bf804fb276 /t/re
parentacababb42be12ff2986b73c1bfa963b70bb5d54e (diff)
downloadperl-38508ce8fc3a1bd12a3bb65e9d4ceb9b396a18db.tar.gz
regexec.c - incredibly inefficient solution to backref problem
Backrefs to unclosed parens inside of a quantified group were not being properly handled, which revealed we are not unrolling the paren state properly on failure and backtracking. Much of the code assumes that when we execute a "conditional" operation (where more than one thing could match) that we need not concern ourself with the paren state unless the conditional operation itself represents a paren, and that generally opcodes only needed to concern themselves with parens to their right. When you exclude backrefs from the equation this is broadly reasonable (i think), as on failure we typically dont care about the state of the paren buffers. They either get reset as we find a new different accepting pathway, or their state is irrelevant if the overal match is rejected (eg it fails). However backreferences are different. Consider the following pattern from the tests "xa=xaaa" =~ /^(xa|=?\1a){2}\z/ in the first iteration through this the first branch matches, and in fact because the \1 is in the second branch it can't match on the first iteration at all. After this $1 = "xa". We then perform the second iteration. "xa" does not match "=xaaa" so we fall to the second branch. The '=?' matches, but sets up a backtracking action to not match if the rest of the pattern does not match. \1 matches 'xa', and then the 'a' matches, leaving an unmatched 'a' in the string, we exit the quantifier loop with $1 = "=xaa" and match \z against the remaining "a" in the pattern, and fail. Here is where things go wrong in the old code, we unwind to the outer loop, but we do not unwind the paren state. We then unwind further into the 2nds iteration of the loop, to the '=?' where we then try to match the tail with the quantifier matching the empty string. We then match the old $1 (which was not unwound) as "=xaa", and then the "a" matches, and we are the end of the string and we have incorrectly accpeted this string as matching the pattern. What should have happend was when the \1 was resolved the second time it should have returned the same string as it did when the =? matched '=', which then would have resulted in the tail matching again, and etc, eventually unwinding the entire pattern when the second iteration failed entirely. This patch is very crude. It simply pushes the state of the parens and creates an unwind point for every case where we do a transition to a B or _next operation, and we make the corresponding _next_fail do the appropriate unwinding. The objective was to achieve correctness and then work towards making it more efficient. We almost certainly overstore items on the stack. In the next patch we will keep track of the unclosed parens before the relevant operators and make sure that they are properly pushed and unwound at the correct times. In other words this is a first step patch to make sure things are correct, the next patch will change it so we do it quickly.
Diffstat (limited to 't/re')
-rw-r--r--t/re/re_tests16
1 files changed, 16 insertions, 0 deletions
diff --git a/t/re/re_tests b/t/re/re_tests
index bfccaa04ed..5599507df2 100644
--- a/t/re/re_tests
+++ b/t/re/re_tests
@@ -2141,6 +2141,22 @@ AB\s+\x{100} AB \x{100}X y - -
/(([ab]+)|([cd]+)|([ef]+))+/ acebd Ty $1-$2-$3-$4=$& d--d-=acebd
/(([ab]+)|([cd]+)|([ef]+))+/ acebdf y $1-$2-$3-$4=$& f---f=acebdf
/((a)(b)(c)|(a)(b)|(a))+/ abcaba y $1+$2-$3-$4+$5-$6+$7=$& a+--+-+a=abcaba
+
+/^(xa|=?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:=|zzzz|)\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:=|zzzz)?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|[Z=]?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|([Z=])?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|([Z=]|zzzz)?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(=|)\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:[Z=])?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:[Z=]|zzzz)?\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:=|)\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|=*\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|[Z=]*\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:[Z=])*\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+/^(xa|(?:[Z=]|zzzz)*\1a){2}$/ xa=xaaa n - - # GH 10073 - RT72020
+
# Keep these lines at the end of the file
# pat string y/n/etc expr expected-expr skip-reason comment
# vim: softtabstop=0 noexpandtab