diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-09-20 11:30:56 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-09-20 11:30:56 +0000 |
commit | 394e5f3cffccee4d0e00248171e2e539b298ccad (patch) | |
tree | 08b51e15583bfb82c34ccc19e8c6fb5c8c3cef14 | |
parent | 78d769a6c3dda0c35b15b4a66c023ec5b4bd5494 (diff) | |
download | pcre-394e5f3cffccee4d0e00248171e2e539b298ccad.tar.gz |
Fix miscompile of /(*ACCEPT)a/, which thought a match had to start with "a".
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@701 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r-- | ChangeLog | 103 | ||||
-rw-r--r-- | pcre_compile.c | 9 | ||||
-rw-r--r-- | testdata/testinput2 | 6 | ||||
-rw-r--r-- | testdata/testoutput2 | 18 |
4 files changed, 84 insertions, 52 deletions
@@ -4,56 +4,61 @@ ChangeLog for PCRE Version 8.20 12-Sep-2011 ------------------------ -1. Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had - a POSIX class. After further experiments with Perl, which convinced me that - Perl has bugs and confusions, a closing square bracket is no longer allowed - in a POSIX name. This bug also affected patterns with classes that started - with full stops. - -2. If a pattern such as /(a)b|ac/ is matched against "ac", there is no captured - substring, but while checking the failing first alternative, substring 1 is - temporarily captured. If the output vector supplied to pcre_exec() was not - big enough for this capture, the yield of the function was still zero - ("insufficient space for captured substrings"). This cannot be totally fixed - without adding another stack variable, which seems a lot of expense for a - edge case. However, I have improved the situation in cases such as - /(a)(b)x|abc/ matched against "abc", where the return code indicates that - fewer than the maximum number of slots in the ovector have been set. - -3. Related to (2) above: when there are more back references in a pattern than - slots in the output vector, pcre_exec() uses temporary memory during - matching, and copies in the captures as far as possible afterwards. It was - using the entire output vector, but this conflicts with the specification - that only 2/3 is used for passing back captured substrings. Now it uses only - the first 2/3, for compatibility. This is, of course, another edge case. - -4. Zoltan Herczeg's just-in-time compiler support has been integrated into the - main code base, and can be used by building with --enable-jit. When this is - done, pcregrep automatically uses it unless --disable-pcregrep-jit or the - runtime --no-jit option is given. - -5. When the number of matches in a pcre_dfa_exec() run exactly filled the - ovector, the return from the function was zero, implying that there were - other matches that did not fit. The correct "exactly full" value is now - returned. - -6. If a subpattern that was called recursively or as a subroutine contained - (*PRUNE) or any other control that caused it to give a non-standard return, - invalid errors such as "Error -26 (nested recursion at the same subject - position)" or even infinite loops could occur. - -7. If a pattern such as /a(*SKIP)c|b(*ACCEPT)|/ was studied, it stopped - computing the minimum length on reaching *ACCEPT, and so ended up with the - wrong value of 1 rather than 0. Further investigation indicates that - computing a minimum subject length in the presence of *ACCEPT is difficult - (think back references, subroutine calls), and so I have changed the code so - that no minimum is registered for a pattern that contains *ACCEPT. - -8. If (*THEN) was present in the first (true) branch of a conditional group, - it was not handled as intended. +1. Change 37 of 8.13 broke patterns like [:a]...[b:] because it thought it had + a POSIX class. After further experiments with Perl, which convinced me that + Perl has bugs and confusions, a closing square bracket is no longer allowed + in a POSIX name. This bug also affected patterns with classes that started + with full stops. + +2. If a pattern such as /(a)b|ac/ is matched against "ac", there is no + captured substring, but while checking the failing first alternative, + substring 1 is temporarily captured. If the output vector supplied to + pcre_exec() was not big enough for this capture, the yield of the function + was still zero ("insufficient space for captured substrings"). This cannot + be totally fixed without adding another stack variable, which seems a lot + of expense for a edge case. However, I have improved the situation in cases + such as /(a)(b)x|abc/ matched against "abc", where the return code + indicates that fewer than the maximum number of slots in the ovector have + been set. + +3. Related to (2) above: when there are more back references in a pattern than + slots in the output vector, pcre_exec() uses temporary memory during + matching, and copies in the captures as far as possible afterwards. It was + using the entire output vector, but this conflicts with the specification + that only 2/3 is used for passing back captured substrings. Now it uses + only the first 2/3, for compatibility. This is, of course, another edge + case. + +4. Zoltan Herczeg's just-in-time compiler support has been integrated into the + main code base, and can be used by building with --enable-jit. When this is + done, pcregrep automatically uses it unless --disable-pcregrep-jit or the + runtime --no-jit option is given. + +5. When the number of matches in a pcre_dfa_exec() run exactly filled the + ovector, the return from the function was zero, implying that there were + other matches that did not fit. The correct "exactly full" value is now + returned. + +6. If a subpattern that was called recursively or as a subroutine contained + (*PRUNE) or any other control that caused it to give a non-standard return, + invalid errors such as "Error -26 (nested recursion at the same subject + position)" or even infinite loops could occur. + +7. If a pattern such as /a(*SKIP)c|b(*ACCEPT)|/ was studied, it stopped + computing the minimum length on reaching *ACCEPT, and so ended up with the + wrong value of 1 rather than 0. Further investigation indicates that + computing a minimum subject length in the presence of *ACCEPT is difficult + (think back references, subroutine calls), and so I have changed the code + so that no minimum is registered for a pattern that contains *ACCEPT. + +8. If (*THEN) was present in the first (true) branch of a conditional group, + it was not handled as intended. + +9. Replaced RunTest.bat with the much improved version provided by Sheri + Pierce. -9. Replaced RunTest.bat with the much improved version provided by Sheri - Pierce. +10. A pathological pattern such as /(*ACCEPT)a/ was miscompiled, thinking that + the first byte in a match must be "a". Version 8.13 16-Aug-2011 diff --git a/pcre_compile.c b/pcre_compile.c index 23406ae..b6d3637 100644 --- a/pcre_compile.c +++ b/pcre_compile.c @@ -5045,6 +5045,9 @@ for (;; ptr++) PUT2INC(code, 0, oc->number); } *code++ = (cd->assert_depth > 0)? OP_ASSERT_ACCEPT : OP_ACCEPT; + + /* Do not set firstbyte after *ACCEPT */ + if (firstbyte == REQ_UNSET) firstbyte = REQ_NONE; } /* Handle other cases with/without an argument */ @@ -6323,7 +6326,7 @@ for (;; ptr++) byte, set it from this character, but revert to none on a zero repeat. Otherwise, leave the firstbyte value alone, and don't change it on a zero repeat. */ - + if (firstbyte == REQ_UNSET) { zerofirstbyte = REQ_NONE; @@ -6340,7 +6343,7 @@ for (;; ptr++) else firstbyte = reqbyte = REQ_NONE; } - /* firstbyte was previously set; we can set reqbyte only the length is + /* firstbyte was previously set; we can set reqbyte only if the length is 1 or the matching is caseful. */ else @@ -7287,7 +7290,7 @@ re->top_bracket = cd->bracount; re->top_backref = cd->top_backref; re->flags = cd->external_flags; -if (cd->had_accept) reqbyte = -1; /* Must disable after (*ACCEPT) */ +if (cd->had_accept) reqbyte = REQ_NONE; /* Must disable after (*ACCEPT) */ /* If not reached end of pattern on success, there's an excess bracket. */ diff --git a/testdata/testinput2 b/testdata/testinput2 index 739c48f..c43cc80 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -3874,4 +3874,10 @@ with \Y. ---/ abc\N\N bbb\N\N +/(*ACCEPT)a/+I + bax + +/z(*ACCEPT)a/+I + baxzbx + /-- End of testinput2 --/ diff --git a/testdata/testoutput2 b/testdata/testoutput2 index e13e4df..6895e44 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -12304,4 +12304,22 @@ No match 0: 0+ bb +/(*ACCEPT)a/+I +Capturing subpattern count = 0 +No options +No first char +No need char + bax + 0: + 0+ bax + +/z(*ACCEPT)a/+I +Capturing subpattern count = 0 +No options +First char = 'z' +No need char + baxzbx + 0: z + 0+ bx + /-- End of testinput2 --/ |