re_intuit_start(): skip too short variant utf8 pat

RT #132187 This function searches in the target string for known fixed substrings of the pattern, either to quickly reject the match, or to find a minimum start point at which to run the full regex engine. If the target string is utf8 and the pattern is non-utf8 but contains chars in the rang 0x80-0xff, the fixed substring to be searched for will be upgraded to utf8, which causes its length to grow. This can defeat an early quick rejection test of: "is the known substring longer than the target string", because that check is done before the upgrade. It can also trigger the bug reported in the ticket above: a calculation of the maximum end-point within the target string to find the substring goes wrong, because (endpoint - N1) gets limited to the start point (since N1 is longer than the string length), and so the moral equivalent of ((endpoint - N1) + N2) then disappears off the end of the string. The net effect of this bug is that a few bytes off the end of the string may be read, triggering complaints by ASAN etc, or even a SEGV. It makes no difference to the match (which should fail and does fail), except that it might match slower in the unlikely event that the bytes off the end of the string match that tail of the searched-for substring, in which case the full regex engine has to be run to finally reject it. This commit: 1) adds a second length(substr) > length(target string) check at the point its going to run the FBM substring search; 2) it tidies up the code that moves the endpoint back, skipping an expensive utf8 hop-back in more cases.
author: David Mitchell <davem@iabyn.com> 2017-12-03 16:38:37 +0000
committer: David Mitchell <davem@iabyn.com> 2017-12-03 17:57:37 +0000
commit: 2ce94a867b15d96bd49eb8807d39df950f3a1087 (patch)
tree: 722fe9fd09dc7138e71d1cfc19e3b59cd6b47d79 /regexec.c
parent: 4f193c3e96af4b9697510e396fb5e944776d38fb (diff)
download: perl-2ce94a867b15d96bd49eb8807d39df950f3a1087.tar.gz
1 files changed, 13 insertions, 4 deletions
diff --git a/regexec.c b/regexec.c
index a19ede95dc..a571be2c5b 100644
--- a/regexec.c
+++ b/regexec.c
@@ -907,19 +907,28 @@ Perl_re_intuit_start(pTHX_
             && prog->intflags & PREGf_ANCH
             && prog->check_offset_max != SSize_t_MAX)
         {
-            SSize_t len = SvCUR(check) - !!SvTAIL(check);
+            SSize_t check_len = SvCUR(check) - !!SvTAIL(check);
             const char * const anchor =
                         (prog->intflags & PREGf_ANCH_GPOS ? strpos : strbeg);
+            SSize_t targ_len = (char*)end_point - anchor;
+
+            if (check_len > targ_len) {
+                DEBUG_EXECUTE_r(Perl_re_printf( aTHX_
+			      "Anchored string too short...\n"));
+                goto fail_finish;
+            }
 
             /* do a bytes rather than chars comparison. It's conservative;
              * so it skips doing the HOP if the result can't possibly end
              * up earlier than the old value of end_point.
              */
-            if ((char*)end_point - anchor > prog->check_offset_max) {
+            assert(anchor + check_len <= (char *)end_point);
+            if (prog->check_offset_max + check_len < targ_len) {
                 end_point = HOP3lim((U8*)anchor,
                                 prog->check_offset_max,
-                                end_point -len)
-                            + len;
+                                end_point - check_len
+                            )
+                            + check_len;
             }
         }
author	David Mitchell <davem@iabyn.com>	2017-12-03 16:38:37 +0000
committer	David Mitchell <davem@iabyn.com>	2017-12-03 17:57:37 +0000
commit	2ce94a867b15d96bd49eb8807d39df950f3a1087 (patch)
tree	722fe9fd09dc7138e71d1cfc19e3b59cd6b47d79 /regexec.c
parent	4f193c3e96af4b9697510e396fb5e944776d38fb (diff)
download	perl-2ce94a867b15d96bd49eb8807d39df950f3a1087.tar.gz