Fix problem with (*THEN) not backing up far enough.

git-svn-id: svn://vcs.exim.org/pcre/code/trunk@550 2f5784b3-3f2a-0410-8824-cb99058d5e15
author: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2010-10-10 16:24:11 +0000
committer: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2010-10-10 16:24:11 +0000
commit: a0ab6e7fdaff6d0b459689f4bc5c79792cfd85cd (patch)
tree: 5f5c6f147d3d5d3c2c06d64cfe7d6ac413991c59
parent: c9a0ce9d1827d3416a549f1fe8ba46f081ccfec2 (diff)
download: pcre-a0ab6e7fdaff6d0b459689f4bc5c79792cfd85cd.tar.gz
13 files changed, 196 insertions, 59 deletions
diff --git a/ChangeLog b/ChangeLog
index b2fef0d..9986cfa 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,6 +1,17 @@
 ChangeLog for PCRE
 ------------------
 
+Version 8.11 10-Oct-2010
+------------------------
+
+1.  (*THEN) was not working properly if there were untried alternatives prior
+    to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it 
+    backtracked to try for "b" instead of moving to the next alternative branch 
+    at the same level (in this case, to look for "c"). The Perl documentation 
+    is clear that when (*THEN) is backtracked onto, it goes to the "next 
+    alternative in the innermost enclosing group".
+
+
 Version 8.10 25-Jun-2010
 ------------------------
 
diff --git a/HACKING b/HACKING
index 8c6a42d..ee09132 100644
--- a/HACKING
+++ b/HACKING
@@ -4,6 +4,7 @@ Technical Notes about PCRE
 These are very rough technical notes that record potentially useful information 
 about PCRE internals.
 
+
 Historical note 1
 -----------------
 
@@ -22,6 +23,7 @@ the one matching the longest subset of the subject string was chosen. This did
 not necessarily maximize the individual wild portions of the pattern, as is
 expected in Unix and Perl-style regular expressions.
 
+
 Historical note 2
 -----------------
 
@@ -34,6 +36,7 @@ maximizing (or, optionally, minimizing in Perl) the amount of the subject that
 matches individual wild portions of the pattern. This is an "NFA algorithm" in
 Friedl's terminology.
 
+
 OK, here's the real stuff
 -------------------------
 
@@ -44,6 +47,7 @@ in the pattern, to save on compiling time. However, because of the greater
 complexity in Perl regular expressions, I couldn't do this. In any case, a
 first pass through the pattern is helpful for other reasons. 
 
+
 Computing the memory requirement: how it was
 --------------------------------------------
 
@@ -54,6 +58,7 @@ idea was that this would turn out faster than the Henry Spencer code because
 the first pass is degenerate and the second pass can just store stuff straight
 into the vector, which it knows is big enough.
 
+
 Computing the memory requirement: how it is
 -------------------------------------------
 
@@ -75,6 +80,7 @@ runs more slowly than before (30% or more, depending on the pattern) because it
 is doing a full analysis of the pattern. My hope was that this would not be a
 big issue, and in the event, nobody has commented on it.
 
+
 Traditional matching function
 -----------------------------
 
@@ -84,6 +90,7 @@ and the way that Perl works. This is not surprising, since it is intended to be
 as compatible with Perl as possible. This is the function most users of PCRE
 will use most of the time.
 
+
 Supplementary matching function
 -------------------------------
 
@@ -119,7 +126,6 @@ quantifiers) are always just two bytes long.
 
 A list of the opcodes follows:
 
-
 Opcodes with no following data
 ------------------------------
 
@@ -151,12 +157,24 @@ These items are all just one byte long
   OP_EXTUNI              match an extended Unicode character 
   OP_ANYNL               match any Unicode newline sequence 
   
-  OP_ACCEPT              ) These are Perl 5.10's "backtracking    
-  OP_COMMIT              ) control verbs". If OP_ACCEPT is inside
-  OP_FAIL                ) capturing parentheses, it may be preceded 
-  OP_PRUNE               ) by one or more OP_CLOSE, followed by a 2-byte 
-  OP_SKIP                ) number, indicating which parentheses must be
-  OP_THEN                ) closed.
+  OP_ACCEPT              ) These are Perl 5.10's "backtracking control   
+  OP_COMMIT              ) verbs". If OP_ACCEPT is inside capturing
+  OP_FAIL                ) parentheses, it may be preceded by one or more
+  OP_PRUNE               ) OP_CLOSE, followed by a 2-byte number,
+  OP_SKIP                ) indicating which parentheses must be closed.
+  
+
+Backtracking control verbs with data
+------------------------------------
+ 
+OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
+start of the current branch.
+
+OP_MARK is followed by the mark name, preceded by a one-byte length, and 
+followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments, 
+the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first 
+two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE 
+offset value.
   
 
 Repeating single characters
@@ -419,4 +437,4 @@ at compile time, and so does not cause anything to be put into the compiled
 data.
 
 Philip Hazel
-October 2009
+October 2010
diff --git a/configure.ac b/configure.ac
index 3f3d3a9..242e1b0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -9,9 +9,9 @@ dnl The PCRE_PRERELEASE feature is for identifying release candidates. It might
 dnl be defined as -RC2, for example. For real releases, it should be empty.
 
 m4_define(pcre_major, [8])
-m4_define(pcre_minor, [10])
-m4_define(pcre_prerelease, [])
-m4_define(pcre_date, [2010-06-25])
+m4_define(pcre_minor, [11])
+m4_define(pcre_prerelease, [-RC1])
+m4_define(pcre_date, [2010-10-09])
 
 # Libtool shared library interface versions (current:revision:age)
 m4_define(libpcre_version, [0:1:0])
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 64802f9..63f3df8 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -2630,10 +2630,10 @@ matching name is found, normal "bumpalong" of one character happens (the
 .sp
   (*THEN) or (*THEN:NAME)
 .sp
-This verb causes a skip to the next alternation if the rest of the pattern does
-not match. That is, it cancels pending backtracking, but only within the
-current alternation. Its name comes from the observation that it can be used
-for a pattern-based if-then-else block:
+This verb causes a skip to the next alternation in the innermost enclosing 
+group if the rest of the pattern does not match. That is, it cancels pending
+backtracking, but only within the current alternation. Its name comes from the
+observation that it can be used for a pattern-based if-then-else block:
 .sp
   ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
 .sp
@@ -2666,6 +2666,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 18 May 2010
+Last updated: 10 October 2010
 Copyright (c) 1997-2010 University of Cambridge.
 .fi
diff --git a/pcre_compile.c b/pcre_compile.c
index f1dc714..9ad335f 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -1724,9 +1724,12 @@ for (;;)
       case OP_MARK:
       case OP_PRUNE_ARG:
       case OP_SKIP_ARG:
-      case OP_THEN_ARG:
       code += code[1];
       break;
+
+      case OP_THEN_ARG:
+      code += code[1+LINK_SIZE];
+      break;
       }
 
     /* Add in the fixed length from the table */
@@ -1827,9 +1830,12 @@ for (;;)
       case OP_MARK:
       case OP_PRUNE_ARG:
       case OP_SKIP_ARG:
-      case OP_THEN_ARG:
       code += code[1];
       break;
+
+      case OP_THEN_ARG:
+      code += code[1+LINK_SIZE];
+      break;
       }
 
     /* Add in the fixed length from the table */
@@ -2105,10 +2111,13 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
     case OP_MARK:
     case OP_PRUNE_ARG:
     case OP_SKIP_ARG:
-    case OP_THEN_ARG:
     code += code[1];
     break;
 
+    case OP_THEN_ARG:
+    code += code[1+LINK_SIZE];
+    break;
+
     /* None of the remaining opcodes are required to match a character. */
 
     default:
@@ -4808,7 +4817,12 @@ for (;; ptr++)
               *errorcodeptr = ERR66;
               goto FAILED;
               }
-            *code++ = verbs[i].op;
+            *code = verbs[i].op;
+            if (*code++ == OP_THEN)
+              {
+              PUT(code, 0, code - bcptr->current_branch - 1);
+              code += LINK_SIZE; 
+              }  
             }
 
           else
@@ -4818,7 +4832,12 @@ for (;; ptr++)
               *errorcodeptr = ERR59;
               goto FAILED;
               }
-            *code++ = verbs[i].op_arg;
+            *code = verbs[i].op_arg;
+            if (*code++ == OP_THEN_ARG)
+              {
+              PUT(code, 0, code - bcptr->current_branch - 1);
+              code += LINK_SIZE; 
+              }  
             *code++ = arglen;
             memcpy(code, arg, arglen);
             code += arglen;
diff --git a/pcre_exec.c b/pcre_exec.c
index 8029bef..32f8c5a 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -748,18 +748,25 @@ for (;;)
 
     md->start_match_ptr = ecode + 2;
     RRETURN(MATCH_SKIP_ARG);
+    
+    /* For THEN (and THEN_ARG) we pass back the address of the bracket or
+    the alt that is at the start of the current branch. This makes it possible 
+    to skip back past alternatives that precede the THEN within the current 
+    branch. */ 
 
     case OP_THEN:
     RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
       ims, eptrb, flags, RM54);
     if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+    md->start_match_ptr = ecode - GET(ecode, 1);
     MRRETURN(MATCH_THEN);
 
     case OP_THEN_ARG:
-    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
-      ims, eptrb, flags, RM58);
+    RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE], 
+      offset_top, md, ims, eptrb, flags, RM58);
     if (rrc != MATCH_NOMATCH) RRETURN(rrc);
-    md->mark = ecode + 2;
+    md->start_match_ptr = ecode - GET(ecode, 1);
+    md->mark = ecode + LINK_SIZE + 2;
     RRETURN(MATCH_THEN);
 
     /* Handle a capturing bracket. If there is space in the offset vector, save
@@ -804,7 +811,9 @@ for (;;)
         {
         RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
           ims, eptrb, flags, RM1);
-        if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+        if (rrc != MATCH_NOMATCH &&
+            (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+          RRETURN(rrc);
         md->capture_last = save_capture_last;
         ecode += GET(ecode, 1);
         }
@@ -865,7 +874,9 @@ for (;;)
 
       RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
         eptrb, flags, RM2);
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode, 1);
       }
     /* Control never reaches here. */
@@ -1066,7 +1077,8 @@ for (;;)
         ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
         while (*ecode == OP_ALT) ecode += GET(ecode, 1);
         }
-      else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+      else if (rrc != MATCH_NOMATCH &&
+              (rrc != MATCH_THEN || md->start_match_ptr != ecode))
         {
         RRETURN(rrc);         /* Need braces because of following else */
         }
@@ -1194,7 +1206,9 @@ for (;;)
         mstart = md->start_match_ptr;   /* In case \K reset it */
         break;
         }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode, 1);
       }
     while (*ecode == OP_ALT);
@@ -1228,7 +1242,9 @@ for (;;)
         do ecode += GET(ecode,1); while (*ecode == OP_ALT);
         break;
         }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode,1);
       }
     while (*ecode == OP_ALT);
@@ -1365,7 +1381,8 @@ for (;;)
             (pcre_free)(new_recursive.offset_save);
           MRRETURN(MATCH_MATCH);
           }
-        else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+        else if (rrc != MATCH_NOMATCH &&
+                (rrc != MATCH_THEN || md->start_match_ptr != ecode))
           {
           DPRINTF(("Recursion gave error %d\n", rrc));
           if (new_recursive.offset_save != stacksave)
@@ -1408,7 +1425,9 @@ for (;;)
         mstart = md->start_match_ptr;
         break;
         }
-      if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+      if (rrc != MATCH_NOMATCH &&
+          (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+        RRETURN(rrc);
       ecode += GET(ecode,1);
       }
     while (*ecode == OP_ALT);
diff --git a/pcre_internal.h b/pcre_internal.h
index e293602..e84667f 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1514,8 +1514,9 @@ in UTF-8 mode. The code that uses this table must know about such things. */
   3, 3,                          /* RREF, NRREF                            */ \
   1,                             /* DEF                                    */ \
   1, 1,                          /* BRAZERO, BRAMINZERO                    */ \
-  3, 1, 3,                       /* MARK, PRUNE, PRUNE_ARG,                */ \
-  1, 3, 1, 3,                    /* SKIP, SKIP_ARG, THEN, THEN_ARG,        */ \
+  3, 1, 3,                       /* MARK, PRUNE, PRUNE_ARG                 */ \
+  1, 3,                          /* SKIP, SKIP_ARG                         */ \
+  1+LINK_SIZE, 3+LINK_SIZE,      /* THEN, THEN_ARG                         */ \
   1, 1, 1, 3, 1                  /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO  */
 
 
diff --git a/pcre_printint.src b/pcre_printint.src
index 49d9317..cbaa3ef 100644
--- a/pcre_printint.src
+++ b/pcre_printint.src
@@ -537,10 +537,25 @@ for(;;)
     case OP_MARK:
     case OP_PRUNE_ARG:
     case OP_SKIP_ARG:
-    case OP_THEN_ARG:
     fprintf(f, "    %s %s", OP_names[*code], code + 2);
     extra += code[1];
     break;
+    
+    case OP_THEN:
+    if (print_lengths) 
+      fprintf(f, "    %s %d", OP_names[*code], GET(code, 1)); 
+    else   
+      fprintf(f, "    %s", OP_names[*code]); 
+    break;  
+
+    case OP_THEN_ARG:
+    if (print_lengths) 
+      fprintf(f, "    %s %d %s", OP_names[*code], GET(code, 1),
+        code + 2 + LINK_SIZE);
+    else     
+      fprintf(f, "    %s %s", OP_names[*code], code + 2 + LINK_SIZE);
+    extra += code[1+LINK_SIZE];
+    break;
 
     /* Anything else is just an item with no data*/
 
diff --git a/pcre_study.c b/pcre_study.c
index f5e0c3a..be321fa 100644
--- a/pcre_study.c
+++ b/pcre_study.c
@@ -419,10 +419,13 @@ for (;;)
     case OP_MARK:
     case OP_PRUNE_ARG:
     case OP_SKIP_ARG:
-    case OP_THEN_ARG:
     cc += _pcre_OP_lengths[op] + cc[1];
     break;
 
+    case OP_THEN_ARG:
+    cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
+    break;
+
     /* For the record, these are the opcodes that are matched by "default":
     OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
     OP_THEN. */
diff --git a/testdata/testinput10 b/testdata/testinput10
index 99afab8..7210cc5 100644
--- a/testdata/testinput10
+++ b/testdata/testinput10
@@ -132,4 +132,6 @@ is required for these tests. --/
 
 /[[:^alpha:]\S]+/8WB
 
+/abc(d|e)(*THEN)x(123(*THEN)4|567(b|q)(*THEN)xx)/B
+
 /-- End of testinput10 --/
diff --git a/testdata/testinput2 b/testdata/testinput2
index cc94bed..b999d8f 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -3464,22 +3464,22 @@ with \Y. ---/
     abcde
 
 /A\NB./BZ
-  ACBD
-  ** Failers
-  A\nB
-  ACB\n   
+    ACBD
+    *** Failers
+    A\nB
+    ACB\n   
 
 /A\NB./sBZ
-  ACBD
-  ACB\n 
-  ** Failers
-  A\nB  
+    ACBD
+    ACB\n 
+    *** Failers
+    A\nB  
   
 /A\NB/<crlf>
-  A\nB
-  A\rB
-  ** Failers
-  A\r\nB    
+    A\nB
+    A\rB
+    ** Failers
+    A\r\nB    
 
 /\R+b/BZ
 
@@ -3491,4 +3491,12 @@ with \Y. ---/
 
 /\s*\R/BZ
 
+/-- Perl treats this one differently, not failing the second string. I believe
+    that is a bug in Perl. --/
+
+/^((abc|abcx)(*THEN)y|abcd)/
+    abcd
+    *** Failers 
+    abcxy 
+
 /-- End of testinput2 --/
diff --git a/testdata/testoutput10 b/testdata/testoutput10
index 4994584..b88474c 100644
--- a/testdata/testoutput10
+++ b/testdata/testoutput10
@@ -707,4 +707,33 @@ Memory allocation (code space): 40
  18     End
 ------------------------------------------------------------------
 
+/abc(d|e)(*THEN)x(123(*THEN)4|567(b|q)(*THEN)xx)/B
+------------------------------------------------------------------
+  0  79 Bra
+  3     abc
+  9   7 CBra 1
+ 14     d
+ 16   5 Alt
+ 19     e
+ 21  12 Ket
+ 24     *THEN 24
+ 27     x
+ 29  16 CBra 2
+ 34     123
+ 40     *THEN 11
+ 43     4
+ 45  31 Alt
+ 48     567
+ 54   7 CBra 3
+ 59     b
+ 61   5 Alt
+ 64     q
+ 66  12 Ket
+ 69     *THEN 24
+ 72     xx
+ 76  47 Ket
+ 79  79 Ket
+ 82     End
+------------------------------------------------------------------
+
 /-- End of testinput10 --/
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index 2baa6e9..b2e2e9a 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -11043,13 +11043,13 @@ No match
         Ket
         End
 ------------------------------------------------------------------
-  ACBD
+    ACBD
  0: ACBD
-  ** Failers
+    *** Failers
 No match
-  A\nB
+    A\nB
 No match
-  ACB\n   
+    ACB\n   
 No match
 
 /A\NB./sBZ
@@ -11062,23 +11062,23 @@ No match
         Ket
         End
 ------------------------------------------------------------------
-  ACBD
+    ACBD
  0: ACBD
-  ACB\n 
+    ACB\n 
  0: ACB\x0a
-  ** Failers
+    *** Failers
 No match
-  A\nB  
+    A\nB  
 No match
   
 /A\NB/<crlf>
-  A\nB
+    A\nB
  0: A\x0aB
-  A\rB
+    A\rB
  0: A\x0dB
-  ** Failers
+    ** Failers
 No match
-  A\r\nB    
+    A\r\nB    
 No match
 
 /\R+b/BZ
@@ -11126,4 +11126,16 @@ No match
         End
 ------------------------------------------------------------------
 
+/-- Perl treats this one differently, not failing the second string. I believe
+    that is a bug in Perl. --/
+
+/^((abc|abcx)(*THEN)y|abcd)/
+    abcd
+ 0: abcd
+ 1: abcd
+    *** Failers 
+No match
+    abcxy 
+No match
+
 /-- End of testinput2 --/
author	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2010-10-10 16:24:11 +0000
committer	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2010-10-10 16:24:11 +0000
commit	a0ab6e7fdaff6d0b459689f4bc5c79792cfd85cd (patch)
tree	5f5c6f147d3d5d3c2c06d64cfe7d6ac413991c59
parent	c9a0ce9d1827d3416a549f1fe8ba46f081ccfec2 (diff)
download	pcre-a0ab6e7fdaff6d0b459689f4bc5c79792cfd85cd.tar.gz