In /x mode, allow white space before a possessive + character.

git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1396 2f5784b3-3f2a-0410-8824-cb99058d5e15
author: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2013-11-10 19:04:34 +0000
committer: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2013-11-10 19:04:34 +0000
commit: b79cc767bf7081781e78955af3c986c2119bcdd3 (patch)
tree: 583022a943abc9aa76252150bbdba279195cd362
parent: 7de890de6074833fd0b0ed433c69a431cd7bf0cb (diff)
download: pcre-b79cc767bf7081781e78955af3c986c2119bcdd3.tar.gz
7 files changed, 140 insertions, 49 deletions
diff --git a/ChangeLog b/ChangeLog
index ebfb0fc..7b806eb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -173,6 +173,10 @@ Version 8.34 xx-xxxx-201x
     
 36. Perl no longer allows group names to start with digits, so I have made this
     change also in PCRE. It simplifies the code a bit. 
+    
+37. In extended mode, Perl ignores spaces before a + that indicates a 
+    possessive quantifier. PCRE allowed a space before the quantifier, but not 
+    before the possessive +. It now does.
 
 
 Version 8.33 28-May-2013
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 6c2576d..1ec0760 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -651,15 +651,22 @@ documentation.
 .sp
   PCRE_EXTENDED
 .sp
-If this bit is set, white space data characters in the pattern are totally
-ignored except when escaped or inside a character class. White space did not
-used to include the VT character (code 11), because Perl did not treat this 
-character as white space. However, Perl changed at release 5.18, so PCRE
-followed at release 8.34, and VT is now treated as white space. PCRE_EXTENDED
-also causes characters between an unescaped # outside a character class and the
-next newline, inclusive, to be ignored. PCRE_EXTENDED is equivalent to
-Perl's /x option, and it can be changed within a pattern by a (?x) option
-setting.
+If this bit is set, most white space characters in the pattern are totally
+ignored except when escaped or inside a character class. However, white space
+is not allowed within sequences such as (?> that introduce various
+parenthesized subpatterns, nor within a numerical quantifier such as {1,3}.
+However, ignorable white space is permitted between an item and a following
+quantifier and between a quantifier and a following + that indicates 
+possessiveness.
+.P
+White space did not used to include the VT character (code 11), because Perl
+did not treat this character as white space. However, Perl changed at release
+5.18, so PCRE followed at release 8.34, and VT is now treated as white space.
+.P
+PCRE_EXTENDED also causes characters between an unescaped # outside a character
+class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
+equivalent to Perl's /x option, and it can be changed within a pattern by a
+(?x) option setting.
 .P
 Which characters are interpreted as newlines is controlled by the options
 passed to \fBpcre_compile()\fP or by a special sequence at the start of the
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index 1f12cd3..b931efe 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -1,4 +1,4 @@
-.TH PCRECOMPAT 3 "05 November 2013" "PCRE 8.34"
+.TH PCRECOMPAT 3 "10 November 2013" "PCRE 8.34"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "DIFFERENCES BETWEEN PCRE AND PERL"
@@ -122,8 +122,8 @@ an error is given at compile time.
 .P
 15. Perl recognizes comments in some places that PCRE does not, for example,
 between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? but PCRE never does, even if the
-PCRE_EXTENDED option is set.
+Perl allows white space between ( and ? (though current Perls warn that this is 
+deprecated) but PCRE never does, even if the PCRE_EXTENDED option is set.
 .P
 16. Perl, when in warning mode, gives warnings for character classes such as
 [A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no 
@@ -195,6 +195,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 05 November 2013
+Last updated: 10 November 2013
 Copyright (c) 1997-2013 University of Cambridge.
 .fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 741bb34..367f622 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -273,10 +273,11 @@ In a UTF mode, only ASCII numbers and letters have any special meaning after a
 backslash. All other characters (in particular, those whose codepoints are
 greater than 127) are treated as literals.
 .P
-If a pattern is compiled with the PCRE_EXTENDED option, white space in the
-pattern (other than in a character class) and characters between a # outside
-a character class and the next newline are ignored. An escaping backslash can
-be used to include a white space or # character as part of the pattern.
+If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
+pattern (other than in a character class), and characters between a # outside a
+character class and the next newline, inclusive, are ignored. An escaping
+backslash can be used to include a white space or # character as part of the
+pattern.
 .P
 If you want to remove the special meaning from a sequence of characters, you
 can do so by putting them between \eQ and \eE. This is different from Perl in
diff --git a/pcre_compile.c b/pcre_compile.c
index 903b466..688efe5 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -4446,7 +4446,7 @@ for (;; ptr++)
   /* Get next character in the pattern */
 
   c = *ptr;
-
+  
   /* If we are at the end of a nested substitution, revert to the outer level
   string. Nesting only happens one level deep. */
 
@@ -4548,8 +4548,37 @@ for (;; ptr++)
         }
       goto NORMAL_CHAR;
       }
+    /* Control does not reach here. */   
     }
 
+  /* In extended mode, skip white space and comments. We need a loop in order 
+  to check for more white space and more comments after a comment. */
+  
+  if ((options & PCRE_EXTENDED) != 0)
+    {
+    for (;;)
+      {
+      while (MAX_255(c) && (cd->ctypes[c] & ctype_space) != 0) c = *(++ptr);
+      if (c != CHAR_NUMBER_SIGN) break;
+      ptr++;
+      while (*ptr != CHAR_NULL)
+        {
+        if (IS_NEWLINE(ptr))         /* For non-fixed-length newline cases, */
+          {                          /* IS_NEWLINE sets cd->nllen. */         
+          ptr += cd->nllen;            
+          break;
+          }
+        ptr++;
+#ifdef SUPPORT_UTF
+        if (utf) FORWARDCHAR(ptr);
+#endif
+        }
+      c = *ptr;     /* Either NULL or the char after a newline */
+      }   
+    }
+
+  /* See if the next thing is a quantifier. */
+
   is_quantifier =
     c == CHAR_ASTERISK || c == CHAR_PLUS || c == CHAR_QUESTION_MARK ||
     (c == CHAR_LEFT_CURLY_BRACKET && is_counted_repeat(ptr+1));
@@ -4565,42 +4594,21 @@ for (;; ptr++)
     previous_callout = NULL;
     }
 
-  /* In extended mode, skip white space and comments. */
-
-  if ((options & PCRE_EXTENDED) != 0)
-    {
-    if (MAX_255(*ptr) && (cd->ctypes[c] & ctype_space) != 0) continue;
-    if (c == CHAR_NUMBER_SIGN)
-      {
-      ptr++;
-      while (*ptr != CHAR_NULL)
-        {
-        if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
-        ptr++;
-#ifdef SUPPORT_UTF
-        if (utf) FORWARDCHAR(ptr);
-#endif
-        }
-      if (*ptr != CHAR_NULL) continue;
-
-      /* Else fall through to handle end of string */
-      c = 0;
-      }
-    }
-
-  /* No auto callout for quantifiers, or while processing property strings that
-  are substituted for \w etc in UCP mode. */
+  /* Create auto callout, except for quantifiers, or while processing property
+  strings that are substituted for \w etc in UCP mode. */
 
   if ((options & PCRE_AUTO_CALLOUT) != 0 && !is_quantifier && nestptr == NULL)
     {
     previous_callout = code;
     code = auto_callout(code, ptr, cd);
     }
+    
+  /* Process the next pattern item. */
 
   switch(c)
     {
     /* ===================================================================*/
-    case 0:                        /* The branch terminates at string end */
+    case CHAR_NULL:                /* The branch terminates at string end */
     case CHAR_VERTICAL_LINE:       /* or | or ) */
     case CHAR_RIGHT_PARENTHESIS:
     *firstcharptr = firstchar;
@@ -5445,6 +5453,34 @@ for (;; ptr++)
     insert something before it. */
 
     tempcode = previous;
+    
+    /* Before checking for a possessive quantifier, we must skip over 
+    whitespace and comments in extended mode because Perl allows white space at 
+    this point. */
+ 
+    if ((options & PCRE_EXTENDED) != 0)
+      {
+      const pcre_uchar *p = ptr + 1;
+      for (;;)
+        {
+        while (MAX_255(*p) && (cd->ctypes[*p] & ctype_space) != 0) p++;
+        if (*p != CHAR_NUMBER_SIGN) break;
+        p++;
+        while (*p != CHAR_NULL)
+          {
+          if (IS_NEWLINE(p))         /* For non-fixed-length newline cases, */
+            {                        /* IS_NEWLINE sets cd->nllen. */         
+            p += cd->nllen;            
+            break;
+            }
+          p++;
+#ifdef SUPPORT_UTF
+          if (utf) FORWARDCHAR(p);
+#endif
+          }           /* Loop for comment characters */
+        }             /* Loop for multiple comments */
+      ptr = p - 1;    /* Character before the next significant one. */
+      }
 
     /* If the next character is '+', we have a possessive quantifier. This
     implies greediness, whatever the setting of the PCRE_UNGREEDY option.
@@ -7752,8 +7788,8 @@ for (;; ptr++)
 
     /* ===================================================================*/
     /* Handle a literal character. It is guaranteed not to be whitespace or #
-    when the extended flag is set. If we are in UTF-8 mode, it may be a
-    multi-byte literal character. */
+    when the extended flag is set. If we are in a UTF mode, it may be a
+    multi-unit literal character. */
 
     default:
     NORMAL_CHAR:
@@ -8899,7 +8935,7 @@ else
     cd->nl[0] = newline;
     }
   }
-
+  
 /* Maximum back reference and backref bitmap. The bitmap records up to 31 back
 references to help in deciding whether (.*) can be treated as anchored or not.
 */
@@ -8952,6 +8988,7 @@ outside can help speed up starting point checks. */
 ptr += skipatstart;
 code = cworkspace;
 *code = OP_BRA;
+
 (void)compile_regex(cd->external_options, &code, &ptr, &errorcode, FALSE,
   FALSE, 0, 0, &firstchar, &firstcharflags, &reqchar, &reqcharflags, NULL,
   cd, &length);
diff --git a/testdata/testinput1 b/testdata/testinput1
index 59024eb..b1c3752 100644
--- a/testdata/testinput1
+++ b/testdata/testinput1
@@ -2,7 +2,7 @@
     Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
     and 32-bit PCRE libraries. --/
     
-< forbid 8BCDIMWZ< 
+< forbid 8BCDIMOWZ< 
 
 /the quick brown fox/
     the quick brown fox
@@ -5633,4 +5633,22 @@ AbcdCBefgBhiBqz
 /^A\o{123}B/
     A\123B
 
+/ ^ a + + b $ /x
+    aaaab
+    
+/ ^ a + #comment
+  + b $ /x
+    aaaab
+    
+/ ^ a + #comment
+  #comment
+  + b $ /x
+    aaaab
+    
+/ ^ (?> a + ) b $ /x
+    aaaab 
+
+/ ^ ( a + ) + + \w $ /x
+    aaaab 
+
 /-- End of testinput1 --/
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index 976d8a7..bc4a661 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -2,7 +2,7 @@
     Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
     and 32-bit PCRE libraries. --/
     
-< forbid 8BCDIMWZ< 
+< forbid 8BCDIMOWZ< 
 
 /the quick brown fox/
     the quick brown fox
@@ -9262,4 +9262,28 @@ No match
     A\123B
  0: ASB
 
+/ ^ a + + b $ /x
+    aaaab
+ 0: aaaab
+    
+/ ^ a + #comment
+  + b $ /x
+    aaaab
+ 0: aaaab
+    
+/ ^ a + #comment
+  #comment
+  + b $ /x
+    aaaab
+ 0: aaaab
+    
+/ ^ (?> a + ) b $ /x
+    aaaab 
+ 0: aaaab
+
+/ ^ ( a + ) + + \w $ /x
+    aaaab 
+ 0: aaaab
+ 1: aaaa
+
 /-- End of testinput1 --/
author	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2013-11-10 19:04:34 +0000
committer	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2013-11-10 19:04:34 +0000
commit	b79cc767bf7081781e78955af3c986c2119bcdd3 (patch)
tree	583022a943abc9aa76252150bbdba279195cd362
parent	7de890de6074833fd0b0ed433c69a431cd7bf0cb (diff)
download	pcre-b79cc767bf7081781e78955af3c986c2119bcdd3.tar.gz