summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-11-10 19:04:34 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-11-10 19:04:34 +0000
commitb79cc767bf7081781e78955af3c986c2119bcdd3 (patch)
tree583022a943abc9aa76252150bbdba279195cd362
parent7de890de6074833fd0b0ed433c69a431cd7bf0cb (diff)
downloadpcre-b79cc767bf7081781e78955af3c986c2119bcdd3.tar.gz
In /x mode, allow white space before a possessive + character.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1396 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog4
-rw-r--r--doc/pcreapi.325
-rw-r--r--doc/pcrecompat.38
-rw-r--r--doc/pcrepattern.39
-rw-r--r--pcre_compile.c97
-rw-r--r--testdata/testinput120
-rw-r--r--testdata/testoutput126
7 files changed, 140 insertions, 49 deletions
diff --git a/ChangeLog b/ChangeLog
index ebfb0fc..7b806eb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -173,6 +173,10 @@ Version 8.34 xx-xxxx-201x
36. Perl no longer allows group names to start with digits, so I have made this
change also in PCRE. It simplifies the code a bit.
+
+37. In extended mode, Perl ignores spaces before a + that indicates a
+ possessive quantifier. PCRE allowed a space before the quantifier, but not
+ before the possessive +. It now does.
Version 8.33 28-May-2013
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 6c2576d..1ec0760 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -651,15 +651,22 @@ documentation.
.sp
PCRE_EXTENDED
.sp
-If this bit is set, white space data characters in the pattern are totally
-ignored except when escaped or inside a character class. White space did not
-used to include the VT character (code 11), because Perl did not treat this
-character as white space. However, Perl changed at release 5.18, so PCRE
-followed at release 8.34, and VT is now treated as white space. PCRE_EXTENDED
-also causes characters between an unescaped # outside a character class and the
-next newline, inclusive, to be ignored. PCRE_EXTENDED is equivalent to
-Perl's /x option, and it can be changed within a pattern by a (?x) option
-setting.
+If this bit is set, most white space characters in the pattern are totally
+ignored except when escaped or inside a character class. However, white space
+is not allowed within sequences such as (?> that introduce various
+parenthesized subpatterns, nor within a numerical quantifier such as {1,3}.
+However, ignorable white space is permitted between an item and a following
+quantifier and between a quantifier and a following + that indicates
+possessiveness.
+.P
+White space did not used to include the VT character (code 11), because Perl
+did not treat this character as white space. However, Perl changed at release
+5.18, so PCRE followed at release 8.34, and VT is now treated as white space.
+.P
+PCRE_EXTENDED also causes characters between an unescaped # outside a character
+class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is
+equivalent to Perl's /x option, and it can be changed within a pattern by a
+(?x) option setting.
.P
Which characters are interpreted as newlines is controlled by the options
passed to \fBpcre_compile()\fP or by a special sequence at the start of the
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index 1f12cd3..b931efe 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -1,4 +1,4 @@
-.TH PCRECOMPAT 3 "05 November 2013" "PCRE 8.34"
+.TH PCRECOMPAT 3 "10 November 2013" "PCRE 8.34"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "DIFFERENCES BETWEEN PCRE AND PERL"
@@ -122,8 +122,8 @@ an error is given at compile time.
.P
15. Perl recognizes comments in some places that PCRE does not, for example,
between the ( and ? at the start of a subpattern. If the /x modifier is set,
-Perl allows white space between ( and ? but PCRE never does, even if the
-PCRE_EXTENDED option is set.
+Perl allows white space between ( and ? (though current Perls warn that this is
+deprecated) but PCRE never does, even if the PCRE_EXTENDED option is set.
.P
16. Perl, when in warning mode, gives warnings for character classes such as
[A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no
@@ -195,6 +195,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 November 2013
+Last updated: 10 November 2013
Copyright (c) 1997-2013 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 741bb34..367f622 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -273,10 +273,11 @@ In a UTF mode, only ASCII numbers and letters have any special meaning after a
backslash. All other characters (in particular, those whose codepoints are
greater than 127) are treated as literals.
.P
-If a pattern is compiled with the PCRE_EXTENDED option, white space in the
-pattern (other than in a character class) and characters between a # outside
-a character class and the next newline are ignored. An escaping backslash can
-be used to include a white space or # character as part of the pattern.
+If a pattern is compiled with the PCRE_EXTENDED option, most white space in the
+pattern (other than in a character class), and characters between a # outside a
+character class and the next newline, inclusive, are ignored. An escaping
+backslash can be used to include a white space or # character as part of the
+pattern.
.P
If you want to remove the special meaning from a sequence of characters, you
can do so by putting them between \eQ and \eE. This is different from Perl in
diff --git a/pcre_compile.c b/pcre_compile.c
index 903b466..688efe5 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -4446,7 +4446,7 @@ for (;; ptr++)
/* Get next character in the pattern */
c = *ptr;
-
+
/* If we are at the end of a nested substitution, revert to the outer level
string. Nesting only happens one level deep. */
@@ -4548,8 +4548,37 @@ for (;; ptr++)
}
goto NORMAL_CHAR;
}
+ /* Control does not reach here. */
}
+ /* In extended mode, skip white space and comments. We need a loop in order
+ to check for more white space and more comments after a comment. */
+
+ if ((options & PCRE_EXTENDED) != 0)
+ {
+ for (;;)
+ {
+ while (MAX_255(c) && (cd->ctypes[c] & ctype_space) != 0) c = *(++ptr);
+ if (c != CHAR_NUMBER_SIGN) break;
+ ptr++;
+ while (*ptr != CHAR_NULL)
+ {
+ if (IS_NEWLINE(ptr)) /* For non-fixed-length newline cases, */
+ { /* IS_NEWLINE sets cd->nllen. */
+ ptr += cd->nllen;
+ break;
+ }
+ ptr++;
+#ifdef SUPPORT_UTF
+ if (utf) FORWARDCHAR(ptr);
+#endif
+ }
+ c = *ptr; /* Either NULL or the char after a newline */
+ }
+ }
+
+ /* See if the next thing is a quantifier. */
+
is_quantifier =
c == CHAR_ASTERISK || c == CHAR_PLUS || c == CHAR_QUESTION_MARK ||
(c == CHAR_LEFT_CURLY_BRACKET && is_counted_repeat(ptr+1));
@@ -4565,42 +4594,21 @@ for (;; ptr++)
previous_callout = NULL;
}
- /* In extended mode, skip white space and comments. */
-
- if ((options & PCRE_EXTENDED) != 0)
- {
- if (MAX_255(*ptr) && (cd->ctypes[c] & ctype_space) != 0) continue;
- if (c == CHAR_NUMBER_SIGN)
- {
- ptr++;
- while (*ptr != CHAR_NULL)
- {
- if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
- ptr++;
-#ifdef SUPPORT_UTF
- if (utf) FORWARDCHAR(ptr);
-#endif
- }
- if (*ptr != CHAR_NULL) continue;
-
- /* Else fall through to handle end of string */
- c = 0;
- }
- }
-
- /* No auto callout for quantifiers, or while processing property strings that
- are substituted for \w etc in UCP mode. */
+ /* Create auto callout, except for quantifiers, or while processing property
+ strings that are substituted for \w etc in UCP mode. */
if ((options & PCRE_AUTO_CALLOUT) != 0 && !is_quantifier && nestptr == NULL)
{
previous_callout = code;
code = auto_callout(code, ptr, cd);
}
+
+ /* Process the next pattern item. */
switch(c)
{
/* ===================================================================*/
- case 0: /* The branch terminates at string end */
+ case CHAR_NULL: /* The branch terminates at string end */
case CHAR_VERTICAL_LINE: /* or | or ) */
case CHAR_RIGHT_PARENTHESIS:
*firstcharptr = firstchar;
@@ -5445,6 +5453,34 @@ for (;; ptr++)
insert something before it. */
tempcode = previous;
+
+ /* Before checking for a possessive quantifier, we must skip over
+ whitespace and comments in extended mode because Perl allows white space at
+ this point. */
+
+ if ((options & PCRE_EXTENDED) != 0)
+ {
+ const pcre_uchar *p = ptr + 1;
+ for (;;)
+ {
+ while (MAX_255(*p) && (cd->ctypes[*p] & ctype_space) != 0) p++;
+ if (*p != CHAR_NUMBER_SIGN) break;
+ p++;
+ while (*p != CHAR_NULL)
+ {
+ if (IS_NEWLINE(p)) /* For non-fixed-length newline cases, */
+ { /* IS_NEWLINE sets cd->nllen. */
+ p += cd->nllen;
+ break;
+ }
+ p++;
+#ifdef SUPPORT_UTF
+ if (utf) FORWARDCHAR(p);
+#endif
+ } /* Loop for comment characters */
+ } /* Loop for multiple comments */
+ ptr = p - 1; /* Character before the next significant one. */
+ }
/* If the next character is '+', we have a possessive quantifier. This
implies greediness, whatever the setting of the PCRE_UNGREEDY option.
@@ -7752,8 +7788,8 @@ for (;; ptr++)
/* ===================================================================*/
/* Handle a literal character. It is guaranteed not to be whitespace or #
- when the extended flag is set. If we are in UTF-8 mode, it may be a
- multi-byte literal character. */
+ when the extended flag is set. If we are in a UTF mode, it may be a
+ multi-unit literal character. */
default:
NORMAL_CHAR:
@@ -8899,7 +8935,7 @@ else
cd->nl[0] = newline;
}
}
-
+
/* Maximum back reference and backref bitmap. The bitmap records up to 31 back
references to help in deciding whether (.*) can be treated as anchored or not.
*/
@@ -8952,6 +8988,7 @@ outside can help speed up starting point checks. */
ptr += skipatstart;
code = cworkspace;
*code = OP_BRA;
+
(void)compile_regex(cd->external_options, &code, &ptr, &errorcode, FALSE,
FALSE, 0, 0, &firstchar, &firstcharflags, &reqchar, &reqcharflags, NULL,
cd, &length);
diff --git a/testdata/testinput1 b/testdata/testinput1
index 59024eb..b1c3752 100644
--- a/testdata/testinput1
+++ b/testdata/testinput1
@@ -2,7 +2,7 @@
Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
and 32-bit PCRE libraries. --/
-< forbid 8BCDIMWZ<
+< forbid 8BCDIMOWZ<
/the quick brown fox/
the quick brown fox
@@ -5633,4 +5633,22 @@ AbcdCBefgBhiBqz
/^A\o{123}B/
A\123B
+/ ^ a + + b $ /x
+ aaaab
+
+/ ^ a + #comment
+ + b $ /x
+ aaaab
+
+/ ^ a + #comment
+ #comment
+ + b $ /x
+ aaaab
+
+/ ^ (?> a + ) b $ /x
+ aaaab
+
+/ ^ ( a + ) + + \w $ /x
+ aaaab
+
/-- End of testinput1 --/
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index 976d8a7..bc4a661 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -2,7 +2,7 @@
Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit,
and 32-bit PCRE libraries. --/
-< forbid 8BCDIMWZ<
+< forbid 8BCDIMOWZ<
/the quick brown fox/
the quick brown fox
@@ -9262,4 +9262,28 @@ No match
A\123B
0: ASB
+/ ^ a + + b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ a + #comment
+ + b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ a + #comment
+ #comment
+ + b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ (?> a + ) b $ /x
+ aaaab
+ 0: aaaab
+
+/ ^ ( a + ) + + \w $ /x
+ aaaab
+ 0: aaaab
+ 1: aaaa
+
/-- End of testinput1 --/