diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2013-11-10 19:04:34 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2013-11-10 19:04:34 +0000 |
commit | b79cc767bf7081781e78955af3c986c2119bcdd3 (patch) | |
tree | 583022a943abc9aa76252150bbdba279195cd362 | |
parent | 7de890de6074833fd0b0ed433c69a431cd7bf0cb (diff) | |
download | pcre-b79cc767bf7081781e78955af3c986c2119bcdd3.tar.gz |
In /x mode, allow white space before a possessive + character.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1396 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r-- | ChangeLog | 4 | ||||
-rw-r--r-- | doc/pcreapi.3 | 25 | ||||
-rw-r--r-- | doc/pcrecompat.3 | 8 | ||||
-rw-r--r-- | doc/pcrepattern.3 | 9 | ||||
-rw-r--r-- | pcre_compile.c | 97 | ||||
-rw-r--r-- | testdata/testinput1 | 20 | ||||
-rw-r--r-- | testdata/testoutput1 | 26 |
7 files changed, 140 insertions, 49 deletions
@@ -173,6 +173,10 @@ Version 8.34 xx-xxxx-201x 36. Perl no longer allows group names to start with digits, so I have made this change also in PCRE. It simplifies the code a bit. + +37. In extended mode, Perl ignores spaces before a + that indicates a + possessive quantifier. PCRE allowed a space before the quantifier, but not + before the possessive +. It now does. Version 8.33 28-May-2013 diff --git a/doc/pcreapi.3 b/doc/pcreapi.3 index 6c2576d..1ec0760 100644 --- a/doc/pcreapi.3 +++ b/doc/pcreapi.3 @@ -651,15 +651,22 @@ documentation. .sp PCRE_EXTENDED .sp -If this bit is set, white space data characters in the pattern are totally -ignored except when escaped or inside a character class. White space did not -used to include the VT character (code 11), because Perl did not treat this -character as white space. However, Perl changed at release 5.18, so PCRE -followed at release 8.34, and VT is now treated as white space. PCRE_EXTENDED -also causes characters between an unescaped # outside a character class and the -next newline, inclusive, to be ignored. PCRE_EXTENDED is equivalent to -Perl's /x option, and it can be changed within a pattern by a (?x) option -setting. +If this bit is set, most white space characters in the pattern are totally +ignored except when escaped or inside a character class. However, white space +is not allowed within sequences such as (?> that introduce various +parenthesized subpatterns, nor within a numerical quantifier such as {1,3}. +However, ignorable white space is permitted between an item and a following +quantifier and between a quantifier and a following + that indicates +possessiveness. +.P +White space did not used to include the VT character (code 11), because Perl +did not treat this character as white space. However, Perl changed at release +5.18, so PCRE followed at release 8.34, and VT is now treated as white space. +.P +PCRE_EXTENDED also causes characters between an unescaped # outside a character +class and the next newline, inclusive, to be ignored. PCRE_EXTENDED is +equivalent to Perl's /x option, and it can be changed within a pattern by a +(?x) option setting. .P Which characters are interpreted as newlines is controlled by the options passed to \fBpcre_compile()\fP or by a special sequence at the start of the diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3 index 1f12cd3..b931efe 100644 --- a/doc/pcrecompat.3 +++ b/doc/pcrecompat.3 @@ -1,4 +1,4 @@ -.TH PCRECOMPAT 3 "05 November 2013" "PCRE 8.34" +.TH PCRECOMPAT 3 "10 November 2013" "PCRE 8.34" .SH NAME PCRE - Perl-compatible regular expressions .SH "DIFFERENCES BETWEEN PCRE AND PERL" @@ -122,8 +122,8 @@ an error is given at compile time. .P 15. Perl recognizes comments in some places that PCRE does not, for example, between the ( and ? at the start of a subpattern. If the /x modifier is set, -Perl allows white space between ( and ? but PCRE never does, even if the -PCRE_EXTENDED option is set. +Perl allows white space between ( and ? (though current Perls warn that this is +deprecated) but PCRE never does, even if the PCRE_EXTENDED option is set. .P 16. Perl, when in warning mode, gives warnings for character classes such as [A-\ed] or [a-[:digit:]]. It then treats the hyphens as literals. PCRE has no @@ -195,6 +195,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 05 November 2013 +Last updated: 10 November 2013 Copyright (c) 1997-2013 University of Cambridge. .fi diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3 index 741bb34..367f622 100644 --- a/doc/pcrepattern.3 +++ b/doc/pcrepattern.3 @@ -273,10 +273,11 @@ In a UTF mode, only ASCII numbers and letters have any special meaning after a backslash. All other characters (in particular, those whose codepoints are greater than 127) are treated as literals. .P -If a pattern is compiled with the PCRE_EXTENDED option, white space in the -pattern (other than in a character class) and characters between a # outside -a character class and the next newline are ignored. An escaping backslash can -be used to include a white space or # character as part of the pattern. +If a pattern is compiled with the PCRE_EXTENDED option, most white space in the +pattern (other than in a character class), and characters between a # outside a +character class and the next newline, inclusive, are ignored. An escaping +backslash can be used to include a white space or # character as part of the +pattern. .P If you want to remove the special meaning from a sequence of characters, you can do so by putting them between \eQ and \eE. This is different from Perl in diff --git a/pcre_compile.c b/pcre_compile.c index 903b466..688efe5 100644 --- a/pcre_compile.c +++ b/pcre_compile.c @@ -4446,7 +4446,7 @@ for (;; ptr++) /* Get next character in the pattern */ c = *ptr; - + /* If we are at the end of a nested substitution, revert to the outer level string. Nesting only happens one level deep. */ @@ -4548,8 +4548,37 @@ for (;; ptr++) } goto NORMAL_CHAR; } + /* Control does not reach here. */ } + /* In extended mode, skip white space and comments. We need a loop in order + to check for more white space and more comments after a comment. */ + + if ((options & PCRE_EXTENDED) != 0) + { + for (;;) + { + while (MAX_255(c) && (cd->ctypes[c] & ctype_space) != 0) c = *(++ptr); + if (c != CHAR_NUMBER_SIGN) break; + ptr++; + while (*ptr != CHAR_NULL) + { + if (IS_NEWLINE(ptr)) /* For non-fixed-length newline cases, */ + { /* IS_NEWLINE sets cd->nllen. */ + ptr += cd->nllen; + break; + } + ptr++; +#ifdef SUPPORT_UTF + if (utf) FORWARDCHAR(ptr); +#endif + } + c = *ptr; /* Either NULL or the char after a newline */ + } + } + + /* See if the next thing is a quantifier. */ + is_quantifier = c == CHAR_ASTERISK || c == CHAR_PLUS || c == CHAR_QUESTION_MARK || (c == CHAR_LEFT_CURLY_BRACKET && is_counted_repeat(ptr+1)); @@ -4565,42 +4594,21 @@ for (;; ptr++) previous_callout = NULL; } - /* In extended mode, skip white space and comments. */ - - if ((options & PCRE_EXTENDED) != 0) - { - if (MAX_255(*ptr) && (cd->ctypes[c] & ctype_space) != 0) continue; - if (c == CHAR_NUMBER_SIGN) - { - ptr++; - while (*ptr != CHAR_NULL) - { - if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; } - ptr++; -#ifdef SUPPORT_UTF - if (utf) FORWARDCHAR(ptr); -#endif - } - if (*ptr != CHAR_NULL) continue; - - /* Else fall through to handle end of string */ - c = 0; - } - } - - /* No auto callout for quantifiers, or while processing property strings that - are substituted for \w etc in UCP mode. */ + /* Create auto callout, except for quantifiers, or while processing property + strings that are substituted for \w etc in UCP mode. */ if ((options & PCRE_AUTO_CALLOUT) != 0 && !is_quantifier && nestptr == NULL) { previous_callout = code; code = auto_callout(code, ptr, cd); } + + /* Process the next pattern item. */ switch(c) { /* ===================================================================*/ - case 0: /* The branch terminates at string end */ + case CHAR_NULL: /* The branch terminates at string end */ case CHAR_VERTICAL_LINE: /* or | or ) */ case CHAR_RIGHT_PARENTHESIS: *firstcharptr = firstchar; @@ -5445,6 +5453,34 @@ for (;; ptr++) insert something before it. */ tempcode = previous; + + /* Before checking for a possessive quantifier, we must skip over + whitespace and comments in extended mode because Perl allows white space at + this point. */ + + if ((options & PCRE_EXTENDED) != 0) + { + const pcre_uchar *p = ptr + 1; + for (;;) + { + while (MAX_255(*p) && (cd->ctypes[*p] & ctype_space) != 0) p++; + if (*p != CHAR_NUMBER_SIGN) break; + p++; + while (*p != CHAR_NULL) + { + if (IS_NEWLINE(p)) /* For non-fixed-length newline cases, */ + { /* IS_NEWLINE sets cd->nllen. */ + p += cd->nllen; + break; + } + p++; +#ifdef SUPPORT_UTF + if (utf) FORWARDCHAR(p); +#endif + } /* Loop for comment characters */ + } /* Loop for multiple comments */ + ptr = p - 1; /* Character before the next significant one. */ + } /* If the next character is '+', we have a possessive quantifier. This implies greediness, whatever the setting of the PCRE_UNGREEDY option. @@ -7752,8 +7788,8 @@ for (;; ptr++) /* ===================================================================*/ /* Handle a literal character. It is guaranteed not to be whitespace or # - when the extended flag is set. If we are in UTF-8 mode, it may be a - multi-byte literal character. */ + when the extended flag is set. If we are in a UTF mode, it may be a + multi-unit literal character. */ default: NORMAL_CHAR: @@ -8899,7 +8935,7 @@ else cd->nl[0] = newline; } } - + /* Maximum back reference and backref bitmap. The bitmap records up to 31 back references to help in deciding whether (.*) can be treated as anchored or not. */ @@ -8952,6 +8988,7 @@ outside can help speed up starting point checks. */ ptr += skipatstart; code = cworkspace; *code = OP_BRA; + (void)compile_regex(cd->external_options, &code, &ptr, &errorcode, FALSE, FALSE, 0, 0, &firstchar, &firstcharflags, &reqchar, &reqcharflags, NULL, cd, &length); diff --git a/testdata/testinput1 b/testdata/testinput1 index 59024eb..b1c3752 100644 --- a/testdata/testinput1 +++ b/testdata/testinput1 @@ -2,7 +2,7 @@ Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit, and 32-bit PCRE libraries. --/ -< forbid 8BCDIMWZ< +< forbid 8BCDIMOWZ< /the quick brown fox/ the quick brown fox @@ -5633,4 +5633,22 @@ AbcdCBefgBhiBqz /^A\o{123}B/ A\123B +/ ^ a + + b $ /x + aaaab + +/ ^ a + #comment + + b $ /x + aaaab + +/ ^ a + #comment + #comment + + b $ /x + aaaab + +/ ^ (?> a + ) b $ /x + aaaab + +/ ^ ( a + ) + + \w $ /x + aaaab + /-- End of testinput1 --/ diff --git a/testdata/testoutput1 b/testdata/testoutput1 index 976d8a7..bc4a661 100644 --- a/testdata/testoutput1 +++ b/testdata/testoutput1 @@ -2,7 +2,7 @@ Perl >= 5.10, in non-UTF-8 mode. It should run clean for the 8-bit, 16-bit, and 32-bit PCRE libraries. --/ -< forbid 8BCDIMWZ< +< forbid 8BCDIMOWZ< /the quick brown fox/ the quick brown fox @@ -9262,4 +9262,28 @@ No match A\123B 0: ASB +/ ^ a + + b $ /x + aaaab + 0: aaaab + +/ ^ a + #comment + + b $ /x + aaaab + 0: aaaab + +/ ^ a + #comment + #comment + + b $ /x + aaaab + 0: aaaab + +/ ^ (?> a + ) b $ /x + aaaab + 0: aaaab + +/ ^ ( a + ) + + \w $ /x + aaaab + 0: aaaab + 1: aaaa + /-- End of testinput1 --/ |