diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2008-04-10 19:55:57 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2008-04-10 19:55:57 +0000 |
commit | abbf313ccbce42c4e297c87550813203ec3cf81e (patch) | |
tree | 3aaf3c5d4adf3580b3e84b818b805ebf65f041a0 | |
parent | 049881b7af634cd0c59d5053c169642c0a06b1de (diff) | |
download | pcre-abbf313ccbce42c4e297c87550813203ec3cf81e.tar.gz |
Add Oniguruma syntax \g<...> and \g'...' for subroutine calls.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@333 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r-- | ChangeLog | 5 | ||||
-rw-r--r-- | doc/pcrepattern.3 | 46 | ||||
-rw-r--r-- | doc/pcresyntax.3 | 14 | ||||
-rw-r--r-- | pcre_compile.c | 119 | ||||
-rw-r--r-- | pcre_internal.h | 2 | ||||
-rw-r--r-- | testdata/testinput2 | 54 | ||||
-rw-r--r-- | testdata/testoutput2 | 130 |
7 files changed, 344 insertions, 26 deletions
@@ -45,6 +45,11 @@ Version 7.7 05-Mar-08 10. Applied Alan Lehotsky's patch to add REG_STARTEND support to the POSIX matching function regexec(). + +11. Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n', + which, however, unlike Perl's \g{...}, are subroutine calls, not back + references. PCRE supports relative numbers with this syntax (I don't think + Oniguruma does). Version 7.6 28-Jan-08 diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3 index 9276863..b2b4024 100644 --- a/doc/pcrepattern.3 +++ b/doc/pcrepattern.3 @@ -9,7 +9,12 @@ are described in detail below. There is a quick-reference syntax summary in the .\" HREF \fBpcresyntax\fP .\" -page. Perl's regular expressions are described in its own documentation, and +page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE +also supports some alternative regular expression syntax (which does not +conflict with the Perl syntax) in order to provide some compatibility with +regular expressions in Python, .NET, and Oniguruma. +.P +Perl's regular expressions are described in its own documentation, and regular expressions in general are covered in a number of books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers regular expressions in great detail. This @@ -310,6 +315,20 @@ parenthesized subpatterns. .\" . . +.SS "Absolute and relative subroutine calls" +.rs +.sp +For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or +a number enclosed either in angle brackets or single quotes, is an alternative +syntax for referencing a subpattern as a "subroutine". Details are discussed +.\" HTML <a href="#onigurumasubroutines"> +.\" </a> +later. +.\" +Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP +synonymous. The former is a back reference; the latter is a subroutine call. +. +. .SS "Generic character types" .rs .sp @@ -2023,6 +2042,27 @@ It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern. . . +.\" HTML <a name="onigurumasubroutines"></a> +.SH "ONIGURUMA SUBROUTINE SYNTAX" +.rs +.sp +For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or +a number enclosed either in angle brackets or single quotes, is an alternative +syntax for referencing a subpattern as a subroutine, possibly recursively. Here +are two of the examples used above, rewritten using this syntax: +.sp + (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) ) + (sens|respons)e and \eg'1'ibility +.sp +PCRE supports an extension to Oniguruma: if a number is preceded by a +plus or a minus sign it is taken as a relative reference. For example: +.sp + (abc)(?i:\eg<-1>) +.sp +Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP +synonymous. The former is a back reference; the latter is a subroutine call. +. +. .SH CALLOUTS .rs .sp @@ -2192,6 +2232,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 17 September 2007 -Copyright (c) 1997-2007 University of Cambridge. +Last updated: 10 April 2008 +Copyright (c) 1997-2008 University of Cambridge. .fi diff --git a/doc/pcresyntax.3 b/doc/pcresyntax.3 index 534441c..151f2d0 100644 --- a/doc/pcresyntax.3 +++ b/doc/pcresyntax.3 @@ -304,6 +304,8 @@ In PCRE, POSIX character set names recognize only ASCII characters. You can use (?<!...) negative look behind .sp Each top-level branch of a look behind must be of a fixed length. +. +. .SH "BACKREFERENCES" .rs .sp @@ -327,6 +329,14 @@ Each top-level branch of a look behind must be of a fixed length. (?-n) call subpattern by relative number (?&name) call subpattern by name (Perl) (?P>name) call subpattern by name (Python) + \eg<name> call subpattern by name (Oniguruma) + \eg'name' call subpattern by name (Oniguruma) + \eg<n> call subpattern by absolute number (Oniguruma) + \eg'n' call subpattern by absolute number (Oniguruma) + \eg<+n> call subpattern by relative number (PCRE extension) + \eg'+n' call subpattern by relative number (PCRE extension) + \eg<-n> call subpattern by relative number (PCRE extension) + \eg'-n' call subpattern by relative number (PCRE extension) . . .SH "CONDITIONAL PATTERNS" @@ -418,6 +428,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 14 November 2007 -Copyright (c) 1997-2007 University of Cambridge. +Last updated: 09 April 2008 +Copyright (c) 1997-2008 University of Cambridge. .fi diff --git a/pcre_compile.c b/pcre_compile.c index 2cacfba..aed8190 100644 --- a/pcre_compile.c +++ b/pcre_compile.c @@ -295,8 +295,8 @@ static const char error_texts[] = /* 55 */ "repeating a DEFINE group is not allowed\0" "inconsistent NEWLINE options\0" - "\\g is not followed by a braced name or an optionally braced non-zero number\0" - "(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number\0" + "\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0" + "a numbered reference must not be zero\0" "(*VERB) with an argument is not supported\0" /* 60 */ "(*VERB) not recognized\0" @@ -531,14 +531,31 @@ else *errorcodeptr = ERR37; break; - /* \g must be followed by a number, either plain or braced. If positive, it - is an absolute backreference. If negative, it is a relative backreference. - This is a Perl 5.10 feature. Perl 5.10 also supports \g{name} as a - reference to a named group. This is part of Perl's movement towards a - unified syntax for back references. As this is synonymous with \k{name}, we - fudge it up by pretending it really was \k. */ + /* \g must be followed by one of a number of specific things: + + (1) A number, either plain or braced. If positive, it is an absolute + backreference. If negative, it is a relative backreference. This is a Perl + 5.10 feature. + + (2) Perl 5.10 also supports \g{name} as a reference to a named group. This + is part of Perl's movement towards a unified syntax for back references. As + this is synonymous with \k{name}, we fudge it up by pretending it really + was \k. + + (3) For Oniguruma compatibility we also support \g followed by a name or a + number either in angle brackets or in single quotes. However, these are + (possibly recursive) subroutine calls, _not_ backreferences. Just return + the -ESC_g code (cf \k). */ case 'g': + if (ptr[1] == '<' || ptr[1] == '\'') + { + c = -ESC_g; + break; + } + + /* Handle the Perl-compatible cases */ + if (ptr[1] == '{') { const uschar *p; @@ -565,17 +582,23 @@ else while ((digitab[ptr[1]] & ctype_digit) != 0) c = c * 10 + *(++ptr) - '0'; - if (c < 0) + if (c < 0) /* Integer overflow */ { *errorcodeptr = ERR61; break; } - - if (c == 0 || (braced && *(++ptr) != '}')) + + if (braced && *(++ptr) != '}') { *errorcodeptr = ERR57; break; } + + if (c == 0) + { + *errorcodeptr = ERR58; + break; + } if (negated) { @@ -611,7 +634,7 @@ else c -= '0'; while ((digitab[ptr[1]] & ctype_digit) != 0) c = c * 10 + *(++ptr) - '0'; - if (c < 0) + if (c < 0) /* Integer overflow */ { *errorcodeptr = ERR61; break; @@ -4567,7 +4590,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */ references (?P=name) and recursion (?P>name), as well as falling through from the Perl recursion syntax (?&name). We also come here from the Perl \k<name> or \k'name' back reference syntax and the \k{name} - .NET syntax. */ + .NET syntax, and the Oniguruma \g<...> and \g'...' subroutine syntax. */ NAMED_REF_OR_RECURSE: name = ++ptr; @@ -4645,6 +4668,15 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */ case '5': case '6': case '7': case '8': case '9': /* subroutine */ { const uschar *called; + terminator = ')'; + + /* Come here from the \g<...> and \g'...' code (Oniguruma + compatibility). However, the syntax has been checked to ensure that + the ... are a (signed) number, so that neither ERR63 nor ERR29 will + be called on this path, nor with the jump to OTHER_CHAR_AFTER_QUERY + ever be taken. */ + + HANDLE_NUMERICAL_RECURSION: if ((refsign = *ptr) == '+') { @@ -4666,7 +4698,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */ while((digitab[*ptr] & ctype_digit) != 0) recno = recno * 10 + *ptr++ - '0'; - if (*ptr != ')') + if (*ptr != terminator) { *errorcodeptr = ERR29; goto FAILED; @@ -5062,7 +5094,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */ back references and those types that consume a character may be repeated. We can test for values between ESC_b and ESC_Z for the latter; this may have to change if any new ones are ever created. */ - + case '\\': tempptr = ptr; c = check_escape(&ptr, errorcodeptr, cd->bracount, options, FALSE); @@ -5089,6 +5121,63 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */ zerofirstbyte = firstbyte; zeroreqbyte = reqbyte; + + /* \g<name> or \g'name' is a subroutine call by name and \g<n> or \g'n' + is a subroutine call by number (Oniguruma syntax). In fact, the value + -ESC_g is returned only for these cases. So we don't need to check for < + or ' if the value is -ESC_g. For the Perl syntax \g{n} the value is + -ESC_REF+n, and for the Perl syntax \g{name} the result is -ESC_k (as + that is a synonym). */ + + if (-c == ESC_g) + { + const uschar *p; + terminator = (*(++ptr) == '<')? '>' : '\''; + + /* These two statements stop the compiler for warning about possibly + unset variables caused by the jump to HANDLE_NUMERICAL_RECURSION. In + fact, because we actually check for a number below, the paths that + would actually be in error are never taken. */ + + skipbytes = 0; + reset_bracount = FALSE; + + /* Test for a name */ + + if (ptr[1] != '+' && ptr[1] != '-') + { + BOOL isnumber = TRUE; + for (p = ptr + 1; *p != 0 && *p != terminator; p++) + { + if ((cd->ctypes[*p] & ctype_digit) == 0) isnumber = FALSE; + if ((cd->ctypes[*p] & ctype_word) == 0) break; + } + if (*p != terminator) + { + *errorcodeptr = ERR57; + break; + } + if (isnumber) + { + ptr++; + goto HANDLE_NUMERICAL_RECURSION; + } + is_recurse = TRUE; + goto NAMED_REF_OR_RECURSE; + } + + /* Test a signed number in angle brackets or quotes. */ + + p = ptr + 2; + while ((digitab[*p] & ctype_digit) != 0) p++; + if (*p != terminator) + { + *errorcodeptr = ERR57; + break; + } + ptr++; + goto HANDLE_NUMERICAL_RECURSION; + } /* \k<name> or \k'name' is a back reference by name (Perl syntax). We also support \k{name} (.NET syntax) */ diff --git a/pcre_internal.h b/pcre_internal.h index caf7b83..bca9564 100644 --- a/pcre_internal.h +++ b/pcre_internal.h @@ -613,7 +613,7 @@ character, that code will have to change. */ enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H, ESC_h, - ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF }; + ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k, ESC_REF }; /* Opcode table: Starting from 1 (i.e. after OP_END), the values up to diff --git a/testdata/testinput2 b/testdata/testinput2 index 1a13fa8..4ef241d 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -2589,4 +2589,58 @@ a random value. /Ix /[[:a\dz:]]/ +/^(?<name>a|b\g<name>c)/ + aaaa + bacxxx + bbaccxxx + bbbacccxx + +/^(?<name>a|b\g'name'c)/ + aaaa + bacxxx + bbaccxxx + bbbacccxx + +/^(a|b\g<1>c)/ + aaaa + bacxxx + bbaccxxx + bbbacccxx + +/^(a|b\g'1'c)/ + aaaa + bacxxx + bbaccxxx + bbbacccxx + +/^(a|b\g'-1'c)/ + aaaa + bacxxx + bbaccxxx + bbbacccxx + +/(^(a|b\g<-1>c))/ + aaaa + bacxxx + bbaccxxx + bbbacccxx + +/(^(a|b\g<-1'c))/ + +/(^(a|b\g{-1}))/ + bacxxx + +/(?-i:\g<name>)(?i:(?<name>a))/ + XaaX + XAAX + +/(?i:\g<name>)(?-i:(?<name>a))/ + XaaX + ** Failers + XAAX + +/(?-i:\g<+1>)(?i:(a))/ + XaaX + XAAX + / End of testinput2 / diff --git a/testdata/testoutput2 b/testdata/testoutput2 index dba227f..ae25cfa 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -8074,13 +8074,13 @@ No match Failed: reference to non-existent subpattern at offset 7 /^(a)\g/ -Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 5 +Failed: a numbered reference must not be zero at offset 5 /^(a)\g{0}/ -Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 7 +Failed: a numbered reference must not be zero at offset 8 /^(a)\g{3/ -Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 8 +Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 8 /^(a)\g{4a}/ Failed: reference to non-existent subpattern at offset 9 @@ -8217,13 +8217,13 @@ No match No match /x(?-0)y/ -Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5 +Failed: a numbered reference must not be zero at offset 5 /x(?-1)y/ Failed: reference to non-existent subpattern at offset 5 /x(?+0)y/ -Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5 +Failed: a numbered reference must not be zero at offset 5 /x(?+1)y/ Failed: reference to non-existent subpattern at offset 5 @@ -9385,4 +9385,124 @@ Failed: unknown POSIX class name at offset 6 /[[:a\dz:]]/ Failed: unknown POSIX class name at offset 3 +/^(?<name>a|b\g<name>c)/ + aaaa + 0: a + 1: a + bacxxx + 0: bac + 1: bac + bbaccxxx + 0: bbacc + 1: bbacc + bbbacccxx + 0: bbbaccc + 1: bbbaccc + +/^(?<name>a|b\g'name'c)/ + aaaa + 0: a + 1: a + bacxxx + 0: bac + 1: bac + bbaccxxx + 0: bbacc + 1: bbacc + bbbacccxx + 0: bbbaccc + 1: bbbaccc + +/^(a|b\g<1>c)/ + aaaa + 0: a + 1: a + bacxxx + 0: bac + 1: bac + bbaccxxx + 0: bbacc + 1: bbacc + bbbacccxx + 0: bbbaccc + 1: bbbaccc + +/^(a|b\g'1'c)/ + aaaa + 0: a + 1: a + bacxxx + 0: bac + 1: bac + bbaccxxx + 0: bbacc + 1: bbacc + bbbacccxx + 0: bbbaccc + 1: bbbaccc + +/^(a|b\g'-1'c)/ + aaaa + 0: a + 1: a + bacxxx + 0: bac + 1: bac + bbaccxxx + 0: bbacc + 1: bbacc + bbbacccxx + 0: bbbaccc + 1: bbbaccc + +/(^(a|b\g<-1>c))/ + aaaa + 0: a + 1: a + 2: a + bacxxx + 0: bac + 1: bac + 2: bac + bbaccxxx + 0: bbacc + 1: bbacc + 2: bbacc + bbbacccxx + 0: bbbaccc + 1: bbbaccc + 2: bbbaccc + +/(^(a|b\g<-1'c))/ +Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 15 + +/(^(a|b\g{-1}))/ + bacxxx +No match + +/(?-i:\g<name>)(?i:(?<name>a))/ + XaaX + 0: aa + 1: a + XAAX + 0: AA + 1: A + +/(?i:\g<name>)(?-i:(?<name>a))/ + XaaX + 0: aa + 1: a + ** Failers +No match + XAAX +No match + +/(?-i:\g<+1>)(?i:(a))/ + XaaX + 0: aa + 1: a + XAAX + 0: AA + 1: A + / End of testinput2 / |