summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2008-04-10 19:55:57 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2008-04-10 19:55:57 +0000
commitabbf313ccbce42c4e297c87550813203ec3cf81e (patch)
tree3aaf3c5d4adf3580b3e84b818b805ebf65f041a0
parent049881b7af634cd0c59d5053c169642c0a06b1de (diff)
downloadpcre-abbf313ccbce42c4e297c87550813203ec3cf81e.tar.gz
Add Oniguruma syntax \g<...> and \g'...' for subroutine calls.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@333 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog5
-rw-r--r--doc/pcrepattern.346
-rw-r--r--doc/pcresyntax.314
-rw-r--r--pcre_compile.c119
-rw-r--r--pcre_internal.h2
-rw-r--r--testdata/testinput254
-rw-r--r--testdata/testoutput2130
7 files changed, 344 insertions, 26 deletions
diff --git a/ChangeLog b/ChangeLog
index 2e6e458..d9baa07 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -45,6 +45,11 @@ Version 7.7 05-Mar-08
10. Applied Alan Lehotsky's patch to add REG_STARTEND support to the POSIX
matching function regexec().
+
+11. Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n',
+ which, however, unlike Perl's \g{...}, are subroutine calls, not back
+ references. PCRE supports relative numbers with this syntax (I don't think
+ Oniguruma does).
Version 7.6 28-Jan-08
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 9276863..b2b4024 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -9,7 +9,12 @@ are described in detail below. There is a quick-reference syntax summary in the
.\" HREF
\fBpcresyntax\fP
.\"
-page. Perl's regular expressions are described in its own documentation, and
+page. PCRE tries to match Perl syntax and semantics as closely as it can. PCRE
+also supports some alternative regular expression syntax (which does not
+conflict with the Perl syntax) in order to provide some compatibility with
+regular expressions in Python, .NET, and Oniguruma.
+.P
+Perl's regular expressions are described in its own documentation, and
regular expressions in general are covered in a number of books, some of which
have copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
published by O'Reilly, covers regular expressions in great detail. This
@@ -310,6 +315,20 @@ parenthesized subpatterns.
.\"
.
.
+.SS "Absolute and relative subroutine calls"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
+a number enclosed either in angle brackets or single quotes, is an alternative
+syntax for referencing a subpattern as a "subroutine". Details are discussed
+.\" HTML <a href="#onigurumasubroutines">
+.\" </a>
+later.
+.\"
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
+synonymous. The former is a back reference; the latter is a subroutine call.
+.
+.
.SS "Generic character types"
.rs
.sp
@@ -2023,6 +2042,27 @@ It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
.
.
+.\" HTML <a name="onigurumasubroutines"></a>
+.SH "ONIGURUMA SUBROUTINE SYNTAX"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
+a number enclosed either in angle brackets or single quotes, is an alternative
+syntax for referencing a subpattern as a subroutine, possibly recursively. Here
+are two of the examples used above, rewritten using this syntax:
+.sp
+ (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
+ (sens|respons)e and \eg'1'ibility
+.sp
+PCRE supports an extension to Oniguruma: if a number is preceded by a
+plus or a minus sign it is taken as a relative reference. For example:
+.sp
+ (abc)(?i:\eg<-1>)
+.sp
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
+synonymous. The former is a back reference; the latter is a subroutine call.
+.
+.
.SH CALLOUTS
.rs
.sp
@@ -2192,6 +2232,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 17 September 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 10 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
.fi
diff --git a/doc/pcresyntax.3 b/doc/pcresyntax.3
index 534441c..151f2d0 100644
--- a/doc/pcresyntax.3
+++ b/doc/pcresyntax.3
@@ -304,6 +304,8 @@ In PCRE, POSIX character set names recognize only ASCII characters. You can use
(?<!...) negative look behind
.sp
Each top-level branch of a look behind must be of a fixed length.
+.
+.
.SH "BACKREFERENCES"
.rs
.sp
@@ -327,6 +329,14 @@ Each top-level branch of a look behind must be of a fixed length.
(?-n) call subpattern by relative number
(?&name) call subpattern by name (Perl)
(?P>name) call subpattern by name (Python)
+ \eg<name> call subpattern by name (Oniguruma)
+ \eg'name' call subpattern by name (Oniguruma)
+ \eg<n> call subpattern by absolute number (Oniguruma)
+ \eg'n' call subpattern by absolute number (Oniguruma)
+ \eg<+n> call subpattern by relative number (PCRE extension)
+ \eg'+n' call subpattern by relative number (PCRE extension)
+ \eg<-n> call subpattern by relative number (PCRE extension)
+ \eg'-n' call subpattern by relative number (PCRE extension)
.
.
.SH "CONDITIONAL PATTERNS"
@@ -418,6 +428,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 14 November 2007
-Copyright (c) 1997-2007 University of Cambridge.
+Last updated: 09 April 2008
+Copyright (c) 1997-2008 University of Cambridge.
.fi
diff --git a/pcre_compile.c b/pcre_compile.c
index 2cacfba..aed8190 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -295,8 +295,8 @@ static const char error_texts[] =
/* 55 */
"repeating a DEFINE group is not allowed\0"
"inconsistent NEWLINE options\0"
- "\\g is not followed by a braced name or an optionally braced non-zero number\0"
- "(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number\0"
+ "\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0"
+ "a numbered reference must not be zero\0"
"(*VERB) with an argument is not supported\0"
/* 60 */
"(*VERB) not recognized\0"
@@ -531,14 +531,31 @@ else
*errorcodeptr = ERR37;
break;
- /* \g must be followed by a number, either plain or braced. If positive, it
- is an absolute backreference. If negative, it is a relative backreference.
- This is a Perl 5.10 feature. Perl 5.10 also supports \g{name} as a
- reference to a named group. This is part of Perl's movement towards a
- unified syntax for back references. As this is synonymous with \k{name}, we
- fudge it up by pretending it really was \k. */
+ /* \g must be followed by one of a number of specific things:
+
+ (1) A number, either plain or braced. If positive, it is an absolute
+ backreference. If negative, it is a relative backreference. This is a Perl
+ 5.10 feature.
+
+ (2) Perl 5.10 also supports \g{name} as a reference to a named group. This
+ is part of Perl's movement towards a unified syntax for back references. As
+ this is synonymous with \k{name}, we fudge it up by pretending it really
+ was \k.
+
+ (3) For Oniguruma compatibility we also support \g followed by a name or a
+ number either in angle brackets or in single quotes. However, these are
+ (possibly recursive) subroutine calls, _not_ backreferences. Just return
+ the -ESC_g code (cf \k). */
case 'g':
+ if (ptr[1] == '<' || ptr[1] == '\'')
+ {
+ c = -ESC_g;
+ break;
+ }
+
+ /* Handle the Perl-compatible cases */
+
if (ptr[1] == '{')
{
const uschar *p;
@@ -565,17 +582,23 @@ else
while ((digitab[ptr[1]] & ctype_digit) != 0)
c = c * 10 + *(++ptr) - '0';
- if (c < 0)
+ if (c < 0) /* Integer overflow */
{
*errorcodeptr = ERR61;
break;
}
-
- if (c == 0 || (braced && *(++ptr) != '}'))
+
+ if (braced && *(++ptr) != '}')
{
*errorcodeptr = ERR57;
break;
}
+
+ if (c == 0)
+ {
+ *errorcodeptr = ERR58;
+ break;
+ }
if (negated)
{
@@ -611,7 +634,7 @@ else
c -= '0';
while ((digitab[ptr[1]] & ctype_digit) != 0)
c = c * 10 + *(++ptr) - '0';
- if (c < 0)
+ if (c < 0) /* Integer overflow */
{
*errorcodeptr = ERR61;
break;
@@ -4567,7 +4590,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
references (?P=name) and recursion (?P>name), as well as falling
through from the Perl recursion syntax (?&name). We also come here from
the Perl \k<name> or \k'name' back reference syntax and the \k{name}
- .NET syntax. */
+ .NET syntax, and the Oniguruma \g<...> and \g'...' subroutine syntax. */
NAMED_REF_OR_RECURSE:
name = ++ptr;
@@ -4645,6 +4668,15 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
case '5': case '6': case '7': case '8': case '9': /* subroutine */
{
const uschar *called;
+ terminator = ')';
+
+ /* Come here from the \g<...> and \g'...' code (Oniguruma
+ compatibility). However, the syntax has been checked to ensure that
+ the ... are a (signed) number, so that neither ERR63 nor ERR29 will
+ be called on this path, nor with the jump to OTHER_CHAR_AFTER_QUERY
+ ever be taken. */
+
+ HANDLE_NUMERICAL_RECURSION:
if ((refsign = *ptr) == '+')
{
@@ -4666,7 +4698,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
while((digitab[*ptr] & ctype_digit) != 0)
recno = recno * 10 + *ptr++ - '0';
- if (*ptr != ')')
+ if (*ptr != terminator)
{
*errorcodeptr = ERR29;
goto FAILED;
@@ -5062,7 +5094,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
back references and those types that consume a character may be repeated.
We can test for values between ESC_b and ESC_Z for the latter; this may
have to change if any new ones are ever created. */
-
+
case '\\':
tempptr = ptr;
c = check_escape(&ptr, errorcodeptr, cd->bracount, options, FALSE);
@@ -5089,6 +5121,63 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
zerofirstbyte = firstbyte;
zeroreqbyte = reqbyte;
+
+ /* \g<name> or \g'name' is a subroutine call by name and \g<n> or \g'n'
+ is a subroutine call by number (Oniguruma syntax). In fact, the value
+ -ESC_g is returned only for these cases. So we don't need to check for <
+ or ' if the value is -ESC_g. For the Perl syntax \g{n} the value is
+ -ESC_REF+n, and for the Perl syntax \g{name} the result is -ESC_k (as
+ that is a synonym). */
+
+ if (-c == ESC_g)
+ {
+ const uschar *p;
+ terminator = (*(++ptr) == '<')? '>' : '\'';
+
+ /* These two statements stop the compiler for warning about possibly
+ unset variables caused by the jump to HANDLE_NUMERICAL_RECURSION. In
+ fact, because we actually check for a number below, the paths that
+ would actually be in error are never taken. */
+
+ skipbytes = 0;
+ reset_bracount = FALSE;
+
+ /* Test for a name */
+
+ if (ptr[1] != '+' && ptr[1] != '-')
+ {
+ BOOL isnumber = TRUE;
+ for (p = ptr + 1; *p != 0 && *p != terminator; p++)
+ {
+ if ((cd->ctypes[*p] & ctype_digit) == 0) isnumber = FALSE;
+ if ((cd->ctypes[*p] & ctype_word) == 0) break;
+ }
+ if (*p != terminator)
+ {
+ *errorcodeptr = ERR57;
+ break;
+ }
+ if (isnumber)
+ {
+ ptr++;
+ goto HANDLE_NUMERICAL_RECURSION;
+ }
+ is_recurse = TRUE;
+ goto NAMED_REF_OR_RECURSE;
+ }
+
+ /* Test a signed number in angle brackets or quotes. */
+
+ p = ptr + 2;
+ while ((digitab[*p] & ctype_digit) != 0) p++;
+ if (*p != terminator)
+ {
+ *errorcodeptr = ERR57;
+ break;
+ }
+ ptr++;
+ goto HANDLE_NUMERICAL_RECURSION;
+ }
/* \k<name> or \k'name' is a back reference by name (Perl syntax).
We also support \k{name} (.NET syntax) */
diff --git a/pcre_internal.h b/pcre_internal.h
index caf7b83..bca9564 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -613,7 +613,7 @@ character, that code will have to change. */
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H, ESC_h,
- ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF };
+ ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k, ESC_REF };
/* Opcode table: Starting from 1 (i.e. after OP_END), the values up to
diff --git a/testdata/testinput2 b/testdata/testinput2
index 1a13fa8..4ef241d 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -2589,4 +2589,58 @@ a random value. /Ix
/[[:a\dz:]]/
+/^(?<name>a|b\g<name>c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(?<name>a|b\g'name'c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(a|b\g<1>c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(a|b\g'1'c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/^(a|b\g'-1'c)/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/(^(a|b\g<-1>c))/
+ aaaa
+ bacxxx
+ bbaccxxx
+ bbbacccxx
+
+/(^(a|b\g<-1'c))/
+
+/(^(a|b\g{-1}))/
+ bacxxx
+
+/(?-i:\g<name>)(?i:(?<name>a))/
+ XaaX
+ XAAX
+
+/(?i:\g<name>)(?-i:(?<name>a))/
+ XaaX
+ ** Failers
+ XAAX
+
+/(?-i:\g<+1>)(?i:(a))/
+ XaaX
+ XAAX
+
/ End of testinput2 /
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index dba227f..ae25cfa 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -8074,13 +8074,13 @@ No match
Failed: reference to non-existent subpattern at offset 7
/^(a)\g/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5
/^(a)\g{0}/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 7
+Failed: a numbered reference must not be zero at offset 8
/^(a)\g{3/
-Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 8
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 8
/^(a)\g{4a}/
Failed: reference to non-existent subpattern at offset 9
@@ -8217,13 +8217,13 @@ No match
No match
/x(?-0)y/
-Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5
/x(?-1)y/
Failed: reference to non-existent subpattern at offset 5
/x(?+0)y/
-Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
+Failed: a numbered reference must not be zero at offset 5
/x(?+1)y/
Failed: reference to non-existent subpattern at offset 5
@@ -9385,4 +9385,124 @@ Failed: unknown POSIX class name at offset 6
/[[:a\dz:]]/
Failed: unknown POSIX class name at offset 3
+/^(?<name>a|b\g<name>c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(?<name>a|b\g'name'c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g<1>c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g'1'c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/^(a|b\g'-1'c)/
+ aaaa
+ 0: a
+ 1: a
+ bacxxx
+ 0: bac
+ 1: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+
+/(^(a|b\g<-1>c))/
+ aaaa
+ 0: a
+ 1: a
+ 2: a
+ bacxxx
+ 0: bac
+ 1: bac
+ 2: bac
+ bbaccxxx
+ 0: bbacc
+ 1: bbacc
+ 2: bbacc
+ bbbacccxx
+ 0: bbbaccc
+ 1: bbbaccc
+ 2: bbbaccc
+
+/(^(a|b\g<-1'c))/
+Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 15
+
+/(^(a|b\g{-1}))/
+ bacxxx
+No match
+
+/(?-i:\g<name>)(?i:(?<name>a))/
+ XaaX
+ 0: aa
+ 1: a
+ XAAX
+ 0: AA
+ 1: A
+
+/(?i:\g<name>)(?-i:(?<name>a))/
+ XaaX
+ 0: aa
+ 1: a
+ ** Failers
+No match
+ XAAX
+No match
+
+/(?-i:\g<+1>)(?i:(a))/
+ XaaX
+ 0: aa
+ 1: a
+ XAAX
+ 0: AA
+ 1: A
+
/ End of testinput2 /