Re: [PATCH] Initial attempt at named captures for perls regexp engine

Message-ID: <9b18b3110610061016x5ddce965u30d9a821f632d450@mail.gmail.com> p4raw-id: //depot/perl@28957
author: Yves Orton <demerphq@gmail.com> 2006-10-06 21:16:01 +0200
committer: Rafael Garcia-Suarez <rgarciasuarez@gmail.com> 2006-10-07 14:30:32 +0000
commit: 81714fb9c03d91d66b66cab6e899e81bf64a2ca7 (patch)
tree: 40861dec0355f417fff2a7ff3c393082960066cc /pod/perlre.pod
parent: f5def3a2a0d8913110936f9f4e13e37835754c28 (diff)
download: perl-81714fb9c03d91d66b66cab6e899e81bf64a2ca7.tar.gz
1 files changed, 104 insertions, 38 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index c4dd7c517a..7cc5decc22 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -191,20 +191,26 @@ X<metacharacter>
 X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
 X<word> X<whitespace>
 
-    \w	Match a "word" character (alphanumeric plus "_")
-    \W	Match a non-"word" character
-    \s	Match a whitespace character
-    \S	Match a non-whitespace character
-    \d	Match a digit character
-    \D	Match a non-digit character
-    \pP	Match P, named property.  Use \p{Prop} for longer names.
-    \PP	Match non-P
-    \X	Match eXtended Unicode "combining character sequence",
-        equivalent to (?:\PM\pM*)
-    \C	Match a single C char (octet) even under Unicode.
-	NOTE: breaks up characters into their UTF-8 bytes,
-	so you may end up with malformed pieces of UTF-8.
-	Unsupported in lookbehind.
+    \w	     Match a "word" character (alphanumeric plus "_")
+    \W	     Match a non-"word" character
+    \s	     Match a whitespace character
+    \S	     Match a non-whitespace character
+    \d	     Match a digit character
+    \D	     Match a non-digit character
+    \pP	     Match P, named property.  Use \p{Prop} for longer names.
+    \PP	     Match non-P
+    \X	     Match eXtended Unicode "combining character sequence",
+             equivalent to (?:\PM\pM*)
+    \C	     Match a single C char (octet) even under Unicode.
+	     NOTE: breaks up characters into their UTF-8 bytes,
+	     so you may end up with malformed pieces of UTF-8.
+	     Unsupported in lookbehind.
+    \1       Backreference to a a specific group. 
+             '1' may actually be any positive integer
+    \k<name> Named backreference
+    \N{name} Named unicode character, or unicode escape.
+    \x12     Hexadecimal escape sequence
+    \x{1234} Long hexadecimal escape sequence
 
 A C<\w> matches a single alphanumeric character (an alphabetic
 character, or a decimal digit) or C<_>, not a whole word.  Use C<\w+>
@@ -403,7 +409,7 @@ X<\G>
 The bracketing construct C<( ... )> creates capture buffers.  To
 refer to the digit'th buffer use \<digit> within the
 match.  Outside the match use "$" instead of "\".  (The
-\<digit> notation works in certain circumstances outside 
+\<digit> notation works in certain circumstances outside
 the match.  See the warning below about \1 vs $1 for details.)
 Referring back to another part of the match is called a
 I<backreference>.
@@ -414,20 +420,38 @@ There is no limit to the number of captured substrings that you may
 use.  However Perl also uses \10, \11, etc. as aliases for \010,
 \011, etc.  (Recall that 0 means octal, so \011 is the character at
 number 9 in your coded character set; which would be the 10th character,
-a horizontal tab under ASCII.)  Perl resolves this 
-ambiguity by interpreting \10 as a backreference only if at least 10 
-left parentheses have opened before it.  Likewise \11 is a 
-backreference only if at least 11 left parentheses have opened 
-before it.  And so on.  \1 through \9 are always interpreted as 
+a horizontal tab under ASCII.)  Perl resolves this
+ambiguity by interpreting \10 as a backreference only if at least 10
+left parentheses have opened before it.  Likewise \11 is a
+backreference only if at least 11 left parentheses have opened
+before it.  And so on.  \1 through \9 are always interpreted as
 backreferences.
 
+Additionally, as of Perl 5.10 you may use named capture buffers and named
+backreferences. The notation is C<< (?<name>...) >> and C<< \k<name> >>
+(you may also use single quotes instead of angle brackets to quote the
+name). The only difference with named capture buffers and unnamed ones is
+that multiple buffers may have the same name and that the contents of
+named capture buffers is available via the C<%+> hash. When multiple
+groups share the same name C<$+{name}> and C<< \k<name> >> refer to the
+leftmost defined group, thus it's possible to do things with named capture
+buffers that would otherwise require C<(??{})> code to accomplish. Named
+capture buffers are numbered just as normal capture buffers are and may be
+referenced via the magic numeric variables or via numeric backreferences
+as well as by name.
+
 Examples:
 
     s/^([^ ]*) *([^ ]*)/$2 $1/;     # swap first two words
 
-     if (/(.)\1/) {                 # find first doubled char
-         print "'$1' is the first doubled character\n";
-     }
+    /(.)\1/                         # find first doubled char
+         and print "'$1' is the first doubled character\n";
+
+    /(?<char>.)\k<char>/            # ... a different way
+         and print "'$+{char}' is the first doubled character\n";
+
+    /(?<char>.)\1/                  # ... mix and match
+         and print "'$1' is the first doubled character\n";
 
     if (/Time: (..):(..):(..)/) {   # parse out values
 	$hours = $1;
@@ -443,7 +467,7 @@ everything before the matched string.  C<$'> returns everything
 after the matched string. And C<$^N> contains whatever was matched by
 the most-recently closed group (submatch). C<$^N> can be used in
 extended patterns (see below), for example to assign a submatch to a
-variable. 
+variable.
 X<$+> X<$^N> X<$&> X<$`> X<$'>
 
 The numbered match variables ($1, $2, $3, etc.) and the related punctuation
@@ -620,6 +644,48 @@ A zero-width negative look-behind assertion.  For example C</(?<!bar)foo/>
 matches any occurrence of "foo" that does not follow "bar".  Works
 only for fixed-width look-behind.
 
+=item C<(?'NAME'pattern)>
+
+=item C<< (?<NAME>pattern) >>
+X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
+
+A named capture buffer. Identical in every respect to normal capturing
+parens C<()> but for the additional fact that C<%+> may be used after
+a succesful match to refer to a named buffer. See C<perlvar> for more
+details on the C<%+> hash.
+
+If multiple distinct capture buffers have the same name then the
+$+{NAME} will refer to the leftmost defined buffer in the match.
+
+The forms C<(?'NAME'pattern)> and C<(?<NAME>pattern)> are equivalent.
+
+B<NOTE:> While the notation of this construct is the same as the similar
+function in .NET regexes, the behavior is not, in Perl the buffers are
+numbered sequentially regardless of being named or not. Thus in the
+pattern
+
+  /(x)(?<foo>y)(z)/
+
+$+{foo} will be the same as $2, and $3 will contain 'z' instead of
+the opposite which is what a .NET regex hacker might expect.
+
+Currently NAME is restricted to word chars only. In other words, it
+must match C</^\w+$/>.
+
+=item C<< \k<name> >>
+
+=item C<< \k'name' >>
+
+Named backreference. Similar to numeric backreferences, except that
+the group is designated by name and not number. If multiple groups
+have the same name then it refers to the leftmost defined group in
+the current match.
+
+It is an error to refer to a name not defined by a C<(?<NAME>)>
+earlier in the pattern.
+
+Both forms are equivalent.
+
 =item C<(?{ code })>
 X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
 
@@ -726,7 +792,7 @@ Thus,
 
     ('a' x 100)=~/(??{'(.)' x 100})/
 
-B<will> match, it will B<not> set $1. 
+B<will> match, it will B<not> set $1.
 
 The C<code> is not interpolated.  As before, the rules to determine
 where the C<code> ends are currently somewhat convoluted.
@@ -762,21 +828,21 @@ X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
 B<WARNING>:  This extended regular expression feature is considered
 highly experimental, and may be changed or deleted without notice.
 
-Similar to C<(??{ code })> except it does not involve compiling any code, 
-instead it treats the contents of a capture buffer as an independent 
+Similar to C<(??{ code })> except it does not involve compiling any code,
+instead it treats the contents of a capture buffer as an independent
 pattern that must match at the current position.  Capture buffers
-contained by the pattern will have the value as determined by the 
+contained by the pattern will have the value as determined by the
 outermost recursion.
 
 PARNO is a sequence of digits not starting with 0 whose value
-reflects the paren-number of the capture buffer to recurse to. 
+reflects the paren-number of the capture buffer to recurse to.
 C<(?R)> curses to the beginning of the pattern.
 
-The following pattern matches a function foo() which may contain 
-balanced parenthesis as the argument. 
+The following pattern matches a function foo() which may contain
+balanced parenthesis as the argument.
 
   $re = qr{ (                    # paren group 1 (full function)
-              foo              
+              foo
               (                  # paren group 2 (parens)
                 \(
                   (              # paren group 3 (contents of parens)
@@ -802,18 +868,18 @@ the output produced should be the following:
 
     $1 = foo(bar(baz)+baz(bop))
     $2 = (bar(baz)+baz(bop))
-    $3 = bar(baz)+baz(bop)      
+    $3 = bar(baz)+baz(bop)
 
-If there is no corresponding capture buffer defined, then it is a 
+If there is no corresponding capture buffer defined, then it is a
 fatal error.  Recursing deeper than 50 times without consuming any input
-string will also result in a fatal error.  The maximum depth is compiled 
+string will also result in a fatal error.  The maximum depth is compiled
 into perl, so changing it requires a custom build.
 
-B<Note> that this pattern does not behave the same way as the equivalent 
+B<Note> that this pattern does not behave the same way as the equivalent
 PCRE or Python construct of the same form. In perl you can backtrack into
 a recursed group, in PCRE and Python the recursed into group is treated
-as atomic. Also, constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect 
-the pattern being recursed into. 
+as atomic. Also, constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect
+the pattern being recursed into.
 
 =item C<< (?>pattern) >>
 X<backtrack> X<backtracking> X<atomic> X<possessive>
author	Yves Orton <demerphq@gmail.com>	2006-10-06 21:16:01 +0200
committer	Rafael Garcia-Suarez <rgarciasuarez@gmail.com>	2006-10-07 14:30:32 +0000
commit	81714fb9c03d91d66b66cab6e899e81bf64a2ca7 (patch)
tree	40861dec0355f417fff2a7ff3c393082960066cc /pod/perlre.pod
parent	f5def3a2a0d8913110936f9f4e13e37835754c28 (diff)
download	perl-81714fb9c03d91d66b66cab6e899e81bf64a2ca7.tar.gz