summaryrefslogtreecommitdiff
path: root/pod/perlre.pod
diff options
context:
space:
mode:
authorYves Orton <demerphq@gmail.com>2006-10-06 21:16:01 +0200
committerRafael Garcia-Suarez <rgarciasuarez@gmail.com>2006-10-07 14:30:32 +0000
commit81714fb9c03d91d66b66cab6e899e81bf64a2ca7 (patch)
tree40861dec0355f417fff2a7ff3c393082960066cc /pod/perlre.pod
parentf5def3a2a0d8913110936f9f4e13e37835754c28 (diff)
downloadperl-81714fb9c03d91d66b66cab6e899e81bf64a2ca7.tar.gz
Re: [PATCH] Initial attempt at named captures for perls regexp engine
Message-ID: <9b18b3110610061016x5ddce965u30d9a821f632d450@mail.gmail.com> p4raw-id: //depot/perl@28957
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r--pod/perlre.pod142
1 files changed, 104 insertions, 38 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index c4dd7c517a..7cc5decc22 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -191,20 +191,26 @@ X<metacharacter>
X<\w> X<\W> X<\s> X<\S> X<\d> X<\D> X<\X> X<\p> X<\P> X<\C>
X<word> X<whitespace>
- \w Match a "word" character (alphanumeric plus "_")
- \W Match a non-"word" character
- \s Match a whitespace character
- \S Match a non-whitespace character
- \d Match a digit character
- \D Match a non-digit character
- \pP Match P, named property. Use \p{Prop} for longer names.
- \PP Match non-P
- \X Match eXtended Unicode "combining character sequence",
- equivalent to (?:\PM\pM*)
- \C Match a single C char (octet) even under Unicode.
- NOTE: breaks up characters into their UTF-8 bytes,
- so you may end up with malformed pieces of UTF-8.
- Unsupported in lookbehind.
+ \w Match a "word" character (alphanumeric plus "_")
+ \W Match a non-"word" character
+ \s Match a whitespace character
+ \S Match a non-whitespace character
+ \d Match a digit character
+ \D Match a non-digit character
+ \pP Match P, named property. Use \p{Prop} for longer names.
+ \PP Match non-P
+ \X Match eXtended Unicode "combining character sequence",
+ equivalent to (?:\PM\pM*)
+ \C Match a single C char (octet) even under Unicode.
+ NOTE: breaks up characters into their UTF-8 bytes,
+ so you may end up with malformed pieces of UTF-8.
+ Unsupported in lookbehind.
+ \1 Backreference to a a specific group.
+ '1' may actually be any positive integer
+ \k<name> Named backreference
+ \N{name} Named unicode character, or unicode escape.
+ \x12 Hexadecimal escape sequence
+ \x{1234} Long hexadecimal escape sequence
A C<\w> matches a single alphanumeric character (an alphabetic
character, or a decimal digit) or C<_>, not a whole word. Use C<\w+>
@@ -403,7 +409,7 @@ X<\G>
The bracketing construct C<( ... )> creates capture buffers. To
refer to the digit'th buffer use \<digit> within the
match. Outside the match use "$" instead of "\". (The
-\<digit> notation works in certain circumstances outside
+\<digit> notation works in certain circumstances outside
the match. See the warning below about \1 vs $1 for details.)
Referring back to another part of the match is called a
I<backreference>.
@@ -414,20 +420,38 @@ There is no limit to the number of captured substrings that you may
use. However Perl also uses \10, \11, etc. as aliases for \010,
\011, etc. (Recall that 0 means octal, so \011 is the character at
number 9 in your coded character set; which would be the 10th character,
-a horizontal tab under ASCII.) Perl resolves this
-ambiguity by interpreting \10 as a backreference only if at least 10
-left parentheses have opened before it. Likewise \11 is a
-backreference only if at least 11 left parentheses have opened
-before it. And so on. \1 through \9 are always interpreted as
+a horizontal tab under ASCII.) Perl resolves this
+ambiguity by interpreting \10 as a backreference only if at least 10
+left parentheses have opened before it. Likewise \11 is a
+backreference only if at least 11 left parentheses have opened
+before it. And so on. \1 through \9 are always interpreted as
backreferences.
+Additionally, as of Perl 5.10 you may use named capture buffers and named
+backreferences. The notation is C<< (?<name>...) >> and C<< \k<name> >>
+(you may also use single quotes instead of angle brackets to quote the
+name). The only difference with named capture buffers and unnamed ones is
+that multiple buffers may have the same name and that the contents of
+named capture buffers is available via the C<%+> hash. When multiple
+groups share the same name C<$+{name}> and C<< \k<name> >> refer to the
+leftmost defined group, thus it's possible to do things with named capture
+buffers that would otherwise require C<(??{})> code to accomplish. Named
+capture buffers are numbered just as normal capture buffers are and may be
+referenced via the magic numeric variables or via numeric backreferences
+as well as by name.
+
Examples:
s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
- if (/(.)\1/) { # find first doubled char
- print "'$1' is the first doubled character\n";
- }
+ /(.)\1/ # find first doubled char
+ and print "'$1' is the first doubled character\n";
+
+ /(?<char>.)\k<char>/ # ... a different way
+ and print "'$+{char}' is the first doubled character\n";
+
+ /(?<char>.)\1/ # ... mix and match
+ and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) { # parse out values
$hours = $1;
@@ -443,7 +467,7 @@ everything before the matched string. C<$'> returns everything
after the matched string. And C<$^N> contains whatever was matched by
the most-recently closed group (submatch). C<$^N> can be used in
extended patterns (see below), for example to assign a submatch to a
-variable.
+variable.
X<$+> X<$^N> X<$&> X<$`> X<$'>
The numbered match variables ($1, $2, $3, etc.) and the related punctuation
@@ -620,6 +644,48 @@ A zero-width negative look-behind assertion. For example C</(?<!bar)foo/>
matches any occurrence of "foo" that does not follow "bar". Works
only for fixed-width look-behind.
+=item C<(?'NAME'pattern)>
+
+=item C<< (?<NAME>pattern) >>
+X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
+
+A named capture buffer. Identical in every respect to normal capturing
+parens C<()> but for the additional fact that C<%+> may be used after
+a succesful match to refer to a named buffer. See C<perlvar> for more
+details on the C<%+> hash.
+
+If multiple distinct capture buffers have the same name then the
+$+{NAME} will refer to the leftmost defined buffer in the match.
+
+The forms C<(?'NAME'pattern)> and C<(?<NAME>pattern)> are equivalent.
+
+B<NOTE:> While the notation of this construct is the same as the similar
+function in .NET regexes, the behavior is not, in Perl the buffers are
+numbered sequentially regardless of being named or not. Thus in the
+pattern
+
+ /(x)(?<foo>y)(z)/
+
+$+{foo} will be the same as $2, and $3 will contain 'z' instead of
+the opposite which is what a .NET regex hacker might expect.
+
+Currently NAME is restricted to word chars only. In other words, it
+must match C</^\w+$/>.
+
+=item C<< \k<name> >>
+
+=item C<< \k'name' >>
+
+Named backreference. Similar to numeric backreferences, except that
+the group is designated by name and not number. If multiple groups
+have the same name then it refers to the leftmost defined group in
+the current match.
+
+It is an error to refer to a name not defined by a C<(?<NAME>)>
+earlier in the pattern.
+
+Both forms are equivalent.
+
=item C<(?{ code })>
X<(?{})> X<regex, code in> X<regexp, code in> X<regular expression, code in>
@@ -726,7 +792,7 @@ Thus,
('a' x 100)=~/(??{'(.)' x 100})/
-B<will> match, it will B<not> set $1.
+B<will> match, it will B<not> set $1.
The C<code> is not interpolated. As before, the rules to determine
where the C<code> ends are currently somewhat convoluted.
@@ -762,21 +828,21 @@ X<regex, recursive> X<regexp, recursive> X<regular expression, recursive>
B<WARNING>: This extended regular expression feature is considered
highly experimental, and may be changed or deleted without notice.
-Similar to C<(??{ code })> except it does not involve compiling any code,
-instead it treats the contents of a capture buffer as an independent
+Similar to C<(??{ code })> except it does not involve compiling any code,
+instead it treats the contents of a capture buffer as an independent
pattern that must match at the current position. Capture buffers
-contained by the pattern will have the value as determined by the
+contained by the pattern will have the value as determined by the
outermost recursion.
PARNO is a sequence of digits not starting with 0 whose value
-reflects the paren-number of the capture buffer to recurse to.
+reflects the paren-number of the capture buffer to recurse to.
C<(?R)> curses to the beginning of the pattern.
-The following pattern matches a function foo() which may contain
-balanced parenthesis as the argument.
+The following pattern matches a function foo() which may contain
+balanced parenthesis as the argument.
$re = qr{ ( # paren group 1 (full function)
- foo
+ foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
@@ -802,18 +868,18 @@ the output produced should be the following:
$1 = foo(bar(baz)+baz(bop))
$2 = (bar(baz)+baz(bop))
- $3 = bar(baz)+baz(bop)
+ $3 = bar(baz)+baz(bop)
-If there is no corresponding capture buffer defined, then it is a
+If there is no corresponding capture buffer defined, then it is a
fatal error. Recursing deeper than 50 times without consuming any input
-string will also result in a fatal error. The maximum depth is compiled
+string will also result in a fatal error. The maximum depth is compiled
into perl, so changing it requires a custom build.
-B<Note> that this pattern does not behave the same way as the equivalent
+B<Note> that this pattern does not behave the same way as the equivalent
PCRE or Python construct of the same form. In perl you can backtrack into
a recursed group, in PCRE and Python the recursed into group is treated
-as atomic. Also, constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect
-the pattern being recursed into.
+as atomic. Also, constructs like (?i:(?1)) or (?:(?i)(?1)) do not affect
+the pattern being recursed into.
=item C<< (?>pattern) >>
X<backtrack> X<backtracking> X<atomic> X<possessive>