summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--pod/perlfaq4.pod6
-rw-r--r--pod/perlfaq6.pod2
-rw-r--r--pod/perlfaq9.pod4
-rw-r--r--pod/perlglossary.pod11
-rw-r--r--pod/perlre.pod129
-rw-r--r--pod/perlrebackslash.pod33
-rw-r--r--pod/perlrequick.pod10
-rw-r--r--pod/perlretut.pod22
-rw-r--r--pod/perlsyn.pod2
9 files changed, 115 insertions, 104 deletions
diff --git a/pod/perlfaq4.pod b/pod/perlfaq4.pod
index 45cc9e044d..2e35068c31 100644
--- a/pod/perlfaq4.pod
+++ b/pod/perlfaq4.pod
@@ -594,11 +594,11 @@ This won't expand C<"\n"> or C<"\t"> or any other special escapes.
You can use the substitution operator to find pairs of characters (or
runs of characters) and replace them with a single instance. In this
substitution, we find a character in C<(.)>. The memory parentheses
-store the matched character in the back-reference C<\1> and we use
+store the matched character in the back-reference C<\g1> and we use
that to require that the same thing immediately follow it. We replace
that part of the string with the character in C<$1>.
- s/(.)\1/$1/g;
+ s/(.)\g1/$1/g;
We can also use the transliteration operator, C<tr///>. In this
example, the search list side of our C<tr///> contains nothing, but
@@ -1162,7 +1162,7 @@ subsequent line.
sub fix {
local $_ = shift;
my ($white, $leader); # common whitespace and common leading string
- if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) {
+ if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\g1\g2?.*\n)+$/) {
($white, $leader) = ($2, quotemeta($1));
} else {
($white, $leader) = (/^(\s+)/, '');
diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod
index b51884d063..fe91933bba 100644
--- a/pod/perlfaq6.pod
+++ b/pod/perlfaq6.pod
@@ -99,7 +99,7 @@ record read in.
$/ = ''; # read in whole paragraph, not just one line
while ( <> ) {
- while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha
+ while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) { # word starts alpha
print "Duplicate $1 at paragraph $.\n";
}
}
diff --git a/pod/perlfaq9.pod b/pod/perlfaq9.pod
index fa9ef11673..7ea553afa9 100644
--- a/pod/perlfaq9.pod
+++ b/pod/perlfaq9.pod
@@ -114,7 +114,7 @@ entities--like C<&lt;> for example.
Here's one "simple-minded" approach, that works for most files:
#!/usr/bin/perl -p0777
- s/<(?:[^>'"]*|(['"]).*?\1)*>//gs
+ s/<(?:[^>'"]*|(['"]).*?\g1)*>//gs
If you want a more complete solution, see the 3-stage striphtml
program in
@@ -166,7 +166,7 @@ attribute is HREF and there are no other attributes.
# qxurl - tchrist@perl.com
print "$2\n" while m{
< \s*
- A \s+ HREF \s* = \s* (["']) (.*?) \1
+ A \s+ HREF \s* = \s* (["']) (.*?) \g1
\s* >
}gsix;
diff --git a/pod/perlglossary.pod b/pod/perlglossary.pod
index 858f24b12c..41dc412b57 100644
--- a/pod/perlglossary.pod
+++ b/pod/perlglossary.pod
@@ -234,13 +234,14 @@ some of its high-level ideas.
=item backreference
A substring L<captured|/capturing> by a subpattern within
-unadorned parentheses in a L</regex>, also referred to as a capture group.
-Backslashed decimal numbers
-(C<\1>, C<\2>, etc.) later in the same pattern refer back to the
-corresponding subpattern in the current match. Outside the pattern,
+unadorned parentheses in a L</regex>, also referred to as a capture group. The
+sequences (C<\g1>, C<\g2>, etc.) later in the same pattern refer back to
+the corresponding subpattern in the current match. Outside the pattern,
the numbered variables (C<$1>, C<$2>, etc.) continue to refer to these
same values, as long as the pattern was the last successful match of
-the current dynamic scope.
+the current dynamic scope. C<\g{-1}> can be used to refer to a group by
+relative rather than absolute position; and groups can be also be named, and
+referred to later by name rather than number. See L<perlre/Capture Buffers>.
=item backtracking
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 9a7e4fef06..8f193c8acc 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -381,44 +381,29 @@ loop. Take care when using patterns that include C<\G> in an alternation.
The bracketing construct C<( ... )> creates capture groups (also referred to as
capture buffers). To refer to the current contents of a group later on, within
-same pattern, use \1 for the first, \2 for the second, and so on.
-Outside the match use "$" instead of "\". (The
-\<digit> notation works in certain circumstances outside
-the match. See L</Warning on \1 Instead of $1> below for details.)
-Referring back to another part of the match is called a
-I<backreference>.
+the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>)
+for the second, and so on.
+This is called a I<backreference>.
X<regex, capture buffer> X<regexp, capture buffer>
X<regex, capture group> X<regexp, capture group>
X<regular expression, capture buffer> X<backreference>
X<regular expression, capture group> X<backreference>
-
-There is no limit to the number of captured substrings that you may
-use. However Perl also uses \10, \11, etc. as aliases for \010,
-\011, etc. (Recall that 0 means octal, so \011 is the character at
-number 9 in your coded character set; which would be the 10th character,
-a horizontal tab under ASCII.) Perl resolves this
-ambiguity by interpreting \10 as a backreference only if at least 10
-left parentheses have opened before it. Likewise \11 is a
-backreference only if at least 11 left parentheses have opened
-before it. And so on. \1 through \9 are always interpreted as
-backreferences.
-If the bracketing group did not match, the associated backreference won't
-match either. (This can happen if the bracketing group is optional, or
-in a different branch of an alternation.)
-
X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference>
-In order to provide a safer and easier way to construct patterns using
-backreferences, Perl provides the C<\g{N}> notation (starting with perl
-5.10.0). The curly brackets are optional, however omitting them is less
-safe as the meaning of the pattern can be changed by text (such as digits)
-following it. When N is a positive integer the C<\g{N}> notation is
-exactly equivalent to using normal backreferences. When N is a negative
-integer then it is a relative backreference referring to the previous N'th
-capturing group. When the bracket form is used and N is not an integer, it
-is treated as a reference to a named group.
-
-Thus C<\g{-1}> refers to the last group, C<\g{-2}> refers to the
-group before that. For example:
+X<named capture buffer> X<regular expression, named capture buffer>
+X<named capture group> X<regular expression, named capture group>
+X<%+> X<$+{name}> X<< \k<name> >>
+There is no limit to the number of captured substrings that you may use.
+Groups are numbered with the leftmost open parenthesis being number 1, etc. If
+a group did not match, the associated backreference won't match either. (This
+can happen if the group is optional, or in a different branch of an
+alternation.)
+You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with
+this form, described below.
+
+You can also refer to capture groups relatively, by using a negative number, so
+that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture
+group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For
+example:
/
(Y) # group 1
@@ -429,33 +414,62 @@ group before that. For example:
)
/x
-and would match the same as C</(Y) ( (X) \3 \1 )/x>.
-
-Additionally, as of Perl 5.10.0 you may use named capture groups and named
-backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >>
-to reference. You may also use apostrophes instead of angle brackets to delimit the
-name; and you may use the bracketed C<< \g{name} >> backreference syntax.
-It's possible to refer to a named capture group by absolute and relative number as well.
-Outside the pattern, a named capture group is available via the C<%+> hash.
-When different groups within the same pattern have the same name, C<$+{name}>
-and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible
-to do things with named capture groups that would otherwise require C<(??{})>
-code to accomplish.)
-X<named capture buffer> X<regular expression, named capture buffer>
-X<named capture group> X<regular expression, named capture group>
-X<%+> X<$+{name}> X<< \k<name> >>
+would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to
+interpolate regexes into larger regexes and not have to worry about the
+capture groups being renumbered.
+
+You can dispense with numbers altogether and create named capture groups.
+The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to
+reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may
+also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.)
+I<name> must not begin with a number, nor contain hyphens.
+When different groups within the same pattern have the same name, any reference
+to that name assumes the leftmost defined group. Named groups count in
+absolute and relative numbering, and so can also be referred to by those
+numbers.
+(It's possible to do things with named capture groups that would otherwise
+require C<(??{})>.)
+
+Capture group contents are dynamically scoped and available to you outside the
+pattern until the end of the enclosing block or until the next successful
+match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
+You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">,
+etc); or by name via the C<%+> hash, using C<"$+{I<name>}">.
+
+Braces are required in referring to named capture groups, but are optional for
+absolute or relative numbered ones. Braces are safer when creating a regex by
+concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a>
+contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which
+is probably not what you intended.
+
+The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
+there were no named nor relative numbered capture groups. Absolute numbered
+groups were referred to using C<\1>, C<\2>, etc, and this notation is still
+accepted (and likely always will be). But it leads to some ambiguities if
+there are more than 9 capture groups, as C<\10> could mean either the tenth
+capture group, or the character whose ordinal in octal is 010 (a backspace in
+ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference
+only if at least 10 left parentheses have opened before it. Likewise C<\11> is
+a backreference only if at least 11 left parentheses have opened before it.
+And so on. C<\1> through C<\9> are always interpreted as backreferences. You
+can minimize the ambiguity by always using C<\g> if you mean capturing groups;
+and always using 3 digits for octal constants, with the first always "0" (which
+works if there are 63 (= \077) or fewer capture groups).
+
+The C<\I<digit>> notation also works in certain circumstances outside
+the pattern. See L</Warning on \1 Instead of $1> below for details.)
Examples:
s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words
- /(.)\1/ # find first doubled char
+ /(.)\g1/ # find first doubled char
and print "'$1' is the first doubled character\n";
/(?<char>.)\k<char>/ # ... a different way
and print "'$+{char}' is the first doubled character\n";
- /(?'char'.)\1/ # ... mix and match
+ /(?'char'.)\g1/ # ... mix and match
and print "'$1' is the first doubled character\n";
if (/Time: (..):(..):(..)/) { # parse out values
@@ -475,14 +489,13 @@ extended patterns (see below), for example to assign a submatch to a
variable.
X<$+> X<$^N> X<$&> X<$`> X<$'>
-The numbered match variables ($1, $2, $3, etc.) and the related punctuation
-set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped
+These special variables, like the C<%+> hash and the numbered match variables
+(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped
until the end of the enclosing block or until the next successful
match, whichever comes first. (See L<perlsyn/"Compound Statements">.)
X<$+> X<$^N> X<$&> X<$`> X<$'>
X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9>
-
B<NOTE>: Failed matches in Perl do not reset the match variables,
which makes it easier to write code that tests for a series of more
specific cases and remembers the best match.
@@ -490,7 +503,7 @@ specific cases and remembers the best match.
B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or
C<$'> anywhere in the program, it has to provide them for every
pattern match. This may substantially slow your program. Perl
-uses the same mechanism to produce $1, $2, etc, so you also pay a
+uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a
price for each pattern that contains capturing parentheses. (To
avoid this cost while retaining the grouping behaviour, use the
extended regular expression C<(?: ... )> instead.) But if you never
@@ -586,7 +599,7 @@ include C<(?i)> at the front of the pattern. For example:
These modifiers are restored at the end of the enclosing group. For example,
- ( (?i) blah ) \s+ \1
+ ( (?i) blah ) \s+ \g1
will match C<blah> in any case, some spaces, and an exact (I<including the case>!)
repetition of the previous word, assuming the C</x> modifier, and no C</i>
@@ -1141,8 +1154,8 @@ C<a*ab> will match fewer characters than a standalone C<a*>, since
this makes the tail match.
An effect similar to C<< (?>pattern) >> may be achieved by writing
-C<(?=(pattern))\1>. This matches the same substring as a standalone
-C<a+>, and the following C<\1> eats the matched string; it therefore
+C<(?=(pattern))\g1>. This matches the same substring as a standalone
+C<a+>, and the following C<\g1> eats the matched string; it therefore
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
@@ -1762,7 +1775,7 @@ I<n>th subpattern later in the pattern using the metacharacter
\I<n>. Subpatterns are numbered based on the left to right order
of their opening parenthesis. A backreference matches whatever
actually matched the subpattern in the string being examined, not
-the rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will
+the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
match "0x1234 0x4321", but not "0x1234 01234", because subpattern
1 matched "0x", even though the rule C<0|0x> could potentially match
the leading 0 in the second number.
diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod
index 5e514ceec6..4f1bed67a5 100644
--- a/pod/perlrebackslash.pod
+++ b/pod/perlrebackslash.pod
@@ -227,10 +227,10 @@ as a character without special meaning by the regex engine, and will match
=head4 Caveat
-Octal escapes potentially clash with backreferences. They both consist
-of a backslash followed by numbers. So Perl has to use heuristics to
-determine whether it is a backreference or an octal escape. Perl uses
-the following rules:
+Octal escapes potentially clash with old-style backreferences (see L</Absolute
+referencing> below). They both consist of a backslash followed by numbers. So
+Perl has to use heuristics to determine whether it is a backreference or an
+octal escape. Perl uses the following rules:
=over 4
@@ -348,7 +348,6 @@ L<perlunicode/Unicode Character Properties>.
Mnemonic: I<p>roperty.
-
=head2 Referencing
If capturing parenthesis are used in a regular expression, we can refer
@@ -361,18 +360,18 @@ absolutely, relatively, and by name.
=head3 Absolute referencing
Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N>
-is an positive (unsigned) decimal number of any length is an absolute reference
+is a positive (unsigned) decimal number of any length is an absolute reference
to a capturing group.
-I<N> refers to the Nth set of parentheses - or more accurately, whatever has
+I<N> refers to the Nth set of parentheses - so C<\gI<N>> refers to whatever has
been matched by that set of parenthesis. Thus C<\g1> refers to the first
capture group in the regex.
The C<\gI<N>> form can be equivalently written as C<\g{I<N>}>
which avoids ambiguity when building a regex by concatenating shorter
-strings. Otherwise if you had a regex C</$a$b/>, and C<$a> contained C<"\g1">,
-and C<$b> contained C<"37">, you would get C</\g137/> which is probably not
-what you intended.
+strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained
+C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is
+probably not what you intended.
In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at
least I<N> capturing groups, or else I<N> will be considered an octal escape
@@ -413,17 +412,15 @@ even if the larger pattern also contains capture groups.
=head3 Named referencing
-Also new in perl 5.10.0 is the use of named capture groups, which can be
-referred to by name. This is done with C<\g{name}>, which is a
-backreference to the capture group with the name I<name>.
+C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a
+named capture group, dispensing completely with having to think about capture
+buffer positions.
To be compatible with .Net regular expressions, C<\g{name}> may also be
written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>.
-Note that C<\g{}> has the potential to be ambiguous, as it could be a named
-reference, or an absolute or relative reference (if its argument is numeric).
-However, names are not allowed to start with digits, nor are they allowed to
-contain a hyphen, so there is no ambiguity.
+To prevent any ambiguity, I<name> must not start with a digit nor contain a
+hyphen.
=head4 Examples
@@ -582,7 +579,7 @@ Mnemonic: eI<X>tended Unicode character.
"\x{256}" =~ /^\C\C$/; # Match as chr (256) takes 2 octets in UTF-8.
$str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz'
- $str =~ s/(.)\K\1//g; # Delete duplicated characters.
+ $str =~ s/(.)\K\g1//g; # Delete duplicated characters.
"\n" =~ /^\R$/; # Match, \n is a generic newline.
"\r" =~ /^\R$/; # Match, \r is a generic newline.
diff --git a/pod/perlrequick.pod b/pod/perlrequick.pod
index ded1e6cefc..1fde5588d6 100644
--- a/pod/perlrequick.pod
+++ b/pod/perlrequick.pod
@@ -298,13 +298,13 @@ indicated below it:
1 2 34
Associated with the matching variables C<$1>, C<$2>, ... are
-the B<backreferences> C<\1>, C<\2>, ... Backreferences are
+the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are
matching variables that can be used I<inside> a regex:
- /(\w\w\w)\s\1/; # find sequences like 'the the' in string
+ /(\w\w\w)\s\g1/; # find sequences like 'the the' in string
-C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>,
-C<\2>, ... only inside a regex.
+C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>,
+C<\g2>, ... only inside a regex.
=head2 Matching repetitions
@@ -347,7 +347,7 @@ Here are some examples:
/[a-z]+\s+\d*/; # match a lowercase word, at least some space, and
# any number of digits
- /(\w+)\s+\1/; # match doubled words of arbitrary length
+ /(\w+)\s+\g1/; # match doubled words of arbitrary length
$year =~ /\d{2,4}/; # make sure year is at least 2 but not more
# than 4 digits
$year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates
diff --git a/pod/perlretut.pod b/pod/perlretut.pod
index 9eded21002..eae266a407 100644
--- a/pod/perlretut.pod
+++ b/pod/perlretut.pod
@@ -732,21 +732,21 @@ match).
=head2 Backreferences
Closely associated with the matching variables C<$1>, C<$2>, ... are
-the I<backreferences> C<\1>, C<\2>,... Backreferences are simply
+the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply
matching variables that can be used I<inside> a regexp. This is a
really nice feature; what matches later in a regexp is made to depend on
what matched earlier in the regexp. Suppose we wanted to look
for doubled words in a text, like 'the the'. The following regexp finds
all 3-letter doubles with a space in between:
- /\b(\w\w\w)\s\1\b/;
+ /\b(\w\w\w)\s\g1\b/;
-The grouping assigns a value to \1, so that the same 3 letter sequence
+The grouping assigns a value to \g1, so that the same 3 letter sequence
is used for both parts.
A similar task is to find words consisting of two identical parts:
- % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words
+ % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words
beriberi
booboo
coco
@@ -755,10 +755,10 @@ A similar task is to find words consisting of two identical parts:
papa
The regexp has a single grouping which considers 4-letter
-combinations, then 3-letter combinations, etc., and uses C<\1> to look for
-a repeat. Although C<$1> and C<\1> represent the same thing, care should be
+combinations, then 3-letter combinations, etc., and uses C<\g1> to look for
+a repeat. Although C<$1> and C<\g1> represent the same thing, care should be
taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp
-and backreferences C<\1>, C<\2>,... only I<inside> a regexp; not doing
+and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing
so may lead to surprising and unsatisfactory results.
@@ -775,7 +775,7 @@ Another good reason in addition to readability and maintainability
for using relative backreferences is illustrated by the following example,
where a simple pattern for matching peculiar strings is used:
- $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc.
+ $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc.
Now that we have this pattern stored as a handy string, we might feel
tempted to use it as a part of some other pattern:
@@ -976,7 +976,7 @@ Here are some examples:
/[a-z]+\s+\d*/; # match a lowercase word, at least one space, and
# any number of digits
- /(\w+)\s+\1/; # match doubled words of arbitrary length
+ /(\w+)\s+\g1/; # match doubled words of arbitrary length
/y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes'
$year =~ /\d{2,4}/; # make sure year is at least 2 but not more
# than 4 digits
@@ -984,7 +984,7 @@ Here are some examples:
$year =~ /\d{2}(\d{2})?/; # same thing written differently. However,
# this produces $1 and the other does not.
- % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier?
+ % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier?
beriberi
booboo
coco
@@ -2385,7 +2385,7 @@ The integer or name form of the C<condition> allows us to choose,
with more flexibility, what to match based on what matched earlier in the
regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">:
- % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words
+ % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words
beriberi
coco
couscous
diff --git a/pod/perlsyn.pod b/pod/perlsyn.pod
index 29db5dafd5..18143d180a 100644
--- a/pod/perlsyn.pod
+++ b/pod/perlsyn.pod
@@ -931,7 +931,7 @@ C preprocessors: it matches the regular expression
# example: '# line 42 "new_filename.plx"'
/^\# \s*
line \s+ (\d+) \s*
- (?:\s("?)([^"]+)\2)? \s*
+ (?:\s("?)([^"]+)\g2)? \s*
$/x
with C<$1> being the line number for the next line, and C<$3> being