diff options
author | Karl Williamson <khw@khw-desktop.(none)> | 2010-06-22 14:29:10 -0600 |
---|---|---|
committer | Jesse Vincent <jesse@bestpractical.com> | 2010-06-28 22:30:04 -0400 |
commit | d8b950dcbc51bd501c5dc196cc12d87eaf47b60c (patch) | |
tree | fd00ef847f27621f035f8c4fd827df582fa1433d | |
parent | c27a5cfe2661343fcb3b4f58478604d8b59b20de (diff) | |
download | perl-d8b950dcbc51bd501c5dc196cc12d87eaf47b60c.tar.gz |
Prefer \g1 over \1 in pods
\g was added to avoid ambiguities that \digit causes. This updates the
pod documentation to use \g in examples, and to prefer it when
explaining the concepts. Some non-symmetrical outlined text dealing
with it was also cleaned up.
-rw-r--r-- | pod/perlfaq4.pod | 6 | ||||
-rw-r--r-- | pod/perlfaq6.pod | 2 | ||||
-rw-r--r-- | pod/perlfaq9.pod | 4 | ||||
-rw-r--r-- | pod/perlglossary.pod | 11 | ||||
-rw-r--r-- | pod/perlre.pod | 129 | ||||
-rw-r--r-- | pod/perlrebackslash.pod | 33 | ||||
-rw-r--r-- | pod/perlrequick.pod | 10 | ||||
-rw-r--r-- | pod/perlretut.pod | 22 | ||||
-rw-r--r-- | pod/perlsyn.pod | 2 |
9 files changed, 115 insertions, 104 deletions
diff --git a/pod/perlfaq4.pod b/pod/perlfaq4.pod index 45cc9e044d..2e35068c31 100644 --- a/pod/perlfaq4.pod +++ b/pod/perlfaq4.pod @@ -594,11 +594,11 @@ This won't expand C<"\n"> or C<"\t"> or any other special escapes. You can use the substitution operator to find pairs of characters (or runs of characters) and replace them with a single instance. In this substitution, we find a character in C<(.)>. The memory parentheses -store the matched character in the back-reference C<\1> and we use +store the matched character in the back-reference C<\g1> and we use that to require that the same thing immediately follow it. We replace that part of the string with the character in C<$1>. - s/(.)\1/$1/g; + s/(.)\g1/$1/g; We can also use the transliteration operator, C<tr///>. In this example, the search list side of our C<tr///> contains nothing, but @@ -1162,7 +1162,7 @@ subsequent line. sub fix { local $_ = shift; my ($white, $leader); # common whitespace and common leading string - if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) { + if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\g1\g2?.*\n)+$/) { ($white, $leader) = ($2, quotemeta($1)); } else { ($white, $leader) = (/^(\s+)/, ''); diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod index b51884d063..fe91933bba 100644 --- a/pod/perlfaq6.pod +++ b/pod/perlfaq6.pod @@ -99,7 +99,7 @@ record read in. $/ = ''; # read in whole paragraph, not just one line while ( <> ) { - while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha + while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) { # word starts alpha print "Duplicate $1 at paragraph $.\n"; } } diff --git a/pod/perlfaq9.pod b/pod/perlfaq9.pod index fa9ef11673..7ea553afa9 100644 --- a/pod/perlfaq9.pod +++ b/pod/perlfaq9.pod @@ -114,7 +114,7 @@ entities--like C<<> for example. Here's one "simple-minded" approach, that works for most files: #!/usr/bin/perl -p0777 - s/<(?:[^>'"]*|(['"]).*?\1)*>//gs + s/<(?:[^>'"]*|(['"]).*?\g1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in @@ -166,7 +166,7 @@ attribute is HREF and there are no other attributes. # qxurl - tchrist@perl.com print "$2\n" while m{ < \s* - A \s+ HREF \s* = \s* (["']) (.*?) \1 + A \s+ HREF \s* = \s* (["']) (.*?) \g1 \s* > }gsix; diff --git a/pod/perlglossary.pod b/pod/perlglossary.pod index 858f24b12c..41dc412b57 100644 --- a/pod/perlglossary.pod +++ b/pod/perlglossary.pod @@ -234,13 +234,14 @@ some of its high-level ideas. =item backreference A substring L<captured|/capturing> by a subpattern within -unadorned parentheses in a L</regex>, also referred to as a capture group. -Backslashed decimal numbers -(C<\1>, C<\2>, etc.) later in the same pattern refer back to the -corresponding subpattern in the current match. Outside the pattern, +unadorned parentheses in a L</regex>, also referred to as a capture group. The +sequences (C<\g1>, C<\g2>, etc.) later in the same pattern refer back to +the corresponding subpattern in the current match. Outside the pattern, the numbered variables (C<$1>, C<$2>, etc.) continue to refer to these same values, as long as the pattern was the last successful match of -the current dynamic scope. +the current dynamic scope. C<\g{-1}> can be used to refer to a group by +relative rather than absolute position; and groups can be also be named, and +referred to later by name rather than number. See L<perlre/Capture Buffers>. =item backtracking diff --git a/pod/perlre.pod b/pod/perlre.pod index 9a7e4fef06..8f193c8acc 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -381,44 +381,29 @@ loop. Take care when using patterns that include C<\G> in an alternation. The bracketing construct C<( ... )> creates capture groups (also referred to as capture buffers). To refer to the current contents of a group later on, within -same pattern, use \1 for the first, \2 for the second, and so on. -Outside the match use "$" instead of "\". (The -\<digit> notation works in certain circumstances outside -the match. See L</Warning on \1 Instead of $1> below for details.) -Referring back to another part of the match is called a -I<backreference>. +the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>) +for the second, and so on. +This is called a I<backreference>. X<regex, capture buffer> X<regexp, capture buffer> X<regex, capture group> X<regexp, capture group> X<regular expression, capture buffer> X<backreference> X<regular expression, capture group> X<backreference> - -There is no limit to the number of captured substrings that you may -use. However Perl also uses \10, \11, etc. as aliases for \010, -\011, etc. (Recall that 0 means octal, so \011 is the character at -number 9 in your coded character set; which would be the 10th character, -a horizontal tab under ASCII.) Perl resolves this -ambiguity by interpreting \10 as a backreference only if at least 10 -left parentheses have opened before it. Likewise \11 is a -backreference only if at least 11 left parentheses have opened -before it. And so on. \1 through \9 are always interpreted as -backreferences. -If the bracketing group did not match, the associated backreference won't -match either. (This can happen if the bracketing group is optional, or -in a different branch of an alternation.) - X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference> -In order to provide a safer and easier way to construct patterns using -backreferences, Perl provides the C<\g{N}> notation (starting with perl -5.10.0). The curly brackets are optional, however omitting them is less -safe as the meaning of the pattern can be changed by text (such as digits) -following it. When N is a positive integer the C<\g{N}> notation is -exactly equivalent to using normal backreferences. When N is a negative -integer then it is a relative backreference referring to the previous N'th -capturing group. When the bracket form is used and N is not an integer, it -is treated as a reference to a named group. - -Thus C<\g{-1}> refers to the last group, C<\g{-2}> refers to the -group before that. For example: +X<named capture buffer> X<regular expression, named capture buffer> +X<named capture group> X<regular expression, named capture group> +X<%+> X<$+{name}> X<< \k<name> >> +There is no limit to the number of captured substrings that you may use. +Groups are numbered with the leftmost open parenthesis being number 1, etc. If +a group did not match, the associated backreference won't match either. (This +can happen if the group is optional, or in a different branch of an +alternation.) +You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with +this form, described below. + +You can also refer to capture groups relatively, by using a negative number, so +that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture +group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For +example: / (Y) # group 1 @@ -429,33 +414,62 @@ group before that. For example: ) /x -and would match the same as C</(Y) ( (X) \3 \1 )/x>. - -Additionally, as of Perl 5.10.0 you may use named capture groups and named -backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >> -to reference. You may also use apostrophes instead of angle brackets to delimit the -name; and you may use the bracketed C<< \g{name} >> backreference syntax. -It's possible to refer to a named capture group by absolute and relative number as well. -Outside the pattern, a named capture group is available via the C<%+> hash. -When different groups within the same pattern have the same name, C<$+{name}> -and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible -to do things with named capture groups that would otherwise require C<(??{})> -code to accomplish.) -X<named capture buffer> X<regular expression, named capture buffer> -X<named capture group> X<regular expression, named capture group> -X<%+> X<$+{name}> X<< \k<name> >> +would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to +interpolate regexes into larger regexes and not have to worry about the +capture groups being renumbered. + +You can dispense with numbers altogether and create named capture groups. +The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to +reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may +also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.) +I<name> must not begin with a number, nor contain hyphens. +When different groups within the same pattern have the same name, any reference +to that name assumes the leftmost defined group. Named groups count in +absolute and relative numbering, and so can also be referred to by those +numbers. +(It's possible to do things with named capture groups that would otherwise +require C<(??{})>.) + +Capture group contents are dynamically scoped and available to you outside the +pattern until the end of the enclosing block or until the next successful +match, whichever comes first. (See L<perlsyn/"Compound Statements">.) +You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">, +etc); or by name via the C<%+> hash, using C<"$+{I<name>}">. + +Braces are required in referring to named capture groups, but are optional for +absolute or relative numbered ones. Braces are safer when creating a regex by +concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a> +contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which +is probably not what you intended. + +The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that +there were no named nor relative numbered capture groups. Absolute numbered +groups were referred to using C<\1>, C<\2>, etc, and this notation is still +accepted (and likely always will be). But it leads to some ambiguities if +there are more than 9 capture groups, as C<\10> could mean either the tenth +capture group, or the character whose ordinal in octal is 010 (a backspace in +ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference +only if at least 10 left parentheses have opened before it. Likewise C<\11> is +a backreference only if at least 11 left parentheses have opened before it. +And so on. C<\1> through C<\9> are always interpreted as backreferences. You +can minimize the ambiguity by always using C<\g> if you mean capturing groups; +and always using 3 digits for octal constants, with the first always "0" (which +works if there are 63 (= \077) or fewer capture groups). + +The C<\I<digit>> notation also works in certain circumstances outside +the pattern. See L</Warning on \1 Instead of $1> below for details.) Examples: s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words - /(.)\1/ # find first doubled char + /(.)\g1/ # find first doubled char and print "'$1' is the first doubled character\n"; /(?<char>.)\k<char>/ # ... a different way and print "'$+{char}' is the first doubled character\n"; - /(?'char'.)\1/ # ... mix and match + /(?'char'.)\g1/ # ... mix and match and print "'$1' is the first doubled character\n"; if (/Time: (..):(..):(..)/) { # parse out values @@ -475,14 +489,13 @@ extended patterns (see below), for example to assign a submatch to a variable. X<$+> X<$^N> X<$&> X<$`> X<$'> -The numbered match variables ($1, $2, $3, etc.) and the related punctuation -set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped +These special variables, like the C<%+> hash and the numbered match variables +(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See L<perlsyn/"Compound Statements">.) X<$+> X<$^N> X<$&> X<$`> X<$'> X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> - B<NOTE>: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match. @@ -490,7 +503,7 @@ specific cases and remembers the best match. B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl -uses the same mechanism to produce $1, $2, etc, so you also pay a +uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression C<(?: ... )> instead.) But if you never @@ -586,7 +599,7 @@ include C<(?i)> at the front of the pattern. For example: These modifiers are restored at the end of the enclosing group. For example, - ( (?i) blah ) \s+ \1 + ( (?i) blah ) \s+ \g1 will match C<blah> in any case, some spaces, and an exact (I<including the case>!) repetition of the previous word, assuming the C</x> modifier, and no C</i> @@ -1141,8 +1154,8 @@ C<a*ab> will match fewer characters than a standalone C<a*>, since this makes the tail match. An effect similar to C<< (?>pattern) >> may be achieved by writing -C<(?=(pattern))\1>. This matches the same substring as a standalone -C<a+>, and the following C<\1> eats the matched string; it therefore +C<(?=(pattern))\g1>. This matches the same substring as a standalone +C<a+>, and the following C<\g1> eats the matched string; it therefore makes a zero-length assertion into an analogue of C<< (?>...) >>. (The difference between these two constructs is that the second one uses a capturing group, thus shifting ordinals of backreferences @@ -1762,7 +1775,7 @@ I<n>th subpattern later in the pattern using the metacharacter \I<n>. Subpatterns are numbered based on the left to right order of their opening parenthesis. A backreference matches whatever actually matched the subpattern in the string being examined, not -the rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will +the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1 matched "0x", even though the rule C<0|0x> could potentially match the leading 0 in the second number. diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 5e514ceec6..4f1bed67a5 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -227,10 +227,10 @@ as a character without special meaning by the regex engine, and will match =head4 Caveat -Octal escapes potentially clash with backreferences. They both consist -of a backslash followed by numbers. So Perl has to use heuristics to -determine whether it is a backreference or an octal escape. Perl uses -the following rules: +Octal escapes potentially clash with old-style backreferences (see L</Absolute +referencing> below). They both consist of a backslash followed by numbers. So +Perl has to use heuristics to determine whether it is a backreference or an +octal escape. Perl uses the following rules: =over 4 @@ -348,7 +348,6 @@ L<perlunicode/Unicode Character Properties>. Mnemonic: I<p>roperty. - =head2 Referencing If capturing parenthesis are used in a regular expression, we can refer @@ -361,18 +360,18 @@ absolutely, relatively, and by name. =head3 Absolute referencing Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N> -is an positive (unsigned) decimal number of any length is an absolute reference +is a positive (unsigned) decimal number of any length is an absolute reference to a capturing group. -I<N> refers to the Nth set of parentheses - or more accurately, whatever has +I<N> refers to the Nth set of parentheses - so C<\gI<N>> refers to whatever has been matched by that set of parenthesis. Thus C<\g1> refers to the first capture group in the regex. The C<\gI<N>> form can be equivalently written as C<\g{I<N>}> which avoids ambiguity when building a regex by concatenating shorter -strings. Otherwise if you had a regex C</$a$b/>, and C<$a> contained C<"\g1">, -and C<$b> contained C<"37">, you would get C</\g137/> which is probably not -what you intended. +strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained +C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is +probably not what you intended. In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at least I<N> capturing groups, or else I<N> will be considered an octal escape @@ -413,17 +412,15 @@ even if the larger pattern also contains capture groups. =head3 Named referencing -Also new in perl 5.10.0 is the use of named capture groups, which can be -referred to by name. This is done with C<\g{name}>, which is a -backreference to the capture group with the name I<name>. +C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a +named capture group, dispensing completely with having to think about capture +buffer positions. To be compatible with .Net regular expressions, C<\g{name}> may also be written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>. -Note that C<\g{}> has the potential to be ambiguous, as it could be a named -reference, or an absolute or relative reference (if its argument is numeric). -However, names are not allowed to start with digits, nor are they allowed to -contain a hyphen, so there is no ambiguity. +To prevent any ambiguity, I<name> must not start with a digit nor contain a +hyphen. =head4 Examples @@ -582,7 +579,7 @@ Mnemonic: eI<X>tended Unicode character. "\x{256}" =~ /^\C\C$/; # Match as chr (256) takes 2 octets in UTF-8. $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz' - $str =~ s/(.)\K\1//g; # Delete duplicated characters. + $str =~ s/(.)\K\g1//g; # Delete duplicated characters. "\n" =~ /^\R$/; # Match, \n is a generic newline. "\r" =~ /^\R$/; # Match, \r is a generic newline. diff --git a/pod/perlrequick.pod b/pod/perlrequick.pod index ded1e6cefc..1fde5588d6 100644 --- a/pod/perlrequick.pod +++ b/pod/perlrequick.pod @@ -298,13 +298,13 @@ indicated below it: 1 2 34 Associated with the matching variables C<$1>, C<$2>, ... are -the B<backreferences> C<\1>, C<\2>, ... Backreferences are +the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are matching variables that can be used I<inside> a regex: - /(\w\w\w)\s\1/; # find sequences like 'the the' in string + /(\w\w\w)\s\g1/; # find sequences like 'the the' in string -C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, -C<\2>, ... only inside a regex. +C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, +C<\g2>, ... only inside a regex. =head2 Matching repetitions @@ -347,7 +347,7 @@ Here are some examples: /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and # any number of digits - /(\w+)\s+\1/; # match doubled words of arbitrary length + /(\w+)\s+\g1/; # match doubled words of arbitrary length $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 9eded21002..eae266a407 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -732,21 +732,21 @@ match). =head2 Backreferences Closely associated with the matching variables C<$1>, C<$2>, ... are -the I<backreferences> C<\1>, C<\2>,... Backreferences are simply +the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply matching variables that can be used I<inside> a regexp. This is a really nice feature; what matches later in a regexp is made to depend on what matched earlier in the regexp. Suppose we wanted to look for doubled words in a text, like 'the the'. The following regexp finds all 3-letter doubles with a space in between: - /\b(\w\w\w)\s\1\b/; + /\b(\w\w\w)\s\g1\b/; -The grouping assigns a value to \1, so that the same 3 letter sequence +The grouping assigns a value to \g1, so that the same 3 letter sequence is used for both parts. A similar task is to find words consisting of two identical parts: - % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words + % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words beriberi booboo coco @@ -755,10 +755,10 @@ A similar task is to find words consisting of two identical parts: papa The regexp has a single grouping which considers 4-letter -combinations, then 3-letter combinations, etc., and uses C<\1> to look for -a repeat. Although C<$1> and C<\1> represent the same thing, care should be +combinations, then 3-letter combinations, etc., and uses C<\g1> to look for +a repeat. Although C<$1> and C<\g1> represent the same thing, care should be taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp -and backreferences C<\1>, C<\2>,... only I<inside> a regexp; not doing +and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing so may lead to surprising and unsatisfactory results. @@ -775,7 +775,7 @@ Another good reason in addition to readability and maintainability for using relative backreferences is illustrated by the following example, where a simple pattern for matching peculiar strings is used: - $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc. + $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc. Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern: @@ -976,7 +976,7 @@ Here are some examples: /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and # any number of digits - /(\w+)\s+\1/; # match doubled words of arbitrary length + /(\w+)\s+\g1/; # match doubled words of arbitrary length /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits @@ -984,7 +984,7 @@ Here are some examples: $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, # this produces $1 and the other does not. - % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? + % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier? beriberi booboo coco @@ -2385,7 +2385,7 @@ The integer or name form of the C<condition> allows us to choose, with more flexibility, what to match based on what matched earlier in the regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">: - % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words + % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words beriberi coco couscous diff --git a/pod/perlsyn.pod b/pod/perlsyn.pod index 29db5dafd5..18143d180a 100644 --- a/pod/perlsyn.pod +++ b/pod/perlsyn.pod @@ -931,7 +931,7 @@ C preprocessors: it matches the regular expression # example: '# line 42 "new_filename.plx"' /^\# \s* line \s+ (\d+) \s* - (?:\s("?)([^"]+)\2)? \s* + (?:\s("?)([^"]+)\g2)? \s* $/x with C<$1> being the line number for the next line, and C<$3> being |