diff options
-rw-r--r-- | pod/perlfaq4.pod | 6 | ||||
-rw-r--r-- | pod/perlfaq6.pod | 2 | ||||
-rw-r--r-- | pod/perlfaq9.pod | 4 | ||||
-rw-r--r-- | pod/perlglossary.pod | 11 | ||||
-rw-r--r-- | pod/perlre.pod | 129 | ||||
-rw-r--r-- | pod/perlrebackslash.pod | 33 | ||||
-rw-r--r-- | pod/perlrequick.pod | 10 | ||||
-rw-r--r-- | pod/perlretut.pod | 22 | ||||
-rw-r--r-- | pod/perlsyn.pod | 2 |
9 files changed, 115 insertions, 104 deletions
diff --git a/pod/perlfaq4.pod b/pod/perlfaq4.pod index 45cc9e044d..2e35068c31 100644 --- a/pod/perlfaq4.pod +++ b/pod/perlfaq4.pod @@ -594,11 +594,11 @@ This won't expand C<"\n"> or C<"\t"> or any other special escapes. You can use the substitution operator to find pairs of characters (or runs of characters) and replace them with a single instance. In this substitution, we find a character in C<(.)>. The memory parentheses -store the matched character in the back-reference C<\1> and we use +store the matched character in the back-reference C<\g1> and we use that to require that the same thing immediately follow it. We replace that part of the string with the character in C<$1>. - s/(.)\1/$1/g; + s/(.)\g1/$1/g; We can also use the transliteration operator, C<tr///>. In this example, the search list side of our C<tr///> contains nothing, but @@ -1162,7 +1162,7 @@ subsequent line. sub fix { local $_ = shift; my ($white, $leader); # common whitespace and common leading string - if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\1\2?.*\n)+$/) { + if (/^\s*(?:([^\w\s]+)(\s*).*\n)(?:\s*\g1\g2?.*\n)+$/) { ($white, $leader) = ($2, quotemeta($1)); } else { ($white, $leader) = (/^(\s+)/, ''); diff --git a/pod/perlfaq6.pod b/pod/perlfaq6.pod index b51884d063..fe91933bba 100644 --- a/pod/perlfaq6.pod +++ b/pod/perlfaq6.pod @@ -99,7 +99,7 @@ record read in. $/ = ''; # read in whole paragraph, not just one line while ( <> ) { - while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha + while ( /\b([\w'-]+)(\s+\g1)+\b/gi ) { # word starts alpha print "Duplicate $1 at paragraph $.\n"; } } diff --git a/pod/perlfaq9.pod b/pod/perlfaq9.pod index fa9ef11673..7ea553afa9 100644 --- a/pod/perlfaq9.pod +++ b/pod/perlfaq9.pod @@ -114,7 +114,7 @@ entities--like C<<> for example. Here's one "simple-minded" approach, that works for most files: #!/usr/bin/perl -p0777 - s/<(?:[^>'"]*|(['"]).*?\1)*>//gs + s/<(?:[^>'"]*|(['"]).*?\g1)*>//gs If you want a more complete solution, see the 3-stage striphtml program in @@ -166,7 +166,7 @@ attribute is HREF and there are no other attributes. # qxurl - tchrist@perl.com print "$2\n" while m{ < \s* - A \s+ HREF \s* = \s* (["']) (.*?) \1 + A \s+ HREF \s* = \s* (["']) (.*?) \g1 \s* > }gsix; diff --git a/pod/perlglossary.pod b/pod/perlglossary.pod index 858f24b12c..41dc412b57 100644 --- a/pod/perlglossary.pod +++ b/pod/perlglossary.pod @@ -234,13 +234,14 @@ some of its high-level ideas. =item backreference A substring L<captured|/capturing> by a subpattern within -unadorned parentheses in a L</regex>, also referred to as a capture group. -Backslashed decimal numbers -(C<\1>, C<\2>, etc.) later in the same pattern refer back to the -corresponding subpattern in the current match. Outside the pattern, +unadorned parentheses in a L</regex>, also referred to as a capture group. The +sequences (C<\g1>, C<\g2>, etc.) later in the same pattern refer back to +the corresponding subpattern in the current match. Outside the pattern, the numbered variables (C<$1>, C<$2>, etc.) continue to refer to these same values, as long as the pattern was the last successful match of -the current dynamic scope. +the current dynamic scope. C<\g{-1}> can be used to refer to a group by +relative rather than absolute position; and groups can be also be named, and +referred to later by name rather than number. See L<perlre/Capture Buffers>. =item backtracking diff --git a/pod/perlre.pod b/pod/perlre.pod index 9a7e4fef06..8f193c8acc 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -381,44 +381,29 @@ loop. Take care when using patterns that include C<\G> in an alternation. The bracketing construct C<( ... )> creates capture groups (also referred to as capture buffers). To refer to the current contents of a group later on, within -same pattern, use \1 for the first, \2 for the second, and so on. -Outside the match use "$" instead of "\". (The -\<digit> notation works in certain circumstances outside -the match. See L</Warning on \1 Instead of $1> below for details.) -Referring back to another part of the match is called a -I<backreference>. +the same pattern, use C<\g1> (or C<\g{1}>) for the first, C<\g2> (or C<\g{2}>) +for the second, and so on. +This is called a I<backreference>. X<regex, capture buffer> X<regexp, capture buffer> X<regex, capture group> X<regexp, capture group> X<regular expression, capture buffer> X<backreference> X<regular expression, capture group> X<backreference> - -There is no limit to the number of captured substrings that you may -use. However Perl also uses \10, \11, etc. as aliases for \010, -\011, etc. (Recall that 0 means octal, so \011 is the character at -number 9 in your coded character set; which would be the 10th character, -a horizontal tab under ASCII.) Perl resolves this -ambiguity by interpreting \10 as a backreference only if at least 10 -left parentheses have opened before it. Likewise \11 is a -backreference only if at least 11 left parentheses have opened -before it. And so on. \1 through \9 are always interpreted as -backreferences. -If the bracketing group did not match, the associated backreference won't -match either. (This can happen if the bracketing group is optional, or -in a different branch of an alternation.) - X<\g{1}> X<\g{-1}> X<\g{name}> X<relative backreference> X<named backreference> -In order to provide a safer and easier way to construct patterns using -backreferences, Perl provides the C<\g{N}> notation (starting with perl -5.10.0). The curly brackets are optional, however omitting them is less -safe as the meaning of the pattern can be changed by text (such as digits) -following it. When N is a positive integer the C<\g{N}> notation is -exactly equivalent to using normal backreferences. When N is a negative -integer then it is a relative backreference referring to the previous N'th -capturing group. When the bracket form is used and N is not an integer, it -is treated as a reference to a named group. - -Thus C<\g{-1}> refers to the last group, C<\g{-2}> refers to the -group before that. For example: +X<named capture buffer> X<regular expression, named capture buffer> +X<named capture group> X<regular expression, named capture group> +X<%+> X<$+{name}> X<< \k<name> >> +There is no limit to the number of captured substrings that you may use. +Groups are numbered with the leftmost open parenthesis being number 1, etc. If +a group did not match, the associated backreference won't match either. (This +can happen if the group is optional, or in a different branch of an +alternation.) +You can omit the C<"g">, and write C<"\1">, etc, but there are some issues with +this form, described below. + +You can also refer to capture groups relatively, by using a negative number, so +that C<\g-1> and C<\g{-1}> both refer to the immediately preceding capture +group, and C<\g-2> and C<\g{-2}> both refer to the group before it. For +example: / (Y) # group 1 @@ -429,33 +414,62 @@ group before that. For example: ) /x -and would match the same as C</(Y) ( (X) \3 \1 )/x>. - -Additionally, as of Perl 5.10.0 you may use named capture groups and named -backreferences. The notation is C<< (?<name>...) >> to declare and C<< \k<name> >> -to reference. You may also use apostrophes instead of angle brackets to delimit the -name; and you may use the bracketed C<< \g{name} >> backreference syntax. -It's possible to refer to a named capture group by absolute and relative number as well. -Outside the pattern, a named capture group is available via the C<%+> hash. -When different groups within the same pattern have the same name, C<$+{name}> -and C<< \k<name> >> refer to the leftmost defined group. (Thus it's possible -to do things with named capture groups that would otherwise require C<(??{})> -code to accomplish.) -X<named capture buffer> X<regular expression, named capture buffer> -X<named capture group> X<regular expression, named capture group> -X<%+> X<$+{name}> X<< \k<name> >> +would match the same as C</(Y) ( (X) \g3 \g1 )/x>. This allows you to +interpolate regexes into larger regexes and not have to worry about the +capture groups being renumbered. + +You can dispense with numbers altogether and create named capture groups. +The notation is C<(?E<lt>I<name>E<gt>...)> to declare and C<\g{I<name>}> to +reference. (To be compatible with .Net regular expressions, C<\g{I<name>}> may +also be written as C<\k{I<name>}>, C<\kE<lt>I<name>E<gt>> or C<\k'I<name>'>.) +I<name> must not begin with a number, nor contain hyphens. +When different groups within the same pattern have the same name, any reference +to that name assumes the leftmost defined group. Named groups count in +absolute and relative numbering, and so can also be referred to by those +numbers. +(It's possible to do things with named capture groups that would otherwise +require C<(??{})>.) + +Capture group contents are dynamically scoped and available to you outside the +pattern until the end of the enclosing block or until the next successful +match, whichever comes first. (See L<perlsyn/"Compound Statements">.) +You can refer to them by absolute number (using C<"$1"> instead of C<"\g1">, +etc); or by name via the C<%+> hash, using C<"$+{I<name>}">. + +Braces are required in referring to named capture groups, but are optional for +absolute or relative numbered ones. Braces are safer when creating a regex by +concatenating smaller strings. For example if you have C<qr/$a$b/>, and C<$a> +contained C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which +is probably not what you intended. + +The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that +there were no named nor relative numbered capture groups. Absolute numbered +groups were referred to using C<\1>, C<\2>, etc, and this notation is still +accepted (and likely always will be). But it leads to some ambiguities if +there are more than 9 capture groups, as C<\10> could mean either the tenth +capture group, or the character whose ordinal in octal is 010 (a backspace in +ASCII). Perl resolves this ambiguity by interpreting C<\10> as a backreference +only if at least 10 left parentheses have opened before it. Likewise C<\11> is +a backreference only if at least 11 left parentheses have opened before it. +And so on. C<\1> through C<\9> are always interpreted as backreferences. You +can minimize the ambiguity by always using C<\g> if you mean capturing groups; +and always using 3 digits for octal constants, with the first always "0" (which +works if there are 63 (= \077) or fewer capture groups). + +The C<\I<digit>> notation also works in certain circumstances outside +the pattern. See L</Warning on \1 Instead of $1> below for details.) Examples: s/^([^ ]*) *([^ ]*)/$2 $1/; # swap first two words - /(.)\1/ # find first doubled char + /(.)\g1/ # find first doubled char and print "'$1' is the first doubled character\n"; /(?<char>.)\k<char>/ # ... a different way and print "'$+{char}' is the first doubled character\n"; - /(?'char'.)\1/ # ... mix and match + /(?'char'.)\g1/ # ... mix and match and print "'$1' is the first doubled character\n"; if (/Time: (..):(..):(..)/) { # parse out values @@ -475,14 +489,13 @@ extended patterns (see below), for example to assign a submatch to a variable. X<$+> X<$^N> X<$&> X<$`> X<$'> -The numbered match variables ($1, $2, $3, etc.) and the related punctuation -set (C<$+>, C<$&>, C<$`>, C<$'>, and C<$^N>) are all dynamically scoped +These special variables, like the C<%+> hash and the numbered match variables +(C<$1>, C<$2>, C<$3>, etc.) are dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See L<perlsyn/"Compound Statements">.) X<$+> X<$^N> X<$&> X<$`> X<$'> X<$1> X<$2> X<$3> X<$4> X<$5> X<$6> X<$7> X<$8> X<$9> - B<NOTE>: Failed matches in Perl do not reset the match variables, which makes it easier to write code that tests for a series of more specific cases and remembers the best match. @@ -490,7 +503,7 @@ specific cases and remembers the best match. B<WARNING>: Once Perl sees that you need one of C<$&>, C<$`>, or C<$'> anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl -uses the same mechanism to produce $1, $2, etc, so you also pay a +uses the same mechanism to produce C<$1>, C<$2>, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression C<(?: ... )> instead.) But if you never @@ -586,7 +599,7 @@ include C<(?i)> at the front of the pattern. For example: These modifiers are restored at the end of the enclosing group. For example, - ( (?i) blah ) \s+ \1 + ( (?i) blah ) \s+ \g1 will match C<blah> in any case, some spaces, and an exact (I<including the case>!) repetition of the previous word, assuming the C</x> modifier, and no C</i> @@ -1141,8 +1154,8 @@ C<a*ab> will match fewer characters than a standalone C<a*>, since this makes the tail match. An effect similar to C<< (?>pattern) >> may be achieved by writing -C<(?=(pattern))\1>. This matches the same substring as a standalone -C<a+>, and the following C<\1> eats the matched string; it therefore +C<(?=(pattern))\g1>. This matches the same substring as a standalone +C<a+>, and the following C<\g1> eats the matched string; it therefore makes a zero-length assertion into an analogue of C<< (?>...) >>. (The difference between these two constructs is that the second one uses a capturing group, thus shifting ordinals of backreferences @@ -1762,7 +1775,7 @@ I<n>th subpattern later in the pattern using the metacharacter \I<n>. Subpatterns are numbered based on the left to right order of their opening parenthesis. A backreference matches whatever actually matched the subpattern in the string being examined, not -the rules for that subpattern. Therefore, C<(0|0x)\d*\s\1\d*> will +the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1 matched "0x", even though the rule C<0|0x> could potentially match the leading 0 in the second number. diff --git a/pod/perlrebackslash.pod b/pod/perlrebackslash.pod index 5e514ceec6..4f1bed67a5 100644 --- a/pod/perlrebackslash.pod +++ b/pod/perlrebackslash.pod @@ -227,10 +227,10 @@ as a character without special meaning by the regex engine, and will match =head4 Caveat -Octal escapes potentially clash with backreferences. They both consist -of a backslash followed by numbers. So Perl has to use heuristics to -determine whether it is a backreference or an octal escape. Perl uses -the following rules: +Octal escapes potentially clash with old-style backreferences (see L</Absolute +referencing> below). They both consist of a backslash followed by numbers. So +Perl has to use heuristics to determine whether it is a backreference or an +octal escape. Perl uses the following rules: =over 4 @@ -348,7 +348,6 @@ L<perlunicode/Unicode Character Properties>. Mnemonic: I<p>roperty. - =head2 Referencing If capturing parenthesis are used in a regular expression, we can refer @@ -361,18 +360,18 @@ absolutely, relatively, and by name. =head3 Absolute referencing Either C<\gI<N>> (starting in Perl 5.10.0), or C<\I<N>> (old-style) where I<N> -is an positive (unsigned) decimal number of any length is an absolute reference +is a positive (unsigned) decimal number of any length is an absolute reference to a capturing group. -I<N> refers to the Nth set of parentheses - or more accurately, whatever has +I<N> refers to the Nth set of parentheses - so C<\gI<N>> refers to whatever has been matched by that set of parenthesis. Thus C<\g1> refers to the first capture group in the regex. The C<\gI<N>> form can be equivalently written as C<\g{I<N>}> which avoids ambiguity when building a regex by concatenating shorter -strings. Otherwise if you had a regex C</$a$b/>, and C<$a> contained C<"\g1">, -and C<$b> contained C<"37">, you would get C</\g137/> which is probably not -what you intended. +strings. Otherwise if you had a regex C<qr/$a$b/>, and C<$a> contained +C<"\g1">, and C<$b> contained C<"37">, you would get C</\g137/> which is +probably not what you intended. In the C<\I<N>> form, I<N> must not begin with a "0", and there must be at least I<N> capturing groups, or else I<N> will be considered an octal escape @@ -413,17 +412,15 @@ even if the larger pattern also contains capture groups. =head3 Named referencing -Also new in perl 5.10.0 is the use of named capture groups, which can be -referred to by name. This is done with C<\g{name}>, which is a -backreference to the capture group with the name I<name>. +C<\g{I<name>}> (starting in Perl 5.10.0) can be used to back refer to a +named capture group, dispensing completely with having to think about capture +buffer positions. To be compatible with .Net regular expressions, C<\g{name}> may also be written as C<\k{name}>, C<< \k<name> >> or C<\k'name'>. -Note that C<\g{}> has the potential to be ambiguous, as it could be a named -reference, or an absolute or relative reference (if its argument is numeric). -However, names are not allowed to start with digits, nor are they allowed to -contain a hyphen, so there is no ambiguity. +To prevent any ambiguity, I<name> must not start with a digit nor contain a +hyphen. =head4 Examples @@ -582,7 +579,7 @@ Mnemonic: eI<X>tended Unicode character. "\x{256}" =~ /^\C\C$/; # Match as chr (256) takes 2 octets in UTF-8. $str =~ s/foo\Kbar/baz/g; # Change any 'bar' following a 'foo' to 'baz' - $str =~ s/(.)\K\1//g; # Delete duplicated characters. + $str =~ s/(.)\K\g1//g; # Delete duplicated characters. "\n" =~ /^\R$/; # Match, \n is a generic newline. "\r" =~ /^\R$/; # Match, \r is a generic newline. diff --git a/pod/perlrequick.pod b/pod/perlrequick.pod index ded1e6cefc..1fde5588d6 100644 --- a/pod/perlrequick.pod +++ b/pod/perlrequick.pod @@ -298,13 +298,13 @@ indicated below it: 1 2 34 Associated with the matching variables C<$1>, C<$2>, ... are -the B<backreferences> C<\1>, C<\2>, ... Backreferences are +the B<backreferences> C<\g1>, C<\g2>, ... Backreferences are matching variables that can be used I<inside> a regex: - /(\w\w\w)\s\1/; # find sequences like 'the the' in string + /(\w\w\w)\s\g1/; # find sequences like 'the the' in string -C<$1>, C<$2>, ... should only be used outside of a regex, and C<\1>, -C<\2>, ... only inside a regex. +C<$1>, C<$2>, ... should only be used outside of a regex, and C<\g1>, +C<\g2>, ... only inside a regex. =head2 Matching repetitions @@ -347,7 +347,7 @@ Here are some examples: /[a-z]+\s+\d*/; # match a lowercase word, at least some space, and # any number of digits - /(\w+)\s+\1/; # match doubled words of arbitrary length + /(\w+)\s+\g1/; # match doubled words of arbitrary length $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits $year =~ /\d{4}|\d{2}/; # better match; throw out 3 digit dates diff --git a/pod/perlretut.pod b/pod/perlretut.pod index 9eded21002..eae266a407 100644 --- a/pod/perlretut.pod +++ b/pod/perlretut.pod @@ -732,21 +732,21 @@ match). =head2 Backreferences Closely associated with the matching variables C<$1>, C<$2>, ... are -the I<backreferences> C<\1>, C<\2>,... Backreferences are simply +the I<backreferences> C<\g1>, C<\g2>,... Backreferences are simply matching variables that can be used I<inside> a regexp. This is a really nice feature; what matches later in a regexp is made to depend on what matched earlier in the regexp. Suppose we wanted to look for doubled words in a text, like 'the the'. The following regexp finds all 3-letter doubles with a space in between: - /\b(\w\w\w)\s\1\b/; + /\b(\w\w\w)\s\g1\b/; -The grouping assigns a value to \1, so that the same 3 letter sequence +The grouping assigns a value to \g1, so that the same 3 letter sequence is used for both parts. A similar task is to find words consisting of two identical parts: - % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\1$' /usr/dict/words + % simple_grep '^(\w\w\w\w|\w\w\w|\w\w|\w)\g1$' /usr/dict/words beriberi booboo coco @@ -755,10 +755,10 @@ A similar task is to find words consisting of two identical parts: papa The regexp has a single grouping which considers 4-letter -combinations, then 3-letter combinations, etc., and uses C<\1> to look for -a repeat. Although C<$1> and C<\1> represent the same thing, care should be +combinations, then 3-letter combinations, etc., and uses C<\g1> to look for +a repeat. Although C<$1> and C<\g1> represent the same thing, care should be taken to use matched variables C<$1>, C<$2>,... only I<outside> a regexp -and backreferences C<\1>, C<\2>,... only I<inside> a regexp; not doing +and backreferences C<\g1>, C<\g2>,... only I<inside> a regexp; not doing so may lead to surprising and unsatisfactory results. @@ -775,7 +775,7 @@ Another good reason in addition to readability and maintainability for using relative backreferences is illustrated by the following example, where a simple pattern for matching peculiar strings is used: - $a99a = '([a-z])(\d)\2\1'; # matches a11a, g22g, x33x, etc. + $a99a = '([a-z])(\d)\g2\g1'; # matches a11a, g22g, x33x, etc. Now that we have this pattern stored as a handy string, we might feel tempted to use it as a part of some other pattern: @@ -976,7 +976,7 @@ Here are some examples: /[a-z]+\s+\d*/; # match a lowercase word, at least one space, and # any number of digits - /(\w+)\s+\1/; # match doubled words of arbitrary length + /(\w+)\s+\g1/; # match doubled words of arbitrary length /y(es)?/i; # matches 'y', 'Y', or a case-insensitive 'yes' $year =~ /\d{2,4}/; # make sure year is at least 2 but not more # than 4 digits @@ -984,7 +984,7 @@ Here are some examples: $year =~ /\d{2}(\d{2})?/; # same thing written differently. However, # this produces $1 and the other does not. - % simple_grep '^(\w+)\1$' /usr/dict/words # isn't this easier? + % simple_grep '^(\w+)\g1$' /usr/dict/words # isn't this easier? beriberi booboo coco @@ -2385,7 +2385,7 @@ The integer or name form of the C<condition> allows us to choose, with more flexibility, what to match based on what matched earlier in the regexp. This searches for words of the form C<"$x$x"> or C<"$x$y$y$x">: - % simple_grep '^(\w+)(\w+)?(?(2)\2\1|\1)$' /usr/dict/words + % simple_grep '^(\w+)(\w+)?(?(2)\g2\g1|\g1)$' /usr/dict/words beriberi coco couscous diff --git a/pod/perlsyn.pod b/pod/perlsyn.pod index 29db5dafd5..18143d180a 100644 --- a/pod/perlsyn.pod +++ b/pod/perlsyn.pod @@ -931,7 +931,7 @@ C preprocessors: it matches the regular expression # example: '# line 42 "new_filename.plx"' /^\# \s* line \s+ (\d+) \s* - (?:\s("?)([^"]+)\2)? \s* + (?:\s("?)([^"]+)\g2)? \s* $/x with C<$1> being the line number for the next line, and C<$3> being |