diff options
author | Ilya Zakharevich <ilya@math.berkeley.edu> | 1998-07-18 19:11:13 -0400 |
---|---|---|
committer | Gurusamy Sarathy <gsar@cpan.org> | 1998-07-21 04:37:49 +0000 |
commit | c84d73f1f9f9cc1f8dd8d87123bb87479c2f2754 (patch) | |
tree | 33217be85a058b171e3aab620f7d3a6aaeeedb7a /pod/perlre.pod | |
parent | 9c4d0f167fb7643c3ff9ccfd4b79b6aeb9763e43 (diff) | |
download | perl-c84d73f1f9f9cc1f8dd8d87123bb87479c2f2754.tar.gz |
applied RE doc patches, with tweaks to the prose
Date: Sat, 18 Jul 1998 23:11:13 -0400 (EDT)
Message-Id: <199807190311.XAA25080@monk.mps.ohio-state.edu>
Subject: [PATCH 5.004_72] Document irregular zero-length matches
--
Date: Sun, 19 Jul 1998 00:38:44 -0400 (EDT)
Message-Id: <199807190438.AAA26226@monk.mps.ohio-state.edu>
Subject: [PATCH 5.004_72] Another irregularity of expressions documented
p4raw-id: //depot/perl@1598
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 120 |
1 files changed, 120 insertions, 0 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index fc4d969466..924a2c4115 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -775,6 +775,126 @@ C<${1}000>. Basically, the operation of interpolation should not be confused with the operation of matching a backreference. Certainly they mean two different things on the I<left> side of the C<s///>. +=head2 Repeated patterns matching zero-length substring + +WARNING: Difficult material (and prose) ahead. This section needs a rewrite. + +Regular expressions provide a terse and powerful programming language. As +with most other power tools, power comes together with the ability +to wreak havoc. + +A common abuse of this power stems from the ability to make infinite +loops using regular expressions, with something as innocous as: + + 'foo' =~ m{ ( o? )* }x; + +The C<o?> can match at the beginning of C<'foo'>, and since the position +in the string is not moved by the match, C<o?> would match again and again +due to the C<*> modifier. Another common way to create a similar cycle +is with the looping modifier C<//g>: + + @matches = ( 'foo' =~ m{ o? }xg ); + +or + + print "match: <$&>\n" while 'foo' =~ m{ o? }xg; + +or the loop implied by split(). + +However, long experience has shown that many programming tasks may +be significantly simplified by using repeated subexpressions which +may match zero-length substrings, with a simple example being: + + @chars = split //, $string; # // is not magic in split + ($whitewashed = $string) =~ s/()/ /g; # parens avoid magic s// / + +Thus Perl allows the C</()/> construct, which I<forcefully breaks +the infinite loop>. The rules for this are different for lower-level +loops given by the greedy modifiers C<*+{}>, and for higher-level +ones like the C</g> modifier or split() operator. + +The lower-level loops are I<interrupted> when it is detected that a +repeated expression did match a zero-length substring, thus + + m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; + +is made equivalent to + + m{ (?: NON_ZERO_LENGTH )* + | + (?: ZERO_LENGTH )? + }x; + +The higher level-loops preserve an additional state between iterations: +whether the last match was zero-length. To break the loop, the following +match after a zero-length match is prohibited to have a length of zero. +This prohibition interacts with backtracking (see L<"Backtracking">), +and so the I<second best> match is chosen if the I<best> match is of +zero length. + +Say, + + $_ = 'bar'; + s/\w??/<$&>/g; + +results in C<"<><b><><a><><r><>">. At each position of the string the best +match given by non-greedy C<??> is the zero-length match, and the I<second +best> match is what is matched by C<\w>. Thus zero-length matches +alternate with one-character-long matches. + +Similarly, for repeated C<m/()/g> the second-best match is the match at the +position one notch further in the string. + +The additional state of being I<matched with zero-length> is associated to +the matched string, and is reset by each assignment to pos(). + +=head2 Creating custom RE engines + +Overloaded constants (see L<overload>) provide a simple way to extend +the functionality of the RE engine. + +Suppose that we want to enable a new RE escape-sequence C<\Y|> which +matches at boundary between white-space characters and non-whitespace +characters. Note that C<(?=\S)(?<!\S)|(?!\S)(?<=\S)> matches exactly +at these positions, so we want to have each C<\Y|> in the place of the +more complicated version. We can create a module C<customre> to do +this: + + package customre; + use overload; + + sub import { + shift; + die "No argument to customre::import allowed" if @_; + overload::constant 'qr' => \&convert; + } + + sub invalid { die "/$_[0]/: invalid escape '\\$_[1]'"} + + my %rules = ( '\\' => '\\', + 'Y|' => qr/(?=\S)(?<!\S)|(?!\S)(?<=\S)/ ); + sub convert { + my $re = shift; + $re =~ s{ + \\ ( \\ | Y . ) + } + { $rules{$1} or invalid($re,$1) }sgex; + return $re; + } + +Now C<use customre> enables the new escape in constant regular +expressions, i.e., those without any runtime variable interpolations. +As documented in L<overload>, this conversion will work only over +literal parts of regular expressions. For C<\Y|$re\Y|> the variable +part of this regular expression needs to be converted explicitly +(but only if the special meaning of C<\Y|> should be enabled inside $re): + + use customre; + $re = <>; + chomp $re; + $re = customre::convert $re; + /\Y|$re\Y|/; + =head2 SEE ALSO L<perlop/"Regexp Quote-Like Operators">. |