summaryrefslogtreecommitdiff
path: root/pod/perlre.pod
diff options
context:
space:
mode:
authorFather Chrysostomos <sprout@cpan.org>2011-03-09 12:58:03 -0800
committerFather Chrysostomos <sprout@cpan.org>2011-03-09 12:58:58 -0800
commit0b928c2f2427ad8eb633f553929143e5010eba1f (patch)
tree3c1cc099abaad0b1debb9575e076fb3c639c4a83 /pod/perlre.pod
parent7b5a08e99644a51f13ce3555c781b9664368a4af (diff)
downloadperl-0b928c2f2427ad8eb633f553929143e5010eba1f.tar.gz
perlre clean-up
Mostly typos, grammatical errors and factual errors (mostly due to bitrot), but also: • The section explaining how to work around the lack of look behind obviously has not been relevant for years. :-) • Since we have relative backreferences, we might as well use them in the explanation of the (?>...) construct. • Note that it’s possible to backtrack *past* (?>...), but not into it. • (?:non-zero-length|zero-length)* is *not* equivalent to nzl*|zl? as "aaaaab" =~ /(?:a|(?{print "hello"})(?=(b)))*/ demonstrates. • The custom re engine section doesn’t mention custom re engines. :-)
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r--pod/perlre.pod170
1 files changed, 101 insertions, 69 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod
index 0f5b436e08..bb92720ec1 100644
--- a/pod/perlre.pod
+++ b/pod/perlre.pod
@@ -106,7 +106,7 @@ be careful not to include the pattern delimiter in the comment--perl has
no way of knowing you did not intend to close the pattern early. See
the C-comment deletion code in L<perlop>. Also note that anything inside
a C<\Q...\E> stays unaffected by C</x>. And note that C</x> doesn't affect
-whether space interpretation within a single multi-character construct. For
+space interpretation within a single multi-character construct. For
example in C<\x{...}>, regardless of the C</x> modifier, there can be no
spaces. Same for a L<quantifier|/Quantifiers> such as C<{3}> or
C<{5,}>. Similarly, C<(?:...)> can't have a space between the C<?> and C<:>,
@@ -114,7 +114,7 @@ but can between the C<(> and C<?>. Within any delimiters for such a
construct, allowed spaces are not affected by C</x>, and depend on the
construct. For example, C<\x{...}> can't have spaces because hexadecimal
numbers don't have spaces in them. But, Unicode properties can have spaces, so
-in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
+in C<\p{...}> there can be spaces that follow the Unicode rules, for which see
L<perluniprops/Properties accessible through \p{} and \P{}>.
X</x>
@@ -129,7 +129,7 @@ within the scope of a C<"use locale"> pragma.
Perl only allows single-byte locales. This means that code points above
255 are treated as Unicode no matter what locale is in effect.
Under Unicode rules, there are a few case-insensitive matches that cross the
-boundary 255/256 boundary. These are disallowed. For example,
+255/256 boundary. These are disallowed. For example,
0xFF does not caselessly match the character at 0x178, LATIN CAPITAL
LETTER Y WITH DIAERESIS, because 0xFF may not be LATIN SMALL LETTER Y
in the current locale, and Perl has no way of knowing if that character
@@ -137,11 +137,11 @@ even exists in the locale, much less what code point it is.
X</l>
C</u> means to use Unicode semantics when pattern matching. It is
-automatically set if the regular expression is encoded in utf8, or is
-compiled within the scope of a
+automatically set if the regular expression is encoded in utf8 internally,
+or is compiled within the scope of a
L<C<"use feature 'unicode_strings">|feature> pragma (and isn't also in
-the scope of L<C<"use locale">|locale> nor L<C<"use bytes">|bytes>
-pragmas). On ASCII platforms, the code points between 128 and 255 take on their
+the scope of the L<C<"use locale">|locale> or the L<C<"use bytes">|bytes>
+pragma). On ASCII platforms, the code points between 128 and 255 take on their
Latin-1 (ISO-8859-1) meanings (which are the same as Unicode's), whereas
in strict ASCII their meanings are undefined. Thus the platform
effectively becomes a Unicode platform. The ASCII characters remain as
@@ -174,7 +174,7 @@ semantics; for example, "k" will match the Unicode C<\N{KELVIN SIGN}>
under C</i> matching, and code points in the Latin1 range, above ASCII
will have Unicode semantics when it comes to case-insensitive matching.
But writing two in "a"'s in a row will increase its effect, causing the
-Kelvin sign and all other non-ASCII characters to not match any ASCII
+Kelvin sign and all other non-ASCII characters not to match any ASCII
character under C</i> matching.
X</a>
@@ -227,7 +227,7 @@ newline within the string (except if the newline is the last character in
the string), and "$" will match before any newline. At the
cost of a little more overhead, you can do this by using the /m modifier
on the pattern match operator. (Older programs did this by setting C<$*>,
-but this practice has been removed in perl 5.9.)
+but this option was removed in perl 5.9.)
X<^> X<$> X</m>
To simplify multi-line substitutions, the "." character never matches a
@@ -247,7 +247,8 @@ X<metacharacter> X<quantifier> X<*> X<+> X<?> X<{n}> X<{n,}> X<{n,m}>
{n,} Match at least n times
{n,m} Match at least n but not more than m times
-(If a curly bracket occurs in any other context, it is treated
+(If a curly bracket occurs in any other context and does not form part of
+a backslashed sequence like C<\x{...}>, it is treated
as a regular character. In particular, the lower bound
is not optional.) The "*" quantifier is equivalent to C<{0,}>, the "+"
quantifier to C<{1,}>, and the "?" quantifier to C<{0,1}>. n and m are limited
@@ -268,7 +269,7 @@ X<?> X<*?> X<+?> X<??> X<{n}?> X<{n,}?> X<{n,m}?>
*? Match 0 or more times, not greedily
+? Match 1 or more times, not greedily
?? Match 0 or 1 time, not greedily
- {n}? Match exactly n times, not greedily
+ {n}? Match exactly n times, not greedily (redundant)
{n,}? Match at least n times, not greedily
{n,m}? Match at least n but not more than m times, not greedily
@@ -297,7 +298,8 @@ string" problem can be most efficiently performed when written as:
/"(?:[^"\\]++|\\.)*+"/
as we know that if the final quote does not match, backtracking will not
-help. See the independent subexpression C<< (?>...) >> for more details;
+help. See the independent subexpression
+L</C<< (?>pattern) >>> for more details;
possessive quantifiers are just syntactic sugar for that construct. For
instance the above example could also be written as follows:
@@ -305,7 +307,7 @@ instance the above example could also be written as follows:
=head3 Escape sequences
-Because patterns are processed as double quoted strings, the following
+Because patterns are processed as double-quoted strings, the following
also work:
\t tab (HT, TAB)
@@ -343,7 +345,7 @@ X<\g> X<\k> X<\K> X<backreference>
uppercase character.
\w [3] Match a "word" character (alphanumeric plus "_", plus
other connector punctuation chars plus Unicode
- marks
+ marks)
\W [3] Match a non-"word" character
\s [3] Match a whitespace character
\S [3] Match a non-whitespace character
@@ -442,10 +444,11 @@ The C<\G> assertion can be used to chain global matches (using
C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">.
It is also useful when writing C<lex>-like scanners, when you have
several patterns that you want to match against consequent substrings
-of your string, see the previous reference. The actual location
+of your string; see the previous reference. The actual location
where C<\G> will match can also be influenced by using C<pos()> as
an lvalue: see L<perlfunc/pos>. Note that the rule for zero-length
-matches is modified somewhat, in that contents to the left of C<\G> is
+matches (see L</"Repeated Patterns Matching a Zero-length Substring">)
+is modified somewhat, in that contents to the left of C<\G> are
not counted when determining the length of the match. Thus the following
will not match forever:
X<\G>
@@ -530,7 +533,8 @@ is probably not what you intended.
The C<\g> and C<\k> notations were introduced in Perl 5.10.0. Prior to that
there were no named nor relative numbered capture groups. Absolute numbered
-groups were referred to using C<\1>, C<\2>, etc, and this notation is still
+groups were referred to using C<\1>,
+C<\2>, etc., and this notation is still
accepted (and likely always will be). But it leads to some ambiguities if
there are more than 9 capture groups, as C<\10> could mean either the tenth
capture group, or the character whose ordinal in octal is 010 (a backspace in
@@ -655,7 +659,8 @@ consult L<perlop/"Gory details of parsing quoted constructs">.
=head2 Extended Patterns
Perl also defines a consistent extension syntax for features not
-found in standard tools like B<awk> and B<lex>. The syntax is a
+found in standard tools like B<awk> and
+B<lex>. The syntax for most of these is a
pair of parentheses with a question mark as the first thing within
the parentheses. The character after the question mark indicates
the extension.
@@ -669,7 +674,7 @@ status.
A question mark was chosen for this and for the minimal-matching
construct because 1) question marks are rare in older regular
expressions, and 2) whenever you see one, you should stop and
-"question" exactly what is going on. That's psychology...
+"question" exactly what is going on. That's psychology....
=over 10
@@ -692,8 +697,8 @@ the remainder of the enclosing pattern group (if any).
This is particularly useful for dynamic patterns, such as those read in from a
configuration file, taken from an argument, or specified in a table
-somewhere. Consider the case where some patterns want to be case
-sensitive and some do not: The case insensitive ones merely need to
+somewhere. Consider the case where some patterns want to be
+case-sensitive and some do not: The case-insensitive ones merely need to
include C<(?i)> at the front of the pattern. For example:
$pattern = "foobar";
@@ -729,7 +734,7 @@ Note that the C<a>, C<d>, C<l>, C<p>, and C<u> modifiers are special in
that they can only be enabled, not disabled, and the C<a>, C<d>, C<l>, and
C<u> modifiers are mutually exclusive: specifying one de-specifies the
others, and a maximum of one may appear in the construct. Thus, for
-example, C<(?-p)>, will warn when compiled under C<use warnings>;
+example, C<(?-p)> will warn when compiled under C<use warnings>;
C<(?-d:...)> and C<(?dl:...)> are fatal errors.
Note also that the C<p> modifier is special in that its presence
@@ -775,7 +780,7 @@ is equivalent to
(?x-ims:foo)
The caret tells Perl that this cluster doesn't inherit the flags of any
-surrounding pattern, but to go back to the system defaults (C<d-imsx>),
+surrounding pattern, but uses the system defaults (C<d-imsx>),
modified by any flags specified.
The caret allows for simpler stringification of compiled regular
@@ -810,7 +815,7 @@ following this construct will be numbered as though the construct
contained only one branch, that being the one with the most capture
groups in it.
-This construct will be useful when you want to capture one of a
+This construct is useful when you want to capture one of a
number of alternative matches.
Consider the following pattern. The numbers underneath show in
@@ -843,7 +848,7 @@ named C<< b >> are aliases for the group belonging to C<< $1 >>.
=item Look-Around Assertions
X<look-around assertion> X<lookaround assertion> X<look-around> X<lookaround>
-Look-around assertions are zero width patterns which match a specific
+Look-around assertions are zero-width patterns which match a specific
pattern without including it in C<$&>. Positive assertions match when
their subpattern matches, negative assertions match when their subpattern
fails. Look-behind matches text up to the current match position,
@@ -868,14 +873,7 @@ use this for look-behind.
If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/>
will not do what you want. That's because the C<(?!foo)> is just saying that
the next thing cannot be "foo"--and it's not, it's a "bar", so "foobar" will
-match. You would have to do something like C</(?!foo)...bar/> for that. We
-say "like" because there's the case of your "bar" not having three characters
-before it. You could cover that this way: C</(?:(?!foo)...|^.{0,2})bar/>.
-Sometimes it's still easier just to say:
-
- if (/bar/ && $` !~ /foo$/)
-
-For look-behind see below.
+match. Use look-behind instead (see below).
=item C<(?<=pattern)> C<\K>
X<(?<=)> X<look-behind, positive> X<lookbehind, positive> X<\K>
@@ -886,7 +884,7 @@ Works only for fixed-width look-behind.
There is a special form of this construct, called C<\K>, which causes the
regex engine to "keep" everything it had matched prior to the C<\K> and
-not include it in C<$&>. This effectively provides variable length
+not include it in C<$&>. This effectively provides variable-length
look-behind. The use of C<\K> inside of another look-around assertion
is allowed, but the behaviour is currently not well defined.
@@ -916,8 +914,10 @@ only for fixed-width look-behind.
X<< (?<NAME>) >> X<(?'NAME')> X<named capture> X<capture>
A named capture group. Identical in every respect to normal capturing
-parentheses C<()> but for the additional fact that C<%+> or C<%-> may be
-used after a successful match to refer to a named group. See L<perlvar>
+parentheses C<()> but for the additional fact that the group
+can be referred to by name in various regular expression
+constructs (like C<\g{NAME}>) and can be accessed by name
+after a successful match via C<%+> or C<%->. See L<perlvar>
for more details on the C<%+> and C<%-> hashes.
If multiple distinct capture groups have the same name then the
@@ -1022,7 +1022,7 @@ L<"Backtracking">.
For reasons of security, this construct is forbidden if the regular
expression involves run-time interpolation of variables, unless the
perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of C<qr//> operator (see
+variables contain results of the C<qr//> operator (see
L<perlop/"qr/STRINGE<sol>msixpodual">).
This restriction is due to the wide-spread and remarkably convenient
@@ -1062,7 +1062,7 @@ due to the effect of future optimisations in the regex engine.
This is a "postponed" regular subexpression. The C<code> is evaluated
at run time, at the moment this subexpression may match. The result
-of evaluation is considered as a regular expression and matched as
+of evaluation is considered a regular expression and matched as
if it were inserted instead of this construct. Note that this means
that the contents of capture groups defined inside an eval'ed pattern
are not available outside of the pattern, and vice versa, there is no
@@ -1094,7 +1094,7 @@ the same task.
For reasons of security, this construct is forbidden if the regular
expression involves run-time interpolation of variables, unless the
perilous C<use re 'eval'> pragma has been used (see L<re>), or the
-variables contain results of C<qr//> operator (see
+variables contain results of the C<qr//> operator (see
L<perlop/"qrE<sol>STRINGE<sol>msixpodual">).
In perl 5.12.x and earlier, because the regex engine was not re-entrant,
@@ -1268,9 +1268,9 @@ For example:
matches a chunk of non-parentheses, possibly included in parentheses
themselves.
-A special form is the C<(DEFINE)> predicate, which never executes directly
-its yes-pattern, and does not allow a no-pattern. This allows to define
-subpatterns which will be executed only by using the recursion mechanism.
+A special form is the C<(DEFINE)> predicate, which never executes its
+yes-pattern directly, and does not allow a no-pattern. This allows one to
+define subpatterns which will be executed only by the recursion mechanism.
This way, you can define a set of regular expression rules that can be
bundled into any pattern you choose.
@@ -1314,9 +1314,13 @@ group C<ab> (see L<"Backtracking">). In particular, C<a*> inside
C<a*ab> will match fewer characters than a standalone C<a*>, since
this makes the tail match.
+C<< (?>pattern) >> does not disable backtracking altogether once it has
+matched. It is still possible to backtrack past the construct, but not
+into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
+
An effect similar to C<< (?>pattern) >> may be achieved by writing
-C<(?=(pattern))\g1>. This matches the same substring as a standalone
-C<a+>, and the following C<\g1> eats the matched string; it therefore
+C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
+C<a+>, and the following C<\g{-1}> eats the matched string; it therefore
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
@@ -1356,7 +1360,8 @@ hung. However, a tiny change to this pattern
which uses C<< (?>...) >> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
the time when used on a similar string with 1000000 C<a>s. Be aware,
-however, that this pattern currently triggers a warning message under
+however, that, when this construct is followed by a
+quantifier, it currently triggers a warning message under
the C<use warnings> pragma or B<-w> switch saying it
C<"matches null string many times in regex">.
@@ -1427,7 +1432,7 @@ C<(*MARK:NAME)> pattern executed. See the explanation for the
C<(*MARK:NAME)> verb below for more details.
B<NOTE:> C<$REGERROR> and C<$REGMARK> are not magic variables like C<$1>
-and most other regex related variables. They are not local to a scope, nor
+and most other regex-related variables. They are not local to a scope, nor
readonly, but instead are volatile package variables similar to C<$AUTOLOAD>.
Use C<local> to localize changes to them to a specific scope if necessary.
@@ -1475,7 +1480,7 @@ If we add a C<(*PRUNE)> before the count like the following
'aaab' =~ /a+b?(*PRUNE)(?{print "$&\n"; $count++})(*FAIL)/;
print "Count=$count\n";
-we prevent backtracking and find the count of the longest matching
+we prevent backtracking and find the count of the longest matching string
at each matching starting point like so:
aaab
@@ -1509,7 +1514,7 @@ encountered, then the C<(*SKIP)> operator has no effect. When used
without a name the "skip point" is where the match point was when
executing the (*SKIP) pattern.
-Compare the following to the examples in C<(*PRUNE)>, note the string
+Compare the following to the examples in C<(*PRUNE)>; note the string
is twice as long:
'aaabaaab' =~ /a+b?(*SKIP)(?{print "$&\n"; $count++})(*FAIL)/;
@@ -1646,7 +1651,7 @@ For instance:
'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x;
will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not
-be set. If another branch in the inner parentheses were matched, such as in the
+be set. If another branch in the inner parentheses was matched, such as in the
string 'ACDE', then the C<D> and C<E> would have to be matched as well.
=back
@@ -1805,7 +1810,7 @@ let C<\D*> expand to "ABC", this would have caused the whole pattern to
fail.
The search engine will initially match C<\D*> with "ABC". Then it will
-try to match C<(?!123> with "123", which fails. But because
+try to match C<(?!123)> with "123", which fails. But because
a quantifier (C<\D*>) has been used in the regular expression, the
search engine can backtrack and retry the match differently
in the hope of matching the complete regular expression.
@@ -1874,7 +1879,7 @@ character; "\\" matches a "\"). This escape mechanism is also required
for the character used as the pattern delimiter.
A series of characters matches that series of characters in the target
-string, so the pattern C<blurfl> would match "blurfl" in the target
+string, so the pattern C<blurfl> would match "blurfl" in the target
string.
You can specify a character class, by enclosing a list of characters
@@ -1913,9 +1918,9 @@ You can specify a series of alternatives for a pattern using "|" to
separate them, so that C<fee|fie|foe> will match any of "fee", "fie",
or "foe" in the target string (as would C<f(e|i|o)e>). The
first alternative includes everything from the last pattern delimiter
-("(", "[", or the beginning of the pattern) up to the first "|", and
+("(", "(?:", etc. or the beginning of the pattern) up to the first "|", and
the last alternative contains everything from the last "|" to the next
-pattern delimiter. That's why it's common practice to include
+closing pattern delimiter. That's why it's common practice to include
alternatives in parentheses: to minimize confusion about where they
start and end.
@@ -1933,7 +1938,7 @@ so if you write C<[fee|fie|foe]> you're really only matching C<[feio|]>.
Within a pattern, you may designate subpatterns for later reference
by enclosing them in parentheses, and you may refer back to the
I<n>th subpattern later in the pattern using the metacharacter
-\I<n>. Subpatterns are numbered based on the left to right order
+\I<n> or \gI<n>. Subpatterns are numbered based on the left to right order
of their opening parenthesis. A backreference matches whatever
actually matched the subpattern in the string being examined, not
the rules for that subpattern. Therefore, C<(0|0x)\d*\s\g1\d*> will
@@ -2013,12 +2018,34 @@ zero-length substring. Thus
is made equivalent to
- m{ (?: NON_ZERO_LENGTH )*
- |
- (?: ZERO_LENGTH )?
- }x;
+ m{ (?: NON_ZERO_LENGTH )* (?: ZERO_LENGTH )? }x;
+
+For example, this program
+
+ #!perl -l
+ "aaaaab" =~ /
+ (?:
+ a # non-zero
+ | # or
+ (?{print "hello"}) # print hello whenever this
+ # branch is tried
+ (?=(b)) # zero-width assertion
+ )* # any number of times
+ /x;
+ print $&;
+ print $1;
-The higher level-loops preserve an additional state between iterations:
+prints
+
+ hello
+ aaaaa
+ b
+
+Notice that "hello" is only printed once, as when Perl sees that the sixth
+iteration of the outermost C<(?:)*> matches a zero-length string, it stops
+the C<*>.
+
+The higher-level loops preserve an additional state between iterations:
whether the last match was zero-length. To break the loop, the following
match after a zero-length match is prohibited to have a length of zero.
This prohibition interacts with backtracking (see L<"Backtracking">),
@@ -2049,7 +2076,7 @@ Each of the elementary pieces of regular expressions which were described
before (such as C<ab> or C<\Z>) could match at most one substring
at the given position of the input string. However, in a typical regular
expression these elementary pieces are combined into more complicated
-patterns using combining operators C<ST>, C<S|T>, C<S*> etc
+patterns using combining operators C<ST>, C<S|T>, C<S*> etc.
(in these examples C<S> and C<T> are regular subexpressions).
Such combinations can include alternatives, leading to a problem of choice:
@@ -2078,11 +2105,11 @@ Consider two possible matches, C<AB> and C<A'B'>, C<A> and C<A'> are
substrings which can be matched by C<S>, C<B> and C<B'> are substrings
which can be matched by C<T>.
-If C<A> is better match for C<S> than C<A'>, C<AB> is a better
+If C<A> is a better match for C<S> than C<A'>, C<AB> is a better
match than C<A'B'>.
If C<A> and C<A'> coincide: C<AB> is a better match than C<AB'> if
-C<B> is better match for C<T> than C<B'>.
+C<B> is a better match for C<T> than C<B'>.
=item C<S|T>
@@ -2146,8 +2173,13 @@ than a match at a later position.
=head2 Creating Custom RE Engines
-Overloaded constants (see L<overload>) provide a simple way to extend
-the functionality of the RE engine.
+As of Perl 5.10.0, one can create custom regular expression engines. This
+is not for the faint of heart, as they have to plug in at the C level. See
+L<perlreapi> for more details.
+
+As an alternative, overloaded constants (see L<overload>) provide a simple
+way to extend the functionality of the RE engine, by substituting one
+pattern for another.
Suppose that we want to enable a new RE escape-sequence C<\Y|> which
matches at a boundary between whitespace characters and non-whitespace
@@ -2193,11 +2225,11 @@ part of this regular expression needs to be converted explicitly
$re = customre::convert $re;
/\Y|$re\Y|/;
-=head1 PCRE/Python Support
+=head2 PCRE/Python Support
-As of Perl 5.10.0, Perl supports several Python/PCRE specific extensions
+As of Perl 5.10.0, Perl supports several Python/PCRE-specific extensions
to the regex syntax. While Perl programmers are encouraged to use the
-Perl specific syntax, the following are also accepted:
+Perl-specific syntax, the following are also accepted:
=over 4
@@ -2217,11 +2249,11 @@ Subroutine call to a named capture group. Equivalent to C<< (?&NAME) >>.
=head1 BUGS
-There are numerous problems with case insensitive matching of characters
+There are numerous problems with case-insensitive matching of characters
outside the ASCII range, especially with those whose folds are multiple
characters, such as ligatures like C<LATIN SMALL LIGATURE FF>.
-In a bracketed character class with case insensitive matching, ranges only work
+In a bracketed character class with case-insensitive matching, ranges only work
for ASCII characters. For example,
C<m/[\N{CYRILLIC CAPITAL LETTER A}-\N{CYRILLIC CAPITAL LETTER YA}]/i>
doesn't match all the Russian upper and lower case letters.