=head1 NAME
X, C
=head2 Quoting metacharacters
Backslashed metacharacters in Perl are alphanumeric, such as C<\b>,
C<\w>, C<\n>. Unlike some other regular expression languages, there
are no backslashed symbols that aren't alphanumeric. So anything
that looks like C<\\>, C<\(>, C<\)>, C<\[>, C<\]>, C<\{>, or C<\}> is
always
interpreted as a literal character, not a metacharacter. This was
once used in a common idiom to disable or quote the special meanings
of regular expression metacharacters in a string that you want to
use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
(If C
is put into the special variable C<$^R>. This happens immediately, so
C<$^R> can be used from other C<(?{ code })> assertions inside the same
regular expression.
The assignment to C<$^R> above is properly localized, so the old
value of C<$^R> is restored if the assertion is backtracked; compare
L"Backtracking">.
Note that the special variable C<$^N> is particularly useful with code
blocks to capture the results of submatches in variables without having to
keep track of the number of nested parentheses. For example:
$_ = "The brown fox jumps over the lazy dog";
/the (\S+)(?{ $color = $^N }) (\S+)(?{ $animal = $^N })/i;
print "color = $color, animal = $animal\n";
=item C<(??{ code })>
X<(??{})>
X X X
B: Using this feature safely requires that you understand its
limitations. Code executed that has side effects may not perform
identically from version to version due to the effect of future
optimisations in the regex engine. For more information on this, see
L.
This is a "postponed" regular subexpression. It behaves in I the
same way as a C<(?{ code })> code block as described above, except that
its return value, rather than being assigned to C<$^R>, is treated as a
pattern, compiled if it's a string (or used as-is if its a qr// object),
then matched as if it were inserted instead of this construct.
During the matching of this sub-pattern, it has its own set of
captures which are valid during the sub-match, but are discarded once
control returns to the main pattern. For example, the following matches,
with the inner pattern capturing "B" and matching "BB", while the outer
pattern captures "A";
my $inner = '(.)\1';
"ABBA" =~ /^(.)(??{ $inner })\1/;
print $1; # prints "A";
Note that this means that there is no way for the inner pattern to refer
to a capture group defined outside. (The code block itself can use C<$1>,
I., to refer to the enclosing pattern's capture groups.) Thus, although
('a' x 100)=~/(??{'(.)' x 100})/
I match, it will I set C<$1> on exit.
The following pattern matches a parenthesized group:
$re = qr{
\(
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(??{ $re }) # Group with matching parens
)*
\)
}x;
See also
L)>|/(?PARNO) (?-PARNO) (?+PARNO) (?R) (?0)>
for a different, more efficient way to accomplish
the same task.
Executing a postponed regular expression too many times without
consuming any input string will also result in a fatal error. The depth
at which that happens is compiled into perl, so it can be changed with a
custom build.
=item C<(?I)> C<(?-I)> C<(?+I)> C<(?R)> C<(?0)>
X<(?PARNO)> X<(?1)> X<(?R)> X<(?0)> X<(?-1)> X<(?+1)> X<(?-PARNO)> X<(?+PARNO)>
X X X
X X X
Recursive subpattern. Treat the contents of a given capture buffer in the
current pattern as an independent subpattern and attempt to match it at
the current position in the string. Information about capture state from
the caller for things like backreferences is available to the subpattern,
but capture buffers set by the subpattern are not visible to the caller.
Similar to C<(??{ code })> except that it does not involve executing any
code or potentially compiling a returned pattern string; instead it treats
the part of the current pattern contained within a specified capture group
as an independent pattern that must match at the current position. Also
different is the treatment of capture buffers, unlike C<(??{ code })>
recursive patterns have access to their caller's match state, so one can
use backreferences safely.
I is a sequence of digits (not starting with 0) whose value reflects
the paren-number of the capture group to recurse to. C<(?R)> recurses to
the beginning of the whole pattern. C<(?0)> is an alternate syntax for
C<(?R)>. If I is preceded by a plus or minus sign then it is assumed
to be relative, with negative numbers indicating preceding capture groups
and positive ones following. Thus C<(?-1)> refers to the most recently
declared group, and C<(?+1)> indicates the next group to be declared.
Note that the counting for relative recursion differs from that of
relative backreferences, in that with recursion unclosed groups B
included.
The following pattern matches a function C which may contain
balanced parentheses as the argument.
$re = qr{ ( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
If the pattern was used as follows
'foo(bar(baz)+baz(bop))'=~/$re/
and print "\$1 = $1\n",
"\$2 = $2\n",
"\$3 = $3\n";
the output produced should be the following:
$1 = foo(bar(baz)+baz(bop))
$2 = (bar(baz)+baz(bop))
$3 = bar(baz)+baz(bop)
If there is no corresponding capture group defined, then it is a
fatal error. Recursing deeply without consuming any input string will
also result in a fatal error. The depth at which that happens is
compiled into perl, so it can be changed with a custom build.
The following shows how using negative indexing can make it
easier to embed recursive patterns inside of a C construct
for later use:
my $parens = qr/(\((?:[^()]++|(?-1))*+\))/;
if (/foo $parens \s+ \+ \s+ bar $parens/x) {
# do something here...
}
B that this pattern does not behave the same way as the equivalent
PCRE or Python construct of the same form. In Perl you can backtrack into
a recursed group, in PCRE and Python the recursed into group is treated
as atomic. Also, modifiers are resolved at compile time, so constructs
like C<(?i:(?1))> or C<(?:(?i)(?1))> do not affect how the sub-pattern will
be processed.
=item C<(?&NAME)>
X<(?&NAME)>
Recurse to a named subpattern. Identical to C<(?I)> except that the
parenthesis to recurse to is determined by name. If multiple parentheses have
the same name, then it recurses to the leftmost.
It is an error to refer to a name that is not declared somewhere in the
pattern.
B In order to make things easier for programmers with experience
with the Python or PCRE regex engines the pattern C<< (?P>NAME) >>
may be used instead of C<< (?&NAME) >>.
=item C<(?(condition)yes-pattern|no-pattern)>
X<(?()>
=item C<(?(condition)yes-pattern)>
Conditional expression. Matches C if C yields
a true value, matches C otherwise. A missing pattern always
matches.
C<(condition)> should be one of:
=over 4
=item an integer in parentheses
(which is valid if the corresponding pair of parentheses
matched);
=item a lookahead/lookbehind/evaluate zero-width assertion;
=item a name in angle brackets or single quotes
(which is valid if a group with the given name matched);
=item the special symbol C<(R)>
(true when evaluated inside of recursion or eval). Additionally the
C<"R"> may be
followed by a number, (which will be true when evaluated when recursing
inside of the appropriate group), or by C<&NAME>, in which case it will
be true only when evaluated during recursion in the named group.
=back
Here's a summary of the possible predicates:
=over 4
=item C<(1)> C<(2)> ...
Checks if the numbered capturing group has matched something.
Full syntax: C<< (?(1)then|else) >>
=item C<(EIE)> C<('I')>
Checks if a group with the given name has matched something.
Full syntax: C<< (?()then|else) >>
=item C<(?=...)> C<(?!...)> C<(?<=...)> C<(?
Checks whether the pattern matches (or does not match, for the C<"!">
variants).
Full syntax: C<< (?(?=lookahead)then|else) >>
=item C<(?{ I })>
Treats the return value of the code block as the condition.
Full syntax: C<< (?(?{ code })then|else) >>
=item C<(R)>
Checks if the expression has been evaluated inside of recursion.
Full syntax: C<< (?(R)then|else) >>
=item C<(R1)> C<(R2)> ...
Checks if the expression has been evaluated while executing directly
inside of the n-th capture group. This check is the regex equivalent of
if ((caller(0))[3] eq 'subname') { ... }
In other words, it does not check the full recursion stack.
Full syntax: C<< (?(R1)then|else) >>
=item C<(R&I)>
Similar to C<(R1)>, this predicate checks to see if we're executing
directly inside of the leftmost group with a given name (this is the same
logic used by C<(?&I)> to disambiguate). It does not check the full
stack, but only the name of the innermost active recursion.
Full syntax: C<< (?(R&name)then|else) >>
=item C<(DEFINE)>
In this case, the yes-pattern is never directly executed, and no
no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient.
See below for details.
Full syntax: C<< (?(DEFINE)definitions...) >>
=back
For example:
m{ ( \( )?
[^()]+
(?(1) \) )
}x
matches a chunk of non-parentheses, possibly included in parentheses
themselves.
A special form is the C<(DEFINE)> predicate, which never executes its
yes-pattern directly, and does not allow a no-pattern. This allows one to
define subpatterns which will be executed only by the recursion mechanism.
This way, you can define a set of regular expression rules that can be
bundled into any pattern you choose.
It is recommended that for this usage you put the DEFINE block at the
end of the pattern, and that you name any subpatterns defined within it.
Also, it's worth noting that patterns defined this way probably will
not be as efficient, as the optimizer is not very clever about
handling them.
An example of how this might be used is as follows:
/(?(?&NAME_PAT))(?(?&ADDRESS_PAT))
(?(DEFINE)
(?....)
(?....)
)/x
Note that capture groups matched inside of recursion are not accessible
after the recursion returns, so the extra layer of capturing groups is
necessary. Thus C<$+{NAME_PAT}> would not be defined even though
C<$+{NAME}> would be.
Finally, keep in mind that subpatterns created inside a DEFINE block
count towards the absolute and relative number of captures, so this:
my @captures = "a" =~ /(.) # First capture
(?(DEFINE)
(? 1 ) # Second capture
)/x;
say scalar @captures;
Will output 2, not 1. This is particularly important if you intend to
compile the definitions with the C operator, and later
interpolate them in another pattern.
=item C<< (?>pattern) >>
X X X X
An "independent" subexpression, one which matches the substring
that a I C would match if anchored at the given
position, and it matches I. This
construct is useful for optimizations of what would otherwise be
"eternal" matches, because it will not backtrack (see L"Backtracking">).
It may also be useful in places where the "grab all you can, and do not
give anything back" semantic is desirable.
For example: C<< ^(?>a*)ab >> will never match, since C<< (?>a*) >>
(anchored at the beginning of string, as above) will match I
characters C<"a"> at the beginning of string, leaving no C<"a"> for
C to match. In contrast, C will match the same as C,
since the match of the subgroup C is influenced by the following
group C (see L"Backtracking">). In particular, C inside
C will match fewer characters than a standalone C, since
this makes the tail match.
C<< (?>pattern) >> does not disable backtracking altogether once it has
matched. It is still possible to backtrack past the construct, but not
into it. So C<< ((?>a*)|(?>b*))ar >> will still match "bar".
An effect similar to C<< (?>pattern) >> may be achieved by writing
C<(?=(pattern))\g{-1}>. This matches the same substring as a standalone
C, and the following C<\g{-1}> eats the matched string; it therefore
makes a zero-length assertion into an analogue of C<< (?>...) >>.
(The difference between these two constructs is that the second one
uses a capturing group, thus shifting ordinals of backreferences
in the rest of a regular expression.)
Consider this pattern:
m{ \(
(
[^()]+ # x+
|
\( [^()]* \)
)+
\)
}x
That will efficiently match a nonempty group with matching parentheses
two levels deep or less. However, if there is no such group, it
will take virtually forever on a long string. That's because there
are so many different ways to split a long string into several
substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar
to a subpattern of the above pattern. Consider how the pattern
above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several
seconds, but that each extra letter doubles this time. This
exponential performance will make it appear that your program has
hung. However, a tiny change to this pattern
m{ \(
(
(?> [^()]+ ) # change x+ above to (?> x+ )
|
\( [^()]* \)
)+
\)
}x
which uses C<< (?>...) >> matches exactly when the one above does (verifying
this yourself would be a productive exercise), but finishes in a fourth
the time when used on a similar string with 1000000 C<"a">s. Be aware,
however, that, when this construct is followed by a
quantifier, it currently triggers a warning message under
the C pragma or B<-w> switch saying it
C<"matches null string many times in regex">.
On simple groups, such as the pattern C<< (?> [^()]+ ) >>, a comparable
effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>.
This was only 4 times slower on a string with 1000000 C<"a">s.
The "grab all you can, and do not give anything back" semantic is desirable
in many situations where on the first sight a simple C<()*> looks like
the correct solution. Suppose we parse text with comments being delimited
by C<"#"> followed by some optional (horizontal) whitespace. Contrary to
its appearance, C<#[ \t]*> I the correct subexpression to match
the comment delimiter, because it may "give up" some whitespace if
the remainder of the pattern can be made to match that way. The correct
answer is either one of these:
(?>#[ \t]*)
#[ \t]*(?![ \t])
For example, to grab non-empty comments into C<$1>, one should use either
one of these:
/ (?> \# [ \t]* ) ( .+ ) /x;
/ \# [ \t]* ( [^ \t] .* ) /x;
Which one you pick depends on which of these expressions better reflects
the above specification of comments.
In some literature this construct is called "atomic matching" or
"possessive matching".
Possessive quantifiers are equivalent to putting the item they are applied
to inside of one of these constructs. The following equivalences apply:
Quantifier Form Bracketing Form
--------------- ---------------
PAT*+ (?>PAT*)
PAT++ (?>PAT+)
PAT?+ (?>PAT?)
PAT{min,max}+ (?>PAT{min,max})
=item C<(?[ ])>
See L.
Note that this feature is currently L;
using it yields a warning in the C category.
=back
=head2 Backtracking
X X
NOTE: This section presents an abstract approximation of regular
expression behavior. For a more rigorous (and complicated) view of
the rules involved in selecting a match among possible alternatives,
see L.
A fundamental feature of regular expression matching involves the
notion called I, which is currently used (when needed)
by all regular non-possessive expression quantifiers, namely C<"*">, C<*?>, C<"+">,
C<+?>, C<{n,m}>, and C<{n,m}?>. Backtracking is often optimized
internally, but the general principle outlined here is valid.
For a regular expression to match, the I regular expression must
match, not just part of it. So if the beginning of a pattern containing a
quantifier succeeds in a way that causes later parts in the pattern to
fail, the matching engine backs up and recalculates the beginning
part--that's why it's called backtracking.
Here is an example of backtracking: Let's say you want to find the
word following "foo" in the string "Food is on the foo table.":
$_ = "Food is on the foo table.";
if ( /\b(foo)\s+(\w+)/i ) {
print "$2 follows $1.\n";
}
When the match runs, the first part of the regular expression (C<\b(foo)>)
finds a possible match right at the beginning of the string, and loads up
C<$1> with "Foo". However, as soon as the matching engine sees that there's
no whitespace following the "Foo" that it had saved in C<$1>, it realizes its
mistake and starts over again one character after where it had the
tentative match. This time it goes all the way until the next occurrence
of "foo". The complete regular expression matches this time, and you get
the expected output of "table follows foo."
Sometimes minimal matching can help a lot. Imagine you'd like to match
everything between "foo" and "bar". Initially, you write something
like this:
$_ = "The food is under the bar in the barn.";
if ( /foo(.*)bar/ ) {
print "got <$1>\n";
}
Which perhaps unexpectedly yields:
got
That's because C<.*> was greedy, so you get everything between the
I "foo" and the I "bar". Here it's more effective
to use minimal matching to make sure you get the text between a "foo"
and the first "bar" thereafter.
if ( /foo(.*?)bar/ ) { print "got <$1>\n" }
got
Here's another example. Let's say you'd like to match a number at the end
of a string, and you also want to keep the preceding part of the match.
So you write this:
$_ = "I have 2 numbers: 53147";
if ( /(.*)(\d*)/ ) { # Wrong!
print "Beginning is <$1>, number is <$2>.\n";
}
That won't work at all, because C<.*> was greedy and gobbled up the
whole string. As C<\d*> can match on an empty string the complete
regular expression matched successfully.
Beginning is , number is <>.
Here are some variants, most of which don't work:
$_ = "I have 2 numbers: 53147";
@pats = qw{
(.*)(\d*)
(.*)(\d+)
(.*?)(\d*)
(.*?)(\d+)
(.*)(\d+)$
(.*?)(\d+)$
(.*)\b(\d+)$
(.*\D)(\d+)$
};
for $pat (@pats) {
printf "%-12s ", $pat;
if ( /$pat/ ) {
print "<$1> <$2>\n";
} else {
print "FAIL\n";
}
}
That will print out:
(.*)(\d*) <>
(.*)(\d+) <7>
(.*?)(\d*) <> <>
(.*?)(\d+) <2>
(.*)(\d+)$ <7>
(.*?)(\d+)$ <53147>
(.*)\b(\d+)$ <53147>
(.*\D)(\d+)$ <53147>
As you see, this can be a bit tricky. It's important to realize that a
regular expression is merely a set of assertions that gives a definition
of success. There may be 0, 1, or several different ways that the
definition might succeed against a particular string. And if there are
multiple ways it might succeed, you need to understand backtracking to
know which variety of success you will achieve.
When using lookahead assertions and negations, this can all get even
trickier. Imagine you'd like to find a sequence of non-digits not
followed by "123". You might try to write that as
$_ = "ABC123";
if ( /^\D*(?!123)/ ) { # Wrong!
print "Yup, no 123 in $_\n";
}
But that isn't going to match; at least, not the way you're hoping. It
claims that there is no 123 in the string. Here's a clearer picture of
why that pattern matches, contrary to popular expectations:
$x = 'ABC123';
$y = 'ABC445';
print "1: got $1\n" if $x =~ /^(ABC)(?!123)/;
print "2: got $1\n" if $y =~ /^(ABC)(?!123)/;
print "3: got $1\n" if $x =~ /^(\D*)(?!123)/;
print "4: got $1\n" if $y =~ /^(\D*)(?!123)/;
This prints
2: got ABC
3: got AB
4: got ABC
You might have expected test 3 to fail because it seems to a more
general purpose version of test 1. The important difference between
them is that test 3 contains a quantifier (C<\D*>) and so can use
backtracking, whereas test 1 will not. What's happening is
that you've asked "Is it true that at the start of C<$x>, following 0 or more
non-digits, you have something that's not 123?" If the pattern matcher had
let C<\D*> expand to "ABC", this would have caused the whole pattern to
fail.
The search engine will initially match C<\D*> with "ABC". Then it will
try to match C<(?!123)> with "123", which fails. But because
a quantifier (C<\D*>) has been used in the regular expression, the
search engine can backtrack and retry the match differently
in the hope of matching the complete regular expression.
The pattern really, I wants to succeed, so it uses the
standard pattern back-off-and-retry and lets C<\D*> expand to just "AB" this
time. Now there's indeed something following "AB" that is not
"123". It's "C123", which suffices.
We can deal with this by using both an assertion and a negation.
We'll say that the first part in C<$1> must be followed both by a digit
and by something that's not "123". Remember that the lookaheads
are zero-width expressions--they only look, but don't consume any
of the string in their match. So rewriting this way produces what
you'd expect; that is, case 5 will fail, but case 6 succeeds:
print "5: got $1\n" if $x =~ /^(\D*)(?=\d)(?!123)/;
print "6: got $1\n" if $y =~ /^(\D*)(?=\d)(?!123)/;
6: got ABC
In other words, the two zero-width assertions next to each other work as though
they're ANDed together, just as you'd use any built-in assertions: C^$/>
matches only if you're at the beginning of the line AND the end of the
line simultaneously. The deeper underlying truth is that juxtaposition in
regular expressions always means AND, except when you write an explicit OR
using the vertical bar. C means match "a" AND (then) match "b",
although the attempted matches are made at different positions because "a"
is not a zero-width assertion, but a one-width assertion.
B: Particularly complicated regular expressions can take
exponential time to solve because of the immense number of possible
ways they can use backtracking to try for a match. For example, without
internal optimizations done by the regular expression engine, this will
take a painfully long time to run:
'aaaaaaaaaaaa' =~ /((a{0,5}){0,5})*[c]/
And if you used C<"*">'s in the internal groups instead of limiting them
to 0 through 5 matches, then it would take forever--or until you ran
out of stack space. Moreover, these internal optimizations are not
always applicable. For example, if you put C<{0,5}> instead of C<"*">
on the external group, no current optimization is applicable, and the
match takes a long time to finish.
A powerful tool for optimizing such beasts is what is known as an
"independent group",
which does not backtrack (see L
pattern) >>>). Note also that
zero-length lookahead/lookbehind assertions will not backtrack to make
the tail match, since they are in "logical" context: only
whether they match is considered relevant. For an example
where side-effects of lookahead I have influenced the
following match, see L