diff options
author | Gurusamy Sarathy <gsar@cpan.org> | 1999-05-24 07:24:11 +0000 |
---|---|---|
committer | Gurusamy Sarathy <gsar@cpan.org> | 1999-05-24 07:24:11 +0000 |
commit | 19799a22062ef658e4ac543ea06fa9193323512a (patch) | |
tree | ae9ae04d1351eb1dbbc2ea3cfe207cf056e56371 /pod/perlre.pod | |
parent | d92eb7b0e84a41728b3fbb642691f159dbe28882 (diff) | |
download | perl-19799a22062ef658e4ac543ea06fa9193323512a.tar.gz |
major pod update from Tom Christiansen
p4raw-id: //depot/perl@3460
Diffstat (limited to 'pod/perlre.pod')
-rw-r--r-- | pod/perlre.pod | 460 |
1 files changed, 246 insertions, 214 deletions
diff --git a/pod/perlre.pod b/pod/perlre.pod index 95d473439e..98d7b35066 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -6,13 +6,13 @@ perlre - Perl regular expressions This page describes the syntax of regular expressions in Perl. For a description of how to I<use> regular expressions in matching -operations, plus various examples of the same, see discussion +operations, plus various examples of the same, see discussions of C<m//>, C<s///>, C<qr//> and C<??> in L<perlop/"Regexp Quote-Like Operators">. -The matching operations can have various modifiers. The modifiers +Matching operations can have various modifiers. Modifiers that relate to the interpretation of the regular expression inside -are listed below. For the modifiers that alter the way a regular expression -is used by Perl, see L<perlop/"Regexp Quote-Like Operators"> and +are listed below. Modifiers that alter the way a regular expression +is used by Perl are detailed in L<perlop/"Regexp Quote-Like Operators"> and L<perlop/"Gory details of parsing quoted constructs">. =over 4 @@ -33,14 +33,15 @@ line anywhere within the string. =item s Treat string as single line. That is, change "." to match any character -whatsoever, even a newline, which it normally would not match. +whatsoever, even a newline, which normally it would not match. -The C</s> and C</m> modifiers both override the C<$*> setting. That is, no matter -what C<$*> contains, C</s> without C</m> will force "^" to match only at the -beginning of the string and "$" to match only at the end (or just before a -newline at the end) of the string. Together, as /ms, they let the "." match -any character whatsoever, while yet allowing "^" and "$" to match, -respectively, just after and just before newlines within the string. +The C</s> and C</m> modifiers both override the C<$*> setting. That +is, no matter what C<$*> contains, C</s> without C</m> will force +"^" to match only at the beginning of the string and "$" to match +only at the end (or just before a newline at the end) of the string. +Together, as /ms, they let the "." match any character whatsoever, +while yet allowing "^" and "$" to match, respectively, just after +and just before newlines within the string. =item x @@ -70,11 +71,11 @@ in L<perlop>. =head2 Regular Expressions -The patterns used in pattern matching are regular expressions such as -those supplied in the Version 8 regex routines. (In fact, the -routines are derived (distantly) from Henry Spencer's freely -redistributable reimplementation of the V8 routines.) -See L<Version 8 Regular Expressions> for details. +The patterns used in Perl pattern matching derive from supplied in +the Version 8 regex routines. (In fact, the routines are derived +(distantly) from Henry Spencer's freely redistributable reimplementation +of the V8 routines.) See L<Version 8 Regular Expressions> for +details. In particular the following metacharacters have their standard I<egrep>-ish meanings: @@ -177,12 +178,13 @@ In addition, Perl defines the following: equivalent to C<(?:\PM\pM*)> \C Match a single C char (octet) even under utf8. -A C<\w> matches a single alphanumeric character, not a whole -word. To match a word you'd need to say C<\w+>. If C<use locale> is in -effect, the list of alphabetic characters generated by C<\w> is taken -from the current locale. See L<perllocale>. You may use C<\w>, C<\W>, -C<\s>, C<\S>, C<\d>, and C<\D> within character classes (though not as -either end of a range). +A C<\w> matches a single alphanumeric character, not a whole word. +To match a word you'd need to say C<\w+>. If C<use locale> is in +effect, the list of alphabetic characters generated by C<\w> is +taken from the current locale. See L<perllocale>. You may use +C<\w>, C<\W>, C<\s>, C<\S>, C<\d>, and C<\D> within character classes +(though not as either end of a range). See L<utf8> for details +about C<\pP>, C<\PP>, and C<\X>. Perl defines the following zero-width assertions: @@ -193,41 +195,46 @@ Perl defines the following zero-width assertions: \z Match only at end of string \G Match only where previous m//g left off (works only with /g) -A word boundary (C<\b>) is defined as a spot between two characters that -has a C<\w> on one side of it and a C<\W> on the other side of it (in -either order), counting the imaginary characters off the beginning and -end of the string as matching a C<\W>. (Within character classes C<\b> -represents backspace rather than a word boundary.) The C<\A> and C<\Z> are -just like "^" and "$", except that they won't match multiple times when the -C</m> modifier is used, while "^" and "$" will match at every internal line -boundary. To match the actual end of the string, not ignoring newline, -you can use C<\z>. The C<\G> assertion can be used to chain global -matches (using C<m//g>), as described in -L<perlop/"Regexp Quote-Like Operators">. - -It is also useful when writing C<lex>-like scanners, when you have several -patterns that you want to match against consequent substrings of your -string, see the previous reference. -The actual location where C<\G> will match can also be influenced -by using C<pos()> as an lvalue. See L<perlfunc/pos>. - -When the bracketing construct C<( ... )> is used, \E<lt>digitE<gt> matches the -digit'th substring. Outside of the pattern, always use "$" instead of "\" -in front of the digit. (While the \E<lt>digitE<gt> notation can on rare occasion work -outside the current pattern, this should not be relied upon. See the -WARNING below.) The scope of $E<lt>digitE<gt> (and C<$`>, C<$&>, and C<$'>) -extends to the end of the enclosing BLOCK or eval string, or to the next -successful pattern match, whichever comes first. If you want to use -parentheses to delimit a subpattern (e.g., a set of alternatives) without -saving it as a subpattern, follow the ( with a ?:. +A word boundary (C<\b>) is defined as a spot between two characters +that has a C<\w> on one side of it and a C<\W> on the other side +of it (in either order), counting the imaginary characters off the +beginning and end of the string as matching a C<\W>. (Within +character classes C<\b> represents backspace rather than a word +boundary, just as it normally does in any double-quoted string.) +The C<\A> and C<\Z> are just like "^" and "$", except that they +won't match multiple times when the C</m> modifier is used, while +"^" and "$" will match at every internal line boundary. To match +the actual end of the string and not ignore an optional trailing +newline, use C<\z>. + +The C<\G> assertion can be used to chain global matches (using +C<m//g>), as described in L<perlop/"Regexp Quote-Like Operators">. +It is also useful when writing C<lex>-like scanners, when you have +several patterns that you want to match against consequent substrings +of your string, see the previous reference. The actual location +where C<\G> will match can also be influenced by using C<pos()> as +an lvalue. See L<perlfunc/pos>. + +When the bracketing construct C<( ... )> is used to create a capture +buffer, \E<lt>digitE<gt> matches the digit'th substring. Outside +of the pattern, always use "$" instead of "\" in front of the digit. +(While the \E<lt>digitE<gt> notation can on rare occasion work +outside the current pattern, this should not be relied upon. See +the WARNING below.) The scope of $E<lt>digitE<gt> (and C<$`>, +C<$&>, and C<$'>) extends to the end of the enclosing BLOCK or eval +string, or to the next successful pattern match, whichever comes +first. If you want to use parentheses to delimit a subpattern +(e.g., a set of alternatives) without saving it as a subpattern, +follow the ( with a ?:. You may have as many parentheses as you wish. If you have more -than 9 substrings, the variables $10, $11, ... refer to the -corresponding substring. Within the pattern, \10, \11, etc. refer back -to substrings if there have been at least that many left parentheses before -the backreference. Otherwise (for backward compatibility) \10 is the -same as \010, a backspace, and \11 the same as \011, a tab. And so -on. (\1 through \9 are always backreferences.) +than 9 captured substrings, the variables $10, $11, ... refer to +the corresponding substring. Within the pattern, \10, \11, etc. +refer back to substrings if there have been at least that many left +parentheses before the backreference. Otherwise (for backward +compatibility) \10 is the same as \010, a backspace, and \11 the +same as \011, a tab. And so on. (\1 through \9 are always +backreferences.) C<$+> returns whatever the last bracket match matched. C<$&> returns the entire matched string. (C<$0> used to return the same thing, but not any @@ -242,50 +249,88 @@ everything after the matched string. Examples: $seconds = $3; } -Once perl sees that you need one of C<$&>, C<$`> or C<$'> anywhere in +Once Perl sees that you need one of C<$&>, C<$`> or C<$'> anywhere in the program, it has to provide them on each and every pattern match. This can slow your program down. The same mechanism that handles these provides for the use of $1, $2, etc., so you pay the same price -for each pattern that contains capturing parentheses. But if you never +for each pattern that contains capturing parentheses. But if you never use $&, etc., in your script, then patterns I<without> capturing -parentheses won't be penalized. So avoid $&, $', and $` if you can, +parentheses won't be penalized. So avoid $&, $', and $` if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two. -Backslashed metacharacters in Perl are -alphanumeric, such as C<\b>, C<\w>, C<\n>. Unlike some other regular -expression languages, there are no backslashed symbols that aren't -alphanumeric. So anything that looks like \\, \(, \), \E<lt>, \E<gt>, -\{, or \} is always interpreted as a literal character, not a -metacharacter. This was once used in a common idiom to disable or -quote the special meanings of regular expression metacharacters in a -string that you want to use for a pattern. Simply quote all -non-alphanumeric characters: +Backslashed metacharacters in Perl are alphanumeric, such as C<\b>, +C<\w>, C<\n>. Unlike some other regular expression languages, there +are no backslashed symbols that aren't alphanumeric. So anything +that looks like \\, \(, \), \E<lt>, \E<gt>, \{, or \} is always +interpreted as a literal character, not a metacharacter. This was +once used in a common idiom to disable or quote the special meanings +of regular expression metacharacters in a string that you want to +use for a pattern. Simply quote all non-alphanumeric characters: $pattern =~ s/(\W)/\\$1/g; -Now it is much more common to see either the quotemeta() function or -the C<\Q> escape sequence used to disable all metacharacters' special -meanings like this: +In modern days, it is more common to see either the quotemeta() +function or the C<\Q> metaquoting escape sequence used to disable +all metacharacters' special meanings like this: /$unquoted\Q$quoted\E$unquoted/ -Perl defines a consistent extension syntax for regular expressions. -The syntax is a pair of parentheses with a question mark as the first -thing within the parentheses (this was a syntax error in older -versions of Perl). The character after the question mark gives the -function of the extension. Several extensions are already supported: +=head2 Extended Patterns + +For those situations where simple regular expression patterns are +not enough, Perl defines a consistent extension syntax for venturing +beyond simple patterns such as are found in standard tools like +B<awk> and B<lex>. That syntax is a pair of parentheses with a +question mark as the first thing within the parentheses (this was +a syntax error in older versions of Perl). The character after the +question mark gives the function of the extension. + +Many extensions are already supported, some for almost five years +now. Other, more exotic forms are very new, and should be considered +highly experimental, and are so marked. + +A question mark was chosen for this and for the new minimal-matching +construct because 1) question mark is pretty rare in older regular +expressions, and 2) whenever you see one, you should stop and "question" +exactly what is going on. That's psychology... =over 10 =item C<(?#text)> -A comment. The text is ignored. If the C</x> switch is used to enable -whitespace formatting, a simple C<#> will suffice. Note that perl closes +A comment. The text is ignored. If the C</x> modifier is used to enable +whitespace formatting, a simple C<#> will suffice. Note that Perl closes the comment as soon as it sees a C<)>, so there is no way to put a literal C<)> in the comment. +=item C<(?imsx-imsx)> + +One or more embedded pattern-match modifiers. This is particularly +useful for dynamic patterns, such as those read in from a configuration +file, read in as an argument, are specified in a table somewhere, +etc. Consider the case that some of which want to be case sensitive +and some do not. The case insensitive ones need to include merely +C<(?i)> at the front of the pattern. For example: + + $pattern = "foobar"; + if ( /$pattern/i ) { } + + # more flexible: + + $pattern = "(?i)foobar"; + if ( /$pattern/ ) { } + +Letters after a C<-> turn those modifiers off. These modifiers are +localized inside an enclosing group (if any). For example, + + ( (?i) blah ) \s+ \1 + +will match a repeated (I<including the case>!) word C<blah> in any +case, assuming C<x> modifier, and no C<i> modifier outside of this +group. + =item C<(?:pattern)> =item C<(?imsx-imsx:pattern)> @@ -299,10 +344,11 @@ is like @fields = split(/\b(a|b|c)\b/) -but doesn't spit out extra fields. +but doesn't spit out extra fields. It's also cheaper not to capture +characters if you don't need to. -The letters between C<?> and C<:> act as flags modifiers, see -L<C<(?imsx-imsx)>>. In particular, +Any letters between C<?> and C<:> act as flags modifiers as with +C<(?imsx-imsx)>. For example, /(?s-i:more.*than).*million/i @@ -312,15 +358,15 @@ is equivalent to more verbose =item C<(?=pattern)> -A zero-width positive lookahead assertion. For example, C</\w+(?=\t)/> +A zero-width positive look-ahead assertion. For example, C</\w+(?=\t)/> matches a word followed by a tab, without including the tab in C<$&>. =item C<(?!pattern)> -A zero-width negative lookahead assertion. For example C</foo(?!bar)/> +A zero-width negative look-ahead assertion. For example C</foo(?!bar)/> matches any occurrence of "foo" that isn't followed by "bar". Note -however that lookahead and lookbehind are NOT the same thing. You cannot -use this for lookbehind. +however that look-ahead and look-behind are NOT the same thing. You cannot +use this for look-behind. If you are looking for a "bar" that isn't preceded by a "foo", C</(?!foo)bar/> will not do what you want. That's because the C<(?!foo)> is just saying that @@ -332,29 +378,32 @@ Sometimes it's still easier just to say: if (/bar/ && $` !~ /foo$/) -For lookbehind see below. +For look-behind see below. =item C<(?E<lt>=pattern)> -A zero-width positive lookbehind assertion. For example, C</(?E<lt>=\t)\w+/> -matches a word following a tab, without including the tab in C<$&>. -Works only for fixed-width lookbehind. +A zero-width positive look-behind assertion. For example, C</(?E<lt>=\t)\w+/> +matches a word that follows a tab, without including the tab in C<$&>. +Works only for fixed-width look-behind. =item C<(?<!pattern)> -A zero-width negative lookbehind assertion. For example C</(?<!bar)foo/> -matches any occurrence of "foo" that isn't following "bar". -Works only for fixed-width lookbehind. +A zero-width negative look-behind assertion. For example C</(?<!bar)foo/> +matches any occurrence of "foo" that does not follow "bar". Works +only for fixed-width look-behind. =item C<(?{ code })> -Experimental "evaluate any Perl code" zero-width assertion. Always -succeeds. C<code> is not interpolated. Currently the rules to -determine where the C<code> ends are somewhat convoluted. +B<WARNING>: This extended regular expression feature is considered +highly experimental, and may be changed or deleted without notice. -The C<code> is properly scoped in the following sense: if the assertion -is backtracked (compare L<"Backtracking">), all the changes introduced after -C<local>isation are undone, so +This zero-width assertion evaluate any embedded Perl code. It +always succeeds, and its C<code> is not interpolated. Currently, +the rules to determine where the C<code> ends are somewhat convoluted. + +The C<code> is properly scoped in the following sense: If the assertion +is backtracked (compare L<"Backtracking">), all changes introduced after +C<local>ization are undone, so that $_ = 'a' x 8; m< @@ -370,51 +419,55 @@ C<local>isation are undone, so # location. >x; -will set C<$res = 4>. Note that after the match $cnt returns to the globally -introduced value 0, since the scopes which restrict C<local> statements +will set C<$res = 4>. Note that after the match, $cnt returns to the globally +introduced value, since the scopes which restrict C<local> operators are unwound. -This assertion may be used as L<C<(?(condition)yes-pattern|no-pattern)>> -switch. If I<not> used in this way, the result of evaluation of C<code> -is put into variable $^R. This happens immediately, so $^R can be used from -other C<(?{ code })> assertions inside the same regular expression. +This assertion may be used as a C<(?(condition)yes-pattern|no-pattern)> +switch. If I<not> used in this way, the result of evaluation of +C<code> is put into the special variable C<$^R>. This happens +immediately, so C<$^R> can be used from other C<(?{ code })> assertions +inside the same regular expression. -The above assignment to $^R is properly localized, thus the old value of $^R -is restored if the assertion is backtracked (compare L<"Backtracking">). +The assignment to C<$^R> above is properly localized, so the old +value of C<$^R> is restored if the assertion is backtracked; compare +L<"Backtracking">. -Due to security concerns, this construction is not allowed if the regular -expression involves run-time interpolation of variables, unless -C<use re 'eval'> pragma is used (see L<re>), or the variables contain -results of qr() operator (see L<perlop/"qr/STRING/imosx">). +For reasons of security, this construct is forbidden if the regular +expression involves run-time interpolation of variables, unless the +perilous C<use re 'eval'> pragma has been used (see L<re>), or the +variables contain results of C<qr//> operator (see +L<perlop/"qr/STRING/imosx">). -This restriction is due to the wide-spread (questionable) practice of -using the construct +This restriction is due to the wide-spread and remarkably convenient +custom of using run-time determined strings as patterns. For example: $re = <>; chomp $re; $string =~ /$re/; -without tainting. While this code is frowned upon from security point -of view, when C<(?{})> was introduced, it was considered bad to add -I<new> security holes to existing scripts. - -B<NOTE:> Use of the above insecure snippet without also enabling taint mode -is to be severely frowned upon. C<use re 'eval'> does not disable tainting -checks, thus to allow $re in the above snippet to contain C<(?{})> -I<with tainting enabled>, one needs both C<use re 'eval'> and untaint -the $re. +Prior to the execution of code in a pattern, this was completely +safe from a security point of view, although it could of course +raise an exception from an illegal pattern. If you turn on the +C<use re 'eval'>, though, it is no longer secure, so you should +only do so if you are also using taint checking. Better yet, use +the carefully constrained evaluation within a Safe module. See +L<perlsec> for details about both these mechanisms. =item C<(?p{ code })> -I<Very experimental> "postponed" regular subexpression. C<code> is evaluated -at runtime, at the moment this subexpression may match. The result of -evaluation is considered as a regular expression, and matched as if it -were inserted instead of this construct. +B<WARNING>: This extended regular expression feature is considered +highly experimental, and may be changed or deleted without notice. -C<code> is not interpolated. Currently the rules to -determine where the C<code> ends are somewhat convoluted. +This is a "postponed" regular subexpression. The C<code> is evaluated +at run time, at the moment this subexpression may match. The result +of evaluation is considered as a regular expression and matched as +if it were inserted instead of this construct. -The following regular expression matches matching parenthesized group: +C<code> is not interpolated. As before, the rules to determine +where the C<code> ends are currently somewhat convoluted. + +The following pattern matches a parenthesized group: $re = qr{ \( @@ -428,31 +481,33 @@ The following regular expression matches matching parenthesized group: =item C<(?E<gt>pattern)> -An "independent" subexpression. Matches the substring that a -I<standalone> C<pattern> would match if anchored at the given position, -B<and only this substring>. - -Say, C<^(?E<gt>a*)ab> will never match, since C<(?E<gt>a*)> (anchored -at the beginning of string, as above) will match I<all> characters -C<a> at the beginning of string, leaving no C<a> for C<ab> to match. -In contrast, C<a*ab> will match the same as C<a+b>, since the match of -the subgroup C<a*> is influenced by the following group C<ab> (see -L<"Backtracking">). In particular, C<a*> inside C<a*ab> will match -fewer characters than a standalone C<a*>, since this makes the tail match. - -An effect similar to C<(?E<gt>pattern)> may be achieved by - - (?=(pattern))\1 - -since the lookahead is in I<"logical"> context, thus matches the same -substring as a standalone C<a+>. The following C<\1> eats the matched -string, thus making a zero-length assertion into an analogue of -C<(?E<gt>...)>. (The difference between these two constructs is that the -second one uses a catching group, thus shifting ordinals of -backreferences in the rest of a regular expression.) - -This construct is useful for optimizations of "eternal" -matches, because it will not backtrack (see L<"Backtracking">). +B<WARNING>: This extended regular expression feature is considered +highly experimental, and may be changed or deleted without notice. + +An "independent" subexpression, one which matches the substring +that a I<standalone> C<pattern> would match if anchored at the given +position -- but it matches no more than this substring. This +construct is useful for optimizations of what would otherwise be +"eternal" matches, because it will not backtrack (see L<"Backtracking">). + +For example: C<^(?E<gt>a*)ab> will never match, since C<(?E<gt>a*)> +(anchored at the beginning of string, as above) will match I<all> +characters C<a> at the beginning of string, leaving no C<a> for +C<ab> to match. In contrast, C<a*ab> will match the same as C<a+b>, +since the match of the subgroup C<a*> is influenced by the following +group C<ab> (see L<"Backtracking">). In particular, C<a*> inside +C<a*ab> will match fewer characters than a standalone C<a*>, since +this makes the tail match. + +An effect similar to C<(?E<gt>pattern)> may be achieved by writing +C<(?=(pattern))\1>. This matches the same substring as a standalone +C<a+>, and the following C<\1> eats the matched string; it therefore +makes a zero-length assertion into an analogue of C<(?E<gt>...)>. +(The difference between these two constructs is that the second one +uses a capturing group, thus shifting ordinals of backreferences +in the rest of a regular expression.) + +Consider this pattern: m{ \( ( @@ -463,17 +518,16 @@ matches, because it will not backtrack (see L<"Backtracking">). \) }x -That will efficiently match a nonempty group with matching -two-or-less-level-deep parentheses. However, if there is no such group, -it will take virtually forever on a long string. That's because there are -so many different ways to split a long string into several substrings. -This is what C<(.+)+> is doing, and C<(.+)+> is similar to a subpattern -of the above pattern. Consider that the above pattern detects no-match -on C<((()aaaaaaaaaaaaaaaaaa> in several seconds, but that each extra -letter doubles this time. This exponential performance will make it -appear that your program has hung. - -However, a tiny modification of this pattern +That will efficiently match a nonempty group with matching parentheses +two levels deep or less. However, if there is no such group, it +will take virtually forever on a long string. That's because there +are so many different ways to split a long string into several +substrings. This is what C<(.+)+> is doing, and C<(.+)+> is similar +to a subpattern of the above pattern. Consider how the pattern +above detects no-match on C<((()aaaaaaaaaaaaaaaaaa> in several +seconds, but that each extra letter doubles this time. This +exponential performance will make it appear that your program has +hung. However, a tiny modification of this pattern m{ \( ( @@ -491,18 +545,21 @@ however, that this pattern currently triggers a warning message under B<-w> saying it C<"matches the null string many times">): On simple groups, such as the pattern C<(?E<gt> [^()]+ )>, a comparable -effect may be achieved by negative lookahead, as in C<[^()]+ (?! [^()] )>. +effect may be achieved by negative look-ahead, as in C<[^()]+ (?! [^()] )>. This was only 4 times slower on a string with 1000000 C<a>s. =item C<(?(condition)yes-pattern|no-pattern)> =item C<(?(condition)yes-pattern)> +B<WARNING>: This extended regular expression feature is considered +highly experimental, and may be changed or deleted without notice. + Conditional expression. C<(condition)> should be either an integer in parentheses (which is valid if the corresponding pair of parentheses -matched), or lookahead/lookbehind/evaluate zero-width assertion. +matched), or look-ahead/look-behind/evaluate zero-width assertion. -Say, +For example: m{ ( \( )? [^()]+ @@ -512,39 +569,8 @@ Say, matches a chunk of non-parentheses, possibly included in parentheses themselves. -=item C<(?imsx-imsx)> - -One or more embedded pattern-match modifiers. This is particularly -useful for patterns that are specified in a table somewhere, some of -which want to be case sensitive, and some of which don't. The case -insensitive ones need to include merely C<(?i)> at the front of the -pattern. For example: - - $pattern = "foobar"; - if ( /$pattern/i ) { } - - # more flexible: - - $pattern = "(?i)foobar"; - if ( /$pattern/ ) { } - -Letters after C<-> switch modifiers off. - -These modifiers are localized inside an enclosing group (if any). Say, - - ( (?i) blah ) \s+ \1 - -(assuming C<x> modifier, and no C<i> modifier outside of this group) -will match a repeated (I<including the case>!) word C<blah> in any -case. - =back -A question mark was chosen for this and for the new minimal-matching -construct because 1) question mark is pretty rare in older regular -expressions, and 2) whenever you see one, you should stop and "question" -exactly what is going on. That's psychology... - =head2 Backtracking A fundamental feature of regular expression matching involves the @@ -652,7 +678,7 @@ definition might succeed against a particular string. And if there are multiple ways it might succeed, you need to understand backtracking to know which variety of success you will achieve. -When using lookahead assertions and negations, this can all get even +When using look-ahead assertions and negations, this can all get even tricker. Imagine you'd like to find a sequence of non-digits not followed by "123". You might try to write that as @@ -702,7 +728,7 @@ time. Now there's indeed something following "AB" that is not We can deal with this by using both an assertion and a negation. We'll say that the first part in $1 must be followed by a digit, and in fact, it must also be followed by something that's not "123". Remember that the -lookaheads are zero-width expressions--they only look, but don't consume +look-aheads are zero-width expressions--they only look, but don't consume any of the string in their match. So rewriting this way produces what you'd expect; that is, case 5 will fail, but case 6 succeeds: @@ -712,7 +738,7 @@ you'd expect; that is, case 5 will fail, but case 6 succeeds: 6: got ABC In other words, the two zero-width assertions next to each other work as though -they're ANDed together, just as you'd use any builtin assertions: C</^$/> +they're ANDed together, just as you'd use any built-in assertions: C</^$/> matches only if you're at the beginning of the line AND the end of the line simultaneously. The deeper underlying truth is that juxtaposition in regular expressions always means AND, except when you write an explicit OR @@ -720,10 +746,10 @@ using the vertical bar. C</ab/> means match "a" AND (then) match "b", although the attempted matches are made at different positions because "a" is not a zero-width assertion, but a one-width assertion. -One warning: particularly complicated regular expressions can take -exponential time to solve due to the immense number of possible ways they -can use backtracking to try match. For example this will take a very long -time to run +B<WARNING>: particularly complicated regular expressions can take +exponential time to solve due to the immense number of possible +ways they can use backtracking to try match. For example, this will +take a very long time to run /((a{0,5}){0,5}){0,5}/ @@ -732,10 +758,10 @@ it would take literally forever--or until you ran out of stack space. A powerful tool for optimizing such beasts is "independent" groups, which do not backtrace (see L<C<(?E<gt>pattern)>>). Note also that -zero-length lookahead/lookbehind assertions will not backtrace to make +zero-length look-ahead/look-behind assertions will not backtrace to make the tail match, since they are in "logical" context: only the fact whether they match or not is considered relevant. For an example -where side-effects of a lookahead I<might> have influenced the +where side-effects of a look-ahead I<might> have influenced the following match, see L<C<(?E<gt>pattern)>>. =head2 Version 8 Regular Expressions @@ -810,7 +836,7 @@ match "0x1234 0x4321", but not "0x1234 01234", because subpattern 1 actually matched "0x", even though the rule C<0|0x> could potentially match the leading 0 in the second number. -=head2 WARNING on \1 vs $1 +=head2 Warning on \1 vs $1 Some people get too used to writing things like: @@ -837,7 +863,7 @@ different things on the I<left> side of the C<s///>. =head2 Repeated patterns matching zero-length substring -WARNING: Difficult material (and prose) ahead. This section needs a rewrite. +B<WARNING>: Difficult material (and prose) ahead. This section needs a rewrite. Regular expressions provide a terse and powerful programming language. As with most other power tools, power comes together with the ability @@ -873,8 +899,9 @@ the infinite loop>. The rules for this are different for lower-level loops given by the greedy modifiers C<*+{}>, and for higher-level ones like the C</g> modifier or split() operator. -The lower-level loops are I<interrupted> when it is detected that a -repeated expression did match a zero-length substring, thus +The lower-level loops are I<interrupted> (that is, the loop is +broken) when Perl detects that a repeated expression matched a +zero-length substring. Thus m{ (?: NON_ZERO_LENGTH | ZERO_LENGTH )* }x; @@ -892,7 +919,7 @@ This prohibition interacts with backtracking (see L<"Backtracking">), and so the I<second best> match is chosen if the I<best> match is of zero length. -Say, +For example: $_ = 'bar'; s/\w??/<$&>/g; @@ -905,7 +932,7 @@ alternate with one-character-long matches. Similarly, for repeated C<m/()/g> the second-best match is the match at the position one notch further in the string. -The additional state of being I<matched with zero-length> is associated to +The additional state of being I<matched with zero-length> is associated with the matched string, and is reset by each assignment to pos(). =head2 Creating custom RE engines @@ -955,7 +982,12 @@ part of this regular expression needs to be converted explicitly $re = customre::convert $re; /\Y|$re\Y|/; -=head2 SEE ALSO +=head1 BUGS + +This manpage is varies from difficult to understand to completely +and utterly opaque. + +=head1 SEE ALSO L<perlop/"Regexp Quote-Like Operators">. @@ -965,4 +997,4 @@ L<perlfunc/pos>. L<perllocale>. -I<Mastering Regular Expressions> (see L<perlbook>) by Jeffrey Friedl. +I<Mastering Regular Expressions> by Jeffrey Friedl. |