diff options
author | Yves Orton <demerphq@gmail.com> | 2006-11-06 14:06:28 +0100 |
---|---|---|
committer | Rafael Garcia-Suarez <rgarciasuarez@gmail.com> | 2006-11-07 10:21:25 +0000 |
commit | e2e6a0f1870d05ddb1ce18fd8556b71330dc694c (patch) | |
tree | 567ec172976f421f30d2eccf9516cf6cbc1c9914 /pod | |
parent | 9c6bc640227cd4fa081b32554378abe794cacfc0 (diff) | |
download | perl-e2e6a0f1870d05ddb1ce18fd8556b71330dc694c.tar.gz |
New regex syntax omnibus
Message-ID: <9b18b3110611060406u2fa1572as57073949a5df9e62@mail.gmail.com>
Plus a portability fix (in string comparison for regex verbs)
and doc tweaks / podchecker fixes
p4raw-id: //depot/perl@29222
Diffstat (limited to 'pod')
-rw-r--r-- | pod/perl595delta.pod | 4 | ||||
-rw-r--r-- | pod/perldiag.pod | 28 | ||||
-rw-r--r-- | pod/perlre.pod | 339 |
3 files changed, 245 insertions, 126 deletions
diff --git a/pod/perl595delta.pod b/pod/perl595delta.pod index ff8efcd621..a7e3b40508 100644 --- a/pod/perl595delta.pod +++ b/pod/perl595delta.pod @@ -107,8 +107,8 @@ quantifiers. (Yves Orton) =item Backtracking control verbs The regex engine now supports a number of special purpose backtrack -control verbs: (?COMMIT), (?CUT), (?ERROR) and (?FAIL). See L<perlre> -for their descriptions. +control verbs: (*COMMIT), (*MARK), (*CUT), (*ERROR), (*FAIL) and +(*ACCEPT). See L<perlre> for their descriptions. =back diff --git a/pod/perldiag.pod b/pod/perldiag.pod index c20b0602c2..e9d23267bd 100644 --- a/pod/perldiag.pod +++ b/pod/perldiag.pod @@ -4291,6 +4291,13 @@ category that is unknown to perl at this point. Note that if you want to enable a warnings category registered by a module (e.g. C<use warnings 'File::Find'>), you must have imported this module + +=item Unknown verb pattern '%s' in regex; marked by <-- HERE in m/%s/ + +(F) You either made a typo or have incorrectly put a C<*> quantifier +after an open brace in your pattern. Check the pattern and review +L<perlre> for details on legal verb patterns. + first. =item unmatched [ in regex; marked by <-- HERE in m/%s/ @@ -4412,6 +4419,17 @@ character to get your parentheses to balance. See L<attributes>. compressed integer format and could not be converted to an integer. See L<perlfunc/pack>. +=item Unterminated verb pattern in regex; marked by <-- HERE in m/%s/ + +(F) You used a pattern of the form C<(*VERB)> but did not terminate +the pattern with a C<)>. Fix the pattern and retry. + +=item Unterminated verb pattern argument in regex; marked by <-- HERE in m/%s/ + +(F) You used a pattern of the form C<(*VERB:ARG)> but did not terminate +the pattern with a C<)>. Fix the pattern and retry. + + =item Unterminated <> operator (F) The lexer saw a left angle bracket in a place where it was expecting @@ -4807,6 +4825,16 @@ anonymous, using the C<sub {}> syntax. When inner anonymous subs that reference variables in outer subroutines are created, they are automatically rebound to the current values of such variables. +=item Verb pattern '%s' has a mandatory argument in regex; marked by <-- HERE in m/%s/ + +(F) You used a verb pattern that requires an argument. Supply an argument +or check that you are using the right verb. + +=item Verb pattern '%s' may not have an argument in regex; marked by <-- HERE in m/%s/ + +(F) You used a verb pattern that is not allowed an argument. Remove the +argument or check that you are using the right verb. + =item Version number must be a constant number (P) The attempt to translate a C<use Module n.n LIST> statement into diff --git a/pod/perlre.pod b/pod/perlre.pod index bce72914fb..45e41e5f54 100644 --- a/pod/perlre.pod +++ b/pod/perlre.pod @@ -933,14 +933,100 @@ the same name, then it recurses to the leftmost. It is an error to refer to a name that is not declared somewhere in the pattern. -=item C<(?FAIL)> C<(?F)> -X<(?FAIL)> X<(?F)> +=item C<(?(condition)yes-pattern|no-pattern)> +X<(?()> -This pattern matches nothing and always fails. It can be used to force the -engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In -fact, C<(?!)> gets optimised into C<(?FAIL)> internally. +=item C<(?(condition)yes-pattern)> -It is probably useful only when combined with C<(?{})> or C<(??{})>. +Conditional expression. C<(condition)> should be either an integer in +parentheses (which is valid if the corresponding pair of parentheses +matched), a look-ahead/look-behind/evaluate zero-width assertion, a +name in angle brackets or single quotes (which is valid if a buffer +with the given name matched), or the special symbol (R) (true when +evaluated inside of recursion or eval). Additionally the R may be +followed by a number, (which will be true when evaluated when recursing +inside of the appropriate group), or by C<&NAME>, in which case it will +be true only when evaluated during recursion in the named group. + +Here's a summary of the possible predicates: + +=over 4 + +=item (1) (2) ... + +Checks if the numbered capturing buffer has matched something. + +=item (<NAME>) ('NAME') + +Checks if a buffer with the given name has matched something. + +=item (?{ CODE }) + +Treats the code block as the condition. + +=item (R) + +Checks if the expression has been evaluated inside of recursion. + +=item (R1) (R2) ... + +Checks if the expression has been evaluated while executing directly +inside of the n-th capture group. This check is the regex equivalent of + + if ((caller(0))[3] eq 'subname') { ... } + +In other words, it does not check the full recursion stack. + +=item (R&NAME) + +Similar to C<(R1)>, this predicate checks to see if we're executing +directly inside of the leftmost group with a given name (this is the same +logic used by C<(?&NAME)> to disambiguate). It does not check the full +stack, but only the name of the innermost active recursion. + +=item (DEFINE) + +In this case, the yes-pattern is never directly executed, and no +no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. +See below for details. + +=back + +For example: + + m{ ( \( )? + [^()]+ + (?(1) \) ) + }x + +matches a chunk of non-parentheses, possibly included in parentheses +themselves. + +A special form is the C<(DEFINE)> predicate, which never executes directly +its yes-pattern, and does not allow a no-pattern. This allows to define +subpatterns which will be executed only by using the recursion mechanism. +This way, you can define a set of regular expression rules that can be +bundled into any pattern you choose. + +It is recommended that for this usage you put the DEFINE block at the +end of the pattern, and that you name any subpatterns defined within it. + +Also, it's worth noting that patterns defined this way probably will +not be as efficient, as the optimiser is not very clever about +handling them. + +An example of how this might be used is as follows: + + /(?<NAME>(&NAME_PAT))(?<ADDR>(&ADDRESS_PAT)) + (?(DEFINE) + (<NAME_PAT>....) + (<ADRESS_PAT>....) + )/x + +Note that capture buffers matched inside of recursion are not accessible +after the recursion returns, so the extra layer of capturing buffers are +necessary. Thus C<$+{NAME_PAT}> would not be defined even though +C<$+{NAME}> would be. =item C<< (?>pattern) >> X<backtrack> X<backtracking> X<atomic> X<possessive> @@ -973,12 +1059,12 @@ in the rest of a regular expression.) Consider this pattern: m{ \( - ( - [^()]+ # x+ - | + ( + [^()]+ # x+ + | \( [^()]* \) )+ - \) + \) }x That will efficiently match a nonempty group with matching parentheses @@ -992,13 +1078,13 @@ seconds, but that each extra letter doubles this time. This exponential performance will make it appear that your program has hung. However, a tiny change to this pattern - m{ \( - ( - (?> [^()]+ ) # change x+ above to (?> x+ ) - | + m{ \( + ( + (?> [^()]+ ) # change x+ above to (?> x+ ) + | \( [^()]* \) )+ - \) + \) }x which uses C<< (?>...) >> matches exactly when the one above does (verifying @@ -1046,13 +1132,50 @@ to inside of one of these constructs. The following equivalences apply: PAT?+ (?>PAT?) PAT{min,max}+ (?>PAT{min,max}) -=item C<(?COMMIT)> -X<(?COMMIT)> +=back + +=head2 Special Backtracking Control Verbs + +B<WARNING:> These patterns are experimental and subject to change or +removal in a future version of perl. Their usage in production code should +be noted to avoid problems during upgrades. + +These special patterns are generally of the form C<(*VERB:ARG)>. Unless +otherwise stated the ARG argument is optional; in some cases, it is +forbidden. + +Any pattern containing a special backtracking verb that allows an argument +has the special behaviour that when executed it sets the current packages' +C<$REGERROR> variable. In this case, the following rules apply: + +On failure, this variable will be set to the ARG value of the verb +pattern, if the verb was involved in the failure of the match. If the ARG +part of the pattern was omitted, then C<$REGERROR> will be set to TRUE. + +On a successful match this variable will be set to FALSE. + +B<NOTE:> C<$REGERROR> is not a magic variable in the same sense than +C<$1> and most other regex related variables. It is not local to a +scope, nor readonly but instead a volatile package variable similar to +C<$AUTOLOAD>. Use C<local> to localize changes to it to a specific scope +if necessary. + +If a pattern does not contain a special backtracking verb that allows an +argument, then C<$REGERROR> is not touched at all. + +=over 4 + +=item Verbs that take an argument + +=over 4 + +=item C<(*NOMATCH)> C<(*NOMATCH:NAME)> +X<(*NOMATCH)> X<(*NOMATCH:NAME)> This zero-width pattern commits the match at the current point, preventing -the engine from back-tracking on failure to the left of the commit point. -Consider the pattern C<A (?COMMIT) B>, where A and B are complex patterns. -Until the C<(?COMMIT)> is reached, A may backtrack as necessary to match. +the engine from backtracking on failure to the left of the this point. +Consider the pattern C<A (*NOMATCH) B>, where A and B are complex patterns. +Until the C<(*NOMATCH)> is reached, A may backtrack as necessary to match. Once it is reached, matching continues in B, which may also backtrack as necessary; however, should B not match, then no further backtracking will take place, and the pattern will fail outright at that starting position. @@ -1060,7 +1183,7 @@ take place, and the pattern will fail outright at that starting position. The following example counts all the possible matching strings in a pattern (without actually matching any of them). - 'aaab'=~/a+b?(?{print "$&\n"; $count++})(?FAIL)/; + 'aaab' =~ /a+b?(?{print "$&\n"; $count++})(*FAIL)/; print "Count=$count\n"; which produces: @@ -1076,9 +1199,9 @@ which produces: a Count=9 -If we add a C<(?COMMIT)> before the count like the following +If we add a C<(*NOMATCH)> before the count like the following - 'aaab'=~/a+b?(?COMMIT)(?{print "$&\n"; $count++})(?FAIL)/; + 'aaab' =~ /a+b?(*NOMATCH)(?{print "$&\n"; $count++})(*FAIL)/; print "Count=$count\n"; we prevent backtracking and find the count of the longest matching @@ -1089,23 +1212,47 @@ at each matching startpoint like so: ab Count=3 -Any number of C<(?COMMIT)> assertions may be used in a pattern. +Any number of C<(*NOMATCH)> assertions may be used in a pattern. See also C<< (?>pattern) >> and possessive quantifiers for other ways to control backtracking. -=item C<(?CUT)> -X<(?CUT)> - -This zero-width pattern is similar to C<(?COMMIT)>, except that on -failure it also signifies that whatever text that was matched leading -up to the C<(?CUT)> pattern cannot match, I<even from another -starting point>. - -Compare the following to the examples in C<(?COMMIT)>, note the string +=item C<(*MARK)> C<(*MARK:NAME)> +X<(*MARK)> + +This zero-width pattern can be used to mark the point in a string +reached when a certain part of the pattern has been successfully +matched. This mark may be given a name. A later C<(*CUT)> pattern +will then cut at that point if backtracked into on failure. Any +number of (*MARK) patterns are allowed, and the NAME portion is +optional and may be duplicated. + +See C<*CUT> for more detail. + +=item C<(*CUT)> C<(*CUT:NAME)> +X<(*CUT)> + +This zero-width pattern is similar to C<(*NOMATCH)>, except that on +failure it also signifies that whatever text that was matched leading up +to the C<(*CUT)> pattern being executed cannot be part of a match, I<even +if started from a later point>. This effectively means that the regex +engine moves forward to this position on failure and tries to match +again, (assuming that there is sufficient room to match). + +The name of the C<(*CUT:NAME)> pattern has special significance. If a +C<(*MARK:NAME)> was encountered while matching, then it is the position +where that pattern was executed that is used for the "cut point" in the +string. If no mark of that name was encountered, then the cut is done at +the point where the C<(*CUT)> was. Similarly if no NAME is specified in +the C<(*CUT)>, and if a C<(*MARK)> with any name (or none) is encountered, +then that C<(*MARK)>'s cursor point will be used. If the C<(*CUT)> is not +preceded by a C<(*MARK)>, then the cut point is where the string was when +the C<(*CUT)> was encountered. + +Compare the following to the examples in C<(*NOMATCH)>, note the string is twice as long: - 'aaabaaab'=~/a+b?(?CUT)(?{print "$&\n"; $count++})(?FAIL)/; + 'aaabaaab' =~ /a+b?(*CUT)(?{print "$&\n"; $count++})(*FAIL)/; print "Count=$count\n"; outputs @@ -1114,17 +1261,17 @@ outputs aaab Count=2 -Once the 'aaab' at the start of the string has matched and the C<(?CUT)> -executed the next startpoint will be where the cursor was when the -C<(?CUT)> was executed. +Once the 'aaab' at the start of the string has matched, and the C<(*CUT)> +executed, the next startpoint will be where the cursor was when the +C<(*CUT)> was executed. -=item C<(?ERROR)> -X<(?ERROR)> +=item C<(*COMMIT)> +X<(*COMMIT)> -This zero-width pattern is similar to C<(?CUT)> except that it causes +This zero-width pattern is similar to C<(*CUT)> except that it causes the match to fail outright. No attempts to match will occur again. - 'aaabaaab'=~/a+b?(?ERROR)(?{print "$&\n"; $count++})(?FAIL)/; + 'aaabaaab' =~ /a+b?(*COMMIT)(?{print "$&\n"; $count++})(*FAIL)/; print "Count=$count\n"; outputs @@ -1132,105 +1279,49 @@ outputs aaab Count=1 -In other words, once the C<(?ERROR)> has been entered and then pattern -does not match then the regex engine will not try any further matching at -all on the rest of the string. - -=item C<(?(condition)yes-pattern|no-pattern)> -X<(?()> - -=item C<(?(condition)yes-pattern)> +In other words, once the C<(*COMMIT)> has been entered, and if the pattern +does not match, the regex engine will not try any further matching on the +rest of the string. -Conditional expression. C<(condition)> should be either an integer in -parentheses (which is valid if the corresponding pair of parentheses -matched), a look-ahead/look-behind/evaluate zero-width assertion, a -name in angle brackets or single quotes (which is valid if a buffer -with the given name matched), the special symbol (R) (true when -evaluated inside of recursion or eval). Additionally the R may be -followed by a number, (which will be true when evaluated when recursing -inside of the appropriate group), or by C<&NAME> in which case it will -be true only when evaluated during recursion in the named group. +=back -Here's a summary of the possible predicates: +=item Verbs without an argument =over 4 -=item (1) (2) ... - -Checks if the numbered capturing buffer has matched something. - -=item (<NAME>) ('NAME') +=item C<(*FAIL)> C<(*F)> +X<(*FAIL)> X<(*F)> -Checks if a buffer with the given name has matched something. - -=item (?{ CODE }) - -Treats the code block as the condition - -=item (R) - -Checks if the expression has been evaluated inside of recursion. - -=item (R1) (R2) ... +This pattern matches nothing and always fails. It can be used to force the +engine to backtrack. It is equivalent to C<(?!)>, but easier to read. In +fact, C<(?!)> gets optimised into C<(*FAIL)> internally. -Checks if the expression has been evaluated while executing directly -inside of the n-th capture group. This check is the regex equivalent of +It is probably useful only when combined with C<(?{})> or C<(??{})>. - if ((caller(0))[3] eq 'subname') { .. } +=item C<(*ACCEPT)> +X<(*ACCEPT)> -In other words, it does not check the full recursion stack. +B<WARNING:> This feature is highly experimental. It is not recommended +for production code. -=item (R&NAME) +This pattern matches nothing and causes the end of successful matching at +the point at which the C<(*ACCEPT)> pattern was encountered, regardless of +whether there is actually more to match in the string. When inside of a +nested pattern, such as recursion or a dynamically generated subbpattern +via C<(??{})>, only the innermost pattern is ended immediately. -Similar to C<(R1)>, this predicate checks to see if we're executing -directly inside of the leftmost group with a given name (this is the same -logic used by C<(?&NAME)> to disambiguate). It does not check the full -stack, but only the name of the innermost active recursion. +If the C<(*ACCEPT)> is inside of capturing buffers then the buffers are +marked as ended at the point at which the C<(*ACCEPT)> was encountered. +For instance: -=item (DEFINE) + 'AB' =~ /(A (A|B(*ACCEPT)|C) D)(E)/x; -In this case, the yes-pattern is never directly executed, and no -no-pattern is allowed. Similar in spirit to C<(?{0})> but more efficient. -See below for details. +will match, and C<$1> will be C<AB> and C<$2> will be C<B>, C<$3> will not +be set. If another branch in the inner parens were matched, such as in the +string 'ACDE', then the C<D> and C<E> would have to be matched as well. =back -For example: - - m{ ( \( )? - [^()]+ - (?(1) \) ) - }x - -matches a chunk of non-parentheses, possibly included in parentheses -themselves. - -A special form is the C<(DEFINE)> predicate, which never executes directly -its yes-pattern, and does not allow a no-pattern. This allows to define -subpatterns which will be executed only by using the recursion mechanism. -This way, you can define a set of regular expression rules that can be -bundled into any pattern you choose. - -It is recommended that for this usage you put the DEFINE block at the -end of the pattern, and that you name any subpatterns defined within it. - -Also, it's worth noting that patterns defined this way probably will -not be as efficient, as the optimiser is not very clever about -handling them. YMMV. - -An example of how this might be used is as follows: - - /(?<NAME>(&NAME_PAT))(?<ADDR>(&ADDRESS_PAT)) - (?(DEFINE) - (<NAME_PAT>....) - (<ADRESS_PAT>....) - )/x - -Note that capture buffers matched inside of recursion are not accessible -after the recursion returns, so the extra layer of capturing buffers are -necessary. Thus C<$+{NAME_PAT}> would not be defined even though -C<$+{NAME}> would be. - =back =head2 Backtracking |