summaryrefslogtreecommitdiff
path: root/doc/html/pcrepattern.html
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-03-22 16:13:13 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2013-03-22 16:13:13 +0000
commit712f2578028ec79534921d1b06f7b9d0fa1e643b (patch)
tree210216515dd0488a595b09cbe3b12bc2353f5645 /doc/html/pcrepattern.html
parent54b46b0215cca9f79390afab565a31db76372d74 (diff)
downloadpcre-712f2578028ec79534921d1b06f7b9d0fa1e643b.tar.gz
Fix COMMIT in recursion; document backtracking verbs in assertions and
subroutines. git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1298 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/html/pcrepattern.html')
-rw-r--r--doc/html/pcrepattern.html269
1 files changed, 189 insertions, 80 deletions
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index ee55d06..c9eb7e5 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -806,7 +806,8 @@ Unicode table.
</P>
<P>
Specifying caseless matching does not affect these escape sequences. For
-example, \p{Lu} always matches only upper case letters.
+example, \p{Lu} always matches only upper case letters. This is different from
+the behaviour of current versions of Perl.
</P>
<P>
Matching characters by Unicode property is not fast, because PCRE has to do a
@@ -871,7 +872,8 @@ PCRE's additional properties
As well as the standard Unicode properties described above, PCRE supports four
more that make it possible to convert traditional escape sequences such as \w
and \s and POSIX character classes to use Unicode properties. PCRE uses these
-non-standard, non-Perl properties internally when PCRE_UCP is set. They are:
+non-standard, non-Perl properties internally when PCRE_UCP is set. However,
+they may also be used explicitly. These properties are:
<pre>
Xan Any alphanumeric character
Xps Any POSIX space character
@@ -883,6 +885,16 @@ property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
carriage return, and any other character that has the Z (separator) property.
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
same characters as Xan, plus underscore.
+</P>
+<P>
+There is another non-standard property, Xuc, which matches any character that
+can be represented by a Universal Character Name in C++ and other programming
+languages. These are the characters $, @, ` (grave accent), and all characters
+with Unicode code points greater than or equal to U+00A0, except for the
+surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
+excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH
+where H is a hexadecimal digit. Note that the Xuc property does not match these
+sequences but the characters that they represent.)
<a name="resetmatchstart"></a></P>
<br><b>
Resetting the match start
@@ -1950,8 +1962,8 @@ except that it does not cause the current matching position to be changed.
Assertion subpatterns are not capturing subpatterns. If such an assertion
contains capturing subpatterns within it, these are counted for the purposes of
numbering the capturing subpatterns in the whole pattern. However, substring
-capturing is carried out only for positive assertions, because it does not make
-sense for negative assertions.
+capturing is carried out only for positive assertions. (Perl sometimes, but not
+always, does do capturing in negative assertions.)
</P>
<P>
For compatibility with Perl, assertion subpatterns may be repeated; though
@@ -2605,7 +2617,14 @@ For example, this pattern has two callout points:
</pre>
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are
automatically installed before each item in the pattern. They are all numbered
-255.
+255. If there is a conditional group in the pattern whose condition is an
+assertion, an additional callout is inserted just before the condition. An
+explicit callout may also be set at this position, as in this example:
+<pre>
+ (?(?C9)(?=a)abc|def)
+</pre>
+Note that this applies only to assertion conditions, not to other types of
+condition.
</P>
<P>
During matching, when PCRE reaches a callout point, the external function is
@@ -2620,38 +2639,37 @@ documentation.
<br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
-are described in the Perl documentation as "experimental and subject to change
-or removal in a future version of Perl". It goes on to say: "Their usage in
-production code should be noted to avoid problems during upgrades." The same
+are still described in the Perl documentation as "experimental and subject to
+change or removal in a future version of Perl". It goes on to say: "Their usage
+in production code should be noted to avoid problems during upgrades." The same
remarks apply to the PCRE features described in this section.
</P>
<P>
-Since these verbs are specifically related to backtracking, most of them can be
-used only when the pattern is to be matched using one of the traditional
-matching functions, which use a backtracking algorithm. With the exception of
-(*FAIL), which behaves like a failing negative assertion, they cause an error
-if encountered by a DFA matching function.
-</P>
-<P>
-If any of these verbs are used in an assertion or in a subpattern that is
-called as a subroutine (whether or not recursively), their effect is confined
-to that subpattern; it does not extend to the surrounding pattern, with one
-exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
-a successful positive assertion <i>is</i> passed back when a match succeeds
-(compare capturing parentheses in assertions). Note that such subpatterns are
-processed as anchored at the point where they are tested. Note also that Perl's
-treatment of subroutines and assertions is different in some cases.
-</P>
-<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
-depending on whether or not an argument is present. A name is any sequence of
+depending on whether or not a name is present. A name is any sequence of
characters that does not include a closing parenthesis. The maximum length of
-name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit library.
+name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit libraries.
If the name is empty, that is, if the closing parenthesis immediately follows
the colon, the effect is as if the colon were not there. Any number of these
verbs may occur in a pattern.
+</P>
+<P>
+Since these verbs are specifically related to backtracking, most of them can be
+used only when the pattern is to be matched using one of the traditional
+matching functions, because these use a backtracking algorithm. With the
+exception of (*FAIL), which behaves like a failing negative assertion, the
+backtracking control verbs cause an error if encountered by a DFA matching
+function.
+</P>
+<P>
+The behaviour of these verbs in
+<a href="#btrepeat">repeated groups,</a>
+<a href="#btassert">assertions,</a>
+and in
+<a href="#btsub">subpatterns called as subroutines</a>
+(whether or not recursively) is documented below.
<a name="nooptimize"></a></P>
<br><b>
Optimizations that affect backtracking verbs
@@ -2660,7 +2678,7 @@ Optimizations that affect backtracking verbs
PCRE contains some optimizations that are used to speed up matching by running
some checks at the start of each match attempt. For example, it may know the
minimum length of matching subject, or that a particular character must be
-present. When one of these optimizations suppresses the running of a match, any
+present. When one of these optimizations bypasses the running of a match, any
included backtracking verbs will not, of course, be processed. You can suppress
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
@@ -2687,8 +2705,12 @@ followed by a name.
This verb causes the match to end successfully, skipping the remainder of the
pattern. However, when it is inside a subpattern that is called as a
subroutine, only that subpattern is ended successfully. Matching then continues
-at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
-far is captured. For example:
+at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
+assertion succeeds; in a negative assertion, the assertion fails.
+</P>
+<P>
+If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
+example:
<pre>
A((?:A|B(*ACCEPT)|C)D)
</pre>
@@ -2722,8 +2744,9 @@ A name is always required with this verb. There may be as many instances of
(*MARK) as you like in a pattern, and their names do not have to be unique.
</P>
<P>
-When a match succeeds, the name of the last-encountered (*MARK) on the matching
-path is passed back to the caller as described in the section entitled
+When a match succeeds, the name of the last-encountered (*MARK:NAME),
+(*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the
+caller as described in the section entitled
<a href="pcreapi.html#extradata">"Extra data for <b>pcre_exec()</b>"</a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
@@ -2744,13 +2767,13 @@ of obtaining this information than putting each alternative in its own
capturing parentheses.
</P>
<P>
-If (*MARK) is encountered in a positive assertion, its name is recorded and
-passed back if it is the last-encountered. This does not happen for negative
-assertions.
+If a verb with a name is encountered in a positive assertion, its name is
+recorded and passed back if it is the last-encountered. This does not happen
+for negative assertions.
</P>
<P>
-After a partial match or a failed match, the name of the last encountered
-(*MARK) in the entire match process is returned. For example:
+After a partial match or a failed match, the last encountered name in the
+entire match process is returned. For example:
<pre>
re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
data&#62; XP
@@ -2774,11 +2797,12 @@ Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching continues
with what follows, but if there is no subsequent match, causing a backtrack to
the verb, a failure is forced. That is, backtracking cannot pass to the left of
-the verb. However, when one of these verbs appears inside an atomic group, its
-effect is confined to that group, because once the group has been matched,
-there is never any backtracking into it. In this situation, backtracking can
-"jump back" to the left of the entire atomic group. (Remember also, as stated
-above, that this localization also applies in subroutine calls and assertions.)
+the verb. However, when one of these verbs appears inside an atomic group or an
+assertion, its effect is confined to that group, because once the group has
+been matched, there is never any backtracking into it. In this situation,
+backtracking can "jump back" to the left of the entire atomic group or
+assertion. (Remember also, as stated above, that this localization also applies
+in subroutine calls.)
</P>
<P>
These verbs differ in exactly what kind of failure occurs when backtracking
@@ -2787,10 +2811,12 @@ reaches them.
(*COMMIT)
</pre>
This verb, which may not be followed by a name, causes the whole match to fail
-outright if the rest of the pattern does not match. Even if the pattern is
-unanchored, no further attempts to find a match by advancing the starting point
-take place. Once (*COMMIT) has been passed, <b>pcre_exec()</b> is committed to
-finding a match at the current starting point, or not at all. For example:
+outright if there is a later matching failure that causes backtracking to reach
+it. Even if the pattern is unanchored, no further attempts to find a match by
+advancing the starting point take place. If (*COMMIT) is the only backtracking
+verb that is encountered, once it has been passed <b>pcre_exec()</b> is
+committed to finding a match at the current starting point, or not at all. For
+example:
<pre>
a+(*COMMIT)b
</pre>
@@ -2800,6 +2826,11 @@ recently passed (*MARK) in the path is passed back when (*COMMIT) forces a
match failure.
</P>
<P>
+If there is more than one backtracking verb in a pattern, a different one that
+follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
+match does not always guarantee that a match must be at this starting point.
+</P>
+<P>
Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE's start-of-match optimizations are turned off, as shown in this
<b>pcretest</b> example:
@@ -2819,15 +2850,20 @@ starting points.
(*PRUNE) or (*PRUNE:NAME)
</pre>
This verb causes the match to fail at the current starting position in the
-subject if the rest of the pattern does not match. If the pattern is
-unanchored, the normal "bumpalong" advance to the next starting character then
-happens. Backtracking can occur as usual to the left of (*PRUNE), before it is
-reached, or when matching to the right of (*PRUNE), but if there is no match to
-the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
-(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
-but there are some uses of (*PRUNE) that cannot be expressed in any other way.
-The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
-anchored pattern (*PRUNE) has the same effect as (*COMMIT).
+subject if there is a later matching failure that causes backtracking to reach
+it. If the pattern is unanchored, the normal "bumpalong" advance to the next
+starting character then happens. Backtracking can occur as usual to the left of
+(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
+if there is no match to the right, backtracking cannot cross (*PRUNE). In
+simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
+possessive quantifier, but there are some uses of (*PRUNE) that cannot be
+expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
+as (*COMMIT).
+</P>
+<P>
+The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE).
+It is like (*MARK:NAME) in that the name is remembered for passing back to the
+caller. However, (*SKIP:NAME) searches only for names set with (*MARK).
<pre>
(*SKIP)
</pre>
@@ -2848,31 +2884,39 @@ instead of skipping on to "c".
<pre>
(*SKIP:NAME)
</pre>
-When (*SKIP) has an associated name, its behaviour is modified. If the
-following pattern fails to match, the previous path through the pattern is
-searched for the most recent (*MARK) that has the same name. If one is found,
-the "bumpalong" advance is to the subject position that corresponds to that
-(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
-matching name is found, the (*SKIP) is ignored.
+When (*SKIP) has an associated name, its behaviour is modified. When it is
+triggered, the previous path through the pattern is searched for the most
+recent (*MARK) that has the same name. If one is found, the "bumpalong" advance
+is to the subject position that corresponds to that (*MARK) instead of to where
+(*SKIP) was encountered. If no (*MARK) with a matching name is found, the
+(*SKIP) is ignored.
+</P>
+<P>
+Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
+names that are set by (*PRUNE:NAME) or (*THEN:NAME).
<pre>
(*THEN) or (*THEN:NAME)
</pre>
-This verb causes a skip to the next innermost alternative if the rest of the
-pattern does not match. That is, it cancels pending backtracking, but only
-within the current alternative. Its name comes from the observation that it can
-be used for a pattern-based if-then-else block:
+This verb causes a skip to the next innermost alternative when backtracking
+reaches it. That is, it cancels any further backtracking within the current
+alternative. Its name comes from the observation that it can be used for a
+pattern-based if-then-else block:
<pre>
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
</pre>
If the COND1 pattern matches, FOO is tried (and possibly further items after
the end of the group if FOO succeeds); on failure, the matcher skips to the
-second alternative and tries COND2, without backtracking into COND1. The
-behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
+second alternative and tries COND2, without backtracking into COND1.
If (*THEN) is not inside an alternation, it acts like (*PRUNE).
</P>
<P>
-Note that a subpattern that does not contain a | character is just a part of
-the enclosing alternative; it is not a nested alternation with only one
+The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN).
+It is like (*MARK:NAME) in that the name is remembered for passing back to the
+caller. However, (*SKIP:NAME) searches only for names set with (*MARK).
+</P>
+<P>
+A subpattern that does not contain a | character is just a part of the
+enclosing alternative; it is not a nested alternation with only one
alternative. The effect of (*THEN) extends beyond such a subpattern to the
enclosing alternative. Consider this pattern, where A, B, etc. are complex
pattern fragments that do not contain any | characters at this level:
@@ -2892,7 +2936,7 @@ because there are no more alternatives to try. In this case, matching does now
backtrack into A.
</P>
<P>
-Note also that a conditional subpattern is not considered as having two
+Note that a conditional subpattern is not considered as having two
alternatives, because only one is ever used. In other words, the | character in
a conditional subpattern has a different meaning. Ignoring white space,
consider:
@@ -2916,17 +2960,82 @@ unanchored pattern). (*SKIP) is similar, except that the advance may be more
than one character. (*COMMIT) is the strongest, causing the entire match to
fail.
</P>
+<br><b>
+More than one backtracking verb
+</b><br>
+<P>
+If more than one backtracking verb is present in a pattern, the one that is
+backtracked onto first acts. For example, consider this pattern, where A, B,
+etc. are complex pattern fragments:
+<pre>
+ (A(*COMMIT)B(*THEN)C|ABD)
+</pre>
+If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
+fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
+the next alternative (ABD) to be tried. This behaviour is consistent, but is
+not always the same as Perl's. It means that if two or more backtracking verbs
+appear in succession, all the the last of them has no effect. Consider this
+example:
+<pre>
+ ...(*COMMIT)(*PRUNE)...
+</pre>
+If there is a matching failure to the right, backtracking onto (*PRUNE) cases
+it to be triggered, and its action is taken. There can never be a backtrack
+onto (*COMMIT).
+<a name="btrepeat"></a></P>
+<br><b>
+Backtracking verbs in repeated groups
+</b><br>
<P>
-If more than one such verb is present in a pattern, the "strongest" one wins.
-For example, consider this pattern, where A, B, etc. are complex pattern
-fragments:
+PCRE differs from Perl in its handling of backtracking verbs in repeated
+groups. For example, consider:
<pre>
- (A(*COMMIT)B(*THEN)C|D)
+ /(a(*COMMIT)b)+ac/
</pre>
-Once A has matched, PCRE is committed to this match, at the current starting
-position. If subsequently B matches, but C does not, the normal (*THEN) action
-of trying the next alternative (that is, D) does not happen because (*COMMIT)
-overrides.
+If the subject is "abac", Perl matches, but PCRE fails because the (*COMMIT) in
+the second repeat of the group acts.
+<a name="btassert"></a></P>
+<br><b>
+Backtracking verbs in assertions
+</b><br>
+<P>
+(*FAIL) in an assertion has its normal effect: it forces an immediate backtrack.
+</P>
+<P>
+(*ACCEPT) in a positive assertion causes the assertion to succeed without any
+further processing. In a negative assertion, (*ACCEPT) causes the assertion to
+fail without any further processing.
+</P>
+<P>
+The other backtracking verbs are not treated specially if they appear in an
+assertion. In particular, (*THEN) skips to the next alternative in the
+innermost enclosing group that has alternations, whether or not this is within
+the assertion.
+<a name="btsub"></a></P>
+<br><b>
+Backtracking verbs in subroutines
+</b><br>
+<P>
+These behaviours occur whether or not the subpattern is called recursively.
+Perl's treatment of subroutines is different in some cases.
+</P>
+<P>
+(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces
+an immediate backtrack.
+</P>
+<P>
+(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to
+succeed without any further processing. Matching then continues after the
+subroutine call.
+</P>
+<P>
+(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause
+the subroutine match to fail.
+</P>
+<P>
+(*THEN) skips to the next alternative in the innermost enclosing group within
+the subpattern that has alternatives. If there is no such group within the
+subpattern, (*THEN) causes the subroutine match to fail.
</P>
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>
<P>
@@ -2944,9 +3053,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC29" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 November 2012
+Last updated: 22 March 2013
<br>
-Copyright &copy; 1997-2012 University of Cambridge.
+Copyright &copy; 1997-2013 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.