summaryrefslogtreecommitdiff
path: root/doc/pcrepattern.3
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcrepattern.3')
-rw-r--r--doc/pcrepattern.3187
1 files changed, 155 insertions, 32 deletions
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 27afc4f..a2d02ca 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -2318,6 +2318,7 @@ description of the interface to the callout function is given in the
documentation.
.
.
+.\" HTML <a name="backtrackcontrol"></a>
.SH "BACKTRACKING CONTROL"
.rs
.sp
@@ -2339,15 +2340,27 @@ it does not extend to the surrounding pattern. Note that such subpatterns are
processed as anchored at the point where they are tested.
.P
The new verbs make use of what was previously invalid syntax: an opening
-parenthesis followed by an asterisk. In Perl, they are generally of the form
-(*VERB:ARG) but PCRE does not support the use of arguments, so its general
-form is just (*VERB). Any number of these verbs may occur in a pattern. There
-are two kinds:
+parenthesis followed by an asterisk. They are generally of the form
+(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
+depending on whether or not an argument is present. An name is a sequence of
+letters, digits, and underscores. If the name is empty, that is, if the closing
+parenthesis immediately follows the colon, the effect is as if the colon were
+not there. Any number of these verbs may occur in a pattern.
+.P
+PCRE contains some optimizations that are used to speed up matching by running
+some checks at the start of each match attempt. For example, it may know the
+minimum length of matching subject, or that a particular character must be
+present. When one of these optimizations suppresses the running of a match, any
+included backtracking verbs will not, of course, be processed. You can suppress
+the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
+when calling \fBpcre_exec()\fP.
+.
.
.SS "Verbs that act immediately"
.rs
.sp
-The following verbs act as soon as they are encountered:
+The following verbs act as soon as they are encountered. They may not be
+followed by a name.
.sp
(*ACCEPT)
.sp
@@ -2374,43 +2387,141 @@ callout feature, as for example in this pattern:
A match with the string "aaaa" always fails, but the callout is taken before
each backtrack happens (in this example, 10 times).
.
+.
+.SS "Recording which path was taken"
+.rs
+.sp
+There is one verb whose main purpose is to track how a match was arrived at,
+though it also has a secondary use in conjunction with advancing the match
+starting point (see (*SKIP) below).
+.sp
+ (*MARK:NAME) or (*:NAME)
+.sp
+A name is always required with this verb. There may be as many instances of
+(*MARK) as you like in a pattern, and their names do not have to be unique.
+.P
+When a match succeeds, the name of the last-encountered (*MARK) is passed back
+to the caller via the \fIpcre_extra\fP data structure, as described in the
+.\" HTML <a href="pcreapi.html#extradata">
+.\" </a>
+section on \fIpcre_extra\fP
+.\"
+in the
+.\" HREF
+\fBpcreapi\fP
+.\"
+documentation. No data is returned for a partial match. Here is an example of
+\fBpcretest\fP output, where the /K modifier requests the retrieval and
+outputting of (*MARK) data:
+.sp
+ /X(*MARK:A)Y|X(*MARK:B)Z/K
+ XY
+ 0: XY
+ MK: A
+ XZ
+ 0: XZ
+ MK: B
+.sp
+The (*MARK) name is tagged with "MK:" in this output, and in this example it
+indicates which of the two alternatives matched. This is a more efficient way
+of obtaining this information than putting each alternative in its own
+capturing parentheses.
+.P
+A name may also be returned after a failed match if the final path through the
+pattern involves (*MARK). However, unless (*MARK) used in conjunction with
+(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the
+starting point for matching is advanced, the final check is often with an empty
+string, causing a failure before (*MARK) is reached. For example:
+.sp
+ /X(*MARK:A)Y|X(*MARK:B)Z/K
+ XP
+ No match
+.sp
+There are three potential starting points for this match (starting with X,
+starting with P, and with an empty string). If the pattern is anchored, the
+result is different:
+.sp
+ /^X(*MARK:A)Y|^X(*MARK:B)Z/K
+ XP
+ No match, mark = B
+.sp
+PCRE's start-of-match optimizations can also interfere with this. For example,
+if, as a result of a call to \fBpcre_study()\fP, it knows the minimum
+subject length for a match, a shorter subject will not be scanned at all.
+.P
+Note that similar anomalies (though different in detail) exist in Perl, no
+doubt for the same reasons. The use of (*MARK) data after a failed match of an
+unanchored pattern is not recommended, unless (*COMMIT) is involved.
+.
+.
.SS "Verbs that act after backtracking"
.rs
.sp
The following verbs do nothing when they are encountered. Matching continues
-with what follows, but if there is no subsequent match, a failure is forced.
-The verbs differ in exactly what kind of failure occurs.
+with what follows, but if there is no subsequent match, causing a backtrack to
+the verb, a failure is forced. That is, backtracking cannot pass to the left of
+the verb. However, when one of these verbs appears inside an atomic group, its
+effect is confined to that group, because once the group has been matched,
+there is never any backtracking into it. In this situation, backtracking can
+"jump back" to the left of the entire atomic group. (Remember also, as stated
+above, that this localization also applies in subroutine calls and assertions.)
+.P
+These verbs differ in exactly what kind of failure occurs when backtracking
+reaches them.
.sp
(*COMMIT)
.sp
-This verb causes the whole match to fail outright if the rest of the pattern
-does not match. Even if the pattern is unanchored, no further attempts to find
-a match by advancing the starting point take place. Once (*COMMIT) has been
-passed, \fBpcre_exec()\fP is committed to finding a match at the current
-starting point, or not at all. For example:
+This verb, which may not be followed by a name, causes the whole match to fail
+outright if the rest of the pattern does not match. Even if the pattern is
+unanchored, no further attempts to find a match by advancing the starting point
+take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to
+finding a match at the current starting point, or not at all. For example:
.sp
a+(*COMMIT)b
.sp
This matches "xxaab" but not "aacaab". It can be thought of as a kind of
-dynamic anchor, or "I've started, so I must finish."
-.sp
- (*PRUNE)
-.sp
-This verb causes the match to fail at the current position if the rest of the
-pattern does not match. If the pattern is unanchored, the normal "bumpalong"
-advance to the next starting character then happens. Backtracking can occur as
-usual to the left of (*PRUNE), or when matching to the right of (*PRUNE), but
-if there is no match to the right, backtracking cannot cross (*PRUNE).
-In simple cases, the use of (*PRUNE) is just an alternative to an atomic
-group or possessive quantifier, but there are some uses of (*PRUNE) that cannot
-be expressed in any other way.
+dynamic anchor, or "I've started, so I must finish." The name of the most
+recently passed (*MARK) in the path is passed back when (*COMMIT) forces a
+match failure.
+.P
+Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
+unless PCRE's start-of-match optimizations are turned off, as shown in this
+\fBpcretest\fP example:
+.sp
+ /(*COMMIT)abc/
+ xyzabc
+ 0: abc
+ xyzabc\eY
+ No match
+.sp
+PCRE knows that any match must start with "a", so the optimization skips along
+the subject to "a" before running the first match attempt, which succeeds. When
+the optimization is disabled by the \eY escape in the second subject, the match
+starts at "x" and so the (*COMMIT) causes it to fail without trying any other
+starting points.
+.sp
+ (*PRUNE) or (*PRUNE:NAME)
+.sp
+This verb causes the match to fail at the current starting position in the
+subject if the rest of the pattern does not match. If the pattern is
+unanchored, the normal "bumpalong" advance to the next starting character then
+happens. Backtracking can occur as usual to the left of (*PRUNE), before it is
+reached, or when matching to the right of (*PRUNE), but if there is no match to
+the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
+(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
+but there are some uses of (*PRUNE) that cannot be expressed in any other way.
+The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
+match fails completely; the name is passed back if this is the final attempt.
+(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
+pattern (*PRUNE) has the same effect as (*COMMIT).
.sp
(*SKIP)
.sp
-This verb is like (*PRUNE), except that if the pattern is unanchored, the
-"bumpalong" advance is not to the next character, but to the position in the
-subject where (*SKIP) was encountered. (*SKIP) signifies that whatever text
-was matched leading up to it cannot be part of a successful match. Consider:
+This verb, when given without a name, is like (*PRUNE), except that if the
+pattern is unanchored, the "bumpalong" advance is not to the next character,
+but to the position in the subject where (*SKIP) was encountered. (*SKIP)
+signifies that whatever text was matched leading up to it cannot be part of a
+successful match. Consider:
.sp
a+(*SKIP)b
.sp
@@ -2421,7 +2532,17 @@ effect as this example; although it would suppress backtracking during the
first match attempt, the second attempt would start at the second character
instead of skipping on to "c".
.sp
- (*THEN)
+ (*SKIP:NAME)
+.sp
+When (*SKIP) has an associated name, its behaviour is modified. If the
+following pattern fails to match, the previous path through the pattern is
+searched for the most recent (*MARK) that has the same name. If one is found,
+the "bumpalong" advance is to the subject position that corresponds to that
+(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
+matching name is found, normal "bumpalong" of one character happens (the
+(*SKIP) is ignored).
+.sp
+ (*THEN) or (*THEN:NAME)
.sp
This verb causes a skip to the next alternation if the rest of the pattern does
not match. That is, it cancels pending backtracking, but only within the
@@ -2432,8 +2553,10 @@ for a pattern-based if-then-else block:
.sp
If the COND1 pattern matches, FOO is tried (and possibly further items after
the end of the group if FOO succeeds); on failure the matcher skips to the
-second alternative and tries COND2, without backtracking into COND1. If (*THEN)
-is used outside of any alternation, it acts exactly like (*PRUNE).
+second alternative and tries COND2, without backtracking into COND1. The
+behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
+overall match fails. If (*THEN) is not directly inside an alternation, it acts
+like (*PRUNE).
.
.
.SH "SEE ALSO"
@@ -2457,6 +2580,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 06 March 2010
+Last updated: 27 March 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi