summaryrefslogtreecommitdiff
path: root/doc/html/pcrepattern.html
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-18 19:12:35 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-18 19:12:35 +0000
commit20dd865c5c8f10036cda34b9870351b702399c08 (patch)
tree3a47dd7d7162f12a80b3fc947e16292b067ffa34 /doc/html/pcrepattern.html
parenteaa446db0f399010171263221963181144b026e0 (diff)
downloadpcre-20dd865c5c8f10036cda34b9870351b702399c08.tar.gz
Add more explanation about recursive subpatterns, and make it possible to
process the documenation without building a whole release. git-svn-id: svn://vcs.exim.org/pcre/code/trunk@453 2f5784b3-3f2a-0410-8824-cb99058d5e15
Diffstat (limited to 'doc/html/pcrepattern.html')
-rw-r--r--doc/html/pcrepattern.html97
1 files changed, 81 insertions, 16 deletions
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 5881bc3..e02a686 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -644,10 +644,10 @@ U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so
cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
(see the discussion of PCRE_NO_UTF8_CHECK in the
<a href="pcreapi.html"><b>pcreapi</b></a>
-page).
+page). Perl does not support the Cs property.
</P>
<P>
-The long synonyms for these properties that Perl supports (such as \p{Letter})
+The long synonyms for property names that Perl supports (such as \p{Letter})
are not supported by PCRE, nor is it permitted to prefix any of these
properties with "Is".
</P>
@@ -1922,7 +1922,7 @@ recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
supports special syntax for recursion of the entire pattern, and also for
individual subpattern recursion. After its introduction in PCRE and Python,
-this kind of recursion was introduced into Perl at release 5.10.
+this kind of recursion was subsequently introduced into Perl at release 5.10.
</P>
<P>
A special item that consists of (? followed by a number greater than zero and a
@@ -1932,12 +1932,6 @@ call, which is described in the next section.) The special item (?R) or (?0) is
a recursive call of the entire regular expression.
</P>
<P>
-In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
-treated as an atomic group. That is, once it has matched some of the subject
-string, it is never re-entered, even if it contains untried alternatives and
-there is a subsequent matching failure.
-</P>
-<P>
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
<pre>
@@ -2028,6 +2022,72 @@ recursing), whereas any characters are permitted at the outer level.
In this pattern, (?(R) is the start of a conditional subpattern, with two
different alternatives for the recursive and non-recursive cases. The (?R) item
is the actual recursive call.
+<a name="recursiondifference"></a></P>
+<br><b>
+Recursion difference from Perl
+</b><br>
+<P>
+In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
+treated as an atomic group. That is, once it has matched some of the subject
+string, it is never re-entered, even if it contains untried alternatives and
+there is a subsequent matching failure. This can be illustrated by the
+following pattern, which purports to match a palindromic string that contains
+an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
+<pre>
+ ^(.|(.)(?1)\2)$
+</pre>
+The idea is that it either matches a single character, or two identical
+characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
+it does not if the pattern is longer than three characters. Consider the
+subject string "abcba":
+</P>
+<P>
+At the top level, the first character is matched, but as it is not at the end
+of the string, the first alternative fails; the second alternative is taken
+and the recursion kicks in. The recursive call to subpattern 1 successfully
+matches the next character ("b"). (Note that the beginning and end of line
+tests are not part of the recursion).
+</P>
+<P>
+Back at the top level, the next character ("c") is compared with what
+subpattern 2 matched, which was "a". This fails. Because the recursion is
+treated as an atomic group, there are now no backtracking points, and so the
+entire match fails. (Perl is able, at this point, to re-enter the recursion and
+try the second alternative.) However, if the pattern is written with the
+alternatives in the other order, things are different:
+<pre>
+ ^((.)(?1)\2|.)$
+</pre>
+This time, the recursing alternative is tried first, and continues to recurse
+until it runs out of characters, at which point the recursion fails. But this
+time we do have another alternative to try at the higher level. That is the big
+difference: in the previous case the remaining alternative is at a deeper
+recursion level, which PCRE cannot use.
+</P>
+<P>
+To change the pattern so that matches all palindromic strings, not just those
+with an odd number of characters, it is tempting to change the pattern to this:
+<pre>
+ ^((.)(?1)\2|.?)$
+</pre>
+Again, this works in Perl, but not in PCRE, and for the same reason. When a
+deeper recursion has matched a single character, it cannot be entered again in
+order to match an empty string. The solution is to separate the two cases, and
+write out the odd and even cases as alternatives at the higher level:
+<pre>
+ ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
+</pre>
+If you want to match typical palindromic phrases, the pattern has to ignore all
+non-word characters, which can be done like this:
+<pre>
+ ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
+</pre>
+If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
+man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
+the use of the possessive quantifier *+ to avoid backtracking into sequences of
+non-word characters. Without this, PCRE takes a great deal longer (ten times or
+more) to match typical phrases, and Perl takes so long that you think it has
+gone into a loop.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
@@ -2138,6 +2198,12 @@ failing negative assertion, they cause an error if encountered by
<b>pcre_dfa_exec()</b>.
</P>
<P>
+If any of these verbs are used in an assertion subpattern, their effect is
+confined to that subpattern; it does not extend to the surrounding pattern.
+Note that assertion subpatterns are processed as anchored at the point where
+they are tested.
+</P>
+<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. In Perl, they are generally of the form
(*VERB:ARG) but PCRE does not support the use of arguments, so its general
@@ -2154,14 +2220,13 @@ The following verbs act as soon as they are encountered:
</pre>
This verb causes the match to end successfully, skipping the remainder of the
pattern. When inside a recursion, only the innermost pattern is ended
-immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside
-capturing parentheses. In Perl, the data so far is captured: in PCRE no data is
-captured. For example:
+immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
+is captured. (This feature was added to PCRE at release 8.00.) For example:
<pre>
- A(A|B(*ACCEPT)|C)D
+ A((?:A|B(*ACCEPT)|C)D)
</pre>
-This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is
-captured.
+This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
+the outer parentheses.
<pre>
(*FAIL) or (*F)
</pre>
@@ -2253,7 +2318,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 April 2009
+Last updated: 18 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>