summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-18 19:12:35 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-18 19:12:35 +0000
commit20dd865c5c8f10036cda34b9870351b702399c08 (patch)
tree3a47dd7d7162f12a80b3fc947e16292b067ffa34
parenteaa446db0f399010171263221963181144b026e0 (diff)
downloadpcre-20dd865c5c8f10036cda34b9870351b702399c08.tar.gz
Add more explanation about recursive subpatterns, and make it possible to
process the documenation without building a whole release. git-svn-id: svn://vcs.exim.org/pcre/code/trunk@453 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rwxr-xr-xPrepareRelease9
-rw-r--r--doc/html/pcre_dfa_exec.html44
-rw-r--r--doc/html/pcre_exec.html40
-rw-r--r--doc/html/pcreapi.html69
-rw-r--r--doc/html/pcrebuild.html12
-rw-r--r--doc/html/pcrecompat.html23
-rw-r--r--doc/html/pcredemo.html14
-rw-r--r--doc/html/pcregrep.html4
-rw-r--r--doc/html/pcrematching.html8
-rw-r--r--doc/html/pcrepartial.html135
-rw-r--r--doc/html/pcrepattern.html97
-rw-r--r--doc/html/pcreposix.html19
-rw-r--r--doc/html/pcretest.html24
-rw-r--r--doc/pcre.txt1495
-rw-r--r--doc/pcrecompat.316
-rw-r--r--doc/pcredemo.314
-rw-r--r--doc/pcregrep.txt411
-rw-r--r--doc/pcrepattern.373
-rw-r--r--doc/pcretest.txt296
-rw-r--r--testdata/testinput1118
-rw-r--r--testdata/testinput28
-rw-r--r--testdata/testoutput1157
-rw-r--r--testdata/testoutput233
23 files changed, 1648 insertions, 1271 deletions
diff --git a/PrepareRelease b/PrepareRelease
index de63e78..b083095 100755
--- a/PrepareRelease
+++ b/PrepareRelease
@@ -4,8 +4,9 @@
# processing of the documentation, detrails files, and creates pcre.h.generic
# and config.h.generic (for use by builders who can't run ./configure).
-# You must run this script before runnning "make dist". It makes use of the
-# following files:
+# You must run this script before runnning "make dist". If its first argument
+# is "doc", it stops after preparing the documentation. There are no other
+# arguments. The script makes use of the following files:
# 132html A Perl script that converts a .1 or .3 man page into HTML. It
# "knows" the relevant troff constructs that are used in the PCRE
@@ -119,6 +120,7 @@ done
# Exclude table of contents for function summaries. It seems that expr
# forces an anchored regex. Also exclude them for small pages that have
# only one section.
+
for file in *.3 ; do
base=`basename $file .3`
toc=-toc
@@ -134,10 +136,11 @@ for file in *.3 ; do
if [ $? != 0 ] ; then exit 1; fi
done
-# End of documentation processing
+# End of documentation processing; stop if only documentation required.
cd ..
echo Documentation done
+if [ "$1" = "doc" ] ; then exit; fi
# These files are detrailed; do not detrail the test data because there may be
# significant trailing spaces. The configure files are also omitted from the
diff --git a/doc/html/pcre_dfa_exec.html b/doc/html/pcre_dfa_exec.html
index a242f75..28bcf0d 100644
--- a/doc/html/pcre_dfa_exec.html
+++ b/doc/html/pcre_dfa_exec.html
@@ -48,27 +48,29 @@ matching function is <b>pcre_exec()</b>. The arguments for this function are:
</pre>
The options are:
<pre>
- PCRE_ANCHORED Match only at the first position
- PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
- PCRE_BSR_UNICODE \R matches all Unicode line endings
- PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
- PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline sequences
- PCRE_NEWLINE_CR Set CR as the newline sequence
- PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
- PCRE_NEWLINE_LF Set LF as the newline sequence
- PCRE_NOTBOL Subject is not the beginning of a line
- PCRE_NOTEOL Subject is not the end of a line
- PCRE_NOTEMPTY An empty string is not a valid match
- PCRE_NO_START_OPTIMIZE Do not do "start-match" optimizations
- PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
- validity (only relevant if PCRE_UTF8
- was set at compile time)
- PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial match
- PCRE_PARTIAL_SOFT ) if no full matches are found
- PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
- PCRE_DFA_SHORTEST Return only the shortest match
- PCRE_DFA_RESTART This is a restart after a partial match
+ PCRE_ANCHORED Match only at the first position
+ PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
+ PCRE_BSR_UNICODE \R matches all Unicode line endings
+ PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
+ PCRE_NEWLINE_ANYCRLF Recognize CR, LF, & CRLF as newline sequences
+ PCRE_NEWLINE_CR Recognize CR as the only newline sequence
+ PCRE_NEWLINE_CRLF Recognize CRLF as the only newline sequence
+ PCRE_NEWLINE_LF Recognize LF as the only newline sequence
+ PCRE_NOTBOL Subject is not the beginning of a line
+ PCRE_NOTEOL Subject is not the end of a line
+ PCRE_NOTEMPTY An empty string is not a valid match
+ PCRE_NOTEMPTY_ATSTART An empty string at the start of the subject
+ is not a valid match
+ PCRE_NO_START_OPTIMIZE Do not do "start-match" optimizations
+ PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
+ validity (only relevant if PCRE_UTF8
+ was set at compile time)
+ PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
+ PCRE_PARTIAL_SOFT ) match if no full matches are found
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
+ PCRE_DFA_SHORTEST Return only the shortest match
+ PCRE_DFA_RESTART Restart after a partial match
</pre>
There are restrictions on what may appear in a pattern when using this matching
function. Details are given in the
diff --git a/doc/html/pcre_exec.html b/doc/html/pcre_exec.html
index ccc3db1..fc0938a 100644
--- a/doc/html/pcre_exec.html
+++ b/doc/html/pcre_exec.html
@@ -44,25 +44,27 @@ offsets to captured substrings. Its arguments are:
</pre>
The options are:
<pre>
- PCRE_ANCHORED Match only at the first position
- PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
- PCRE_BSR_UNICODE \R matches all Unicode line endings
- PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
- PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline sequences
- PCRE_NEWLINE_CR Set CR as the newline sequence
- PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
- PCRE_NEWLINE_LF Set LF as the newline sequence
- PCRE_NOTBOL Subject is not the beginning of a line
- PCRE_NOTEOL Subject is not the end of a line
- PCRE_NOTEMPTY An empty string is not a valid match
- PCRE_NO_START_OPTIMIZE Do not do "start-match" optimizations
- PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
- validity (only relevant if PCRE_UTF8
- was set at compile time)
- PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial match
- PCRE_PARTIAL_SOFT ) if no full matches are found
- PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ PCRE_ANCHORED Match only at the first position
+ PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
+ PCRE_BSR_UNICODE \R matches all Unicode line endings
+ PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
+ PCRE_NEWLINE_ANYCRLF Recognize CR, LF, & CRLF as newline sequences
+ PCRE_NEWLINE_CR Recognize CR as the only newline sequence
+ PCRE_NEWLINE_CRLF Recognize CRLF as the only newline sequence
+ PCRE_NEWLINE_LF Recognize LF as the only newline sequence
+ PCRE_NOTBOL Subject string is not the beginning of a line
+ PCRE_NOTEOL Subject string is not the end of a line
+ PCRE_NOTEMPTY An empty string is not a valid match
+ PCRE_NOTEMPTY_ATSTART An empty string at the start of the subject
+ is not a valid match
+ PCRE_NO_START_OPTIMIZE Do not do "start-match" optimizations
+ PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
+ validity (only relevant if PCRE_UTF8
+ was set at compile time)
+ PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
+ PCRE_PARTIAL_SOFT ) match if no full matches are found
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
</pre>
For details of partial matching, see the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index fe08e74..1d1ca12 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -175,9 +175,10 @@ documentation describes how to compile and run it.
A second matching function, <b>pcre_dfa_exec()</b>, which is not
Perl-compatible, is also provided. This uses a different algorithm for the
matching. The alternative algorithm finds all possible matches (at a given
-point in the subject), and scans the subject just once. However, this algorithm
-does not return captured substrings. A description of the two matching
-algorithms and their advantages and disadvantages is given in the
+point in the subject), and scans the subject just once (unless there are
+lookbehind assertions). However, this algorithm does not return captured
+substrings. A description of the two matching algorithms and their advantages
+and disadvantages is given in the
<a href="pcrematching.html"><b>pcrematching</b></a>
documentation.
</P>
@@ -1017,10 +1018,10 @@ different for each compiled pattern.
<pre>
PCRE_INFO_OKPARTIAL
</pre>
-Return 1 if the pattern can be used for partial matching, otherwise 0. The
-fourth argument should point to an <b>int</b> variable. From release 8.00, this
-always returns 1, because the restrictions that previously applied to partial
-matching have been lifted. The
+Return 1 if the pattern can be used for partial matching with
+<b>pcre_exec()</b>, otherwise 0. The fourth argument should point to an
+<b>int</b> variable. From release 8.00, this always returns 1, because the
+restrictions that previously applied to partial matching have been lifted. The
<a href="pcrepartial.html"><b>pcrepartial</b></a>
documentation gives details of partial matching.
<pre>
@@ -1224,8 +1225,8 @@ PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the <i>flags</i> field. If the limit
is exceeded, <b>pcre_exec()</b> returns PCRE_ERROR_RECURSIONLIMIT.
</P>
<P>
-The <i>pcre_callout</i> field is used in conjunction with the "callout" feature,
-which is described in the
+The <i>callout_data</i> field is used in conjunction with the "callout" feature,
+and is described in the
<a href="pcrecallout.html"><b>pcrecallout</b></a>
documentation.
</P>
@@ -1248,8 +1249,9 @@ Option bits for <b>pcre_exec()</b>
<P>
The unused bits of the <i>options</i> argument for <b>pcre_exec()</b> must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>,
-PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE,
-PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.
+PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
+PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and
+PCRE_PARTIAL_HARD.
<pre>
PCRE_ANCHORED
</pre>
@@ -1328,18 +1330,25 @@ match the empty string, the entire match fails. For example, if the pattern
<pre>
a?b?
</pre>
-is applied to a string not beginning with "a" or "b", it matches the empty
+is applied to a string not beginning with "a" or "b", it matches an empty
string at the start of the subject. With PCRE_NOTEMPTY set, this match is not
valid, so PCRE searches further into the string for occurrences of "a" or "b".
+<pre>
+ PCRE_NOTEMPTY_ATSTART
+</pre>
+This is like PCRE_NOTEMPTY, except that an empty string match that is not at
+the start of the subject is permitted. If the pattern is anchored, such a match
+can occur only if the pattern contains \K.
</P>
<P>
-Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a special case
-of a pattern match of the empty string within its <b>split()</b> function, and
-when using the /g modifier. It is possible to emulate Perl's behaviour after
-matching a null string by first trying the match again at the same offset with
-PCRE_NOTEMPTY and PCRE_ANCHORED, and then if that fails by advancing the
-starting offset (see below) and trying an ordinary match again. There is some
-code that demonstrates how to do this in the
+Perl has no direct equivalent of PCRE_NOTEMPTY or PCRE_NOTEMPTY_ATSTART, but it
+does make a special case of a pattern match of the empty string within its
+<b>split()</b> function, and when using the /g modifier. It is possible to
+emulate Perl's behaviour after matching a null string by first trying the match
+again at the same offset with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then
+if that fails, by advancing the starting offset (see below) and trying an
+ordinary match again. There is some code that demonstrates how to do this in
+the
<a href="pcredemo.html"><b>pcredemo</b></a>
sample program.
<pre>
@@ -1389,8 +1398,8 @@ PCRE_PARTIAL_HARD is set, <b>pcre_exec()</b> immediately returns
PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
-provided the partial match is set as the first matching string. There is a more
-detailed discussion in the
+was inspected when the partial match was found is set as the first matching
+string. There is a more detailed discussion in the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
documentation.
</P>
@@ -1837,8 +1846,8 @@ a compiled pattern, using a matching algorithm that scans the subject string
just once, and does not backtrack. This has different characteristics to the
normal algorithm, and is not compatible with Perl. Some of the features of PCRE
patterns are not supported. Nevertheless, there are times when this kind of
-matching can be useful. For a discussion of the two matching algorithms, see
-the
+matching can be useful. For a discussion of the two matching algorithms, and a
+list of features that <b>pcre_dfa_exec()</b> does not support, see the
<a href="pcrematching.html"><b>pcrematching</b></a>
documentation.
</P>
@@ -1880,10 +1889,10 @@ Option bits for <b>pcre_dfa_exec()</b>
<P>
The unused bits of the <i>options</i> argument for <b>pcre_dfa_exec()</b> must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>,
-PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD,
-PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
-four of these are exactly the same as for <b>pcre_exec()</b>, so their
-description is not repeated here.
+PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
+PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST,
+and PCRE_DFA_RESTART. All but the last four of these are exactly the same as
+for <b>pcre_exec()</b>, so their description is not repeated here.
<pre>
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
@@ -1896,8 +1905,8 @@ additional characters. This happens even if some complete matches have also
been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH
is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
there have been no complete matches, but there is still at least one matching
-possibility. The portion of the string that provided the longest partial match
-is set as the first matching string in both cases.
+possibility. The portion of the string that was inspected when the longest
+partial match was found is set as the first matching string in both cases.
<pre>
PCRE_DFA_SHORTEST
</pre>
@@ -2009,7 +2018,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 01 September 2009
+Last updated: 11 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrebuild.html b/doc/html/pcrebuild.html
index 7f06250..adff0e9 100644
--- a/doc/html/pcrebuild.html
+++ b/doc/html/pcrebuild.html
@@ -39,8 +39,14 @@ the library is compiled. It assumes use of the <b>configure</b> script, where
the optional features are selected or deselected by providing options to
<b>configure</b> before running the <b>make</b> command. However, the same
options can be selected in both Unix-like and non-Unix-like environments using
-the GUI facility of <b>CMakeSetup</b> if you are using <b>CMake</b> instead of
-<b>configure</b> to build PCRE.
+the GUI facility of <b>cmake-gui</b> if you are using <b>CMake</b> instead of
+<b>configure</b> to build PCRE.
+</P>
+<P>
+There is a lot more information about building PCRE in non-Unix-like
+environments in the file called <i>NON_UNIX_USE</i>, which is part of the PCRE
+distribution. You should consult this file as well as the <i>README</i> file if
+you are building in a non-Unix-like environment.
</P>
<P>
The complete list of options for <b>configure</b> (which includes the standard
@@ -339,7 +345,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC18" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 17 March 2009
+Last updated: 06 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrecompat.html b/doc/html/pcrecompat.html
index 5567c27..5ad6036 100644
--- a/doc/html/pcrecompat.html
+++ b/doc/html/pcrecompat.html
@@ -59,7 +59,10 @@ encountered by PCRE, an error is generated.
built with Unicode character property support. The properties that can be
tested with \p and \P are limited to the general category properties such as
Lu and Nd, script names such as Greek or Han, and the derived properties Any
-and L&.
+and L&. PCRE does support the Cs (surrogate) property, which Perl does not; the
+Perl documentation says "Because Perl hides the need for the user to understand
+the internal representation of Unicode characters, there is no need to
+implement the somewhat messy concept of surrogates."
</P>
<P>
7. PCRE does support the \Q...\E escape for quoting substrings. Characters in
@@ -79,7 +82,7 @@ The \Q...\E sequence is recognized both inside and outside character classes.
<P>
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This is not
-available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE "callout"
+available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE "callout"
feature allows an external function to be called during pattern matching. See
the
<a href="pcrecallout.html"><b>pcrecallout</b></a>
@@ -87,7 +90,12 @@ documentation for details.
</P>
<P>
9. Subpatterns that are called recursively or as "subroutines" are always
-treated as atomic groups in PCRE. This is like Python, but unlike Perl.
+treated as atomic groups in PCRE. This is like Python, but unlike Perl. There
+is a discussion of an example that explains this in more detail in the
+<a href="pcrepattern.html#recursiondifference">section on recursion differences from Perl</a>
+in the
+<a href="pcrecompat.html"><b>pcrecompat</b></a>
+page.
</P>
<P>
10. There are some differences that are concerned with the settings of captured
@@ -97,8 +105,7 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
<P>
11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), (*FAIL), (*F),
(*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an
-argument. PCRE does not support (*MARK). If (*ACCEPT) is within capturing
-parentheses, PCRE does not set that capture group; this is different to Perl.
+argument. PCRE does not support (*MARK).
</P>
<P>
12. PCRE provides some extensions to the Perl regular expression facilities.
@@ -130,8 +137,8 @@ question mark they are.
only at the first matching position in the subject string.
<br>
<br>
-(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAPTURE
-options for <b>pcre_exec()</b> have no Perl equivalents.
+(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, and
+PCRE_NO_AUTO_CAPTURE options for <b>pcre_exec()</b> have no Perl equivalents.
<br>
<br>
(g) The \R escape sequence can be restricted to match only CR, LF, or CRLF
@@ -170,7 +177,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 25 August 2009
+Last updated: 18 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcredemo.html b/doc/html/pcredemo.html
index 57b4d1d..3978560 100644
--- a/doc/html/pcredemo.html
+++ b/doc/html/pcredemo.html
@@ -240,12 +240,12 @@ if (namecount &lt;= 0) printf("No named substrings\n"); else
* *
* If the previous match WAS for an empty string, we can't do that, as it *
* would lead to an infinite loop. Instead, a special call of pcre_exec() *
-* is made with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set. The first *
-* of these tells PCRE that an empty string is not a valid match; other *
-* possibilities must be tried. The second flag restricts PCRE to one *
-* match attempt at the initial string position. If this match succeeds, *
-* an alternative to the empty string match has been found, and we can *
-* proceed round the loop. *
+* is made with the PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED flags set. *
+* The first of these tells PCRE that an empty string at the start of the *
+* subject is not a valid match; other possibilities must be tried. The *
+* second flag restricts PCRE to one match attempt at the initial string *
+* position. If this match succeeds, an alternative to the empty string *
+* match has been found, and we can proceed round the loop. *
*************************************************************************/
if (!find_all)
@@ -268,7 +268,7 @@ for (;;)
if (ovector[0] == ovector[1])
{
if (ovector[0] == subject_length) break;
- options = PCRE_NOTEMPTY | PCRE_ANCHORED;
+ options = PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED;
}
/* Run the next matching operation */
diff --git a/doc/html/pcregrep.html b/doc/html/pcregrep.html
index 5256a67..8ab07fe 100644
--- a/doc/html/pcregrep.html
+++ b/doc/html/pcregrep.html
@@ -98,7 +98,7 @@ above options is used.
</P>
<P>
Patterns that can match an empty string are accepted, but empty string
-matches are not recognized. An example is the pattern "(super)?(man)?", in
+matches are never recognized. An example is the pattern "(super)?(man)?", in
which all components are optional. This pattern finds all occurrences of both
"super" and "man"; the output differs from matching with "super|man" when only
the matching substrings are being shown.
@@ -538,7 +538,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 12 August 2009
+Last updated: 13 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html
index 2a60581..9c2d86c 100644
--- a/doc/html/pcrematching.html
+++ b/doc/html/pcrematching.html
@@ -116,6 +116,12 @@ character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
</P>
<P>
+Although the general principle of this matching algorithm is that it scans the
+subject string only once, without backtracking, there is one exception: when a
+lookbehind assertion is encountered, the preceding characters have to be
+re-inspected.
+</P>
+<P>
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
</P>
@@ -209,7 +215,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 25 August 2009
+Last updated: 05 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrepartial.html b/doc/html/pcrepartial.html
index 3801320..80eb3bb 100644
--- a/doc/html/pcrepartial.html
+++ b/doc/html/pcrepartial.html
@@ -51,10 +51,10 @@ is very long and is not all available at once.
<P>
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
-<b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
-for PCRE_PARTIAL_SOFT. The essential difference between the two options is
-whether or not a partial match is preferred to an alternative complete match,
-though the details differ between the two matching functions. If both options
+<b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
+for PCRE_PARTIAL_SOFT. The essential difference between the two options is
+whether or not a partial match is preferred to an alternative complete match,
+though the details differ between the two matching functions. If both options
are set, PCRE_PARTIAL_HARD takes precedence.
</P>
<P>
@@ -74,50 +74,69 @@ been matched. (In other words, a partial match can never be an empty string.)
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
continues as normal, and other alternatives in the pattern are tried. If no
complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL
-instead of PCRE_ERROR_NOMATCH, and if there are at least two slots in the
-offsets vector, they are filled in with the offsets of the longest string that
-partially matched. Consider this pattern:
+instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
+vector, the first of them is set to the offset of the earliest character that
+was inspected when the partial match was found. For convenience, the second
+offset points to the end of the string so that a substring can easily be
+extracted.
+</P>
+<P>
+For the majority of patterns, the first offset identifies the start of the
+partially matched string. However, for patterns that contain lookbehind
+assertions, or \K, or begin with \b or \B, earlier characters have been
+inspected while carrying out the match. For example:
+<pre>
+ /(?&#60;=abc)123/
+</pre>
+This pattern matches "123", but only if it is preceded by "abc". If the subject
+string is "xyzabc12", the offsets after a partial match are for the substring
+"abc12", because all these characters are needed if another match is tried
+with extra characters added.
+</P>
+<P>
+If there is more than one partial match, the first one that was found provides
+the data that is returned. Consider this pattern:
<pre>
/123\w+X|dogY/
</pre>
If this is matched against the subject string "abc123dog", both
-alternatives fail to match, but the end of the subject is reached during
+alternatives fail to match, but the end of the subject is reached during
matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
-offsets are set to 3 and 9, identifying "123dog" as the longest partial match
+offsets are set to 3 and 9, identifying "123dog" as the first partial match
that was found. (In this example, there are two partial matches, because "dog"
on its own partially matches the second alternative.)
</P>
<P>
-If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
+If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
search for possible complete matches. The difference between the two options
can be illustrated by a pattern such as:
<pre>
/dog(sbody)?/
</pre>
-This matches either "dog" or "dogsbody", greedily (that is, it prefers the
+This matches either "dog" or "dogsbody", greedily (that is, it prefers the
longer string if possible). If it is matched against the string "dog" with
-PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
-PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
+PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
+PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
if the pattern is made ungreedy the result is different:
<pre>
/dog(sbody)??/
</pre>
-In this case the result is always a complete match because <b>pcre_exec()</b>
-finds that first, and it never continues after finding a match. It might be
+In this case the result is always a complete match because <b>pcre_exec()</b>
+finds that first, and it never continues after finding a match. It might be
easier to follow this explanation by thinking of the two patterns like this:
<pre>
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
</pre>
-The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
+The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
used, because it will always find the shorter match first.
</P>
<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
<P>
-The <b>pcre_dfa_exec()</b> function moves along the subject string character by
-character, without backtracking, searching for all possible matches
-simultaneously. If the end of the subject is reached before the end of the
+The <b>pcre_dfa_exec()</b> function moves along the subject string character by
+character, without backtracking, searching for all possible matches
+simultaneously. If the end of the subject is reached before the end of the
pattern, there is the possibility of a partial match, again provided that at
least one character has matched.
</P>
@@ -125,40 +144,40 @@ least one character has matched.
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
-complete matches. The portion of the string that provided the longest partial
-match is set as the first matching string, provided there are at least two
-slots in the offsets vector.
+complete matches. The portion of the string that was inspected when the longest
+partial match was found is set as the first matching string, provided there are
+at least two slots in the offsets vector.
</P>
<P>
-Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
+Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
there is no difference between greedy and ungreedy repetition, its behaviour is
-different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
+different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
string "dog" matched against the ungreedy pattern shown above:
<pre>
/dog(sbody)??/
</pre>
-Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
+Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
"dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
so returns that when PCRE_PARTIAL_HARD is set.
</P>
<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
<P>
-If a pattern ends with one of sequences \w or \W, which test for word
-boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
+If a pattern ends with one of sequences \w or \W, which test for word
+boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
results. Consider this pattern:
<pre>
/\bcat\b/
</pre>
This matches "cat", provided there is a word boundary at either end. If the
subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However,
-<b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
-of the subject when the last character is a letter, thus finding a complete
-match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
+character cannot take place, so a partial match is found. However,
+<b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
+of the subject when the last character is a letter, thus finding a complete
+match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
</P>
<P>
-Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
+Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
then the partial match takes precedence.
</P>
<br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
@@ -236,10 +255,10 @@ facility can be used to pass very long subject strings to
</P>
<br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
<P>
-From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
-matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
-previous match with a new segment of data. Instead, new data must be added to
-the previous subject string, and the entire match re-run, starting from the
+From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
+matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
+previous match with a new segment of data. Instead, new data must be added to
+the previous subject string, and the entire match re-run, starting from the
point where the partial match occurred. Earlier data can be discarded.
Consider an unanchored pattern that matches dates:
<pre>
@@ -247,15 +266,21 @@ Consider an unanchored pattern that matches dates:
data&#62; The date is 23ja\P
Partial match: 23ja
</pre>
-The this stage, an application could discard the text preceding "23ja", add on
-text from the next segment, and call <b>pcre_exec()</b> again. Unlike
-<b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
-the complete matching process occurs for each call, so more memory and more
+The this stage, an application could discard the text preceding "23ja", add on
+text from the next segment, and call <b>pcre_exec()</b> again. Unlike
+<b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
+the complete matching process occurs for each call, so more memory and more
processing time is needed.
</P>
+<P>
+<b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
+with \b or \B, the string that is returned for a partial match will include
+characters that precede the partially matched string itself, because these must
+be retained when adding on more characters for a subsequent matching attempt.
+</P>
<br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
<P>
-Certain types of pattern may give problems with multi-segment matching,
+Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
</P>
<P>
@@ -264,18 +289,18 @@ to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
subject string for any call does not contain the beginning or end of a line.
</P>
<P>
-2. If the pattern contains backward assertions (including \b or \B), you need
-to arrange for some overlap in the subject strings to allow for them to be
-correctly tested at the start of each substring. For example, using
-<b>pcre_dfa_exec()</b>, you could pass the subject in chunks that are 500 bytes
-long, but in a buffer of 700 bytes, with the starting offset set to 200 and the
-previous 200 bytes at the start of the buffer.
+2. Lookbehind assertions at the start of a pattern are catered for in the
+offsets that are returned for a partial match. However, in theory, a lookbehind
+assertion later in the pattern could require even earlier characters to be
+inspected, and it might not have been reached when a partial match occurs. This
+is probably an extremely unlikely case; you could guard against it to a certain
+extent by always including extra characters at the start.
</P>
<P>
3. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
-especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
-Word Boundaries" above describes an issue that arises if the pattern ends with
+especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
+Word Boundaries" above describes an issue that arises if the pattern ends with
\b or \B. Another kind of difference may occur when there are multiple
matching possibilities, because a partial match result is given only when there
are no completed matches. This means that as soon as the shortest match has
@@ -284,7 +309,7 @@ Consider again this <b>pcretest</b> example:
<pre>
re&#62; /dog(sbody)?/
data&#62; dogsb\P
- 0: dog
+ 0: dog
data&#62; do\P\D
Partial match: do
data&#62; gsb\R\P\D
@@ -308,17 +333,17 @@ matching multi-segment data. The example above then behaves differently:
<pre>
re&#62; /dog(sbody)?/
data&#62; dogsb\P\P
- Partial match: dogsb
+ Partial match: dogsb
data&#62; do\P\D
Partial match: do
data&#62; gsb\R\P\P\D
- Partial match: gsb
+ Partial match: gsb
</PRE>
</P>
<P>
4. Patterns that contain alternatives at the top level which do not all
-start with the same pattern item may not work as expected when
+start with the same pattern item may not work as expected when
<b>pcre_dfa_exec()</b> is used. For example, consider this pattern:
<pre>
1234|3789
@@ -335,7 +360,7 @@ patterns or patterns such as:
1234|ABCD
</pre>
where no string can be a partial match for both alternatives. This is not a
-problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
+problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
each time:
<pre>
re&#62; /1234|3789/
@@ -357,7 +382,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC11" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 31 August 2009
+Last updated: 05 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 5881bc3..e02a686 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -644,10 +644,10 @@ U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so
cannot be tested by PCRE, unless UTF-8 validity checking has been turned off
(see the discussion of PCRE_NO_UTF8_CHECK in the
<a href="pcreapi.html"><b>pcreapi</b></a>
-page).
+page). Perl does not support the Cs property.
</P>
<P>
-The long synonyms for these properties that Perl supports (such as \p{Letter})
+The long synonyms for property names that Perl supports (such as \p{Letter})
are not supported by PCRE, nor is it permitted to prefix any of these
properties with "Is".
</P>
@@ -1922,7 +1922,7 @@ recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
supports special syntax for recursion of the entire pattern, and also for
individual subpattern recursion. After its introduction in PCRE and Python,
-this kind of recursion was introduced into Perl at release 5.10.
+this kind of recursion was subsequently introduced into Perl at release 5.10.
</P>
<P>
A special item that consists of (? followed by a number greater than zero and a
@@ -1932,12 +1932,6 @@ call, which is described in the next section.) The special item (?R) or (?0) is
a recursive call of the entire regular expression.
</P>
<P>
-In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
-treated as an atomic group. That is, once it has matched some of the subject
-string, it is never re-entered, even if it contains untried alternatives and
-there is a subsequent matching failure.
-</P>
-<P>
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
<pre>
@@ -2028,6 +2022,72 @@ recursing), whereas any characters are permitted at the outer level.
In this pattern, (?(R) is the start of a conditional subpattern, with two
different alternatives for the recursive and non-recursive cases. The (?R) item
is the actual recursive call.
+<a name="recursiondifference"></a></P>
+<br><b>
+Recursion difference from Perl
+</b><br>
+<P>
+In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
+treated as an atomic group. That is, once it has matched some of the subject
+string, it is never re-entered, even if it contains untried alternatives and
+there is a subsequent matching failure. This can be illustrated by the
+following pattern, which purports to match a palindromic string that contains
+an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
+<pre>
+ ^(.|(.)(?1)\2)$
+</pre>
+The idea is that it either matches a single character, or two identical
+characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
+it does not if the pattern is longer than three characters. Consider the
+subject string "abcba":
+</P>
+<P>
+At the top level, the first character is matched, but as it is not at the end
+of the string, the first alternative fails; the second alternative is taken
+and the recursion kicks in. The recursive call to subpattern 1 successfully
+matches the next character ("b"). (Note that the beginning and end of line
+tests are not part of the recursion).
+</P>
+<P>
+Back at the top level, the next character ("c") is compared with what
+subpattern 2 matched, which was "a". This fails. Because the recursion is
+treated as an atomic group, there are now no backtracking points, and so the
+entire match fails. (Perl is able, at this point, to re-enter the recursion and
+try the second alternative.) However, if the pattern is written with the
+alternatives in the other order, things are different:
+<pre>
+ ^((.)(?1)\2|.)$
+</pre>
+This time, the recursing alternative is tried first, and continues to recurse
+until it runs out of characters, at which point the recursion fails. But this
+time we do have another alternative to try at the higher level. That is the big
+difference: in the previous case the remaining alternative is at a deeper
+recursion level, which PCRE cannot use.
+</P>
+<P>
+To change the pattern so that matches all palindromic strings, not just those
+with an odd number of characters, it is tempting to change the pattern to this:
+<pre>
+ ^((.)(?1)\2|.?)$
+</pre>
+Again, this works in Perl, but not in PCRE, and for the same reason. When a
+deeper recursion has matched a single character, it cannot be entered again in
+order to match an empty string. The solution is to separate the two cases, and
+write out the odd and even cases as alternatives at the higher level:
+<pre>
+ ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
+</pre>
+If you want to match typical palindromic phrases, the pattern has to ignore all
+non-word characters, which can be done like this:
+<pre>
+ ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
+</pre>
+If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
+man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
+the use of the possessive quantifier *+ to avoid backtracking into sequences of
+non-word characters. Without this, PCRE takes a great deal longer (ten times or
+more) to match typical phrases, and Perl takes so long that you think it has
+gone into a loop.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
@@ -2138,6 +2198,12 @@ failing negative assertion, they cause an error if encountered by
<b>pcre_dfa_exec()</b>.
</P>
<P>
+If any of these verbs are used in an assertion subpattern, their effect is
+confined to that subpattern; it does not extend to the surrounding pattern.
+Note that assertion subpatterns are processed as anchored at the point where
+they are tested.
+</P>
+<P>
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. In Perl, they are generally of the form
(*VERB:ARG) but PCRE does not support the use of arguments, so its general
@@ -2154,14 +2220,13 @@ The following verbs act as soon as they are encountered:
</pre>
This verb causes the match to end successfully, skipping the remainder of the
pattern. When inside a recursion, only the innermost pattern is ended
-immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside
-capturing parentheses. In Perl, the data so far is captured: in PCRE no data is
-captured. For example:
+immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
+is captured. (This feature was added to PCRE at release 8.00.) For example:
<pre>
- A(A|B(*ACCEPT)|C)D
+ A((?:A|B(*ACCEPT)|C)D)
</pre>
-This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is
-captured.
+This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
+the outer parentheses.
<pre>
(*FAIL) or (*F)
</pre>
@@ -2253,7 +2318,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 April 2009
+Last updated: 18 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcreposix.html b/doc/html/pcreposix.html
index 2479d41..b0c00bf 100644
--- a/doc/html/pcreposix.html
+++ b/doc/html/pcreposix.html
@@ -66,6 +66,11 @@ POSIX interface often use it, this makes it easier to slot in PCRE as a
replacement library. Other POSIX options are not even defined.
</P>
<P>
+There are also some other options that are not defined by POSIX. These have
+been added at the request of users who want to make use of certain
+PCRE-specific features via the POSIX calling interface.
+</P>
+<P>
When PCRE is called via these functions, it is only the API that is POSIX-like
in style. The syntax and semantics of the regular expressions themselves are
still those of Perl, subject to the setting of various PCRE options, as
@@ -121,6 +126,12 @@ compiled with this flag is passed to <b>regexec()</b> for matching, the
<i>nmatch</i> and <i>pmatch</i> arguments are ignored, and no captured strings
are returned.
<pre>
+ REG_UNGREEDY
+</pre>
+The PCRE_UNGREEDY option is set when the regular expression is passed for
+compilation to the native function. Note that REG_UNGREEDY is not part of the
+POSIX standard.
+<pre>
REG_UTF8
</pre>
The PCRE_UTF8 option is set when the regular expression is passed for
@@ -134,7 +145,7 @@ This means the the regex is compiled with PCRE default semantics. In
particular, the way it handles newline characters in the subject string is the
Perl way, not the POSIX way. Note that setting PCRE_MULTILINE has only
<i>some</i> of the effects specified for REG_NEWLINE. It does not affect the way
-newlines are matched by . (they aren't) or by a negative class such as [^a]
+newlines are matched by . (they are not) or by a negative class such as [^a]
(they are).
</P>
<P>
@@ -222,6 +233,10 @@ strings is returned. The <i>nmatch</i> and <i>pmatch</i> arguments of
<b>regexec()</b> are ignored.
</P>
<P>
+If the value of <i>nmatch</i> is zero, or if the value <i>pmatch</i> is NULL,
+no data about any matched strings is returned.
+</P>
+<P>
Otherwise,the portion of the string that was matched, and also any captured
substrings, are returned via the <i>pmatch</i> argument, which points to an
array of <i>nmatch</i> structures of type <i>regmatch_t</i>, containing the
@@ -262,7 +277,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 15 August 2009
+Last updated: 02 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 88728ca..1a99d60 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -246,11 +246,11 @@ begins with a lookbehind assertion (including \b or \B).
</P>
<P>
If any call to <b>pcre_exec()</b> in a <b>/g</b> or <b>/G</b> sequence matches an
-empty string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
-flags set in order to search for another, non-empty, match at the same point.
-If this second match fails, the start offset is advanced by one, and the normal
-match is retried. This imitates the way Perl handles such cases when using the
-<b>/g</b> modifier or the <b>split()</b> function.
+empty string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
+PCRE_ANCHORED flags set in order to search for another, non-empty, match at the
+same point. If this second match fails, the start offset is advanced by one
+character, and the normal match is retried. This imitates the way Perl handles
+such cases when using the <b>/g</b> modifier or the <b>split()</b> function.
</P>
<br><b>
Other modifiers
@@ -370,7 +370,8 @@ recognized:
ated by next non-alphanumeric character)
\L call pcre_get_substringlist() after a successful match
\M discover the minimum MATCH_LIMIT and MATCH_LIMIT_RECURSION settings
- \N pass the PCRE_NOTEMPTY option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
+ \N pass the PCRE_NOTEMPTY option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>; if used twice, pass the
+ PCRE_NOTEMPTY_ATSTART option
\Odd set the size of the output vector passed to <b>pcre_exec()</b> to dd (any number of digits)
\P pass the PCRE_PARTIAL_SOFT option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>; if used twice, pass the
PCRE_PARTIAL_HARD option
@@ -454,10 +455,11 @@ This section describes the output when the normal matching function,
<P>
When a match succeeds, pcretest outputs the list of captured substrings that
<b>pcre_exec()</b> returns, starting with number 0 for the string that matched
-the whole pattern. Otherwise, it outputs "No match" or "Partial match:"
-followed by the partially matching substring when <b>pcre_exec()</b> returns
-PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL, respectively, and otherwise the PCRE
-negative error number. Here is an example of an interactive <b>pcretest</b> run.
+the whole pattern. Otherwise, it outputs "No match" when the return is
+PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
+substring when <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL. For any other
+returns, it outputs the PCRE negative error number. Here is an example of an
+interactive <b>pcretest</b> run.
<pre>
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -706,7 +708,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 29 August 2009
+Last updated: 11 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 5bbeb9a..ab98276 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -282,20 +282,25 @@ PCRE BUILD-TIME OPTIONS
script, where the optional features are selected or deselected by pro-
viding options to configure before running the make command. However,
the same options can be selected in both Unix-like and non-Unix-like
- environments using the GUI facility of CMakeSetup if you are using
- CMake instead of configure to build PCRE.
+ environments using the GUI facility of cmake-gui if you are using CMake
+ instead of configure to build PCRE.
+
+ There is a lot more information about building PCRE in non-Unix-like
+ environments in the file called NON_UNIX_USE, which is part of the PCRE
+ distribution. You should consult this file as well as the README file
+ if you are building in a non-Unix-like environment.
The complete list of options for configure (which includes the standard
- ones such as the selection of the installation directory) can be
+ ones such as the selection of the installation directory) can be
obtained by running
./configure --help
- The following sections include descriptions of options whose names
+ The following sections include descriptions of options whose names
begin with --enable or --disable. These settings specify changes to the
- defaults for the configure command. Because of the way that configure
- works, --enable and --disable always come in pairs, so the complemen-
- tary option always exists as well, but as it specifies the default, it
+ defaults for the configure command. Because of the way that configure
+ works, --enable and --disable always come in pairs, so the complemen-
+ tary option always exists as well, but as it specifies the default, it
is not described.
@@ -316,46 +321,46 @@ UTF-8 SUPPORT
--enable-utf8
- to the configure command. Of itself, this does not make PCRE treat
- strings as UTF-8. As well as compiling PCRE with this option, you also
- have have to set the PCRE_UTF8 option when you call the pcre_compile()
+ to the configure command. Of itself, this does not make PCRE treat
+ strings as UTF-8. As well as compiling PCRE with this option, you also
+ have have to set the PCRE_UTF8 option when you call the pcre_compile()
function.
- If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
+ If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
expects its input to be either ASCII or UTF-8 (depending on the runtime
- option). It is not possible to support both EBCDIC and UTF-8 codes in
- the same version of the library. Consequently, --enable-utf8 and
+ option). It is not possible to support both EBCDIC and UTF-8 codes in
+ the same version of the library. Consequently, --enable-utf8 and
--enable-ebcdic are mutually exclusive.
UNICODE CHARACTER PROPERTY SUPPORT
- UTF-8 support allows PCRE to process character values greater than 255
- in the strings that it handles. On its own, however, it does not pro-
+ UTF-8 support allows PCRE to process character values greater than 255
+ in the strings that it handles. On its own, however, it does not pro-
vide any facilities for accessing the properties of such characters. If
- you want to be able to use the pattern escapes \P, \p, and \X, which
+ you want to be able to use the pattern escapes \P, \p, and \X, which
refer to Unicode character properties, you must add
--enable-unicode-properties
- to the configure command. This implies UTF-8 support, even if you have
+ to the configure command. This implies UTF-8 support, even if you have
not explicitly requested it.
- Including Unicode property support adds around 30K of tables to the
- PCRE library. Only the general category properties such as Lu and Nd
+ Including Unicode property support adds around 30K of tables to the
+ PCRE library. Only the general category properties such as Lu and Nd
are supported. Details are given in the pcrepattern documentation.
CODE VALUE OF NEWLINE
- By default, PCRE interprets the linefeed (LF) character as indicating
- the end of a line. This is the normal newline character on Unix-like
- systems. You can compile PCRE to use carriage return (CR) instead, by
+ By default, PCRE interprets the linefeed (LF) character as indicating
+ the end of a line. This is the normal newline character on Unix-like
+ systems. You can compile PCRE to use carriage return (CR) instead, by
adding
--enable-newline-is-cr
- to the configure command. There is also a --enable-newline-is-lf
+ to the configure command. There is also a --enable-newline-is-lf
option, which explicitly specifies linefeed as the newline character.
Alternatively, you can specify that line endings are to be indicated by
@@ -367,35 +372,35 @@ CODE VALUE OF NEWLINE
--enable-newline-is-anycrlf
- which causes PCRE to recognize any of the three sequences CR, LF, or
+ which causes PCRE to recognize any of the three sequences CR, LF, or
CRLF as indicating a line ending. Finally, a fifth option, specified by
--enable-newline-is-any
causes PCRE to recognize any Unicode newline sequence.
- Whatever line ending convention is selected when PCRE is built can be
- overridden when the library functions are called. At build time it is
+ Whatever line ending convention is selected when PCRE is built can be
+ overridden when the library functions are called. At build time it is
conventional to use the standard for your operating system.
WHAT \R MATCHES
- By default, the sequence \R in a pattern matches any Unicode newline
- sequence, whatever has been selected as the line ending sequence. If
+ By default, the sequence \R in a pattern matches any Unicode newline
+ sequence, whatever has been selected as the line ending sequence. If
you specify
--enable-bsr-anycrlf
- the default is changed so that \R matches only CR, LF, or CRLF. What-
- ever is selected when PCRE is built can be overridden when the library
+ the default is changed so that \R matches only CR, LF, or CRLF. What-
+ ever is selected when PCRE is built can be overridden when the library
functions are called.
BUILDING SHARED AND STATIC LIBRARIES
- The PCRE building process uses libtool to build both shared and static
- Unix libraries by default. You can suppress one of these by adding one
+ The PCRE building process uses libtool to build both shared and static
+ Unix libraries by default. You can suppress one of these by adding one
of
--disable-shared
@@ -407,9 +412,9 @@ BUILDING SHARED AND STATIC LIBRARIES
POSIX MALLOC USAGE
When PCRE is called through the POSIX interface (see the pcreposix doc-
- umentation), additional working storage is required for holding the
- pointers to capturing substrings, because PCRE requires three integers
- per substring, whereas the POSIX interface provides only two. If the
+ umentation), additional working storage is required for holding the
+ pointers to capturing substrings, because PCRE requires three integers
+ per substring, whereas the POSIX interface provides only two. If the
number of expected substrings is small, the wrapper function uses space
on the stack, because this is faster than using malloc() for each call.
The default threshold above which the stack is no longer used is 10; it
@@ -422,112 +427,112 @@ POSIX MALLOC USAGE
HANDLING VERY LARGE PATTERNS
- Within a compiled pattern, offset values are used to point from one
- part to another (for example, from an opening parenthesis to an alter-
- nation metacharacter). By default, two-byte values are used for these
- offsets, leading to a maximum size for a compiled pattern of around
- 64K. This is sufficient to handle all but the most gigantic patterns.
- Nevertheless, some people do want to process enormous patterns, so it
- is possible to compile PCRE to use three-byte or four-byte offsets by
+ Within a compiled pattern, offset values are used to point from one
+ part to another (for example, from an opening parenthesis to an alter-
+ nation metacharacter). By default, two-byte values are used for these
+ offsets, leading to a maximum size for a compiled pattern of around
+ 64K. This is sufficient to handle all but the most gigantic patterns.
+ Nevertheless, some people do want to process enormous patterns, so it
+ is possible to compile PCRE to use three-byte or four-byte offsets by
adding a setting such as
--with-link-size=3
- to the configure command. The value given must be 2, 3, or 4. Using
- longer offsets slows down the operation of PCRE because it has to load
+ to the configure command. The value given must be 2, 3, or 4. Using
+ longer offsets slows down the operation of PCRE because it has to load
additional bytes when handling them.
AVOIDING EXCESSIVE STACK USAGE
When matching with the pcre_exec() function, PCRE implements backtrack-
- ing by making recursive calls to an internal function called match().
- In environments where the size of the stack is limited, this can se-
- verely limit PCRE's operation. (The Unix environment does not usually
+ ing by making recursive calls to an internal function called match().
+ In environments where the size of the stack is limited, this can se-
+ verely limit PCRE's operation. (The Unix environment does not usually
suffer from this problem, but it may sometimes be necessary to increase
- the maximum stack size. There is a discussion in the pcrestack docu-
- mentation.) An alternative approach to recursion that uses memory from
- the heap to remember data, instead of using recursive function calls,
- has been implemented to work round the problem of limited stack size.
+ the maximum stack size. There is a discussion in the pcrestack docu-
+ mentation.) An alternative approach to recursion that uses memory from
+ the heap to remember data, instead of using recursive function calls,
+ has been implemented to work round the problem of limited stack size.
If you want to build a version of PCRE that works this way, add
--disable-stack-for-recursion
- to the configure command. With this configuration, PCRE will use the
- pcre_stack_malloc and pcre_stack_free variables to call memory manage-
- ment functions. By default these point to malloc() and free(), but you
+ to the configure command. With this configuration, PCRE will use the
+ pcre_stack_malloc and pcre_stack_free variables to call memory manage-
+ ment functions. By default these point to malloc() and free(), but you
can replace the pointers so that your own functions are used.
- Separate functions are provided rather than using pcre_malloc and
- pcre_free because the usage is very predictable: the block sizes
- requested are always the same, and the blocks are always freed in
- reverse order. A calling program might be able to implement optimized
- functions that perform better than malloc() and free(). PCRE runs
+ Separate functions are provided rather than using pcre_malloc and
+ pcre_free because the usage is very predictable: the block sizes
+ requested are always the same, and the blocks are always freed in
+ reverse order. A calling program might be able to implement optimized
+ functions that perform better than malloc() and free(). PCRE runs
noticeably more slowly when built in this way. This option affects only
- the pcre_exec() function; it is not relevant for the the
+ the pcre_exec() function; it is not relevant for the the
pcre_dfa_exec() function.
LIMITING PCRE RESOURCE USAGE
- Internally, PCRE has a function called match(), which it calls repeat-
- edly (sometimes recursively) when matching a pattern with the
- pcre_exec() function. By controlling the maximum number of times this
- function may be called during a single matching operation, a limit can
- be placed on the resources used by a single call to pcre_exec(). The
- limit can be changed at run time, as described in the pcreapi documen-
- tation. The default is 10 million, but this can be changed by adding a
+ Internally, PCRE has a function called match(), which it calls repeat-
+ edly (sometimes recursively) when matching a pattern with the
+ pcre_exec() function. By controlling the maximum number of times this
+ function may be called during a single matching operation, a limit can
+ be placed on the resources used by a single call to pcre_exec(). The
+ limit can be changed at run time, as described in the pcreapi documen-
+ tation. The default is 10 million, but this can be changed by adding a
setting such as
--with-match-limit=500000
- to the configure command. This setting has no effect on the
+ to the configure command. This setting has no effect on the
pcre_dfa_exec() matching function.
- In some environments it is desirable to limit the depth of recursive
+ In some environments it is desirable to limit the depth of recursive
calls of match() more strictly than the total number of calls, in order
- to restrict the maximum amount of stack (or heap, if --disable-stack-
+ to restrict the maximum amount of stack (or heap, if --disable-stack-
for-recursion is specified) that is used. A second limit controls this;
- it defaults to the value that is set for --with-match-limit, which
- imposes no additional constraints. However, you can set a lower limit
+ it defaults to the value that is set for --with-match-limit, which
+ imposes no additional constraints. However, you can set a lower limit
by adding, for example,
--with-match-limit-recursion=10000
- to the configure command. This value can also be overridden at run
+ to the configure command. This value can also be overridden at run
time.
CREATING CHARACTER TABLES AT BUILD TIME
- PCRE uses fixed tables for processing characters whose code values are
- less than 256. By default, PCRE is built with a set of tables that are
- distributed in the file pcre_chartables.c.dist. These tables are for
+ PCRE uses fixed tables for processing characters whose code values are
+ less than 256. By default, PCRE is built with a set of tables that are
+ distributed in the file pcre_chartables.c.dist. These tables are for
ASCII codes only. If you add
--enable-rebuild-chartables
- to the configure command, the distributed tables are no longer used.
- Instead, a program called dftables is compiled and run. This outputs
+ to the configure command, the distributed tables are no longer used.
+ Instead, a program called dftables is compiled and run. This outputs
the source for new set of tables, created in the default locale of your
C runtime system. (This method of replacing the tables does not work if
- you are cross compiling, because dftables is run on the local host. If
- you need to create alternative tables when cross compiling, you will
+ you are cross compiling, because dftables is run on the local host. If
+ you need to create alternative tables when cross compiling, you will
have to do so "by hand".)
USING EBCDIC CODE
- PCRE assumes by default that it will run in an environment where the
- character code is ASCII (or Unicode, which is a superset of ASCII).
- This is the case for most computer operating systems. PCRE can, how-
+ PCRE assumes by default that it will run in an environment where the
+ character code is ASCII (or Unicode, which is a superset of ASCII).
+ This is the case for most computer operating systems. PCRE can, how-
ever, be compiled to run in an EBCDIC environment by adding
--enable-ebcdic
to the configure command. This setting implies --enable-rebuild-charta-
- bles. You should only use it if you know that you are in an EBCDIC
- environment (for example, an IBM mainframe operating system). The
+ bles. You should only use it if you know that you are in an EBCDIC
+ environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with --enable-utf8.
@@ -541,7 +546,7 @@ PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
--enable-pcregrep-libbz2
to the configure command. These options naturally require that the rel-
- evant libraries are installed on your system. Configuration will fail
+ evant libraries are installed on your system. Configuration will fail
if they are not.
@@ -551,24 +556,24 @@ PCRETEST OPTION FOR LIBREADLINE SUPPORT
--enable-pcretest-libreadline
- to the configure command, pcretest is linked with the libreadline
- library, and when its input is from a terminal, it reads it using the
+ to the configure command, pcretest is linked with the libreadline
+ library, and when its input is from a terminal, it reads it using the
readline() function. This provides line-editing and history facilities.
Note that libreadline is GPL-licenced, so if you distribute a binary of
pcretest linked in this way, there may be licensing issues.
- Setting this option causes the -lreadline option to be added to the
- pcretest build. In many operating environments with a sytem-installed
+ Setting this option causes the -lreadline option to be added to the
+ pcretest build. In many operating environments with a sytem-installed
libreadline this is sufficient. However, in some environments (e.g. if
- an unmodified distribution version of readline is in use), some extra
- configuration may be necessary. The INSTALL file for libreadline says
+ an unmodified distribution version of readline is in use), some extra
+ configuration may be necessary. The INSTALL file for libreadline says
this:
"Readline uses the termcap functions, but does not link with the
termcap or curses library itself, allowing applications which link
with readline the to choose an appropriate library."
- If your environment has not been set up so that an appropriate library
+ If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
@@ -590,7 +595,7 @@ AUTHOR
REVISION
- Last updated: 17 March 2009
+ Last updated: 06 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
@@ -696,67 +701,72 @@ THE ALTERNATIVE MATCHING ALGORITHM
at the fourth character of the subject. The algorithm does not automat-
ically move on to find matches that start at later positions.
+ Although the general principle of this matching algorithm is that it
+ scans the subject string only once, without backtracking, there is one
+ exception: when a lookbehind assertion is encountered, the preceding
+ characters have to be re-inspected.
+
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
- 1. Because the algorithm finds all possible matches, the greedy or
- ungreedy nature of repetition quantifiers is not relevant. Greedy and
+ 1. Because the algorithm finds all possible matches, the greedy or
+ ungreedy nature of repetition quantifiers is not relevant. Greedy and
ungreedy quantifiers are treated in exactly the same way. However, pos-
- sessive quantifiers can make a difference when what follows could also
+ sessive quantifiers can make a difference when what follows could also
match what is quantified, for example in a pattern like this:
^a++\w!
- This pattern matches "aaab!" but not "aaa!", which would be matched by
- a non-possessive quantifier. Similarly, if an atomic group is present,
- it is matched as if it were a standalone pattern at the current point,
- and the longest match is then "locked in" for the rest of the overall
+ This pattern matches "aaab!" but not "aaa!", which would be matched by
+ a non-possessive quantifier. Similarly, if an atomic group is present,
+ it is matched as if it were a standalone pattern at the current point,
+ and the longest match is then "locked in" for the rest of the overall
pattern.
2. When dealing with multiple paths through the tree simultaneously, it
- is not straightforward to keep track of captured substrings for the
- different matching possibilities, and PCRE's implementation of this
+ is not straightforward to keep track of captured substrings for the
+ different matching possibilities, and PCRE's implementation of this
algorithm does not attempt to do this. This means that no captured sub-
strings are available.
- 3. Because no substrings are captured, back references within the pat-
+ 3. Because no substrings are captured, back references within the pat-
tern are not supported, and cause errors if encountered.
- 4. For the same reason, conditional expressions that use a backrefer-
- ence as the condition or test for a specific group recursion are not
+ 4. For the same reason, conditional expressions that use a backrefer-
+ ence as the condition or test for a specific group recursion are not
supported.
- 5. Because many paths through the tree may be active, the \K escape
+ 5. Because many paths through the tree may be active, the \K escape
sequence, which resets the start of the match when encountered (but may
- be on some paths and not on others), is not supported. It causes an
+ be on some paths and not on others), is not supported. It causes an
error if encountered.
- 6. Callouts are supported, but the value of the capture_top field is
+ 6. Callouts are supported, but the value of the capture_top field is
always 1, and the value of the capture_last field is always -1.
- 7. The \C escape sequence, which (in the standard algorithm) matches a
- single byte, even in UTF-8 mode, is not supported because the alterna-
- tive algorithm moves through the subject string one character at a
+ 7. The \C escape sequence, which (in the standard algorithm) matches a
+ single byte, even in UTF-8 mode, is not supported because the alterna-
+ tive algorithm moves through the subject string one character at a
time, for all active paths through the tree.
- 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
- are not supported. (*FAIL) is supported, and behaves like a failing
+ 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
+ are not supported. (*FAIL) is supported, and behaves like a failing
negative assertion.
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
- Using the alternative matching algorithm provides the following advan-
+ Using the alternative matching algorithm provides the following advan-
tages:
1. All possible matches (at a single point in the subject) are automat-
- ically found, and in particular, the longest match is found. To find
+ ically found, and in particular, the longest match is found. To find
more than one match using the standard algorithm, you have to do kludgy
things with callouts.
- 2. Because the alternative algorithm scans the subject string just
- once, and never needs to backtrack, it is possible to pass very long
- subject strings to the matching function in several pieces, checking
+ 2. Because the alternative algorithm scans the subject string just
+ once, and never needs to backtrack, it is possible to pass very long
+ subject strings to the matching function in several pieces, checking
for partial matching each time.
@@ -764,8 +774,8 @@ DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
The alternative algorithm suffers from a number of disadvantages:
- 1. It is substantially slower than the standard algorithm. This is
- partly because it has to search for all possible matches, but is also
+ 1. It is substantially slower than the standard algorithm. This is
+ partly because it has to search for all possible matches, but is also
because it is less susceptible to optimization.
2. Capturing parentheses and back references are not supported.
@@ -783,7 +793,7 @@ AUTHOR
REVISION
- Last updated: 25 August 2009
+ Last updated: 05 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
@@ -902,12 +912,13 @@ PCRE API OVERVIEW
A second matching function, pcre_dfa_exec(), which is not Perl-compati-
ble, is also provided. This uses a different algorithm for the match-
ing. The alternative algorithm finds all possible matches (at a given
- point in the subject), and scans the subject just once. However, this
- algorithm does not return captured substrings. A description of the two
- matching algorithms and their advantages and disadvantages is given in
- the pcrematching documentation.
+ point in the subject), and scans the subject just once (unless there
+ are lookbehind assertions). However, this algorithm does not return
+ captured substrings. A description of the two matching algorithms and
+ their advantages and disadvantages is given in the pcrematching docu-
+ mentation.
- In addition to the main compiling and matching functions, there are
+ In addition to the main compiling and matching functions, there are
convenience functions for extracting captured substrings from a subject
string that is matched by pcre_exec(). They are:
@@ -922,91 +933,91 @@ PCRE API OVERVIEW
pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings.
- The function pcre_maketables() is used to build a set of character
- tables in the current locale for passing to pcre_compile(),
- pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
- provided for specialist use. Most commonly, no special tables are
- passed, in which case internal tables that are generated when PCRE is
+ The function pcre_maketables() is used to build a set of character
+ tables in the current locale for passing to pcre_compile(),
+ pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
+ provided for specialist use. Most commonly, no special tables are
+ passed, in which case internal tables that are generated when PCRE is
built are used.
- The function pcre_fullinfo() is used to find out information about a
- compiled pattern; pcre_info() is an obsolete version that returns only
- some of the available information, but is retained for backwards com-
- patibility. The function pcre_version() returns a pointer to a string
+ The function pcre_fullinfo() is used to find out information about a
+ compiled pattern; pcre_info() is an obsolete version that returns only
+ some of the available information, but is retained for backwards com-
+ patibility. The function pcre_version() returns a pointer to a string
containing the version of PCRE and its date of release.
- The function pcre_refcount() maintains a reference count in a data
- block containing a compiled pattern. This is provided for the benefit
+ The function pcre_refcount() maintains a reference count in a data
+ block containing a compiled pattern. This is provided for the benefit
of object-oriented applications.
- The global variables pcre_malloc and pcre_free initially contain the
- entry points of the standard malloc() and free() functions, respec-
+ The global variables pcre_malloc and pcre_free initially contain the
+ entry points of the standard malloc() and free() functions, respec-
tively. PCRE calls the memory management functions via these variables,
- so a calling program can replace them if it wishes to intercept the
+ so a calling program can replace them if it wishes to intercept the
calls. This should be done before calling any PCRE functions.
- The global variables pcre_stack_malloc and pcre_stack_free are also
- indirections to memory management functions. These special functions
- are used only when PCRE is compiled to use the heap for remembering
+ The global variables pcre_stack_malloc and pcre_stack_free are also
+ indirections to memory management functions. These special functions
+ are used only when PCRE is compiled to use the heap for remembering
data, instead of recursive function calls, when running the pcre_exec()
- function. See the pcrebuild documentation for details of how to do
- this. It is a non-standard way of building PCRE, for use in environ-
- ments that have limited stacks. Because of the greater use of memory
- management, it runs more slowly. Separate functions are provided so
- that special-purpose external code can be used for this case. When
- used, these functions are always called in a stack-like manner (last
- obtained, first freed), and always for memory blocks of the same size.
- There is a discussion about PCRE's stack usage in the pcrestack docu-
+ function. See the pcrebuild documentation for details of how to do
+ this. It is a non-standard way of building PCRE, for use in environ-
+ ments that have limited stacks. Because of the greater use of memory
+ management, it runs more slowly. Separate functions are provided so
+ that special-purpose external code can be used for this case. When
+ used, these functions are always called in a stack-like manner (last
+ obtained, first freed), and always for memory blocks of the same size.
+ There is a discussion about PCRE's stack usage in the pcrestack docu-
mentation.
The global variable pcre_callout initially contains NULL. It can be set
- by the caller to a "callout" function, which PCRE will then call at
- specified points during a matching operation. Details are given in the
+ by the caller to a "callout" function, which PCRE will then call at
+ specified points during a matching operation. Details are given in the
pcrecallout documentation.
NEWLINES
- PCRE supports five different conventions for indicating line breaks in
- strings: a single CR (carriage return) character, a single LF (line-
+ PCRE supports five different conventions for indicating line breaks in
+ strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
- ceding, or any Unicode newline sequence. The Unicode newline sequences
- are the three just mentioned, plus the single characters VT (vertical
- tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
+ ceding, or any Unicode newline sequence. The Unicode newline sequences
+ are the three just mentioned, plus the single characters VT (vertical
+ tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
- Each of the first three conventions is used by at least one operating
- system as its standard newline sequence. When PCRE is built, a default
- can be specified. The default default is LF, which is the Unix stan-
- dard. When PCRE is run, the default can be overridden, either when a
+ Each of the first three conventions is used by at least one operating
+ system as its standard newline sequence. When PCRE is built, a default
+ can be specified. The default default is LF, which is the Unix stan-
+ dard. When PCRE is run, the default can be overridden, either when a
pattern is compiled, or when it is matched.
At compile time, the newline convention can be specified by the options
- argument of pcre_compile(), or it can be specified by special text at
+ argument of pcre_compile(), or it can be specified by special text at
the start of the pattern itself; this overrides any other settings. See
the pcrepattern page for details of the special character sequences.
In the PCRE documentation the word "newline" is used to mean "the char-
- acter or pair of characters that indicate a line break". The choice of
- newline convention affects the handling of the dot, circumflex, and
+ acter or pair of characters that indicate a line break". The choice of
+ newline convention affects the handling of the dot, circumflex, and
dollar metacharacters, the handling of #-comments in /x mode, and, when
- CRLF is a recognized line ending sequence, the match position advance-
+ CRLF is a recognized line ending sequence, the match position advance-
ment for a non-anchored pattern. There is more detail about this in the
section on pcre_exec() options below.
- The choice of newline convention does not affect the interpretation of
- the \n or \r escape sequences, nor does it affect what \R matches,
+ The choice of newline convention does not affect the interpretation of
+ the \n or \r escape sequences, nor does it affect what \R matches,
which is controlled in a similar way, but by separate options.
MULTITHREADING
- The PCRE functions can be used in multi-threading applications, with
+ The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout function pointed to by pcre_callout, are shared by all threads.
- The compiled form of a regular expression is not altered during match-
+ The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.
@@ -1014,10 +1025,10 @@ MULTITHREADING
SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a
- later time, possibly by a different program, and even on a host other
- than the one on which it was compiled. Details are given in the
- pcreprecompile documentation. However, compiling a regular expression
- with one version of PCRE for use with a different version is not guar-
+ later time, possibly by a different program, and even on a host other
+ than the one on which it was compiled. Details are given in the
+ pcreprecompile documentation. However, compiling a regular expression
+ with one version of PCRE for use with a different version is not guar-
anteed to work and may cause crashes.
@@ -1025,79 +1036,79 @@ CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);
- The function pcre_config() makes it possible for a PCRE client to dis-
+ The function pcre_config() makes it possible for a PCRE client to dis-
cover which optional features have been compiled into the PCRE library.
- The pcrebuild documentation has more details about these optional fea-
+ The pcrebuild documentation has more details about these optional fea-
tures.
- The first argument for pcre_config() is an integer, specifying which
+ The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable
- into which the information is placed. The following information is
+ into which the information is placed. The following information is
available:
PCRE_CONFIG_UTF8
- The output is an integer that is set to one if UTF-8 support is avail-
+ The output is an integer that is set to one if UTF-8 support is avail-
able; otherwise it is set to zero.
PCRE_CONFIG_UNICODE_PROPERTIES
- The output is an integer that is set to one if support for Unicode
+ The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero.
PCRE_CONFIG_NEWLINE
- The output is an integer whose value specifies the default character
- sequence that is recognized as meaning "newline". The four values that
+ The output is an integer whose value specifies the default character
+ sequence that is recognized as meaning "newline". The four values that
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
- and -1 for ANY. Though they are derived from ASCII, the same values
+ and -1 for ANY. Though they are derived from ASCII, the same values
are returned in EBCDIC environments. The default should normally corre-
spond to the standard sequence for your operating system.
PCRE_CONFIG_BSR
The output is an integer whose value indicates what character sequences
- the \R escape sequence matches by default. A value of 0 means that \R
- matches any Unicode line ending sequence; a value of 1 means that \R
+ the \R escape sequence matches by default. A value of 0 means that \R
+ matches any Unicode line ending sequence; a value of 1 means that \R
matches only CR, LF, or CRLF. The default can be overridden when a pat-
tern is compiled or matched.
PCRE_CONFIG_LINK_SIZE
- The output is an integer that contains the number of bytes used for
+ The output is an integer that contains the number of bytes used for
internal linkage in compiled regular expressions. The value is 2, 3, or
- 4. Larger values allow larger regular expressions to be compiled, at
- the expense of slower matching. The default value of 2 is sufficient
- for all but the most massive patterns, since it allows the compiled
+ 4. Larger values allow larger regular expressions to be compiled, at
+ the expense of slower matching. The default value of 2 is sufficient
+ for all but the most massive patterns, since it allows the compiled
pattern to be up to 64K in size.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
- The output is an integer that contains the threshold above which the
- POSIX interface uses malloc() for output vectors. Further details are
+ The output is an integer that contains the threshold above which the
+ POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation.
PCRE_CONFIG_MATCH_LIMIT
- The output is a long integer that gives the default limit for the num-
- ber of internal matching function calls in a pcre_exec() execution.
+ The output is a long integer that gives the default limit for the num-
+ ber of internal matching function calls in a pcre_exec() execution.
Further details are given with pcre_exec() below.
PCRE_CONFIG_MATCH_LIMIT_RECURSION
The output is a long integer that gives the default limit for the depth
- of recursion when calling the internal matching function in a
- pcre_exec() execution. Further details are given with pcre_exec()
+ of recursion when calling the internal matching function in a
+ pcre_exec() execution. Further details are given with pcre_exec()
below.
PCRE_CONFIG_STACKRECURSE
- The output is an integer that is set to one if internal recursion when
+ The output is an integer that is set to one if internal recursion when
running pcre_exec() is implemented by recursive function calls that use
- the stack to remember their state. This is the usual way that PCRE is
+ the stack to remember their state. This is the usual way that PCRE is
compiled. The output is zero if PCRE was compiled to use blocks of data
- on the heap instead of recursive function calls. In this case,
- pcre_stack_malloc and pcre_stack_free are called to manage memory
+ on the heap instead of recursive function calls. In this case,
+ pcre_stack_malloc and pcre_stack_free are called to manage memory
blocks on the heap, thus avoiding the use of the stack.
@@ -1114,56 +1125,56 @@ COMPILING A PATTERN
Either of the functions pcre_compile() or pcre_compile2() can be called
to compile a pattern into an internal form. The only difference between
- the two interfaces is that pcre_compile2() has an additional argument,
+ the two interfaces is that pcre_compile2() has an additional argument,
errorcodeptr, via which a numerical error code can be returned.
The pattern is a C string terminated by a binary zero, and is passed in
- the pattern argument. A pointer to a single block of memory that is
- obtained via pcre_malloc is returned. This contains the compiled code
+ the pattern argument. A pointer to a single block of memory that is
+ obtained via pcre_malloc is returned. This contains the compiled code
and related data. The pcre type is defined for the returned block; this
is a typedef for a structure whose contents are not externally defined.
It is up to the caller to free the memory (via pcre_free) when it is no
longer required.
- Although the compiled code of a PCRE regex is relocatable, that is, it
+ Although the compiled code of a PCRE regex is relocatable, that is, it
does not depend on memory location, the complete pcre data block is not
- fully relocatable, because it may contain a copy of the tableptr argu-
+ fully relocatable, because it may contain a copy of the tableptr argu-
ment, which is an address (see below).
The options argument contains various bit settings that affect the com-
- pilation. It should be zero if no options are required. The available
- options are described below. Some of them (in particular, those that
- are compatible with Perl, but also some others) can also be set and
- unset from within the pattern (see the detailed description in the
- pcrepattern documentation). For those options that can be different in
- different parts of the pattern, the contents of the options argument
+ pilation. It should be zero if no options are required. The available
+ options are described below. Some of them (in particular, those that
+ are compatible with Perl, but also some others) can also be set and
+ unset from within the pattern (see the detailed description in the
+ pcrepattern documentation). For those options that can be different in
+ different parts of the pattern, the contents of the options argument
specifies their initial settings at the start of compilation and execu-
- tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
+ tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
time of matching as well as at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
- if compilation of a pattern fails, pcre_compile() returns NULL, and
+ if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes-
sage. This is a static string that is part of the library. You must not
try to free it. The offset from the start of the pattern to the charac-
ter where the error was discovered is placed in the variable pointed to
- by erroffset, which must not be NULL. If it is, an immediate error is
+ by erroffset, which must not be NULL. If it is, an immediate error is
given.
- If pcre_compile2() is used instead of pcre_compile(), and the error-
- codeptr argument is not NULL, a non-zero error code number is returned
- via this argument in the event of an error. This is in addition to the
+ If pcre_compile2() is used instead of pcre_compile(), and the error-
+ codeptr argument is not NULL, a non-zero error code number is returned
+ via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below.
- If the final argument, tableptr, is NULL, PCRE uses a default set of
- character tables that are built when PCRE is compiled, using the
- default C locale. Otherwise, tableptr must be an address that is the
- result of a call to pcre_maketables(). This value is stored with the
- compiled pattern, and used again by pcre_exec(), unless another table
+ If the final argument, tableptr, is NULL, PCRE uses a default set of
+ character tables that are built when PCRE is compiled, using the
+ default C locale. Otherwise, tableptr must be an address that is the
+ result of a call to pcre_maketables(). This value is stored with the
+ compiled pattern, and used again by pcre_exec(), unless another table
pointer is passed to it. For more discussion, see the section on locale
support below.
- This code fragment shows a typical straightforward call to pcre_com-
+ This code fragment shows a typical straightforward call to pcre_com-
pile():
pcre *re;
@@ -1176,137 +1187,137 @@ COMPILING A PATTERN
&erroffset, /* for error offset */
NULL); /* use default character tables */
- The following names for option bits are defined in the pcre.h header
+ The following names for option bits are defined in the pcre.h header
file:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
- is constrained to match only at the first matching point in the string
- that is being searched (the "subject string"). This effect can also be
- achieved by appropriate constructs in the pattern itself, which is the
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE_AUTO_CALLOUT
If this bit is set, pcre_compile() automatically inserts callout items,
- all with number 255, before each pattern item. For discussion of the
+ all with number 255, before each pattern item. For discussion of the
callout facility, see the pcrecallout documentation.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
- sequence matches. The choice is either to match only CR, LF, or CRLF,
+ sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. The default is specified when
PCRE is built. It can be overridden from within the pattern, or by set-
ting an option when a compiled pattern is matched.
PCRE_CASELESS
- If this bit is set, letters in the pattern match both upper and lower
- case letters. It is equivalent to Perl's /i option, and it can be
- changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
- always understands the concept of case for characters whose values are
- less than 128, so caseless matching is always possible. For characters
- with higher values, the concept of case is supported if PCRE is com-
- piled with Unicode property support, but not otherwise. If you want to
- use caseless matching for characters 128 and above, you must ensure
- that PCRE is compiled with Unicode property support as well as with
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters. It is equivalent to Perl's /i option, and it can be
+ changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
+ always understands the concept of case for characters whose values are
+ less than 128, so caseless matching is always possible. For characters
+ with higher values, the concept of case is supported if PCRE is com-
+ piled with Unicode property support, but not otherwise. If you want to
+ use caseless matching for characters 128 and above, you must ensure
+ that PCRE is compiled with Unicode property support as well as with
UTF-8 support.
PCRE_DOLLAR_ENDONLY
- If this bit is set, a dollar metacharacter in the pattern matches only
- at the end of the subject string. Without this option, a dollar also
- matches immediately before a newline at the end of the string (but not
- before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
- if PCRE_MULTILINE is set. There is no equivalent to this option in
+ If this bit is set, a dollar metacharacter in the pattern matches only
+ at the end of the subject string. Without this option, a dollar also
+ matches immediately before a newline at the end of the string (but not
+ before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
+ if PCRE_MULTILINE is set. There is no equivalent to this option in
Perl, and no way to set it within a pattern.
PCRE_DOTALL
If this bit is set, a dot metacharater in the pattern matches all char-
- acters, including those that indicate newline. Without it, a dot does
- not match when the current position is at a newline. This option is
- equivalent to Perl's /s option, and it can be changed within a pattern
- by a (?s) option setting. A negative class such as [^a] always matches
+ acters, including those that indicate newline. Without it, a dot does
+ not match when the current position is at a newline. This option is
+ equivalent to Perl's /s option, and it can be changed within a pattern
+ by a (?s) option setting. A negative class such as [^a] always matches
newline characters, independent of the setting of this option.
PCRE_DUPNAMES
- If this bit is set, names used to identify capturing subpatterns need
+ If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
- is known that only one instance of the named subpattern can ever be
- matched. There are more details of named subpatterns below; see also
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
the pcrepattern documentation.
PCRE_EXTENDED
- If this bit is set, whitespace data characters in the pattern are
+ If this bit is set, whitespace data characters in the pattern are
totally ignored except when escaped or inside a character class. White-
space does not include the VT character (code 11). In addition, charac-
ters between an unescaped # outside a character class and the next new-
- line, inclusive, are also ignored. This is equivalent to Perl's /x
- option, and it can be changed within a pattern by a (?x) option set-
+ line, inclusive, are also ignored. This is equivalent to Perl's /x
+ option, and it can be changed within a pattern by a (?x) option set-
ting.
- This option makes it possible to include comments inside complicated
- patterns. Note, however, that this applies only to data characters.
- Whitespace characters may never appear within special character
- sequences in a pattern, for example within the sequence (?( which
+ This option makes it possible to include comments inside complicated
+ patterns. Note, however, that this applies only to data characters.
+ Whitespace characters may never appear within special character
+ sequences in a pattern, for example within the sequence (?( which
introduces a conditional subpattern.
PCRE_EXTRA
- This option was invented in order to turn on additional functionality
- of PCRE that is incompatible with Perl, but it is currently of very
- little use. When set, any backslash in a pattern that is followed by a
- letter that has no special meaning causes an error, thus reserving
- these combinations for future expansion. By default, as in Perl, a
- backslash followed by a letter with no special meaning is treated as a
- literal. (Perl can, however, be persuaded to give a warning for this.)
- There are at present no other features controlled by this option. It
+ This option was invented in order to turn on additional functionality
+ of PCRE that is incompatible with Perl, but it is currently of very
+ little use. When set, any backslash in a pattern that is followed by a
+ letter that has no special meaning causes an error, thus reserving
+ these combinations for future expansion. By default, as in Perl, a
+ backslash followed by a letter with no special meaning is treated as a
+ literal. (Perl can, however, be persuaded to give a warning for this.)
+ There are at present no other features controlled by this option. It
can also be set by a (?X) option setting within a pattern.
PCRE_FIRSTLINE
- If this option is set, an unanchored pattern is required to match
- before or at the first newline in the subject string, though the
+ If this option is set, an unanchored pattern is required to match
+ before or at the first newline in the subject string, though the
matched text may continue over the newline.
PCRE_JAVASCRIPT_COMPAT
If this option is set, PCRE's behaviour is changed in some ways so that
- it is compatible with JavaScript rather than Perl. The changes are as
+ it is compatible with JavaScript rather than Perl. The changes are as
follows:
- (1) A lone closing square bracket in a pattern causes a compile-time
- error, because this is illegal in JavaScript (by default it is treated
+ (1) A lone closing square bracket in a pattern causes a compile-time
+ error, because this is illegal in JavaScript (by default it is treated
as a data character). Thus, the pattern AB]CD becomes illegal when this
option is set.
- (2) At run time, a back reference to an unset subpattern group matches
- an empty string (by default this causes the current matching alterna-
- tive to fail). A pattern such as (\1)(a) succeeds when this option is
- set (assuming it can find an "a" in the subject), whereas it fails by
+ (2) At run time, a back reference to an unset subpattern group matches
+ an empty string (by default this causes the current matching alterna-
+ tive to fail). A pattern such as (\1)(a) succeeds when this option is
+ set (assuming it can find an "a" in the subject), whereas it fails by
default, for Perl compatibility.
PCRE_MULTILINE
- By default, PCRE treats the subject string as consisting of a single
- line of characters (even if it actually contains newlines). The "start
- of line" metacharacter (^) matches only at the start of the string,
- while the "end of line" metacharacter ($) matches only at the end of
+ By default, PCRE treats the subject string as consisting of a single
+ line of characters (even if it actually contains newlines). The "start
+ of line" metacharacter (^) matches only at the start of the string,
+ while the "end of line" metacharacter ($) matches only at the end of
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
is set). This is the same as Perl.
- When PCRE_MULTILINE it is set, the "start of line" and "end of line"
- constructs match immediately following or immediately before internal
- newlines in the subject string, respectively, as well as at the very
- start and end. This is equivalent to Perl's /m option, and it can be
+ When PCRE_MULTILINE it is set, the "start of line" and "end of line"
+ constructs match immediately following or immediately before internal
+ newlines in the subject string, respectively, as well as at the very
+ start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. If there are no new-
- lines in a subject string, or no occurrences of ^ or $ in a pattern,
+ lines in a subject string, or no occurrences of ^ or $ in a pattern,
setting PCRE_MULTILINE has no effect.
PCRE_NEWLINE_CR
@@ -1315,32 +1326,32 @@ COMPILING A PATTERN
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
- These options override the default newline definition that was chosen
- when PCRE was built. Setting the first or the second specifies that a
- newline is indicated by a single character (CR or LF, respectively).
- Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
- two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
+ These options override the default newline definition that was chosen
+ when PCRE was built. Setting the first or the second specifies that a
+ newline is indicated by a single character (CR or LF, respectively).
+ Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
+ two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
that any of the three preceding sequences should be recognized. Setting
- PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
+ PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
recognized. The Unicode newline sequences are the three just mentioned,
- plus the single characters VT (vertical tab, U+000B), FF (formfeed,
- U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
- (paragraph separator, U+2029). The last two are recognized only in
+ plus the single characters VT (vertical tab, U+000B), FF (formfeed,
+ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
+ (paragraph separator, U+2029). The last two are recognized only in
UTF-8 mode.
- The newline setting in the options word uses three bits that are
+ The newline setting in the options word uses three bits that are
treated as a number, giving eight possibilities. Currently only six are
- used (default plus the five values above). This means that if you set
- more than one newline option, the combination may or may not be sensi-
+ used (default plus the five values above). This means that if you set
+ more than one newline option, the combination may or may not be sensi-
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
- PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
+ PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
cause an error.
- The only time that a line break is specially recognized when compiling
- a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
- character class is encountered. This indicates a comment that lasts
- until after the next line break sequence. In other circumstances, line
- break sequences are treated as literal data, except that in
+ The only time that a line break is specially recognized when compiling
+ a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
+ character class is encountered. This indicates a comment that lasts
+ until after the next line break sequence. In other circumstances, line
+ break sequences are treated as literal data, except that in
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
and are therefore ignored.
@@ -1350,46 +1361,46 @@ COMPILING A PATTERN
PCRE_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
- theses in the pattern. Any opening parenthesis that is not followed by
- ? behaves as if it were followed by ?: but named parentheses can still
- be used for capturing (and they acquire numbers in the usual way).
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
+ be used for capturing (and they acquire numbers in the usual way).
There is no equivalent of this option in Perl.
PCRE_UNGREEDY
- This option inverts the "greediness" of the quantifiers so that they
- are not greedy by default, but become greedy if followed by "?". It is
- not compatible with Perl. It can also be set by a (?U) option setting
+ This option inverts the "greediness" of the quantifiers so that they
+ are not greedy by default, but become greedy if followed by "?". It is
+ not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE_UTF8
- This option causes PCRE to regard both the pattern and the subject as
- strings of UTF-8 characters instead of single-byte character strings.
- However, it is available only when PCRE is built to include UTF-8 sup-
- port. If not, the use of this option provokes an error. Details of how
- this option changes the behaviour of PCRE are given in the section on
+ This option causes PCRE to regard both the pattern and the subject as
+ strings of UTF-8 characters instead of single-byte character strings.
+ However, it is available only when PCRE is built to include UTF-8 sup-
+ port. If not, the use of this option provokes an error. Details of how
+ this option changes the behaviour of PCRE are given in the section on
UTF-8 support in the main pcre page.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
- automatically checked. There is a discussion about the validity of
- UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
- bytes is found, pcre_compile() returns an error. If you already know
+ automatically checked. There is a discussion about the validity of
+ UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
+ bytes is found, pcre_compile() returns an error. If you already know
that your pattern is valid, and you want to skip this check for perfor-
- mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
- set, the effect of passing an invalid UTF-8 string as a pattern is
- undefined. It may cause your program to crash. Note that this option
- can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
+ mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
+ set, the effect of passing an invalid UTF-8 string as a pattern is
+ undefined. It may cause your program to crash. Note that this option
+ can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
UTF-8 validity checking of subject strings.
COMPILATION ERROR CODES
- The following table lists the error codes than may be returned by
- pcre_compile2(), along with the error messages that may be returned by
- both compiling functions. As PCRE has developed, some error codes have
+ The following table lists the error codes than may be returned by
+ pcre_compile2(), along with the error messages that may be returned by
+ both compiling functions. As PCRE has developed, some error codes have
fallen out of use. To avoid confusion, they have not been re-used.
0 no error
@@ -1445,7 +1456,7 @@ COMPILATION ERROR CODES
50 [this code is not in use]
51 octal value is greater than \377 (not in UTF-8 mode)
52 internal error: overran compiling workspace
- 53 internal error: previously-checked referenced subpattern not
+ 53 internal error: previously-checked referenced subpattern not
found
54 DEFINE group contains more than one branch
55 repeating a DEFINE group is not allowed
@@ -1460,7 +1471,7 @@ COMPILATION ERROR CODES
63 digit expected after (?+
64 ] is an invalid data character in JavaScript compatibility mode
- The numbers 32 and 10000 in errors 48 and 49 are defaults; different
+ The numbers 32 and 10000 in errors 48 and 49 are defaults; different
values may be used if the limits were changed when PCRE was built.
@@ -1469,32 +1480,32 @@ STUDYING A PATTERN
pcre_extra *pcre_study(const pcre *code, int options
const char **errptr);
- If a compiled pattern is going to be used several times, it is worth
+ If a compiled pattern is going to be used several times, it is worth
spending more time analyzing it in order to speed up the time taken for
- matching. The function pcre_study() takes a pointer to a compiled pat-
+ matching. The function pcre_study() takes a pointer to a compiled pat-
tern as its first argument. If studying the pattern produces additional
- information that will help speed up matching, pcre_study() returns a
- pointer to a pcre_extra block, in which the study_data field points to
+ information that will help speed up matching, pcre_study() returns a
+ pointer to a pcre_extra block, in which the study_data field points to
the results of the study.
The returned value from pcre_study() can be passed directly to
- pcre_exec(). However, a pcre_extra block also contains other fields
- that can be set by the caller before the block is passed; these are
+ pcre_exec(). However, a pcre_extra block also contains other fields
+ that can be set by the caller before the block is passed; these are
described below in the section on matching a pattern.
- If studying the pattern does not produce any additional information
+ If studying the pattern does not produce any additional information
pcre_study() returns NULL. In that circumstance, if the calling program
- wants to pass any of the other fields to pcre_exec(), it must set up
+ wants to pass any of the other fields to pcre_exec(), it must set up
its own pcre_extra block.
- The second argument of pcre_study() contains option bits. At present,
+ The second argument of pcre_study() contains option bits. At present,
no options are defined, and this argument should always be zero.
- The third argument for pcre_study() is a pointer for an error message.
- If studying succeeds (even if no data is returned), the variable it
- points to is set to NULL. Otherwise it is set to point to a textual
+ The third argument for pcre_study() is a pointer for an error message.
+ If studying succeeds (even if no data is returned), the variable it
+ points to is set to NULL. Otherwise it is set to point to a textual
error message. This is a static string that is part of the library. You
- must not try to free it. You should test the error pointer for NULL
+ must not try to free it. You should test the error pointer for NULL
after calling pcre_study(), to be sure that it has run successfully.
This is a typical call to pcre_study():
@@ -1506,62 +1517,62 @@ STUDYING A PATTERN
&error); /* set to NULL or points to a message */
At present, studying a pattern is useful only for non-anchored patterns
- that do not have a single fixed starting character. A bitmap of possi-
+ that do not have a single fixed starting character. A bitmap of possi-
ble starting bytes is created.
LOCALE SUPPORT
- PCRE handles caseless matching, and determines whether characters are
- letters, digits, or whatever, by reference to a set of tables, indexed
- by character value. When running in UTF-8 mode, this applies only to
- characters with codes less than 128. Higher-valued codes never match
- escapes such as \w or \d, but can be tested with \p if PCRE is built
- with Unicode character property support. The use of locales with Uni-
- code is discouraged. If you are handling characters with codes greater
- than 128, you should either use UTF-8 and Unicode, or use locales, but
+ PCRE handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character value. When running in UTF-8 mode, this applies only to
+ characters with codes less than 128. Higher-valued codes never match
+ escapes such as \w or \d, but can be tested with \p if PCRE is built
+ with Unicode character property support. The use of locales with Uni-
+ code is discouraged. If you are handling characters with codes greater
+ than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
- PCRE contains an internal set of tables that are used when the final
- argument of pcre_compile() is NULL. These are sufficient for many
+ PCRE contains an internal set of tables that are used when the final
+ argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
- The internal tables can always be overridden by tables supplied by the
+ The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
- from the default. As more and more applications change to using Uni-
+ from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
- External tables are built by calling the pcre_maketables() function,
- which has no arguments, in the relevant locale. The result can then be
- passed to pcre_compile() or pcre_exec() as often as necessary. For
- example, to build and use tables that are appropriate for the French
- locale (where accented characters with values greater than 128 are
+ External tables are built by calling the pcre_maketables() function,
+ which has no arguments, in the relevant locale. The result can then be
+ passed to pcre_compile() or pcre_exec() as often as necessary. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
- The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
if you are using Windows, the name for the French locale is "french".
- When pcre_maketables() runs, the tables are built in memory that is
- obtained via pcre_malloc. It is the caller's responsibility to ensure
- that the memory containing the tables remains available for as long as
+ When pcre_maketables() runs, the tables are built in memory that is
+ obtained via pcre_malloc. It is the caller's responsibility to ensure
+ that the memory containing the tables remains available for as long as
it is needed.
The pointer that is passed to pcre_compile() is saved with the compiled
- pattern, and the same tables are used via this pointer by pcre_study()
+ pattern, and the same tables are used via this pointer by pcre_study()
and normally also by pcre_exec(). Thus, by default, for any single pat-
tern, compilation, studying and matching all happen in the same locale,
but different patterns can be compiled in different locales.
- It is possible to pass a table pointer or NULL (indicating the use of
- the internal tables) to pcre_exec(). Although not intended for this
- purpose, this facility could be used to match a pattern in a different
+ It is possible to pass a table pointer or NULL (indicating the use of
+ the internal tables) to pcre_exec(). Although not intended for this
+ purpose, this facility could be used to match a pattern in a different
locale from the one in which it was compiled. Passing table pointers at
run time is discussed below in the section on matching a pattern.
@@ -1571,15 +1582,15 @@ INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
- The pcre_fullinfo() function returns information about a compiled pat-
+ The pcre_fullinfo() function returns information about a compiled pat-
tern. It replaces the obsolete pcre_info() function, which is neverthe-
less retained for backwards compability (and is documented below).
- The first argument for pcre_fullinfo() is a pointer to the compiled
- pattern. The second argument is the result of pcre_study(), or NULL if
- the pattern was not studied. The third argument specifies which piece
- of information is required, and the fourth argument is a pointer to a
- variable to receive the data. The yield of the function is zero for
+ The first argument for pcre_fullinfo() is a pointer to the compiled
+ pattern. The second argument is the result of pcre_study(), or NULL if
+ the pattern was not studied. The third argument specifies which piece
+ of information is required, and the fourth argument is a pointer to a
+ variable to receive the data. The yield of the function is zero for
success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL
@@ -1587,9 +1598,9 @@ INFORMATION ABOUT A PATTERN
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of what was invalid
- The "magic number" is placed at the start of each compiled pattern as
- an simple check against passing an arbitrary memory pointer. Here is a
- typical call of pcre_fullinfo(), to obtain the length of the compiled
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre_fullinfo(), to obtain the length of the compiled
pattern:
int rc;
@@ -1600,76 +1611,76 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
- The possible values for the third argument are defined in pcre.h, and
+ The possible values for the third argument are defined in pcre.h, and
are as follows:
PCRE_INFO_BACKREFMAX
- Return the number of the highest back reference in the pattern. The
- fourth argument should point to an int variable. Zero is returned if
+ Return the number of the highest back reference in the pattern. The
+ fourth argument should point to an int variable. Zero is returned if
there are no back references.
PCRE_INFO_CAPTURECOUNT
- Return the number of capturing subpatterns in the pattern. The fourth
+ Return the number of capturing subpatterns in the pattern. The fourth
argument should point to an int variable.
PCRE_INFO_DEFAULT_TABLES
- Return a pointer to the internal default character tables within PCRE.
- The fourth argument should point to an unsigned char * variable. This
+ Return a pointer to the internal default character tables within PCRE.
+ The fourth argument should point to an unsigned char * variable. This
information call is provided for internal use by the pcre_study() func-
- tion. External callers can cause PCRE to use its internal tables by
+ tion. External callers can cause PCRE to use its internal tables by
passing a NULL table pointer.
PCRE_INFO_FIRSTBYTE
- Return information about the first byte of any matched string, for a
- non-anchored pattern. The fourth argument should point to an int vari-
- able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
+ Return information about the first byte of any matched string, for a
+ non-anchored pattern. The fourth argument should point to an int vari-
+ able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
is still recognized for backwards compatibility.)
- If there is a fixed first byte, for example, from a pattern such as
+ If there is a fixed first byte, for example, from a pattern such as
(cat|cow|coyote), its value is returned. Otherwise, if either
- (a) the pattern was compiled with the PCRE_MULTILINE option, and every
+ (a) the pattern was compiled with the PCRE_MULTILINE option, and every
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored),
- -1 is returned, indicating that the pattern matches only at the start
- of a subject string or after any newline within the string. Otherwise
+ -1 is returned, indicating that the pattern matches only at the start
+ of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned.
PCRE_INFO_FIRSTTABLE
- If the pattern was studied, and this resulted in the construction of a
+ If the pattern was studied, and this resulted in the construction of a
256-bit table indicating a fixed set of bytes for the first byte in any
- matching string, a pointer to the table is returned. Otherwise NULL is
- returned. The fourth argument should point to an unsigned char * vari-
+ matching string, a pointer to the table is returned. Otherwise NULL is
+ returned. The fourth argument should point to an unsigned char * vari-
able.
PCRE_INFO_HASCRORLF
- Return 1 if the pattern contains any explicit matches for CR or LF
- characters, otherwise 0. The fourth argument should point to an int
- variable. An explicit match is either a literal CR or LF character, or
+ Return 1 if the pattern contains any explicit matches for CR or LF
+ characters, otherwise 0. The fourth argument should point to an int
+ variable. An explicit match is either a literal CR or LF character, or
\r or \n.
PCRE_INFO_JCHANGED
- Return 1 if the (?J) or (?-J) option setting is used in the pattern,
- otherwise 0. The fourth argument should point to an int variable. (?J)
+ Return 1 if the (?J) or (?-J) option setting is used in the pattern,
+ otherwise 0. The fourth argument should point to an int variable. (?J)
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
PCRE_INFO_LASTLITERAL
- Return the value of the rightmost literal byte that must exist in any
- matched string, other than at its start, if such a byte has been
+ Return the value of the rightmost literal byte that must exist in any
+ matched string, other than at its start, if such a byte has been
recorded. The fourth argument should point to an int variable. If there
- is no such byte, -1 is returned. For anchored patterns, a last literal
- byte is recorded only if it follows something of variable length. For
+ is no such byte, -1 is returned. For anchored patterns, a last literal
+ byte is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
/^a\dz\d/ the returned value is -1.
@@ -1677,34 +1688,34 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
- PCRE supports the use of named as well as numbered capturing parenthe-
- ses. The names are just an additional way of identifying the parenthe-
+ PCRE supports the use of named as well as numbered capturing parenthe-
+ ses. The names are just an additional way of identifying the parenthe-
ses, which still acquire numbers. Several convenience functions such as
- pcre_get_named_substring() are provided for extracting captured sub-
- strings by name. It is also possible to extract the data directly, by
- first converting the name to a number in order to access the correct
+ pcre_get_named_substring() are provided for extracting captured sub-
+ strings by name. It is also possible to extract the data directly, by
+ first converting the name to a number in order to access the correct
pointers in the output vector (described with pcre_exec() below). To do
- the conversion, you need to use the name-to-number map, which is
+ the conversion, you need to use the name-to-number map, which is
described by these three values.
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
- of each entry; both of these return an int value. The entry size
- depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
- a pointer to the first entry of the table (a pointer to char). The
+ of each entry; both of these return an int value. The entry size
+ depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
+ a pointer to the first entry of the table (a pointer to char). The
first two bytes of each entry are the number of the capturing parenthe-
- sis, most significant byte first. The rest of the entry is the corre-
- sponding name, zero terminated. The names are in alphabetical order.
+ sis, most significant byte first. The rest of the entry is the corre-
+ sponding name, zero terminated. The names are in alphabetical order.
When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
- theses numbers. For example, consider the following pattern (assume
- PCRE_EXTENDED is set, so white space - including newlines - is
+ theses numbers. For example, consider the following pattern (assume
+ PCRE_EXTENDED is set, so white space - including newlines - is
ignored):
(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )
- There are four named subpatterns, so the table has four entries, and
- each entry in the table is eight bytes long. The table is as follows,
+ There are four named subpatterns, so the table has four entries, and
+ each entry in the table is eight bytes long. The table is as follows,
with non-printing bytes shows in hexadecimal, and undefined bytes shown
as ??:
@@ -1713,17 +1724,18 @@ INFORMATION ABOUT A PATTERN
00 04 m o n t h 00
00 02 y e a r 00 ??
- When writing code to extract data from named subpatterns using the
- name-to-number map, remember that the length of the entries is likely
+ When writing code to extract data from named subpatterns using the
+ name-to-number map, remember that the length of the entries is likely
to be different for each compiled pattern.
PCRE_INFO_OKPARTIAL
- Return 1 if the pattern can be used for partial matching, otherwise 0.
- The fourth argument should point to an int variable. From release 8.00,
- this always returns 1, because the restrictions that previously applied
- to partial matching have been lifted. The pcrepartial documentation
- gives details of partial matching.
+ Return 1 if the pattern can be used for partial matching with
+ pcre_exec(), otherwise 0. The fourth argument should point to an int
+ variable. From release 8.00, this always returns 1, because the
+ restrictions that previously applied to partial matching have been
+ lifted. The pcrepartial documentation gives details of partial match-
+ ing.
PCRE_INFO_OPTIONS
@@ -1909,8 +1921,8 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
- The pcre_callout field is used in conjunction with the "callout" fea-
- ture, which is described in the pcrecallout documentation.
+ The callout_data field is used in conjunction with the "callout" fea-
+ ture, and is described in the pcrecallout documentation.
The tables field is used to pass a character tables pointer to
pcre_exec(); this overrides the value that is stored with the compiled
@@ -1927,22 +1939,23 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
The unused bits of the options argument for pcre_exec() must be zero.
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
- PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE,
- PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.
+ PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
+ PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and
+ PCRE_PARTIAL_HARD.
PCRE_ANCHORED
- The PCRE_ANCHORED option limits pcre_exec() to matching at the first
- matching position. If a pattern was compiled with PCRE_ANCHORED, or
- turned out to be anchored by virtue of its contents, it cannot be made
+ The PCRE_ANCHORED option limits pcre_exec() to matching at the first
+ matching position. If a pattern was compiled with PCRE_ANCHORED, or
+ turned out to be anchored by virtue of its contents, it cannot be made
unachored at matching time.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
- sequence matches. The choice is either to match only CR, LF, or CRLF,
- or to match any Unicode newline sequence. These options override the
+ sequence matches. The choice is either to match only CR, LF, or CRLF,
+ or to match any Unicode newline sequence. These options override the
choice that was made or defaulted when the pattern was compiled.
PCRE_NEWLINE_CR
@@ -1951,76 +1964,83 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
- These options override the newline definition that was chosen or
- defaulted when the pattern was compiled. For details, see the descrip-
- tion of pcre_compile() above. During matching, the newline choice
- affects the behaviour of the dot, circumflex, and dollar metacharac-
- ters. It may also alter the way the match position is advanced after a
+ These options override the newline definition that was chosen or
+ defaulted when the pattern was compiled. For details, see the descrip-
+ tion of pcre_compile() above. During matching, the newline choice
+ affects the behaviour of the dot, circumflex, and dollar metacharac-
+ ters. It may also alter the way the match position is advanced after a
match failure for an unanchored pattern.
- When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
- set, and a match attempt for an unanchored pattern fails when the cur-
- rent position is at a CRLF sequence, and the pattern contains no
- explicit matches for CR or LF characters, the match position is
+ When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
+ set, and a match attempt for an unanchored pattern fails when the cur-
+ rent position is at a CRLF sequence, and the pattern contains no
+ explicit matches for CR or LF characters, the match position is
advanced by two characters instead of one, in other words, to after the
CRLF.
The above rule is a compromise that makes the most common cases work as
- expected. For example, if the pattern is .+A (and the PCRE_DOTALL
+ expected. For example, if the pattern is .+A (and the PCRE_DOTALL
option is not set), it does not match the string "\r\nA" because, after
- failing at the start, it skips both the CR and the LF before retrying.
- However, the pattern [\r\n]A does match that string, because it con-
+ failing at the start, it skips both the CR and the LF before retrying.
+ However, the pattern [\r\n]A does match that string, because it con-
tains an explicit CR or LF reference, and so advances only by one char-
acter after the first failure.
An explicit match for CR of LF is either a literal appearance of one of
- those characters, or one of the \r or \n escape sequences. Implicit
- matches such as [^X] do not count, nor does \s (which includes CR and
+ those characters, or one of the \r or \n escape sequences. Implicit
+ matches such as [^X] do not count, nor does \s (which includes CR and
LF in the characters that it matches).
- Notwithstanding the above, anomalous effects may still occur when CRLF
+ Notwithstanding the above, anomalous effects may still occur when CRLF
is a valid newline sequence and explicit \r or \n escapes appear in the
pattern.
PCRE_NOTBOL
This option specifies that first character of the subject string is not
- the beginning of a line, so the circumflex metacharacter should not
- match before it. Setting this without PCRE_MULTILINE (at compile time)
- causes circumflex never to match. This option affects only the behav-
+ the beginning of a line, so the circumflex metacharacter should not
+ match before it. Setting this without PCRE_MULTILINE (at compile time)
+ causes circumflex never to match. This option affects only the behav-
iour of the circumflex metacharacter. It does not affect \A.
PCRE_NOTEOL
This option specifies that the end of the subject string is not the end
- of a line, so the dollar metacharacter should not match it nor (except
- in multiline mode) a newline immediately before it. Setting this with-
+ of a line, so the dollar metacharacter should not match it nor (except
+ in multiline mode) a newline immediately before it. Setting this with-
out PCRE_MULTILINE (at compile time) causes dollar never to match. This
- option affects only the behaviour of the dollar metacharacter. It does
+ option affects only the behaviour of the dollar metacharacter. It does
not affect \Z or \z.
PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is
- set. If there are alternatives in the pattern, they are tried. If all
- the alternatives match the empty string, the entire match fails. For
+ set. If there are alternatives in the pattern, they are tried. If all
+ the alternatives match the empty string, the entire match fails. For
example, if the pattern
a?b?
- is applied to a string not beginning with "a" or "b", it matches the
- empty string at the start of the subject. With PCRE_NOTEMPTY set, this
+ is applied to a string not beginning with "a" or "b", it matches an
+ empty string at the start of the subject. With PCRE_NOTEMPTY set, this
match is not valid, so PCRE searches further into the string for occur-
rences of "a" or "b".
- Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
- cial case of a pattern match of the empty string within its split()
- function, and when using the /g modifier. It is possible to emulate
- Perl's behaviour after matching a null string by first trying the match
- again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
- if that fails by advancing the starting offset (see below) and trying
- an ordinary match again. There is some code that demonstrates how to do
- this in the pcredemo sample program.
+ PCRE_NOTEMPTY_ATSTART
+
+ This is like PCRE_NOTEMPTY, except that an empty string match that is
+ not at the start of the subject is permitted. If the pattern is
+ anchored, such a match can occur only if the pattern contains \K.
+
+ Perl has no direct equivalent of PCRE_NOTEMPTY or
+ PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
+ match of the empty string within its split() function, and when using
+ the /g modifier. It is possible to emulate Perl's behaviour after
+ matching a null string by first trying the match again at the same off-
+ set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
+ fails, by advancing the starting offset (see below) and trying an ordi-
+ nary match again. There is some code that demonstrates how to do this
+ in the pcredemo sample program.
PCRE_NO_START_OPTIMIZE
@@ -2066,9 +2086,9 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set,
matching continues by testing any other alternatives. Only if they all
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH).
- The portion of the string that provided the partial match is set as the
- first matching string. There is a more detailed discussion in the
- pcrepartial documentation.
+ The portion of the string that was inspected when the partial match was
+ found is set as the first matching string. There is a more detailed
+ discussion in the pcrepartial documentation.
The string to be matched by pcre_exec()
@@ -2484,19 +2504,20 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
characteristics to the normal algorithm, and is not compatible with
Perl. Some of the features of PCRE patterns are not supported. Never-
theless, there are times when this kind of matching can be useful. For
- a discussion of the two matching algorithms, see the pcrematching docu-
- mentation.
+ a discussion of the two matching algorithms, and a list of features
+ that pcre_dfa_exec() does not support, see the pcrematching documenta-
+ tion.
- The arguments for the pcre_dfa_exec() function are the same as for
+ The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ-
- ent way, and this is described below. The other common arguments are
- used in the same way as for pcre_exec(), so their description is not
+ ent way, and this is described below. The other common arguments are
+ used in the same way as for pcre_exec(), so their description is not
repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace will be needed for patterns and subjects where there are a
+ workspace will be needed for patterns and subjects where there are a
lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec():
@@ -2518,12 +2539,13 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_exec()
- The unused bits of the options argument for pcre_dfa_exec() must be
- zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
- LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
- PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and
- PCRE_DFA_RESTART. All but the last four of these are exactly the same
- as for pcre_exec(), so their description is not repeated here.
+ The unused bits of the options argument for pcre_dfa_exec() must be
+ zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
+ LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
+ PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR-
+ TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
+ four of these are exactly the same as for pcre_exec(), so their
+ description is not repeated here.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
@@ -2537,8 +2559,8 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
of the subject is reached, there have been no complete matches, but
there is still at least one matching possibility. The portion of the
- string that provided the longest partial match is set as the first
- matching string in both cases.
+ string that was inspected when the longest partial match was found is
+ set as the first matching string in both cases.
PCRE_DFA_SHORTEST
@@ -2644,7 +2666,7 @@ AUTHOR
REVISION
- Last updated: 01 September 2009
+ Last updated: 11 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
@@ -2869,7 +2891,11 @@ DIFFERENCES BETWEEN PCRE AND PERL
is built with Unicode character property support. The properties that
can be tested with \p and \P are limited to the general category prop-
erties such as Lu and Nd, script names such as Greek or Han, and the
- derived properties Any and L&.
+ derived properties Any and L&. PCRE does support the Cs (surrogate)
+ property, which Perl does not; the Perl documentation says "Because
+ Perl hides the need for the user to understand the internal representa-
+ tion of Unicode characters, there is no need to implement the somewhat
+ messy concept of surrogates."
7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
ters in between are treated as literals. This is slightly different
@@ -2889,13 +2915,15 @@ DIFFERENCES BETWEEN PCRE AND PERL
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This
- is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
+ is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
"callout" feature allows an external function to be called during pat-
tern matching. See the pcrecallout documentation for details.
9. Subpatterns that are called recursively or as "subroutines" are
always treated as atomic groups in PCRE. This is like Python, but
- unlike Perl.
+ unlike Perl. There is a discussion of an example that explains this in
+ more detail in the section on recursion differences from Perl in the
+ pcrecompat page.
10. There are some differences that are concerned with the settings of
captured strings when part of a pattern is repeated. For example,
@@ -2904,9 +2932,7 @@ DIFFERENCES BETWEEN PCRE AND PERL
11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT),
(*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
- the forms without an argument. PCRE does not support (*MARK). If
- (*ACCEPT) is within capturing parentheses, PCRE does not set that cap-
- ture group; this is different to Perl.
+ the forms without an argument. PCRE does not support (*MARK).
12. PCRE provides some extensions to the Perl regular expression facil-
ities. Perl 5.10 will include new features that are not in earlier
@@ -2931,10 +2957,11 @@ DIFFERENCES BETWEEN PCRE AND PERL
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be
tried only at the first matching position in the subject string.
- (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
- TURE options for pcre_exec() have no Perl equivalents.
+ (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
+ and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
+ lents.
- (g) The \R escape sequence can be restricted to match only CR, LF, or
+ (g) The \R escape sequence can be restricted to match only CR, LF, or
CRLF by the PCRE_BSR_ANYCRLF option.
(h) The callout facility is PCRE-specific.
@@ -2944,10 +2971,10 @@ DIFFERENCES BETWEEN PCRE AND PERL
(j) Patterns compiled by PCRE can be saved and re-used at a later time,
even on different hosts that have the other endianness.
- (k) The alternative matching function (pcre_dfa_exec()) matches in a
+ (k) The alternative matching function (pcre_dfa_exec()) matches in a
different way and is not Perl-compatible.
- (l) PCRE recognizes some special sequences such as (*CR) at the start
+ (l) PCRE recognizes some special sequences such as (*CR) at the start
of a pattern that set overall options that cannot be changed within the
pattern.
@@ -2961,7 +2988,7 @@ AUTHOR
REVISION
- Last updated: 25 August 2009
+ Last updated: 18 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
@@ -3480,9 +3507,9 @@ BACKSLASH
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
- the pcreapi page).
+ the pcreapi page). Perl does not support the Cs property.
- The long synonyms for these properties that Perl supports (such as
+ The long synonyms for property names that Perl supports (such as
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix
any of these properties with "Is".
@@ -4707,8 +4734,8 @@ RECURSIVE PATTERNS
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
it supports special syntax for recursion of the entire pattern, and
also for individual subpattern recursion. After its introduction in
- PCRE and Python, this kind of recursion was introduced into Perl at
- release 5.10.
+ PCRE and Python, this kind of recursion was subsequently introduced
+ into Perl at release 5.10.
A special item that consists of (? followed by a number greater than
zero and a closing parenthesis is a recursive call of the subpattern of
@@ -4717,105 +4744,166 @@ RECURSIVE PATTERNS
tion.) The special item (?R) or (?0) is a recursive call of the entire
regular expression.
- In PCRE (like Python, but unlike Perl), a recursive subpattern call is
- always treated as an atomic group. That is, once it has matched some of
- the subject string, it is never re-entered, even if it contains untried
- alternatives and there is a subsequent matching failure.
-
- This PCRE pattern solves the nested parentheses problem (assume the
+ This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
\( ( (?>[^()]+) | (?R) )* \)
- First it matches an opening parenthesis. Then it matches any number of
- substrings which can either be a sequence of non-parentheses, or a
- recursive match of the pattern itself (that is, a correctly parenthe-
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis.
- If this were part of a larger pattern, you would not want to recurse
+ If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( (?>[^()]+) | (?1) )* \) )
- We have put the pattern into parentheses, and caused the recursion to
+ We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
- In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references. (A Perl
- 5.10 feature.) Instead of (?1) in the pattern above you can write
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references. (A Perl
+ 5.10 feature.) Instead of (?1) in the pattern above you can write
(?-2) to refer to the second most recently opened parentheses preceding
- the recursion. In other words, a negative number counts capturing
+ the recursion. In other words, a negative number counts capturing
parentheses leftwards from the point at which it is encountered.
- It is also possible to refer to subsequently opened parentheses, by
- writing references such as (?+2). However, these cannot be recursive
- because the reference is not inside the parentheses that are refer-
- enced. They are always "subroutine" calls, as described in the next
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always "subroutine" calls, as described in the next
section.
- An alternative approach is to use named parentheses instead. The Perl
- syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
+ An alternative approach is to use named parentheses instead. The Perl
+ syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
- If there is more than one subpattern with the same name, the earliest
+ If there is more than one subpattern with the same name, the earliest
one is used.
- This particular example pattern that we have been looking at contains
- nested unlimited repeats, and so the use of atomic grouping for match-
- ing strings of non-parentheses is important when applying the pattern
+ This particular example pattern that we have been looking at contains
+ nested unlimited repeats, and so the use of atomic grouping for match-
+ ing strings of non-parentheses is important when applying the pattern
to strings that do not match. For example, when this pattern is applied
to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if atomic grouping is not used,
- the match runs for a very long time indeed because there are so many
- different ways the + and * repeats can carve up the subject, and all
+ it yields "no match" quickly. However, if atomic grouping is not used,
+ the match runs for a very long time indeed because there are so many
+ different ways the + and * repeats can carve up the subject, and all
have to be tested before failure can be reported.
At the end of a match, the values set for any capturing subpatterns are
those from the outermost level of the recursion at which the subpattern
- value is set. If you want to obtain intermediate values, a callout
- function can be used (see below and the pcrecallout documentation). If
+ value is set. If you want to obtain intermediate values, a callout
+ function can be used (see below and the pcrecallout documentation). If
the pattern above is matched against
(ab(cd)ef)
- the value for the capturing parentheses is "ef", which is the last
- value taken on at the top level. If additional parentheses are added,
+ the value for the capturing parentheses is "ef", which is the last
+ value taken on at the top level. If additional parentheses are added,
giving
\( ( ( (?>[^()]+) | (?R) )* ) \)
^ ^
^ ^
- the string they capture is "ab(cd)ef", the contents of the top level
- parentheses. If there are more than 15 capturing parentheses in a pat-
+ the string they capture is "ab(cd)ef", the contents of the top level
+ parentheses. If there are more than 15 capturing parentheses in a pat-
tern, PCRE has to obtain extra memory to store data during a recursion,
- which it does by using pcre_malloc, freeing it via pcre_free after-
- wards. If no memory can be obtained, the match fails with the
+ which it does by using pcre_malloc, freeing it via pcre_free after-
+ wards. If no memory can be obtained, the match fails with the
PCRE_ERROR_NOMEMORY error.
- Do not confuse the (?R) item with the condition (R), which tests for
- recursion. Consider this pattern, which matches text in angle brack-
- ets, allowing for arbitrary nesting. Only digits are allowed in nested
- brackets (that is, when recursing), whereas any characters are permit-
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
- In this pattern, (?(R) is the start of a conditional subpattern, with
- two different alternatives for the recursive and non-recursive cases.
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
+ Recursion difference from Perl
+
+ In PCRE (like Python, but unlike Perl), a recursive subpattern call is
+ always treated as an atomic group. That is, once it has matched some of
+ the subject string, it is never re-entered, even if it contains untried
+ alternatives and there is a subsequent matching failure. This can be
+ illustrated by the following pattern, which purports to match a palin-
+ dromic string that contains an odd number of characters (for example,
+ "a", "aba", "abcba", "abcdcba"):
+
+ ^(.|(.)(?1)\2)$
+
+ The idea is that it either matches a single character, or two identical
+ characters surrounding a sub-palindrome. In Perl, this pattern works;
+ in PCRE it does not if the pattern is longer than three characters.
+ Consider the subject string "abcba":
+
+ At the top level, the first character is matched, but as it is not at
+ the end of the string, the first alternative fails; the second alterna-
+ tive is taken and the recursion kicks in. The recursive call to subpat-
+ tern 1 successfully matches the next character ("b"). (Note that the
+ beginning and end of line tests are not part of the recursion).
+
+ Back at the top level, the next character ("c") is compared with what
+ subpattern 2 matched, which was "a". This fails. Because the recursion
+ is treated as an atomic group, there are now no backtracking points,
+ and so the entire match fails. (Perl is able, at this point, to re-
+ enter the recursion and try the second alternative.) However, if the
+ pattern is written with the alternatives in the other order, things are
+ different:
+
+ ^((.)(?1)\2|.)$
+
+ This time, the recursing alternative is tried first, and continues to
+ recurse until it runs out of characters, at which point the recursion
+ fails. But this time we do have another alternative to try at the
+ higher level. That is the big difference: in the previous case the
+ remaining alternative is at a deeper recursion level, which PCRE cannot
+ use.
+
+ To change the pattern so that matches all palindromic strings, not just
+ those with an odd number of characters, it is tempting to change the
+ pattern to this:
+
+ ^((.)(?1)\2|.?)$
+
+ Again, this works in Perl, but not in PCRE, and for the same reason.
+ When a deeper recursion has matched a single character, it cannot be
+ entered again in order to match an empty string. The solution is to
+ separate the two cases, and write out the odd and even cases as alter-
+ natives at the higher level:
+
+ ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
+
+ If you want to match typical palindromic phrases, the pattern has to
+ ignore all non-word characters, which can be done like this:
+
+ ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$
+
+ If run with the PCRE_CASELESS option, this pattern matches phrases such
+ as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
+ Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
+ ing into sequences of non-word characters. Without this, PCRE takes a
+ great deal longer (ten times or more) to match typical phrases, and
+ Perl takes so long that you think it has gone into a loop.
+
SUBPATTERNS AS SUBROUTINES
If the syntax for a recursive subpattern reference (either by number or
- by name) is used outside the parentheses to which it refers, it oper-
- ates like a subroutine in a programming language. The "called" subpat-
+ by name) is used outside the parentheses to which it refers, it oper-
+ ates like a subroutine in a programming language. The "called" subpat-
tern may be defined before or after the reference. A numbered reference
can be absolute or relative, as in these examples:
@@ -4827,101 +4915,106 @@ SUBPATTERNS AS SUBROUTINES
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
+ matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
- is used, it does match "sense and responsibility" as well as the other
- two strings. Another example is given in the discussion of DEFINE
+ is used, it does match "sense and responsibility" as well as the other
+ two strings. Another example is given in the discussion of DEFINE
above.
Like recursive subpatterns, a "subroutine" call is always treated as an
- atomic group. That is, once it has matched some of the subject string,
- it is never re-entered, even if it contains untried alternatives and
+ atomic group. That is, once it has matched some of the subject string,
+ it is never re-entered, even if it contains untried alternatives and
there is a subsequent matching failure.
- When a subpattern is used as a subroutine, processing options such as
+ When a subpattern is used as a subroutine, processing options such as
case-independence are fixed when the subpattern is defined. They cannot
be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))
- It matches "abcabc". It does not match "abcABC" because the change of
+ It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a subroutine,
- possibly recursively. Here are two of the examples used above, rewrit-
+ an alternative syntax for referencing a subpattern as a subroutine,
+ possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
- PCRE supports an extension to Oniguruma: if a number is preceded by a
+ PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
- Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
- synonymous. The former is a back reference; the latter is a subroutine
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a back reference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
- Perl code to be obeyed in the middle of matching a regular expression.
+ Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides
- an external function by putting its entry point in the global variable
- pcre_callout. By default, this variable contains NULL, which disables
+ an external function by putting its entry point in the global variable
+ pcre_callout. By default, this variable contains NULL, which disables
all calling out.
- Within a regular expression, (?C) indicates the points at which the
- external function is to be called. If you want to identify different
- callout points, you can put a number less than 256 after the letter C.
- The default value is zero. For example, this pattern has two callout
+ Within a regular expression, (?C) indicates the points at which the
+ external function is to be called. If you want to identify different
+ callout points, you can put a number less than 256 after the letter C.
+ The default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
- automatically installed before each item in the pattern. They are all
+ automatically installed before each item in the pattern. They are all
numbered 255.
During matching, when PCRE reaches a callout point (and pcre_callout is
- set), the external function is called. It is provided with the number
- of the callout, the position in the pattern, and, optionally, one item
- of data originally supplied by the caller of pcre_exec(). The callout
- function may cause matching to proceed, to backtrack, or to fail alto-
+ set), the external function is called. It is provided with the number
+ of the callout, the position in the pattern, and, optionally, one item
+ of data originally supplied by the caller of pcre_exec(). The callout
+ function may cause matching to proceed, to backtrack, or to fail alto-
gether. A complete description of the interface to the callout function
is given in the pcrecallout documentation.
BACKTRACKING CONTROL
- Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
+ Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are described in the Perl documentation as "experimental and sub-
- ject to change or removal in a future version of Perl". It goes on to
- say: "Their usage in production code should be noted to avoid problems
+ ject to change or removal in a future version of Perl". It goes on to
+ say: "Their usage in production code should be noted to avoid problems
during upgrades." The same remarks apply to the PCRE features described
in this section.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using
pcre_exec(), which uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, they cause an
error if encountered by pcre_dfa_exec().
+ If any of these verbs are used in an assertion subpattern, their effect
+ is confined to that subpattern; it does not extend to the surrounding
+ pattern. Note that assertion subpatterns are processed as anchored at
+ the point where they are tested.
+
The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. In Perl, they are generally of
the form (*VERB:ARG) but PCRE does not support the use of arguments, so
@@ -4936,14 +5029,14 @@ BACKTRACKING CONTROL
This verb causes the match to end successfully, skipping the remainder
of the pattern. When inside a recursion, only the innermost pattern is
- ended immediately. PCRE differs from Perl in what happens if the
- (*ACCEPT) is inside capturing parentheses. In Perl, the data so far is
- captured: in PCRE no data is captured. For example:
+ ended immediately. If the (*ACCEPT) is inside capturing parentheses,
+ the data so far is captured. (This feature was added to PCRE at release
+ 8.00.) For example:
- A(A|B(*ACCEPT)|C)D
+ A((?:A|B(*ACCEPT)|C)D)
- This matches "AB", "AAD", or "ACD", but when it matches "AB", no data
- is captured.
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ tured by the outer parentheses.
(*FAIL) or (*F)
@@ -5039,7 +5132,7 @@ AUTHOR
REVISION
- Last updated: 11 April 2009
+ Last updated: 18 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
@@ -5453,10 +5546,26 @@ PARTIAL MATCHING USING pcre_exec()
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
matching continues as normal, and other alternatives in the pattern are
tried. If no complete match can be found, pcre_exec() returns
- PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH, and if there are at
- least two slots in the offsets vector, they are filled in with the off-
- sets of the longest string that partially matched. Consider this pat-
- tern:
+ PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
+ two slots in the offsets vector, the first of them is set to the offset
+ of the earliest character that was inspected when the partial match was
+ found. For convenience, the second offset points to the end of the
+ string so that a substring can easily be extracted.
+
+ For the majority of patterns, the first offset identifies the start of
+ the partially matched string. However, for patterns that contain look-
+ behind assertions, or \K, or begin with \b or \B, earlier characters
+ have been inspected while carrying out the match. For example:
+
+ /(?<=abc)123/
+
+ This pattern matches "123", but only if it is preceded by "abc". If the
+ subject string is "xyzabc12", the offsets after a partial match are for
+ the substring "abc12", because all these characters are needed if
+ another match is tried with extra characters added.
+
+ If there is more than one partial match, the first one that was found
+ provides the data that is returned. Consider this pattern:
/123\w+X|dogY/
@@ -5464,7 +5573,7 @@ PARTIAL MATCHING USING pcre_exec()
natives fail to match, but the end of the subject is reached during
matching, so PCRE_ERROR_PARTIAL is returned instead of
PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
- "123dog" as the longest partial match that was found. (In this example,
+ "123dog" as the first partial match that was found. (In this example,
there are two partial matches, because "dog" on its own partially
matches the second alternative.)
@@ -5508,64 +5617,65 @@ PARTIAL MATCHING USING pcre_dfa_exec()
there have been no complete matches. Otherwise, the complete matches
are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
takes precedence over any complete matches. The portion of the string
- that provided the longest partial match is set as the first matching
- string, provided there are at least two slots in the offsets vector.
+ that was inspected when the longest partial match was found is set as
+ the first matching string, provided there are at least two slots in the
+ offsets vector.
- Because pcre_dfa_exec() always searches for all possible matches, and
- there is no difference between greedy and ungreedy repetition, its be-
+ Because pcre_dfa_exec() always searches for all possible matches, and
+ there is no difference between greedy and ungreedy repetition, its be-
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
- sider the string "dog" matched against the ungreedy pattern shown
+ sider the string "dog" matched against the ungreedy pattern shown
above:
/dog(sbody)??/
- Whereas pcre_exec() stops as soon as it finds the complete match for
+ Whereas pcre_exec() stops as soon as it finds the complete match for
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
so returns that when PCRE_PARTIAL_HARD is set.
PARTIAL MATCHING AND WORD BOUNDARIES
- If a pattern ends with one of sequences \w or \W, which test for word
- boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
+ If a pattern ends with one of sequences \w or \W, which test for word
+ boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
intuitive results. Consider this pattern:
/\bcat\b/
This matches "cat", provided there is a word boundary at either end. If
the subject string is "the cat", the comparison of the final "t" with a
- following character cannot take place, so a partial match is found.
- However, pcre_exec() carries on with normal matching, which matches \b
- at the end of the subject when the last character is a letter, thus
+ following character cannot take place, so a partial match is found.
+ However, pcre_exec() carries on with normal matching, which matches \b
+ at the end of the subject when the last character is a letter, thus
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
- TIAL. The same thing happens with pcre_dfa_exec(), because it also
+ TIAL. The same thing happens with pcre_dfa_exec(), because it also
finds the complete match.
- Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
+ Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
because then the partial match takes precedence.
FORMERLY RESTRICTED PATTERNS
For releases of PCRE prior to 8.00, because of the way certain internal
- optimizations were implemented in the pcre_exec() function, the
- PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
- used with all patterns. From release 8.00 onwards, the restrictions no
- longer apply, and partial matching with pcre_exec() can be requested
+ optimizations were implemented in the pcre_exec() function, the
+ PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
+ used with all patterns. From release 8.00 onwards, the restrictions no
+ longer apply, and partial matching with pcre_exec() can be requested
for any pattern.
Items that were formerly restricted were repeated single characters and
- repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
- not conform to the restrictions, pcre_exec() returned the error code
- PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
- PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
+ repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
+ not conform to the restrictions, pcre_exec() returned the error code
+ PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
+ PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
pattern can be used for partial matching now always returns 1.
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
- If the escape sequence \P is present in a pcretest data line, the
- PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
+ If the escape sequence \P is present in a pcretest data line, the
+ PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
pcretest that uses the date example quoted above:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -5581,24 +5691,24 @@ EXAMPLE OF PARTIAL MATCHING USING PCRETEST
data> j\P
No match
- The first data string is matched completely, so pcretest shows the
- matched substrings. The remaining four strings do not match the com-
+ The first data string is matched completely, so pcretest shows the
+ matched substrings. The remaining four strings do not match the com-
plete pattern, but the first two are partial matches. Similar output is
obtained when pcre_dfa_exec() is used.
- If the escape sequence \P is present more than once in a pcretest data
+ If the escape sequence \P is present more than once in a pcretest data
line, the PCRE_PARTIAL_HARD option is set for the match.
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
When a partial match has been found using pcre_dfa_exec(), it is possi-
- ble to continue the match by providing additional subject data and
- calling pcre_dfa_exec() again with the same compiled regular expres-
- sion, this time setting the PCRE_DFA_RESTART option. You must pass the
+ ble to continue the match by providing additional subject data and
+ calling pcre_dfa_exec() again with the same compiled regular expres-
+ sion, this time setting the PCRE_DFA_RESTART option. You must pass the
same working space as before, because this is where details of the pre-
- vious partial match are stored. Here is an example using pcretest,
- using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
+ vious partial match are stored. Here is an example using pcretest,
+ using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
specifies the use of pcre_dfa_exec()):
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -5607,26 +5717,26 @@ MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
data> n05\R\D
0: n05
- The first call has "23ja" as the subject, and requests partial match-
- ing; the second call has "n05" as the subject for the continued
- (restarted) match. Notice that when the match is complete, only the
- last part is shown; PCRE does not retain the previously partially-
- matched string. It is up to the calling program to do that if it needs
+ The first call has "23ja" as the subject, and requests partial match-
+ ing; the second call has "n05" as the subject for the continued
+ (restarted) match. Notice that when the match is complete, only the
+ last part is shown; PCRE does not retain the previously partially-
+ matched string. It is up to the calling program to do that if it needs
to.
- You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
- PCRE_DFA_RESTART to continue partial matching over multiple segments.
- This facility can be used to pass very long subject strings to
+ You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
+ PCRE_DFA_RESTART to continue partial matching over multiple segments.
+ This facility can be used to pass very long subject strings to
pcre_dfa_exec().
MULTI-SEGMENT MATCHING WITH pcre_exec()
- From release 8.00, pcre_exec() can also be used to do multi-segment
- matching. Unlike pcre_dfa_exec(), it is not possible to restart the
- previous match with a new segment of data. Instead, new data must be
- added to the previous subject string, and the entire match re-run,
- starting from the point where the partial match occurred. Earlier data
+ From release 8.00, pcre_exec() can also be used to do multi-segment
+ matching. Unlike pcre_dfa_exec(), it is not possible to restart the
+ previous match with a new segment of data. Instead, new data must be
+ added to the previous subject string, and the entire match re-run,
+ starting from the point where the partial match occurred. Earlier data
can be discarded. Consider an unanchored pattern that matches dates:
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
@@ -5634,39 +5744,45 @@ MULTI-SEGMENT MATCHING WITH pcre_exec()
Partial match: 23ja
The this stage, an application could discard the text preceding "23ja",
- add on text from the next segment, and call pcre_exec() again. Unlike
- pcre_dfa_exec(), the entire matching string must always be available,
- and the complete matching process occurs for each call, so more memory
+ add on text from the next segment, and call pcre_exec() again. Unlike
+ pcre_dfa_exec(), the entire matching string must always be available,
+ and the complete matching process occurs for each call, so more memory
and more processing time is needed.
+ Note: If the pattern contains lookbehind assertions, or \K, or starts
+ with \b or \B, the string that is returned for a partial match will
+ include characters that precede the partially matched string itself,
+ because these must be retained when adding on more characters for a
+ subsequent matching attempt.
+
ISSUES WITH MULTI-SEGMENT MATCHING
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
- 1. If the pattern contains tests for the beginning or end of a line,
- you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
- ate, when the subject string for any call does not contain the begin-
+ 1. If the pattern contains tests for the beginning or end of a line,
+ you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
+ ate, when the subject string for any call does not contain the begin-
ning or end of a line.
- 2. If the pattern contains backward assertions (including \b or \B),
- you need to arrange for some overlap in the subject strings to allow
- for them to be correctly tested at the start of each substring. For
- example, using pcre_dfa_exec(), you could pass the subject in chunks
- that are 500 bytes long, but in a buffer of 700 bytes, with the start-
- ing offset set to 200 and the previous 200 bytes at the start of the
- buffer.
-
- 3. Matching a subject string that is split into multiple segments may
- not always produce exactly the same result as matching over one single
- long string, especially when PCRE_PARTIAL_SOFT is used. The section
- "Partial Matching and Word Boundaries" above describes an issue that
- arises if the pattern ends with \b or \B. Another kind of difference
- may occur when there are multiple matching possibilities, because a
+ 2. Lookbehind assertions at the start of a pattern are catered for in
+ the offsets that are returned for a partial match. However, in theory,
+ a lookbehind assertion later in the pattern could require even earlier
+ characters to be inspected, and it might not have been reached when a
+ partial match occurs. This is probably an extremely unlikely case; you
+ could guard against it to a certain extent by always including extra
+ characters at the start.
+
+ 3. Matching a subject string that is split into multiple segments may
+ not always produce exactly the same result as matching over one single
+ long string, especially when PCRE_PARTIAL_SOFT is used. The section
+ "Partial Matching and Word Boundaries" above describes an issue that
+ arises if the pattern ends with \b or \B. Another kind of difference
+ may occur when there are multiple matching possibilities, because a
partial match result is given only when there are no completed matches.
This means that as soon as the shortest match has been found, continua-
- tion to a new subject segment is no longer possible. Consider again
+ tion to a new subject segment is no longer possible. Consider again
this pcretest example:
re> /dog(sbody)?/
@@ -5680,17 +5796,17 @@ ISSUES WITH MULTI-SEGMENT MATCHING
0: dogsbody
1: dog
- The first data line passes the string "dogsb" to pcre_exec(), setting
- the PCRE_PARTIAL_SOFT option. Although the string is a partial match
- for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
- shorter string "dog" is a complete match. Similarly, when the subject
- is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
+ The first data line passes the string "dogsb" to pcre_exec(), setting
+ the PCRE_PARTIAL_SOFT option. Although the string is a partial match
+ for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
+ shorter string "dog" is a complete match. Similarly, when the subject
+ is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
the first two) the match stops when "dog" has been found, and it is not
- possible to continue. On the other hand, if "dogsbody" is presented as
+ possible to continue. On the other hand, if "dogsbody" is presented as
a single string, pcre_dfa_exec() finds both matches.
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
- when matching multi-segment data. The example above then behaves dif-
+ when matching multi-segment data. The example above then behaves dif-
ferently:
re> /dog(sbody)?/
@@ -5703,25 +5819,25 @@ ISSUES WITH MULTI-SEGMENT MATCHING
4. Patterns that contain alternatives at the top level which do not all
- start with the same pattern item may not work as expected when
+ start with the same pattern item may not work as expected when
pcre_dfa_exec() is used. For example, consider this pattern:
1234|3789
- If the first part of the subject is "ABC123", a partial match of the
- first alternative is found at offset 3. There is no partial match for
+ If the first part of the subject is "ABC123", a partial match of the
+ first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the same
- point in the subject string. Attempting to continue with the string
- "7890" does not yield a match because only those alternatives that
- match at one point in the subject are remembered. The problem arises
- because the start of the second alternative matches within the first
- alternative. There is no problem with anchored patterns or patterns
+ point in the subject string. Attempting to continue with the string
+ "7890" does not yield a match because only those alternatives that
+ match at one point in the subject are remembered. The problem arises
+ because the start of the second alternative matches within the first
+ alternative. There is no problem with anchored patterns or patterns
such as:
1234|ABCD
- where no string can be a partial match for both alternatives. This is
- not a problem if pcre_exec() is used, because the entire match has to
+ where no string can be a partial match for both alternatives. This is
+ not a problem if pcre_exec() is used, because the entire match has to
be rerun each time:
re> /1234|3789/
@@ -5740,7 +5856,7 @@ AUTHOR
REVISION
- Last updated: 31 August 2009
+ Last updated: 05 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
@@ -6062,6 +6178,10 @@ DESCRIPTION
easier to slot in PCRE as a replacement library. Other POSIX options
are not even defined.
+ There are also some other options that are not defined by POSIX. These
+ have been added at the request of users who want to make use of certain
+ PCRE-specific features via the POSIX calling interface.
+
When PCRE is called via these functions, it is only the API that is
POSIX-like in style. The syntax and semantics of the regular expres-
sions themselves are still those of Perl, subject to the setting of
@@ -6116,6 +6236,12 @@ COMPILING A PATTERN
ing, the nmatch and pmatch arguments are ignored, and no captured
strings are returned.
+ REG_UNGREEDY
+
+ The PCRE_UNGREEDY option is set when the regular expression is passed
+ for compilation to the native function. Note that REG_UNGREEDY is not
+ part of the POSIX standard.
+
REG_UTF8
The PCRE_UTF8 option is set when the regular expression is passed for
@@ -6128,7 +6254,7 @@ COMPILING A PATTERN
semantics. In particular, the way it handles newline characters in the
subject string is the Perl way, not the POSIX way. Note that setting
PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.
- It does not affect the way newlines are matched by . (they aren't) or
+ It does not affect the way newlines are matched by . (they are not) or
by a negative class such as [^a] (they are).
The yield of regcomp() is zero on success, and non-zero otherwise. The
@@ -6215,36 +6341,39 @@ MATCHING A PATTERN
matched strings is returned. The nmatch and pmatch arguments of
regexec() are ignored.
+ If the value of nmatch is zero, or if the value pmatch is NULL, no data
+ about any matched strings is returned.
+
Otherwise,the portion of the string that was matched, and also any cap-
tured substrings, are returned via the pmatch argument, which points to
- an array of nmatch structures of type regmatch_t, containing the mem-
- bers rm_so and rm_eo. These contain the offset to the first character
- of each substring and the offset to the first character after the end
- of each substring, respectively. The 0th element of the vector relates
- to the entire portion of string that was matched; subsequent elements
- relate to the capturing subpatterns of the regular expression. Unused
+ an array of nmatch structures of type regmatch_t, containing the mem-
+ bers rm_so and rm_eo. These contain the offset to the first character
+ of each substring and the offset to the first character after the end
+ of each substring, respectively. The 0th element of the vector relates
+ to the entire portion of string that was matched; subsequent elements
+ relate to the capturing subpatterns of the regular expression. Unused
entries in the array have both structure members set to -1.
- A successful match yields a zero return; various error codes are
- defined in the header file, of which REG_NOMATCH is the "expected"
+ A successful match yields a zero return; various error codes are
+ defined in the header file, of which REG_NOMATCH is the "expected"
failure code.
ERROR MESSAGES
The regerror() function maps a non-zero errorcode from either regcomp()
- or regexec() to a printable message. If preg is not NULL, the error
+ or regexec() to a printable message. If preg is not NULL, the error
should have arisen from the use of that structure. A message terminated
- by a binary zero is placed in errbuf. The length of the message,
- including the zero, is limited to errbuf_size. The yield of the func-
+ by a binary zero is placed in errbuf. The length of the message,
+ including the zero, is limited to errbuf_size. The yield of the func-
tion is the size of buffer needed to hold the whole message.
MEMORY USAGE
- Compiling a regular expression causes memory to be allocated and asso-
- ciated with the preg structure. The function regfree() frees all such
- memory, after which preg may no longer be used as a compiled expres-
+ Compiling a regular expression causes memory to be allocated and asso-
+ ciated with the preg structure. The function regfree() frees all such
+ memory, after which preg may no longer be used as a compiled expres-
sion.
@@ -6257,7 +6386,7 @@ AUTHOR
REVISION
- Last updated: 15 August 2009
+ Last updated: 02 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index af0aa92..f32b071 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -69,7 +69,7 @@ The \eQ...\eE sequence is recognized both inside and outside character classes.
.P
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
constructions. However, there is support for recursive patterns. This is not
-available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE "callout"
+available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE "callout"
feature allows an external function to be called during pattern matching. See
the
.\" HREF
@@ -78,7 +78,17 @@ the
documentation for details.
.P
9. Subpatterns that are called recursively or as "subroutines" are always
-treated as atomic groups in PCRE. This is like Python, but unlike Perl.
+treated as atomic groups in PCRE. This is like Python, but unlike Perl. There
+is a discussion of an example that explains this in more detail in the
+.\" HTML <a href="pcrepattern.html#recursiondifference">
+.\" </a>
+section on recursion differences from Perl
+.\"
+in the
+.\" HREF
+\fBpcrecompat\fP
+.\"
+page.
.P
10. There are some differences that are concerned with the settings of captured
strings when part of a pattern is repeated. For example, matching "aba" against
@@ -145,6 +155,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 16 September 2009
+Last updated: 18 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcredemo.3 b/doc/pcredemo.3
index 611ed0c..5d2926d 100644
--- a/doc/pcredemo.3
+++ b/doc/pcredemo.3
@@ -240,12 +240,12 @@ if (namecount <= 0) printf("No named substrings\en"); else
* *
* If the previous match WAS for an empty string, we can't do that, as it *
* would lead to an infinite loop. Instead, a special call of pcre_exec() *
-* is made with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set. The first *
-* of these tells PCRE that an empty string is not a valid match; other *
-* possibilities must be tried. The second flag restricts PCRE to one *
-* match attempt at the initial string position. If this match succeeds, *
-* an alternative to the empty string match has been found, and we can *
-* proceed round the loop. *
+* is made with the PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED flags set. *
+* The first of these tells PCRE that an empty string at the start of the *
+* subject is not a valid match; other possibilities must be tried. The *
+* second flag restricts PCRE to one match attempt at the initial string *
+* position. If this match succeeds, an alternative to the empty string *
+* match has been found, and we can proceed round the loop. *
*************************************************************************/
if (!find_all)
@@ -268,7 +268,7 @@ for (;;)
if (ovector[0] == ovector[1])
{
if (ovector[0] == subject_length) break;
- options = PCRE_NOTEMPTY | PCRE_ANCHORED;
+ options = PCRE_NOTEMPTY_ATSTART | PCRE_ANCHORED;
}
/* Run the next matching operation */
diff --git a/doc/pcregrep.txt b/doc/pcregrep.txt
index e876b01..47937a9 100644
--- a/doc/pcregrep.txt
+++ b/doc/pcregrep.txt
@@ -70,203 +70,204 @@ DESCRIPTION
of the above options is used.
Patterns that can match an empty string are accepted, but empty string
- matches are not recognized. An example is the pattern "(super)?(man)?",
- in which all components are optional. This pattern finds all occur-
- rences of both "super" and "man"; the output differs from matching with
- "super|man" when only the matching substrings are being shown.
-
- If the LC_ALL or LC_CTYPE environment variable is set, pcregrep uses
- the value to set a locale when calling the PCRE library. The --locale
+ matches are never recognized. An example is the pattern
+ "(super)?(man)?", in which all components are optional. This pattern
+ finds all occurrences of both "super" and "man"; the output differs
+ from matching with "super|man" when only the matching substrings are
+ being shown.
+
+ If the LC_ALL or LC_CTYPE environment variable is set, pcregrep uses
+ the value to set a locale when calling the PCRE library. The --locale
option can be used to override this.
SUPPORT FOR COMPRESSED FILES
- It is possible to compile pcregrep so that it uses libz or libbz2 to
- read files whose names end in .gz or .bz2, respectively. You can find
+ It is possible to compile pcregrep so that it uses libz or libbz2 to
+ read files whose names end in .gz or .bz2, respectively. You can find
out whether your binary has support for one or both of these file types
by running it with the --help option. If the appropriate support is not
- present, files are treated as plain text. The standard input is always
+ present, files are treated as plain text. The standard input is always
so treated.
OPTIONS
- The order in which some of the options appear can affect the output.
- For example, both the -h and -l options affect the printing of file
- names. Whichever comes later in the command line will be the one that
+ The order in which some of the options appear can affect the output.
+ For example, both the -h and -l options affect the printing of file
+ names. Whichever comes later in the command line will be the one that
takes effect.
- -- This terminate the list of options. It is useful if the next
- item on the command line starts with a hyphen but is not an
- option. This allows for the processing of patterns and file-
+ -- This terminate the list of options. It is useful if the next
+ item on the command line starts with a hyphen but is not an
+ option. This allows for the processing of patterns and file-
names that start with hyphens.
-A number, --after-context=number
- Output number lines of context after each matching line. If
+ Output number lines of context after each matching line. If
filenames and/or line numbers are being output, a hyphen sep-
- arator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
+ arator is used instead of a colon for the context lines. A
+ line containing "--" is output between each group of lines,
+ unless they are in fact contiguous in the input file. The
+ value of number is expected to be relatively small. However,
pcregrep guarantees to have up to 8K of following text avail-
able for context output.
-B number, --before-context=number
- Output number lines of context before each matching line. If
+ Output number lines of context before each matching line. If
filenames and/or line numbers are being output, a hyphen sep-
- arator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
+ arator is used instead of a colon for the context lines. A
+ line containing "--" is output between each group of lines,
+ unless they are in fact contiguous in the input file. The
+ value of number is expected to be relatively small. However,
pcregrep guarantees to have up to 8K of preceding text avail-
able for context output.
-C number, --context=number
- Output number lines of context both before and after each
- matching line. This is equivalent to setting both -A and -B
+ Output number lines of context both before and after each
+ matching line. This is equivalent to setting both -A and -B
to the same value.
-c, --count
- Do not output individual lines from the files that are being
+ Do not output individual lines from the files that are being
scanned; instead output the number of lines that would other-
- wise have been shown. If no lines are selected, the number
- zero is output. If several files are are being scanned, a
- count is output for each of them. However, if the --files-
- with-matches option is also used, only those files whose
+ wise have been shown. If no lines are selected, the number
+ zero is output. If several files are are being scanned, a
+ count is output for each of them. However, if the --files-
+ with-matches option is also used, only those files whose
counts are greater than zero are listed. When -c is used, the
-A, -B, and -C options are ignored.
--colour, --color
If this option is given without any data, it is equivalent to
- "--colour=auto". If data is required, it must be given in
+ "--colour=auto". If data is required, it must be given in
the same shell item, separated by an equals sign.
--colour=value, --color=value
This option specifies under what circumstances the parts of a
line that matched a pattern should be coloured in the output.
- By default, the output is not coloured. The value (which is
- optional, see above) may be "never", "always", or "auto". In
- the latter case, colouring happens only if the standard out-
- put is connected to a terminal. More resources are used when
- colouring is enabled, because pcregrep has to search for all
- possible matches in a line, not just one, in order to colour
+ By default, the output is not coloured. The value (which is
+ optional, see above) may be "never", "always", or "auto". In
+ the latter case, colouring happens only if the standard out-
+ put is connected to a terminal. More resources are used when
+ colouring is enabled, because pcregrep has to search for all
+ possible matches in a line, not just one, in order to colour
them all.
The colour that is used can be specified by setting the envi-
ronment variable PCREGREP_COLOUR or PCREGREP_COLOR. The value
of this variable should be a string of two numbers, separated
- by a semicolon. They are copied directly into the control
- string for setting colour on a terminal, so it is your
- responsibility to ensure that they make sense. If neither of
- the environment variables is set, the default is "1;31",
+ by a semicolon. They are copied directly into the control
+ string for setting colour on a terminal, so it is your
+ responsibility to ensure that they make sense. If neither of
+ the environment variables is set, the default is "1;31",
which gives red.
-D action, --devices=action
- If an input path is not a regular file or a directory,
- "action" specifies how it is to be processed. Valid values
+ If an input path is not a regular file or a directory,
+ "action" specifies how it is to be processed. Valid values
are "read" (the default) or "skip" (silently skip the path).
-d action, --directories=action
If an input path is a directory, "action" specifies how it is
- to be processed. Valid values are "read" (the default),
- "recurse" (equivalent to the -r option), or "skip" (silently
- skip the path). In the default case, directories are read as
- if they were ordinary files. In some operating systems the
- effect of reading a directory like this is an immediate end-
+ to be processed. Valid values are "read" (the default),
+ "recurse" (equivalent to the -r option), or "skip" (silently
+ skip the path). In the default case, directories are read as
+ if they were ordinary files. In some operating systems the
+ effect of reading a directory like this is an immediate end-
of-file.
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
- be used as a way of specifying a single pattern that starts
- with a hyphen. When -e is used, no argument pattern is taken
- from the command line; all arguments are treated as file
- names. There is an overall maximum of 100 patterns. They are
- applied to each line in the order in which they are defined
+ be used as a way of specifying a single pattern that starts
+ with a hyphen. When -e is used, no argument pattern is taken
+ from the command line; all arguments are treated as file
+ names. There is an overall maximum of 100 patterns. They are
+ applied to each line in the order in which they are defined
until one matches (or fails to match if -v is used). If -f is
- used with -e, the command line patterns are matched first,
- followed by the patterns from the file, independent of the
- order in which these options are specified. Note that multi-
+ used with -e, the command line patterns are matched first,
+ followed by the patterns from the file, independent of the
+ order in which these options are specified. Note that multi-
ple use of -e is not the same as a single pattern with alter-
natives. For example, X|Y finds the first character in a line
- that is X or Y, whereas if the two patterns are given sepa-
+ that is X or Y, whereas if the two patterns are given sepa-
rately, pcregrep finds X if it is present, even if it follows
- Y in the line. It finds Y only if there is no X in the line.
- This really matters only if you are using -o to show the
+ Y in the line. It finds Y only if there is no X in the line.
+ This really matters only if you are using -o to show the
part(s) of the line that matched.
--exclude=pattern
When pcregrep is searching the files in a directory as a con-
- sequence of the -r (recursive search) option, any regular
+ sequence of the -r (recursive search) option, any regular
files whose names match the pattern are excluded. Subdirecto-
- ries are not excluded by this option; they are searched
- recursively, subject to the --exclude_dir and --include_dir
- options. The pattern is a PCRE regular expression, and is
+ ries are not excluded by this option; they are searched
+ recursively, subject to the --exclude_dir and --include_dir
+ options. The pattern is a PCRE regular expression, and is
matched against the final component of the file name (not the
- entire path). If a file name matches both --include and
- --exclude, it is excluded. There is no short form for this
+ entire path). If a file name matches both --include and
+ --exclude, it is excluded. There is no short form for this
option.
--exclude_dir=pattern
- When pcregrep is searching the contents of a directory as a
- consequence of the -r (recursive search) option, any subdi-
- rectories whose names match the pattern are excluded. (Note
- that the --exclude option does not affect subdirectories.)
- The pattern is a PCRE regular expression, and is matched
- against the final component of the name (not the entire
- path). If a subdirectory name matches both --include_dir and
- --exclude_dir, it is excluded. There is no short form for
+ When pcregrep is searching the contents of a directory as a
+ consequence of the -r (recursive search) option, any subdi-
+ rectories whose names match the pattern are excluded. (Note
+ that the --exclude option does not affect subdirectories.)
+ The pattern is a PCRE regular expression, and is matched
+ against the final component of the name (not the entire
+ path). If a subdirectory name matches both --include_dir and
+ --exclude_dir, it is excluded. There is no short form for
this option.
-F, --fixed-strings
- Interpret each pattern as a list of fixed strings, separated
- by newlines, instead of as a regular expression. The -w
- (match as a word) and -x (match whole line) options can be
+ Interpret each pattern as a list of fixed strings, separated
+ by newlines, instead of as a regular expression. The -w
+ (match as a word) and -x (match whole line) options can be
used with -F. They apply to each of the fixed strings. A line
is selected if any of the fixed strings are found in it (sub-
ject to -w or -x, if present).
-f filename, --file=filename
- Read a number of patterns from the file, one per line, and
- match them against each line of input. A data line is output
+ Read a number of patterns from the file, one per line, and
+ match them against each line of input. A data line is output
if any of the patterns match it. The filename can be given as
"-" to refer to the standard input. When -f is used, patterns
- specified on the command line using -e may also be present;
+ specified on the command line using -e may also be present;
they are tested before the file's patterns. However, no other
- pattern is taken from the command line; all arguments are
- treated as file names. There is an overall maximum of 100
+ pattern is taken from the command line; all arguments are
+ treated as file names. There is an overall maximum of 100
patterns. Trailing white space is removed from each line, and
- blank lines are ignored. An empty file contains no patterns
- and therefore matches nothing. See also the comments about
- multiple patterns versus a single pattern with alternatives
+ blank lines are ignored. An empty file contains no patterns
+ and therefore matches nothing. See also the comments about
+ multiple patterns versus a single pattern with alternatives
in the description of -e above.
--file-offsets
- Instead of showing lines or parts of lines that match, show
- each match as an offset from the start of the file and a
- length, separated by a comma. In this mode, no context is
- shown. That is, the -A, -B, and -C options are ignored. If
+ Instead of showing lines or parts of lines that match, show
+ each match as an offset from the start of the file and a
+ length, separated by a comma. In this mode, no context is
+ shown. That is, the -A, -B, and -C options are ignored. If
there is more than one match in a line, each of them is shown
- separately. This option is mutually exclusive with --line-
+ separately. This option is mutually exclusive with --line-
offsets and --only-matching.
-H, --with-filename
- Force the inclusion of the filename at the start of output
- lines when searching a single file. By default, the filename
- is not shown in this case. For matching lines, the filename
+ Force the inclusion of the filename at the start of output
+ lines when searching a single file. By default, the filename
+ is not shown in this case. For matching lines, the filename
is followed by a colon; for context lines, a hyphen separator
- is used. If a line number is also being output, it follows
+ is used. If a line number is also being output, it follows
the file name.
-h, --no-filename
- Suppress the output filenames when searching multiple files.
- By default, filenames are shown when multiple files are
- searched. For matching lines, the filename is followed by a
- colon; for context lines, a hyphen separator is used. If a
+ Suppress the output filenames when searching multiple files.
+ By default, filenames are shown when multiple files are
+ searched. For matching lines, the filename is followed by a
+ colon; for context lines, a hyphen separator is used. If a
line number is also being output, it follows the file name.
- --help Output a help message, giving brief details of the command
+ --help Output a help message, giving brief details of the command
options and file type support, and then exit.
-i, --ignore-case
@@ -276,38 +277,38 @@ OPTIONS
When pcregrep is searching the files in a directory as a con-
sequence of the -r (recursive search) option, only those reg-
ular files whose names match the pattern are included. Subdi-
- rectories are always included and searched recursively, sub-
+ rectories are always included and searched recursively, sub-
ject to the --include_dir and --exclude_dir options. The pat-
tern is a PCRE regular expression, and is matched against the
- final component of the file name (not the entire path). If a
+ final component of the file name (not the entire path). If a
file name matches both --include and --exclude, it is
excluded. There is no short form for this option.
--include_dir=pattern
- When pcregrep is searching the contents of a directory as a
- consequence of the -r (recursive search) option, only those
- subdirectories whose names match the pattern are included.
- (Note that the --include option does not affect subdirecto-
- ries.) The pattern is a PCRE regular expression, and is
- matched against the final component of the name (not the
- entire path). If a subdirectory name matches both
- --include_dir and --exclude_dir, it is excluded. There is no
+ When pcregrep is searching the contents of a directory as a
+ consequence of the -r (recursive search) option, only those
+ subdirectories whose names match the pattern are included.
+ (Note that the --include option does not affect subdirecto-
+ ries.) The pattern is a PCRE regular expression, and is
+ matched against the final component of the name (not the
+ entire path). If a subdirectory name matches both
+ --include_dir and --exclude_dir, it is excluded. There is no
short form for this option.
-L, --files-without-match
- Instead of outputting lines from the files, just output the
- names of the files that do not contain any lines that would
- have been output. Each file name is output once, on a sepa-
+ Instead of outputting lines from the files, just output the
+ names of the files that do not contain any lines that would
+ have been output. Each file name is output once, on a sepa-
rate line.
-l, --files-with-matches
- Instead of outputting lines from the files, just output the
+ Instead of outputting lines from the files, just output the
names of the files containing lines that would have been out-
- put. Each file name is output once, on a separate line.
- Searching normally stops as soon as a matching line is found
- in a file. However, if the -c (count) option is also used,
- matching continues in order to obtain the correct count, and
- those files that have at least one match are listed along
+ put. Each file name is output once, on a separate line.
+ Searching normally stops as soon as a matching line is found
+ in a file. However, if the -c (count) option is also used,
+ matching continues in order to obtain the correct count, and
+ those files that have at least one match are listed along
with their counts. Using this option with -c is a way of sup-
pressing the listing of files with no matches.
@@ -317,106 +318,106 @@ OPTIONS
input)" is used. There is no short form for this option.
--line-offsets
- Instead of showing lines or parts of lines that match, show
+ Instead of showing lines or parts of lines that match, show
each match as a line number, the offset from the start of the
- line, and a length. The line number is terminated by a colon
- (as usual; see the -n option), and the offset and length are
- separated by a comma. In this mode, no context is shown.
- That is, the -A, -B, and -C options are ignored. If there is
- more than one match in a line, each of them is shown sepa-
+ line, and a length. The line number is terminated by a colon
+ (as usual; see the -n option), and the offset and length are
+ separated by a comma. In this mode, no context is shown.
+ That is, the -A, -B, and -C options are ignored. If there is
+ more than one match in a line, each of them is shown sepa-
rately. This option is mutually exclusive with --file-offsets
and --only-matching.
--locale=locale-name
- This option specifies a locale to be used for pattern match-
- ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
- ronment variables. If no locale is specified, the PCRE
- library's default (usually the "C" locale) is used. There is
+ This option specifies a locale to be used for pattern match-
+ ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
+ ronment variables. If no locale is specified, the PCRE
+ library's default (usually the "C" locale) is used. There is
no short form for this option.
-M, --multiline
- Allow patterns to match more than one line. When this option
+ Allow patterns to match more than one line. When this option
is given, patterns may usefully contain literal newline char-
- acters and internal occurrences of ^ and $ characters. The
- output for any one match may consist of more than one line.
- When this option is set, the PCRE library is called in "mul-
- tiline" mode. There is a limit to the number of lines that
- can be matched, imposed by the way that pcregrep buffers the
- input file as it scans it. However, pcregrep ensures that at
+ acters and internal occurrences of ^ and $ characters. The
+ output for any one match may consist of more than one line.
+ When this option is set, the PCRE library is called in "mul-
+ tiline" mode. There is a limit to the number of lines that
+ can be matched, imposed by the way that pcregrep buffers the
+ input file as it scans it. However, pcregrep ensures that at
least 8K characters or the rest of the document (whichever is
- the shorter) are available for forward matching, and simi-
+ the shorter) are available for forward matching, and simi-
larly the previous 8K characters (or all the previous charac-
- ters, if fewer than 8K) are guaranteed to be available for
+ ters, if fewer than 8K) are guaranteed to be available for
lookbehind assertions.
-N newline-type, --newline=newline-type
- The PCRE library supports five different conventions for
- indicating the ends of lines. They are the single-character
- sequences CR (carriage return) and LF (linefeed), the two-
- character sequence CRLF, an "anycrlf" convention, which rec-
- ognizes any of the preceding three types, and an "any" con-
+ The PCRE library supports five different conventions for
+ indicating the ends of lines. They are the single-character
+ sequences CR (carriage return) and LF (linefeed), the two-
+ character sequence CRLF, an "anycrlf" convention, which rec-
+ ognizes any of the preceding three types, and an "any" con-
vention, in which any Unicode line ending sequence is assumed
- to end a line. The Unicode sequences are the three just men-
- tioned, plus VT (vertical tab, U+000B), FF (formfeed,
- U+000C), NEL (next line, U+0085), LS (line separator,
+ to end a line. The Unicode sequences are the three just men-
+ tioned, plus VT (vertical tab, U+000B), FF (formfeed,
+ U+000C), NEL (next line, U+0085), LS (line separator,
U+2028), and PS (paragraph separator, U+2029).
When the PCRE library is built, a default line-ending
- sequence is specified. This is normally the standard
+ sequence is specified. This is normally the standard
sequence for the operating system. Unless otherwise specified
- by this option, pcregrep uses the library's default. The
+ by this option, pcregrep uses the library's default. The
possible values for this option are CR, LF, CRLF, ANYCRLF, or
- ANY. This makes it possible to use pcregrep on files that
- have come from other environments without having to modify
- their line endings. If the data that is being scanned does
- not agree with the convention set by this option, pcregrep
+ ANY. This makes it possible to use pcregrep on files that
+ have come from other environments without having to modify
+ their line endings. If the data that is being scanned does
+ not agree with the convention set by this option, pcregrep
may behave in strange ways.
-n, --line-number
Precede each output line by its line number in the file, fol-
- lowed by a colon for matching lines or a hyphen for context
- lines. If the filename is also being output, it precedes the
+ lowed by a colon for matching lines or a hyphen for context
+ lines. If the filename is also being output, it precedes the
line number. This option is forced if --line-offsets is used.
-o, --only-matching
- Show only the part of the line that matched a pattern. In
- this mode, no context is shown. That is, the -A, -B, and -C
- options are ignored. If there is more than one match in a
- line, each of them is shown separately. If -o is combined
- with -v (invert the sense of the match to find non-matching
- lines), no output is generated, but the return code is set
+ Show only the part of the line that matched a pattern. In
+ this mode, no context is shown. That is, the -A, -B, and -C
+ options are ignored. If there is more than one match in a
+ line, each of them is shown separately. If -o is combined
+ with -v (invert the sense of the match to find non-matching
+ lines), no output is generated, but the return code is set
appropriately. This option is mutually exclusive with --file-
offsets and --line-offsets.
-q, --quiet
Work quietly, that is, display nothing except error messages.
- The exit status indicates whether or not any matches were
+ The exit status indicates whether or not any matches were
found.
-r, --recursive
- If any given path is a directory, recursively scan the files
- it contains, taking note of any --include and --exclude set-
- tings. By default, a directory is read as a normal file; in
- some operating systems this gives an immediate end-of-file.
- This option is a shorthand for setting the -d option to
+ If any given path is a directory, recursively scan the files
+ it contains, taking note of any --include and --exclude set-
+ tings. By default, a directory is read as a normal file; in
+ some operating systems this gives an immediate end-of-file.
+ This option is a shorthand for setting the -d option to
"recurse".
-s, --no-messages
- Suppress error messages about non-existent or unreadable
- files. Such files are quietly skipped. However, the return
+ Suppress error messages about non-existent or unreadable
+ files. Such files are quietly skipped. However, the return
code is still 2, even if matches were found in other files.
-u, --utf-8
- Operate in UTF-8 mode. This option is available only if PCRE
- has been compiled with UTF-8 support. Both patterns and sub-
+ Operate in UTF-8 mode. This option is available only if PCRE
+ has been compiled with UTF-8 support. Both patterns and sub-
ject lines must be valid strings of UTF-8 characters.
-V, --version
- Write the version numbers of pcregrep and the PCRE library
+ Write the version numbers of pcregrep and the PCRE library
that is being used to the standard error stream.
-v, --invert-match
- Invert the sense of the match, so that lines which do not
+ Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
@@ -424,38 +425,38 @@ OPTIONS
lent to having \b at the start and end of the pattern.
-x, --line-regex, --line-regexp
- Force the patterns to be anchored (each must start matching
- at the beginning of a line) and in addition, require them to
- match entire lines. This is equivalent to having ^ and $
+ Force the patterns to be anchored (each must start matching
+ at the beginning of a line) and in addition, require them to
+ match entire lines. This is equivalent to having ^ and $
characters at the start and end of each alternative branch in
every pattern.
ENVIRONMENT VARIABLES
- The environment variables LC_ALL and LC_CTYPE are examined, in that
- order, for a locale. The first one that is set is used. This can be
- overridden by the --locale option. If no locale is set, the PCRE
+ The environment variables LC_ALL and LC_CTYPE are examined, in that
+ order, for a locale. The first one that is set is used. This can be
+ overridden by the --locale option. If no locale is set, the PCRE
library's default (usually the "C" locale) is used.
NEWLINES
- The -N (--newline) option allows pcregrep to scan files with different
- newline conventions from the default. However, the setting of this
- option does not affect the way in which pcregrep writes information to
- the standard error and output streams. It uses the string "\n" in C
- printf() calls to indicate newlines, relying on the C I/O library to
- convert this to an appropriate sequence if the output is sent to a
+ The -N (--newline) option allows pcregrep to scan files with different
+ newline conventions from the default. However, the setting of this
+ option does not affect the way in which pcregrep writes information to
+ the standard error and output streams. It uses the string "\n" in C
+ printf() calls to indicate newlines, relying on the C I/O library to
+ convert this to an appropriate sequence if the output is sent to a
file.
OPTIONS COMPATIBILITY
The majority of short and long forms of pcregrep's options are the same
- as in the GNU grep program. Any long option of the form --xxx-regexp
- (GNU terminology) is also available as --xxx-regex (PCRE terminology).
- However, the --locale, -M, --multiline, -u, and --utf-8 options are
+ as in the GNU grep program. Any long option of the form --xxx-regexp
+ (GNU terminology) is also available as --xxx-regex (PCRE terminology).
+ However, the --locale, -M, --multiline, -u, and --utf-8 options are
specific to pcregrep. If both the -c and -l options are given, GNU grep
lists only file names, without counts, but pcregrep gives the counts.
@@ -463,48 +464,48 @@ OPTIONS COMPATIBILITY
OPTIONS WITH DATA
There are four different ways in which an option with data can be spec-
- ified. If a short form option is used, the data may follow immedi-
+ ified. If a short form option is used, the data may follow immedi-
ately, or in the next command line item. For example:
-f/some/file
-f /some/file
- If a long form option is used, the data may appear in the same command
+ If a long form option is used, the data may appear in the same command
line item, separated by an equals character, or (with one exception) it
may appear in the next command line item. For example:
--file=/some/file
--file /some/file
- Note, however, that if you want to supply a file name beginning with ~
- as data in a shell command, and have the shell expand ~ to a home
+ Note, however, that if you want to supply a file name beginning with ~
+ as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item.
- The exception to the above is the --colour (or --color) option, for
- which the data is optional. If this option does have data, it must be
- given in the first form, using an equals character. Otherwise it will
+ The exception to the above is the --colour (or --color) option, for
+ which the data is optional. If this option does have data, it must be
+ given in the first form, using an equals character. Otherwise it will
be assumed that it has no data.
MATCHING ERRORS
- It is possible to supply a regular expression that takes a very long
- time to fail to match certain lines. Such patterns normally involve
- nested indefinite repeats, for example: (a+)*\d when matched against a
- line of a's with no final digit. The PCRE matching function has a
- resource limit that causes it to abort in these circumstances. If this
+ It is possible to supply a regular expression that takes a very long
+ time to fail to match certain lines. Such patterns normally involve
+ nested indefinite repeats, for example: (a+)*\d when matched against a
+ line of a's with no final digit. The PCRE matching function has a
+ resource limit that causes it to abort in these circumstances. If this
happens, pcregrep outputs an error message and the line that caused the
- problem to the standard error stream. If there are more than 20 such
+ problem to the standard error stream. If there are more than 20 such
errors, pcregrep gives up.
DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found,
- and 2 for syntax errors and non-existent or inacessible files (even if
- matches were found in other files) or too many matching errors. Using
- the -s option to suppress error messages about inaccessble files does
+ and 2 for syntax errors and non-existent or inacessible files (even if
+ matches were found in other files) or too many matching errors. Using
+ the -s option to suppress error messages about inaccessble files does
not affect the return code.
@@ -522,5 +523,5 @@ AUTHOR
REVISION
- Last updated: 12 August 2009
+ Last updated: 13 September 2009
Copyright (c) 1997-2009 University of Cambridge.
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 87d8eb3..21e3f71 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -1922,7 +1922,7 @@ recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
supports special syntax for recursion of the entire pattern, and also for
individual subpattern recursion. After its introduction in PCRE and Python,
-this kind of recursion was introduced into Perl at release 5.10.
+this kind of recursion was subsequently introduced into Perl at release 5.10.
.P
A special item that consists of (? followed by a number greater than zero and a
closing parenthesis is a recursive call of the subpattern of the given number,
@@ -1930,11 +1930,6 @@ provided that it occurs inside that subpattern. (If not, it is a "subroutine"
call, which is described in the next section.) The special item (?R) or (?0) is
a recursive call of the entire regular expression.
.P
-In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
-treated as an atomic group. That is, once it has matched some of the subject
-string, it is never re-entered, even if it contains untried alternatives and
-there is a subsequent matching failure.
-.P
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
.sp
@@ -2022,6 +2017,70 @@ different alternatives for the recursive and non-recursive cases. The (?R) item
is the actual recursive call.
.
.
+.\" HTML <a name="recursiondifference"></a>
+.SS "Recursion difference from Perl"
+.rs
+.sp
+In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
+treated as an atomic group. That is, once it has matched some of the subject
+string, it is never re-entered, even if it contains untried alternatives and
+there is a subsequent matching failure. This can be illustrated by the
+following pattern, which purports to match a palindromic string that contains
+an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
+.sp
+ ^(.|(.)(?1)\e2)$
+.sp
+The idea is that it either matches a single character, or two identical
+characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
+it does not if the pattern is longer than three characters. Consider the
+subject string "abcba":
+.P
+At the top level, the first character is matched, but as it is not at the end
+of the string, the first alternative fails; the second alternative is taken
+and the recursion kicks in. The recursive call to subpattern 1 successfully
+matches the next character ("b"). (Note that the beginning and end of line
+tests are not part of the recursion).
+.P
+Back at the top level, the next character ("c") is compared with what
+subpattern 2 matched, which was "a". This fails. Because the recursion is
+treated as an atomic group, there are now no backtracking points, and so the
+entire match fails. (Perl is able, at this point, to re-enter the recursion and
+try the second alternative.) However, if the pattern is written with the
+alternatives in the other order, things are different:
+.sp
+ ^((.)(?1)\e2|.)$
+.sp
+This time, the recursing alternative is tried first, and continues to recurse
+until it runs out of characters, at which point the recursion fails. But this
+time we do have another alternative to try at the higher level. That is the big
+difference: in the previous case the remaining alternative is at a deeper
+recursion level, which PCRE cannot use.
+.P
+To change the pattern so that matches all palindromic strings, not just those
+with an odd number of characters, it is tempting to change the pattern to this:
+.sp
+ ^((.)(?1)\e2|.?)$
+.sp
+Again, this works in Perl, but not in PCRE, and for the same reason. When a
+deeper recursion has matched a single character, it cannot be entered again in
+order to match an empty string. The solution is to separate the two cases, and
+write out the odd and even cases as alternatives at the higher level:
+.sp
+ ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
+.sp
+If you want to match typical palindromic phrases, the pattern has to ignore all
+non-word characters, which can be done like this:
+.sp
+ ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$
+.sp
+If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
+man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
+the use of the possessive quantifier *+ to avoid backtracking into sequences of
+non-word characters. Without this, PCRE takes a great deal longer (ten times or
+more) to match typical phrases, and Perl takes so long that you think it has
+gone into a loop.
+.
+.
.\" HTML <a name="subpatternsassubroutines"></a>
.SH "SUBPATTERNS AS SUBROUTINES"
.rs
@@ -2258,6 +2317,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 16 September 2009
+Last updated: 18 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index f1e8777..46df32a 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -196,87 +196,88 @@ PATTERN MODIFIERS
or \B).
If any call to pcre_exec() in a /g or /G sequence matches an empty
- string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
- flags set in order to search for another, non-empty, match at the same
- point. If this second match fails, the start offset is advanced by
- one, and the normal match is retried. This imitates the way Perl han-
- dles such cases when using the /g modifier or the split() function.
+ string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
+ PCRE_ANCHORED flags set in order to search for another, non-empty,
+ match at the same point. If this second match fails, the start offset
+ is advanced by one character, and the normal match is retried. This
+ imitates the way Perl handles such cases when using the /g modifier or
+ the split() function.
Other modifiers
There are yet more modifiers for controlling the way pcretest operates.
- The /+ modifier requests that as well as outputting the substring that
- matched the entire pattern, pcretest should in addition output the
- remainder of the subject string. This is useful for tests where the
+ The /+ modifier requests that as well as outputting the substring that
+ matched the entire pattern, pcretest should in addition output the
+ remainder of the subject string. This is useful for tests where the
subject contains multiple copies of the same substring.
- The /B modifier is a debugging feature. It requests that pcretest out-
- put a representation of the compiled byte code after compilation. Nor-
- mally this information contains length and offset values; however, if
- /Z is also present, this data is replaced by spaces. This is a special
+ The /B modifier is a debugging feature. It requests that pcretest out-
+ put a representation of the compiled byte code after compilation. Nor-
+ mally this information contains length and offset values; however, if
+ /Z is also present, this data is replaced by spaces. This is a special
feature for use in the automatic test scripts; it ensures that the same
output is generated for different internal link sizes.
- The /L modifier must be followed directly by the name of a locale, for
+ The /L modifier must be followed directly by the name of a locale, for
example,
/pattern/Lfr_FR
For this reason, it must be the last modifier. The given locale is set,
- pcre_maketables() is called to build a set of character tables for the
- locale, and this is then passed to pcre_compile() when compiling the
- regular expression. Without an /L modifier, NULL is passed as the
- tables pointer; that is, /L applies only to the expression on which it
+ pcre_maketables() is called to build a set of character tables for the
+ locale, and this is then passed to pcre_compile() when compiling the
+ regular expression. Without an /L modifier, NULL is passed as the
+ tables pointer; that is, /L applies only to the expression on which it
appears.
- The /I modifier requests that pcretest output information about the
- compiled pattern (whether it is anchored, has a fixed first character,
- and so on). It does this by calling pcre_fullinfo() after compiling a
- pattern. If the pattern is studied, the results of that are also out-
+ The /I modifier requests that pcretest output information about the
+ compiled pattern (whether it is anchored, has a fixed first character,
+ and so on). It does this by calling pcre_fullinfo() after compiling a
+ pattern. If the pattern is studied, the results of that are also out-
put.
- The /D modifier is a PCRE debugging feature, and is equivalent to /BI,
+ The /D modifier is a PCRE debugging feature, and is equivalent to /BI,
that is, both the /B and the /I modifiers.
The /F modifier causes pcretest to flip the byte order of the fields in
- the compiled pattern that contain 2-byte and 4-byte numbers. This
- facility is for testing the feature in PCRE that allows it to execute
+ the compiled pattern that contain 2-byte and 4-byte numbers. This
+ facility is for testing the feature in PCRE that allows it to execute
patterns that were compiled on a host with a different endianness. This
- feature is not available when the POSIX interface to PCRE is being
- used, that is, when the /P pattern modifier is specified. See also the
+ feature is not available when the POSIX interface to PCRE is being
+ used, that is, when the /P pattern modifier is specified. See also the
section about saving and reloading compiled patterns below.
- The /S modifier causes pcre_study() to be called after the expression
+ The /S modifier causes pcre_study() to be called after the expression
has been compiled, and the results used when the expression is matched.
- The /M modifier causes the size of memory block used to hold the com-
+ The /M modifier causes the size of memory block used to hold the com-
piled pattern to be output.
- The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
- rather than its native API. When this is done, all other modifiers
- except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
- and REG_NEWLINE is set if /m is present. The wrapper functions force
+ The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+ rather than its native API. When this is done, all other modifiers
+ except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
+ and REG_NEWLINE is set if /m is present. The wrapper functions force
PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
- The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
- set. This turns on support for UTF-8 character handling in PCRE, pro-
- vided that it was compiled with this support enabled. This modifier
+ The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
+ set. This turns on support for UTF-8 character handling in PCRE, pro-
+ vided that it was compiled with this support enabled. This modifier
also causes any non-printing characters in output strings to be printed
using the \x{hh...} notation if they are valid UTF-8 sequences.
- If the /? modifier is used with /8, it causes pcretest to call
- pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
+ If the /? modifier is used with /8, it causes pcretest to call
+ pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
checking of the string for UTF-8 validity.
DATA LINES
- Before each data line is passed to pcre_exec(), leading and trailing
- whitespace is removed, and it is then scanned for \ escapes. Some of
- these are pretty esoteric features, intended for checking out some of
- the more complicated features of PCRE. If you are just testing "ordi-
- nary" regular expressions, you probably don't need any of these. The
+ Before each data line is passed to pcre_exec(), leading and trailing
+ whitespace is removed, and it is then scanned for \ escapes. Some of
+ these are pretty esoteric features, intended for checking out some of
+ the more complicated features of PCRE. If you are just testing "ordi-
+ nary" regular expressions, you probably don't need any of these. The
following escapes are recognized:
\a alarm (BEL, \x07)
@@ -323,7 +324,8 @@ DATA LINES
\M discover the minimum MATCH_LIMIT and
MATCH_LIMIT_RECURSION settings
\N pass the PCRE_NOTEMPTY option to pcre_exec()
- or pcre_dfa_exec()
+ or pcre_dfa_exec(); if used twice, pass the
+ PCRE_NOTEMPTY_ATSTART option
\Odd set the size of the output vector passed to
pcre_exec() to dd (any number of digits)
\P pass the PCRE_PARTIAL_SOFT option to pcre_exec()
@@ -351,73 +353,73 @@ DATA LINES
\<any> pass the PCRE_NEWLINE_ANY option to pcre_exec()
or pcre_dfa_exec()
- The escapes that specify line ending sequences are literal strings,
+ The escapes that specify line ending sequences are literal strings,
exactly as shown. No more than one newline setting should be present in
any data line.
- A backslash followed by anything else just escapes the anything else.
- If the very last character is a backslash, it is ignored. This gives a
- way of passing an empty line as data, since a real empty line termi-
+ A backslash followed by anything else just escapes the anything else.
+ If the very last character is a backslash, it is ignored. This gives a
+ way of passing an empty line as data, since a real empty line termi-
nates the data input.
- If \M is present, pcretest calls pcre_exec() several times, with dif-
- ferent values in the match_limit and match_limit_recursion fields of
- the pcre_extra data structure, until it finds the minimum numbers for
+ If \M is present, pcretest calls pcre_exec() several times, with dif-
+ ferent values in the match_limit and match_limit_recursion fields of
+ the pcre_extra data structure, until it finds the minimum numbers for
each parameter that allow pcre_exec() to complete. The match_limit num-
- ber is a measure of the amount of backtracking that takes place, and
+ ber is a measure of the amount of backtracking that takes place, and
checking it out can be instructive. For most simple matches, the number
- is quite small, but for patterns with very large numbers of matching
- possibilities, it can become large very quickly with increasing length
+ is quite small, but for patterns with very large numbers of matching
+ possibilities, it can become large very quickly with increasing length
of subject string. The match_limit_recursion number is a measure of how
- much stack (or, if PCRE is compiled with NO_RECURSE, how much heap)
+ much stack (or, if PCRE is compiled with NO_RECURSE, how much heap)
memory is needed to complete the match attempt.
- When \O is used, the value specified may be higher or lower than the
+ When \O is used, the value specified may be higher or lower than the
size set by the -O command line option (or defaulted to 45); \O applies
only to the call of pcre_exec() for the line in which it appears.
- If the /P modifier was present on the pattern, causing the POSIX wrap-
- per API to be used, the only option-setting sequences that have any
- effect are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively,
+ If the /P modifier was present on the pattern, causing the POSIX wrap-
+ per API to be used, the only option-setting sequences that have any
+ effect are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively,
to be passed to regexec().
- The use of \x{hh...} to represent UTF-8 characters is not dependent on
- the use of the /8 modifier on the pattern. It is recognized always.
- There may be any number of hexadecimal digits inside the braces. The
- result is from one to six bytes, encoded according to the original
- UTF-8 rules of RFC 2279. This allows for values in the range 0 to
- 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
- or indeed valid UTF-8 characters according to the later rules in RFC
+ The use of \x{hh...} to represent UTF-8 characters is not dependent on
+ the use of the /8 modifier on the pattern. It is recognized always.
+ There may be any number of hexadecimal digits inside the braces. The
+ result is from one to six bytes, encoded according to the original
+ UTF-8 rules of RFC 2279. This allows for values in the range 0 to
+ 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
+ or indeed valid UTF-8 characters according to the later rules in RFC
3629.
THE ALTERNATIVE MATCHING FUNCTION
- By default, pcretest uses the standard PCRE matching function,
+ By default, pcretest uses the standard PCRE matching function,
pcre_exec() to match each data line. From release 6.0, PCRE supports an
- alternative matching function, pcre_dfa_test(), which operates in a
- different way, and has some restrictions. The differences between the
+ alternative matching function, pcre_dfa_test(), which operates in a
+ different way, and has some restrictions. The differences between the
two functions are described in the pcrematching documentation.
- If a data line contains the \D escape sequence, or if the command line
- contains the -dfa option, the alternative matching function is called.
+ If a data line contains the \D escape sequence, or if the command line
+ contains the -dfa option, the alternative matching function is called.
This function finds all possible matches at a given point. If, however,
- the \F escape sequence is present in the data line, it stops after the
+ the \F escape sequence is present in the data line, it stops after the
first match is found. This is always the shortest possible match.
DEFAULT OUTPUT FROM PCRETEST
- This section describes the output when the normal matching function,
+ This section describes the output when the normal matching function,
pcre_exec(), is being used.
When a match succeeds, pcretest outputs the list of captured substrings
- that pcre_exec() returns, starting with number 0 for the string that
- matched the whole pattern. Otherwise, it outputs "No match" or "Partial
- match:" followed by the partially matching substring when pcre_exec()
- returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL, respectively, and
- otherwise the PCRE negative error number. Here is an example of an
- interactive pcretest run.
+ that pcre_exec() returns, starting with number 0 for the string that
+ matched the whole pattern. Otherwise, it outputs "No match" when the
+ return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the par-
+ tially matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
+ For any other returns, it outputs the PCRE negative error number. Here
+ is an example of an interactive pcretest run.
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -429,11 +431,11 @@ DEFAULT OUTPUT FROM PCRETEST
data> xyz
No match
- Note that unset capturing substrings that are not followed by one that
- is set are not returned by pcre_exec(), and are not shown by pcretest.
- In the following example, there are two capturing substrings, but when
- the first data line is matched, the second, unset substring is not
- shown. An "internal" unset substring is shown as "<unset>", as for the
+ Note that unset capturing substrings that are not followed by one that
+ is set are not returned by pcre_exec(), and are not shown by pcretest.
+ In the following example, there are two capturing substrings, but when
+ the first data line is matched, the second, unset substring is not
+ shown. An "internal" unset substring is shown as "<unset>", as for the
second data line.
re> /(a)|(b)/
@@ -445,11 +447,11 @@ DEFAULT OUTPUT FROM PCRETEST
1: <unset>
2: b
- If the strings contain any non-printing characters, they are output as
- \0x escapes, or as \x{...} escapes if the /8 modifier was present on
- the pattern. See below for the definition of non-printing characters.
- If the pattern has the /+ modifier, the output for substring 0 is fol-
- lowed by the the rest of the subject string, identified by "0+" like
+ If the strings contain any non-printing characters, they are output as
+ \0x escapes, or as \x{...} escapes if the /8 modifier was present on
+ the pattern. See below for the definition of non-printing characters.
+ If the pattern has the /+ modifier, the output for substring 0 is fol-
+ lowed by the the rest of the subject string, identified by "0+" like
this:
re> /cat/+
@@ -457,7 +459,7 @@ DEFAULT OUTPUT FROM PCRETEST
0: cat
0+ aract
- If the pattern has the /g or /G modifier, the results of successive
+ If the pattern has the /g or /G modifier, the results of successive
matching attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@@ -471,24 +473,24 @@ DEFAULT OUTPUT FROM PCRETEST
"No match" is output only if the first match attempt fails.
- If any of the sequences \C, \G, or \L are present in a data line that
- is successfully matched, the substrings extracted by the convenience
+ If any of the sequences \C, \G, or \L are present in a data line that
+ is successfully matched, the substrings extracted by the convenience
functions are output with C, G, or L after the string number instead of
a colon. This is in addition to the normal full list. The string length
- (that is, the return from the extraction function) is given in paren-
+ (that is, the return from the extraction function) is given in paren-
theses after each string for \C and \G.
Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), data lines may not. However new-
- lines can be included in data by means of the \n escape (or \r, \r\n,
+ lines can be included in data by means of the \n escape (or \r, \r\n,
etc., depending on the newline sequence setting).
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
- When the alternative matching function, pcre_dfa_exec(), is used (by
- means of the \D escape sequence or the -dfa command line option), the
- output consists of a list of all the matches that start at the first
+ When the alternative matching function, pcre_dfa_exec(), is used (by
+ means of the \D escape sequence or the -dfa command line option), the
+ output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/
@@ -497,8 +499,8 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- (Using the normal matching function on this data finds only "tang".)
- The longest matching string is always given first (and numbered zero).
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
lowed by the partially matching substring.
@@ -514,16 +516,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- Since the matching function does not support substring capture, the
- escape sequences that are concerned with captured substrings are not
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE_ERROR_PARTIAL
- return, indicating that the subject partially matched the pattern, you
- can restart the match with additional subject data by means of the \R
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
escape sequence. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -532,30 +534,30 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\R\D
0: n05
- For further information about partial matching, see the pcrepartial
+ For further information about partial matching, see the pcrepartial
documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. This works with both matching func-
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the
- start and current positions in the text at the callout time, and the
+ start and current positions in the text at the callout time, and the
next pattern item to be tested. For example, the output
--->pqrabcdef
0 ^ ^ \d
- indicates that callout number 0 occurred for a match attempt starting
- at the fourth character of the subject string, when the pointer was at
- the seventh character of the data, and when the next pattern item was
- \d. Just one circumflex is output if the start and current positions
+ indicates that callout number 0 occurred for a match attempt starting
+ at the fourth character of the subject string, when the pointer was at
+ the seventh character of the data, and when the next pattern item was
+ \d. Just one circumflex is output if the start and current positions
are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /C pattern modifier. In this case, instead of showing
- the callout number, the offset in the pattern, preceded by a plus, is
+ a result of the /C pattern modifier. In this case, instead of showing
+ the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
re> /\d?[A-E]\*/C
@@ -567,86 +569,86 @@ CALLOUTS
+10 ^ ^
0: E*
- The callout function in pcretest returns zero (carry on matching) by
- default, but you can use a \C item in a data line (as described above)
+ The callout function in pcretest returns zero (carry on matching) by
+ default, but you can use a \C item in a data line (as described above)
to change this.
- Inserting callouts can be helpful when using pcretest to check compli-
- cated regular expressions. For further information about callouts, see
+ Inserting callouts can be helpful when using pcretest to check compli-
+ cated regular expressions. For further information about callouts, see
the pcrecallout documentation.
NON-PRINTING CHARACTERS
- When pcretest is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
are are therefore shown as hex escapes.
- When pcretest is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /L modifier). In this case, the
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
isprint() function to distinguish printing and non-printing characters.
SAVING AND RELOADING COMPILED PATTERNS
- The facilities described in this section are not available when the
+ The facilities described in this section are not available when the
POSIX inteface to PCRE is being used, that is, when the /P pattern mod-
ifier is specified.
When the POSIX interface is not in use, you can cause pcretest to write
- a compiled pattern to a file, by following the modifiers with > and a
+ a compiled pattern to a file, by following the modifiers with > and a
file name. For example:
/pattern/im >/some/file
- See the pcreprecompile documentation for a discussion about saving and
+ See the pcreprecompile documentation for a discussion about saving and
re-using compiled patterns.
- The data that is written is binary. The first eight bytes are the
- length of the compiled pattern data followed by the length of the
- optional study data, each written as four bytes in big-endian order
- (most significant byte first). If there is no study data (either the
+ The data that is written is binary. The first eight bytes are the
+ length of the compiled pattern data followed by the length of the
+ optional study data, each written as four bytes in big-endian order
+ (most significant byte first). If there is no study data (either the
pattern was not studied, or studying did not return any data), the sec-
- ond length is zero. The lengths are followed by an exact copy of the
+ ond length is zero. The lengths are followed by an exact copy of the
compiled pattern. If there is additional study data, this follows imme-
- diately after the compiled pattern. After writing the file, pcretest
+ diately after the compiled pattern. After writing the file, pcretest
expects to read a new pattern.
A saved pattern can be reloaded into pcretest by specifing < and a file
- name instead of a pattern. The name of the file must not contain a <
- character, as otherwise pcretest will interpret the line as a pattern
+ name instead of a pattern. The name of the file must not contain a <
+ character, as otherwise pcretest will interpret the line as a pattern
delimited by < characters. For example:
re> </some/file
Compiled regex loaded from /some/file
No study data
- When the pattern has been loaded, pcretest proceeds to read data lines
+ When the pattern has been loaded, pcretest proceeds to read data lines
in the usual way.
- You can copy a file written by pcretest to a different host and reload
- it there, even if the new host has opposite endianness to the one on
- which the pattern was compiled. For example, you can compile on an i86
+ You can copy a file written by pcretest to a different host and reload
+ it there, even if the new host has opposite endianness to the one on
+ which the pattern was compiled. For example, you can compile on an i86
machine and run on a SPARC machine.
- File names for saving and reloading can be absolute or relative, but
- note that the shell facility of expanding a file name that starts with
+ File names for saving and reloading can be absolute or relative, but
+ note that the shell facility of expanding a file name that starts with
a tilde (~) is not available.
- The ability to save and reload files in pcretest is intended for test-
- ing and experimentation. It is not intended for production use because
- only a single pattern can be written to a file. Furthermore, there is
- no facility for supplying custom character tables for use with a
- reloaded pattern. If the original pattern was compiled with custom
- tables, an attempt to match a subject string using a reloaded pattern
- is likely to cause pcretest to crash. Finally, if you attempt to load
+ The ability to save and reload files in pcretest is intended for test-
+ ing and experimentation. It is not intended for production use because
+ only a single pattern can be written to a file. Furthermore, there is
+ no facility for supplying custom character tables for use with a
+ reloaded pattern. If the original pattern was compiled with custom
+ tables, an attempt to match a subject string using a reloaded pattern
+ is likely to cause pcretest to crash. Finally, if you attempt to load
a file that is not in the correct format, the result is undefined.
SEE ALSO
- pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcrepartial(d),
+ pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcrepartial(d),
pcrepattern(3), pcreprecompile(3).
@@ -659,5 +661,5 @@ AUTHOR
REVISION
- Last updated: 29 August 2009
+ Last updated: 11 September 2009
Copyright (c) 1997-2009 University of Cambridge.
diff --git a/testdata/testinput11 b/testdata/testinput11
index 1a08cd1..6f1203b 100644
--- a/testdata/testinput11
+++ b/testdata/testinput11
@@ -252,4 +252,22 @@
** Failers
AD
+/^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$/i
+ 1221
+ Satan, oscillate my metallic sonatas!
+ A man, a plan, a canal: Panama!
+ Able was I ere I saw Elba.
+ *** Failers
+ The quick brown fox
+
+/^((.)(?1)\2|.)$/
+ a
+ aba
+ aabaa
+ abcdcba
+ pqaabaaqp
+ ablewasiereisawelba
+ rhubarb
+ the quick brown fox
+
/-- End of testinput11 --/
diff --git a/testdata/testinput2 b/testdata/testinput2
index ec5d621..fbc7671 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -1133,14 +1133,6 @@
/(a(?1)+b)/DZ
-/^\W*(?:((.)\W*(?1)\W*\2|)|((.)\W*(?3)\W*\4|\W*.\W*))\W*$/Ii
- 1221
- Satan, oscillate my metallic sonatas!
- A man, a plan, a canal: Panama!
- Able was I ere I saw Elba.
- *** Failers
- The quick brown fox
-
/^(\d+|\((?1)([+*-])(?1)\)|-(?1))$/I
12
(((2+2)*-3)-7)
diff --git a/testdata/testoutput11 b/testdata/testoutput11
index f7b54ac..9a421aa 100644
--- a/testdata/testoutput11
+++ b/testdata/testoutput11
@@ -520,4 +520,61 @@ No match
AD
No match
+/^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$/i
+ 1221
+ 0: 1221
+ 1: 1221
+ 2: 1
+ Satan, oscillate my metallic sonatas!
+ 0: Satan, oscillate my metallic sonatas!
+ 1: <unset>
+ 2: <unset>
+ 3: Satan, oscillate my metallic sonatas
+ 4: S
+ A man, a plan, a canal: Panama!
+ 0: A man, a plan, a canal: Panama!
+ 1: <unset>
+ 2: <unset>
+ 3: A man, a plan, a canal: Panama
+ 4: A
+ Able was I ere I saw Elba.
+ 0: Able was I ere I saw Elba.
+ 1: <unset>
+ 2: <unset>
+ 3: Able was I ere I saw Elba
+ 4: A
+ *** Failers
+No match
+ The quick brown fox
+No match
+
+/^((.)(?1)\2|.)$/
+ a
+ 0: a
+ 1: a
+ aba
+ 0: aba
+ 1: aba
+ 2: a
+ aabaa
+ 0: aabaa
+ 1: aabaa
+ 2: a
+ abcdcba
+ 0: abcdcba
+ 1: abcdcba
+ 2: a
+ pqaabaaqp
+ 0: pqaabaaqp
+ 1: pqaabaaqp
+ 2: p
+ ablewasiereisawelba
+ 0: ablewasiereisawelba
+ 1: ablewasiereisawelba
+ 2: a
+ rhubarb
+No match
+ the quick brown fox
+No match
+
/-- End of testinput11 --/
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index 67301ea..fc83759 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -3949,39 +3949,6 @@ No options
First char = 'a'
Need char = 'b'
-/^\W*(?:((.)\W*(?1)\W*\2|)|((.)\W*(?3)\W*\4|\W*.\W*))\W*$/Ii
-Capturing subpattern count = 4
-Max back reference = 4
-Options: anchored caseless
-No first char
-No need char
- 1221
- 0: 1221
- 1: 1221
- 2: 1
- Satan, oscillate my metallic sonatas!
- 0: Satan, oscillate my metallic sonatas!
- 1: <unset>
- 2: <unset>
- 3: Satan, oscillate my metallic sonatas
- 4: S
- A man, a plan, a canal: Panama!
- 0: A man, a plan, a canal: Panama!
- 1: <unset>
- 2: <unset>
- 3: A man, a plan, a canal: Panama
- 4: A
- Able was I ere I saw Elba.
- 0: Able was I ere I saw Elba.
- 1: <unset>
- 2: <unset>
- 3: Able was I ere I saw Elba
- 4: A
- *** Failers
-No match
- The quick brown fox
-No match
-
/^(\d+|\((?1)([+*-])(?1)\)|-(?1))$/I
Capturing subpattern count = 2
Options: anchored