summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-01 16:10:16 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-01 16:10:16 +0000
commitdbbbb5f823318e1665b0ae6fb2f7c12d71f66e84 (patch)
tree031067dfea13ed6f72701de1b45c97fec9e0c455
parent78dd3d00385e8c5f868d07b3fe21694299a183fe (diff)
downloadpcre-dbbbb5f823318e1665b0ae6fb2f7c12d71f66e84.tar.gz
Add pcredemo man page, containing a listing of pcredemo.c.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@429 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rwxr-xr-x132html17
-rw-r--r--ChangeLog4
-rwxr-xr-xPrepareRelease44
-rw-r--r--README4
-rw-r--r--doc/html/index.html21
-rw-r--r--doc/html/pcre.html13
-rw-r--r--doc/html/pcre_dfa_exec.html9
-rw-r--r--doc/html/pcre_exec.html13
-rw-r--r--doc/html/pcre_fullinfo.html1
-rw-r--r--doc/html/pcreapi.html95
-rw-r--r--doc/html/pcrecompat.html6
-rw-r--r--doc/html/pcredemo.html354
-rw-r--r--doc/html/pcregrep.html30
-rw-r--r--doc/html/pcrematching.html12
-rw-r--r--doc/html/pcrepartial.html322
-rw-r--r--doc/html/pcreposix.html7
-rw-r--r--doc/html/pcresample.html22
-rw-r--r--doc/html/pcretest.html17
-rw-r--r--doc/index.html.src5
-rw-r--r--doc/pcre.313
-rw-r--r--doc/pcre.txt1516
-rw-r--r--doc/pcreapi.316
-rw-r--r--doc/pcredemo.3352
-rw-r--r--doc/pcregrep.txt362
-rw-r--r--doc/pcresample.328
-rw-r--r--doc/pcretest.txt152
26 files changed, 2275 insertions, 1160 deletions
diff --git a/132html b/132html
index 43d1358..062babc 100755
--- a/132html
+++ b/132html
@@ -231,6 +231,23 @@ while (<STDIN>)
$_ = "$one $two";
redo; # Process the joined lines
}
+
+ # .EX/.EE are used in the pcredemo page to bracket the entire program,
+ # which is unmodified except for turning backslash into "\e".
+
+ elsif (/^\.EX\s*$/)
+ {
+ print TEMP "<PRE>\n";
+ while (<STDIN>)
+ {
+ last if /^\.EE\s*$/;
+ s/\\e/\\/g;
+ s/&/&amp;/g;
+ s/</&lt;/g;
+ s/>/&gt;/g;
+ print TEMP;
+ }
+ }
# Ignore anything not recognized
diff --git a/ChangeLog b/ChangeLog
index 850b081..cd16d76 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -72,6 +72,10 @@ Version 8.00 ??-???-??
"g". If the first part-match was for the string "dog", restarting with
"sbody" failed.
+13. Added a pcredemo man page, created automatically from the pcredemo.c file,
+ so that the demonstration program is easily available in environments where
+ PCRE has not been installed from source.
+
Version 7.9 11-Apr-09
---------------------
diff --git a/PrepareRelease b/PrepareRelease
index a4f3485..de63e78 100755
--- a/PrepareRelease
+++ b/PrepareRelease
@@ -8,8 +8,8 @@
# following files:
# 132html A Perl script that converts a .1 or .3 man page into HTML. It
-# is called from MakeRelease. It "knows" the relevant troff
-# constructs that are used in the PCRE man pages.
+# "knows" the relevant troff constructs that are used in the PCRE
+# man pages.
# CleanTxt A Perl script that cleans up the output of "nroff -man" by
# removing backspaces and other redundant text so as to produce
@@ -37,8 +37,9 @@ cat <<End >pcre.txt
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
-synopses of each function in the library have not been included. There are
-separate text files for the pcregrep and pcretest commands.
+synopses of each function in the library have not been included. Neither has
+the pcredemo program. There are separate text files for the pcregrep and
+pcretest commands.
-----------------------------------------------------------------------------
@@ -68,6 +69,41 @@ for file in pcretest pcregrep pcre-config ; do
done
+# Make pcredemo.3 from the pcredemo.c source file
+
+echo "Making pcredemo.3"
+perl <<"END" >pcredemo.3
+ open(IN, "../pcredemo.c") || die "Failed to open pcredemo.c\n";
+ open(OUT, ">pcredemo.3") || die "Failed to open pcredemo.3\n";
+ print OUT ".\\\" Start example.\n" .
+ ".de EX\n" .
+ ". nr mE \\\\n(.f\n" .
+ ". nf\n" .
+ ". nh\n" .
+ ". ft CW\n" .
+ "..\n" .
+ ".\n" .
+ ".\n" .
+ ".\\\" End example.\n" .
+ ".de EE\n" .
+ ". ft \\\\n(mE\n" .
+ ". fi\n" .
+ ". hy \\\\n(HY\n" .
+ "..\n" .
+ ".\n" .
+ ".EX\n" ;
+ while (<IN>)
+ {
+ s/\\/\\e/g;
+ print OUT;
+ }
+ print OUT ".EE\n";
+ close(IN);
+ close(OUT);
+END
+if [ $? != 0 ] ; then exit 1; fi
+
+
# Make HTML form of the documentation.
echo "Making HTML documentation"
diff --git a/README b/README
index 2a7411d..4936352 100644
--- a/README
+++ b/README
@@ -712,7 +712,7 @@ The distribution should contain the following files:
) "configure" and config.h
depcomp ) script to find program dependencies, generated by
) automake
- doc/*.3 man page sources for the PCRE functions
+ doc/*.3 man page sources for PCRE
doc/*.1 man page sources for pcregrep and pcretest
doc/index.html.src the base HTML page
doc/html/* HTML documentation
@@ -765,4 +765,4 @@ The distribution should contain the following files:
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 15 August 2009
+Last updated: 01 September 2009
diff --git a/doc/html/index.html b/doc/html/index.html
index 8a7174e..58dfe45 100644
--- a/doc/html/index.html
+++ b/doc/html/index.html
@@ -1,10 +1,10 @@
<html>
-<!-- This is a manually maintained file that is the root of the HTML version of
- the PCRE documentation. When the HTML documents are built from the man
- page versions, the entire doc/html directory is emptied, this file is then
- copied into doc/html/index.html, and the remaining files therein are
+<!-- This is a manually maintained file that is the root of the HTML version of
+ the PCRE documentation. When the HTML documents are built from the man
+ page versions, the entire doc/html directory is emptied, this file is then
+ copied into doc/html/index.html, and the remaining files therein are
created by the 132html script.
--->
+-->
<head>
<title>PCRE specification</title>
</head>
@@ -36,6 +36,9 @@ The HTML documentation for PCRE comprises the following pages:
<tr><td><a href="pcrecpp.html">pcrecpp</a></td>
<td>&nbsp;&nbsp;The C++ wrapper for the PCRE library</td></tr>
+<tr><td><a href="pcredemo.html">pcredemo</a></td>
+ <td>&nbsp;&nbsp;A demonstration C program that uses the PCRE library</td></tr>
+
<tr><td><a href="pcregrep.html">pcregrep</a></td>
<td>&nbsp;&nbsp;The <b>pcregrep</b> command</td></tr>
@@ -58,7 +61,7 @@ The HTML documentation for PCRE comprises the following pages:
<td>&nbsp;&nbsp;How to save and re-use compiled patterns</td></tr>
<tr><td><a href="pcresample.html">pcresample</a></td>
- <td>&nbsp;&nbsp;Description of the sample program</td></tr>
+ <td>&nbsp;&nbsp;Discussion of the pcredemo program</td></tr>
<tr><td><a href="pcrestack.html">pcrestack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE's stack usage</td></tr>
@@ -71,11 +74,11 @@ The HTML documentation for PCRE comprises the following pages:
</table>
<p>
-There are also individual pages that summarize the interface for each function
+There are also individual pages that summarize the interface for each function
in the library:
</p>
-<table>
+<table>
<tr><td><a href="pcre_compile.html">pcre_compile</a></td>
<td>&nbsp;&nbsp;Compile a regular expression</td></tr>
@@ -126,7 +129,7 @@ in the library:
<tr><td><a href="pcre_maketables.html">pcre_maketables</a></td>
<td>&nbsp;&nbsp;Build character tables in current locale</td></tr>
-
+
<tr><td><a href="pcre_refcount.html">pcre_refcount</a></td>
<td>&nbsp;&nbsp;Maintain reference count in compiled pattern</td></tr>
diff --git a/doc/html/pcre.html b/doc/html/pcre.html
index 5e2a036..bfb4e97 100644
--- a/doc/html/pcre.html
+++ b/doc/html/pcre.html
@@ -30,8 +30,8 @@ support for certain .NET and Oniguruma syntax items, and there is an option for
requesting some minor changes that give better JavaScript compatibility.
</P>
<P>
-The current implementation of PCRE (release 7.x) corresponds approximately with
-Perl 5.10, including support for UTF-8 encoded strings and Unicode general
+The current implementation of PCRE (release 8.xx) corresponds approximately
+with Perl 5.10, including support for UTF-8 encoded strings and Unicode general
category properties. However, UTF-8 and Unicode support has to be explicitly
enabled; it is not the default. The Unicode tables correspond to Unicode
release 5.1.
@@ -88,8 +88,8 @@ not exported.
The user documentation for PCRE comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
-all the sections are concatenated, for ease of searching. The sections are as
-follows:
+all the sections, except the <b>pcredemo</b> section, are concatenated, for ease
+of searching. The sections are as follows:
<pre>
pcre this document
pcre-config show PCRE installation configuration information
@@ -98,6 +98,7 @@ follows:
pcrecallout details of the callout feature
pcrecompat discussion of Perl compatibility
pcrecpp details of the C++ wrapper
+ pcredemo a demonstration C program that uses PCRE
pcregrep description of the <b>pcregrep</b> command
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
@@ -106,7 +107,7 @@ follows:
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
- pcresample discussion of the sample program
+ pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
pcretest description of the <b>pcretest</b> testing command
</pre>
@@ -297,7 +298,7 @@ two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 April 2009
+Last updated: 01 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcre_dfa_exec.html b/doc/html/pcre_dfa_exec.html
index a243ee8..a242f75 100644
--- a/doc/html/pcre_dfa_exec.html
+++ b/doc/html/pcre_dfa_exec.html
@@ -63,14 +63,19 @@ The options are:
PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
validity (only relevant if PCRE_UTF8
was set at compile time)
- PCRE_PARTIAL Return PCRE_ERROR_PARTIAL for a partial match
+ PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial match
+ PCRE_PARTIAL_SOFT ) if no full matches are found
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
PCRE_DFA_SHORTEST Return only the shortest match
PCRE_DFA_RESTART This is a restart after a partial match
</pre>
There are restrictions on what may appear in a pattern when using this matching
function. Details are given in the
<a href="pcrematching.html"><b>pcrematching</b></a>
-documentation.
+documentation. For details of partial matching, see the
+<a href="pcrepartial.html"><b>pcrepartial</b></a>
+page.
</P>
<P>
A <b>pcre_extra</b> structure contains the following fields:
diff --git a/doc/html/pcre_exec.html b/doc/html/pcre_exec.html
index ef43830..ccc3db1 100644
--- a/doc/html/pcre_exec.html
+++ b/doc/html/pcre_exec.html
@@ -59,15 +59,14 @@ The options are:
PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
validity (only relevant if PCRE_UTF8
was set at compile time)
- PCRE_PARTIAL Return PCRE_ERROR_PARTIAL for a partial match
+ PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial match
+ PCRE_PARTIAL_SOFT ) if no full matches are found
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
</pre>
-There are restrictions on what may appear in a pattern when partial matching is
-requested. For details, see the
+For details of partial matching, see the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
-page.
-</P>
-<P>
-A <b>pcre_extra</b> structure contains the following fields:
+page. A <b>pcre_extra</b> structure contains the following fields:
<pre>
<i>flags</i> Bits indicating which fields are set
<i>study_data</i> Opaque data from <b>pcre_study()</b>
diff --git a/doc/html/pcre_fullinfo.html b/doc/html/pcre_fullinfo.html
index 48fddf5..3ec75f9 100644
--- a/doc/html/pcre_fullinfo.html
+++ b/doc/html/pcre_fullinfo.html
@@ -49,6 +49,7 @@ The following information is available:
PCRE_INFO_NAMEENTRYSIZE Size of name table entry
PCRE_INFO_NAMETABLE Pointer to name table
PCRE_INFO_OKPARTIAL Return 1 if partial matching can be tried
+ (always returns 1 after release 8.00)
PCRE_INFO_OPTIONS Option bits used for compilation
PCRE_INFO_SIZE Size of compiled pattern
PCRE_INFO_STUDYSIZE Size of study data
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index 91273de..fe08e74 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -164,8 +164,10 @@ Applications can use these to include support for different releases of PCRE.
The functions <b>pcre_compile()</b>, <b>pcre_compile2()</b>, <b>pcre_study()</b>,
and <b>pcre_exec()</b> are used for compiling and matching regular expressions
in a Perl-compatible manner. A sample program that demonstrates the simplest
-way of using them is provided in the file called <i>pcredemo.c</i> in the source
-distribution. The
+way of using them is provided in the file called <i>pcredemo.c</i> in the PCRE
+source distribution. A listing of this program is given in the
+<a href="pcredemo.html"><b>pcredemo</b></a>
+documentation, and the
<a href="pcresample.html"><b>pcresample</b></a>
documentation describes how to compile and run it.
</P>
@@ -1016,10 +1018,11 @@ different for each compiled pattern.
PCRE_INFO_OKPARTIAL
</pre>
Return 1 if the pattern can be used for partial matching, otherwise 0. The
-fourth argument should point to an <b>int</b> variable. The
+fourth argument should point to an <b>int</b> variable. From release 8.00, this
+always returns 1, because the restrictions that previously applied to partial
+matching have been lifted. The
<a href="pcrepartial.html"><b>pcrepartial</b></a>
-documentation lists the restrictions that apply to patterns when partial
-matching is used.
+documentation gives details of partial matching.
<pre>
PCRE_INFO_OPTIONS
</pre>
@@ -1246,7 +1249,7 @@ Option bits for <b>pcre_exec()</b>
The unused bits of the <i>options</i> argument for <b>pcre_exec()</b> must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>,
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE,
-PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
+PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.
<pre>
PCRE_ANCHORED
</pre>
@@ -1336,7 +1339,9 @@ when using the /g modifier. It is possible to emulate Perl's behaviour after
matching a null string by first trying the match again at the same offset with
PCRE_NOTEMPTY and PCRE_ANCHORED, and then if that fails by advancing the
starting offset (see below) and trying an ordinary match again. There is some
-code that demonstrates how to do this in the <i>pcredemo.c</i> sample program.
+code that demonstrates how to do this in the
+<a href="pcredemo.html"><b>pcredemo</b></a>
+sample program.
<pre>
PCRE_NO_START_OPTIMIZE
</pre>
@@ -1373,15 +1378,19 @@ PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a
subject, or a value of <i>startoffset</i> that does not point to the start of a
UTF-8 character, is undefined. Your program may crash.
<pre>
- PCRE_PARTIAL
-</pre>
-This option turns on the partial matching feature. If the subject string fails
-to match the pattern, but at some point during the matching process the end of
-the subject was reached (that is, the subject partially matches the pattern and
-the failure to match occurred only because there were not enough subject
-characters), <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL instead of
-PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is used, there are restrictions on what
-may appear in the pattern. These are discussed in the
+ PCRE_PARTIAL_HARD
+ PCRE_PARTIAL_SOFT
+</pre>
+These options turn on the partial matching feature. For backwards
+compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match
+occurs if the end of the subject string is reached successfully, but there are
+not enough subject characters to complete the match. If this happens when
+PCRE_PARTIAL_HARD is set, <b>pcre_exec()</b> immediately returns
+PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
+by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
+returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
+provided the partial match is set as the first matching string. There is a more
+detailed discussion in the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
documentation.
</P>
@@ -1582,10 +1591,10 @@ documentation for details of partial matching.
<pre>
PCRE_ERROR_BADPARTIAL (-13)
</pre>
-The PCRE_PARTIAL option was used with a compiled pattern containing items that
-are not supported for partial matching. See the
-<a href="pcrepartial.html"><b>pcrepartial</b></a>
-documentation for details of partial matching.
+This code is no longer in use. It was formerly returned when the PCRE_PARTIAL
+option was used with a compiled pattern containing items that were not
+supported for partial matching. From release 8.00 onwards, there are no
+restrictions on partial matching.
<pre>
PCRE_ERROR_INTERNAL (-14)
</pre>
@@ -1871,19 +1880,24 @@ Option bits for <b>pcre_dfa_exec()</b>
<P>
The unused bits of the <i>options</i> argument for <b>pcre_dfa_exec()</b> must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>,
-PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL,
-PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of these are
-the same as for <b>pcre_exec()</b>, so their description is not repeated here.
-<pre>
- PCRE_PARTIAL
-</pre>
-This has the same general effect as it does for <b>pcre_exec()</b>, but the
-details are slightly different. When PCRE_PARTIAL is set for
-<b>pcre_dfa_exec()</b>, the return code PCRE_ERROR_NOMATCH is converted into
-PCRE_ERROR_PARTIAL if the end of the subject is reached, there have been no
-complete matches, but there is still at least one matching possibility. The
-portion of the string that provided the partial match is set as the first
-matching string.
+PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD,
+PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
+four of these are exactly the same as for <b>pcre_exec()</b>, so their
+description is not repeated here.
+<pre>
+ PCRE_PARTIAL_HARD
+ PCRE_PARTIAL_SOFT
+</pre>
+These have the same general effect as they do for <b>pcre_exec()</b>, but the
+details are slightly different. When PCRE_PARTIAL_HARD is set for
+<b>pcre_dfa_exec()</b>, it returns PCRE_ERROR_PARTIAL if the end of the subject
+is reached and there is still at least one matching possibility that requires
+additional characters. This happens even if some complete matches have also
+been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH
+is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
+there have been no complete matches, but there is still at least one matching
+possibility. The portion of the string that provided the longest partial match
+is set as the first matching string in both cases.
<pre>
PCRE_DFA_SHORTEST
</pre>
@@ -1894,13 +1908,12 @@ matching point in the subject string.
<pre>
PCRE_DFA_RESTART
</pre>
-When <b>pcre_dfa_exec()</b> is called with the PCRE_PARTIAL option, and returns
-a partial match, it is possible to call it again, with additional subject
-characters, and have it continue with the same match. The PCRE_DFA_RESTART
-option requests this action; when it is set, the <i>workspace</i> and
-<i>wscount</i> options must reference the same vector as before because data
-about the match so far is left in them after a partial match. There is more
-discussion of this facility in the
+When <b>pcre_dfa_exec()</b> returns a partial match, it is possible to call it
+again, with additional subject characters, and have it continue with the same
+match. The PCRE_DFA_RESTART option requests this action; when it is set, the
+<i>workspace</i> and <i>wscount</i> options must reference the same vector as
+before because data about the match so far is left in them after a partial
+match. There is more discussion of this facility in the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
documentation.
</P>
@@ -1996,7 +2009,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 April 2009
+Last updated: 01 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrecompat.html b/doc/html/pcrecompat.html
index d1b93d0..5567c27 100644
--- a/doc/html/pcrecompat.html
+++ b/doc/html/pcrecompat.html
@@ -19,7 +19,7 @@ DIFFERENCES BETWEEN PCRE AND PERL
This document describes the differences in the ways that PCRE and Perl handle
regular expressions. The differences described here are mainly with respect to
Perl 5.8, though PCRE versions 7.0 and later contain some features that are
-expected to be in the forthcoming Perl 5.10.
+in Perl 5.10.
</P>
<P>
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what
@@ -170,9 +170,9 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 11 September 2007
+Last updated: 25 August 2009
<br>
-Copyright &copy; 1997-2007 University of Cambridge.
+Copyright &copy; 1997-2009 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcredemo.html b/doc/html/pcredemo.html
new file mode 100644
index 0000000..57b4d1d
--- /dev/null
+++ b/doc/html/pcredemo.html
@@ -0,0 +1,354 @@
+<html>
+<head>
+<title>pcredemo specification</title>
+</head>
+<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
+<h1>pcredemo man page</h1>
+<p>
+Return to the <a href="index.html">PCRE index page</a>.
+</p>
+<p>
+This page is part of the PCRE HTML documentation. It was generated automatically
+from the original man page. If there is any nonsense in it, please consult the
+man page, in case the conversion went wrong.
+<br>
+<ul>
+</ul>
+<PRE>
+/*************************************************
+* PCRE DEMONSTRATION PROGRAM *
+*************************************************/
+
+/* This is a demonstration program to illustrate the most straightforward ways
+of calling the PCRE regular expression library from a C program. See the
+pcresample documentation for a short discussion ("man pcresample" if you have
+the PCRE man pages installed).
+
+In Unix-like environments, compile this program thuswise:
+
+ gcc -Wall pcredemo.c -I/usr/local/include -L/usr/local/lib \
+ -R/usr/local/lib -lpcre
+
+Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
+library files for PCRE are installed on your system. You don't need -I and -L
+if PCRE is installed in the standard system libraries. Only some operating
+systems (e.g. Solaris) use the -R option.
+
+Building under Windows:
+
+If you want to statically link this program against a non-dll .a file, you must
+define PCRE_STATIC before including pcre.h, otherwise the pcre_malloc() and
+pcre_free() exported functions will be declared __declspec(dllimport), with
+unwanted results. So in this environment, uncomment the following line. */
+
+/* #define PCRE_STATIC */
+
+#include &lt;stdio.h&gt;
+#include &lt;string.h&gt;
+#include &lt;pcre.h&gt;
+
+#define OVECCOUNT 30 /* should be a multiple of 3 */
+
+
+int main(int argc, char **argv)
+{
+pcre *re;
+const char *error;
+char *pattern;
+char *subject;
+unsigned char *name_table;
+int erroffset;
+int find_all;
+int namecount;
+int name_entry_size;
+int ovector[OVECCOUNT];
+int subject_length;
+int rc, i;
+
+
+/**************************************************************************
+* First, sort out the command line. There is only one possible option at *
+* the moment, "-g" to request repeated matching to find all occurrences, *
+* like Perl's /g option. We set the variable find_all to a non-zero value *
+* if the -g option is present. Apart from that, there must be exactly two *
+* arguments. *
+**************************************************************************/
+
+find_all = 0;
+for (i = 1; i &lt; argc; i++)
+ {
+ if (strcmp(argv[i], "-g") == 0) find_all = 1;
+ else break;
+ }
+
+/* After the options, we require exactly two arguments, which are the pattern,
+and the subject string. */
+
+if (argc - i != 2)
+ {
+ printf("Two arguments required: a regex and a subject string\n");
+ return 1;
+ }
+
+pattern = argv[i];
+subject = argv[i+1];
+subject_length = (int)strlen(subject);
+
+
+/*************************************************************************
+* Now we are going to compile the regular expression pattern, and handle *
+* and errors that are detected. *
+*************************************************************************/
+
+re = pcre_compile(
+ pattern, /* the pattern */
+ 0, /* default options */
+ &amp;error, /* for error message */
+ &amp;erroffset, /* for error offset */
+ NULL); /* use default character tables */
+
+/* Compilation failed: print the error message and exit */
+
+if (re == NULL)
+ {
+ printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
+ return 1;
+ }
+
+
+/*************************************************************************
+* If the compilation succeeded, we call PCRE again, in order to do a *
+* pattern match against the subject string. This does just ONE match. If *
+* further matching is needed, it will be done below. *
+*************************************************************************/
+
+rc = pcre_exec(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ subject, /* the subject string */
+ subject_length, /* the length of the subject */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ ovector, /* output vector for substring information */
+ OVECCOUNT); /* number of elements in the output vector */
+
+/* Matching failed: handle error cases */
+
+if (rc &lt; 0)
+ {
+ switch(rc)
+ {
+ case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
+ /*
+ Handle other special cases if you like
+ */
+ default: printf("Matching error %d\n", rc); break;
+ }
+ pcre_free(re); /* Release memory used for the compiled pattern */
+ return 1;
+ }
+
+/* Match succeded */
+
+printf("\nMatch succeeded at offset %d\n", ovector[0]);
+
+
+/*************************************************************************
+* We have found the first match within the subject string. If the output *
+* vector wasn't big enough, say so. Then output any substrings that were *
+* captured. *
+*************************************************************************/
+
+/* The output vector wasn't big enough */
+
+if (rc == 0)
+ {
+ rc = OVECCOUNT/3;
+ printf("ovector only has room for %d captured substrings\n", rc - 1);
+ }
+
+/* Show substrings stored in the output vector by number. Obviously, in a real
+application you might want to do things other than print them. */
+
+for (i = 0; i &lt; rc; i++)
+ {
+ char *substring_start = subject + ovector[2*i];
+ int substring_length = ovector[2*i+1] - ovector[2*i];
+ printf("%2d: %.*s\n", i, substring_length, substring_start);
+ }
+
+
+/**************************************************************************
+* That concludes the basic part of this demonstration program. We have *
+* compiled a pattern, and performed a single match. The code that follows *
+* shows first how to access named substrings, and then how to code for *
+* repeated matches on the same subject. *
+**************************************************************************/
+
+/* See if there are any named substrings, and if so, show them by name. First
+we have to extract the count of named parentheses from the pattern. */
+
+(void)pcre_fullinfo(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ PCRE_INFO_NAMECOUNT, /* number of named substrings */
+ &amp;namecount); /* where to put the answer */
+
+if (namecount &lt;= 0) printf("No named substrings\n"); else
+ {
+ unsigned char *tabptr;
+ printf("Named substrings\n");
+
+ /* Before we can access the substrings, we must extract the table for
+ translating names to numbers, and the size of each entry in the table. */
+
+ (void)pcre_fullinfo(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ PCRE_INFO_NAMETABLE, /* address of the table */
+ &amp;name_table); /* where to put the answer */
+
+ (void)pcre_fullinfo(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ PCRE_INFO_NAMEENTRYSIZE, /* size of each entry in the table */
+ &amp;name_entry_size); /* where to put the answer */
+
+ /* Now we can scan the table and, for each entry, print the number, the name,
+ and the substring itself. */
+
+ tabptr = name_table;
+ for (i = 0; i &lt; namecount; i++)
+ {
+ int n = (tabptr[0] &lt;&lt; 8) | tabptr[1];
+ printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
+ ovector[2*n+1] - ovector[2*n], subject + ovector[2*n]);
+ tabptr += name_entry_size;
+ }
+ }
+
+
+/*************************************************************************
+* If the "-g" option was given on the command line, we want to continue *
+* to search for additional matches in the subject string, in a similar *
+* way to the /g option in Perl. This turns out to be trickier than you *
+* might think because of the possibility of matching an empty string. *
+* What happens is as follows: *
+* *
+* If the previous match was NOT for an empty string, we can just start *
+* the next match at the end of the previous one. *
+* *
+* If the previous match WAS for an empty string, we can't do that, as it *
+* would lead to an infinite loop. Instead, a special call of pcre_exec() *
+* is made with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set. The first *
+* of these tells PCRE that an empty string is not a valid match; other *
+* possibilities must be tried. The second flag restricts PCRE to one *
+* match attempt at the initial string position. If this match succeeds, *
+* an alternative to the empty string match has been found, and we can *
+* proceed round the loop. *
+*************************************************************************/
+
+if (!find_all)
+ {
+ pcre_free(re); /* Release the memory used for the compiled pattern */
+ return 0; /* Finish unless -g was given */
+ }
+
+/* Loop for second and subsequent matches */
+
+for (;;)
+ {
+ int options = 0; /* Normally no options */
+ int start_offset = ovector[1]; /* Start at end of previous match */
+
+ /* If the previous match was for an empty string, we are finished if we are
+ at the end of the subject. Otherwise, arrange to run another match at the
+ same point to see if a non-empty match can be found. */
+
+ if (ovector[0] == ovector[1])
+ {
+ if (ovector[0] == subject_length) break;
+ options = PCRE_NOTEMPTY | PCRE_ANCHORED;
+ }
+
+ /* Run the next matching operation */
+
+ rc = pcre_exec(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ subject, /* the subject string */
+ subject_length, /* the length of the subject */
+ start_offset, /* starting offset in the subject */
+ options, /* options */
+ ovector, /* output vector for substring information */
+ OVECCOUNT); /* number of elements in the output vector */
+
+ /* This time, a result of NOMATCH isn't an error. If the value in "options"
+ is zero, it just means we have found all possible matches, so the loop ends.
+ Otherwise, it means we have failed to find a non-empty-string match at a
+ point where there was a previous empty-string match. In this case, we do what
+ Perl does: advance the matching position by one, and continue. We do this by
+ setting the "end of previous match" offset, because that is picked up at the
+ top of the loop as the point at which to start again. */
+
+ if (rc == PCRE_ERROR_NOMATCH)
+ {
+ if (options == 0) break;
+ ovector[1] = start_offset + 1;
+ continue; /* Go round the loop again */
+ }
+
+ /* Other matching errors are not recoverable. */
+
+ if (rc &lt; 0)
+ {
+ printf("Matching error %d\n", rc);
+ pcre_free(re); /* Release memory used for the compiled pattern */
+ return 1;
+ }
+
+ /* Match succeded */
+
+ printf("\nMatch succeeded again at offset %d\n", ovector[0]);
+
+ /* The match succeeded, but the output vector wasn't big enough. */
+
+ if (rc == 0)
+ {
+ rc = OVECCOUNT/3;
+ printf("ovector only has room for %d captured substrings\n", rc - 1);
+ }
+
+ /* As before, show substrings stored in the output vector by number, and then
+ also any named substrings. */
+
+ for (i = 0; i &lt; rc; i++)
+ {
+ char *substring_start = subject + ovector[2*i];
+ int substring_length = ovector[2*i+1] - ovector[2*i];
+ printf("%2d: %.*s\n", i, substring_length, substring_start);
+ }
+
+ if (namecount &lt;= 0) printf("No named substrings\n"); else
+ {
+ unsigned char *tabptr = name_table;
+ printf("Named substrings\n");
+ for (i = 0; i &lt; namecount; i++)
+ {
+ int n = (tabptr[0] &lt;&lt; 8) | tabptr[1];
+ printf("(%d) %*s: %.*s\n", n, name_entry_size - 3, tabptr + 2,
+ ovector[2*n+1] - ovector[2*n], subject + ovector[2*n]);
+ tabptr += name_entry_size;
+ }
+ }
+ } /* End of loop to find second and subsequent matches */
+
+printf("\n");
+pcre_free(re); /* Release memory used for the compiled pattern */
+return 0;
+}
+
+/* End of pcredemo.c */
+<p>
+Return to the <a href="index.html">PCRE index page</a>.
+</p>
diff --git a/doc/html/pcregrep.html b/doc/html/pcregrep.html
index 13e45d9..5256a67 100644
--- a/doc/html/pcregrep.html
+++ b/doc/html/pcregrep.html
@@ -119,6 +119,12 @@ standard input is always so treated.
</P>
<br><a name="SEC4" href="#TOC1">OPTIONS</a><br>
<P>
+The order in which some of the options appear can affect the output. For
+example, both the <b>-h</b> and <b>-l</b> options affect the printing of file
+names. Whichever comes later in the command line will be the one that takes
+effect.
+</P>
+<P>
<b>--</b>
This terminate the list of options. It is useful if the next item on the
command line starts with a hyphen but is not an option. This allows for the
@@ -149,10 +155,13 @@ This is equivalent to setting both <b>-A</b> and <b>-B</b> to the same value.
</P>
<P>
<b>-c</b>, <b>--count</b>
-Do not output individual lines; instead just output a count of the number of
-lines that would otherwise have been output. If several files are given, a
-count is output for each of them. In this mode, the <b>-A</b>, <b>-B</b>, and
-<b>-C</b> options are ignored.
+Do not output individual lines from the files that are being scanned; instead
+output the number of lines that would otherwise have been shown. If no lines
+are selected, the number zero is output. If several files are are being
+scanned, a count is output for each of them. However, if the
+<b>--files-with-matches</b> option is also used, only those files whose counts
+are greater than zero are listed. When <b>-c</b> is used, the <b>-A</b>,
+<b>-B</b>, and <b>-C</b> options are ignored.
</P>
<P>
<b>--colour</b>, <b>--color</b>
@@ -316,8 +325,11 @@ output once, on a separate line.
<b>-l</b>, <b>--files-with-matches</b>
Instead of outputting lines from the files, just output the names of the files
containing lines that would have been output. Each file name is output
-once, on a separate line. Searching stops as soon as a matching line is found
-in a file.
+once, on a separate line. Searching normally stops as soon as a matching line
+is found in a file. However, if the <b>-c</b> (count) option is also used,
+matching continues in order to obtain the correct count, and those files that
+have at least one match are listed along with their counts. Using this option
+with <b>-c</b> is a way of suppressing the listing of files with no matches.
</P>
<P>
<b>--label</b>=<i>name</i>
@@ -462,7 +474,9 @@ The majority of short and long forms of <b>pcregrep</b>'s options are the same
as in the GNU <b>grep</b> program. Any long option of the form
<b>--xxx-regexp</b> (GNU terminology) is also available as <b>--xxx-regex</b>
(PCRE terminology). However, the <b>--locale</b>, <b>-M</b>, <b>--multiline</b>,
-<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>.
+<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>. If both the
+<b>-c</b> and <b>-l</b> options are given, GNU grep lists only file names,
+without counts, but <b>pcregrep</b> gives the counts.
</P>
<br><a name="SEC8" href="#TOC1">OPTIONS WITH DATA</a><br>
<P>
@@ -524,7 +538,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC13" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 01 March 2009
+Last updated: 12 August 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html
index 2cad88b..2a60581 100644
--- a/doc/html/pcrematching.html
+++ b/doc/html/pcrematching.html
@@ -177,13 +177,7 @@ match using the standard algorithm, you have to do kludgy things with
callouts.
</P>
<P>
-2. There is much better support for partial matching. The restrictions on the
-content of the pattern that apply when using the standard algorithm for partial
-matching do not apply to the alternative algorithm. For non-anchored patterns,
-the starting position of a partial match is available.
-</P>
-<P>
-3. Because the alternative algorithm scans the subject string just once, and
+2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
time.
@@ -215,9 +209,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 April 2008
+Last updated: 25 August 2009
<br>
-Copyright &copy; 1997-2008 University of Cambridge.
+Copyright &copy; 1997-2009 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcrepartial.html b/doc/html/pcrepartial.html
index 1fab23c..3801320 100644
--- a/doc/html/pcrepartial.html
+++ b/doc/html/pcrepartial.html
@@ -14,11 +14,16 @@ man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
-<li><a name="TOC2" href="#SEC2">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a>
-<li><a name="TOC3" href="#SEC3">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
-<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
-<li><a name="TOC5" href="#SEC5">AUTHOR</a>
-<li><a name="TOC6" href="#SEC6">REVISION</a>
+<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a>
+<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a>
+<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
+<li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
+<li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
+<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
+<li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a>
+<li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
+<li><a name="TOC10" href="#SEC10">AUTHOR</a>
+<li><a name="TOC11" href="#SEC11">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
<P>
@@ -37,78 +42,155 @@ in the form <i>ddmmmyy</i>, defined by this pattern:
</pre>
If the application sees the user's keystrokes one by one, and can check that
what has been typed so far is potentially valid, it is able to raise an error
-as soon as a mistake is made, possibly beeping and not reflecting the
-character that has been typed. This immediate feedback is likely to be a better
+as soon as a mistake is made, by beeping and not reflecting the character that
+has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
-entered.
+entered. Partial matching can also sometimes be useful when the subject string
+is very long and is not all available at once.
</P>
<P>
-PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
-option, which can be set when calling <b>pcre_exec()</b> or
-<b>pcre_dfa_exec()</b>. When this flag is set for <b>pcre_exec()</b>, the return
-code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
-during the matching process the last part of the subject string matched part of
-the pattern. Unfortunately, for non-anchored matching, it is not possible to
-obtain the position of the start of the partial match. No captured data is set
-when PCRE_ERROR_PARTIAL is returned.
+PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
+PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
+<b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
+for PCRE_PARTIAL_SOFT. The essential difference between the two options is
+whether or not a partial match is preferred to an alternative complete match,
+though the details differ between the two matching functions. If both options
+are set, PCRE_PARTIAL_HARD takes precedence.
</P>
<P>
-When PCRE_PARTIAL is set for <b>pcre_dfa_exec()</b>, the return code
-PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
-subject is reached, there have been no complete matches, but there is still at
-least one matching possibility. The portion of the string that provided the
-partial match is set as the first matching string.
+Setting a partial matching option disables one of PCRE's optimizations. PCRE
+remembers the last literal byte in a pattern, and abandons matching immediately
+if such a byte is not present in the subject string. This optimization cannot
+be used for a subject string that might match only partially.
</P>
+<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
<P>
-Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
-last literal byte in a pattern, and abandons matching immediately if such a
-byte is not present in the subject string. This optimization cannot be used
-for a subject string that might match only partially.
+A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of
+the subject string is reached successfully, but matching cannot continue
+because more characters are needed. However, at least one character must have
+been matched. (In other words, a partial match can never be an empty string.)
</P>
-<br><a name="SEC2" href="#TOC1">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a><br>
<P>
-Because of the way certain internal optimizations are implemented in the
-<b>pcre_exec()</b> function, the PCRE_PARTIAL option cannot be used with all
-patterns. These restrictions do not apply when <b>pcre_dfa_exec()</b> is used.
-For <b>pcre_exec()</b>, repeated single characters such as
+If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
+continues as normal, and other alternatives in the pattern are tried. If no
+complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL
+instead of PCRE_ERROR_NOMATCH, and if there are at least two slots in the
+offsets vector, they are filled in with the offsets of the longest string that
+partially matched. Consider this pattern:
<pre>
- a{2,4}
+ /123\w+X|dogY/
</pre>
-and repeated single metasequences such as
+If this is matched against the subject string "abc123dog", both
+alternatives fail to match, but the end of the subject is reached during
+matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
+offsets are set to 3 and 9, identifying "123dog" as the longest partial match
+that was found. (In this example, there are two partial matches, because "dog"
+on its own partially matches the second alternative.)
+</P>
+<P>
+If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
+PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
+search for possible complete matches. The difference between the two options
+can be illustrated by a pattern such as:
+<pre>
+ /dog(sbody)?/
+</pre>
+This matches either "dog" or "dogsbody", greedily (that is, it prefers the
+longer string if possible). If it is matched against the string "dog" with
+PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
+PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
+if the pattern is made ungreedy the result is different:
+<pre>
+ /dog(sbody)??/
+</pre>
+In this case the result is always a complete match because <b>pcre_exec()</b>
+finds that first, and it never continues after finding a match. It might be
+easier to follow this explanation by thinking of the two patterns like this:
+<pre>
+ /dog(sbody)?/ is the same as /dogsbody|dog/
+ /dog(sbody)??/ is the same as /dog|dogsbody/
+</pre>
+The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
+used, because it will always find the shorter match first.
+</P>
+<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
+<P>
+The <b>pcre_dfa_exec()</b> function moves along the subject string character by
+character, without backtracking, searching for all possible matches
+simultaneously. If the end of the subject is reached before the end of the
+pattern, there is the possibility of a partial match, again provided that at
+least one character has matched.
+</P>
+<P>
+When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
+have been no complete matches. Otherwise, the complete matches are returned.
+However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
+complete matches. The portion of the string that provided the longest partial
+match is set as the first matching string, provided there are at least two
+slots in the offsets vector.
+</P>
+<P>
+Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
+there is no difference between greedy and ungreedy repetition, its behaviour is
+different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
+string "dog" matched against the ungreedy pattern shown above:
<pre>
- \d+
+ /dog(sbody)??/
</pre>
-are not permitted if the maximum number of occurrences is greater than one.
-Optional items such as \d? (where the maximum is one) are permitted.
-Quantifiers with any values are permitted after parentheses, so the invalid
-examples above can be coded thus:
+Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
+"dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
+so returns that when PCRE_PARTIAL_HARD is set.
+</P>
+<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
+<P>
+If a pattern ends with one of sequences \w or \W, which test for word
+boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
+results. Consider this pattern:
<pre>
- (a){2,4}
- (\d)+
+ /\bcat\b/
</pre>
-These constructions run more slowly, but for the kinds of application that are
-envisaged for this facility, this is not felt to be a major restriction.
+This matches "cat", provided there is a word boundary at either end. If the
+subject string is "the cat", the comparison of the final "t" with a following
+character cannot take place, so a partial match is found. However,
+<b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
+of the subject when the last character is a letter, thus finding a complete
+match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
+happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
</P>
<P>
-If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
-<b>pcre_exec()</b> returns the error code PCRE_ERROR_BADPARTIAL (-13).
-You can use the PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out
-if a compiled pattern can be used for partial matching.
+Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
+then the partial match takes precedence.
</P>
-<br><a name="SEC3" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
+<br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
+<P>
+For releases of PCRE prior to 8.00, because of the way certain internal
+optimizations were implemented in the <b>pcre_exec()</b> function, the
+PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
+all patterns. From release 8.00 onwards, the restrictions no longer apply, and
+partial matching with <b>pcre_exec()</b> can be requested for any pattern.
+</P>
+<P>
+Items that were formerly restricted were repeated single characters and
+repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
+conform to the restrictions, <b>pcre_exec()</b> returned the error code
+PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
+PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
+pattern can be used for partial matching now always returns 1.
+</P>
+<br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
<P>
If the escape sequence \P is present in a <b>pcretest</b> data line, the
-PCRE_PARTIAL flag is used for the match. Here is a run of <b>pcretest</b> that
-uses the date example quoted above:
+PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
+that uses the date example quoted above:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 25jun04\P
0: 25jun04
1: jun
data&#62; 25dec3\P
- Partial match
+ Partial match: 23dec3
data&#62; 3ju\P
- Partial match
+ Partial match: 3ju
data&#62; 3juj\P
No match
data&#62; j\P
@@ -116,34 +198,23 @@ uses the date example quoted above:
</pre>
The first data string is matched completely, so <b>pcretest</b> shows the
matched substrings. The remaining four strings do not match the complete
-pattern, but the first two are partial matches. The same test, using
-<b>pcre_dfa_exec()</b> matching (by means of the \D escape sequence), produces
-the following output:
-<pre>
- re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
- data&#62; 25jun04\P\D
- 0: 25jun04
- data&#62; 23dec3\P\D
- Partial match: 23dec3
- data&#62; 3ju\P\D
- Partial match: 3ju
- data&#62; 3juj\P\D
- No match
- data&#62; j\P\D
- No match
-</pre>
-Notice that in this case the portion of the string that was matched is made
-available.
+pattern, but the first two are partial matches. Similar output is obtained
+when <b>pcre_dfa_exec()</b> is used.
</P>
-<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
+<P>
+If the escape sequence \P is present more than once in a <b>pcretest</b> data
+line, the PCRE_PARTIAL_HARD option is set for the match.
+</P>
+<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
<P>
When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
to continue the match by providing additional subject data and calling
<b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
-time setting the PCRE_DFA_RESTART option. You must also pass the same working
+time setting the PCRE_DFA_RESTART option. You must pass the same working
space as before, because this is where details of the previous partial match
are stored. Here is an example using <b>pcretest</b>, using the \R escape
-sequence to set the PCRE_DFA_RESTART option (\P and \D are as above):
+sequence to set the PCRE_DFA_RESTART option (\D specifies the use of
+<b>pcre_dfa_exec()</b>):
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data&#62; 23ja\P\D
@@ -158,10 +229,34 @@ not retain the previously partially-matched string. It is up to the calling
program to do that if it needs to.
</P>
<P>
-You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
-over multiple segments. This facility can be used to pass very long subject
-strings to <b>pcre_dfa_exec()</b>. However, some care is needed for certain
-types of pattern.
+You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
+PCRE_DFA_RESTART to continue partial matching over multiple segments. This
+facility can be used to pass very long subject strings to
+<b>pcre_dfa_exec()</b>.
+</P>
+<br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
+<P>
+From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
+matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
+previous match with a new segment of data. Instead, new data must be added to
+the previous subject string, and the entire match re-run, starting from the
+point where the partial match occurred. Earlier data can be discarded.
+Consider an unanchored pattern that matches dates:
+<pre>
+ re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
+ data&#62; The date is 23ja\P
+ Partial match: 23ja
+</pre>
+The this stage, an application could discard the text preceding "23ja", add on
+text from the next segment, and call <b>pcre_exec()</b> again. Unlike
+<b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
+the complete matching process occurs for each call, so more memory and more
+processing time is needed.
+</P>
+<br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
+<P>
+Certain types of pattern may give problems with multi-segment matching,
+whichever matching function is used.
</P>
<P>
1. If the pattern contains tests for the beginning or end of a line, you need
@@ -170,21 +265,26 @@ subject string for any call does not contain the beginning or end of a line.
</P>
<P>
2. If the pattern contains backward assertions (including \b or \B), you need
-to arrange for some overlap in the subject strings to allow for this. For
-example, you could pass the subject in chunks that are 500 bytes long, but in
-a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
-bytes at the start of the buffer.
+to arrange for some overlap in the subject strings to allow for them to be
+correctly tested at the start of each substring. For example, using
+<b>pcre_dfa_exec()</b>, you could pass the subject in chunks that are 500 bytes
+long, but in a buffer of 700 bytes, with the starting offset set to 200 and the
+previous 200 bytes at the start of the buffer.
</P>
<P>
-3. Matching a subject string that is split into multiple segments does not
-always produce exactly the same result as matching over one single long string.
-The difference arises when there are multiple matching possibilities, because a
-partial match result is given only when there are no completed matches in a
-call to <b>pcre_dfa_exec()</b>. This means that as soon as the shortest match has
+3. Matching a subject string that is split into multiple segments may not
+always produce exactly the same result as matching over one single long string,
+especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
+Word Boundaries" above describes an issue that arises if the pattern ends with
+\b or \B. Another kind of difference may occur when there are multiple
+matching possibilities, because a partial match result is given only when there
+are no completed matches. This means that as soon as the shortest match has
been found, continuation to a new subject segment is no longer possible.
-Consider this <b>pcretest</b> example:
+Consider again this <b>pcretest</b> example:
<pre>
re&#62; /dog(sbody)?/
+ data&#62; dogsb\P
+ 0: dog
data&#62; do\P\D
Partial match: do
data&#62; gsb\R\P\D
@@ -193,26 +293,40 @@ Consider this <b>pcretest</b> example:
0: dogsbody
1: dog
</pre>
-The pattern matches the words "dog" or "dogsbody". When the subject is
-presented in several parts ("do" and "gsb" being the first two) the match stops
-when "dog" has been found, and it is not possible to continue. On the other
-hand, if "dogsbody" is presented as a single string, both matches are found.
+The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the
+PCRE_PARTIAL_SOFT option. Although the string is a partial match for
+"dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
+"dog" is a complete match. Similarly, when the subject is presented to
+<b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the
+match stops when "dog" has been found, and it is not possible to continue. On
+the other hand, if "dogsbody" is presented as a single string,
+<b>pcre_dfa_exec()</b> finds both matches.
</P>
<P>
-Because of this phenomenon, it does not usually make sense to end a pattern
-that is going to be matched in this way with a variable repeat.
+Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
+matching multi-segment data. The example above then behaves differently:
+<pre>
+ re&#62; /dog(sbody)?/
+ data&#62; dogsb\P\P
+ Partial match: dogsb
+ data&#62; do\P\D
+ Partial match: do
+ data&#62; gsb\R\P\P\D
+ Partial match: gsb
+
+</PRE>
</P>
<P>
4. Patterns that contain alternatives at the top level which do not all
-start with the same pattern item may not work as expected. For example,
-consider this pattern:
+start with the same pattern item may not work as expected when
+<b>pcre_dfa_exec()</b> is used. For example, consider this pattern:
<pre>
1234|3789
</pre>
If the first part of the subject is "ABC123", a partial match of the first
alternative is found at offset 3. There is no partial match for the second
alternative, because such a match does not start at the same point in the
-subject string. Attempting to continue with the string "789" does not yield a
+subject string. Attempting to continue with the string "7890" does not yield a
match because only those alternatives that match at one point in the subject
are remembered. The problem arises because the start of the second alternative
matches within the first alternative. There is no problem with anchored
@@ -220,9 +334,19 @@ patterns or patterns such as:
<pre>
1234|ABCD
</pre>
-where no string can be a partial match for both alternatives.
+where no string can be a partial match for both alternatives. This is not a
+problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
+each time:
+<pre>
+ re&#62; /1234|3789/
+ data&#62; ABC123\P
+ Partial match: 123
+ data&#62; 1237890
+ 0: 3789
+
+</PRE>
</P>
-<br><a name="SEC5" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@@ -231,11 +355,11 @@ University Computing Service
Cambridge CB2 3QH, England.
<br>
</P>
-<br><a name="SEC6" href="#TOC1">REVISION</a><br>
+<br><a name="SEC11" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 04 June 2007
+Last updated: 31 August 2009
<br>
-Copyright &copy; 1997-2007 University of Cambridge.
+Copyright &copy; 1997-2009 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcreposix.html b/doc/html/pcreposix.html
index 3a35664..2479d41 100644
--- a/doc/html/pcreposix.html
+++ b/doc/html/pcreposix.html
@@ -143,6 +143,11 @@ The yield of <b>regcomp()</b> is zero on success, and non-zero otherwise. The
is public: <i>re_nsub</i> contains the number of capturing subpatterns in
the regular expression. Various error codes are defined in the header file.
</P>
+<P>
+NOTE: If the yield of <b>regcomp()</b> is non-zero, you must not attempt to
+use the contents of the <i>preg</i> structure. If, for example, you pass it to
+<b>regexec()</b>, the result is undefined and your program is likely to crash.
+</P>
<br><a name="SEC4" href="#TOC1">MATCHING NEWLINE CHARACTERS</a><br>
<P>
This area is not simple, because POSIX and Perl take different views of things.
@@ -257,7 +262,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC9" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 March 2009
+Last updated: 15 August 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcresample.html b/doc/html/pcresample.html
index 6243be6..37c6f79 100644
--- a/doc/html/pcresample.html
+++ b/doc/html/pcresample.html
@@ -17,7 +17,11 @@ PCRE SAMPLE PROGRAM
</b><br>
<P>
A simple, complete demonstration program, to get you started with using PCRE,
-is supplied in the file <i>pcredemo.c</i> in the PCRE distribution.
+is supplied in the file <i>pcredemo.c</i> in the PCRE distribution. A listing of
+this program is given in the
+<a href="pcredemo.html"><b>pcredemo</b></a>
+documentation. If you do not have a copy of the PCRE distribution, you can save
+this listing to re-create <i>pcredemo.c</i>.
</P>
<P>
The program compiles the regular expression that is its first argument, and
@@ -55,13 +59,15 @@ this:
Note that there is a much more comprehensive test program, called
<a href="pcretest.html"><b>pcretest</b>,</a>
which supports many more facilities for testing regular expressions and the
-PCRE library. The <b>pcredemo</b> program is provided as a simple coding
-example.
+PCRE library. The
+<a href="pcredemo.html"><b>pcredemo</b></a>
+program is provided as a simple coding example.
</P>
<P>
-On some operating systems (e.g. Solaris), when PCRE is not installed in the
-standard library directory, you may get an error like this when you try to run
-<b>pcredemo</b>:
+When you try to run
+<a href="pcredemo.html"><b>pcredemo</b></a>
+when PCRE is not installed in the standard library directory, you may get an
+error like this on some operating systems (e.g. Solaris):
<pre>
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
</pre>
@@ -87,9 +93,9 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 23 January 2008
+Last updated: 01 September 2009
<br>
-Copyright &copy; 1997-2008 University of Cambridge.
+Copyright &copy; 1997-2009 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 0358958..88728ca 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -372,7 +372,8 @@ recognized:
\M discover the minimum MATCH_LIMIT and MATCH_LIMIT_RECURSION settings
\N pass the PCRE_NOTEMPTY option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
\Odd set the size of the output vector passed to <b>pcre_exec()</b> to dd (any number of digits)
- \P pass the PCRE_PARTIAL option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
+ \P pass the PCRE_PARTIAL_SOFT option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>; if used twice, pass the
+ PCRE_PARTIAL_HARD option
\Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd (any number of digits)
\R pass the PCRE_DFA_RESTART option to <b>pcre_dfa_exec()</b>
\S output details of memory get/free calls during matching
@@ -453,10 +454,10 @@ This section describes the output when the normal matching function,
<P>
When a match succeeds, pcretest outputs the list of captured substrings that
<b>pcre_exec()</b> returns, starting with number 0 for the string that matched
-the whole pattern. Otherwise, it outputs "No match" or "Partial match"
-when <b>pcre_exec()</b> returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL,
-respectively, and otherwise the PCRE negative error number. Here is an example
-of an interactive <b>pcretest</b> run.
+the whole pattern. Otherwise, it outputs "No match" or "Partial match:"
+followed by the partially matching substring when <b>pcre_exec()</b> returns
+PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL, respectively, and otherwise the PCRE
+negative error number. Here is an example of an interactive <b>pcretest</b> run.
<pre>
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -536,7 +537,9 @@ the subject where there is at least one match. For example:
2: tan
</pre>
(Using the normal matching function on this data finds only "tang".) The
-longest matching string is always given first (and numbered zero).
+longest matching string is always given first (and numbered zero). After a
+PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
+partially matching substring.
</P>
<P>
If <b>/g</b> is present on the pattern, the search for further matches resumes
@@ -703,7 +706,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 10 March 2009
+Last updated: 29 August 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/index.html.src b/doc/index.html.src
index 888471f..58dfe45 100644
--- a/doc/index.html.src
+++ b/doc/index.html.src
@@ -36,6 +36,9 @@ The HTML documentation for PCRE comprises the following pages:
<tr><td><a href="pcrecpp.html">pcrecpp</a></td>
<td>&nbsp;&nbsp;The C++ wrapper for the PCRE library</td></tr>
+<tr><td><a href="pcredemo.html">pcredemo</a></td>
+ <td>&nbsp;&nbsp;A demonstration C program that uses the PCRE library</td></tr>
+
<tr><td><a href="pcregrep.html">pcregrep</a></td>
<td>&nbsp;&nbsp;The <b>pcregrep</b> command</td></tr>
@@ -58,7 +61,7 @@ The HTML documentation for PCRE comprises the following pages:
<td>&nbsp;&nbsp;How to save and re-use compiled patterns</td></tr>
<tr><td><a href="pcresample.html">pcresample</a></td>
- <td>&nbsp;&nbsp;Description of the sample program</td></tr>
+ <td>&nbsp;&nbsp;Discussion of the pcredemo program</td></tr>
<tr><td><a href="pcrestack.html">pcrestack</a></td>
<td>&nbsp;&nbsp;Discussion of PCRE's stack usage</td></tr>
diff --git a/doc/pcre.3 b/doc/pcre.3
index 6f174ad..430fbd5 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -11,8 +11,8 @@ appeared in Perl are also available using the Python syntax. There is also some
support for certain .NET and Oniguruma syntax items, and there is an option for
requesting some minor changes that give better JavaScript compatibility.
.P
-The current implementation of PCRE (release 7.x) corresponds approximately with
-Perl 5.10, including support for UTF-8 encoded strings and Unicode general
+The current implementation of PCRE (release 8.xx) corresponds approximately
+with Perl 5.10, including support for UTF-8 encoded strings and Unicode general
category properties. However, UTF-8 and Unicode support has to be explicitly
enabled; it is not the default. The Unicode tables correspond to Unicode
release 5.1.
@@ -83,8 +83,8 @@ not exported.
The user documentation for PCRE comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
-all the sections are concatenated, for ease of searching. The sections are as
-follows:
+all the sections, except the \fBpcredemo\fP section, are concatenated, for ease
+of searching. The sections are as follows:
.sp
pcre this document
pcre-config show PCRE installation configuration information
@@ -93,6 +93,7 @@ follows:
pcrecallout details of the callout feature
pcrecompat discussion of Perl compatibility
pcrecpp details of the C++ wrapper
+ pcredemo a demonstration C program that uses PCRE
pcregrep description of the \fBpcregrep\fP command
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
@@ -103,7 +104,7 @@ follows:
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
- pcresample discussion of the sample program
+ pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
pcretest description of the \fBpcretest\fP testing command
.sp
@@ -291,6 +292,6 @@ two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 11 April 2009
+Last updated: 01 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 9a2ce31..5bbeb9a 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -2,8 +2,9 @@
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
-synopses of each function in the library have not been included. There are
-separate text files for the pcregrep and pcretest commands.
+synopses of each function in the library have not been included. Neither has
+the pcredemo program. There are separate text files for the pcregrep and
+pcretest commands.
-----------------------------------------------------------------------------
@@ -24,7 +25,7 @@ INTRODUCTION
tax items, and there is an option for requesting some minor changes
that give better JavaScript compatibility.
- The current implementation of PCRE (release 7.x) corresponds approxi-
+ The current implementation of PCRE (release 8.xx) corresponds approxi-
mately with Perl 5.10, including support for UTF-8 encoded strings and
Unicode general category properties. However, UTF-8 and Unicode support
has to be explicitly enabled; it is not the default. The Unicode tables
@@ -71,8 +72,9 @@ USER DOCUMENTATION
The user documentation for PCRE comprises a number of different sec-
tions. In the "man" format, each of these is a separate "man page". In
the HTML format, each is a separate page, linked from the index page.
- In the plain text format, all the sections are concatenated, for ease
- of searching. The sections are as follows:
+ In the plain text format, all the sections, except the pcredemo sec-
+ tion, are concatenated, for ease of searching. The sections are as fol-
+ lows:
pcre this document
pcre-config show PCRE installation configuration information
@@ -81,6 +83,7 @@ USER DOCUMENTATION
pcrecallout details of the callout feature
pcrecompat discussion of Perl compatibility
pcrecpp details of the C++ wrapper
+ pcredemo a demonstration C program that uses PCRE
pcregrep description of the pcregrep command
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
@@ -90,25 +93,25 @@ USER DOCUMENTATION
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
- pcresample discussion of the sample program
+ pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
pcretest description of the pcretest testing command
- In addition, in the "man" and HTML formats, there is a short page for
+ In addition, in the "man" and HTML formats, there is a short page for
each C library function, listing its arguments and results.
LIMITATIONS
- There are some size limitations in PCRE but it is hoped that they will
+ There are some size limitations in PCRE but it is hoped that they will
never in practice be relevant.
- The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
+ The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
is compiled with the default internal linkage size of 2. If you want to
- process regular expressions that are truly enormous, you can compile
- PCRE with an internal linkage size of 3 or 4 (see the README file in
- the source distribution and the pcrebuild documentation for details).
- In these cases the limit is substantially larger. However, the speed
+ process regular expressions that are truly enormous, you can compile
+ PCRE with an internal linkage size of 3 or 4 (see the README file in
+ the source distribution and the pcrebuild documentation for details).
+ In these cases the limit is substantially larger. However, the speed
of execution is slower.
All values in repeating quantifiers must be less than 65536.
@@ -119,131 +122,131 @@ LIMITATIONS
The maximum length of name for a named subpattern is 32 characters, and
the maximum number of named subpatterns is 10000.
- The maximum length of a subject string is the largest positive number
- that an integer variable can hold. However, when using the traditional
+ The maximum length of a subject string is the largest positive number
+ that an integer variable can hold. However, when using the traditional
matching function, PCRE uses recursion to handle subpatterns and indef-
- inite repetition. This means that the available stack space may limit
+ inite repetition. This means that the available stack space may limit
the size of a subject string that can be processed by certain patterns.
For a discussion of stack issues, see the pcrestack documentation.
UTF-8 AND UNICODE PROPERTY SUPPORT
- From release 3.3, PCRE has had some support for character strings
- encoded in the UTF-8 format. For release 4.0 this was greatly extended
- to cover most common requirements, and in release 5.0 additional sup-
+ From release 3.3, PCRE has had some support for character strings
+ encoded in the UTF-8 format. For release 4.0 this was greatly extended
+ to cover most common requirements, and in release 5.0 additional sup-
port for Unicode general category properties was added.
- In order process UTF-8 strings, you must build PCRE to include UTF-8
- support in the code, and, in addition, you must call pcre_compile()
- with the PCRE_UTF8 option flag, or the pattern must start with the
- sequence (*UTF8). When either of these is the case, both the pattern
- and any subject strings that are matched against it are treated as
+ In order process UTF-8 strings, you must build PCRE to include UTF-8
+ support in the code, and, in addition, you must call pcre_compile()
+ with the PCRE_UTF8 option flag, or the pattern must start with the
+ sequence (*UTF8). When either of these is the case, both the pattern
+ and any subject strings that are matched against it are treated as
UTF-8 strings instead of just strings of bytes.
- If you compile PCRE with UTF-8 support, but do not use it at run time,
- the library will be a bit bigger, but the additional run time overhead
+ If you compile PCRE with UTF-8 support, but do not use it at run time,
+ the library will be a bit bigger, but the additional run time overhead
is limited to testing the PCRE_UTF8 flag occasionally, so should not be
very big.
If PCRE is built with Unicode character property support (which implies
- UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
+ UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
ported. The available properties that can be tested are limited to the
- general category properties such as Lu for an upper case letter or Nd
- for a decimal number, the Unicode script names such as Arabic or Han,
- and the derived properties Any and L&. A full list is given in the
+ general category properties such as Lu for an upper case letter or Nd
+ for a decimal number, the Unicode script names such as Arabic or Han,
+ and the derived properties Any and L&. A full list is given in the
pcrepattern documentation. Only the short names for properties are sup-
- ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
- ter}, is not supported. Furthermore, in Perl, many properties may
- optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
+ ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
+ ter}, is not supported. Furthermore, in Perl, many properties may
+ optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
does not support this.
Validity of UTF-8 strings
- When you set the PCRE_UTF8 flag, the strings passed as patterns and
+ When you set the PCRE_UTF8 flag, the strings passed as patterns and
subjects are (by default) checked for validity on entry to the relevant
- functions. From release 7.3 of PCRE, the check is according the rules
- of RFC 3629, which are themselves derived from the Unicode specifica-
- tion. Earlier releases of PCRE followed the rules of RFC 2279, which
- allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current
+ functions. From release 7.3 of PCRE, the check is according the rules
+ of RFC 3629, which are themselves derived from the Unicode specifica-
+ tion. Earlier releases of PCRE followed the rules of RFC 2279, which
+ allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current
check allows only values in the range U+0 to U+10FFFF, excluding U+D800
to U+DFFF.
- The excluded code points are the "Low Surrogate Area" of Unicode, of
- which the Unicode Standard says this: "The Low Surrogate Area does not
- contain any character assignments, consequently no character code
+ The excluded code points are the "Low Surrogate Area" of Unicode, of
+ which the Unicode Standard says this: "The Low Surrogate Area does not
+ contain any character assignments, consequently no character code
charts or namelists are provided for this area. Surrogates are reserved
- for use with UTF-16 and then must be used in pairs." The code points
- that are encoded by UTF-16 pairs are available as independent code
- points in the UTF-8 encoding. (In other words, the whole surrogate
+ for use with UTF-16 and then must be used in pairs." The code points
+ that are encoded by UTF-16 pairs are available as independent code
+ points in the UTF-8 encoding. (In other words, the whole surrogate
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
- If an invalid UTF-8 string is passed to PCRE, an error return
+ If an invalid UTF-8 string is passed to PCRE, an error return
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
that your strings are valid, and therefore want to skip these checks in
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
- compile time or at run time, PCRE assumes that the pattern or subject
- it is given (respectively) contains only valid UTF-8 codes. In this
+ compile time or at run time, PCRE assumes that the pattern or subject
+ it is given (respectively) contains only valid UTF-8 codes. In this
case, it does not diagnose an invalid UTF-8 string.
- If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
- what happens depends on why the string is invalid. If the string con-
+ If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
+ what happens depends on why the string is invalid. If the string con-
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
- string of characters in the range 0 to 0x7FFFFFFF. In other words,
+ string of characters in the range 0 to 0x7FFFFFFF. In other words,
apart from the initial validity test, PCRE (when in UTF-8 mode) handles
- strings according to the more liberal rules of RFC 2279. However, if
- the string does not even conform to RFC 2279, the result is undefined.
+ strings according to the more liberal rules of RFC 2279. However, if
+ the string does not even conform to RFC 2279, the result is undefined.
Your program may crash.
- If you want to process strings of values in the full range 0 to
- 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
+ If you want to process strings of values in the full range 0 to
+ 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
this situation, you will have to apply your own validity check.
General comments about UTF-8 mode
- 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a
+ 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a
two-byte UTF-8 character if the value is greater than 127.
- 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8
+ 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8
characters for values greater than \177.
- 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
+ 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
vidual bytes, for example: \x{100}{3}.
- 4. The dot metacharacter matches one UTF-8 character instead of a sin-
+ 4. The dot metacharacter matches one UTF-8 character instead of a sin-
gle byte.
- 5. The escape sequence \C can be used to match a single byte in UTF-8
- mode, but its use can lead to some strange effects. This facility is
+ 5. The escape sequence \C can be used to match a single byte in UTF-8
+ mode, but its use can lead to some strange effects. This facility is
not available in the alternative matching function, pcre_dfa_exec().
- 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
- test characters of any code value, but the characters that PCRE recog-
- nizes as digits, spaces, or word characters remain the same set as
+ 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
+ test characters of any code value, but the characters that PCRE recog-
+ nizes as digits, spaces, or word characters remain the same set as
before, all with values less than 256. This remains true even when PCRE
- includes Unicode property support, because to do otherwise would slow
- down PCRE in many common cases. If you really want to test for a wider
- sense of, say, "digit", you must use Unicode property tests such as
- \p{Nd}. Note that this also applies to \b, because it is defined in
+ includes Unicode property support, because to do otherwise would slow
+ down PCRE in many common cases. If you really want to test for a wider
+ sense of, say, "digit", you must use Unicode property tests such as
+ \p{Nd}. Note that this also applies to \b, because it is defined in
terms of \w and \W.
- 7. Similarly, characters that match the POSIX named character classes
+ 7. Similarly, characters that match the POSIX named character classes
are all low-valued characters.
- 8. However, the Perl 5.10 horizontal and vertical whitespace matching
+ 8. However, the Perl 5.10 horizontal and vertical whitespace matching
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
acters.
- 9. Case-insensitive matching applies only to characters whose values
- are less than 128, unless PCRE is built with Unicode property support.
- Even when Unicode property support is available, PCRE still uses its
- own character tables when checking the case of low-valued characters,
- so as not to degrade performance. The Unicode property information is
+ 9. Case-insensitive matching applies only to characters whose values
+ are less than 128, unless PCRE is built with Unicode property support.
+ Even when Unicode property support is available, PCRE still uses its
+ own character tables when checking the case of low-valued characters,
+ so as not to degrade performance. The Unicode property information is
used only for characters with higher values. Even when Unicode property
support is available, PCRE supports case-insensitive matching only when
- there is a one-to-one mapping between a letter's cases. There are a
- small number of many-to-one mappings in Unicode; these are not sup-
+ there is a one-to-one mapping between a letter's cases. There are a
+ small number of many-to-one mappings in Unicode; these are not sup-
ported by PCRE.
@@ -253,18 +256,18 @@ AUTHOR
University Computing Service
Cambridge CB2 3QH, England.
- Putting an actual email address here seems to have been a spam magnet,
- so I've taken it away. If you want to email me, use my two initials,
+ Putting an actual email address here seems to have been a spam magnet,
+ so I've taken it away. If you want to email me, use my two initials,
followed by the two digits 10, at the domain cam.ac.uk.
REVISION
- Last updated: 11 April 2009
+ Last updated: 01 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREBUILD(3) PCREBUILD(3)
@@ -590,8 +593,8 @@ REVISION
Last updated: 17 March 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREMATCHING(3) PCREMATCHING(3)
@@ -751,13 +754,7 @@ ADVANTAGES OF THE ALTERNATIVE ALGORITHM
more than one match using the standard algorithm, you have to do kludgy
things with callouts.
- 2. There is much better support for partial matching. The restrictions
- on the content of the pattern that apply when using the standard algo-
- rithm for partial matching do not apply to the alternative algorithm.
- For non-anchored patterns, the starting position of a partial match is
- available.
-
- 3. Because the alternative algorithm scans the subject string just
+ 2. Because the alternative algorithm scans the subject string just
once, and never needs to backtrack, it is possible to pass very long
subject strings to the matching function in several pieces, checking
for partial matching each time.
@@ -786,11 +783,11 @@ AUTHOR
REVISION
- Last updated: 19 April 2008
- Copyright (c) 1997-2008 University of Cambridge.
+ Last updated: 25 August 2009
+ Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREAPI(3) PCREAPI(3)
@@ -898,18 +895,19 @@ PCRE API OVERVIEW
pcre_exec() are used for compiling and matching regular expressions in
a Perl-compatible manner. A sample program that demonstrates the sim-
plest way of using them is provided in the file called pcredemo.c in
- the source distribution. The pcresample documentation describes how to
- compile and run it.
+ the PCRE source distribution. A listing of this program is given in the
+ pcredemo documentation, and the pcresample documentation describes how
+ to compile and run it.
A second matching function, pcre_dfa_exec(), which is not Perl-compati-
- ble, is also provided. This uses a different algorithm for the match-
- ing. The alternative algorithm finds all possible matches (at a given
- point in the subject), and scans the subject just once. However, this
+ ble, is also provided. This uses a different algorithm for the match-
+ ing. The alternative algorithm finds all possible matches (at a given
+ point in the subject), and scans the subject just once. However, this
algorithm does not return captured substrings. A description of the two
- matching algorithms and their advantages and disadvantages is given in
+ matching algorithms and their advantages and disadvantages is given in
the pcrematching documentation.
- In addition to the main compiling and matching functions, there are
+ In addition to the main compiling and matching functions, there are
convenience functions for extracting captured substrings from a subject
string that is matched by pcre_exec(). They are:
@@ -924,91 +922,91 @@ PCRE API OVERVIEW
pcre_free_substring() and pcre_free_substring_list() are also provided,
to free the memory used for extracted strings.
- The function pcre_maketables() is used to build a set of character
- tables in the current locale for passing to pcre_compile(),
- pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
- provided for specialist use. Most commonly, no special tables are
- passed, in which case internal tables that are generated when PCRE is
+ The function pcre_maketables() is used to build a set of character
+ tables in the current locale for passing to pcre_compile(),
+ pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
+ provided for specialist use. Most commonly, no special tables are
+ passed, in which case internal tables that are generated when PCRE is
built are used.
- The function pcre_fullinfo() is used to find out information about a
- compiled pattern; pcre_info() is an obsolete version that returns only
- some of the available information, but is retained for backwards com-
- patibility. The function pcre_version() returns a pointer to a string
+ The function pcre_fullinfo() is used to find out information about a
+ compiled pattern; pcre_info() is an obsolete version that returns only
+ some of the available information, but is retained for backwards com-
+ patibility. The function pcre_version() returns a pointer to a string
containing the version of PCRE and its date of release.
- The function pcre_refcount() maintains a reference count in a data
- block containing a compiled pattern. This is provided for the benefit
+ The function pcre_refcount() maintains a reference count in a data
+ block containing a compiled pattern. This is provided for the benefit
of object-oriented applications.
- The global variables pcre_malloc and pcre_free initially contain the
- entry points of the standard malloc() and free() functions, respec-
+ The global variables pcre_malloc and pcre_free initially contain the
+ entry points of the standard malloc() and free() functions, respec-
tively. PCRE calls the memory management functions via these variables,
- so a calling program can replace them if it wishes to intercept the
+ so a calling program can replace them if it wishes to intercept the
calls. This should be done before calling any PCRE functions.
- The global variables pcre_stack_malloc and pcre_stack_free are also
- indirections to memory management functions. These special functions
- are used only when PCRE is compiled to use the heap for remembering
+ The global variables pcre_stack_malloc and pcre_stack_free are also
+ indirections to memory management functions. These special functions
+ are used only when PCRE is compiled to use the heap for remembering
data, instead of recursive function calls, when running the pcre_exec()
- function. See the pcrebuild documentation for details of how to do
- this. It is a non-standard way of building PCRE, for use in environ-
- ments that have limited stacks. Because of the greater use of memory
- management, it runs more slowly. Separate functions are provided so
- that special-purpose external code can be used for this case. When
- used, these functions are always called in a stack-like manner (last
- obtained, first freed), and always for memory blocks of the same size.
- There is a discussion about PCRE's stack usage in the pcrestack docu-
+ function. See the pcrebuild documentation for details of how to do
+ this. It is a non-standard way of building PCRE, for use in environ-
+ ments that have limited stacks. Because of the greater use of memory
+ management, it runs more slowly. Separate functions are provided so
+ that special-purpose external code can be used for this case. When
+ used, these functions are always called in a stack-like manner (last
+ obtained, first freed), and always for memory blocks of the same size.
+ There is a discussion about PCRE's stack usage in the pcrestack docu-
mentation.
The global variable pcre_callout initially contains NULL. It can be set
- by the caller to a "callout" function, which PCRE will then call at
- specified points during a matching operation. Details are given in the
+ by the caller to a "callout" function, which PCRE will then call at
+ specified points during a matching operation. Details are given in the
pcrecallout documentation.
NEWLINES
- PCRE supports five different conventions for indicating line breaks in
- strings: a single CR (carriage return) character, a single LF (line-
+ PCRE supports five different conventions for indicating line breaks in
+ strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
- ceding, or any Unicode newline sequence. The Unicode newline sequences
- are the three just mentioned, plus the single characters VT (vertical
- tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
+ ceding, or any Unicode newline sequence. The Unicode newline sequences
+ are the three just mentioned, plus the single characters VT (vertical
+ tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
separator, U+2028), and PS (paragraph separator, U+2029).
- Each of the first three conventions is used by at least one operating
- system as its standard newline sequence. When PCRE is built, a default
- can be specified. The default default is LF, which is the Unix stan-
- dard. When PCRE is run, the default can be overridden, either when a
+ Each of the first three conventions is used by at least one operating
+ system as its standard newline sequence. When PCRE is built, a default
+ can be specified. The default default is LF, which is the Unix stan-
+ dard. When PCRE is run, the default can be overridden, either when a
pattern is compiled, or when it is matched.
At compile time, the newline convention can be specified by the options
- argument of pcre_compile(), or it can be specified by special text at
+ argument of pcre_compile(), or it can be specified by special text at
the start of the pattern itself; this overrides any other settings. See
the pcrepattern page for details of the special character sequences.
In the PCRE documentation the word "newline" is used to mean "the char-
- acter or pair of characters that indicate a line break". The choice of
- newline convention affects the handling of the dot, circumflex, and
+ acter or pair of characters that indicate a line break". The choice of
+ newline convention affects the handling of the dot, circumflex, and
dollar metacharacters, the handling of #-comments in /x mode, and, when
- CRLF is a recognized line ending sequence, the match position advance-
+ CRLF is a recognized line ending sequence, the match position advance-
ment for a non-anchored pattern. There is more detail about this in the
section on pcre_exec() options below.
- The choice of newline convention does not affect the interpretation of
- the \n or \r escape sequences, nor does it affect what \R matches,
+ The choice of newline convention does not affect the interpretation of
+ the \n or \r escape sequences, nor does it affect what \R matches,
which is controlled in a similar way, but by separate options.
MULTITHREADING
- The PCRE functions can be used in multi-threading applications, with
+ The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
callout function pointed to by pcre_callout, are shared by all threads.
- The compiled form of a regular expression is not altered during match-
+ The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.
@@ -1016,10 +1014,10 @@ MULTITHREADING
SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a
- later time, possibly by a different program, and even on a host other
- than the one on which it was compiled. Details are given in the
- pcreprecompile documentation. However, compiling a regular expression
- with one version of PCRE for use with a different version is not guar-
+ later time, possibly by a different program, and even on a host other
+ than the one on which it was compiled. Details are given in the
+ pcreprecompile documentation. However, compiling a regular expression
+ with one version of PCRE for use with a different version is not guar-
anteed to work and may cause crashes.
@@ -1027,79 +1025,79 @@ CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);
- The function pcre_config() makes it possible for a PCRE client to dis-
+ The function pcre_config() makes it possible for a PCRE client to dis-
cover which optional features have been compiled into the PCRE library.
- The pcrebuild documentation has more details about these optional fea-
+ The pcrebuild documentation has more details about these optional fea-
tures.
- The first argument for pcre_config() is an integer, specifying which
+ The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable
- into which the information is placed. The following information is
+ into which the information is placed. The following information is
available:
PCRE_CONFIG_UTF8
- The output is an integer that is set to one if UTF-8 support is avail-
+ The output is an integer that is set to one if UTF-8 support is avail-
able; otherwise it is set to zero.
PCRE_CONFIG_UNICODE_PROPERTIES
- The output is an integer that is set to one if support for Unicode
+ The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero.
PCRE_CONFIG_NEWLINE
- The output is an integer whose value specifies the default character
- sequence that is recognized as meaning "newline". The four values that
+ The output is an integer whose value specifies the default character
+ sequence that is recognized as meaning "newline". The four values that
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
- and -1 for ANY. Though they are derived from ASCII, the same values
+ and -1 for ANY. Though they are derived from ASCII, the same values
are returned in EBCDIC environments. The default should normally corre-
spond to the standard sequence for your operating system.
PCRE_CONFIG_BSR
The output is an integer whose value indicates what character sequences
- the \R escape sequence matches by default. A value of 0 means that \R
- matches any Unicode line ending sequence; a value of 1 means that \R
+ the \R escape sequence matches by default. A value of 0 means that \R
+ matches any Unicode line ending sequence; a value of 1 means that \R
matches only CR, LF, or CRLF. The default can be overridden when a pat-
tern is compiled or matched.
PCRE_CONFIG_LINK_SIZE
- The output is an integer that contains the number of bytes used for
+ The output is an integer that contains the number of bytes used for
internal linkage in compiled regular expressions. The value is 2, 3, or
- 4. Larger values allow larger regular expressions to be compiled, at
- the expense of slower matching. The default value of 2 is sufficient
- for all but the most massive patterns, since it allows the compiled
+ 4. Larger values allow larger regular expressions to be compiled, at
+ the expense of slower matching. The default value of 2 is sufficient
+ for all but the most massive patterns, since it allows the compiled
pattern to be up to 64K in size.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
- The output is an integer that contains the threshold above which the
- POSIX interface uses malloc() for output vectors. Further details are
+ The output is an integer that contains the threshold above which the
+ POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation.
PCRE_CONFIG_MATCH_LIMIT
- The output is a long integer that gives the default limit for the num-
- ber of internal matching function calls in a pcre_exec() execution.
+ The output is a long integer that gives the default limit for the num-
+ ber of internal matching function calls in a pcre_exec() execution.
Further details are given with pcre_exec() below.
PCRE_CONFIG_MATCH_LIMIT_RECURSION
The output is a long integer that gives the default limit for the depth
- of recursion when calling the internal matching function in a
- pcre_exec() execution. Further details are given with pcre_exec()
+ of recursion when calling the internal matching function in a
+ pcre_exec() execution. Further details are given with pcre_exec()
below.
PCRE_CONFIG_STACKRECURSE
- The output is an integer that is set to one if internal recursion when
+ The output is an integer that is set to one if internal recursion when
running pcre_exec() is implemented by recursive function calls that use
- the stack to remember their state. This is the usual way that PCRE is
+ the stack to remember their state. This is the usual way that PCRE is
compiled. The output is zero if PCRE was compiled to use blocks of data
- on the heap instead of recursive function calls. In this case,
- pcre_stack_malloc and pcre_stack_free are called to manage memory
+ on the heap instead of recursive function calls. In this case,
+ pcre_stack_malloc and pcre_stack_free are called to manage memory
blocks on the heap, thus avoiding the use of the stack.
@@ -1116,56 +1114,56 @@ COMPILING A PATTERN
Either of the functions pcre_compile() or pcre_compile2() can be called
to compile a pattern into an internal form. The only difference between
- the two interfaces is that pcre_compile2() has an additional argument,
+ the two interfaces is that pcre_compile2() has an additional argument,
errorcodeptr, via which a numerical error code can be returned.
The pattern is a C string terminated by a binary zero, and is passed in
- the pattern argument. A pointer to a single block of memory that is
- obtained via pcre_malloc is returned. This contains the compiled code
+ the pattern argument. A pointer to a single block of memory that is
+ obtained via pcre_malloc is returned. This contains the compiled code
and related data. The pcre type is defined for the returned block; this
is a typedef for a structure whose contents are not externally defined.
It is up to the caller to free the memory (via pcre_free) when it is no
longer required.
- Although the compiled code of a PCRE regex is relocatable, that is, it
+ Although the compiled code of a PCRE regex is relocatable, that is, it
does not depend on memory location, the complete pcre data block is not
- fully relocatable, because it may contain a copy of the tableptr argu-
+ fully relocatable, because it may contain a copy of the tableptr argu-
ment, which is an address (see below).
The options argument contains various bit settings that affect the com-
- pilation. It should be zero if no options are required. The available
- options are described below. Some of them (in particular, those that
- are compatible with Perl, but also some others) can also be set and
- unset from within the pattern (see the detailed description in the
- pcrepattern documentation). For those options that can be different in
- different parts of the pattern, the contents of the options argument
+ pilation. It should be zero if no options are required. The available
+ options are described below. Some of them (in particular, those that
+ are compatible with Perl, but also some others) can also be set and
+ unset from within the pattern (see the detailed description in the
+ pcrepattern documentation). For those options that can be different in
+ different parts of the pattern, the contents of the options argument
specifies their initial settings at the start of compilation and execu-
- tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
+ tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
time of matching as well as at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
- if compilation of a pattern fails, pcre_compile() returns NULL, and
+ if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes-
sage. This is a static string that is part of the library. You must not
try to free it. The offset from the start of the pattern to the charac-
ter where the error was discovered is placed in the variable pointed to
- by erroffset, which must not be NULL. If it is, an immediate error is
+ by erroffset, which must not be NULL. If it is, an immediate error is
given.
- If pcre_compile2() is used instead of pcre_compile(), and the error-
- codeptr argument is not NULL, a non-zero error code number is returned
- via this argument in the event of an error. This is in addition to the
+ If pcre_compile2() is used instead of pcre_compile(), and the error-
+ codeptr argument is not NULL, a non-zero error code number is returned
+ via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below.
- If the final argument, tableptr, is NULL, PCRE uses a default set of
- character tables that are built when PCRE is compiled, using the
- default C locale. Otherwise, tableptr must be an address that is the
- result of a call to pcre_maketables(). This value is stored with the
- compiled pattern, and used again by pcre_exec(), unless another table
+ If the final argument, tableptr, is NULL, PCRE uses a default set of
+ character tables that are built when PCRE is compiled, using the
+ default C locale. Otherwise, tableptr must be an address that is the
+ result of a call to pcre_maketables(). This value is stored with the
+ compiled pattern, and used again by pcre_exec(), unless another table
pointer is passed to it. For more discussion, see the section on locale
support below.
- This code fragment shows a typical straightforward call to pcre_com-
+ This code fragment shows a typical straightforward call to pcre_com-
pile():
pcre *re;
@@ -1178,137 +1176,137 @@ COMPILING A PATTERN
&erroffset, /* for error offset */
NULL); /* use default character tables */
- The following names for option bits are defined in the pcre.h header
+ The following names for option bits are defined in the pcre.h header
file:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
- is constrained to match only at the first matching point in the string
- that is being searched (the "subject string"). This effect can also be
- achieved by appropriate constructs in the pattern itself, which is the
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE_AUTO_CALLOUT
If this bit is set, pcre_compile() automatically inserts callout items,
- all with number 255, before each pattern item. For discussion of the
+ all with number 255, before each pattern item. For discussion of the
callout facility, see the pcrecallout documentation.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
- sequence matches. The choice is either to match only CR, LF, or CRLF,
+ sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. The default is specified when
PCRE is built. It can be overridden from within the pattern, or by set-
ting an option when a compiled pattern is matched.
PCRE_CASELESS
- If this bit is set, letters in the pattern match both upper and lower
- case letters. It is equivalent to Perl's /i option, and it can be
- changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
- always understands the concept of case for characters whose values are
- less than 128, so caseless matching is always possible. For characters
- with higher values, the concept of case is supported if PCRE is com-
- piled with Unicode property support, but not otherwise. If you want to
- use caseless matching for characters 128 and above, you must ensure
- that PCRE is compiled with Unicode property support as well as with
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters. It is equivalent to Perl's /i option, and it can be
+ changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
+ always understands the concept of case for characters whose values are
+ less than 128, so caseless matching is always possible. For characters
+ with higher values, the concept of case is supported if PCRE is com-
+ piled with Unicode property support, but not otherwise. If you want to
+ use caseless matching for characters 128 and above, you must ensure
+ that PCRE is compiled with Unicode property support as well as with
UTF-8 support.
PCRE_DOLLAR_ENDONLY
- If this bit is set, a dollar metacharacter in the pattern matches only
- at the end of the subject string. Without this option, a dollar also
- matches immediately before a newline at the end of the string (but not
- before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
- if PCRE_MULTILINE is set. There is no equivalent to this option in
+ If this bit is set, a dollar metacharacter in the pattern matches only
+ at the end of the subject string. Without this option, a dollar also
+ matches immediately before a newline at the end of the string (but not
+ before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
+ if PCRE_MULTILINE is set. There is no equivalent to this option in
Perl, and no way to set it within a pattern.
PCRE_DOTALL
If this bit is set, a dot metacharater in the pattern matches all char-
- acters, including those that indicate newline. Without it, a dot does
- not match when the current position is at a newline. This option is
- equivalent to Perl's /s option, and it can be changed within a pattern
- by a (?s) option setting. A negative class such as [^a] always matches
+ acters, including those that indicate newline. Without it, a dot does
+ not match when the current position is at a newline. This option is
+ equivalent to Perl's /s option, and it can be changed within a pattern
+ by a (?s) option setting. A negative class such as [^a] always matches
newline characters, independent of the setting of this option.
PCRE_DUPNAMES
- If this bit is set, names used to identify capturing subpatterns need
+ If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
- is known that only one instance of the named subpattern can ever be
- matched. There are more details of named subpatterns below; see also
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
the pcrepattern documentation.
PCRE_EXTENDED
- If this bit is set, whitespace data characters in the pattern are
+ If this bit is set, whitespace data characters in the pattern are
totally ignored except when escaped or inside a character class. White-
space does not include the VT character (code 11). In addition, charac-
ters between an unescaped # outside a character class and the next new-
- line, inclusive, are also ignored. This is equivalent to Perl's /x
- option, and it can be changed within a pattern by a (?x) option set-
+ line, inclusive, are also ignored. This is equivalent to Perl's /x
+ option, and it can be changed within a pattern by a (?x) option set-
ting.
- This option makes it possible to include comments inside complicated
- patterns. Note, however, that this applies only to data characters.
- Whitespace characters may never appear within special character
- sequences in a pattern, for example within the sequence (?( which
+ This option makes it possible to include comments inside complicated
+ patterns. Note, however, that this applies only to data characters.
+ Whitespace characters may never appear within special character
+ sequences in a pattern, for example within the sequence (?( which
introduces a conditional subpattern.
PCRE_EXTRA
- This option was invented in order to turn on additional functionality
- of PCRE that is incompatible with Perl, but it is currently of very
- little use. When set, any backslash in a pattern that is followed by a
- letter that has no special meaning causes an error, thus reserving
- these combinations for future expansion. By default, as in Perl, a
- backslash followed by a letter with no special meaning is treated as a
- literal. (Perl can, however, be persuaded to give a warning for this.)
- There are at present no other features controlled by this option. It
+ This option was invented in order to turn on additional functionality
+ of PCRE that is incompatible with Perl, but it is currently of very
+ little use. When set, any backslash in a pattern that is followed by a
+ letter that has no special meaning causes an error, thus reserving
+ these combinations for future expansion. By default, as in Perl, a
+ backslash followed by a letter with no special meaning is treated as a
+ literal. (Perl can, however, be persuaded to give a warning for this.)
+ There are at present no other features controlled by this option. It
can also be set by a (?X) option setting within a pattern.
PCRE_FIRSTLINE
- If this option is set, an unanchored pattern is required to match
- before or at the first newline in the subject string, though the
+ If this option is set, an unanchored pattern is required to match
+ before or at the first newline in the subject string, though the
matched text may continue over the newline.
PCRE_JAVASCRIPT_COMPAT
If this option is set, PCRE's behaviour is changed in some ways so that
- it is compatible with JavaScript rather than Perl. The changes are as
+ it is compatible with JavaScript rather than Perl. The changes are as
follows:
- (1) A lone closing square bracket in a pattern causes a compile-time
- error, because this is illegal in JavaScript (by default it is treated
+ (1) A lone closing square bracket in a pattern causes a compile-time
+ error, because this is illegal in JavaScript (by default it is treated
as a data character). Thus, the pattern AB]CD becomes illegal when this
option is set.
- (2) At run time, a back reference to an unset subpattern group matches
- an empty string (by default this causes the current matching alterna-
- tive to fail). A pattern such as (\1)(a) succeeds when this option is
- set (assuming it can find an "a" in the subject), whereas it fails by
+ (2) At run time, a back reference to an unset subpattern group matches
+ an empty string (by default this causes the current matching alterna-
+ tive to fail). A pattern such as (\1)(a) succeeds when this option is
+ set (assuming it can find an "a" in the subject), whereas it fails by
default, for Perl compatibility.
PCRE_MULTILINE
- By default, PCRE treats the subject string as consisting of a single
- line of characters (even if it actually contains newlines). The "start
- of line" metacharacter (^) matches only at the start of the string,
- while the "end of line" metacharacter ($) matches only at the end of
+ By default, PCRE treats the subject string as consisting of a single
+ line of characters (even if it actually contains newlines). The "start
+ of line" metacharacter (^) matches only at the start of the string,
+ while the "end of line" metacharacter ($) matches only at the end of
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
is set). This is the same as Perl.
- When PCRE_MULTILINE it is set, the "start of line" and "end of line"
- constructs match immediately following or immediately before internal
- newlines in the subject string, respectively, as well as at the very
- start and end. This is equivalent to Perl's /m option, and it can be
+ When PCRE_MULTILINE it is set, the "start of line" and "end of line"
+ constructs match immediately following or immediately before internal
+ newlines in the subject string, respectively, as well as at the very
+ start and end. This is equivalent to Perl's /m option, and it can be
changed within a pattern by a (?m) option setting. If there are no new-
- lines in a subject string, or no occurrences of ^ or $ in a pattern,
+ lines in a subject string, or no occurrences of ^ or $ in a pattern,
setting PCRE_MULTILINE has no effect.
PCRE_NEWLINE_CR
@@ -1317,32 +1315,32 @@ COMPILING A PATTERN
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
- These options override the default newline definition that was chosen
- when PCRE was built. Setting the first or the second specifies that a
- newline is indicated by a single character (CR or LF, respectively).
- Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
- two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
+ These options override the default newline definition that was chosen
+ when PCRE was built. Setting the first or the second specifies that a
+ newline is indicated by a single character (CR or LF, respectively).
+ Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
+ two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
that any of the three preceding sequences should be recognized. Setting
- PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
+ PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
recognized. The Unicode newline sequences are the three just mentioned,
- plus the single characters VT (vertical tab, U+000B), FF (formfeed,
- U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
- (paragraph separator, U+2029). The last two are recognized only in
+ plus the single characters VT (vertical tab, U+000B), FF (formfeed,
+ U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
+ (paragraph separator, U+2029). The last two are recognized only in
UTF-8 mode.
- The newline setting in the options word uses three bits that are
+ The newline setting in the options word uses three bits that are
treated as a number, giving eight possibilities. Currently only six are
- used (default plus the five values above). This means that if you set
- more than one newline option, the combination may or may not be sensi-
+ used (default plus the five values above). This means that if you set
+ more than one newline option, the combination may or may not be sensi-
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
- PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
+ PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
cause an error.
- The only time that a line break is specially recognized when compiling
- a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
- character class is encountered. This indicates a comment that lasts
- until after the next line break sequence. In other circumstances, line
- break sequences are treated as literal data, except that in
+ The only time that a line break is specially recognized when compiling
+ a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
+ character class is encountered. This indicates a comment that lasts
+ until after the next line break sequence. In other circumstances, line
+ break sequences are treated as literal data, except that in
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
and are therefore ignored.
@@ -1352,46 +1350,46 @@ COMPILING A PATTERN
PCRE_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
- theses in the pattern. Any opening parenthesis that is not followed by
- ? behaves as if it were followed by ?: but named parentheses can still
- be used for capturing (and they acquire numbers in the usual way).
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
+ be used for capturing (and they acquire numbers in the usual way).
There is no equivalent of this option in Perl.
PCRE_UNGREEDY
- This option inverts the "greediness" of the quantifiers so that they
- are not greedy by default, but become greedy if followed by "?". It is
- not compatible with Perl. It can also be set by a (?U) option setting
+ This option inverts the "greediness" of the quantifiers so that they
+ are not greedy by default, but become greedy if followed by "?". It is
+ not compatible with Perl. It can also be set by a (?U) option setting
within the pattern.
PCRE_UTF8
- This option causes PCRE to regard both the pattern and the subject as
- strings of UTF-8 characters instead of single-byte character strings.
- However, it is available only when PCRE is built to include UTF-8 sup-
- port. If not, the use of this option provokes an error. Details of how
- this option changes the behaviour of PCRE are given in the section on
+ This option causes PCRE to regard both the pattern and the subject as
+ strings of UTF-8 characters instead of single-byte character strings.
+ However, it is available only when PCRE is built to include UTF-8 sup-
+ port. If not, the use of this option provokes an error. Details of how
+ this option changes the behaviour of PCRE are given in the section on
UTF-8 support in the main pcre page.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
- automatically checked. There is a discussion about the validity of
- UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
- bytes is found, pcre_compile() returns an error. If you already know
+ automatically checked. There is a discussion about the validity of
+ UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of
+ bytes is found, pcre_compile() returns an error. If you already know
that your pattern is valid, and you want to skip this check for perfor-
- mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
- set, the effect of passing an invalid UTF-8 string as a pattern is
- undefined. It may cause your program to crash. Note that this option
- can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
+ mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is
+ set, the effect of passing an invalid UTF-8 string as a pattern is
+ undefined. It may cause your program to crash. Note that this option
+ can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
UTF-8 validity checking of subject strings.
COMPILATION ERROR CODES
- The following table lists the error codes than may be returned by
- pcre_compile2(), along with the error messages that may be returned by
- both compiling functions. As PCRE has developed, some error codes have
+ The following table lists the error codes than may be returned by
+ pcre_compile2(), along with the error messages that may be returned by
+ both compiling functions. As PCRE has developed, some error codes have
fallen out of use. To avoid confusion, they have not been re-used.
0 no error
@@ -1447,7 +1445,7 @@ COMPILATION ERROR CODES
50 [this code is not in use]
51 octal value is greater than \377 (not in UTF-8 mode)
52 internal error: overran compiling workspace
- 53 internal error: previously-checked referenced subpattern not
+ 53 internal error: previously-checked referenced subpattern not
found
54 DEFINE group contains more than one branch
55 repeating a DEFINE group is not allowed
@@ -1462,7 +1460,7 @@ COMPILATION ERROR CODES
63 digit expected after (?+
64 ] is an invalid data character in JavaScript compatibility mode
- The numbers 32 and 10000 in errors 48 and 49 are defaults; different
+ The numbers 32 and 10000 in errors 48 and 49 are defaults; different
values may be used if the limits were changed when PCRE was built.
@@ -1471,32 +1469,32 @@ STUDYING A PATTERN
pcre_extra *pcre_study(const pcre *code, int options
const char **errptr);
- If a compiled pattern is going to be used several times, it is worth
+ If a compiled pattern is going to be used several times, it is worth
spending more time analyzing it in order to speed up the time taken for
- matching. The function pcre_study() takes a pointer to a compiled pat-
+ matching. The function pcre_study() takes a pointer to a compiled pat-
tern as its first argument. If studying the pattern produces additional
- information that will help speed up matching, pcre_study() returns a
- pointer to a pcre_extra block, in which the study_data field points to
+ information that will help speed up matching, pcre_study() returns a
+ pointer to a pcre_extra block, in which the study_data field points to
the results of the study.
The returned value from pcre_study() can be passed directly to
- pcre_exec(). However, a pcre_extra block also contains other fields
- that can be set by the caller before the block is passed; these are
+ pcre_exec(). However, a pcre_extra block also contains other fields
+ that can be set by the caller before the block is passed; these are
described below in the section on matching a pattern.
- If studying the pattern does not produce any additional information
+ If studying the pattern does not produce any additional information
pcre_study() returns NULL. In that circumstance, if the calling program
- wants to pass any of the other fields to pcre_exec(), it must set up
+ wants to pass any of the other fields to pcre_exec(), it must set up
its own pcre_extra block.
- The second argument of pcre_study() contains option bits. At present,
+ The second argument of pcre_study() contains option bits. At present,
no options are defined, and this argument should always be zero.
- The third argument for pcre_study() is a pointer for an error message.
- If studying succeeds (even if no data is returned), the variable it
- points to is set to NULL. Otherwise it is set to point to a textual
+ The third argument for pcre_study() is a pointer for an error message.
+ If studying succeeds (even if no data is returned), the variable it
+ points to is set to NULL. Otherwise it is set to point to a textual
error message. This is a static string that is part of the library. You
- must not try to free it. You should test the error pointer for NULL
+ must not try to free it. You should test the error pointer for NULL
after calling pcre_study(), to be sure that it has run successfully.
This is a typical call to pcre_study():
@@ -1508,62 +1506,62 @@ STUDYING A PATTERN
&error); /* set to NULL or points to a message */
At present, studying a pattern is useful only for non-anchored patterns
- that do not have a single fixed starting character. A bitmap of possi-
+ that do not have a single fixed starting character. A bitmap of possi-
ble starting bytes is created.
LOCALE SUPPORT
- PCRE handles caseless matching, and determines whether characters are
- letters, digits, or whatever, by reference to a set of tables, indexed
- by character value. When running in UTF-8 mode, this applies only to
- characters with codes less than 128. Higher-valued codes never match
- escapes such as \w or \d, but can be tested with \p if PCRE is built
- with Unicode character property support. The use of locales with Uni-
- code is discouraged. If you are handling characters with codes greater
- than 128, you should either use UTF-8 and Unicode, or use locales, but
+ PCRE handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character value. When running in UTF-8 mode, this applies only to
+ characters with codes less than 128. Higher-valued codes never match
+ escapes such as \w or \d, but can be tested with \p if PCRE is built
+ with Unicode character property support. The use of locales with Uni-
+ code is discouraged. If you are handling characters with codes greater
+ than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
- PCRE contains an internal set of tables that are used when the final
- argument of pcre_compile() is NULL. These are sufficient for many
+ PCRE contains an internal set of tables that are used when the final
+ argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
- The internal tables can always be overridden by tables supplied by the
+ The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
- from the default. As more and more applications change to using Uni-
+ from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
- External tables are built by calling the pcre_maketables() function,
- which has no arguments, in the relevant locale. The result can then be
- passed to pcre_compile() or pcre_exec() as often as necessary. For
- example, to build and use tables that are appropriate for the French
- locale (where accented characters with values greater than 128 are
+ External tables are built by calling the pcre_maketables() function,
+ which has no arguments, in the relevant locale. The result can then be
+ passed to pcre_compile() or pcre_exec() as often as necessary. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
- The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
if you are using Windows, the name for the French locale is "french".
- When pcre_maketables() runs, the tables are built in memory that is
- obtained via pcre_malloc. It is the caller's responsibility to ensure
- that the memory containing the tables remains available for as long as
+ When pcre_maketables() runs, the tables are built in memory that is
+ obtained via pcre_malloc. It is the caller's responsibility to ensure
+ that the memory containing the tables remains available for as long as
it is needed.
The pointer that is passed to pcre_compile() is saved with the compiled
- pattern, and the same tables are used via this pointer by pcre_study()
+ pattern, and the same tables are used via this pointer by pcre_study()
and normally also by pcre_exec(). Thus, by default, for any single pat-
tern, compilation, studying and matching all happen in the same locale,
but different patterns can be compiled in different locales.
- It is possible to pass a table pointer or NULL (indicating the use of
- the internal tables) to pcre_exec(). Although not intended for this
- purpose, this facility could be used to match a pattern in a different
+ It is possible to pass a table pointer or NULL (indicating the use of
+ the internal tables) to pcre_exec(). Although not intended for this
+ purpose, this facility could be used to match a pattern in a different
locale from the one in which it was compiled. Passing table pointers at
run time is discussed below in the section on matching a pattern.
@@ -1573,15 +1571,15 @@ INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
- The pcre_fullinfo() function returns information about a compiled pat-
+ The pcre_fullinfo() function returns information about a compiled pat-
tern. It replaces the obsolete pcre_info() function, which is neverthe-
less retained for backwards compability (and is documented below).
- The first argument for pcre_fullinfo() is a pointer to the compiled
- pattern. The second argument is the result of pcre_study(), or NULL if
- the pattern was not studied. The third argument specifies which piece
- of information is required, and the fourth argument is a pointer to a
- variable to receive the data. The yield of the function is zero for
+ The first argument for pcre_fullinfo() is a pointer to the compiled
+ pattern. The second argument is the result of pcre_study(), or NULL if
+ the pattern was not studied. The third argument specifies which piece
+ of information is required, and the fourth argument is a pointer to a
+ variable to receive the data. The yield of the function is zero for
success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL
@@ -1589,9 +1587,9 @@ INFORMATION ABOUT A PATTERN
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of what was invalid
- The "magic number" is placed at the start of each compiled pattern as
- an simple check against passing an arbitrary memory pointer. Here is a
- typical call of pcre_fullinfo(), to obtain the length of the compiled
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre_fullinfo(), to obtain the length of the compiled
pattern:
int rc;
@@ -1602,76 +1600,76 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
- The possible values for the third argument are defined in pcre.h, and
+ The possible values for the third argument are defined in pcre.h, and
are as follows:
PCRE_INFO_BACKREFMAX
- Return the number of the highest back reference in the pattern. The
- fourth argument should point to an int variable. Zero is returned if
+ Return the number of the highest back reference in the pattern. The
+ fourth argument should point to an int variable. Zero is returned if
there are no back references.
PCRE_INFO_CAPTURECOUNT
- Return the number of capturing subpatterns in the pattern. The fourth
+ Return the number of capturing subpatterns in the pattern. The fourth
argument should point to an int variable.
PCRE_INFO_DEFAULT_TABLES
- Return a pointer to the internal default character tables within PCRE.
- The fourth argument should point to an unsigned char * variable. This
+ Return a pointer to the internal default character tables within PCRE.
+ The fourth argument should point to an unsigned char * variable. This
information call is provided for internal use by the pcre_study() func-
- tion. External callers can cause PCRE to use its internal tables by
+ tion. External callers can cause PCRE to use its internal tables by
passing a NULL table pointer.
PCRE_INFO_FIRSTBYTE
- Return information about the first byte of any matched string, for a
- non-anchored pattern. The fourth argument should point to an int vari-
- able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
+ Return information about the first byte of any matched string, for a
+ non-anchored pattern. The fourth argument should point to an int vari-
+ able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
is still recognized for backwards compatibility.)
- If there is a fixed first byte, for example, from a pattern such as
+ If there is a fixed first byte, for example, from a pattern such as
(cat|cow|coyote), its value is returned. Otherwise, if either
- (a) the pattern was compiled with the PCRE_MULTILINE option, and every
+ (a) the pattern was compiled with the PCRE_MULTILINE option, and every
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored),
- -1 is returned, indicating that the pattern matches only at the start
- of a subject string or after any newline within the string. Otherwise
+ -1 is returned, indicating that the pattern matches only at the start
+ of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned.
PCRE_INFO_FIRSTTABLE
- If the pattern was studied, and this resulted in the construction of a
+ If the pattern was studied, and this resulted in the construction of a
256-bit table indicating a fixed set of bytes for the first byte in any
- matching string, a pointer to the table is returned. Otherwise NULL is
- returned. The fourth argument should point to an unsigned char * vari-
+ matching string, a pointer to the table is returned. Otherwise NULL is
+ returned. The fourth argument should point to an unsigned char * vari-
able.
PCRE_INFO_HASCRORLF
- Return 1 if the pattern contains any explicit matches for CR or LF
- characters, otherwise 0. The fourth argument should point to an int
- variable. An explicit match is either a literal CR or LF character, or
+ Return 1 if the pattern contains any explicit matches for CR or LF
+ characters, otherwise 0. The fourth argument should point to an int
+ variable. An explicit match is either a literal CR or LF character, or
\r or \n.
PCRE_INFO_JCHANGED
- Return 1 if the (?J) or (?-J) option setting is used in the pattern,
- otherwise 0. The fourth argument should point to an int variable. (?J)
+ Return 1 if the (?J) or (?-J) option setting is used in the pattern,
+ otherwise 0. The fourth argument should point to an int variable. (?J)
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
PCRE_INFO_LASTLITERAL
- Return the value of the rightmost literal byte that must exist in any
- matched string, other than at its start, if such a byte has been
+ Return the value of the rightmost literal byte that must exist in any
+ matched string, other than at its start, if such a byte has been
recorded. The fourth argument should point to an int variable. If there
- is no such byte, -1 is returned. For anchored patterns, a last literal
- byte is recorded only if it follows something of variable length. For
+ is no such byte, -1 is returned. For anchored patterns, a last literal
+ byte is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
/^a\dz\d/ the returned value is -1.
@@ -1679,34 +1677,34 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
- PCRE supports the use of named as well as numbered capturing parenthe-
- ses. The names are just an additional way of identifying the parenthe-
+ PCRE supports the use of named as well as numbered capturing parenthe-
+ ses. The names are just an additional way of identifying the parenthe-
ses, which still acquire numbers. Several convenience functions such as
- pcre_get_named_substring() are provided for extracting captured sub-
- strings by name. It is also possible to extract the data directly, by
- first converting the name to a number in order to access the correct
+ pcre_get_named_substring() are provided for extracting captured sub-
+ strings by name. It is also possible to extract the data directly, by
+ first converting the name to a number in order to access the correct
pointers in the output vector (described with pcre_exec() below). To do
- the conversion, you need to use the name-to-number map, which is
+ the conversion, you need to use the name-to-number map, which is
described by these three values.
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
- of each entry; both of these return an int value. The entry size
- depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
- a pointer to the first entry of the table (a pointer to char). The
+ of each entry; both of these return an int value. The entry size
+ depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
+ a pointer to the first entry of the table (a pointer to char). The
first two bytes of each entry are the number of the capturing parenthe-
- sis, most significant byte first. The rest of the entry is the corre-
- sponding name, zero terminated. The names are in alphabetical order.
+ sis, most significant byte first. The rest of the entry is the corre-
+ sponding name, zero terminated. The names are in alphabetical order.
When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
- theses numbers. For example, consider the following pattern (assume
- PCRE_EXTENDED is set, so white space - including newlines - is
+ theses numbers. For example, consider the following pattern (assume
+ PCRE_EXTENDED is set, so white space - including newlines - is
ignored):
(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )
- There are four named subpatterns, so the table has four entries, and
- each entry in the table is eight bytes long. The table is as follows,
+ There are four named subpatterns, so the table has four entries, and
+ each entry in the table is eight bytes long. The table is as follows,
with non-printing bytes shows in hexadecimal, and undefined bytes shown
as ??:
@@ -1715,16 +1713,17 @@ INFORMATION ABOUT A PATTERN
00 04 m o n t h 00
00 02 y e a r 00 ??
- When writing code to extract data from named subpatterns using the
- name-to-number map, remember that the length of the entries is likely
+ When writing code to extract data from named subpatterns using the
+ name-to-number map, remember that the length of the entries is likely
to be different for each compiled pattern.
PCRE_INFO_OKPARTIAL
- Return 1 if the pattern can be used for partial matching, otherwise 0.
- The fourth argument should point to an int variable. The pcrepartial
- documentation lists the restrictions that apply to patterns when par-
- tial matching is used.
+ Return 1 if the pattern can be used for partial matching, otherwise 0.
+ The fourth argument should point to an int variable. From release 8.00,
+ this always returns 1, because the restrictions that previously applied
+ to partial matching have been lifted. The pcrepartial documentation
+ gives details of partial matching.
PCRE_INFO_OPTIONS
@@ -1929,7 +1928,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
The unused bits of the options argument for pcre_exec() must be zero.
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE,
- PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
+ PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD.
PCRE_ANCHORED
@@ -2021,7 +2020,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
if that fails by advancing the starting offset (see below) and trying
an ordinary match again. There is some code that demonstrates how to do
- this in the pcredemo.c sample program.
+ this in the pcredemo sample program.
PCRE_NO_START_OPTIMIZE
@@ -2056,128 +2055,132 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
value of startoffset that does not point to the start of a UTF-8 char-
acter, is undefined. Your program may crash.
- PCRE_PARTIAL
-
- This option turns on the partial matching feature. If the subject
- string fails to match the pattern, but at some point during the match-
- ing process the end of the subject was reached (that is, the subject
- partially matches the pattern and the failure to match occurred only
- because there were not enough subject characters), pcre_exec() returns
- PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
- used, there are restrictions on what may appear in the pattern. These
- are discussed in the pcrepartial documentation.
+ PCRE_PARTIAL_HARD
+ PCRE_PARTIAL_SOFT
+
+ These options turn on the partial matching feature. For backwards com-
+ patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
+ match occurs if the end of the subject string is reached successfully,
+ but there are not enough subject characters to complete the match. If
+ this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately
+ returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set,
+ matching continues by testing any other alternatives. Only if they all
+ fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH).
+ The portion of the string that provided the partial match is set as the
+ first matching string. There is a more detailed discussion in the
+ pcrepartial documentation.
The string to be matched by pcre_exec()
- The subject string is passed to pcre_exec() as a pointer in subject, a
+ The subject string is passed to pcre_exec() as a pointer in subject, a
length (in bytes) in length, and a starting byte offset in startoffset.
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
- acter. Unlike the pattern string, the subject may contain binary zero
- bytes. When the starting offset is zero, the search for a match starts
- at the beginning of the subject, and this is by far the most common
+ acter. Unlike the pattern string, the subject may contain binary zero
+ bytes. When the starting offset is zero, the search for a match starts
+ at the beginning of the subject, and this is by far the most common
case.
- A non-zero starting offset is useful when searching for another match
- in the same subject by calling pcre_exec() again after a previous suc-
- cess. Setting startoffset differs from just passing over a shortened
- string and setting PCRE_NOTBOL in the case of a pattern that begins
+ A non-zero starting offset is useful when searching for another match
+ in the same subject by calling pcre_exec() again after a previous suc-
+ cess. Setting startoffset differs from just passing over a shortened
+ string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern
\Biss\B
- which finds occurrences of "iss" in the middle of words. (\B matches
- only if the current position in the subject is not a word boundary.)
- When applied to the string "Mississipi" the first call to pcre_exec()
- finds the first occurrence. If pcre_exec() is called again with just
- the remainder of the subject, namely "issipi", it does not match,
+ which finds occurrences of "iss" in the middle of words. (\B matches
+ only if the current position in the subject is not a word boundary.)
+ When applied to the string "Mississipi" the first call to pcre_exec()
+ finds the first occurrence. If pcre_exec() is called again with just
+ the remainder of the subject, namely "issipi", it does not match,
because \B is always false at the start of the subject, which is deemed
- to be a word boundary. However, if pcre_exec() is passed the entire
+ to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur-
- rence of "iss" because it is able to look behind the starting point to
+ rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
- If a non-zero starting offset is passed when the pattern is anchored,
+ If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is made. This can only succeed
- if the pattern does not require the match to be at the start of the
+ if the pattern does not require the match to be at the start of the
subject.
How pcre_exec() returns captured substrings
- In general, a pattern matches a certain portion of the subject, and in
- addition, further substrings from the subject may be picked out by
- parts of the pattern. Following the usage in Jeffrey Friedl's book,
- this is called "capturing" in what follows, and the phrase "capturing
- subpattern" is used for a fragment of a pattern that picks out a sub-
- string. PCRE supports several other kinds of parenthesized subpattern
+ In general, a pattern matches a certain portion of the subject, and in
+ addition, further substrings from the subject may be picked out by
+ parts of the pattern. Following the usage in Jeffrey Friedl's book,
+ this is called "capturing" in what follows, and the phrase "capturing
+ subpattern" is used for a fragment of a pattern that picks out a sub-
+ string. PCRE supports several other kinds of parenthesized subpattern
that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integers
- whose address is passed in ovector. The number of elements in the vec-
- tor is passed in ovecsize, which must be a non-negative number. Note:
+ whose address is passed in ovector. The number of elements in the vec-
+ tor is passed in ovecsize, which must be a non-negative number. Note:
this argument is NOT the size of ovector in bytes.
- The first two-thirds of the vector is used to pass back captured sub-
- strings, each substring using a pair of integers. The remaining third
- of the vector is used as workspace by pcre_exec() while matching cap-
- turing subpatterns, and is not available for passing back information.
- The number passed in ovecsize should always be a multiple of three. If
+ The first two-thirds of the vector is used to pass back captured sub-
+ strings, each substring using a pair of integers. The remaining third
+ of the vector is used as workspace by pcre_exec() while matching cap-
+ turing subpatterns, and is not available for passing back information.
+ The number passed in ovecsize should always be a multiple of three. If
it is not, it is rounded down.
- When a match is successful, information about captured substrings is
- returned in pairs of integers, starting at the beginning of ovector,
- and continuing up to two-thirds of its length at the most. The first
- element of each pair is set to the byte offset of the first character
- in a substring, and the second is set to the byte offset of the first
- character after the end of a substring. Note: these values are always
+ When a match is successful, information about captured substrings is
+ returned in pairs of integers, starting at the beginning of ovector,
+ and continuing up to two-thirds of its length at the most. The first
+ element of each pair is set to the byte offset of the first character
+ in a substring, and the second is set to the byte offset of the first
+ character after the end of a substring. Note: these values are always
byte offsets, even in UTF-8 mode. They are not character counts.
- The first pair of integers, ovector[0] and ovector[1], identify the
- portion of the subject string matched by the entire pattern. The next
- pair is used for the first capturing subpattern, and so on. The value
+ The first pair of integers, ovector[0] and ovector[1], identify the
+ portion of the subject string matched by the entire pattern. The next
+ pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
- has been set. For example, if two substrings have been captured, the
- returned value is 3. If there are no capturing subpatterns, the return
+ has been set. For example, if two substrings have been captured, the
+ returned value is 3. If there are no capturing subpatterns, the return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.
If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that is returned.
- If the vector is too small to hold all the captured substring offsets,
+ If the vector is too small to hold all the captured substring offsets,
it is used as far as possible (up to two-thirds of its length), and the
- function returns a value of zero. If the substring offsets are not of
- interest, pcre_exec() may be called with ovector passed as NULL and
- ovecsize as zero. However, if the pattern contains back references and
- the ovector is not big enough to remember the related substrings, PCRE
- has to get additional memory for use during matching. Thus it is usu-
+ function returns a value of zero. If the substring offsets are not of
+ interest, pcre_exec() may be called with ovector passed as NULL and
+ ovecsize as zero. However, if the pattern contains back references and
+ the ovector is not big enough to remember the related substrings, PCRE
+ has to get additional memory for use during matching. Thus it is usu-
ally advisable to supply an ovector.
- The pcre_info() function can be used to find out how many capturing
- subpatterns there are in a compiled pattern. The smallest size for
- ovector that will allow for n captured substrings, in addition to the
+ The pcre_info() function can be used to find out how many capturing
+ subpatterns there are in a compiled pattern. The smallest size for
+ ovector that will allow for n captured substrings, in addition to the
offsets of the substring matched by the whole pattern, is (n+1)*3.
- It is possible for capturing subpattern number n+1 to match some part
+ It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example,
- if the string "abc" is matched against the pattern (a|(z))(bc) the
+ if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but
- 2 is not. When this happens, both values in the offset pairs corre-
+ 2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to -1.
- Offset values that correspond to unused subpatterns at the end of the
- expression are also set to -1. For example, if the string "abc" is
- matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
- matched. The return from the function is 2, because the highest used
+ Offset values that correspond to unused subpatterns at the end of the
+ expression are also set to -1. For example, if the string "abc" is
+ matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
+ matched. The return from the function is 2, because the highest used
capturing subpattern number is 1. However, you can refer to the offsets
- for the second and third capturing subpatterns if you wish (assuming
+ for the second and third capturing subpatterns if you wish (assuming
the vector is large enough, of course).
- Some convenience functions are provided for extracting the captured
+ Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below.
Error return values from pcre_exec()
- If pcre_exec() fails, it returns a negative number. The following are
+ If pcre_exec() fails, it returns a negative number. The following are
defined in the header file:
PCRE_ERROR_NOMATCH (-1)
@@ -2186,7 +2189,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_NULL (-2)
- Either code or subject was passed as NULL, or ovector was NULL and
+ Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
@@ -2195,65 +2198,66 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_BADMAGIC (-4)
- PCRE stores a 4-byte "magic number" at the start of the compiled code,
+ PCRE stores a 4-byte "magic number" at the start of the compiled code,
to catch the case when it is passed a junk pointer and to detect when a
pattern that was compiled in an environment of one endianness is run in
- an environment with the other endianness. This is the error that PCRE
+ an environment with the other endianness. This is the error that PCRE
gives when the magic number is not present.
PCRE_ERROR_UNKNOWN_OPCODE (-5)
While running the pattern match, an unknown item was encountered in the
- compiled pattern. This error could be caused by a bug in PCRE or by
+ compiled pattern. This error could be caused by a bug in PCRE or by
overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
- If a pattern contains back references, but the ovector that is passed
+ If a pattern contains back references, but the ovector that is passed
to pcre_exec() is not big enough to remember the referenced substrings,
- PCRE gets a block of memory at the start of matching to use for this
- purpose. If the call via pcre_malloc() fails, this error is given. The
+ PCRE gets a block of memory at the start of matching to use for this
+ purpose. If the call via pcre_malloc() fails, this error is given. The
memory is automatically freed at the end of matching.
PCRE_ERROR_NOSUBSTRING (-7)
- This error is used by the pcre_copy_substring(), pcre_get_substring(),
+ This error is used by the pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() functions (see below). It is never
returned by pcre_exec().
PCRE_ERROR_MATCHLIMIT (-8)
- The backtracking limit, as specified by the match_limit field in a
- pcre_extra structure (or defaulted) was reached. See the description
+ The backtracking limit, as specified by the match_limit field in a
+ pcre_extra structure (or defaulted) was reached. See the description
above.
PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec() itself. It is provided for
- use by callout functions that want to yield a distinctive error code.
+ use by callout functions that want to yield a distinctive error code.
See the pcrecallout documentation for details.
PCRE_ERROR_BADUTF8 (-10)
- A string that contains an invalid UTF-8 byte sequence was passed as a
+ A string that contains an invalid UTF-8 byte sequence was passed as a
subject.
PCRE_ERROR_BADUTF8_OFFSET (-11)
The UTF-8 byte sequence that was passed as a subject was valid, but the
- value of startoffset did not point to the beginning of a UTF-8 charac-
+ value of startoffset did not point to the beginning of a UTF-8 charac-
ter.
PCRE_ERROR_PARTIAL (-12)
- The subject string did not match, but it did match partially. See the
+ The subject string did not match, but it did match partially. See the
pcrepartial documentation for details of partial matching.
PCRE_ERROR_BADPARTIAL (-13)
- The PCRE_PARTIAL option was used with a compiled pattern containing
- items that are not supported for partial matching. See the pcrepartial
- documentation for details of partial matching.
+ This code is no longer in use. It was formerly returned when the
+ PCRE_PARTIAL option was used with a compiled pattern containing items
+ that were not supported for partial matching. From release 8.00
+ onwards, there are no restrictions on partial matching.
PCRE_ERROR_INTERNAL (-14)
@@ -2517,19 +2521,24 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
The unused bits of the options argument for pcre_dfa_exec() must be
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
- PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
- three of these are the same as for pcre_exec(), so their description is
- not repeated here.
-
- PCRE_PARTIAL
-
- This has the same general effect as it does for pcre_exec(), but the
- details are slightly different. When PCRE_PARTIAL is set for
- pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into
- PCRE_ERROR_PARTIAL if the end of the subject is reached, there have
- been no complete matches, but there is still at least one matching pos-
- sibility. The portion of the string that provided the partial match is
- set as the first matching string.
+ PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and
+ PCRE_DFA_RESTART. All but the last four of these are exactly the same
+ as for pcre_exec(), so their description is not repeated here.
+
+ PCRE_PARTIAL_HARD
+ PCRE_PARTIAL_SOFT
+
+ These have the same general effect as they do for pcre_exec(), but the
+ details are slightly different. When PCRE_PARTIAL_HARD is set for
+ pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
+ ject is reached and there is still at least one matching possibility
+ that requires additional characters. This happens even if some complete
+ matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
+ code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
+ of the subject is reached, there have been no complete matches, but
+ there is still at least one matching possibility. The portion of the
+ string that provided the longest partial match is set as the first
+ matching string in both cases.
PCRE_DFA_SHORTEST
@@ -2540,21 +2549,20 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
PCRE_DFA_RESTART
- When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and
- returns a partial match, it is possible to call it again, with addi-
- tional subject characters, and have it continue with the same match.
- The PCRE_DFA_RESTART option requests this action; when it is set, the
- workspace and wscount options must reference the same vector as before
- because data about the match so far is left in them after a partial
- match. There is more discussion of this facility in the pcrepartial
- documentation.
+ When pcre_dfa_exec() returns a partial match, it is possible to call it
+ again, with additional subject characters, and have it continue with
+ the same match. The PCRE_DFA_RESTART option requests this action; when
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
+ after a partial match. There is more discussion of this facility in the
+ pcrepartial documentation.
Successful returns from pcre_dfa_exec()
- When pcre_dfa_exec() succeeds, it may have matched more than one sub-
+ When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -2569,61 +2577,61 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else>
<something> <something else> <something further>
- On success, the yield of the function is a number greater than zero,
- which is the number of matched substrings. The substrings themselves
- are returned in ovector. Each string uses two elements; the first is
- the offset to the start, and the second is the offset to the end. In
- fact, all the strings have the same start offset. (Space could have
- been saved by giving this only once, but it was decided to retain some
- compatibility with the way pcre_exec() returns data, even though the
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The substrings themselves
+ are returned in ovector. Each string uses two elements; the first is
+ the offset to the start, and the second is the offset to the end. In
+ fact, all the strings have the same start offset. (Space could have
+ been saved by giving this only once, but it was decided to retain some
+ compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long-
- est matching string is given first. If there were too many matches to
- fit into ovector, the yield of the function is zero, and the vector is
+ est matching string is given first. If there were too many matches to
+ fit into ovector, the yield of the function is zero, and the vector is
filled with the longest matches.
Error returns from pcre_dfa_exec()
- The pcre_dfa_exec() function returns a negative number when it fails.
- Many of the errors are the same as for pcre_exec(), and these are
- described above. There are in addition the following errors that are
+ The pcre_dfa_exec() function returns a negative number when it fails.
+ Many of the errors are the same as for pcre_exec(), and these are
+ described above. There are in addition the following errors that are
specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16)
- This return is given if pcre_dfa_exec() encounters an item in the pat-
- tern that it does not support, for instance, the use of \C or a back
+ This return is given if pcre_dfa_exec() encounters an item in the pat-
+ tern that it does not support, for instance, the use of \C or a back
reference.
PCRE_ERROR_DFA_UCOND (-17)
- This return is given if pcre_dfa_exec() encounters a condition item
- that uses a back reference for the condition, or a test for recursion
+ This return is given if pcre_dfa_exec() encounters a condition item
+ that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18)
- This return is given if pcre_dfa_exec() is called with an extra block
+ This return is given if pcre_dfa_exec() is called with an extra block
that contains a setting of the match_limit field. This is not supported
(it is meaningless).
PCRE_ERROR_DFA_WSSIZE (-19)
- This return is given if pcre_dfa_exec() runs out of space in the
+ This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.
PCRE_ERROR_DFA_RECURSE (-20)
- When a recursive subpattern is processed, the matching function calls
- itself recursively, using private vectors for ovector and workspace.
- This error is given if the output vector is not large enough. This
+ When a recursive subpattern is processed, the matching function calls
+ itself recursively, using private vectors for ovector and workspace.
+ This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
SEE ALSO
- pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
+ pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
@@ -2636,11 +2644,11 @@ AUTHOR
REVISION
- Last updated: 11 April 2009
+ Last updated: 01 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECALLOUT(3) PCRECALLOUT(3)
@@ -2815,8 +2823,8 @@ REVISION
Last updated: 15 March 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECOMPAT(3) PCRECOMPAT(3)
@@ -2829,7 +2837,7 @@ DIFFERENCES BETWEEN PCRE AND PERL
This document describes the differences in the ways that PCRE and Perl
handle regular expressions. The differences described here are mainly
with respect to Perl 5.8, though PCRE versions 7.0 and later contain
- some features that are expected to be in the forthcoming Perl 5.10.
+ some features that are in Perl 5.10.
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
of what it does have are given in the section on UTF-8 support in the
@@ -2953,11 +2961,11 @@ AUTHOR
REVISION
- Last updated: 11 September 2007
- Copyright (c) 1997-2007 University of Cambridge.
+ Last updated: 25 August 2009
+ Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPATTERN(3) PCREPATTERN(3)
@@ -5034,8 +5042,8 @@ REVISION
Last updated: 11 April 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRESYNTAX(3) PCRESYNTAX(3)
@@ -5387,8 +5395,8 @@ REVISION
Last updated: 11 April 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPARTIAL(3) PCREPARTIAL(3)
@@ -5412,77 +5420,162 @@ PARTIAL MATCHING IN PCRE
If the application sees the user's keystrokes one by one, and can check
that what has been typed so far is potentially valid, it is able to
- raise an error as soon as a mistake is made, possibly beeping and not
- reflecting the character that has been typed. This immediate feedback
- is likely to be a better user interface than a check that is delayed
- until the entire string has been entered.
+ raise an error as soon as a mistake is made, by beeping and not
+ reflecting the character that has been typed, for example. This immedi-
+ ate feedback is likely to be a better user interface than a check that
+ is delayed until the entire string has been entered. Partial matching
+ can also sometimes be useful when the subject string is very long and
+ is not all available at once.
+
+ PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
+ PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
+ pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym
+ for PCRE_PARTIAL_SOFT. The essential difference between the two options
+ is whether or not a partial match is preferred to an alternative com-
+ plete match, though the details differ between the two matching func-
+ tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
+
+ Setting a partial matching option disables one of PCRE's optimizations.
+ PCRE remembers the last literal byte in a pattern, and abandons match-
+ ing immediately if such a byte is not present in the subject string.
+ This optimization cannot be used for a subject string that might match
+ only partially.
+
+
+PARTIAL MATCHING USING pcre_exec()
+
+ A partial match occurs during a call to pcre_exec() whenever the end of
+ the subject string is reached successfully, but matching cannot con-
+ tinue because more characters are needed. However, at least one charac-
+ ter must have been matched. (In other words, a partial match can never
+ be an empty string.)
+
+ If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
+ matching continues as normal, and other alternatives in the pattern are
+ tried. If no complete match can be found, pcre_exec() returns
+ PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH, and if there are at
+ least two slots in the offsets vector, they are filled in with the off-
+ sets of the longest string that partially matched. Consider this pat-
+ tern:
+
+ /123\w+X|dogY/
+
+ If this is matched against the subject string "abc123dog", both alter-
+ natives fail to match, but the end of the subject is reached during
+ matching, so PCRE_ERROR_PARTIAL is returned instead of
+ PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
+ "123dog" as the longest partial match that was found. (In this example,
+ there are two partial matches, because "dog" on its own partially
+ matches the second alternative.)
+
+ If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
+ TIAL as soon as a partial match is found, without continuing to search
+ for possible complete matches. The difference between the two options
+ can be illustrated by a pattern such as:
+
+ /dog(sbody)?/
+
+ This matches either "dog" or "dogsbody", greedily (that is, it prefers
+ the longer string if possible). If it is matched against the string
+ "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
+ However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
+ On the other hand, if the pattern is made ungreedy the result is dif-
+ ferent:
+
+ /dog(sbody)??/
+
+ In this case the result is always a complete match because pcre_exec()
+ finds that first, and it never continues after finding a match. It
+ might be easier to follow this explanation by thinking of the two pat-
+ terns like this:
+
+ /dog(sbody)?/ is the same as /dogsbody|dog/
+ /dog(sbody)??/ is the same as /dog|dogsbody/
+
+ The second pattern will never match "dogsbody" when pcre_exec() is
+ used, because it will always find the shorter match first.
+
+
+PARTIAL MATCHING USING pcre_dfa_exec()
+
+ The pcre_dfa_exec() function moves along the subject string character
+ by character, without backtracking, searching for all possible matches
+ simultaneously. If the end of the subject is reached before the end of
+ the pattern, there is the possibility of a partial match, again pro-
+ vided that at least one character has matched.
+
+ When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
+ there have been no complete matches. Otherwise, the complete matches
+ are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
+ takes precedence over any complete matches. The portion of the string
+ that provided the longest partial match is set as the first matching
+ string, provided there are at least two slots in the offsets vector.
- PCRE supports the concept of partial matching by means of the PCRE_PAR-
- TIAL option, which can be set when calling pcre_exec() or
- pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
- PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
- during the matching process the last part of the subject string matched
- part of the pattern. Unfortunately, for non-anchored matching, it is
- not possible to obtain the position of the start of the partial match.
- No captured data is set when PCRE_ERROR_PARTIAL is returned.
+ Because pcre_dfa_exec() always searches for all possible matches, and
+ there is no difference between greedy and ungreedy repetition, its be-
+ haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
+ sider the string "dog" matched against the ungreedy pattern shown
+ above:
- When PCRE_PARTIAL is set for pcre_dfa_exec(), the return code
- PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of
- the subject is reached, there have been no complete matches, but there
- is still at least one matching possibility. The portion of the string
- that provided the partial match is set as the first matching string.
+ /dog(sbody)??/
- Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
- the last literal byte in a pattern, and abandons matching immediately
- if such a byte is not present in the subject string. This optimization
- cannot be used for a subject string that might match only partially.
+ Whereas pcre_exec() stops as soon as it finds the complete match for
+ "dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
+ so returns that when PCRE_PARTIAL_HARD is set.
-RESTRICTED PATTERNS FOR PCRE_PARTIAL
+PARTIAL MATCHING AND WORD BOUNDARIES
- Because of the way certain internal optimizations are implemented in
- the pcre_exec() function, the PCRE_PARTIAL option cannot be used with
- all patterns. These restrictions do not apply when pcre_dfa_exec() is
- used. For pcre_exec(), repeated single characters such as
+ If a pattern ends with one of sequences \w or \W, which test for word
+ boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
+ intuitive results. Consider this pattern:
- a{2,4}
+ /\bcat\b/
- and repeated single metasequences such as
+ This matches "cat", provided there is a word boundary at either end. If
+ the subject string is "the cat", the comparison of the final "t" with a
+ following character cannot take place, so a partial match is found.
+ However, pcre_exec() carries on with normal matching, which matches \b
+ at the end of the subject when the last character is a letter, thus
+ finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
+ TIAL. The same thing happens with pcre_dfa_exec(), because it also
+ finds the complete match.
- \d+
+ Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
+ because then the partial match takes precedence.
- are not permitted if the maximum number of occurrences is greater than
- one. Optional items such as \d? (where the maximum is one) are permit-
- ted. Quantifiers with any values are permitted after parentheses, so
- the invalid examples above can be coded thus:
- (a){2,4}
- (\d)+
+FORMERLY RESTRICTED PATTERNS
- These constructions run more slowly, but for the kinds of application
- that are envisaged for this facility, this is not felt to be a major
- restriction.
+ For releases of PCRE prior to 8.00, because of the way certain internal
+ optimizations were implemented in the pcre_exec() function, the
+ PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
+ used with all patterns. From release 8.00 onwards, the restrictions no
+ longer apply, and partial matching with pcre_exec() can be requested
+ for any pattern.
- If PCRE_PARTIAL is set for a pattern that does not conform to the
- restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
- (-13). You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to
- find out if a compiled pattern can be used for partial matching.
+ Items that were formerly restricted were repeated single characters and
+ repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
+ not conform to the restrictions, pcre_exec() returned the error code
+ PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
+ PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
+ pattern can be used for partial matching now always returns 1.
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
If the escape sequence \P is present in a pcretest data line, the
- PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
- uses the date example quoted above:
+ PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
+ pcretest that uses the date example quoted above:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 25jun04\P
0: 25jun04
1: jun
data> 25dec3\P
- Partial match
+ Partial match: 23dec3
data> 3ju\P
- Partial match
+ Partial match: 3ju
data> 3juj\P
No match
data> j\P
@@ -5490,36 +5583,23 @@ EXAMPLE OF PARTIAL MATCHING USING PCRETEST
The first data string is matched completely, so pcretest shows the
matched substrings. The remaining four strings do not match the com-
- plete pattern, but the first two are partial matches. The same test,
- using pcre_dfa_exec() matching (by means of the \D escape sequence),
- produces the following output:
+ plete pattern, but the first two are partial matches. Similar output is
+ obtained when pcre_dfa_exec() is used.
- re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
- data> 25jun04\P\D
- 0: 25jun04
- data> 23dec3\P\D
- Partial match: 23dec3
- data> 3ju\P\D
- Partial match: 3ju
- data> 3juj\P\D
- No match
- data> j\P\D
- No match
-
- Notice that in this case the portion of the string that was matched is
- made available.
+ If the escape sequence \P is present more than once in a pcretest data
+ line, the PCRE_PARTIAL_HARD option is set for the match.
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
When a partial match has been found using pcre_dfa_exec(), it is possi-
- ble to continue the match by providing additional subject data and
- calling pcre_dfa_exec() again with the same compiled regular expres-
- sion, this time setting the PCRE_DFA_RESTART option. You must also pass
- the same working space as before, because this is where details of the
- previous partial match are stored. Here is an example using pcretest,
- using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
- \D are as above):
+ ble to continue the match by providing additional subject data and
+ calling pcre_dfa_exec() again with the same compiled regular expres-
+ sion, this time setting the PCRE_DFA_RESTART option. You must pass the
+ same working space as before, because this is where details of the pre-
+ vious partial match are stored. Here is an example using pcretest,
+ using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
+ specifies the use of pcre_dfa_exec()):
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 23ja\P\D
@@ -5527,38 +5607,71 @@ MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
data> n05\R\D
0: n05
- The first call has "23ja" as the subject, and requests partial match-
- ing; the second call has "n05" as the subject for the continued
- (restarted) match. Notice that when the match is complete, only the
- last part is shown; PCRE does not retain the previously partially-
- matched string. It is up to the calling program to do that if it needs
+ The first call has "23ja" as the subject, and requests partial match-
+ ing; the second call has "n05" as the subject for the continued
+ (restarted) match. Notice that when the match is complete, only the
+ last part is shown; PCRE does not retain the previously partially-
+ matched string. It is up to the calling program to do that if it needs
to.
- You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial
- matching over multiple segments. This facility can be used to pass very
- long subject strings to pcre_dfa_exec(). However, some care is needed
- for certain types of pattern.
+ You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
+ PCRE_DFA_RESTART to continue partial matching over multiple segments.
+ This facility can be used to pass very long subject strings to
+ pcre_dfa_exec().
- 1. If the pattern contains tests for the beginning or end of a line,
- you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
- ate, when the subject string for any call does not contain the begin-
- ning or end of a line.
- 2. If the pattern contains backward assertions (including \b or \B),
- you need to arrange for some overlap in the subject strings to allow
- for this. For example, you could pass the subject in chunks that are
- 500 bytes long, but in a buffer of 700 bytes, with the starting offset
- set to 200 and the previous 200 bytes at the start of the buffer.
+MULTI-SEGMENT MATCHING WITH pcre_exec()
+
+ From release 8.00, pcre_exec() can also be used to do multi-segment
+ matching. Unlike pcre_dfa_exec(), it is not possible to restart the
+ previous match with a new segment of data. Instead, new data must be
+ added to the previous subject string, and the entire match re-run,
+ starting from the point where the partial match occurred. Earlier data
+ can be discarded. Consider an unanchored pattern that matches dates:
+
+ re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
+ data> The date is 23ja\P
+ Partial match: 23ja
+
+ The this stage, an application could discard the text preceding "23ja",
+ add on text from the next segment, and call pcre_exec() again. Unlike
+ pcre_dfa_exec(), the entire matching string must always be available,
+ and the complete matching process occurs for each call, so more memory
+ and more processing time is needed.
- 3. Matching a subject string that is split into multiple segments does
- not always produce exactly the same result as matching over one single
- long string. The difference arises when there are multiple matching
- possibilities, because a partial match result is given only when there
- are no completed matches in a call to pcre_dfa_exec(). This means that
- as soon as the shortest match has been found, continuation to a new
- subject segment is no longer possible. Consider this pcretest example:
+
+ISSUES WITH MULTI-SEGMENT MATCHING
+
+ Certain types of pattern may give problems with multi-segment matching,
+ whichever matching function is used.
+
+ 1. If the pattern contains tests for the beginning or end of a line,
+ you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
+ ate, when the subject string for any call does not contain the begin-
+ ning or end of a line.
+
+ 2. If the pattern contains backward assertions (including \b or \B),
+ you need to arrange for some overlap in the subject strings to allow
+ for them to be correctly tested at the start of each substring. For
+ example, using pcre_dfa_exec(), you could pass the subject in chunks
+ that are 500 bytes long, but in a buffer of 700 bytes, with the start-
+ ing offset set to 200 and the previous 200 bytes at the start of the
+ buffer.
+
+ 3. Matching a subject string that is split into multiple segments may
+ not always produce exactly the same result as matching over one single
+ long string, especially when PCRE_PARTIAL_SOFT is used. The section
+ "Partial Matching and Word Boundaries" above describes an issue that
+ arises if the pattern ends with \b or \B. Another kind of difference
+ may occur when there are multiple matching possibilities, because a
+ partial match result is given only when there are no completed matches.
+ This means that as soon as the shortest match has been found, continua-
+ tion to a new subject segment is no longer possible. Consider again
+ this pcretest example:
re> /dog(sbody)?/
+ data> dogsb\P
+ 0: dog
data> do\P\D
Partial match: do
data> gsb\R\P\D
@@ -5567,18 +5680,31 @@ MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
0: dogsbody
1: dog
- The pattern matches the words "dog" or "dogsbody". When the subject is
- presented in several parts ("do" and "gsb" being the first two) the
- match stops when "dog" has been found, and it is not possible to con-
- tinue. On the other hand, if "dogsbody" is presented as a single
- string, both matches are found.
+ The first data line passes the string "dogsb" to pcre_exec(), setting
+ the PCRE_PARTIAL_SOFT option. Although the string is a partial match
+ for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
+ shorter string "dog" is a complete match. Similarly, when the subject
+ is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
+ the first two) the match stops when "dog" has been found, and it is not
+ possible to continue. On the other hand, if "dogsbody" is presented as
+ a single string, pcre_dfa_exec() finds both matches.
+
+ Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
+ when matching multi-segment data. The example above then behaves dif-
+ ferently:
+
+ re> /dog(sbody)?/
+ data> dogsb\P\P
+ Partial match: dogsb
+ data> do\P\D
+ Partial match: do
+ data> gsb\R\P\P\D
+ Partial match: gsb
- Because of this phenomenon, it does not usually make sense to end a
- pattern that is going to be matched in this way with a variable repeat.
4. Patterns that contain alternatives at the top level which do not all
- start with the same pattern item may not work as expected. For example,
- consider this pattern:
+ start with the same pattern item may not work as expected when
+ pcre_dfa_exec() is used. For example, consider this pattern:
1234|3789
@@ -5586,14 +5712,23 @@ MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the same
point in the subject string. Attempting to continue with the string
- "789" does not yield a match because only those alternatives that match
- at one point in the subject are remembered. The problem arises because
- the start of the second alternative matches within the first alterna-
- tive. There is no problem with anchored patterns or patterns such as:
+ "7890" does not yield a match because only those alternatives that
+ match at one point in the subject are remembered. The problem arises
+ because the start of the second alternative matches within the first
+ alternative. There is no problem with anchored patterns or patterns
+ such as:
1234|ABCD
- where no string can be a partial match for both alternatives.
+ where no string can be a partial match for both alternatives. This is
+ not a problem if pcre_exec() is used, because the entire match has to
+ be rerun each time:
+
+ re> /1234|3789/
+ data> ABC123\P
+ Partial match: 123
+ data> 1237890
+ 0: 3789
AUTHOR
@@ -5605,11 +5740,11 @@ AUTHOR
REVISION
- Last updated: 04 June 2007
- Copyright (c) 1997-2007 University of Cambridge.
+ Last updated: 31 August 2009
+ Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPRECOMPILE(3) PCREPRECOMPILE(3)
@@ -5732,8 +5867,8 @@ REVISION
Last updated: 13 June 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPERFORM(3) PCREPERFORM(3)
@@ -5882,8 +6017,8 @@ REVISION
Last updated: 06 March 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPOSIX(3) PCREPOSIX(3)
@@ -6001,6 +6136,10 @@ COMPILING A PATTERN
is public: re_nsub contains the number of capturing subpatterns in the
regular expression. Various error codes are defined in the header file.
+ NOTE: If the yield of regcomp() is non-zero, you must not attempt to
+ use the contents of the preg structure. If, for example, you pass it to
+ regexec(), the result is undefined and your program is likely to crash.
+
MATCHING NEWLINE CHARACTERS
@@ -6118,11 +6257,11 @@ AUTHOR
REVISION
- Last updated: 11 March 2009
+ Last updated: 15 August 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECPP(3) PCRECPP(3)
@@ -6462,8 +6601,8 @@ REVISION
Last updated: 17 March 2009
------------------------------------------------------------------------------
-
-
+
+
PCRESAMPLE(3) PCRESAMPLE(3)
@@ -6474,53 +6613,56 @@ NAME
PCRE SAMPLE PROGRAM
A simple, complete demonstration program, to get you started with using
- PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
+ PCRE, is supplied in the file pcredemo.c in the PCRE distribution. A
+ listing of this program is given in the pcredemo documentation. If you
+ do not have a copy of the PCRE distribution, you can save this listing
+ to re-create pcredemo.c.
The program compiles the regular expression that is its first argument,
- and matches it against the subject string in its second argument. No
- PCRE options are set, and default character tables are used. If match-
- ing succeeds, the program outputs the portion of the subject that
+ and matches it against the subject string in its second argument. No
+ PCRE options are set, and default character tables are used. If match-
+ ing succeeds, the program outputs the portion of the subject that
matched, together with the contents of any captured substrings.
If the -g option is given on the command line, the program then goes on
to check for further matches of the same regular expression in the same
- subject string. The logic is a little bit tricky because of the possi-
- bility of matching an empty string. Comments in the code explain what
+ subject string. The logic is a little bit tricky because of the possi-
+ bility of matching an empty string. Comments in the code explain what
is going on.
- If PCRE is installed in the standard include and library directories
- for your system, you should be able to compile the demonstration pro-
+ If PCRE is installed in the standard include and library directories
+ for your system, you should be able to compile the demonstration pro-
gram using this command:
gcc -o pcredemo pcredemo.c -lpcre
- If PCRE is installed elsewhere, you may need to add additional options
- to the command line. For example, on a Unix-like system that has PCRE
- installed in /usr/local, you can compile the demonstration program
+ If PCRE is installed elsewhere, you may need to add additional options
+ to the command line. For example, on a Unix-like system that has PCRE
+ installed in /usr/local, you can compile the demonstration program
using a command like this:
gcc -o pcredemo -I/usr/local/include pcredemo.c \
-L/usr/local/lib -lpcre
- Once you have compiled the demonstration program, you can run simple
+ Once you have compiled the demonstration program, you can run simple
tests like this:
./pcredemo 'cat|dog' 'the cat sat on the mat'
./pcredemo -g 'cat|dog' 'the dog sat on the cat'
- Note that there is a much more comprehensive test program, called
- pcretest, which supports many more facilities for testing regular
+ Note that there is a much more comprehensive test program, called
+ pcretest, which supports many more facilities for testing regular
expressions and the PCRE library. The pcredemo program is provided as a
simple coding example.
- On some operating systems (e.g. Solaris), when PCRE is not installed in
- the standard library directory, you may get an error like this when you
- try to run pcredemo:
+ When you try to run pcredemo when PCRE is not installed in the standard
+ library directory, you may get an error like this on some operating
+ systems (e.g. Solaris):
- ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
+ ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
directory
- This is caused by the way shared library support works on those sys-
+ This is caused by the way shared library support works on those sys-
tems. You need to add
-R/usr/local/lib
@@ -6537,8 +6679,8 @@ AUTHOR
REVISION
- Last updated: 23 January 2008
- Copyright (c) 1997-2008 University of Cambridge.
+ Last updated: 01 September 2009
+ Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCRESTACK(3) PCRESTACK(3)
@@ -6676,5 +6818,5 @@ REVISION
Last updated: 09 July 2008
Copyright (c) 1997-2008 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index c67bd10..8c4b45d 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -135,8 +135,12 @@ Applications can use these to include support for different releases of PCRE.
The functions \fBpcre_compile()\fP, \fBpcre_compile2()\fP, \fBpcre_study()\fP,
and \fBpcre_exec()\fP are used for compiling and matching regular expressions
in a Perl-compatible manner. A sample program that demonstrates the simplest
-way of using them is provided in the file called \fIpcredemo.c\fP in the source
-distribution. The
+way of using them is provided in the file called \fIpcredemo.c\fP in the PCRE
+source distribution. A listing of this program is given in the
+.\" HREF
+\fBpcredemo\fP
+.\"
+documentation, and the
.\" HREF
\fBpcresample\fP
.\"
@@ -1327,7 +1331,11 @@ when using the /g modifier. It is possible to emulate Perl's behaviour after
matching a null string by first trying the match again at the same offset with
PCRE_NOTEMPTY and PCRE_ANCHORED, and then if that fails by advancing the
starting offset (see below) and trying an ordinary match again. There is some
-code that demonstrates how to do this in the \fIpcredemo.c\fP sample program.
+code that demonstrates how to do this in the
+.\" HREF
+\fBpcredemo\fP
+.\"
+sample program.
.sp
PCRE_NO_START_OPTIMIZE
.sp
@@ -2003,6 +2011,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 29 August 2009
+Last updated: 01 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcredemo.3 b/doc/pcredemo.3
new file mode 100644
index 0000000..611ed0c
--- /dev/null
+++ b/doc/pcredemo.3
@@ -0,0 +1,352 @@
+.\" Start example.
+.de EX
+. nr mE \\n(.f
+. nf
+. nh
+. ft CW
+..
+.
+.
+.\" End example.
+.de EE
+. ft \\n(mE
+. fi
+. hy \\n(HY
+..
+.
+.EX
+/*************************************************
+* PCRE DEMONSTRATION PROGRAM *
+*************************************************/
+
+/* This is a demonstration program to illustrate the most straightforward ways
+of calling the PCRE regular expression library from a C program. See the
+pcresample documentation for a short discussion ("man pcresample" if you have
+the PCRE man pages installed).
+
+In Unix-like environments, compile this program thuswise:
+
+ gcc -Wall pcredemo.c -I/usr/local/include -L/usr/local/lib \e
+ -R/usr/local/lib -lpcre
+
+Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and
+library files for PCRE are installed on your system. You don't need -I and -L
+if PCRE is installed in the standard system libraries. Only some operating
+systems (e.g. Solaris) use the -R option.
+
+Building under Windows:
+
+If you want to statically link this program against a non-dll .a file, you must
+define PCRE_STATIC before including pcre.h, otherwise the pcre_malloc() and
+pcre_free() exported functions will be declared __declspec(dllimport), with
+unwanted results. So in this environment, uncomment the following line. */
+
+/* #define PCRE_STATIC */
+
+#include <stdio.h>
+#include <string.h>
+#include <pcre.h>
+
+#define OVECCOUNT 30 /* should be a multiple of 3 */
+
+
+int main(int argc, char **argv)
+{
+pcre *re;
+const char *error;
+char *pattern;
+char *subject;
+unsigned char *name_table;
+int erroffset;
+int find_all;
+int namecount;
+int name_entry_size;
+int ovector[OVECCOUNT];
+int subject_length;
+int rc, i;
+
+
+/**************************************************************************
+* First, sort out the command line. There is only one possible option at *
+* the moment, "-g" to request repeated matching to find all occurrences, *
+* like Perl's /g option. We set the variable find_all to a non-zero value *
+* if the -g option is present. Apart from that, there must be exactly two *
+* arguments. *
+**************************************************************************/
+
+find_all = 0;
+for (i = 1; i < argc; i++)
+ {
+ if (strcmp(argv[i], "-g") == 0) find_all = 1;
+ else break;
+ }
+
+/* After the options, we require exactly two arguments, which are the pattern,
+and the subject string. */
+
+if (argc - i != 2)
+ {
+ printf("Two arguments required: a regex and a subject string\en");
+ return 1;
+ }
+
+pattern = argv[i];
+subject = argv[i+1];
+subject_length = (int)strlen(subject);
+
+
+/*************************************************************************
+* Now we are going to compile the regular expression pattern, and handle *
+* and errors that are detected. *
+*************************************************************************/
+
+re = pcre_compile(
+ pattern, /* the pattern */
+ 0, /* default options */
+ &error, /* for error message */
+ &erroffset, /* for error offset */
+ NULL); /* use default character tables */
+
+/* Compilation failed: print the error message and exit */
+
+if (re == NULL)
+ {
+ printf("PCRE compilation failed at offset %d: %s\en", erroffset, error);
+ return 1;
+ }
+
+
+/*************************************************************************
+* If the compilation succeeded, we call PCRE again, in order to do a *
+* pattern match against the subject string. This does just ONE match. If *
+* further matching is needed, it will be done below. *
+*************************************************************************/
+
+rc = pcre_exec(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ subject, /* the subject string */
+ subject_length, /* the length of the subject */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ ovector, /* output vector for substring information */
+ OVECCOUNT); /* number of elements in the output vector */
+
+/* Matching failed: handle error cases */
+
+if (rc < 0)
+ {
+ switch(rc)
+ {
+ case PCRE_ERROR_NOMATCH: printf("No match\en"); break;
+ /*
+ Handle other special cases if you like
+ */
+ default: printf("Matching error %d\en", rc); break;
+ }
+ pcre_free(re); /* Release memory used for the compiled pattern */
+ return 1;
+ }
+
+/* Match succeded */
+
+printf("\enMatch succeeded at offset %d\en", ovector[0]);
+
+
+/*************************************************************************
+* We have found the first match within the subject string. If the output *
+* vector wasn't big enough, say so. Then output any substrings that were *
+* captured. *
+*************************************************************************/
+
+/* The output vector wasn't big enough */
+
+if (rc == 0)
+ {
+ rc = OVECCOUNT/3;
+ printf("ovector only has room for %d captured substrings\en", rc - 1);
+ }
+
+/* Show substrings stored in the output vector by number. Obviously, in a real
+application you might want to do things other than print them. */
+
+for (i = 0; i < rc; i++)
+ {
+ char *substring_start = subject + ovector[2*i];
+ int substring_length = ovector[2*i+1] - ovector[2*i];
+ printf("%2d: %.*s\en", i, substring_length, substring_start);
+ }
+
+
+/**************************************************************************
+* That concludes the basic part of this demonstration program. We have *
+* compiled a pattern, and performed a single match. The code that follows *
+* shows first how to access named substrings, and then how to code for *
+* repeated matches on the same subject. *
+**************************************************************************/
+
+/* See if there are any named substrings, and if so, show them by name. First
+we have to extract the count of named parentheses from the pattern. */
+
+(void)pcre_fullinfo(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ PCRE_INFO_NAMECOUNT, /* number of named substrings */
+ &namecount); /* where to put the answer */
+
+if (namecount <= 0) printf("No named substrings\en"); else
+ {
+ unsigned char *tabptr;
+ printf("Named substrings\en");
+
+ /* Before we can access the substrings, we must extract the table for
+ translating names to numbers, and the size of each entry in the table. */
+
+ (void)pcre_fullinfo(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ PCRE_INFO_NAMETABLE, /* address of the table */
+ &name_table); /* where to put the answer */
+
+ (void)pcre_fullinfo(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ PCRE_INFO_NAMEENTRYSIZE, /* size of each entry in the table */
+ &name_entry_size); /* where to put the answer */
+
+ /* Now we can scan the table and, for each entry, print the number, the name,
+ and the substring itself. */
+
+ tabptr = name_table;
+ for (i = 0; i < namecount; i++)
+ {
+ int n = (tabptr[0] << 8) | tabptr[1];
+ printf("(%d) %*s: %.*s\en", n, name_entry_size - 3, tabptr + 2,
+ ovector[2*n+1] - ovector[2*n], subject + ovector[2*n]);
+ tabptr += name_entry_size;
+ }
+ }
+
+
+/*************************************************************************
+* If the "-g" option was given on the command line, we want to continue *
+* to search for additional matches in the subject string, in a similar *
+* way to the /g option in Perl. This turns out to be trickier than you *
+* might think because of the possibility of matching an empty string. *
+* What happens is as follows: *
+* *
+* If the previous match was NOT for an empty string, we can just start *
+* the next match at the end of the previous one. *
+* *
+* If the previous match WAS for an empty string, we can't do that, as it *
+* would lead to an infinite loop. Instead, a special call of pcre_exec() *
+* is made with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set. The first *
+* of these tells PCRE that an empty string is not a valid match; other *
+* possibilities must be tried. The second flag restricts PCRE to one *
+* match attempt at the initial string position. If this match succeeds, *
+* an alternative to the empty string match has been found, and we can *
+* proceed round the loop. *
+*************************************************************************/
+
+if (!find_all)
+ {
+ pcre_free(re); /* Release the memory used for the compiled pattern */
+ return 0; /* Finish unless -g was given */
+ }
+
+/* Loop for second and subsequent matches */
+
+for (;;)
+ {
+ int options = 0; /* Normally no options */
+ int start_offset = ovector[1]; /* Start at end of previous match */
+
+ /* If the previous match was for an empty string, we are finished if we are
+ at the end of the subject. Otherwise, arrange to run another match at the
+ same point to see if a non-empty match can be found. */
+
+ if (ovector[0] == ovector[1])
+ {
+ if (ovector[0] == subject_length) break;
+ options = PCRE_NOTEMPTY | PCRE_ANCHORED;
+ }
+
+ /* Run the next matching operation */
+
+ rc = pcre_exec(
+ re, /* the compiled pattern */
+ NULL, /* no extra data - we didn't study the pattern */
+ subject, /* the subject string */
+ subject_length, /* the length of the subject */
+ start_offset, /* starting offset in the subject */
+ options, /* options */
+ ovector, /* output vector for substring information */
+ OVECCOUNT); /* number of elements in the output vector */
+
+ /* This time, a result of NOMATCH isn't an error. If the value in "options"
+ is zero, it just means we have found all possible matches, so the loop ends.
+ Otherwise, it means we have failed to find a non-empty-string match at a
+ point where there was a previous empty-string match. In this case, we do what
+ Perl does: advance the matching position by one, and continue. We do this by
+ setting the "end of previous match" offset, because that is picked up at the
+ top of the loop as the point at which to start again. */
+
+ if (rc == PCRE_ERROR_NOMATCH)
+ {
+ if (options == 0) break;
+ ovector[1] = start_offset + 1;
+ continue; /* Go round the loop again */
+ }
+
+ /* Other matching errors are not recoverable. */
+
+ if (rc < 0)
+ {
+ printf("Matching error %d\en", rc);
+ pcre_free(re); /* Release memory used for the compiled pattern */
+ return 1;
+ }
+
+ /* Match succeded */
+
+ printf("\enMatch succeeded again at offset %d\en", ovector[0]);
+
+ /* The match succeeded, but the output vector wasn't big enough. */
+
+ if (rc == 0)
+ {
+ rc = OVECCOUNT/3;
+ printf("ovector only has room for %d captured substrings\en", rc - 1);
+ }
+
+ /* As before, show substrings stored in the output vector by number, and then
+ also any named substrings. */
+
+ for (i = 0; i < rc; i++)
+ {
+ char *substring_start = subject + ovector[2*i];
+ int substring_length = ovector[2*i+1] - ovector[2*i];
+ printf("%2d: %.*s\en", i, substring_length, substring_start);
+ }
+
+ if (namecount <= 0) printf("No named substrings\en"); else
+ {
+ unsigned char *tabptr = name_table;
+ printf("Named substrings\en");
+ for (i = 0; i < namecount; i++)
+ {
+ int n = (tabptr[0] << 8) | tabptr[1];
+ printf("(%d) %*s: %.*s\en", n, name_entry_size - 3, tabptr + 2,
+ ovector[2*n+1] - ovector[2*n], subject + ovector[2*n]);
+ tabptr += name_entry_size;
+ }
+ }
+ } /* End of loop to find second and subsequent matches */
+
+printf("\en");
+pcre_free(re); /* Release memory used for the compiled pattern */
+return 0;
+}
+
+/* End of pcredemo.c */
+.EE
diff --git a/doc/pcregrep.txt b/doc/pcregrep.txt
index 0163d58..e876b01 100644
--- a/doc/pcregrep.txt
+++ b/doc/pcregrep.txt
@@ -92,172 +92,181 @@ SUPPORT FOR COMPRESSED FILES
OPTIONS
- -- This terminate the list of options. It is useful if the next
- item on the command line starts with a hyphen but is not an
- option. This allows for the processing of patterns and file-
+ The order in which some of the options appear can affect the output.
+ For example, both the -h and -l options affect the printing of file
+ names. Whichever comes later in the command line will be the one that
+ takes effect.
+
+ -- This terminate the list of options. It is useful if the next
+ item on the command line starts with a hyphen but is not an
+ option. This allows for the processing of patterns and file-
names that start with hyphens.
-A number, --after-context=number
- Output number lines of context after each matching line. If
+ Output number lines of context after each matching line. If
filenames and/or line numbers are being output, a hyphen sep-
- arator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
+ arator is used instead of a colon for the context lines. A
+ line containing "--" is output between each group of lines,
+ unless they are in fact contiguous in the input file. The
+ value of number is expected to be relatively small. However,
pcregrep guarantees to have up to 8K of following text avail-
able for context output.
-B number, --before-context=number
- Output number lines of context before each matching line. If
+ Output number lines of context before each matching line. If
filenames and/or line numbers are being output, a hyphen sep-
- arator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
+ arator is used instead of a colon for the context lines. A
+ line containing "--" is output between each group of lines,
+ unless they are in fact contiguous in the input file. The
+ value of number is expected to be relatively small. However,
pcregrep guarantees to have up to 8K of preceding text avail-
able for context output.
-C number, --context=number
- Output number lines of context both before and after each
- matching line. This is equivalent to setting both -A and -B
+ Output number lines of context both before and after each
+ matching line. This is equivalent to setting both -A and -B
to the same value.
-c, --count
- Do not output individual lines; instead just output a count
- of the number of lines that would otherwise have been output.
- If several files are given, a count is output for each of
- them. In this mode, the -A, -B, and -C options are ignored.
+ Do not output individual lines from the files that are being
+ scanned; instead output the number of lines that would other-
+ wise have been shown. If no lines are selected, the number
+ zero is output. If several files are are being scanned, a
+ count is output for each of them. However, if the --files-
+ with-matches option is also used, only those files whose
+ counts are greater than zero are listed. When -c is used, the
+ -A, -B, and -C options are ignored.
--colour, --color
If this option is given without any data, it is equivalent to
- "--colour=auto". If data is required, it must be given in
+ "--colour=auto". If data is required, it must be given in
the same shell item, separated by an equals sign.
--colour=value, --color=value
This option specifies under what circumstances the parts of a
line that matched a pattern should be coloured in the output.
- By default, the output is not coloured. The value (which is
- optional, see above) may be "never", "always", or "auto". In
- the latter case, colouring happens only if the standard out-
- put is connected to a terminal. More resources are used when
- colouring is enabled, because pcregrep has to search for all
- possible matches in a line, not just one, in order to colour
+ By default, the output is not coloured. The value (which is
+ optional, see above) may be "never", "always", or "auto". In
+ the latter case, colouring happens only if the standard out-
+ put is connected to a terminal. More resources are used when
+ colouring is enabled, because pcregrep has to search for all
+ possible matches in a line, not just one, in order to colour
them all.
The colour that is used can be specified by setting the envi-
ronment variable PCREGREP_COLOUR or PCREGREP_COLOR. The value
of this variable should be a string of two numbers, separated
- by a semicolon. They are copied directly into the control
- string for setting colour on a terminal, so it is your
- responsibility to ensure that they make sense. If neither of
- the environment variables is set, the default is "1;31",
+ by a semicolon. They are copied directly into the control
+ string for setting colour on a terminal, so it is your
+ responsibility to ensure that they make sense. If neither of
+ the environment variables is set, the default is "1;31",
which gives red.
-D action, --devices=action
- If an input path is not a regular file or a directory,
- "action" specifies how it is to be processed. Valid values
+ If an input path is not a regular file or a directory,
+ "action" specifies how it is to be processed. Valid values
are "read" (the default) or "skip" (silently skip the path).
-d action, --directories=action
If an input path is a directory, "action" specifies how it is
- to be processed. Valid values are "read" (the default),
- "recurse" (equivalent to the -r option), or "skip" (silently
- skip the path). In the default case, directories are read as
- if they were ordinary files. In some operating systems the
- effect of reading a directory like this is an immediate end-
+ to be processed. Valid values are "read" (the default),
+ "recurse" (equivalent to the -r option), or "skip" (silently
+ skip the path). In the default case, directories are read as
+ if they were ordinary files. In some operating systems the
+ effect of reading a directory like this is an immediate end-
of-file.
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
- be used as a way of specifying a single pattern that starts
- with a hyphen. When -e is used, no argument pattern is taken
- from the command line; all arguments are treated as file
- names. There is an overall maximum of 100 patterns. They are
- applied to each line in the order in which they are defined
+ be used as a way of specifying a single pattern that starts
+ with a hyphen. When -e is used, no argument pattern is taken
+ from the command line; all arguments are treated as file
+ names. There is an overall maximum of 100 patterns. They are
+ applied to each line in the order in which they are defined
until one matches (or fails to match if -v is used). If -f is
- used with -e, the command line patterns are matched first,
- followed by the patterns from the file, independent of the
- order in which these options are specified. Note that multi-
+ used with -e, the command line patterns are matched first,
+ followed by the patterns from the file, independent of the
+ order in which these options are specified. Note that multi-
ple use of -e is not the same as a single pattern with alter-
natives. For example, X|Y finds the first character in a line
- that is X or Y, whereas if the two patterns are given sepa-
+ that is X or Y, whereas if the two patterns are given sepa-
rately, pcregrep finds X if it is present, even if it follows
- Y in the line. It finds Y only if there is no X in the line.
- This really matters only if you are using -o to show the
+ Y in the line. It finds Y only if there is no X in the line.
+ This really matters only if you are using -o to show the
part(s) of the line that matched.
--exclude=pattern
When pcregrep is searching the files in a directory as a con-
- sequence of the -r (recursive search) option, any regular
+ sequence of the -r (recursive search) option, any regular
files whose names match the pattern are excluded. Subdirecto-
- ries are not excluded by this option; they are searched
- recursively, subject to the --exclude_dir and --include_dir
- options. The pattern is a PCRE regular expression, and is
+ ries are not excluded by this option; they are searched
+ recursively, subject to the --exclude_dir and --include_dir
+ options. The pattern is a PCRE regular expression, and is
matched against the final component of the file name (not the
- entire path). If a file name matches both --include and
- --exclude, it is excluded. There is no short form for this
+ entire path). If a file name matches both --include and
+ --exclude, it is excluded. There is no short form for this
option.
--exclude_dir=pattern
- When pcregrep is searching the contents of a directory as a
- consequence of the -r (recursive search) option, any subdi-
- rectories whose names match the pattern are excluded. (Note
- that the --exclude option does not affect subdirectories.)
- The pattern is a PCRE regular expression, and is matched
- against the final component of the name (not the entire
- path). If a subdirectory name matches both --include_dir and
- --exclude_dir, it is excluded. There is no short form for
+ When pcregrep is searching the contents of a directory as a
+ consequence of the -r (recursive search) option, any subdi-
+ rectories whose names match the pattern are excluded. (Note
+ that the --exclude option does not affect subdirectories.)
+ The pattern is a PCRE regular expression, and is matched
+ against the final component of the name (not the entire
+ path). If a subdirectory name matches both --include_dir and
+ --exclude_dir, it is excluded. There is no short form for
this option.
-F, --fixed-strings
- Interpret each pattern as a list of fixed strings, separated
- by newlines, instead of as a regular expression. The -w
- (match as a word) and -x (match whole line) options can be
+ Interpret each pattern as a list of fixed strings, separated
+ by newlines, instead of as a regular expression. The -w
+ (match as a word) and -x (match whole line) options can be
used with -F. They apply to each of the fixed strings. A line
is selected if any of the fixed strings are found in it (sub-
ject to -w or -x, if present).
-f filename, --file=filename
- Read a number of patterns from the file, one per line, and
- match them against each line of input. A data line is output
+ Read a number of patterns from the file, one per line, and
+ match them against each line of input. A data line is output
if any of the patterns match it. The filename can be given as
"-" to refer to the standard input. When -f is used, patterns
- specified on the command line using -e may also be present;
+ specified on the command line using -e may also be present;
they are tested before the file's patterns. However, no other
- pattern is taken from the command line; all arguments are
- treated as file names. There is an overall maximum of 100
+ pattern is taken from the command line; all arguments are
+ treated as file names. There is an overall maximum of 100
patterns. Trailing white space is removed from each line, and
- blank lines are ignored. An empty file contains no patterns
- and therefore matches nothing. See also the comments about
- multiple patterns versus a single pattern with alternatives
+ blank lines are ignored. An empty file contains no patterns
+ and therefore matches nothing. See also the comments about
+ multiple patterns versus a single pattern with alternatives
in the description of -e above.
--file-offsets
- Instead of showing lines or parts of lines that match, show
- each match as an offset from the start of the file and a
- length, separated by a comma. In this mode, no context is
- shown. That is, the -A, -B, and -C options are ignored. If
+ Instead of showing lines or parts of lines that match, show
+ each match as an offset from the start of the file and a
+ length, separated by a comma. In this mode, no context is
+ shown. That is, the -A, -B, and -C options are ignored. If
there is more than one match in a line, each of them is shown
- separately. This option is mutually exclusive with --line-
+ separately. This option is mutually exclusive with --line-
offsets and --only-matching.
-H, --with-filename
- Force the inclusion of the filename at the start of output
- lines when searching a single file. By default, the filename
- is not shown in this case. For matching lines, the filename
+ Force the inclusion of the filename at the start of output
+ lines when searching a single file. By default, the filename
+ is not shown in this case. For matching lines, the filename
is followed by a colon; for context lines, a hyphen separator
- is used. If a line number is also being output, it follows
+ is used. If a line number is also being output, it follows
the file name.
-h, --no-filename
- Suppress the output filenames when searching multiple files.
- By default, filenames are shown when multiple files are
- searched. For matching lines, the filename is followed by a
- colon; for context lines, a hyphen separator is used. If a
+ Suppress the output filenames when searching multiple files.
+ By default, filenames are shown when multiple files are
+ searched. For matching lines, the filename is followed by a
+ colon; for context lines, a hyphen separator is used. If a
line number is also being output, it follows the file name.
- --help Output a help message, giving brief details of the command
+ --help Output a help message, giving brief details of the command
options and file type support, and then exit.
-i, --ignore-case
@@ -267,36 +276,40 @@ OPTIONS
When pcregrep is searching the files in a directory as a con-
sequence of the -r (recursive search) option, only those reg-
ular files whose names match the pattern are included. Subdi-
- rectories are always included and searched recursively, sub-
+ rectories are always included and searched recursively, sub-
ject to the --include_dir and --exclude_dir options. The pat-
tern is a PCRE regular expression, and is matched against the
- final component of the file name (not the entire path). If a
+ final component of the file name (not the entire path). If a
file name matches both --include and --exclude, it is
excluded. There is no short form for this option.
--include_dir=pattern
- When pcregrep is searching the contents of a directory as a
- consequence of the -r (recursive search) option, only those
- subdirectories whose names match the pattern are included.
- (Note that the --include option does not affect subdirecto-
- ries.) The pattern is a PCRE regular expression, and is
- matched against the final component of the name (not the
- entire path). If a subdirectory name matches both
- --include_dir and --exclude_dir, it is excluded. There is no
+ When pcregrep is searching the contents of a directory as a
+ consequence of the -r (recursive search) option, only those
+ subdirectories whose names match the pattern are included.
+ (Note that the --include option does not affect subdirecto-
+ ries.) The pattern is a PCRE regular expression, and is
+ matched against the final component of the name (not the
+ entire path). If a subdirectory name matches both
+ --include_dir and --exclude_dir, it is excluded. There is no
short form for this option.
-L, --files-without-match
- Instead of outputting lines from the files, just output the
- names of the files that do not contain any lines that would
- have been output. Each file name is output once, on a sepa-
+ Instead of outputting lines from the files, just output the
+ names of the files that do not contain any lines that would
+ have been output. Each file name is output once, on a sepa-
rate line.
-l, --files-with-matches
- Instead of outputting lines from the files, just output the
+ Instead of outputting lines from the files, just output the
names of the files containing lines that would have been out-
- put. Each file name is output once, on a separate line.
- Searching stops as soon as a matching line is found in a
- file.
+ put. Each file name is output once, on a separate line.
+ Searching normally stops as soon as a matching line is found
+ in a file. However, if the -c (count) option is also used,
+ matching continues in order to obtain the correct count, and
+ those files that have at least one match are listed along
+ with their counts. Using this option with -c is a way of sup-
+ pressing the listing of files with no matches.
--label=name
This option supplies a name to be used for the standard input
@@ -304,106 +317,106 @@ OPTIONS
input)" is used. There is no short form for this option.
--line-offsets
- Instead of showing lines or parts of lines that match, show
+ Instead of showing lines or parts of lines that match, show
each match as a line number, the offset from the start of the
- line, and a length. The line number is terminated by a colon
- (as usual; see the -n option), and the offset and length are
- separated by a comma. In this mode, no context is shown.
- That is, the -A, -B, and -C options are ignored. If there is
- more than one match in a line, each of them is shown sepa-
+ line, and a length. The line number is terminated by a colon
+ (as usual; see the -n option), and the offset and length are
+ separated by a comma. In this mode, no context is shown.
+ That is, the -A, -B, and -C options are ignored. If there is
+ more than one match in a line, each of them is shown sepa-
rately. This option is mutually exclusive with --file-offsets
and --only-matching.
--locale=locale-name
- This option specifies a locale to be used for pattern match-
- ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
- ronment variables. If no locale is specified, the PCRE
- library's default (usually the "C" locale) is used. There is
+ This option specifies a locale to be used for pattern match-
+ ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
+ ronment variables. If no locale is specified, the PCRE
+ library's default (usually the "C" locale) is used. There is
no short form for this option.
-M, --multiline
- Allow patterns to match more than one line. When this option
+ Allow patterns to match more than one line. When this option
is given, patterns may usefully contain literal newline char-
- acters and internal occurrences of ^ and $ characters. The
- output for any one match may consist of more than one line.
- When this option is set, the PCRE library is called in "mul-
- tiline" mode. There is a limit to the number of lines that
- can be matched, imposed by the way that pcregrep buffers the
- input file as it scans it. However, pcregrep ensures that at
+ acters and internal occurrences of ^ and $ characters. The
+ output for any one match may consist of more than one line.
+ When this option is set, the PCRE library is called in "mul-
+ tiline" mode. There is a limit to the number of lines that
+ can be matched, imposed by the way that pcregrep buffers the
+ input file as it scans it. However, pcregrep ensures that at
least 8K characters or the rest of the document (whichever is
- the shorter) are available for forward matching, and simi-
+ the shorter) are available for forward matching, and simi-
larly the previous 8K characters (or all the previous charac-
- ters, if fewer than 8K) are guaranteed to be available for
+ ters, if fewer than 8K) are guaranteed to be available for
lookbehind assertions.
-N newline-type, --newline=newline-type
- The PCRE library supports five different conventions for
- indicating the ends of lines. They are the single-character
- sequences CR (carriage return) and LF (linefeed), the two-
- character sequence CRLF, an "anycrlf" convention, which rec-
- ognizes any of the preceding three types, and an "any" con-
+ The PCRE library supports five different conventions for
+ indicating the ends of lines. They are the single-character
+ sequences CR (carriage return) and LF (linefeed), the two-
+ character sequence CRLF, an "anycrlf" convention, which rec-
+ ognizes any of the preceding three types, and an "any" con-
vention, in which any Unicode line ending sequence is assumed
- to end a line. The Unicode sequences are the three just men-
- tioned, plus VT (vertical tab, U+000B), FF (formfeed,
- U+000C), NEL (next line, U+0085), LS (line separator,
+ to end a line. The Unicode sequences are the three just men-
+ tioned, plus VT (vertical tab, U+000B), FF (formfeed,
+ U+000C), NEL (next line, U+0085), LS (line separator,
U+2028), and PS (paragraph separator, U+2029).
When the PCRE library is built, a default line-ending
- sequence is specified. This is normally the standard
+ sequence is specified. This is normally the standard
sequence for the operating system. Unless otherwise specified
- by this option, pcregrep uses the library's default. The
+ by this option, pcregrep uses the library's default. The
possible values for this option are CR, LF, CRLF, ANYCRLF, or
- ANY. This makes it possible to use pcregrep on files that
- have come from other environments without having to modify
- their line endings. If the data that is being scanned does
- not agree with the convention set by this option, pcregrep
+ ANY. This makes it possible to use pcregrep on files that
+ have come from other environments without having to modify
+ their line endings. If the data that is being scanned does
+ not agree with the convention set by this option, pcregrep
may behave in strange ways.
-n, --line-number
Precede each output line by its line number in the file, fol-
- lowed by a colon for matching lines or a hyphen for context
- lines. If the filename is also being output, it precedes the
+ lowed by a colon for matching lines or a hyphen for context
+ lines. If the filename is also being output, it precedes the
line number. This option is forced if --line-offsets is used.
-o, --only-matching
- Show only the part of the line that matched a pattern. In
- this mode, no context is shown. That is, the -A, -B, and -C
- options are ignored. If there is more than one match in a
- line, each of them is shown separately. If -o is combined
- with -v (invert the sense of the match to find non-matching
- lines), no output is generated, but the return code is set
+ Show only the part of the line that matched a pattern. In
+ this mode, no context is shown. That is, the -A, -B, and -C
+ options are ignored. If there is more than one match in a
+ line, each of them is shown separately. If -o is combined
+ with -v (invert the sense of the match to find non-matching
+ lines), no output is generated, but the return code is set
appropriately. This option is mutually exclusive with --file-
offsets and --line-offsets.
-q, --quiet
Work quietly, that is, display nothing except error messages.
- The exit status indicates whether or not any matches were
+ The exit status indicates whether or not any matches were
found.
-r, --recursive
- If any given path is a directory, recursively scan the files
- it contains, taking note of any --include and --exclude set-
- tings. By default, a directory is read as a normal file; in
- some operating systems this gives an immediate end-of-file.
- This option is a shorthand for setting the -d option to
+ If any given path is a directory, recursively scan the files
+ it contains, taking note of any --include and --exclude set-
+ tings. By default, a directory is read as a normal file; in
+ some operating systems this gives an immediate end-of-file.
+ This option is a shorthand for setting the -d option to
"recurse".
-s, --no-messages
- Suppress error messages about non-existent or unreadable
- files. Such files are quietly skipped. However, the return
+ Suppress error messages about non-existent or unreadable
+ files. Such files are quietly skipped. However, the return
code is still 2, even if matches were found in other files.
-u, --utf-8
- Operate in UTF-8 mode. This option is available only if PCRE
- has been compiled with UTF-8 support. Both patterns and sub-
+ Operate in UTF-8 mode. This option is available only if PCRE
+ has been compiled with UTF-8 support. Both patterns and sub-
ject lines must be valid strings of UTF-8 characters.
-V, --version
- Write the version numbers of pcregrep and the PCRE library
+ Write the version numbers of pcregrep and the PCRE library
that is being used to the standard error stream.
-v, --invert-match
- Invert the sense of the match, so that lines which do not
+ Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
@@ -411,39 +424,40 @@ OPTIONS
lent to having \b at the start and end of the pattern.
-x, --line-regex, --line-regexp
- Force the patterns to be anchored (each must start matching
- at the beginning of a line) and in addition, require them to
- match entire lines. This is equivalent to having ^ and $
+ Force the patterns to be anchored (each must start matching
+ at the beginning of a line) and in addition, require them to
+ match entire lines. This is equivalent to having ^ and $
characters at the start and end of each alternative branch in
every pattern.
ENVIRONMENT VARIABLES
- The environment variables LC_ALL and LC_CTYPE are examined, in that
- order, for a locale. The first one that is set is used. This can be
- overridden by the --locale option. If no locale is set, the PCRE
+ The environment variables LC_ALL and LC_CTYPE are examined, in that
+ order, for a locale. The first one that is set is used. This can be
+ overridden by the --locale option. If no locale is set, the PCRE
library's default (usually the "C" locale) is used.
NEWLINES
- The -N (--newline) option allows pcregrep to scan files with different
- newline conventions from the default. However, the setting of this
- option does not affect the way in which pcregrep writes information to
- the standard error and output streams. It uses the string "\n" in C
- printf() calls to indicate newlines, relying on the C I/O library to
- convert this to an appropriate sequence if the output is sent to a
+ The -N (--newline) option allows pcregrep to scan files with different
+ newline conventions from the default. However, the setting of this
+ option does not affect the way in which pcregrep writes information to
+ the standard error and output streams. It uses the string "\n" in C
+ printf() calls to indicate newlines, relying on the C I/O library to
+ convert this to an appropriate sequence if the output is sent to a
file.
OPTIONS COMPATIBILITY
The majority of short and long forms of pcregrep's options are the same
- as in the GNU grep program. Any long option of the form --xxx-regexp
- (GNU terminology) is also available as --xxx-regex (PCRE terminology).
- However, the --locale, -M, --multiline, -u, and --utf-8 options are
- specific to pcregrep.
+ as in the GNU grep program. Any long option of the form --xxx-regexp
+ (GNU terminology) is also available as --xxx-regex (PCRE terminology).
+ However, the --locale, -M, --multiline, -u, and --utf-8 options are
+ specific to pcregrep. If both the -c and -l options are given, GNU grep
+ lists only file names, without counts, but pcregrep gives the counts.
OPTIONS WITH DATA
@@ -508,5 +522,5 @@ AUTHOR
REVISION
- Last updated: 01 March 2009
+ Last updated: 12 August 2009
Copyright (c) 1997-2009 University of Cambridge.
diff --git a/doc/pcresample.3 b/doc/pcresample.3
index d27690a..48941c5 100644
--- a/doc/pcresample.3
+++ b/doc/pcresample.3
@@ -5,7 +5,13 @@ PCRE - Perl-compatible regular expressions
.rs
.sp
A simple, complete demonstration program, to get you started with using PCRE,
-is supplied in the file \fIpcredemo.c\fP in the PCRE distribution.
+is supplied in the file \fIpcredemo.c\fP in the PCRE distribution. A listing of
+this program is given in the
+.\" HREF
+\fBpcredemo\fP
+.\"
+documentation. If you do not have a copy of the PCRE distribution, you can save
+this listing to re-create \fIpcredemo.c\fP.
.P
The program compiles the regular expression that is its first argument, and
matches it against the subject string in its second argument. No PCRE options
@@ -44,12 +50,18 @@ Note that there is a much more comprehensive test program, called
\fBpcretest\fP,
.\"
which supports many more facilities for testing regular expressions and the
-PCRE library. The \fBpcredemo\fP program is provided as a simple coding
-example.
+PCRE library. The
+.\" HREF
+\fBpcredemo\fP
+.\"
+program is provided as a simple coding example.
.P
-On some operating systems (e.g. Solaris), when PCRE is not installed in the
-standard library directory, you may get an error like this when you try to run
-\fBpcredemo\fP:
+When you try to run
+.\" HREF
+\fBpcredemo\fP
+.\"
+when PCRE is not installed in the standard library directory, you may get an
+error like this on some operating systems (e.g. Solaris):
.sp
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
.sp
@@ -75,6 +87,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 23 January 2008
-Copyright (c) 1997-2008 University of Cambridge.
+Last updated: 01 September 2009
+Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index aa862eb..f1e8777 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -326,8 +326,9 @@ DATA LINES
or pcre_dfa_exec()
\Odd set the size of the output vector passed to
pcre_exec() to dd (any number of digits)
- \P pass the PCRE_PARTIAL option to pcre_exec()
- or pcre_dfa_exec()
+ \P pass the PCRE_PARTIAL_SOFT option to pcre_exec()
+ or pcre_dfa_exec(); if used twice, pass the
+ PCRE_PARTIAL_HARD option
\Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
(any number of digits)
\R pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
@@ -413,9 +414,10 @@ DEFAULT OUTPUT FROM PCRETEST
When a match succeeds, pcretest outputs the list of captured substrings
that pcre_exec() returns, starting with number 0 for the string that
matched the whole pattern. Otherwise, it outputs "No match" or "Partial
- match" when pcre_exec() returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PAR-
- TIAL, respectively, and otherwise the PCRE negative error number. Here
- is an example of an interactive pcretest run.
+ match:" followed by the partially matching substring when pcre_exec()
+ returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL, respectively, and
+ otherwise the PCRE negative error number. Here is an example of an
+ interactive pcretest run.
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -427,11 +429,11 @@ DEFAULT OUTPUT FROM PCRETEST
data> xyz
No match
- Note that unset capturing substrings that are not followed by one that
- is set are not returned by pcre_exec(), and are not shown by pcretest.
- In the following example, there are two capturing substrings, but when
- the first data line is matched, the second, unset substring is not
- shown. An "internal" unset substring is shown as "<unset>", as for the
+ Note that unset capturing substrings that are not followed by one that
+ is set are not returned by pcre_exec(), and are not shown by pcretest.
+ In the following example, there are two capturing substrings, but when
+ the first data line is matched, the second, unset substring is not
+ shown. An "internal" unset substring is shown as "<unset>", as for the
second data line.
re> /(a)|(b)/
@@ -443,11 +445,11 @@ DEFAULT OUTPUT FROM PCRETEST
1: <unset>
2: b
- If the strings contain any non-printing characters, they are output as
- \0x escapes, or as \x{...} escapes if the /8 modifier was present on
- the pattern. See below for the definition of non-printing characters.
- If the pattern has the /+ modifier, the output for substring 0 is fol-
- lowed by the the rest of the subject string, identified by "0+" like
+ If the strings contain any non-printing characters, they are output as
+ \0x escapes, or as \x{...} escapes if the /8 modifier was present on
+ the pattern. See below for the definition of non-printing characters.
+ If the pattern has the /+ modifier, the output for substring 0 is fol-
+ lowed by the the rest of the subject string, identified by "0+" like
this:
re> /cat/+
@@ -455,7 +457,7 @@ DEFAULT OUTPUT FROM PCRETEST
0: cat
0+ aract
- If the pattern has the /g or /G modifier, the results of successive
+ If the pattern has the /g or /G modifier, the results of successive
matching attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@@ -469,24 +471,24 @@ DEFAULT OUTPUT FROM PCRETEST
"No match" is output only if the first match attempt fails.
- If any of the sequences \C, \G, or \L are present in a data line that
- is successfully matched, the substrings extracted by the convenience
+ If any of the sequences \C, \G, or \L are present in a data line that
+ is successfully matched, the substrings extracted by the convenience
functions are output with C, G, or L after the string number instead of
a colon. This is in addition to the normal full list. The string length
- (that is, the return from the extraction function) is given in paren-
+ (that is, the return from the extraction function) is given in paren-
theses after each string for \C and \G.
Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), data lines may not. However new-
- lines can be included in data by means of the \n escape (or \r, \r\n,
+ lines can be included in data by means of the \n escape (or \r, \r\n,
etc., depending on the newline sequence setting).
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
- When the alternative matching function, pcre_dfa_exec(), is used (by
- means of the \D escape sequence or the -dfa command line option), the
- output consists of a list of all the matches that start at the first
+ When the alternative matching function, pcre_dfa_exec(), is used (by
+ means of the \D escape sequence or the -dfa command line option), the
+ output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/
@@ -495,8 +497,10 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- (Using the normal matching function on this data finds only "tang".)
- The longest matching string is always given first (and numbered zero).
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
+ After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
+ lowed by the partially matching substring.
If /g is present on the pattern, the search for further matches resumes
at the end of the longest match. For example:
@@ -510,16 +514,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- Since the matching function does not support substring capture, the
- escape sequences that are concerned with captured substrings are not
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE_ERROR_PARTIAL
- return, indicating that the subject partially matched the pattern, you
- can restart the match with additional subject data by means of the \R
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
escape sequence. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -528,30 +532,30 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\R\D
0: n05
- For further information about partial matching, see the pcrepartial
+ For further information about partial matching, see the pcrepartial
documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. This works with both matching func-
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the
- start and current positions in the text at the callout time, and the
+ start and current positions in the text at the callout time, and the
next pattern item to be tested. For example, the output
--->pqrabcdef
0 ^ ^ \d
- indicates that callout number 0 occurred for a match attempt starting
- at the fourth character of the subject string, when the pointer was at
- the seventh character of the data, and when the next pattern item was
- \d. Just one circumflex is output if the start and current positions
+ indicates that callout number 0 occurred for a match attempt starting
+ at the fourth character of the subject string, when the pointer was at
+ the seventh character of the data, and when the next pattern item was
+ \d. Just one circumflex is output if the start and current positions
are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /C pattern modifier. In this case, instead of showing
- the callout number, the offset in the pattern, preceded by a plus, is
+ a result of the /C pattern modifier. In this case, instead of showing
+ the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
re> /\d?[A-E]\*/C
@@ -563,86 +567,86 @@ CALLOUTS
+10 ^ ^
0: E*
- The callout function in pcretest returns zero (carry on matching) by
- default, but you can use a \C item in a data line (as described above)
+ The callout function in pcretest returns zero (carry on matching) by
+ default, but you can use a \C item in a data line (as described above)
to change this.
- Inserting callouts can be helpful when using pcretest to check compli-
- cated regular expressions. For further information about callouts, see
+ Inserting callouts can be helpful when using pcretest to check compli-
+ cated regular expressions. For further information about callouts, see
the pcrecallout documentation.
NON-PRINTING CHARACTERS
- When pcretest is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
are are therefore shown as hex escapes.
- When pcretest is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /L modifier). In this case, the
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
isprint() function to distinguish printing and non-printing characters.
SAVING AND RELOADING COMPILED PATTERNS
- The facilities described in this section are not available when the
+ The facilities described in this section are not available when the
POSIX inteface to PCRE is being used, that is, when the /P pattern mod-
ifier is specified.
When the POSIX interface is not in use, you can cause pcretest to write
- a compiled pattern to a file, by following the modifiers with > and a
+ a compiled pattern to a file, by following the modifiers with > and a
file name. For example:
/pattern/im >/some/file
- See the pcreprecompile documentation for a discussion about saving and
+ See the pcreprecompile documentation for a discussion about saving and
re-using compiled patterns.
- The data that is written is binary. The first eight bytes are the
- length of the compiled pattern data followed by the length of the
- optional study data, each written as four bytes in big-endian order
- (most significant byte first). If there is no study data (either the
+ The data that is written is binary. The first eight bytes are the
+ length of the compiled pattern data followed by the length of the
+ optional study data, each written as four bytes in big-endian order
+ (most significant byte first). If there is no study data (either the
pattern was not studied, or studying did not return any data), the sec-
- ond length is zero. The lengths are followed by an exact copy of the
+ ond length is zero. The lengths are followed by an exact copy of the
compiled pattern. If there is additional study data, this follows imme-
- diately after the compiled pattern. After writing the file, pcretest
+ diately after the compiled pattern. After writing the file, pcretest
expects to read a new pattern.
A saved pattern can be reloaded into pcretest by specifing < and a file
- name instead of a pattern. The name of the file must not contain a <
- character, as otherwise pcretest will interpret the line as a pattern
+ name instead of a pattern. The name of the file must not contain a <
+ character, as otherwise pcretest will interpret the line as a pattern
delimited by < characters. For example:
re> </some/file
Compiled regex loaded from /some/file
No study data
- When the pattern has been loaded, pcretest proceeds to read data lines
+ When the pattern has been loaded, pcretest proceeds to read data lines
in the usual way.
- You can copy a file written by pcretest to a different host and reload
- it there, even if the new host has opposite endianness to the one on
- which the pattern was compiled. For example, you can compile on an i86
+ You can copy a file written by pcretest to a different host and reload
+ it there, even if the new host has opposite endianness to the one on
+ which the pattern was compiled. For example, you can compile on an i86
machine and run on a SPARC machine.
- File names for saving and reloading can be absolute or relative, but
- note that the shell facility of expanding a file name that starts with
+ File names for saving and reloading can be absolute or relative, but
+ note that the shell facility of expanding a file name that starts with
a tilde (~) is not available.
- The ability to save and reload files in pcretest is intended for test-
- ing and experimentation. It is not intended for production use because
- only a single pattern can be written to a file. Furthermore, there is
- no facility for supplying custom character tables for use with a
- reloaded pattern. If the original pattern was compiled with custom
- tables, an attempt to match a subject string using a reloaded pattern
- is likely to cause pcretest to crash. Finally, if you attempt to load
+ The ability to save and reload files in pcretest is intended for test-
+ ing and experimentation. It is not intended for production use because
+ only a single pattern can be written to a file. Furthermore, there is
+ no facility for supplying custom character tables for use with a
+ reloaded pattern. If the original pattern was compiled with custom
+ tables, an attempt to match a subject string using a reloaded pattern
+ is likely to cause pcretest to crash. Finally, if you attempt to load
a file that is not in the correct format, the result is undefined.
SEE ALSO
- pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcrepartial(d),
+ pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcrepartial(d),
pcrepattern(3), pcreprecompile(3).
@@ -655,5 +659,5 @@ AUTHOR
REVISION
- Last updated: 10 March 2009
+ Last updated: 29 August 2009
Copyright (c) 1997-2009 University of Cambridge.