summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-10-02 08:53:31 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-10-02 08:53:31 +0000
commit80d5ddd534157a15013c2314a4236bdd8ac0b72f (patch)
treee9cc83556582cde2b0adf4e2a4d98ae75c35aecf
parent5c3493df2827cb70ddc42899df2f9ee30f5e7a7b (diff)
downloadpcre-80d5ddd534157a15013c2314a4236bdd8ac0b72f.tar.gz
Documentation update
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@456 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--HACKING29
-rw-r--r--doc/pcre.335
-rw-r--r--doc/pcre_compile.34
-rw-r--r--doc/pcre_compile2.350
-rw-r--r--doc/pcreapi.324
-rw-r--r--doc/pcrebuild.330
-rw-r--r--doc/pcrecallout.316
-rw-r--r--doc/pcrecompat.339
-rw-r--r--doc/pcrematching.324
-rw-r--r--doc/pcrepartial.311
-rw-r--r--doc/pcrepattern.3173
-rw-r--r--doc/pcresample.36
-rw-r--r--doc/perltest.txt24
13 files changed, 274 insertions, 191 deletions
diff --git a/HACKING b/HACKING
index 1f30d4c..623fe5b 100644
--- a/HACKING
+++ b/HACKING
@@ -67,22 +67,22 @@ many tests of the mode that might slow it down. So I re-factored the compiling
functions to work this way. This got rid of about 600 lines of source. It
should make future maintenance and development easier. As this was such a major
change, I never released 6.8, instead upping the number to 7.0 (other quite
-major changes are also present in the 7.0 release).
+major changes were also present in the 7.0 release).
-A side effect of this work is that the previous limit of 200 on the nesting
+A side effect of this work was that the previous limit of 200 on the nesting
depth of parentheses was removed. However, there is a downside: pcre_compile()
runs more slowly than before (30% or more, depending on the pattern) because it
-is doing a full analysis of the pattern. My hope is that this is not a big
-issue.
+is doing a full analysis of the pattern. My hope was that this would not be a
+big issue, and in the event, nobody has commented on it.
Traditional matching function
-----------------------------
The "traditional", and original, matching function is called pcre_exec(), and
it implements an NFA algorithm, similar to the original Henry Spencer algorithm
-and the way that Perl works. Not surprising, since it is intended to be as
-compatible with Perl as possible. This is the function most users of PCRE will
-use most of the time.
+and the way that Perl works. This is not surprising, since it is intended to be
+as compatible with Perl as possible. This is the function most users of PCRE
+will use most of the time.
Supplementary matching function
-------------------------------
@@ -119,6 +119,7 @@ quantifiers) are always just two bytes long.
A list of the opcodes follows:
+
Opcodes with no following data
------------------------------
@@ -150,12 +151,12 @@ These items are all just one byte long
OP_EXTUNI match an extended Unicode character
OP_ANYNL match any Unicode newline sequence
- OP_ACCEPT )
- OP_COMMIT )
- OP_FAIL ) These are Perl 5.10's "backtracking
- OP_PRUNE ) control verbs".
- OP_SKIP )
- OP_THEN )
+ OP_ACCEPT ) These are Perl 5.10's "backtracking
+ OP_COMMIT ) control verbs". If OP_ACCEPT is inside
+ OP_FAIL ) capturing parentheses, it may be preceded
+ OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
+ OP_SKIP ) number, indicating which parentheses must be
+ OP_THEN ) closed.
Repeating single characters
@@ -415,4 +416,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
-April 2008
+October 2009
diff --git a/doc/pcre.3 b/doc/pcre.3
index 430fbd5..3d6409a 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -6,21 +6,20 @@ PCRE - Perl-compatible regular expressions
.sp
The PCRE library is a set of functions that implement regular expression
pattern matching using the same syntax and semantics as Perl, with just a few
-differences. Certain features that appeared in Python and PCRE before they
-appeared in Perl are also available using the Python syntax. There is also some
-support for certain .NET and Oniguruma syntax items, and there is an option for
-requesting some minor changes that give better JavaScript compatibility.
+differences. Some features that appeared in Python and PCRE before they
+appeared in Perl are also available using the Python syntax, there is some
+support for one or two .NET and Oniguruma syntax items, and there is an option
+for requesting some minor changes that give better JavaScript compatibility.
.P
-The current implementation of PCRE (release 8.xx) corresponds approximately
-with Perl 5.10, including support for UTF-8 encoded strings and Unicode general
-category properties. However, UTF-8 and Unicode support has to be explicitly
-enabled; it is not the default. The Unicode tables correspond to Unicode
-release 5.1.
+The current implementation of PCRE corresponds approximately with Perl 5.10,
+including support for UTF-8 encoded strings and Unicode general category
+properties. However, UTF-8 and Unicode support has to be explicitly enabled; it
+is not the default. The Unicode tables correspond to Unicode release 5.1.
.P
In addition to the Perl-compatible matching function, PCRE contains an
-alternative matching function that matches the same compiled patterns in a
-different way. In certain circumstances, the alternative function has some
-advantages. For a discussion of the two matching algorithms, see the
+alternative function that matches the same compiled patterns in a different
+way. In certain circumstances, the alternative function has some advantages.
+For a discussion of the two matching algorithms, see the
.\" HREF
\fBpcrematching\fP
.\"
@@ -66,7 +65,8 @@ available. The features themselves are described in the
\fBpcrebuild\fP
.\"
page. Documentation about building PCRE for various operating systems can be
-found in the \fBREADME\fP file in the source distribution.
+found in the \fBREADME\fP and \fBNON-UNIX-USE\fP files in the source
+distribution.
.P
The library contains a number of undocumented internal functions and data
tables that are used by more than one of the exported external functions, but
@@ -100,12 +100,12 @@ of searching. The sections are as follows:
.\" JOIN
pcrepattern syntax and semantics of supported
regular expressions
- pcresyntax quick syntax reference
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
+ pcresyntax quick syntax reference
pcretest description of the \fBpcretest\fP testing command
.sp
In addition, in the "man" and HTML formats, there is a short page for each
@@ -148,8 +148,8 @@ issues, see the
.\"
documentation.
.
-.\" HTML <a name="utf8support"></a>
.
+.\" HTML <a name="utf8support"></a>
.
.SH "UTF-8 AND UNICODE PROPERTY SUPPORT"
.rs
@@ -167,7 +167,7 @@ the code, and, in addition, you must call
with the PCRE_UTF8 option flag, or the pattern must start with the sequence
(*UTF8). When either of these is the case, both the pattern and any subject
strings that are matched against it are treated as UTF-8 strings instead of
-just strings of bytes.
+strings of 1-byte characters.
.P
If you compile PCRE with UTF-8 support, but do not use it at run time, the
library will be a bit bigger, but the additional run time overhead is limited
@@ -187,6 +187,7 @@ documentation. Only the short names for properties are supported. For example,
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for
compatibility with Perl 5.6. PCRE does not support this.
.
+.
.\" HTML <a name="utf8strings"></a>
.
.SS "Validity of UTF-8 strings"
@@ -292,6 +293,6 @@ two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 01 September 2009
+Last updated: 28 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcre_compile.3 b/doc/pcre_compile.3
index 48f92f7..e64288a 100644
--- a/doc/pcre_compile.3
+++ b/doc/pcre_compile.3
@@ -52,11 +52,11 @@ The option bits are:
PCRE_NEWLINE_LF Set LF as the newline sequence
PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
- PCRE_UNGREEDY Invert greediness of quantifiers
- PCRE_UTF8 Run in UTF-8 mode
PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
validity (only relevant if
PCRE_UTF8 is set)
+ PCRE_UNGREEDY Invert greediness of quantifiers
+ PCRE_UTF8 Run in UTF-8 mode
.sp
PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
PCRE_NO_UTF8_CHECK.
diff --git a/doc/pcre_compile2.3 b/doc/pcre_compile2.3
index 1e71aff..84dbf19 100644
--- a/doc/pcre_compile2.3
+++ b/doc/pcre_compile2.3
@@ -34,29 +34,33 @@ argument. The arguments are:
.sp
The option bits are:
.sp
- PCRE_ANCHORED Force pattern anchoring
- PCRE_AUTO_CALLOUT Compile automatic callouts
- PCRE_CASELESS Do caseless matching
- PCRE_DOLLAR_ENDONLY $ not to match newline at end
- PCRE_DOTALL . matches anything including NL
- PCRE_DUPNAMES Allow duplicate names for subpatterns
- PCRE_EXTENDED Ignore whitespace and # comments
- PCRE_EXTRA PCRE extra features
- (not much use currently)
- PCRE_FIRSTLINE Force matching to be before newline
- PCRE_MULTILINE ^ and $ match newlines within data
- PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
- PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline sequences
- PCRE_NEWLINE_CR Set CR as the newline sequence
- PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
- PCRE_NEWLINE_LF Set LF as the newline sequence
- PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
- theses (named ones available)
- PCRE_UNGREEDY Invert greediness of quantifiers
- PCRE_UTF8 Run in UTF-8 mode
- PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
- validity (only relevant if
- PCRE_UTF8 is set)
+ PCRE_ANCHORED Force pattern anchoring
+ PCRE_AUTO_CALLOUT Compile automatic callouts
+ PCRE_BSR_ANYCRLF \eR matches only CR, LF, or CRLF
+ PCRE_BSR_UNICODE \eR matches all Unicode line endings
+ PCRE_CASELESS Do caseless matching
+ PCRE_DOLLAR_ENDONLY $ not to match newline at end
+ PCRE_DOTALL . matches anything including NL
+ PCRE_DUPNAMES Allow duplicate names for subpatterns
+ PCRE_EXTENDED Ignore whitespace and # comments
+ PCRE_EXTRA PCRE extra features
+ (not much use currently)
+ PCRE_FIRSTLINE Force matching to be before newline
+ PCRE_JAVASCRIPT_COMPAT JavaScript compatibility
+ PCRE_MULTILINE ^ and $ match newlines within data
+ PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
+ PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline
+ sequences
+ PCRE_NEWLINE_CR Set CR as the newline sequence
+ PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
+ PCRE_NEWLINE_LF Set LF as the newline sequence
+ PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
+ theses (named ones available)
+ PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
+ validity (only relevant if
+ PCRE_UTF8 is set)
+ PCRE_UNGREEDY Invert greediness of quantifiers
+ PCRE_UTF8 Run in UTF-8 mode
.sp
PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
PCRE_NO_UTF8_CHECK.
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index c406f60..6b586e8 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -395,7 +395,9 @@ avoiding the use of the stack.
Either of the functions \fBpcre_compile()\fP or \fBpcre_compile2()\fP can be
called to compile a pattern into an internal form. The only difference between
the two interfaces is that \fBpcre_compile2()\fP has an additional argument,
-\fIerrorcodeptr\fP, via which a numerical error code can be returned.
+\fIerrorcodeptr\fP, via which a numerical error code can be returned. To avoid
+too much repetition, we refer just to \fBpcre_compile()\fP below, but the
+information applies equally to \fBpcre_compile2()\fP.
.P
The pattern is a C string terminated by a binary zero, and is passed in the
\fIpattern\fP argument. A pointer to a single block of memory that is obtained
@@ -412,23 +414,23 @@ argument, which is an address (see below).
The \fIoptions\fP argument contains various bit settings that affect the
compilation. It should be zero if no options are required. The available
options are described below. Some of them (in particular, those that are
-compatible with Perl, but also some others) can also be set and unset from
+compatible with Perl, but some others as well) can also be set and unset from
within the pattern (see the detailed description in the
.\" HREF
\fBpcrepattern\fP
.\"
documentation). For those options that can be different in different parts of
-the pattern, the contents of the \fIoptions\fP argument specifies their initial
-settings at the start of compilation and execution. The PCRE_ANCHORED and
-PCRE_NEWLINE_\fIxxx\fP options can be set at the time of matching as well as at
-compile time.
+the pattern, the contents of the \fIoptions\fP argument specifies their
+settings at the start of compilation and execution. The PCRE_ANCHORED,
+PCRE_BSR_\fIxxx\fP, and PCRE_NEWLINE_\fIxxx\fP options can be set at the time
+of matching as well as at compile time.
.P
If \fIerrptr\fP is NULL, \fBpcre_compile()\fP returns NULL immediately.
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fP returns
NULL, and sets the variable pointed to by \fIerrptr\fP to point to a textual
error message. This is a static string that is part of the library. You must
not try to free it. The byte offset from the start of the pattern to the
-character that was being processes when the error was discovered is placed in
+character that was being processed when the error was discovered is placed in
the variable pointed to by \fIerroffset\fP, which must not be NULL. If it is,
an immediate error is given. Some errors are not detected until checks are
carried out when the whole pattern has been scanned; in this case the offset is
@@ -984,7 +986,7 @@ is -1.
.sp
If the pattern was studied and a minimum length for matching subject strings
was computed, its value is returned. Otherwise the returned value is -1. The
-value is a number of characters, not bytes (there may be a difference in UTF-8
+value is a number of characters, not bytes (this may be relevant in UTF-8
mode). The fourth argument should point to an \fBint\fP variable. A
non-negative value is a lower bound to the length of any matching string. There
may not be any strings of that length that do actually match, but every string
@@ -1209,7 +1211,7 @@ the block by setting the other fields and their corresponding flag bits.
The \fImatch_limit\fP field provides a means of preventing PCRE from using up a
vast amount of resources when running patterns that are not going to match,
but which have a very large number of possibilities in their search trees. The
-classic example is the use of nested unlimited repeats.
+classic example is a pattern that uses nested unlimited repeats.
.P
Internally, PCRE uses a function called \fBmatch()\fP which it calls repeatedly
(sometimes recursively). The limit set by \fImatch_limit\fP is imposed on the
@@ -1508,7 +1510,7 @@ the \fIovector\fP is not big enough to remember the related substrings, PCRE
has to get additional memory for use during matching. Thus it is usually
advisable to supply an \fIovector\fP.
.P
-The \fBpcre_info()\fP function can be used to find out how many capturing
+The \fBpcre_fullinfo()\fP function can be used to find out how many capturing
subpatterns there are in a compiled pattern. The smallest size for
\fIovector\fP that will allow for \fIn\fP captured substrings, in addition to
the offsets of the substring matched by the whole pattern, is (\fIn\fP+1)*3.
@@ -2043,6 +2045,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 26 September 2009
+Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrebuild.3 b/doc/pcrebuild.3
index 4801263..dd970dc 100644
--- a/doc/pcrebuild.3
+++ b/doc/pcrebuild.3
@@ -1,6 +1,8 @@
.TH PCREBUILD 3
.SH NAME
PCRE - Perl-compatible regular expressions
+.
+.
.SH "PCRE BUILD-TIME OPTIONS"
.rs
.sp
@@ -29,6 +31,7 @@ The following sections include descriptions of options whose names begin with
--enable and --disable always come in pairs, so the complementary option always
exists as well, but as it specifies the default, it is not described.
.
+.
.SH "C++ SUPPORT"
.rs
.sp
@@ -40,6 +43,7 @@ for PCRE. You can disable this by adding
.sp
to the \fBconfigure\fP command.
.
+.
.SH "UTF-8 SUPPORT"
.rs
.sp
@@ -50,7 +54,7 @@ To build PCRE with support for UTF-8 Unicode character strings, add
to the \fBconfigure\fP command. Of itself, this does not make PCRE treat
strings as UTF-8. As well as compiling PCRE with this option, you also have
have to set the PCRE_UTF8 option when you call the \fBpcre_compile()\fP
-function.
+or \fBpcre_compile2()\fP functions.
.P
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE expects
its input to be either ASCII or UTF-8 (depending on the runtime option). It is
@@ -58,6 +62,7 @@ not possible to support both EBCDIC and UTF-8 codes in the same version of the
library. Consequently, --enable-utf8 and --enable-ebcdic are mutually
exclusive.
.
+.
.SH "UNICODE CHARACTER PROPERTY SUPPORT"
.rs
.sp
@@ -80,6 +85,7 @@ supported. Details are given in the
.\"
documentation.
.
+.
.SH "CODE VALUE OF NEWLINE"
.rs
.sp
@@ -112,6 +118,7 @@ Whatever line ending convention is selected when PCRE is built can be
overridden when the library functions are called. At build time it is
conventional to use the standard for your operating system.
.
+.
.SH "WHAT \eR MATCHES"
.rs
.sp
@@ -124,6 +131,7 @@ the default is changed so that \eR matches only CR, LF, or CRLF. Whatever is
selected when PCRE is built can be overridden when the library functions are
called.
.
+.
.SH "BUILDING SHARED AND STATIC LIBRARIES"
.rs
.sp
@@ -135,6 +143,7 @@ Unix libraries by default. You can suppress one of these by adding one of
.sp
to the \fBconfigure\fP command, as required.
.
+.
.SH "POSIX MALLOC USAGE"
.rs
.sp
@@ -154,6 +163,7 @@ such as
.sp
to the \fBconfigure\fP command.
.
+.
.SH "HANDLING VERY LARGE PATTERNS"
.rs
.sp
@@ -162,8 +172,8 @@ another (for example, from an opening parenthesis to an alternation
metacharacter). By default, two-byte values are used for these offsets, leading
to a maximum size for a compiled pattern of around 64K. This is sufficient to
handle all but the most gigantic patterns. Nevertheless, some people do want to
-process enormous patterns, so it is possible to compile PCRE to use three-byte
-or four-byte offsets by adding a setting such as
+process truyl enormous patterns, so it is possible to compile PCRE to use
+three-byte or four-byte offsets by adding a setting such as
.sp
--with-link-size=3
.sp
@@ -171,6 +181,7 @@ to the \fBconfigure\fP command. The value given must be 2, 3, or 4. Using
longer offsets slows down the operation of PCRE because it has to load
additional bytes when handling them.
.
+.
.SH "AVOIDING EXCESSIVE STACK USAGE"
.rs
.sp
@@ -194,7 +205,7 @@ to the \fBconfigure\fP command. With this configuration, PCRE will use the
\fBpcre_stack_malloc\fP and \fBpcre_stack_free\fP variables to call memory
management functions. By default these point to \fBmalloc()\fP and
\fBfree()\fP, but you can replace the pointers so that your own functions are
-used.
+used instead.
.P
Separate functions are provided rather than using \fBpcre_malloc\fP and
\fBpcre_free\fP because the usage is very predictable: the block sizes
@@ -202,7 +213,8 @@ requested are always the same, and the blocks are always freed in reverse
order. A calling program might be able to implement optimized functions that
perform better than \fBmalloc()\fP and \fBfree()\fP. PCRE runs noticeably more
slowly when built in this way. This option affects only the \fBpcre_exec()\fP
-function; it is not relevant for the the \fBpcre_dfa_exec()\fP function.
+function; it is not relevant for \fBpcre_dfa_exec()\fP.
+.
.
.SH "LIMITING PCRE RESOURCE USAGE"
.rs
@@ -235,6 +247,7 @@ constraints. However, you can set a lower limit by adding, for example,
.sp
to the \fBconfigure\fP command. This value can also be overridden at run time.
.
+.
.SH "CREATING CHARACTER TABLES AT BUILD TIME"
.rs
.sp
@@ -253,6 +266,7 @@ compiling, because \fBdftables\fP is run on the local host. If you need to
create alternative tables when cross compiling, you will have to do so "by
hand".)
.
+.
.SH "USING EBCDIC CODE"
.rs
.sp
@@ -268,6 +282,7 @@ to the \fBconfigure\fP command. This setting implies
an EBCDIC environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with --enable-utf8.
.
+.
.SH "PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT"
.rs
.sp
@@ -282,6 +297,7 @@ to the \fBconfigure\fP command. These options naturally require that the
relevant libraries are installed on your system. Configuration will fail if
they are not.
.
+.
.SH "PCRETEST OPTION FOR LIBREADLINE SUPPORT"
.rs
.sp
@@ -292,7 +308,7 @@ If you add
to the \fBconfigure\fP command, \fBpcretest\fP is linked with the
\fBlibreadline\fP library, and when its input is from a terminal, it reads it
using the \fBreadline()\fP function. This provides line-editing and history
-facilities. Note that \fBlibreadline\fP is GPL-licenced, so if you distribute a
+facilities. Note that \fBlibreadline\fP is GPL-licensed, so if you distribute a
binary of \fBpcretest\fP linked in this way, there may be licensing issues.
.P
Setting this option causes the \fB-lreadline\fP option to be added to the
@@ -334,6 +350,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 06 September 2009
+Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrecallout.3 b/doc/pcrecallout.3
index abdbaed..ad8a211 100644
--- a/doc/pcrecallout.3
+++ b/doc/pcrecallout.3
@@ -19,9 +19,10 @@ For example, this pattern has two callout points:
.sp
(?C1)abc(?C2)def
.sp
-If the PCRE_AUTO_CALLOUT option bit is set when \fBpcre_compile()\fP is called,
-PCRE automatically inserts callouts, all with number 255, before each item in
-the pattern. For example, if PCRE_AUTO_CALLOUT is used with the pattern
+If the PCRE_AUTO_CALLOUT option bit is set when \fBpcre_compile()\fP or
+\fBpcre_compile2()\fP is called, PCRE automatically inserts callouts, all with
+number 255, before each item in the pattern. For example, if PCRE_AUTO_CALLOUT
+is used with the pattern
.sp
A(\ed{2}|--)
.sp
@@ -54,6 +55,11 @@ string is "abyz", the lack of "d" means that matching doesn't ever start, and
the callout is never reached. However, with "abyd", though the result is still
no match, the callout is obeyed.
.P
+If the pattern is studied, PCRE knows the minimum length of a matching string,
+and will immediately give a "no match" return without actually running a match
+if the subject is not long enough, or, for unanchored patterns, if it has
+been scanned far enough.
+.P
You can disable these optimizations by passing the PCRE_NO_START_OPTIMIZE
option to \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. This slows down the
matching process, but does ensure that callouts such as the example above are
@@ -155,7 +161,7 @@ The external callout function returns an integer to PCRE. If the value is zero,
matching proceeds as normal. If the value is greater than zero, matching fails
at the current point, but the testing of other matching possibilities goes
ahead, just as if a lookahead assertion had failed. If the value is less than
-zero, the match is abandoned, and \fBpcre_exec()\fP (or \fBpcre_dfa_exec()\fP)
+zero, the match is abandoned, and \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP
returns the negative value.
.P
Negative values should normally be chosen from the set of PCRE_ERROR_xxx
@@ -178,6 +184,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 15 March 2009
+Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index f32b071..2028c52 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -5,9 +5,8 @@ PCRE - Perl-compatible regular expressions
.rs
.sp
This document describes the differences in the ways that PCRE and Perl handle
-regular expressions. The differences described here are mainly with respect to
-Perl 5.8, though PCRE versions 7.0 and later contain some features that are
-in Perl 5.10.
+regular expressions. The differences described here are with respect to Perl
+5.10.
.P
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what
it does have are given in the
@@ -86,7 +85,7 @@ section on recursion differences from Perl
.\"
in the
.\" HREF
-\fBpcrecompat\fP
+\fBpcrepattern\fP
.\"
page.
.P
@@ -98,14 +97,30 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
(*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an
argument. PCRE does not support (*MARK).
.P
-12. PCRE provides some extensions to the Perl regular expression facilities.
-Perl 5.10 will include new features that are not in earlier versions, some of
-which (such as named parentheses) have been in PCRE for some time. This list is
-with respect to Perl 5.10:
+12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
+names is not as general as Perl's. This is a consequence of the fact the PCRE
+works internally just with numbers, using an external table to translate
+between numbers and names. The following are some specific differences:
+.sp
+(a) After matching a pattern such as (?|(?<a>A)|(?<b)B) where the two capturing
+parentheses have the same number but different names, it is not possible to
+distinguish which parentheses matched, because both names map to capturing
+subpattern number 1.
+.sp
+(b) A condition test for a subpattern with a name that is duplicated gives
+unpredictable results. For example, when the pattern
+(?:(?<a>A)|(?<a>B))(?('a')...|...) is compiled (the PCRE_DUPNAMES option is
+required), the condition test (?('a') is set to test whether subpattern 1 has
+matched, ignoring subpattern 2, even though it has the same name.
+.P
+13. PCRE provides some extensions to the Perl regular expression facilities.
+Perl 5.10 includes new features that are not in earlier versions of Perl, some
+of which (such as named parentheses) have been in PCRE for some time. This list
+is with respect to Perl 5.10:
.sp
-(a) Although lookbehind assertions must match fixed length strings, each
-alternative branch of a lookbehind assertion can match a different length of
-string. Perl requires them all to have the same length.
+(a) Although lookbehind assertions in PCRE must match fixed length strings,
+each alternative branch of a lookbehind assertion can match a different length
+of string. Perl requires them all to have the same length.
.sp
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
@@ -155,6 +170,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 18 September 2009
+Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrematching.3 b/doc/pcrematching.3
index a3b8363..2e2abd9 100644
--- a/doc/pcrematching.3
+++ b/doc/pcrematching.3
@@ -74,13 +74,17 @@ this is a kind of "DFA algorithm", though it is not implemented as a
traditional finite state machine (it keeps multiple states active
simultaneously).
.P
+Although the general principle of this matching algorithm is that it scans the
+subject string only once, without backtracking, there is one exception: when a
+lookaround assertion is encountered, the characters following or preceding the
+current point have to be independently inspected.
+.P
The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
-them, and in particular, it finds the longest. In PCRE, there is an option to
-stop the algorithm after the first match (which is necessarily the shortest)
-has been found.
+them, and in particular, it finds the longest. There is an option to stop the
+algorithm after the first match (which is necessarily the shortest) is found.
.P
Note that all the matches that are found start at the same point in the
subject. If the pattern
@@ -92,11 +96,6 @@ the three strings "cat", "cater", and "caterpillar" that start at the fourth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
.P
-Although the general principle of this matching algorithm is that it scans the
-subject string only once, without backtracking, there is one exception: when a
-lookbehind assertion is encountered, the preceding characters have to be
-re-inspected.
-.P
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
.P
@@ -152,7 +151,12 @@ callouts.
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
-time.
+time. The
+.\" HREF
+\fBpcrepartial\fP
+.\"
+documentation gives details of partial matching.
+.
.
.SH "DISADVANTAGES OF THE ALTERNATIVE ALGORITHM"
.rs
@@ -183,6 +187,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 September 2009
+Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrepartial.3 b/doc/pcrepartial.3
index 0e9cc47..05487e1 100644
--- a/doc/pcrepartial.3
+++ b/doc/pcrepartial.3
@@ -32,10 +32,13 @@ whether or not a partial match is preferred to an alternative complete match,
though the details differ between the two matching functions. If both options
are set, PCRE_PARTIAL_HARD takes precedence.
.P
-Setting a partial matching option disables one of PCRE's optimizations. PCRE
+Setting a partial matching option disables two of PCRE's optimizations. PCRE
remembers the last literal byte in a pattern, and abandons matching immediately
if such a byte is not present in the subject string. This optimization cannot
-be used for a subject string that might match only partially.
+be used for a subject string that might match only partially. If the pattern
+was studied, PCRE knows the minimum length of a matching string, and does not
+bother to run the matching function on shorter strings. This optimization is
+also disabled for partial matching.
.
.
.SH "PARTIAL MATCHING USING pcre_exec()"
@@ -53,7 +56,7 @@ instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
vector, the first of them is set to the offset of the earliest character that
was inspected when the partial match was found. For convenience, the second
offset points to the end of the string so that a substring can easily be
-extracted.
+identified.
.P
For the majority of patterns, the first offset identifies the start of the
partially matched string. However, for patterns that contain lookbehind
@@ -358,6 +361,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 05 September 2009
+Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 0b26453..34a5373 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -21,10 +21,10 @@ published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
.P
The original operation of PCRE was on strings of one-byte characters. However,
-there is now also support for UTF-8 character strings. To use this, you must
-build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with
-the PCRE_UTF8 option. There is also a special sequence that can be given at the
-start of a pattern:
+there is now also support for UTF-8 character strings. To use this,
+PCRE must be built to include UTF-8 support, and you must call
+\fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There
+is also a special sequence that can be given at the start of a pattern:
.sp
(*UTF8)
.sp
@@ -83,8 +83,9 @@ string with one of the following five sequences:
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
.sp
-These override the default and the options given to \fBpcre_compile()\fP. For
-example, on a Unix system where LF is the default newline sequence, the pattern
+These override the default and the options given to \fBpcre_compile()\fP or
+\fBpcre_compile2()\fP. For example, on a Unix system where LF is the default
+newline sequence, the pattern
.sp
(*CR)a.b
.sp
@@ -206,9 +207,8 @@ The \eQ...\eE sequence is recognized both inside and outside character classes.
A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters, apart from the binary zero that terminates a pattern,
-but when a pattern is being prepared by text editing, it is usually easier to
-use one of the following escape sequences than the binary character it
-represents:
+but when a pattern is being prepared by text editing, it is often easier to use
+one of the following escape sequences than the binary character it represents:
.sp
\ea alarm, that is, the BEL character (hex 07)
\ecx "control-x", where x is any character
@@ -468,12 +468,13 @@ one of the following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
.sp
-These override the default and the options given to \fBpcre_compile()\fP, but
-they can be overridden by options given to \fBpcre_exec()\fP. Note that these
-special settings, which are not Perl-compatible, are recognized only at the
-very start of a pattern, and that they must be in upper case. If more than one
-of them is present, the last one is used. They can be combined with a change of
-newline convention, for example, a pattern can start with:
+These override the default and the options given to \fBpcre_compile()\fP or
+\fBpcre_compile2()\fP, but they can be overridden by options given to
+\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings,
+which are not Perl-compatible, are recognized only at the very start of a
+pattern, and that they must be in upper case. If more than one of them is
+present, the last one is used. They can be combined with a change of newline
+convention, for example, a pattern can start with:
.sp
(*ANY)(*BSR_ANYCRLF)
.sp
@@ -740,7 +741,10 @@ different meaning, namely the backspace character, inside a character class).
A word boundary is a position in the subject string where the current character
and the previous character do not both match \ew or \eW (i.e. one matches
\ew and the other matches \eW), or the start or end of the string if the
-first or last character matches \ew, respectively.
+first or last character matches \ew, respectively. Neither PCRE nor Perl has a
+separte "start of word" or "end of word" metasequence. However, whatever
+follows \eb normally determines which it is. For example, the fragment
+\eba matches "a" at the start of a word.
.P
The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
dollar (described in the next section) in that they only ever match at the very
@@ -872,14 +876,15 @@ the lookbehind.
.rs
.sp
An opening square bracket introduces a character class, terminated by a closing
-square bracket. A closing square bracket on its own is not special. If a
-closing square bracket is required as a member of the class, it should be the
-first data character in the class (after an initial circumflex, if present) or
-escaped with a backslash.
+square bracket. A closing square bracket on its own is not special by default.
+However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
+bracket causes a compile-time error. If a closing square bracket is required as
+a member of the class, it should be the first data character in the class
+(after an initial circumflex, if present) or escaped with a backslash.
.P
A character class matches a single character in the subject. In UTF-8 mode, the
-character may occupy more than one byte. A matched character must be in the set
-of characters defined by the class, unless the first character in the class
+character may be more than one byte long. A matched character must be in the
+set of characters defined by the class, unless the first character in the class
definition is a circumflex, in which case the subject character must not be in
the set defined by the class. If a circumflex is actually required as a member
of the class, ensure it is not the first character, or escape it with a
@@ -889,7 +894,7 @@ For example, the character class [aeiou] matches any lower case vowel, while
[^aeiou] matches any character that is not a lower case vowel. Note that a
circumflex is just a convenient notation for specifying the characters that
are in the class by enumerating those that are not. A class that starts with a
-circumflex is not an assertion: it still consumes a character from the subject
+circumflex is not an assertion; it still consumes a character from the subject
string, and therefore it fails if the current pointer is at the end of the
string.
.P
@@ -903,9 +908,9 @@ caseful version would. In UTF-8 mode, PCRE always understands the concept of
case for characters whose values are less than 128, so caseless matching is
always possible. For characters with higher values, the concept of case is
supported if PCRE is compiled with Unicode property support, but not otherwise.
-If you want to use caseless matching for characters 128 and above, you must
-ensure that PCRE is compiled with Unicode property support as well as with
-UTF-8 support.
+If you want to use caseless matching in UTF8-mode for characters 128 and above,
+you must ensure that PCRE is compiled with Unicode property support as well as
+with UTF-8 support.
.P
Characters that might indicate line breaks are never treated in any special way
when matching character classes, whatever line-ending sequence is in use, and
@@ -1132,6 +1137,7 @@ is reached, an option setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
.
.
+.\" HTML <a name="dupsubpatternnumber"></a>
.SH "DUPLICATE SUBPATTERN NUMBERS"
.rs
.sp
@@ -1157,10 +1163,20 @@ stored.
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
.sp
-A backreference or a recursive call to a numbered subpattern always refers to
-the first one in the pattern with the given number.
+A backreference to a numbered subpattern uses the most recent value that is set
+for that number by any subpattern. The following pattern matches "abcabc" or
+"defdef":
+.sp
+ /(?|(abc)|(def))\1/
+.sp
+In contrast, a recursive or "subroutine" call to a numbered subpattern always
+refers to the first one in the pattern with the given number. The following
+pattern matches "abcabc" or "defabc":
+.sp
+ /(?|(abc)|(def))(?1)/
+.sp
.P
-An alternative approach to using this "branch reset" feature is to use
+An alternative approach to using the "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
.
.
@@ -1247,6 +1263,7 @@ items:
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
+ a recursive or "subroutine" call to a subpattern
.sp
The general repetition quantifier specifies a minimum and maximum number of
permitted matches, by giving the two numbers in curly brackets (braces),
@@ -1568,16 +1585,19 @@ after the reference.
.P
There may be more than one back reference to the same subpattern. If a
subpattern has not actually been used in a particular match, any back
-references to it always fail. For example, the pattern
+references to it always fail by default. For example, the pattern
.sp
(a|(bc))\e2
.sp
-always fails if it starts to match "a" rather than "bc". Because there may be
-many capturing parentheses in a pattern, all digits following the backslash are
-taken as part of a potential back reference number. If the pattern continues
-with a digit character, some delimiter must be used to terminate the back
-reference. If the PCRE_EXTENDED option is set, this can be whitespace.
-Otherwise an empty comment (see
+always fails if it starts to match "a" rather than "bc". However, if the
+PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
+unset value matches an empty string.
+.P
+Because there may be many capturing parentheses in a pattern, all digits
+following a backslash are taken as part of a potential back reference number.
+If the pattern continues with a digit character, some delimiter must be used to
+terminate the back reference. If the PCRE_EXTENDED option is set, this can be
+whitespace. Otherwise, the \eg{ syntax or an empty comment (see
.\" HTML <a href="#comments">
.\" </a>
"Comments"
@@ -1650,6 +1670,8 @@ lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the most
convenient way to do it is with (?!) because an empty string always matches, so
an assertion that requires there not to be an empty string must always fail.
+The Perl 5.10 backtracking control verb (*FAIL) or (*F) is essentially a
+synonym for (?!).
.
.
.\" HTML <a name="lookbehind"></a>
@@ -1716,8 +1738,8 @@ Recursion,
however, is not supported.
.P
Possessive quantifiers can be used in conjunction with lookbehind assertions to
-specify efficient matching at the end of the subject string. Consider a simple
-pattern such as
+specify efficient matching of fixed-length strings at the end of subject
+strings. Consider a simple pattern such as
.sp
abcd$
.sp
@@ -1781,8 +1803,8 @@ characters that are not "999".
.sp
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns, depending on
-the result of an assertion, or whether a previous capturing subpattern matched
-or not. The two possible forms of conditional subpattern are
+the result of an assertion, or whether a specific capturing subpattern has
+already been matched. The two possible forms of conditional subpattern are:
.sp
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
@@ -1798,12 +1820,20 @@ recursion, a pseudo-condition called DEFINE, and assertions.
.rs
.sp
If the text between the parentheses consists of a sequence of digits, the
-condition is true if the capturing subpattern of that number has previously
-matched. An alternative notation is to precede the digits with a plus or minus
-sign. In this case, the subpattern number is relative rather than absolute.
-The most recently opened parentheses can be referenced by (?(-1), the next most
-recent by (?(-2), and so on. In looping constructs it can also make sense to
-refer to subsequent groups with constructs such as (?(+2).
+condition is true if a capturing subpattern of that number has previously
+matched. If there is more than one capturing subpattern with the same number
+(see the earlier
+.\"
+.\" HTML <a href="#recursion">
+.\" </a>
+section about duplicate subpattern numbers),
+.\"
+the condition is true if any of them have been set. An alternative notation is
+to precede the digits with a plus or minus sign. In this case, the subpattern
+number is relative rather than absolute. The most recently opened parentheses
+can be referenced by (?(-1), the next most recent by (?(-2), and so on. In
+looping constructs it can also make sense to refer to subsequent groups with
+constructs such as (?(+2).
.P
Consider the following pattern, which contains non-significant white space to
make it more readable (assume the PCRE_EXTENDED option) and to divide it into
@@ -1855,7 +1885,7 @@ letter R, for example:
.sp
(?(R3)...) or (?(R&name)...)
.sp
-the condition is true if the most recent recursion is into the subpattern whose
+the condition is true if the most recent recursion is into a subpattern whose
number or name is given. This condition does not check the entire recursion
stack.
.P
@@ -1887,11 +1917,9 @@ written like this (ignore whitespace and line breaks):
The first part of the pattern is a DEFINE group inside which a another group
named "byte" is defined. This matches an individual component of an IPv4
address (a number less than 256). When matching takes place, this part of the
-pattern is skipped because DEFINE acts like a false condition.
-.P
-The rest of the pattern uses references to the named group to match the four
-dot-separated components of an IPv4 address, insisting on a word boundary at
-each end.
+pattern is skipped because DEFINE acts like a false condition. The rest of the
+pattern uses references to the named group to match the four dot-separated
+components of an IPv4 address, insisting on a word boundary at each end.
.
.SS "Assertion conditions"
.rs
@@ -1963,23 +1991,24 @@ a recursive call of the entire regular expression.
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
.sp
- \e( ( (?>[^()]+) | (?R) )* \e)
+ \e( ( [^()]++ | (?R) )* \e)
.sp
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a recursive
match of the pattern itself (that is, a correctly parenthesized substring).
-Finally there is a closing parenthesis.
+Finally there is a closing parenthesis. Note the use of a possessive quantifier
+to avoid backtracking into sequences of non-parentheses.
.P
If this were part of a larger pattern, you would not want to recurse the entire
pattern, so instead you could use this:
.sp
- ( \e( ( (?>[^()]+) | (?1) )* \e) )
+ ( \e( ( [^()]++ | (?1) )* \e) )
.sp
We have put the pattern into parentheses, and caused the recursion to refer to
them instead of the whole pattern.
.P
In a larger pattern, keeping track of parenthesis numbers can be tricky. This
-is made easier by the use of relative references. (A Perl 5.10 feature.)
+is made easier by the use of relative references (a Perl 5.10 feature).
Instead of (?1) in the pattern above you can write (?-2) to refer to the second
most recently opened parentheses preceding the recursion. In other words, a
negative number counts capturing parentheses leftwards from the point at which
@@ -1998,19 +2027,19 @@ An alternative approach is to use named parentheses instead. The Perl syntax
for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We
could rewrite the above example as follows:
.sp
- (?<pn> \e( ( (?>[^()]+) | (?&pn) )* \e) )
+ (?<pn> \e( ( [^()]++ | (?&pn) )* \e) )
.sp
If there is more than one subpattern with the same name, the earliest one is
used.
.P
This particular example pattern that we have been looking at contains nested
-unlimited repeats, and so the use of atomic grouping for matching strings of
-non-parentheses is important when applying the pattern to strings that do not
-match. For example, when this pattern is applied to
+unlimited repeats, and so the use of a possessive quantifier for matching
+strings of non-parentheses is important when applying the pattern to strings
+that do not match. For example, when this pattern is applied to
.sp
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
.sp
-it yields "no match" quickly. However, if atomic grouping is not used,
+it yields "no match" quickly. However, if a possessive quantifier is not used,
the match runs for a very long time indeed because there are so many different
ways the + and * repeats can carve up the subject, and all have to be tested
before failure can be reported.
@@ -2029,7 +2058,7 @@ documentation). If the pattern above is matched against
the value for the capturing parentheses is "ef", which is the last value taken
on at the top level. If additional parentheses are added, giving
.sp
- \e( ( ( (?>[^()]+) | (?R) )* ) \e)
+ \e( ( ( [^()]++ | (?R) )* ) \e)
^ ^
^ ^
.sp
@@ -2113,6 +2142,13 @@ the use of the possessive quantifier *+ to avoid backtracking into sequences of
non-word characters. Without this, PCRE takes a great deal longer (ten times or
more) to match typical phrases, and Perl takes so long that you think it has
gone into a loop.
+.P
+\fBWARNING\fP: The palindrome-matching patterns above work only if the subject
+string does not start with a palindrome that is shorter than the entire string.
+For example, although "abcba" is correctly matched, if the subject is "ababa",
+PCRE finds the palindrome "aba" at the start, then fails at top level because
+the end of the string does not follow. Once again, it cannot jump back into the
+recursion to try other alternatives, so the entire match fails.
.
.
.\" HTML <a name="subpatternsassubroutines"></a>
@@ -2248,8 +2284,8 @@ The following verbs act as soon as they are encountered:
.sp
This verb causes the match to end successfully, skipping the remainder of the
pattern. When inside a recursion, only the innermost pattern is ended
-immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
-is captured. (This feature was added to PCRE at release 8.00.) For example:
+immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is
+captured. (This feature was added to PCRE at release 8.00.) For example:
.sp
A((?:A|B(*ACCEPT)|C)D)
.sp
@@ -2280,7 +2316,7 @@ The verbs differ in exactly what kind of failure occurs.
.sp
This verb causes the whole match to fail outright if the rest of the pattern
does not match. Even if the pattern is unanchored, no further attempts to find
-a match by advancing the start point take place. Once (*COMMIT) has been
+a match by advancing the starting point take place. Once (*COMMIT) has been
passed, \fBpcre_exec()\fP is committed to finding a match at the current
starting point, or not at all. For example:
.sp
@@ -2312,7 +2348,7 @@ was matched leading up to it cannot be part of a successful match. Consider:
If the subject is "aaaac...", after the first match attempt fails (starting at
the first character in the string), the starting point skips on to start the
next attempt at "c". Note that a possessive quantifer does not have the same
-effect in this example; although it would suppress backtracking during the
+effect as this example; although it would suppress backtracking during the
first match attempt, the second attempt would start at the second character
instead of skipping on to "c".
.sp
@@ -2334,7 +2370,8 @@ is used outside of any alternation, it acts exactly like (*PRUNE).
.SH "SEE ALSO"
.rs
.sp
-\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), \fBpcre\fP(3).
+\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
+\fBpcresyntax\fP(3), \fBpcre\fP(3).
.
.
.SH AUTHOR
@@ -2351,6 +2388,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 22 September 2009
+Last updated: 30 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcresample.3 b/doc/pcresample.3
index 48941c5..f7eefda 100644
--- a/doc/pcresample.3
+++ b/doc/pcresample.3
@@ -25,8 +25,8 @@ string. The logic is a little bit tricky because of the possibility of matching
an empty string. Comments in the code explain what is going on.
.P
If PCRE is installed in the standard include and library directories for your
-system, you should be able to compile the demonstration program using this
-command:
+operating system, you should be able to compile the demonstration program using
+this command:
.sp
gcc -o pcredemo pcredemo.c -lpcre
.sp
@@ -87,6 +87,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 01 September 2009
+Last updated: 30 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/perltest.txt b/doc/perltest.txt
index ca02690..fbbc10e 100644
--- a/doc/perltest.txt
+++ b/doc/perltest.txt
@@ -1,7 +1,7 @@
The perltest program
--------------------
-The perltest program tests Perl's regular expressions; it has the same
+The perltest.pl script tests Perl's regular expressions; it has the same
specification as pcretest, and so can be given identical input, except that
input patterns can be followed only by Perl's lower case modifiers and /+ (as
used by pcretest), which is recognized and handled by the program.
@@ -14,20 +14,14 @@ modifiers such as /A that pcretest recognizes, and its special data line
escapes, are not used in these files. The output should be identical, apart
from the initial identifying banner.
-The perltest script can also test UTF-8 features. It works as is for Perl 5.8
-or higher. It recognizes the special modifier /8 that pcretest uses to invoke
-UTF-8 functionality. The testinput4 file can be fed to perltest to run
-compatible UTF-8 tests.
+The perltest.pl script can also test UTF-8 features. It recognizes the special
+modifier /8 that pcretest uses to invoke UTF-8 functionality. The testinput4
+file can be fed to perltest to run compatible UTF-8 tests.
-For Perl 5.6, perltest won't work unmodified for the UTF-8 tests. You need to
-uncomment the "use utf8" lines that it contains. It is best to do this on a
-copy of the script, because for non-UTF-8 tests, these lines should remain
-commented out.
-
-The other testinput files are not suitable for feeding to perltest, since they
-make use of the special upper case modifiers and escapes that pcretest uses to
-test some features of PCRE. Some of these files also contains malformed regular
-expressions, in order to check that PCRE diagnoses them correctly.
+The other testinput files are not suitable for feeding to perltest.pl, since
+they make use of the special upper case modifiers and escapes that pcretest
+uses to test some features of PCRE. Some of these files also contains malformed
+regular expressions, in order to check that PCRE diagnoses them correctly.
Philip Hazel
-September 2004
+September 2009