summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-10-05 10:59:35 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-10-05 10:59:35 +0000
commitf66c8de115b662c90e2a0af9a4357f69df2b3106 (patch)
treefec1a80cdf7c366cb1868339fce075f556c95feb
parent7f1b753dfecb0db660812f00e667abaca6252e28 (diff)
downloadpcre-f66c8de115b662c90e2a0af9a4357f69df2b3106.tar.gz
Tidy up, remove trailing spaces, etc. for 8.00-RC1.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@461 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rwxr-xr-x132html14
-rw-r--r--ChangeLog172
-rw-r--r--LICENCE2
-rw-r--r--NEWS15
-rw-r--r--NON-UNIX-USE20
-rw-r--r--README36
-rwxr-xr-xRunGrepTest4
-rwxr-xr-xRunTest4
-rw-r--r--configure.ac2
-rw-r--r--doc/html/index.html16
-rw-r--r--doc/html/pcre.html32
-rw-r--r--doc/html/pcre_compile.html4
-rw-r--r--doc/html/pcre_compile2.html50
-rw-r--r--doc/html/pcre_dfa_exec.html4
-rw-r--r--doc/html/pcre_exec.html4
-rw-r--r--doc/html/pcre_fullinfo.html1
-rw-r--r--doc/html/pcreapi.html120
-rw-r--r--doc/html/pcrebuild.html22
-rw-r--r--doc/html/pcrecallout.html17
-rw-r--r--doc/html/pcrecompat.html36
-rw-r--r--doc/html/pcregrep.html16
-rw-r--r--doc/html/pcrematching.html23
-rw-r--r--doc/html/pcrepartial.html11
-rw-r--r--doc/html/pcrepattern.html280
-rw-r--r--doc/html/pcreposix.html4
-rw-r--r--doc/html/pcresample.html10
-rw-r--r--doc/html/pcretest.html11
-rw-r--r--doc/pcre.txt2103
-rw-r--r--doc/pcre_compile2.32
-rw-r--r--doc/pcre_dfa_exec.34
-rw-r--r--doc/pcre_exec.34
-rw-r--r--doc/pcre_fullinfo.32
-rw-r--r--doc/pcreapi.340
-rw-r--r--doc/pcrebuild.38
-rw-r--r--doc/pcrecallout.32
-rw-r--r--doc/pcrecompat.34
-rw-r--r--doc/pcregrep.116
-rw-r--r--doc/pcrematching.310
-rw-r--r--doc/pcrepartial.36
-rw-r--r--doc/pcrepattern.3110
-rw-r--r--doc/pcreposix.34
-rw-r--r--doc/pcresample.34
-rw-r--r--doc/pcretest.18
-rw-r--r--doc/pcretest.txt4
-rw-r--r--doc/perltest.txt6
-rw-r--r--pcre_compile.c164
-rw-r--r--pcre_dfa_exec.c72
-rw-r--r--pcre_exec.c160
-rw-r--r--pcre_fullinfo.c6
-rw-r--r--pcre_internal.h6
-rw-r--r--pcre_printint.src6
-rw-r--r--pcre_study.c122
-rw-r--r--pcre_try_flipped.c2
-rw-r--r--pcregrep.c36
-rw-r--r--pcreposix.c34
-rw-r--r--pcretest.c16
-rwxr-xr-xperltest.pl6
-rw-r--r--testdata/testinput24
-rw-r--r--testdata/testoutput2110
59 files changed, 2140 insertions, 1871 deletions
diff --git a/132html b/132html
index 062babc..ccfbfd9 100755
--- a/132html
+++ b/132html
@@ -231,23 +231,23 @@ while (<STDIN>)
$_ = "$one $two";
redo; # Process the joined lines
}
-
+
# .EX/.EE are used in the pcredemo page to bracket the entire program,
# which is unmodified except for turning backslash into "\e".
-
+
elsif (/^\.EX\s*$/)
{
print TEMP "<PRE>\n";
while (<STDIN>)
{
- last if /^\.EE\s*$/;
+ last if /^\.EE\s*$/;
s/\\e/\\/g;
- s/&/&amp;/g;
+ s/&/&amp;/g;
s/</&lt;/g;
s/>/&gt;/g;
- print TEMP;
- }
- }
+ print TEMP;
+ }
+ }
# Ignore anything not recognized
diff --git a/ChangeLog b/ChangeLog
index a1d2af4..2885ebb 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,170 +1,170 @@
ChangeLog for PCRE
------------------
-Version 8.00 ??-???-??
+Version 8.00 05-Oct-09
----------------------
1. The table for translating pcre_compile() error codes into POSIX error codes
- was out-of-date, and there was no check on the pcre_compile() error code
- being within the table. This could lead to an OK return being given in
+ was out-of-date, and there was no check on the pcre_compile() error code
+ being within the table. This could lead to an OK return being given in
error.
-
-2. Changed the call to open a subject file in pcregrep from fopen(pathname,
- "r") to fopen(pathname, "rb"), which fixed a problem with some of the tests
- in a Windows environment.
-
+
+2. Changed the call to open a subject file in pcregrep from fopen(pathname,
+ "r") to fopen(pathname, "rb"), which fixed a problem with some of the tests
+ in a Windows environment.
+
3. The pcregrep --count option prints the count for each file even when it is
zero, as does GNU grep. However, pcregrep was also printing all files when
--files-with-matches was added. Now, when both options are given, it prints
counts only for those files that have at least one match. (GNU grep just
- prints the file name in this circumstance, but including the count seems
- more useful - otherwise, why use --count?) Also ensured that the
+ prints the file name in this circumstance, but including the count seems
+ more useful - otherwise, why use --count?) Also ensured that the
combination -clh just lists non-zero counts, with no names.
-
-4. The long form of the pcregrep -F option was incorrectly implemented as
- --fixed_strings instead of --fixed-strings. This is an incompatible change,
- but it seems right to fix it, and I didn't think it was worth preserving
- the old behaviour.
-
-5. The command line items --regex=pattern and --regexp=pattern were not
+
+4. The long form of the pcregrep -F option was incorrectly implemented as
+ --fixed_strings instead of --fixed-strings. This is an incompatible change,
+ but it seems right to fix it, and I didn't think it was worth preserving
+ the old behaviour.
+
+5. The command line items --regex=pattern and --regexp=pattern were not
recognized by pcregrep, which required --regex pattern or --regexp pattern
- (with a space rather than an '='). The man page documented the '=' forms,
+ (with a space rather than an '='). The man page documented the '=' forms,
which are compatible with GNU grep; these now work.
-
-6. No libpcreposix.pc file was created for pkg-config; there was just
+
+6. No libpcreposix.pc file was created for pkg-config; there was just
libpcre.pc and libpcrecpp.pc. The omission has been rectified.
-
+
7. Added #ifndef SUPPORT_UCP into the pcre_ucd.c module, to reduce its size
- when UCP support is not needed, by modifying the Python script that
+ when UCP support is not needed, by modifying the Python script that
generates it from Unicode data files. This should not matter if the module
is correctly used as a library, but I received one complaint about 50K of
unwanted data. My guess is that the person linked everything into his
program rather than using a library. Anyway, it does no harm.
-
+
8. A pattern such as /\x{123}{2,2}+/8 was incorrectly compiled; the trigger
- was a minimum greater than 1 for a wide character in a possessive
+ was a minimum greater than 1 for a wide character in a possessive
repetition. The same bug could also affect patterns like /(\x{ff}{0,2})*/8
which had an unlimited repeat of a nested, fixed maximum repeat of a wide
character. Chaos in the form of incorrect output or a compiling loop could
result.
-
+
9. The restrictions on what a pattern can contain when partial matching is
- requested for pcre_exec() have been removed. All patterns can now be
+ requested for pcre_exec() have been removed. All patterns can now be
partially matched by this function. In addition, if there are at least two
slots in the offset vector, the offset of the earliest inspected character
for the match and the offset of the end of the subject are set in them when
- PCRE_ERROR_PARTIAL is returned.
-
+ PCRE_ERROR_PARTIAL is returned.
+
10. Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is
synonymous with PCRE_PARTIAL, for backwards compatibility, and
PCRE_PARTIAL_HARD, which causes a partial match to supersede a full match,
and may be more useful for multi-segment matching, especially with
pcre_exec().
-
-11. Partial matching with pcre_exec() is now more intuitive. A partial match
- used to be given if ever the end of the subject was reached; now it is
- given only if matching could not proceed because another character was
- needed. This makes a difference in some odd cases such as Z(*FAIL) with the
- string "Z", which now yields "no match" instead of "partial match". In the
- case of pcre_dfa_exec(), "no match" is given if every matching path for the
- final character ended with (*FAIL).
-
+
+11. Partial matching with pcre_exec() is now more intuitive. A partial match
+ used to be given if ever the end of the subject was reached; now it is
+ given only if matching could not proceed because another character was
+ needed. This makes a difference in some odd cases such as Z(*FAIL) with the
+ string "Z", which now yields "no match" instead of "partial match". In the
+ case of pcre_dfa_exec(), "no match" is given if every matching path for the
+ final character ended with (*FAIL).
+
12. Restarting a match using pcre_dfa_exec() after a partial match did not work
- if the pattern had a "must contain" character that was already found in the
+ if the pattern had a "must contain" character that was already found in the
earlier partial match, unless partial matching was again requested. For
example, with the pattern /dog.(body)?/, the "must contain" character is
"g". If the first part-match was for the string "dog", restarting with
"sbody" failed. This bug has been fixed.
-
-13. The string returned by pcre_dfa_exec() after a partial match has been
- changed so that it starts at the first inspected character rather than the
- first character of the match. This makes a difference only if the pattern
- starts with a lookbehind assertion or \b or \B (\K is not supported by
- pcre_dfa_exec()). It's an incompatible change, but it makes the two
+
+13. The string returned by pcre_dfa_exec() after a partial match has been
+ changed so that it starts at the first inspected character rather than the
+ first character of the match. This makes a difference only if the pattern
+ starts with a lookbehind assertion or \b or \B (\K is not supported by
+ pcre_dfa_exec()). It's an incompatible change, but it makes the two
matching functions compatible, and I think it's the right thing to do.
-
+
14. Added a pcredemo man page, created automatically from the pcredemo.c file,
- so that the demonstration program is easily available in environments where
- PCRE has not been installed from source.
-
+ so that the demonstration program is easily available in environments where
+ PCRE has not been installed from source.
+
15. Arranged to add -DPCRE_STATIC to cflags in libpcre.pc, libpcreposix.cp,
libpcrecpp.pc and pcre-config when PCRE is not compiled as a shared
library.
-
+
16. Added REG_UNGREEDY to the pcreposix interface, at the request of a user.
It maps to PCRE_UNGREEDY. It is not, of course, POSIX-compatible, but it
- is not the first non-POSIX option to be added. Clearly some people find
+ is not the first non-POSIX option to be added. Clearly some people find
these options useful.
-
-17. If a caller to the POSIX matching function regexec() passes a non-zero
+
+17. If a caller to the POSIX matching function regexec() passes a non-zero
value for nmatch with a NULL value for pmatch, the value of
- nmatch is forced to zero.
-
+ nmatch is forced to zero.
+
18. RunGrepTest did not have a test for the availability of the -u option of
- the diff command, as RunTest does. It now checks in the same way as
+ the diff command, as RunTest does. It now checks in the same way as
RunTest, and also checks for the -b option.
-
+
19. If an odd number of negated classes containing just a single character
interposed, within parentheses, between a forward reference to a named
- subpattern and the definition of the subpattern, compilation crashed with
- an internal error, complaining that it could not find the referenced
+ subpattern and the definition of the subpattern, compilation crashed with
+ an internal error, complaining that it could not find the referenced
subpattern. An example of a crashing pattern is /(?&A)(([^m])(?<A>))/.
- [The bug was that it was starting one character too far in when skipping
- over the character class, thus treating the ] as data rather than
- terminating the class. This meant it could skip too much.]
-
+ [The bug was that it was starting one character too far in when skipping
+ over the character class, thus treating the ] as data rather than
+ terminating the class. This meant it could skip too much.]
+
20. Added PCRE_NOTEMPTY_ATSTART in order to be able to correctly implement the
- /g option in pcretest when the pattern contains \K, which makes it possible
+ /g option in pcretest when the pattern contains \K, which makes it possible
to have an empty string match not at the start, even when the pattern is
- anchored. Updated pcretest and pcredemo to use this option.
-
+ anchored. Updated pcretest and pcredemo to use this option.
+
21. If the maximum number of capturing subpatterns in a recursion was greater
- than the maximum at the outer level, the higher number was returned, but
- with unset values at the outer level. The correct (outer level) value is
+ than the maximum at the outer level, the higher number was returned, but
+ with unset values at the outer level. The correct (outer level) value is
now given.
-
+
22. If (*ACCEPT) appeared inside capturing parentheses, previous releases of
PCRE did not set those parentheses (unlike Perl). I have now found a way to
make it do so. The string so far is captured, making this feature
compatible with Perl.
-
-23. The tests have been re-organized, adding tests 11 and 12, to make it
+
+23. The tests have been re-organized, adding tests 11 and 12, to make it
possible to check the Perl 5.10 features against Perl 5.10.
-
+
24. Perl 5.10 allows subroutine calls in lookbehinds, as long as the subroutine
- pattern matches a fixed length string. PCRE did not allow this; now it
- does. Neither allows recursion.
-
-25. I finally figured out how to implement a request to provide the minimum
- length of subject string that was needed in order to match a given pattern.
- (It was back references and recursion that I had previously got hung up
- on.) This code has now been added to pcre_study(); it finds a lower bound
+ pattern matches a fixed length string. PCRE did not allow this; now it
+ does. Neither allows recursion.
+
+25. I finally figured out how to implement a request to provide the minimum
+ length of subject string that was needed in order to match a given pattern.
+ (It was back references and recursion that I had previously got hung up
+ on.) This code has now been added to pcre_study(); it finds a lower bound
to the length of subject needed. It is not necessarily the greatest lower
bound, but using it to avoid searching strings that are too short does give
some useful speed-ups. The value is available to calling programs via
pcre_fullinfo().
-
+
26. While implementing 25, I discovered to my embarrassment that pcretest had
not been passing the result of pcre_study() to pcre_dfa_exec(), so the
study optimizations had never been tested with that matching function.
Oops. What is worse, even when it was passed study data, there was a bug in
pcre_dfa_exec() that meant it never actually used it. Double oops. There
were also very few tests of studied patterns with pcre_dfa_exec().
-
+
27. If (?| is used to create subpatterns with duplicate numbers, they are now
allowed to have the same name, even if PCRE_DUPNAMES is not set. However,
on the other side of the coin, they are no longer allowed to have different
names, because these cannot be distinguished in PCRE, and this has caused
confusion. (This is a difference from Perl.)
-
-28. When duplicate subpattern names are present (necessarily with different
- numbers, as required by 27 above), and a test is made by name in a
- conditional pattern, either for a subpattern having been matched, or for
- recursion in such a pattern, all the associated numbered subpatterns are
+
+28. When duplicate subpattern names are present (necessarily with different
+ numbers, as required by 27 above), and a test is made by name in a
+ conditional pattern, either for a subpattern having been matched, or for
+ recursion in such a pattern, all the associated numbered subpatterns are
tested, and the overall condition is true if the condition is true for any
one of them. This is the way Perl works, and is also more like the way
testing by number works.
-
+
Version 7.9 11-Apr-09
---------------------
diff --git a/LICENCE b/LICENCE
index ff443a9..73f8cde 100644
--- a/LICENCE
+++ b/LICENCE
@@ -4,7 +4,7 @@ PCRE LICENCE
PCRE is a library of functions to support regular expressions whose syntax
and semantics are as close as possible to those of the Perl 5 language.
-Release 7 of PCRE is distributed under the terms of the "BSD" licence, as
+Release 8 of PCRE is distributed under the terms of the "BSD" licence, as
specified below. The documentation for PCRE, supplied in the "doc"
directory, is distributed under the same terms as the software itself.
diff --git a/NEWS b/NEWS
index 2b26fcc..08c1e03 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,21 @@
News about PCRE releases
------------------------
+Release 8.00 05-Oct-09
+----------------------
+
+Bugs have been fixed in the library and in pcregrep. There are also some
+enhancements. Restrictions on patterns used for partial matching have been
+removed, extra information is given for partial matches, the partial matching
+process has been improved, and an option to make a partial match override a
+full match is available. The "study" process has been enhanced by finding a
+lower bound matching length. Groups with duplicate numbers may now have
+duplicated names without the use of PCRE_DUPNAMES. However, they may not have
+different names. The documentation has been revised to reflect these changes.
+The version number has been expanded to 3 digits as it is clear that the rate
+of change is not slowing down.
+
+
Release 7.9 11-Apr-09
---------------------
diff --git a/NON-UNIX-USE b/NON-UNIX-USE
index 9374130..aca81bd 100644
--- a/NON-UNIX-USE
+++ b/NON-UNIX-USE
@@ -12,10 +12,10 @@ This document contains the following sections:
Comments about Win32 builds
Building PCRE on Windows with CMake
Use of relative paths with CMake on Windows
- Testing with runtest.bat
+ Testing with RunTest.bat
Building under Windows with BCC5.5
Building PCRE on OpenVMS
- Building PCRE on Stratus OpenVOS
+ Building PCRE on Stratus OpenVOS
GENERAL
@@ -37,10 +37,10 @@ wrapper functions are a separate issue (see below).
The PCRE distribution includes a "configure" file for use by the Configure/Make
build system, as found in many Unix-like environments. There is also support
-support for CMake, which some users prefer, in particular in Windows
-environments. There are some instructions for CMake under Windows in the
-section entitled "Building PCRE with CMake" below. CMake can also be used to
-build PCRE in Unix-like systems.
+support for CMake, which some users prefer, especially in Windows environments.
+There are some instructions for CMake under Windows in the section entitled
+"Building PCRE with CMake" below. CMake can also be used to build PCRE in
+Unix-like systems.
GENERIC INSTRUCTIONS FOR THE PCRE C LIBRARY
@@ -304,10 +304,10 @@ were contributed by a PCRE user.
7. Select the particular IDE / build tool that you are using (Visual
Studio, MSYS makefiles, MinGW makefiles, etc.)
-8. The GUI will then list several configuration options. This is where
+8. The GUI will then list several configuration options. This is where
you can enable UTF-8 support or other PCRE optional features.
-9. Hit "Configure" again. The adjacent "Generate" button should now be
+9. Hit "Configure" again. The adjacent "Generate" button should now be
active.
10. Hit "Generate".
@@ -460,7 +460,7 @@ I built pcre on OpenVOS Release 17.0.1at using GNU Tools 3.4a without any
problems. I used the following packages to build PCRE:
ftp://ftp.stratus.com/pub/vos/posix/ga/posix.save.evf.gz
-
+
Please read and follow the instructions that come with these packages. To start
the build of pcre, from the root of the package type:
@@ -494,5 +494,5 @@ build.log file in the root of the package also.
=========================
-Last Updated: 09 September 2009
+Last Updated: 05 October 2009
****
diff --git a/README b/README
index 46acf6b..c219676 100644
--- a/README
+++ b/README
@@ -24,7 +24,7 @@ The contents of this README file are:
Shared libraries on Unix-like systems
Cross-compiling on Unix-like systems
Using HP's ANSI C++ compiler (aCC)
- Using PCRE from MySQL
+ Using PCRE from MySQL
Making new tarballs
Testing PCRE
Character tables
@@ -477,16 +477,16 @@ use the workaround of specifying the following environment variable prior to
running the "configure" script:
CXXLDFLAGS="-lstd_v2 -lCsup_v2"
-
+
Using PCRE from MySQL
---------------------
-On systems where both PCRE and MySQL are installed, it is possible to make use
-of PCRE from within MySQL, as an alternative to the built-in pattern matching.
+On systems where both PCRE and MySQL are installed, it is possible to make use
+of PCRE from within MySQL, as an alternative to the built-in pattern matching.
There is a web page that tells you how to do this:
- http://www.mysqludf.org/lib_mysqludf_preg/index.php
+ http://www.mysqludf.org/lib_mysqludf_preg/index.php
Making new tarballs
@@ -564,22 +564,32 @@ document entitled NON-UNIX-USE.]
The fourth test checks the UTF-8 support. It is not run automatically unless
PCRE is built with UTF-8 support. To do this you must set --enable-utf8 when
-running "configure". This file can be also fed directly to the perltest script,
-provided you are running Perl 5.8 or higher. (For Perl 5.6, a small patch,
-commented in the script, can be be used.)
+running "configure". This file can be also fed directly to the perltest.pl
+script, provided you are running Perl 5.8 or higher.
The fifth test checks error handling with UTF-8 encoding, and internal UTF-8
features of PCRE that are not relevant to Perl.
-The sixth test checks the support for Unicode character properties. It it not
-run automatically unless PCRE is built with Unicode property support. To to
-this you must set --enable-unicode-properties when running "configure".
+The sixth test (which is Perl-5.10 compatible) checks the support for Unicode
+character properties. It it not run automatically unless PCRE is built with
+Unicode property support. To to this you must set --enable-unicode-properties
+when running "configure".
The seventh, eighth, and ninth tests check the pcre_dfa_exec() alternative
matching function, in non-UTF-8 mode, UTF-8 mode, and UTF-8 mode with Unicode
property support, respectively. The eighth and ninth tests are not run
automatically unless PCRE is build with the relevant support.
+The tenth test checks some internal offsets and code size features; it is run
+only when the default "link size" of 2 is set (in other cases the sizes
+change).
+
+The eleventh test checks out features that are new in Perl 5.10, and the
+twelfth test checks a number internals and non-Perl features concerned with
+Unicode property support. It it not run automatically unless PCRE is built with
+Unicode property support. To to this you must set --enable-unicode-properties
+when running "configure".
+
Character tables
----------------
@@ -732,7 +742,7 @@ The distribution should contain the following files:
doc/perltest.txt plain text documentation of Perl test program
install-sh a shell script for installing files
libpcre.pc.in template for libpcre.pc for pkg-config
- libpcreposix.pc.in template for libpcreposix.pc for pkg-config
+ libpcreposix.pc.in template for libpcreposix.pc for pkg-config
libpcrecpp.pc.in template for libpcrecpp.pc for pkg-config
ltmain.sh file used to build a libtool script
missing ) common stub for a few missing GNU programs while
@@ -776,4 +786,4 @@ The distribution should contain the following files:
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 16 September 2009
+Last updated: 05 October 2009
diff --git a/RunGrepTest b/RunGrepTest
index 6cbd2cc..13119d0 100755
--- a/RunGrepTest
+++ b/RunGrepTest
@@ -29,9 +29,9 @@ cf="diff -ub"
# that lacks a -u option. Try to deal with this; better do the test for the -b
# option as well.
-if diff -u /dev/null /dev/null; then
+if diff -u /dev/null /dev/null; then
if diff -ub /dev/null /dev/null; then cf="diff -ub"; else cf="diff -u"; fi
-else
+else
if diff -b /dev/null /dev/null; then cf="diff -b"; else cf="diff"; fi
fi
diff --git a/RunTest b/RunTest
index 6bdc73e..787f9cf 100755
--- a/RunTest
+++ b/RunTest
@@ -60,7 +60,7 @@ while [ $# -gt 0 ] ; do
9) do9=yes;;
10) do10=yes;;
11) do11=yes;;
- 12) do12=yes;;
+ 12) do12=yes;;
valgrind) valgrind="valgrind -q";;
*) echo "Unknown test number $1"; exit 1;;
esac
@@ -124,7 +124,7 @@ if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no -a \
if [ $utf8 -ne 0 -a $ucp -ne 0 ] ; then do9=yes; fi
if [ $link_size -eq 2 -a $ucp -ne 0 ] ; then do10=yes; fi
do11=yes
- if [ $utf8 -ne 0 -a $ucp -ne 0 ] ; then do12=yes; fi
+ if [ $utf8 -ne 0 -a $ucp -ne 0 ] ; then do12=yes; fi
fi
# Show which release
diff --git a/configure.ac b/configure.ac
index 7f567f8..aed0d4c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -9,7 +9,7 @@ dnl empty.
m4_define(pcre_major, [8])
m4_define(pcre_minor, [00])
m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2009-09-05])
+m4_define(pcre_date, [2009-10-05])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
diff --git a/doc/html/index.html b/doc/html/index.html
index 58dfe45..d9af7e1 100644
--- a/doc/html/index.html
+++ b/doc/html/index.html
@@ -1,10 +1,10 @@
<html>
-<!-- This is a manually maintained file that is the root of the HTML version of
- the PCRE documentation. When the HTML documents are built from the man
- page versions, the entire doc/html directory is emptied, this file is then
- copied into doc/html/index.html, and the remaining files therein are
+<!-- This is a manually maintained file that is the root of the HTML version of
+ the PCRE documentation. When the HTML documents are built from the man
+ page versions, the entire doc/html directory is emptied, this file is then
+ copied into doc/html/index.html, and the remaining files therein are
created by the 132html script.
--->
+-->
<head>
<title>PCRE specification</title>
</head>
@@ -74,11 +74,11 @@ The HTML documentation for PCRE comprises the following pages:
</table>
<p>
-There are also individual pages that summarize the interface for each function
+There are also individual pages that summarize the interface for each function
in the library:
</p>
-<table>
+<table>
<tr><td><a href="pcre_compile.html">pcre_compile</a></td>
<td>&nbsp;&nbsp;Compile a regular expression</td></tr>
@@ -129,7 +129,7 @@ in the library:
<tr><td><a href="pcre_maketables.html">pcre_maketables</a></td>
<td>&nbsp;&nbsp;Build character tables in current locale</td></tr>
-
+
<tr><td><a href="pcre_refcount.html">pcre_refcount</a></td>
<td>&nbsp;&nbsp;Maintain reference count in compiled pattern</td></tr>
diff --git a/doc/html/pcre.html b/doc/html/pcre.html
index bfb4e97..8ea03a1 100644
--- a/doc/html/pcre.html
+++ b/doc/html/pcre.html
@@ -24,23 +24,22 @@ man page, in case the conversion went wrong.
<P>
The PCRE library is a set of functions that implement regular expression
pattern matching using the same syntax and semantics as Perl, with just a few
-differences. Certain features that appeared in Python and PCRE before they
-appeared in Perl are also available using the Python syntax. There is also some
-support for certain .NET and Oniguruma syntax items, and there is an option for
-requesting some minor changes that give better JavaScript compatibility.
+differences. Some features that appeared in Python and PCRE before they
+appeared in Perl are also available using the Python syntax, there is some
+support for one or two .NET and Oniguruma syntax items, and there is an option
+for requesting some minor changes that give better JavaScript compatibility.
</P>
<P>
-The current implementation of PCRE (release 8.xx) corresponds approximately
-with Perl 5.10, including support for UTF-8 encoded strings and Unicode general
-category properties. However, UTF-8 and Unicode support has to be explicitly
-enabled; it is not the default. The Unicode tables correspond to Unicode
-release 5.1.
+The current implementation of PCRE corresponds approximately with Perl 5.10,
+including support for UTF-8 encoded strings and Unicode general category
+properties. However, UTF-8 and Unicode support has to be explicitly enabled; it
+is not the default. The Unicode tables correspond to Unicode release 5.1.
</P>
<P>
In addition to the Perl-compatible matching function, PCRE contains an
-alternative matching function that matches the same compiled patterns in a
-different way. In certain circumstances, the alternative function has some
-advantages. For a discussion of the two matching algorithms, see the
+alternative function that matches the same compiled patterns in a different
+way. In certain circumstances, the alternative function has some advantages.
+For a discussion of the two matching algorithms, see the
<a href="pcrematching.html"><b>pcrematching</b></a>
page.
</P>
@@ -72,7 +71,8 @@ function makes it possible for a client to discover which features are
available. The features themselves are described in the
<a href="pcrebuild.html"><b>pcrebuild</b></a>
page. Documentation about building PCRE for various operating systems can be
-found in the <b>README</b> file in the source distribution.
+found in the <b>README</b> and <b>NON-UNIX-USE</b> files in the source
+distribution.
</P>
<P>
The library contains a number of undocumented internal functions and data
@@ -103,12 +103,12 @@ of searching. The sections are as follows:
pcrematching discussion of the two matching algorithms
pcrepartial details of the partial matching facility
pcrepattern syntax and semantics of supported regular expressions
- pcresyntax quick syntax reference
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
+ pcresyntax quick syntax reference
pcretest description of the <b>pcretest</b> testing command
</pre>
In addition, in the "man" and HTML formats, there is a short page for each
@@ -164,7 +164,7 @@ the code, and, in addition, you must call
with the PCRE_UTF8 option flag, or the pattern must start with the sequence
(*UTF8). When either of these is the case, both the pattern and any subject
strings that are matched against it are treated as UTF-8 strings instead of
-just strings of bytes.
+strings of 1-byte characters.
</P>
<P>
If you compile PCRE with UTF-8 support, but do not use it at run time, the
@@ -298,7 +298,7 @@ two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 01 September 2009
+Last updated: 28 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcre_compile.html b/doc/html/pcre_compile.html
index 396a5fb..773594f 100644
--- a/doc/html/pcre_compile.html
+++ b/doc/html/pcre_compile.html
@@ -63,11 +63,11 @@ The option bits are:
PCRE_NEWLINE_LF Set LF as the newline sequence
PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
theses (named ones available)
- PCRE_UNGREEDY Invert greediness of quantifiers
- PCRE_UTF8 Run in UTF-8 mode
PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
validity (only relevant if
PCRE_UTF8 is set)
+ PCRE_UNGREEDY Invert greediness of quantifiers
+ PCRE_UTF8 Run in UTF-8 mode
</pre>
PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
PCRE_NO_UTF8_CHECK.
diff --git a/doc/html/pcre_compile2.html b/doc/html/pcre_compile2.html
index 8d743c1..4e64921 100644
--- a/doc/html/pcre_compile2.html
+++ b/doc/html/pcre_compile2.html
@@ -45,29 +45,33 @@ argument. The arguments are:
</pre>
The option bits are:
<pre>
- PCRE_ANCHORED Force pattern anchoring
- PCRE_AUTO_CALLOUT Compile automatic callouts
- PCRE_CASELESS Do caseless matching
- PCRE_DOLLAR_ENDONLY $ not to match newline at end
- PCRE_DOTALL . matches anything including NL
- PCRE_DUPNAMES Allow duplicate names for subpatterns
- PCRE_EXTENDED Ignore whitespace and # comments
- PCRE_EXTRA PCRE extra features
- (not much use currently)
- PCRE_FIRSTLINE Force matching to be before newline
- PCRE_MULTILINE ^ and $ match newlines within data
- PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
- PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline sequences
- PCRE_NEWLINE_CR Set CR as the newline sequence
- PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
- PCRE_NEWLINE_LF Set LF as the newline sequence
- PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
- theses (named ones available)
- PCRE_UNGREEDY Invert greediness of quantifiers
- PCRE_UTF8 Run in UTF-8 mode
- PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
- validity (only relevant if
- PCRE_UTF8 is set)
+ PCRE_ANCHORED Force pattern anchoring
+ PCRE_AUTO_CALLOUT Compile automatic callouts
+ PCRE_BSR_ANYCRLF \R matches only CR, LF, or CRLF
+ PCRE_BSR_UNICODE \R matches all Unicode line endings
+ PCRE_CASELESS Do caseless matching
+ PCRE_DOLLAR_ENDONLY $ not to match newline at end
+ PCRE_DOTALL . matches anything including NL
+ PCRE_DUPNAMES Allow duplicate names for subpatterns
+ PCRE_EXTENDED Ignore whitespace and # comments
+ PCRE_EXTRA PCRE extra features
+ (not much use currently)
+ PCRE_FIRSTLINE Force matching to be before newline
+ PCRE_JAVASCRIPT_COMPAT JavaScript compatibility
+ PCRE_MULTILINE ^ and $ match newlines within data
+ PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
+ PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline
+ sequences
+ PCRE_NEWLINE_CR Set CR as the newline sequence
+ PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
+ PCRE_NEWLINE_LF Set LF as the newline sequence
+ PCRE_NO_AUTO_CAPTURE Disable numbered capturing paren-
+ theses (named ones available)
+ PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
+ validity (only relevant if
+ PCRE_UTF8 is set)
+ PCRE_UNGREEDY Invert greediness of quantifiers
+ PCRE_UTF8 Run in UTF-8 mode
</pre>
PCRE must be built with UTF-8 support in order to use PCRE_UTF8 and
PCRE_NO_UTF8_CHECK.
diff --git a/doc/html/pcre_dfa_exec.html b/doc/html/pcre_dfa_exec.html
index 28bcf0d..663c247 100644
--- a/doc/html/pcre_dfa_exec.html
+++ b/doc/html/pcre_dfa_exec.html
@@ -67,8 +67,8 @@ The options are:
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
- PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
PCRE_DFA_SHORTEST Return only the shortest match
PCRE_DFA_RESTART Restart after a partial match
</pre>
diff --git a/doc/html/pcre_exec.html b/doc/html/pcre_exec.html
index fc0938a..cee8bf4 100644
--- a/doc/html/pcre_exec.html
+++ b/doc/html/pcre_exec.html
@@ -63,8 +63,8 @@ The options are:
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
- PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
</pre>
For details of partial matching, see the
<a href="pcrepartial.html"><b>pcrepartial</b></a>
diff --git a/doc/html/pcre_fullinfo.html b/doc/html/pcre_fullinfo.html
index 3ec75f9..36487fa 100644
--- a/doc/html/pcre_fullinfo.html
+++ b/doc/html/pcre_fullinfo.html
@@ -45,6 +45,7 @@ The following information is available:
PCRE_INFO_FIRSTTABLE Table of first bytes (after studying)
PCRE_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE_INFO_LASTLITERAL Literal last byte required
+ PCRE_INFO_MINLENGTH Lower bound length of matching strings
PCRE_INFO_NAMECOUNT Number of named subpatterns
PCRE_INFO_NAMEENTRYSIZE Size of name table entry
PCRE_INFO_NAMETABLE Pointer to name table
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index cac98d9..126a5d2 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -400,7 +400,9 @@ avoiding the use of the stack.
Either of the functions <b>pcre_compile()</b> or <b>pcre_compile2()</b> can be
called to compile a pattern into an internal form. The only difference between
the two interfaces is that <b>pcre_compile2()</b> has an additional argument,
-<i>errorcodeptr</i>, via which a numerical error code can be returned.
+<i>errorcodeptr</i>, via which a numerical error code can be returned. To avoid
+too much repetition, we refer just to <b>pcre_compile()</b> below, but the
+information applies equally to <b>pcre_compile2()</b>.
</P>
<P>
The pattern is a C string terminated by a binary zero, and is passed in the
@@ -420,14 +422,14 @@ argument, which is an address (see below).
The <i>options</i> argument contains various bit settings that affect the
compilation. It should be zero if no options are required. The available
options are described below. Some of them (in particular, those that are
-compatible with Perl, but also some others) can also be set and unset from
+compatible with Perl, but some others as well) can also be set and unset from
within the pattern (see the detailed description in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
documentation). For those options that can be different in different parts of
-the pattern, the contents of the <i>options</i> argument specifies their initial
-settings at the start of compilation and execution. The PCRE_ANCHORED and
-PCRE_NEWLINE_<i>xxx</i> options can be set at the time of matching as well as at
-compile time.
+the pattern, the contents of the <i>options</i> argument specifies their
+settings at the start of compilation and execution. The PCRE_ANCHORED,
+PCRE_BSR_<i>xxx</i>, and PCRE_NEWLINE_<i>xxx</i> options can be set at the time
+of matching as well as at compile time.
</P>
<P>
If <i>errptr</i> is NULL, <b>pcre_compile()</b> returns NULL immediately.
@@ -435,7 +437,7 @@ Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns
NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual
error message. This is a static string that is part of the library. You must
not try to free it. The byte offset from the start of the pattern to the
-character that was being processes when the error was discovered is placed in
+character that was being processed when the error was discovered is placed in
the variable pointed to by <i>erroffset</i>, which must not be NULL. If it is,
an immediate error is given. Some errors are not detected until checks are
carried out when the whole pattern has been scanned; in this case the offset is
@@ -772,17 +774,17 @@ results of the study.
</P>
<P>
The returned value from <b>pcre_study()</b> can be passed directly to
-<b>pcre_exec()</b>. However, a <b>pcre_extra</b> block also contains other
-fields that can be set by the caller before the block is passed; these are
-described
+<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. However, a <b>pcre_extra</b> block
+also contains other fields that can be set by the caller before the block is
+passed; these are described
<a href="#extradata">below</a>
in the section on matching a pattern.
</P>
<P>
-If studying the pattern does not produce any additional information
+If studying the pattern does not produce any useful information,
<b>pcre_study()</b> returns NULL. In that circumstance, if the calling program
-wants to pass any of the other fields to <b>pcre_exec()</b>, it must set up its
-own <b>pcre_extra</b> block.
+wants to pass any of the other fields to <b>pcre_exec()</b> or
+<b>pcre_dfa_exec()</b>, it must set up its own <b>pcre_extra</b> block.
</P>
<P>
The second argument of <b>pcre_study()</b> contains option bits. At present, no
@@ -805,9 +807,19 @@ This is a typical call to <b>pcre_study</b>():
0, /* no options exist */
&error); /* set to NULL or points to a message */
</pre>
-At present, studying a pattern is useful only for non-anchored patterns that do
-not have a single fixed starting character. A bitmap of possible starting
-bytes is created.
+Studying a pattern does two things: first, a lower bound for the length of
+subject string that is needed to match the pattern is computed. This does not
+mean that there are any strings of that length that match, but it does
+guarantee that no shorter strings match. The value is used by
+<b>pcre_exec()</b> and <b>pcre_dfa_exec()</b> to avoid wasting time by trying to
+match strings that are shorter than the lower bound. You can find out the value
+in a calling program via the <b>pcre_fullinfo()</b> function.
+</P>
+<P>
+Studying a pattern is also useful for non-anchored patterns that do not have a
+single fixed starting character. A bitmap of possible starting bytes is
+created. This speeds up finding a position in the subject at which to start
+matching.
<a name="localesupport"></a></P>
<br><a name="SEC10" href="#TOC1">LOCALE SUPPORT</a><br>
<P>
@@ -978,6 +990,16 @@ follows something of variable length. For example, for the pattern
/^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value
is -1.
<pre>
+ PCRE_INFO_MINLENGTH
+</pre>
+If the pattern was studied and a minimum length for matching subject strings
+was computed, its value is returned. Otherwise the returned value is -1. The
+value is a number of characters, not bytes (this may be relevant in UTF-8
+mode). The fourth argument should point to an <b>int</b> variable. A
+non-negative value is a lower bound to the length of any matching string. There
+may not be any strings of that length that do actually match, but every string
+that does match is at least that long.
+<pre>
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
@@ -999,10 +1021,24 @@ entry; both of these return an <b>int</b> value. The entry size depends on the
length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
entry of the table (a pointer to <b>char</b>). The first two bytes of each entry
are the number of the capturing parenthesis, most significant byte first. The
-rest of the entry is the corresponding name, zero terminated. The names are in
-alphabetical order. When PCRE_DUPNAMES is set, duplicate names are in order of
-their parentheses numbers. For example, consider the following pattern (assume
-PCRE_EXTENDED is set, so white space - including newlines - is ignored):
+rest of the entry is the corresponding name, zero terminated.
+</P>
+<P>
+The names are in alphabetical order. Duplicate names may appear if (?| is used
+to create multiple groups with the same number, as described in the
+<a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a>
+in the
+<a href="pcrepattern.html"><b>pcrepattern</b></a>
+page. Duplicate names for subpatterns with different numbers are permitted only
+if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the
+table in the order in which they were found in the pattern. In the absence of
+(?| this is the order of increasing number; when (?| is used this is not
+necessarily the case because later subpatterns may have lower numbers.
+</P>
+<P>
+As a simple example of the name/number table, consider the following pattern
+(assume PCRE_EXTENDED is set, so white space - including newlines - is
+ignored):
<pre>
(?&#60;date&#62; (?&#60;year&#62;(\d\d)?\d\d) - (?&#60;month&#62;\d\d) - (?&#60;day&#62;\d\d) )
</pre>
@@ -1062,7 +1098,8 @@ variable.
Return the size of the data block pointed to by the <i>study_data</i> field in
a <b>pcre_extra</b> block. That is, it is the value that was passed to
<b>pcre_malloc()</b> when PCRE was getting memory into which to place the data
-created by <b>pcre_study()</b>. The fourth argument should point to a
+created by <b>pcre_study()</b>. If <b>pcre_extra</b> is NULL, or there is no
+study data, zero is returned. The fourth argument should point to a
<b>size_t</b> variable.
</P>
<br><a name="SEC12" href="#TOC1">OBSOLETE INFO FUNCTION</a><br>
@@ -1122,7 +1159,7 @@ is different. (This seems a highly unlikely scenario.)
<P>
The function <b>pcre_exec()</b> is called to match a subject string against a
compiled pattern, which is passed in the <i>code</i> argument. If the
-pattern has been studied, the result of the study should be passed in the
+pattern was studied, the result of the study should be passed in the
<i>extra</i> argument. This function is the main matching facility of the
library, and it operates in a Perl-like manner. For specialist use there is
also an alternative matching function, which is described
@@ -1189,7 +1226,7 @@ the block by setting the other fields and their corresponding flag bits.
The <i>match_limit</i> field provides a means of preventing PCRE from using up a
vast amount of resources when running patterns that are not going to match,
but which have a very large number of possibilities in their search trees. The
-classic example is the use of nested unlimited repeats.
+classic example is a pattern that uses nested unlimited repeats.
</P>
<P>
Internally, PCRE uses a function called <b>match()</b> which it calls repeatedly
@@ -1339,7 +1376,7 @@ valid, so PCRE searches further into the string for occurrences of "a" or "b".
<pre>
PCRE_NOTEMPTY_ATSTART
</pre>
-This is like PCRE_NOTEMPTY, except that an empty string match that is not at
+This is like PCRE_NOTEMPTY, except that an empty string match that is not at
the start of the subject is permitted. If the pattern is anchored, such a match
can occur only if the pattern contains \K.
</P>
@@ -1390,7 +1427,7 @@ PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a
subject, or a value of <i>startoffset</i> that does not point to the start of a
UTF-8 character, is undefined. Your program may crash.
<pre>
- PCRE_PARTIAL_HARD
+ PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
</pre>
These options turn on the partial matching feature. For backwards
@@ -1499,7 +1536,7 @@ has to get additional memory for use during matching. Thus it is usually
advisable to supply an <i>ovector</i>.
</P>
<P>
-The <b>pcre_info()</b> function can be used to find out how many capturing
+The <b>pcre_fullinfo()</b> function can be used to find out how many capturing
subpatterns there are in a compiled pattern. The smallest size for
<i>ovector</i> that will allow for <i>n</i> captured substrings, in addition to
the offsets of the substring matched by the whole pattern, is (<i>n</i>+1)*3.
@@ -1605,7 +1642,7 @@ documentation for details of partial matching.
</pre>
This code is no longer in use. It was formerly returned when the PCRE_PARTIAL
option was used with a compiled pattern containing items that were not
-supported for partial matching. From release 8.00 onwards, there are no
+supported for partial matching. From release 8.00 onwards, there are no
restrictions on partial matching.
<pre>
PCRE_ERROR_INTERNAL (-14)
@@ -1779,10 +1816,15 @@ appropriate. <b>NOTE:</b> If PCRE_DUPNAMES is set and there are duplicate names,
the behaviour may not be what you want (see the next section).
</P>
<P>
-<b>Warning:</b> If the pattern uses the "(?|" feature to set up multiple
-subpatterns with the same number, you cannot use names to distinguish them,
-because names are not included in the compiled code. The matching process uses
-only numbers.
+<b>Warning:</b> If the pattern uses the (?| feature to set up multiple
+subpatterns with the same number, as described in the
+<a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a>
+in the
+<a href="pcrepattern.html"><b>pcrepattern</b></a>
+page, you cannot use names to distinguish the different subpatterns, because
+names are not included in the compiled code. The matching process uses only
+numbers. For this reason, the use of different names for subpatterns of the
+same number causes an error at compile time.
</P>
<br><a name="SEC17" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
<P>
@@ -1791,9 +1833,13 @@ only numbers.
</P>
<P>
When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns
-are not required to be unique. Normally, patterns with duplicate names are such
-that in any one match, only one of the named subpatterns participates. An
-example is shown in the
+are not required to be unique. (Duplicate names are always allowed for
+subpatterns with the same number, created by using the (?| feature. Indeed, if
+such subpatterns are named, they are required to use the same names.)
+</P>
+<P>
+Normally, patterns with duplicate names are such that in any one match, only
+one of the named subpatterns participates. An example is shown in the
<a href="pcrepattern.html"><b>pcrepattern</b></a>
documentation.
</P>
@@ -1849,7 +1895,7 @@ a compiled pattern, using a matching algorithm that scans the subject string
just once, and does not backtrack. This has different characteristics to the
normal algorithm, and is not compatible with Perl. Some of the features of PCRE
patterns are not supported. Nevertheless, there are times when this kind of
-matching can be useful. For a discussion of the two matching algorithms, and a
+matching can be useful. For a discussion of the two matching algorithms, and a
list of features that <b>pcre_dfa_exec()</b> does not support, see the
<a href="pcrematching.html"><b>pcrematching</b></a>
documentation.
@@ -1898,7 +1944,7 @@ and PCRE_DFA_RESTART. All but the last four of these are exactly the same as
for <b>pcre_exec()</b>, so their description is not repeated here.
<pre>
PCRE_PARTIAL_HARD
- PCRE_PARTIAL_SOFT
+ PCRE_PARTIAL_SOFT
</pre>
These have the same general effect as they do for <b>pcre_exec()</b>, but the
details are slightly different. When PCRE_PARTIAL_HARD is set for
@@ -2021,7 +2067,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC22" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 September 2009
+Last updated: 03 October 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrebuild.html b/doc/html/pcrebuild.html
index adff0e9..22f83c6 100644
--- a/doc/html/pcrebuild.html
+++ b/doc/html/pcrebuild.html
@@ -40,12 +40,12 @@ the optional features are selected or deselected by providing options to
<b>configure</b> before running the <b>make</b> command. However, the same
options can be selected in both Unix-like and non-Unix-like environments using
the GUI facility of <b>cmake-gui</b> if you are using <b>CMake</b> instead of
-<b>configure</b> to build PCRE.
+<b>configure</b> to build PCRE.
</P>
<P>
-There is a lot more information about building PCRE in non-Unix-like
-environments in the file called <i>NON_UNIX_USE</i>, which is part of the PCRE
-distribution. You should consult this file as well as the <i>README</i> file if
+There is a lot more information about building PCRE in non-Unix-like
+environments in the file called <i>NON_UNIX_USE</i>, which is part of the PCRE
+distribution. You should consult this file as well as the <i>README</i> file if
you are building in a non-Unix-like environment.
</P>
<P>
@@ -80,7 +80,7 @@ To build PCRE with support for UTF-8 Unicode character strings, add
to the <b>configure</b> command. Of itself, this does not make PCRE treat
strings as UTF-8. As well as compiling PCRE with this option, you also have
have to set the PCRE_UTF8 option when you call the <b>pcre_compile()</b>
-function.
+or <b>pcre_compile2()</b> functions.
</P>
<P>
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE expects
@@ -186,8 +186,8 @@ another (for example, from an opening parenthesis to an alternation
metacharacter). By default, two-byte values are used for these offsets, leading
to a maximum size for a compiled pattern of around 64K. This is sufficient to
handle all but the most gigantic patterns. Nevertheless, some people do want to
-process enormous patterns, so it is possible to compile PCRE to use three-byte
-or four-byte offsets by adding a setting such as
+process truyl enormous patterns, so it is possible to compile PCRE to use
+three-byte or four-byte offsets by adding a setting such as
<pre>
--with-link-size=3
</pre>
@@ -215,7 +215,7 @@ to the <b>configure</b> command. With this configuration, PCRE will use the
<b>pcre_stack_malloc</b> and <b>pcre_stack_free</b> variables to call memory
management functions. By default these point to <b>malloc()</b> and
<b>free()</b>, but you can replace the pointers so that your own functions are
-used.
+used instead.
</P>
<P>
Separate functions are provided rather than using <b>pcre_malloc</b> and
@@ -224,7 +224,7 @@ requested are always the same, and the blocks are always freed in reverse
order. A calling program might be able to implement optimized functions that
perform better than <b>malloc()</b> and <b>free()</b>. PCRE runs noticeably more
slowly when built in this way. This option affects only the <b>pcre_exec()</b>
-function; it is not relevant for the the <b>pcre_dfa_exec()</b> function.
+function; it is not relevant for <b>pcre_dfa_exec()</b>.
</P>
<br><a name="SEC11" href="#TOC1">LIMITING PCRE RESOURCE USAGE</a><br>
<P>
@@ -308,7 +308,7 @@ If you add
to the <b>configure</b> command, <b>pcretest</b> is linked with the
<b>libreadline</b> library, and when its input is from a terminal, it reads it
using the <b>readline()</b> function. This provides line-editing and history
-facilities. Note that <b>libreadline</b> is GPL-licenced, so if you distribute a
+facilities. Note that <b>libreadline</b> is GPL-licensed, so if you distribute a
binary of <b>pcretest</b> linked in this way, there may be licensing issues.
</P>
<P>
@@ -345,7 +345,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC18" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 06 September 2009
+Last updated: 29 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrecallout.html b/doc/html/pcrecallout.html
index 217764e..ab2e2d5 100644
--- a/doc/html/pcrecallout.html
+++ b/doc/html/pcrecallout.html
@@ -39,9 +39,10 @@ For example, this pattern has two callout points:
<pre>
(?C1)abc(?C2)def
</pre>
-If the PCRE_AUTO_CALLOUT option bit is set when <b>pcre_compile()</b> is called,
-PCRE automatically inserts callouts, all with number 255, before each item in
-the pattern. For example, if PCRE_AUTO_CALLOUT is used with the pattern
+If the PCRE_AUTO_CALLOUT option bit is set when <b>pcre_compile()</b> or
+<b>pcre_compile2()</b> is called, PCRE automatically inserts callouts, all with
+number 255, before each item in the pattern. For example, if PCRE_AUTO_CALLOUT
+is used with the pattern
<pre>
A(\d{2}|--)
</pre>
@@ -73,6 +74,12 @@ the callout is never reached. However, with "abyd", though the result is still
no match, the callout is obeyed.
</P>
<P>
+If the pattern is studied, PCRE knows the minimum length of a matching string,
+and will immediately give a "no match" return without actually running a match
+if the subject is not long enough, or, for unanchored patterns, if it has
+been scanned far enough.
+</P>
+<P>
You can disable these optimizations by passing the PCRE_NO_START_OPTIMIZE
option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. This slows down the
matching process, but does ensure that callouts such as the example above are
@@ -179,7 +186,7 @@ The external callout function returns an integer to PCRE. If the value is zero,
matching proceeds as normal. If the value is greater than zero, matching fails
at the current point, but the testing of other matching possibilities goes
ahead, just as if a lookahead assertion had failed. If the value is less than
-zero, the match is abandoned, and <b>pcre_exec()</b> (or <b>pcre_dfa_exec()</b>)
+zero, the match is abandoned, and <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
returns the negative value.
</P>
<P>
@@ -199,7 +206,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 15 March 2009
+Last updated: 29 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrecompat.html b/doc/html/pcrecompat.html
index 5ad6036..22f34df 100644
--- a/doc/html/pcrecompat.html
+++ b/doc/html/pcrecompat.html
@@ -17,9 +17,8 @@ DIFFERENCES BETWEEN PCRE AND PERL
</b><br>
<P>
This document describes the differences in the ways that PCRE and Perl handle
-regular expressions. The differences described here are mainly with respect to
-Perl 5.8, though PCRE versions 7.0 and later contain some features that are
-in Perl 5.10.
+regular expressions. The differences described here are with respect to Perl
+5.10.
</P>
<P>
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details of what
@@ -90,11 +89,11 @@ documentation for details.
</P>
<P>
9. Subpatterns that are called recursively or as "subroutines" are always
-treated as atomic groups in PCRE. This is like Python, but unlike Perl. There
+treated as atomic groups in PCRE. This is like Python, but unlike Perl. There
is a discussion of an example that explains this in more detail in the
<a href="pcrepattern.html#recursiondifference">section on recursion differences from Perl</a>
in the
-<a href="pcrecompat.html"><b>pcrecompat</b></a>
+<a href="pcrepattern.html"><b>pcrepattern</b></a>
page.
</P>
<P>
@@ -108,15 +107,26 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
argument. PCRE does not support (*MARK).
</P>
<P>
-12. PCRE provides some extensions to the Perl regular expression facilities.
-Perl 5.10 will include new features that are not in earlier versions, some of
-which (such as named parentheses) have been in PCRE for some time. This list is
-with respect to Perl 5.10:
+12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
+names is not as general as Perl's. This is a consequence of the fact the PCRE
+works internally just with numbers, using an external table to translate
+between numbers and names. In particular, a pattern such as (?|(?&#60;a&#62;A)|(?&#60;b)B),
+where the two capturing parentheses have the same number but different names,
+is not supported, and causes an error at compile time. If it were allowed, it
+would not be possible to distinguish which parentheses matched, because both
+names map to capturing subpattern number 1. To avoid this confusing situation,
+an error is given at compile time.
+</P>
+<P>
+13. PCRE provides some extensions to the Perl regular expression facilities.
+Perl 5.10 includes new features that are not in earlier versions of Perl, some
+of which (such as named parentheses) have been in PCRE for some time. This list
+is with respect to Perl 5.10:
<br>
<br>
-(a) Although lookbehind assertions must match fixed length strings, each
-alternative branch of a lookbehind assertion can match a different length of
-string. Perl requires them all to have the same length.
+(a) Although lookbehind assertions in PCRE must match fixed length strings,
+each alternative branch of a lookbehind assertion can match a different length
+of string. Perl requires them all to have the same length.
<br>
<br>
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
@@ -177,7 +187,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 18 September 2009
+Last updated: 04 October 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcregrep.html b/doc/html/pcregrep.html
index 8ab07fe..58911d9 100644
--- a/doc/html/pcregrep.html
+++ b/doc/html/pcregrep.html
@@ -119,9 +119,9 @@ standard input is always so treated.
</P>
<br><a name="SEC4" href="#TOC1">OPTIONS</a><br>
<P>
-The order in which some of the options appear can affect the output. For
-example, both the <b>-h</b> and <b>-l</b> options affect the printing of file
-names. Whichever comes later in the command line will be the one that takes
+The order in which some of the options appear can affect the output. For
+example, both the <b>-h</b> and <b>-l</b> options affect the printing of file
+names. Whichever comes later in the command line will be the one that takes
effect.
</P>
<P>
@@ -326,9 +326,9 @@ output once, on a separate line.
Instead of outputting lines from the files, just output the names of the files
containing lines that would have been output. Each file name is output
once, on a separate line. Searching normally stops as soon as a matching line
-is found in a file. However, if the <b>-c</b> (count) option is also used,
-matching continues in order to obtain the correct count, and those files that
-have at least one match are listed along with their counts. Using this option
+is found in a file. However, if the <b>-c</b> (count) option is also used,
+matching continues in order to obtain the correct count, and those files that
+have at least one match are listed along with their counts. Using this option
with <b>-c</b> is a way of suppressing the listing of files with no matches.
</P>
<P>
@@ -474,8 +474,8 @@ The majority of short and long forms of <b>pcregrep</b>'s options are the same
as in the GNU <b>grep</b> program. Any long option of the form
<b>--xxx-regexp</b> (GNU terminology) is also available as <b>--xxx-regex</b>
(PCRE terminology). However, the <b>--locale</b>, <b>-M</b>, <b>--multiline</b>,
-<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>. If both the
-<b>-c</b> and <b>-l</b> options are given, GNU grep lists only file names,
+<b>-u</b>, and <b>--utf-8</b> options are specific to <b>pcregrep</b>. If both the
+<b>-c</b> and <b>-l</b> options are given, GNU grep lists only file names,
without counts, but <b>pcregrep</b> gives the counts.
</P>
<br><a name="SEC8" href="#TOC1">OPTIONS WITH DATA</a><br>
diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html
index 9c2d86c..6ab3dfb 100644
--- a/doc/html/pcrematching.html
+++ b/doc/html/pcrematching.html
@@ -96,13 +96,18 @@ traditional finite state machine (it keeps multiple states active
simultaneously).
</P>
<P>
+Although the general principle of this matching algorithm is that it scans the
+subject string only once, without backtracking, there is one exception: when a
+lookaround assertion is encountered, the characters following or preceding the
+current point have to be independently inspected.
+</P>
+<P>
The scan continues until either the end of the subject is reached, or there are
no more unterminated paths. At this point, terminated paths represent the
different matching possibilities (if there are none, the match has failed).
Thus, if there is more than one possible match, this algorithm finds all of
-them, and in particular, it finds the longest. In PCRE, there is an option to
-stop the algorithm after the first match (which is necessarily the shortest)
-has been found.
+them, and in particular, it finds the longest. There is an option to stop the
+algorithm after the first match (which is necessarily the shortest) is found.
</P>
<P>
Note that all the matches that are found start at the same point in the
@@ -116,12 +121,6 @@ character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
</P>
<P>
-Although the general principle of this matching algorithm is that it scans the
-subject string only once, without backtracking, there is one exception: when a
-lookbehind assertion is encountered, the preceding characters have to be
-re-inspected.
-</P>
-<P>
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
</P>
@@ -186,7 +185,9 @@ callouts.
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
-time.
+time. The
+<a href="pcrepartial.html"><b>pcrepartial</b></a>
+documentation gives details of partial matching.
</P>
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
<P>
@@ -215,7 +216,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 05 September 2009
+Last updated: 29 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrepartial.html b/doc/html/pcrepartial.html
index 80eb3bb..459464f 100644
--- a/doc/html/pcrepartial.html
+++ b/doc/html/pcrepartial.html
@@ -58,10 +58,13 @@ though the details differ between the two matching functions. If both options
are set, PCRE_PARTIAL_HARD takes precedence.
</P>
<P>
-Setting a partial matching option disables one of PCRE's optimizations. PCRE
+Setting a partial matching option disables two of PCRE's optimizations. PCRE
remembers the last literal byte in a pattern, and abandons matching immediately
if such a byte is not present in the subject string. This optimization cannot
-be used for a subject string that might match only partially.
+be used for a subject string that might match only partially. If the pattern
+was studied, PCRE knows the minimum length of a matching string, and does not
+bother to run the matching function on shorter strings. This optimization is
+also disabled for partial matching.
</P>
<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
<P>
@@ -78,7 +81,7 @@ instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
vector, the first of them is set to the offset of the earliest character that
was inspected when the partial match was found. For convenience, the second
offset points to the end of the string so that a substring can easily be
-extracted.
+identified.
</P>
<P>
For the majority of patterns, the first offset identifies the start of the
@@ -382,7 +385,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC11" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 05 September 2009
+Last updated: 29 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index b2e0dd5..619024a 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -61,10 +61,10 @@ description of PCRE's regular expressions is intended as reference material.
</P>
<P>
The original operation of PCRE was on strings of one-byte characters. However,
-there is now also support for UTF-8 character strings. To use this, you must
-build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with
-the PCRE_UTF8 option. There is also a special sequence that can be given at the
-start of a pattern:
+there is now also support for UTF-8 character strings. To use this,
+PCRE must be built to include UTF-8 support, and you must call
+<b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There
+is also a special sequence that can be given at the start of a pattern:
<pre>
(*UTF8)
</pre>
@@ -111,8 +111,9 @@ string with one of the following five sequences:
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
</pre>
-These override the default and the options given to <b>pcre_compile()</b>. For
-example, on a Unix system where LF is the default newline sequence, the pattern
+These override the default and the options given to <b>pcre_compile()</b> or
+<b>pcre_compile2()</b>. For example, on a Unix system where LF is the default
+newline sequence, the pattern
<pre>
(*CR)a.b
</pre>
@@ -228,9 +229,8 @@ Non-printing characters
A second use of backslash provides a way of encoding non-printing characters
in patterns in a visible manner. There is no restriction on the appearance of
non-printing characters, apart from the binary zero that terminates a pattern,
-but when a pattern is being prepared by text editing, it is usually easier to
-use one of the following escape sequences than the binary character it
-represents:
+but when a pattern is being prepared by text editing, it is often easier to use
+one of the following escape sequences than the binary character it represents:
<pre>
\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any character
@@ -334,7 +334,7 @@ a number enclosed either in angle brackets or single quotes, is an alternative
syntax for referencing a subpattern as a "subroutine". Details are discussed
<a href="#onigurumasubroutines">later.</a>
Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
-synonymous. The former is a back reference; the latter is a
+synonymous. The former is a back reference; the latter is a
<a href="#subpatternsassubroutines">subroutine</a>
call.
</P>
@@ -465,12 +465,13 @@ one of the following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
</pre>
-These override the default and the options given to <b>pcre_compile()</b>, but
-they can be overridden by options given to <b>pcre_exec()</b>. Note that these
-special settings, which are not Perl-compatible, are recognized only at the
-very start of a pattern, and that they must be in upper case. If more than one
-of them is present, the last one is used. They can be combined with a change of
-newline convention, for example, a pattern can start with:
+These override the default and the options given to <b>pcre_compile()</b> or
+<b>pcre_compile2()</b>, but they can be overridden by options given to
+<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings,
+which are not Perl-compatible, are recognized only at the very start of a
+pattern, and that they must be in upper case. If more than one of them is
+present, the last one is used. They can be combined with a change of newline
+convention, for example, a pattern can start with:
<pre>
(*ANY)(*BSR_ANYCRLF)
</pre>
@@ -731,7 +732,10 @@ different meaning, namely the backspace character, inside a character class).
A word boundary is a position in the subject string where the current character
and the previous character do not both match \w or \W (i.e. one matches
\w and the other matches \W), or the start or end of the string if the
-first or last character matches \w, respectively.
+first or last character matches \w, respectively. Neither PCRE nor Perl has a
+separte "start of word" or "end of word" metasequence. However, whatever
+follows \b normally determines which it is. For example, the fragment
+\ba matches "a" at the start of a word.
</P>
<P>
The \A, \Z, and \z assertions differ from the traditional circumflex and
@@ -862,15 +866,16 @@ the lookbehind.
<br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
<P>
An opening square bracket introduces a character class, terminated by a closing
-square bracket. A closing square bracket on its own is not special. If a
-closing square bracket is required as a member of the class, it should be the
-first data character in the class (after an initial circumflex, if present) or
-escaped with a backslash.
+square bracket. A closing square bracket on its own is not special by default.
+However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
+bracket causes a compile-time error. If a closing square bracket is required as
+a member of the class, it should be the first data character in the class
+(after an initial circumflex, if present) or escaped with a backslash.
</P>
<P>
A character class matches a single character in the subject. In UTF-8 mode, the
-character may occupy more than one byte. A matched character must be in the set
-of characters defined by the class, unless the first character in the class
+character may be more than one byte long. A matched character must be in the
+set of characters defined by the class, unless the first character in the class
definition is a circumflex, in which case the subject character must not be in
the set defined by the class. If a circumflex is actually required as a member
of the class, ensure it is not the first character, or escape it with a
@@ -881,7 +886,7 @@ For example, the character class [aeiou] matches any lower case vowel, while
[^aeiou] matches any character that is not a lower case vowel. Note that a
circumflex is just a convenient notation for specifying the characters that
are in the class by enumerating those that are not. A class that starts with a
-circumflex is not an assertion: it still consumes a character from the subject
+circumflex is not an assertion; it still consumes a character from the subject
string, and therefore it fails if the current pointer is at the end of the
string.
</P>
@@ -897,9 +902,9 @@ caseful version would. In UTF-8 mode, PCRE always understands the concept of
case for characters whose values are less than 128, so caseless matching is
always possible. For characters with higher values, the concept of case is
supported if PCRE is compiled with Unicode property support, but not otherwise.
-If you want to use caseless matching for characters 128 and above, you must
-ensure that PCRE is compiled with Unicode property support as well as with
-UTF-8 support.
+If you want to use caseless matching in UTF8-mode for characters 128 and above,
+you must ensure that PCRE is compiled with Unicode property support as well as
+with UTF-8 support.
</P>
<P>
Characters that might indicate line breaks are never treated in any special way
@@ -1127,7 +1132,7 @@ match exactly the same set of strings. Because alternative branches are tried
from left to right, and options are not reset until the end of the subpattern
is reached, an option setting in one branch does affect subsequent branches, so
the above patterns match "SUNDAY" as well as "Saturday".
-</P>
+<a name="dupsubpatternnumber"></a></P>
<br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
<P>
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
@@ -1152,8 +1157,22 @@ stored.
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
</pre>
-A backreference or a recursive call to a numbered subpattern always refers to
-the first one in the pattern with the given number.
+A backreference to a numbered subpattern uses the most recent value that is set
+for that number by any subpattern. The following pattern matches "abcabc" or
+"defdef":
+<pre>
+ /(?|(abc)|(def))\1/
+</pre>
+In contrast, a recursive or "subroutine" call to a numbered subpattern always
+refers to the first one in the pattern with the given number. The following
+pattern matches "abcabc" or "defabc":
+<pre>
+ /(?|(abc)|(def))(?1)/
+</pre>
+If a
+<a href="#conditions">condition test</a>
+for a subpattern's having matched refers to a non-unique number, the test is
+true if any of the subpatterns of that number have matched.
</P>
<P>
An alternative approach to using this "branch reset" feature is to use
@@ -1167,7 +1186,8 @@ if an expression is modified, the numbers may change. To help with this
difficulty, PCRE supports the naming of subpatterns. This feature was not
added to Perl until release 5.10. Python had the feature earlier, and PCRE
introduced it at release 4.0, using the Python syntax. PCRE now supports both
-the Perl and the Python syntax.
+the Perl and the Python syntax. Perl allows identically numbered subpatterns to
+have different names, but PCRE does not.
</P>
<P>
In PCRE, a subpattern can be named in one of three ways: (?&#60;name&#62;...) or
@@ -1188,11 +1208,13 @@ is also a convenience function for extracting a captured substring by name.
</P>
<P>
By default, a name must be unique within a pattern, but it is possible to relax
-this constraint by setting the PCRE_DUPNAMES option at compile time. This can
-be useful for patterns where only one instance of the named parentheses can
-match. Suppose you want to match the name of a weekday, either as a 3-letter
-abbreviation or as the full name, and in both cases you want to extract the
-abbreviation. This pattern (ignoring the line breaks) does the job:
+this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
+names are also always permitted for subpatterns with the same number, set up as
+described in the previous section.) Duplicate names can be useful for patterns
+where only one instance of the named parentheses can match. Suppose you want to
+match the name of a weekday, either as a 3-letter abbreviation or as the full
+name, and in both cases you want to extract the abbreviation. This pattern
+(ignoring the line breaks) does the job:
<pre>
(?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
(?&#60;DN&#62;Tue)(?:sday)?|
@@ -1207,17 +1229,29 @@ subpattern, as described in the previous section.)
<P>
The convenience function for extracting the data by name returns the substring
for the first (and in this example, the only) subpattern of that name that
-matched. This saves searching to find which numbered subpattern it was. If you
-make a reference to a non-unique named subpattern from elsewhere in the
-pattern, the one that corresponds to the lowest number is used. For further
-details of the interfaces for handling named subpatterns, see the
+matched. This saves searching to find which numbered subpattern it was.
+</P>
+<P>
+If you make a backreference to a non-unique named subpattern from elsewhere in
+the pattern, the one that corresponds to the first occurrence of the name is
+used. In the absence of duplicate numbers (see the previous section) this is
+the one with the lowest number. If you use a named reference in a condition
+test (see the
+<a href="#conditions">section about conditions</a>
+below), either to check whether a subpattern has matched, or to check for
+recursion, all subpatterns with the same name are tested. If the condition is
+true for any one of them, the overall condition is true. This is the same
+behaviour as testing by number. For further details of the interfaces for
+handling named subpatterns, see the
<a href="pcreapi.html"><b>pcreapi</b></a>
documentation.
</P>
<P>
<b>Warning:</b> You cannot use different names to distinguish between two
-subpatterns with the same number (see the previous section) because PCRE uses
-only the numbers when matching.
+subpatterns with the same number because PCRE uses only the numbers when
+matching. For this reason, an error is given at compile time if different names
+are given to subpatterns with the same number. However, you can give the same
+name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
</P>
<br><a name="SEC15" href="#TOC1">REPETITION</a><br>
<P>
@@ -1233,6 +1267,7 @@ items:
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
+ a recursive or "subroutine" call to a subpattern
</pre>
The general repetition quantifier specifies a minimum and maximum number of
permitted matches, by giving the two numbers in curly brackets (braces),
@@ -1564,16 +1599,20 @@ after the reference.
<P>
There may be more than one back reference to the same subpattern. If a
subpattern has not actually been used in a particular match, any back
-references to it always fail. For example, the pattern
+references to it always fail by default. For example, the pattern
<pre>
(a|(bc))\2
</pre>
-always fails if it starts to match "a" rather than "bc". Because there may be
-many capturing parentheses in a pattern, all digits following the backslash are
-taken as part of a potential back reference number. If the pattern continues
-with a digit character, some delimiter must be used to terminate the back
-reference. If the PCRE_EXTENDED option is set, this can be whitespace.
-Otherwise an empty comment (see
+always fails if it starts to match "a" rather than "bc". However, if the
+PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
+unset value matches an empty string.
+</P>
+<P>
+Because there may be many capturing parentheses in a pattern, all digits
+following a backslash are taken as part of a potential back reference number.
+If the pattern continues with a digit character, some delimiter must be used to
+terminate the back reference. If the PCRE_EXTENDED option is set, this can be
+whitespace. Otherwise, the \g{ syntax or an empty comment (see
<a href="#comments">"Comments"</a>
below) can be used.
</P>
@@ -1641,6 +1680,8 @@ lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the most
convenient way to do it is with (?!) because an empty string always matches, so
an assertion that requires there not to be an empty string must always fail.
+The Perl 5.10 backtracking control verb (*FAIL) or (*F) is essentially a
+synonym for (?!).
<a name="lookbehind"></a></P>
<br><b>
Lookbehind assertions
@@ -1677,7 +1718,7 @@ branches:
</pre>
In some cases, the Perl 5.10 escape sequence \K
<a href="#resetmatchstart">(see above)</a>
-can be used instead of a lookbehind assertion to get round the fixed-length
+can be used instead of a lookbehind assertion to get round the fixed-length
restriction.
</P>
<P>
@@ -1695,14 +1736,14 @@ different numbers of bytes, are also not permitted.
<P>
<a href="#subpatternsassubroutines">"Subroutine"</a>
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
-as the subpattern matches a fixed-length string.
+as the subpattern matches a fixed-length string.
<a href="#recursion">Recursion,</a>
however, is not supported.
</P>
<P>
Possessive quantifiers can be used in conjunction with lookbehind assertions to
-specify efficient matching at the end of the subject string. Consider a simple
-pattern such as
+specify efficient matching of fixed-length strings at the end of subject
+strings. Consider a simple pattern such as
<pre>
abcd$
</pre>
@@ -1764,8 +1805,8 @@ characters that are not "999".
<P>
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns, depending on
-the result of an assertion, or whether a previous capturing subpattern matched
-or not. The two possible forms of conditional subpattern are
+the result of an assertion, or whether a specific capturing subpattern has
+already been matched. The two possible forms of conditional subpattern are:
<pre>
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
@@ -1783,12 +1824,16 @@ Checking for a used subpattern by number
</b><br>
<P>
If the text between the parentheses consists of a sequence of digits, the
-condition is true if the capturing subpattern of that number has previously
-matched. An alternative notation is to precede the digits with a plus or minus
-sign. In this case, the subpattern number is relative rather than absolute.
-The most recently opened parentheses can be referenced by (?(-1), the next most
-recent by (?(-2), and so on. In looping constructs it can also make sense to
-refer to subsequent groups with constructs such as (?(+2).
+condition is true if a capturing subpattern of that number has previously
+matched. If there is more than one capturing subpattern with the same number
+(see the earlier
+<a href="#recursion">section about duplicate subpattern numbers),</a>
+the condition is true if any of them have been set. An alternative notation is
+to precede the digits with a plus or minus sign. In this case, the subpattern
+number is relative rather than absolute. The most recently opened parentheses
+can be referenced by (?(-1), the next most recent by (?(-2), and so on. In
+looping constructs it can also make sense to refer to subsequent groups with
+constructs such as (?(+2).
</P>
<P>
Consider the following pattern, which contains non-significant white space to
@@ -1832,8 +1877,10 @@ names that consist entirely of digits is not recommended.
Rewriting the above example to use a named subpattern gives this:
<pre>
(?&#60;OPEN&#62; \( )? [^()]+ (?(&#60;OPEN&#62;) \) )
-
-</PRE>
+</pre>
+If the name used in a condition of this kind is a duplicate, the test is
+applied to all subpatterns of the same name, and is true if any one of them has
+matched.
</P>
<br><b>
Checking for pattern recursion
@@ -1846,14 +1893,16 @@ letter R, for example:
<pre>
(?(R3)...) or (?(R&name)...)
</pre>
-the condition is true if the most recent recursion is into the subpattern whose
+the condition is true if the most recent recursion is into a subpattern whose
number or name is given. This condition does not check the entire recursion
-stack.
+stack. If the name used in a condition of this kind is a duplicate, the test is
+applied to all subpatterns of the same name, and is true if any one of them is
+the most recent recursion.
</P>
<P>
-At "top level", all these recursion test conditions are false.
-<a href="#recursion">Recursive patterns</a>
-are described below.
+At "top level", all these recursion test conditions are false.
+<a href="#recursion">The syntax for recursive patterns</a>
+is described below.
</P>
<br><b>
Defining subpatterns for use by reference only
@@ -1863,7 +1912,7 @@ If the condition is the string (DEFINE), and there is no subpattern with the
name DEFINE, the condition is always false. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
-"subroutines" that can be referenced from elsewhere. (The use of
+"subroutines" that can be referenced from elsewhere. (The use of
<a href="#subpatternsassubroutines">"subroutines"</a>
is described below.) For example, a pattern to match an IPv4 address could be
written like this (ignore whitespace and line breaks):
@@ -1874,12 +1923,9 @@ written like this (ignore whitespace and line breaks):
The first part of the pattern is a DEFINE group inside which a another group
named "byte" is defined. This matches an individual component of an IPv4
address (a number less than 256). When matching takes place, this part of the
-pattern is skipped because DEFINE acts like a false condition.
-</P>
-<P>
-The rest of the pattern uses references to the named group to match the four
-dot-separated components of an IPv4 address, insisting on a word boundary at
-each end.
+pattern is skipped because DEFINE acts like a false condition. The rest of the
+pattern uses references to the named group to match the four dot-separated
+components of an IPv4 address, insisting on a word boundary at each end.
</P>
<br><b>
Assertion conditions
@@ -1939,7 +1985,7 @@ this kind of recursion was subsequently introduced into Perl at release 5.10.
<P>
A special item that consists of (? followed by a number greater than zero and a
closing parenthesis is a recursive call of the subpattern of the given number,
-provided that it occurs inside that subpattern. (If not, it is a
+provided that it occurs inside that subpattern. (If not, it is a
<a href="#subpatternsassubroutines">"subroutine"</a>
call, which is described in the next section.) The special item (?R) or (?0) is
a recursive call of the entire regular expression.
@@ -1948,25 +1994,26 @@ a recursive call of the entire regular expression.
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
<pre>
- \( ( (?&#62;[^()]+) | (?R) )* \)
+ \( ( [^()]++ | (?R) )* \)
</pre>
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a recursive
match of the pattern itself (that is, a correctly parenthesized substring).
-Finally there is a closing parenthesis.
+Finally there is a closing parenthesis. Note the use of a possessive quantifier
+to avoid backtracking into sequences of non-parentheses.
</P>
<P>
If this were part of a larger pattern, you would not want to recurse the entire
pattern, so instead you could use this:
<pre>
- ( \( ( (?&#62;[^()]+) | (?1) )* \) )
+ ( \( ( [^()]++ | (?1) )* \) )
</pre>
We have put the pattern into parentheses, and caused the recursion to refer to
them instead of the whole pattern.
</P>
<P>
In a larger pattern, keeping track of parenthesis numbers can be tricky. This
-is made easier by the use of relative references. (A Perl 5.10 feature.)
+is made easier by the use of relative references (a Perl 5.10 feature).
Instead of (?1) in the pattern above you can write (?-2) to refer to the second
most recently opened parentheses preceding the recursion. In other words, a
negative number counts capturing parentheses leftwards from the point at which
@@ -1984,20 +2031,20 @@ An alternative approach is to use named parentheses instead. The Perl syntax
for this is (?&name); PCRE's earlier syntax (?P&#62;name) is also supported. We
could rewrite the above example as follows:
<pre>
- (?&#60;pn&#62; \( ( (?&#62;[^()]+) | (?&pn) )* \) )
+ (?&#60;pn&#62; \( ( [^()]++ | (?&pn) )* \) )
</pre>
If there is more than one subpattern with the same name, the earliest one is
used.
</P>
<P>
This particular example pattern that we have been looking at contains nested
-unlimited repeats, and so the use of atomic grouping for matching strings of
-non-parentheses is important when applying the pattern to strings that do not
-match. For example, when this pattern is applied to
+unlimited repeats, and so the use of a possessive quantifier for matching
+strings of non-parentheses is important when applying the pattern to strings
+that do not match. For example, when this pattern is applied to
<pre>
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
</pre>
-it yields "no match" quickly. However, if atomic grouping is not used,
+it yields "no match" quickly. However, if a possessive quantifier is not used,
the match runs for a very long time indeed because there are so many different
ways the + and * repeats can carve up the subject, and all have to be tested
before failure can be reported.
@@ -2015,7 +2062,7 @@ documentation). If the pattern above is matched against
the value for the capturing parentheses is "ef", which is the last value taken
on at the top level. If additional parentheses are added, giving
<pre>
- \( ( ( (?&#62;[^()]+) | (?R) )* ) \)
+ \( ( ( [^()]++ | (?R) )* ) \)
^ ^
^ ^
</pre>
@@ -2044,19 +2091,19 @@ Recursion difference from Perl
In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
treated as an atomic group. That is, once it has matched some of the subject
string, it is never re-entered, even if it contains untried alternatives and
-there is a subsequent matching failure. This can be illustrated by the
-following pattern, which purports to match a palindromic string that contains
+there is a subsequent matching failure. This can be illustrated by the
+following pattern, which purports to match a palindromic string that contains
an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
<pre>
^(.|(.)(?1)\2)$
</pre>
-The idea is that it either matches a single character, or two identical
-characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
+The idea is that it either matches a single character, or two identical
+characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
it does not if the pattern is longer than three characters. Consider the
subject string "abcba":
</P>
<P>
-At the top level, the first character is matched, but as it is not at the end
+At the top level, the first character is matched, but as it is not at the end
of the string, the first alternative fails; the second alternative is taken
and the recursion kicks in. The recursive call to subpattern 1 successfully
matches the next character ("b"). (Note that the beginning and end of line
@@ -2064,7 +2111,7 @@ tests are not part of the recursion).
</P>
<P>
Back at the top level, the next character ("c") is compared with what
-subpattern 2 matched, which was "a". This fails. Because the recursion is
+subpattern 2 matched, which was "a". This fails. Because the recursion is
treated as an atomic group, there are now no backtracking points, and so the
entire match fails. (Perl is able, at this point, to re-enter the recursion and
try the second alternative.) However, if the pattern is written with the
@@ -2072,36 +2119,44 @@ alternatives in the other order, things are different:
<pre>
^((.)(?1)\2|.)$
</pre>
-This time, the recursing alternative is tried first, and continues to recurse
-until it runs out of characters, at which point the recursion fails. But this
-time we do have another alternative to try at the higher level. That is the big
+This time, the recursing alternative is tried first, and continues to recurse
+until it runs out of characters, at which point the recursion fails. But this
+time we do have another alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at a deeper
recursion level, which PCRE cannot use.
</P>
<P>
-To change the pattern so that matches all palindromic strings, not just those
+To change the pattern so that matches all palindromic strings, not just those
with an odd number of characters, it is tempting to change the pattern to this:
<pre>
^((.)(?1)\2|.?)$
</pre>
-Again, this works in Perl, but not in PCRE, and for the same reason. When a
-deeper recursion has matched a single character, it cannot be entered again in
-order to match an empty string. The solution is to separate the two cases, and
+Again, this works in Perl, but not in PCRE, and for the same reason. When a
+deeper recursion has matched a single character, it cannot be entered again in
+order to match an empty string. The solution is to separate the two cases, and
write out the odd and even cases as alternatives at the higher level:
<pre>
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
</pre>
-If you want to match typical palindromic phrases, the pattern has to ignore all
+If you want to match typical palindromic phrases, the pattern has to ignore all
non-word characters, which can be done like this:
<pre>
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
</pre>
-If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
-man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
-the use of the possessive quantifier *+ to avoid backtracking into sequences of
+If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
+man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
+the use of the possessive quantifier *+ to avoid backtracking into sequences of
non-word characters. Without this, PCRE takes a great deal longer (ten times or
more) to match typical phrases, and Perl takes so long that you think it has
gone into a loop.
+</P>
+<P>
+<b>WARNING</b>: The palindrome-matching patterns above work only if the subject
+string does not start with a palindrome that is shorter than the entire string.
+For example, although "abcba" is correctly matched, if the subject is "ababa",
+PCRE finds the palindrome "aba" at the start, then fails at top level because
+the end of the string does not follow. Once again, it cannot jump back into the
+recursion to try other alternatives, so the entire match fails.
<a name="subpatternsassubroutines"></a></P>
<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
<P>
@@ -2212,9 +2267,9 @@ failing negative assertion, they cause an error if encountered by
<b>pcre_dfa_exec()</b>.
</P>
<P>
-If any of these verbs are used in an assertion subpattern, their effect is
+If any of these verbs are used in an assertion subpattern, their effect is
confined to that subpattern; it does not extend to the surrounding pattern.
-Note that assertion subpatterns are processed as anchored at the point where
+Note that assertion subpatterns are processed as anchored at the point where
they are tested.
</P>
<P>
@@ -2234,12 +2289,12 @@ The following verbs act as soon as they are encountered:
</pre>
This verb causes the match to end successfully, skipping the remainder of the
pattern. When inside a recursion, only the innermost pattern is ended
-immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
-is captured. (This feature was added to PCRE at release 8.00.) For example:
+immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is
+captured. (This feature was added to PCRE at release 8.00.) For example:
<pre>
A((?:A|B(*ACCEPT)|C)D)
</pre>
-This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
+This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses.
<pre>
(*FAIL) or (*F)
@@ -2267,7 +2322,7 @@ The verbs differ in exactly what kind of failure occurs.
</pre>
This verb causes the whole match to fail outright if the rest of the pattern
does not match. Even if the pattern is unanchored, no further attempts to find
-a match by advancing the start point take place. Once (*COMMIT) has been
+a match by advancing the starting point take place. Once (*COMMIT) has been
passed, <b>pcre_exec()</b> is committed to finding a match at the current
starting point, or not at all. For example:
<pre>
@@ -2299,7 +2354,7 @@ was matched leading up to it cannot be part of a successful match. Consider:
If the subject is "aaaac...", after the first match attempt fails (starting at
the first character in the string), the starting point skips on to start the
next attempt at "c". Note that a possessive quantifer does not have the same
-effect in this example; although it would suppress backtracking during the
+effect as this example; although it would suppress backtracking during the
first match attempt, the second attempt would start at the second character
instead of skipping on to "c".
<pre>
@@ -2319,7 +2374,8 @@ is used outside of any alternation, it acts exactly like (*PRUNE).
</P>
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
<P>
-<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).
+<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3),
+<b>pcresyntax</b>(3), <b>pcre</b>(3).
</P>
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
<P>
@@ -2332,7 +2388,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 September 2009
+Last updated: 04 October 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcreposix.html b/doc/html/pcreposix.html
index b0c00bf..c4ad088 100644
--- a/doc/html/pcreposix.html
+++ b/doc/html/pcreposix.html
@@ -128,9 +128,9 @@ are returned.
<pre>
REG_UNGREEDY
</pre>
-The PCRE_UNGREEDY option is set when the regular expression is passed for
+The PCRE_UNGREEDY option is set when the regular expression is passed for
compilation to the native function. Note that REG_UNGREEDY is not part of the
-POSIX standard.
+POSIX standard.
<pre>
REG_UTF8
</pre>
diff --git a/doc/html/pcresample.html b/doc/html/pcresample.html
index 37c6f79..3f53de2 100644
--- a/doc/html/pcresample.html
+++ b/doc/html/pcresample.html
@@ -20,7 +20,7 @@ A simple, complete demonstration program, to get you started with using PCRE,
is supplied in the file <i>pcredemo.c</i> in the PCRE distribution. A listing of
this program is given in the
<a href="pcredemo.html"><b>pcredemo</b></a>
-documentation. If you do not have a copy of the PCRE distribution, you can save
+documentation. If you do not have a copy of the PCRE distribution, you can save
this listing to re-create <i>pcredemo.c</i>.
</P>
<P>
@@ -38,8 +38,8 @@ an empty string. Comments in the code explain what is going on.
</P>
<P>
If PCRE is installed in the standard include and library directories for your
-system, you should be able to compile the demonstration program using this
-command:
+operating system, you should be able to compile the demonstration program using
+this command:
<pre>
gcc -o pcredemo pcredemo.c -lpcre
</pre>
@@ -59,7 +59,7 @@ this:
Note that there is a much more comprehensive test program, called
<a href="pcretest.html"><b>pcretest</b>,</a>
which supports many more facilities for testing regular expressions and the
-PCRE library. The
+PCRE library. The
<a href="pcredemo.html"><b>pcredemo</b></a>
program is provided as a simple coding example.
</P>
@@ -93,7 +93,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 01 September 2009
+Last updated: 30 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 1a99d60..ba1d369 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -248,7 +248,7 @@ begins with a lookbehind assertion (including \b or \B).
If any call to <b>pcre_exec()</b> in a <b>/g</b> or <b>/G</b> sequence matches an
empty string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
PCRE_ANCHORED flags set in order to search for another, non-empty, match at the
-same point. If this second match fails, the start offset is advanced by one
+same point. If this second match fails, the start offset is advanced by one
character, and the normal match is retried. This imitates the way Perl handles
such cases when using the <b>/g</b> modifier or the <b>split()</b> function.
</P>
@@ -371,13 +371,14 @@ recognized:
\L call pcre_get_substringlist() after a successful match
\M discover the minimum MATCH_LIMIT and MATCH_LIMIT_RECURSION settings
\N pass the PCRE_NOTEMPTY option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>; if used twice, pass the
- PCRE_NOTEMPTY_ATSTART option
+ PCRE_NOTEMPTY_ATSTART option
\Odd set the size of the output vector passed to <b>pcre_exec()</b> to dd (any number of digits)
\P pass the PCRE_PARTIAL_SOFT option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>; if used twice, pass the
- PCRE_PARTIAL_HARD option
+ PCRE_PARTIAL_HARD option
\Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd (any number of digits)
\R pass the PCRE_DFA_RESTART option to <b>pcre_dfa_exec()</b>
\S output details of memory get/free calls during matching
+ \Y pass the PCRE_NO_START_OPTIMIZE option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
\Z pass the PCRE_NOTEOL option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
\? pass the PCRE_NO_UTF8_CHECK option to <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>
\&#62;dd start the match at offset dd (any number of digits);
@@ -540,7 +541,7 @@ the subject where there is at least one match. For example:
</pre>
(Using the normal matching function on this data finds only "tang".) The
longest matching string is always given first (and numbered zero). After a
-PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
+PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
partially matching substring.
</P>
<P>
@@ -708,7 +709,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 September 2009
+Last updated: 26 September 2009
<br>
Copyright &copy; 1997-2009 University of Cambridge.
<br>
diff --git a/doc/pcre.txt b/doc/pcre.txt
index f6140e6..6a96b67 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -2,7 +2,7 @@
This file contains a concatenation of the PCRE man pages, converted to plain
text format for ease of searching with a text editor, or for use on systems
that do not have a man page processor. The small individual files that give
-synopses of each function in the library have not been included. Neither has
+synopses of each function in the library have not been included. Neither has
the pcredemo program. There are separate text files for the pcregrep and
pcretest commands.
-----------------------------------------------------------------------------
@@ -19,23 +19,23 @@ INTRODUCTION
The PCRE library is a set of functions that implement regular expres-
sion pattern matching using the same syntax and semantics as Perl, with
- just a few differences. Certain features that appeared in Python and
- PCRE before they appeared in Perl are also available using the Python
- syntax. There is also some support for certain .NET and Oniguruma syn-
- tax items, and there is an option for requesting some minor changes
- that give better JavaScript compatibility.
-
- The current implementation of PCRE (release 8.xx) corresponds approxi-
- mately with Perl 5.10, including support for UTF-8 encoded strings and
- Unicode general category properties. However, UTF-8 and Unicode support
- has to be explicitly enabled; it is not the default. The Unicode tables
- correspond to Unicode release 5.1.
+ just a few differences. Some features that appeared in Python and PCRE
+ before they appeared in Perl are also available using the Python syn-
+ tax, there is some support for one or two .NET and Oniguruma syntax
+ items, and there is an option for requesting some minor changes that
+ give better JavaScript compatibility.
+
+ The current implementation of PCRE corresponds approximately with Perl
+ 5.10, including support for UTF-8 encoded strings and Unicode general
+ category properties. However, UTF-8 and Unicode support has to be
+ explicitly enabled; it is not the default. The Unicode tables corre-
+ spond to Unicode release 5.1.
In addition to the Perl-compatible matching function, PCRE contains an
- alternative matching function that matches the same compiled patterns
- in a different way. In certain circumstances, the alternative function
- has some advantages. For a discussion of the two matching algorithms,
- see the pcrematching page.
+ alternative function that matches the same compiled patterns in a dif-
+ ferent way. In certain circumstances, the alternative function has some
+ advantages. For a discussion of the two matching algorithms, see the
+ pcrematching page.
PCRE is written in C and released as a C library. A number of people
have written wrappers and interfaces of various kinds. In particular,
@@ -55,8 +55,8 @@ INTRODUCTION
library is built. The pcre_config() function makes it possible for a
client to discover which features are available. The features them-
selves are described in the pcrebuild page. Documentation about build-
- ing PCRE for various operating systems can be found in the README file
- in the source distribution.
+ ing PCRE for various operating systems can be found in the README and
+ NON-UNIX-USE files in the source distribution.
The library contains a number of undocumented internal functions and
data tables that are used by more than one of the exported external
@@ -89,12 +89,12 @@ USER DOCUMENTATION
pcrepartial details of the partial matching facility
pcrepattern syntax and semantics of supported
regular expressions
- pcresyntax quick syntax reference
pcreperform discussion of performance issues
pcreposix the POSIX-compatible C API
pcreprecompile details of saving and re-using precompiled patterns
pcresample discussion of the pcredemo program
pcrestack discussion of stack usage
+ pcresyntax quick syntax reference
pcretest description of the pcretest testing command
In addition, in the "man" and HTML formats, there is a short page for
@@ -142,7 +142,7 @@ UTF-8 AND UNICODE PROPERTY SUPPORT
with the PCRE_UTF8 option flag, or the pattern must start with the
sequence (*UTF8). When either of these is the case, both the pattern
and any subject strings that are matched against it are treated as
- UTF-8 strings instead of just strings of bytes.
+ UTF-8 strings instead of strings of 1-byte characters.
If you compile PCRE with UTF-8 support, but do not use it at run time,
the library will be a bit bigger, but the additional run time overhead
@@ -263,11 +263,11 @@ AUTHOR
REVISION
- Last updated: 01 September 2009
+ Last updated: 28 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREBUILD(3) PCREBUILD(3)
@@ -324,7 +324,7 @@ UTF-8 SUPPORT
to the configure command. Of itself, this does not make PCRE treat
strings as UTF-8. As well as compiling PCRE with this option, you also
have have to set the PCRE_UTF8 option when you call the pcre_compile()
- function.
+ or pcre_compile2() functions.
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE
expects its input to be either ASCII or UTF-8 (depending on the runtime
@@ -432,9 +432,9 @@ HANDLING VERY LARGE PATTERNS
nation metacharacter). By default, two-byte values are used for these
offsets, leading to a maximum size for a compiled pattern of around
64K. This is sufficient to handle all but the most gigantic patterns.
- Nevertheless, some people do want to process enormous patterns, so it
- is possible to compile PCRE to use three-byte or four-byte offsets by
- adding a setting such as
+ Nevertheless, some people do want to process truyl enormous patterns,
+ so it is possible to compile PCRE to use three-byte or four-byte off-
+ sets by adding a setting such as
--with-link-size=3
@@ -461,7 +461,7 @@ AVOIDING EXCESSIVE STACK USAGE
to the configure command. With this configuration, PCRE will use the
pcre_stack_malloc and pcre_stack_free variables to call memory manage-
ment functions. By default these point to malloc() and free(), but you
- can replace the pointers so that your own functions are used.
+ can replace the pointers so that your own functions are used instead.
Separate functions are provided rather than using pcre_malloc and
pcre_free because the usage is very predictable: the block sizes
@@ -469,70 +469,69 @@ AVOIDING EXCESSIVE STACK USAGE
reverse order. A calling program might be able to implement optimized
functions that perform better than malloc() and free(). PCRE runs
noticeably more slowly when built in this way. This option affects only
- the pcre_exec() function; it is not relevant for the the
- pcre_dfa_exec() function.
+ the pcre_exec() function; it is not relevant for pcre_dfa_exec().
LIMITING PCRE RESOURCE USAGE
- Internally, PCRE has a function called match(), which it calls repeat-
- edly (sometimes recursively) when matching a pattern with the
- pcre_exec() function. By controlling the maximum number of times this
- function may be called during a single matching operation, a limit can
- be placed on the resources used by a single call to pcre_exec(). The
- limit can be changed at run time, as described in the pcreapi documen-
- tation. The default is 10 million, but this can be changed by adding a
+ Internally, PCRE has a function called match(), which it calls repeat-
+ edly (sometimes recursively) when matching a pattern with the
+ pcre_exec() function. By controlling the maximum number of times this
+ function may be called during a single matching operation, a limit can
+ be placed on the resources used by a single call to pcre_exec(). The
+ limit can be changed at run time, as described in the pcreapi documen-
+ tation. The default is 10 million, but this can be changed by adding a
setting such as
--with-match-limit=500000
- to the configure command. This setting has no effect on the
+ to the configure command. This setting has no effect on the
pcre_dfa_exec() matching function.
- In some environments it is desirable to limit the depth of recursive
+ In some environments it is desirable to limit the depth of recursive
calls of match() more strictly than the total number of calls, in order
- to restrict the maximum amount of stack (or heap, if --disable-stack-
+ to restrict the maximum amount of stack (or heap, if --disable-stack-
for-recursion is specified) that is used. A second limit controls this;
- it defaults to the value that is set for --with-match-limit, which
- imposes no additional constraints. However, you can set a lower limit
+ it defaults to the value that is set for --with-match-limit, which
+ imposes no additional constraints. However, you can set a lower limit
by adding, for example,
--with-match-limit-recursion=10000
- to the configure command. This value can also be overridden at run
+ to the configure command. This value can also be overridden at run
time.
CREATING CHARACTER TABLES AT BUILD TIME
- PCRE uses fixed tables for processing characters whose code values are
- less than 256. By default, PCRE is built with a set of tables that are
- distributed in the file pcre_chartables.c.dist. These tables are for
+ PCRE uses fixed tables for processing characters whose code values are
+ less than 256. By default, PCRE is built with a set of tables that are
+ distributed in the file pcre_chartables.c.dist. These tables are for
ASCII codes only. If you add
--enable-rebuild-chartables
- to the configure command, the distributed tables are no longer used.
- Instead, a program called dftables is compiled and run. This outputs
+ to the configure command, the distributed tables are no longer used.
+ Instead, a program called dftables is compiled and run. This outputs
the source for new set of tables, created in the default locale of your
C runtime system. (This method of replacing the tables does not work if
- you are cross compiling, because dftables is run on the local host. If
- you need to create alternative tables when cross compiling, you will
+ you are cross compiling, because dftables is run on the local host. If
+ you need to create alternative tables when cross compiling, you will
have to do so "by hand".)
USING EBCDIC CODE
- PCRE assumes by default that it will run in an environment where the
- character code is ASCII (or Unicode, which is a superset of ASCII).
- This is the case for most computer operating systems. PCRE can, how-
+ PCRE assumes by default that it will run in an environment where the
+ character code is ASCII (or Unicode, which is a superset of ASCII).
+ This is the case for most computer operating systems. PCRE can, how-
ever, be compiled to run in an EBCDIC environment by adding
--enable-ebcdic
to the configure command. This setting implies --enable-rebuild-charta-
- bles. You should only use it if you know that you are in an EBCDIC
- environment (for example, an IBM mainframe operating system). The
+ bles. You should only use it if you know that you are in an EBCDIC
+ environment (for example, an IBM mainframe operating system). The
--enable-ebcdic option is incompatible with --enable-utf8.
@@ -546,7 +545,7 @@ PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
--enable-pcregrep-libbz2
to the configure command. These options naturally require that the rel-
- evant libraries are installed on your system. Configuration will fail
+ evant libraries are installed on your system. Configuration will fail
if they are not.
@@ -556,24 +555,24 @@ PCRETEST OPTION FOR LIBREADLINE SUPPORT
--enable-pcretest-libreadline
- to the configure command, pcretest is linked with the libreadline
- library, and when its input is from a terminal, it reads it using the
+ to the configure command, pcretest is linked with the libreadline
+ library, and when its input is from a terminal, it reads it using the
readline() function. This provides line-editing and history facilities.
- Note that libreadline is GPL-licenced, so if you distribute a binary of
+ Note that libreadline is GPL-licensed, so if you distribute a binary of
pcretest linked in this way, there may be licensing issues.
- Setting this option causes the -lreadline option to be added to the
- pcretest build. In many operating environments with a sytem-installed
+ Setting this option causes the -lreadline option to be added to the
+ pcretest build. In many operating environments with a sytem-installed
libreadline this is sufficient. However, in some environments (e.g. if
- an unmodified distribution version of readline is in use), some extra
- configuration may be necessary. The INSTALL file for libreadline says
+ an unmodified distribution version of readline is in use), some extra
+ configuration may be necessary. The INSTALL file for libreadline says
this:
"Readline uses the termcap functions, but does not link with the
termcap or curses library itself, allowing applications which link
with readline the to choose an appropriate library."
- If your environment has not been set up so that an appropriate library
+ If your environment has not been set up so that an appropriate library
is automatically included, you may need to add something like
LIBS="-ncurses"
@@ -595,11 +594,11 @@ AUTHOR
REVISION
- Last updated: 06 September 2009
+ Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREMATCHING(3) PCREMATCHING(3)
@@ -683,13 +682,19 @@ THE ALTERNATIVE MATCHING ALGORITHM
though it is not implemented as a traditional finite state machine (it
keeps multiple states active simultaneously).
+ Although the general principle of this matching algorithm is that it
+ scans the subject string only once, without backtracking, there is one
+ exception: when a lookaround assertion is encountered, the characters
+ following or preceding the current point have to be independently
+ inspected.
+
The scan continues until either the end of the subject is reached, or
there are no more unterminated paths. At this point, terminated paths
represent the different matching possibilities (if there are none, the
match has failed). Thus, if there is more than one possible match,
this algorithm finds all of them, and in particular, it finds the long-
- est. In PCRE, there is an option to stop the algorithm after the first
- match (which is necessarily the shortest) has been found.
+ est. There is an option to stop the algorithm after the first match
+ (which is necessarily the shortest) is found.
Note that all the matches that are found start at the same point in the
subject. If the pattern
@@ -701,73 +706,69 @@ THE ALTERNATIVE MATCHING ALGORITHM
at the fourth character of the subject. The algorithm does not automat-
ically move on to find matches that start at later positions.
- Although the general principle of this matching algorithm is that it
- scans the subject string only once, without backtracking, there is one
- exception: when a lookbehind assertion is encountered, the preceding
- characters have to be re-inspected.
-
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
- 1. Because the algorithm finds all possible matches, the greedy or
- ungreedy nature of repetition quantifiers is not relevant. Greedy and
+ 1. Because the algorithm finds all possible matches, the greedy or
+ ungreedy nature of repetition quantifiers is not relevant. Greedy and
ungreedy quantifiers are treated in exactly the same way. However, pos-
- sessive quantifiers can make a difference when what follows could also
+ sessive quantifiers can make a difference when what follows could also
match what is quantified, for example in a pattern like this:
^a++\w!
- This pattern matches "aaab!" but not "aaa!", which would be matched by
- a non-possessive quantifier. Similarly, if an atomic group is present,
- it is matched as if it were a standalone pattern at the current point,
- and the longest match is then "locked in" for the rest of the overall
+ This pattern matches "aaab!" but not "aaa!", which would be matched by
+ a non-possessive quantifier. Similarly, if an atomic group is present,
+ it is matched as if it were a standalone pattern at the current point,
+ and the longest match is then "locked in" for the rest of the overall
pattern.
2. When dealing with multiple paths through the tree simultaneously, it
- is not straightforward to keep track of captured substrings for the
- different matching possibilities, and PCRE's implementation of this
+ is not straightforward to keep track of captured substrings for the
+ different matching possibilities, and PCRE's implementation of this
algorithm does not attempt to do this. This means that no captured sub-
strings are available.
- 3. Because no substrings are captured, back references within the pat-
+ 3. Because no substrings are captured, back references within the pat-
tern are not supported, and cause errors if encountered.
- 4. For the same reason, conditional expressions that use a backrefer-
- ence as the condition or test for a specific group recursion are not
+ 4. For the same reason, conditional expressions that use a backrefer-
+ ence as the condition or test for a specific group recursion are not
supported.
- 5. Because many paths through the tree may be active, the \K escape
+ 5. Because many paths through the tree may be active, the \K escape
sequence, which resets the start of the match when encountered (but may
- be on some paths and not on others), is not supported. It causes an
+ be on some paths and not on others), is not supported. It causes an
error if encountered.
- 6. Callouts are supported, but the value of the capture_top field is
+ 6. Callouts are supported, but the value of the capture_top field is
always 1, and the value of the capture_last field is always -1.
- 7. The \C escape sequence, which (in the standard algorithm) matches a
- single byte, even in UTF-8 mode, is not supported because the alterna-
- tive algorithm moves through the subject string one character at a
+ 7. The \C escape sequence, which (in the standard algorithm) matches a
+ single byte, even in UTF-8 mode, is not supported because the alterna-
+ tive algorithm moves through the subject string one character at a
time, for all active paths through the tree.
- 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
- are not supported. (*FAIL) is supported, and behaves like a failing
+ 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
+ are not supported. (*FAIL) is supported, and behaves like a failing
negative assertion.
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
- Using the alternative matching algorithm provides the following advan-
+ Using the alternative matching algorithm provides the following advan-
tages:
1. All possible matches (at a single point in the subject) are automat-
- ically found, and in particular, the longest match is found. To find
+ ically found, and in particular, the longest match is found. To find
more than one match using the standard algorithm, you have to do kludgy
things with callouts.
- 2. Because the alternative algorithm scans the subject string just
- once, and never needs to backtrack, it is possible to pass very long
- subject strings to the matching function in several pieces, checking
- for partial matching each time.
+ 2. Because the alternative algorithm scans the subject string just
+ once, and never needs to backtrack, it is possible to pass very long
+ subject strings to the matching function in several pieces, checking
+ for partial matching each time. The pcrepartial documentation gives
+ details of partial matching.
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
@@ -793,11 +794,11 @@ AUTHOR
REVISION
- Last updated: 05 September 2009
+ Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREAPI(3) PCREAPI(3)
@@ -1126,7 +1127,9 @@ COMPILING A PATTERN
Either of the functions pcre_compile() or pcre_compile2() can be called
to compile a pattern into an internal form. The only difference between
the two interfaces is that pcre_compile2() has an additional argument,
- errorcodeptr, via which a numerical error code can be returned.
+ errorcodeptr, via which a numerical error code can be returned. To
+ avoid too much repetition, we refer just to pcre_compile() below, but
+ the information applies equally to pcre_compile2().
The pattern is a C string terminated by a binary zero, and is passed in
the pattern argument. A pointer to a single block of memory that is
@@ -1144,20 +1147,20 @@ COMPILING A PATTERN
The options argument contains various bit settings that affect the com-
pilation. It should be zero if no options are required. The available
options are described below. Some of them (in particular, those that
- are compatible with Perl, but also some others) can also be set and
+ are compatible with Perl, but some others as well) can also be set and
unset from within the pattern (see the detailed description in the
pcrepattern documentation). For those options that can be different in
different parts of the pattern, the contents of the options argument
- specifies their initial settings at the start of compilation and execu-
- tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the
- time of matching as well as at compile time.
+ specifies their settings at the start of compilation and execution. The
+ PCRE_ANCHORED, PCRE_BSR_xxx, and PCRE_NEWLINE_xxx options can be set at
+ the time of matching as well as at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes-
sage. This is a static string that is part of the library. You must not
try to free it. The byte offset from the start of the pattern to the
- character that was being processes when the error was discovered is
+ character that was being processed when the error was discovered is
placed in the variable pointed to by erroffset, which must not be NULL.
If it is, an immediate error is given. Some errors are not detected
until checks are carried out when the whole pattern has been scanned;
@@ -1491,14 +1494,14 @@ STUDYING A PATTERN
the results of the study.
The returned value from pcre_study() can be passed directly to
- pcre_exec(). However, a pcre_extra block also contains other fields
- that can be set by the caller before the block is passed; these are
- described below in the section on matching a pattern.
+ pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con-
+ tains other fields that can be set by the caller before the block is
+ passed; these are described below in the section on matching a pattern.
- If studying the pattern does not produce any additional information
+ If studying the pattern does not produce any useful information,
pcre_study() returns NULL. In that circumstance, if the calling program
- wants to pass any of the other fields to pcre_exec(), it must set up
- its own pcre_extra block.
+ wants to pass any of the other fields to pcre_exec() or
+ pcre_dfa_exec(), it must set up its own pcre_extra block.
The second argument of pcre_study() contains option bits. At present,
no options are defined, and this argument should always be zero.
@@ -1518,63 +1521,72 @@ STUDYING A PATTERN
0, /* no options exist */
&error); /* set to NULL or points to a message */
- At present, studying a pattern is useful only for non-anchored patterns
- that do not have a single fixed starting character. A bitmap of possi-
- ble starting bytes is created.
+ Studying a pattern does two things: first, a lower bound for the length
+ of subject string that is needed to match the pattern is computed. This
+ does not mean that there are any strings of that length that match, but
+ it does guarantee that no shorter strings match. The value is used by
+ pcre_exec() and pcre_dfa_exec() to avoid wasting time by trying to
+ match strings that are shorter than the lower bound. You can find out
+ the value in a calling program via the pcre_fullinfo() function.
+
+ Studying a pattern is also useful for non-anchored patterns that do not
+ have a single fixed starting character. A bitmap of possible starting
+ bytes is created. This speeds up finding a position in the subject at
+ which to start matching.
LOCALE SUPPORT
- PCRE handles caseless matching, and determines whether characters are
- letters, digits, or whatever, by reference to a set of tables, indexed
- by character value. When running in UTF-8 mode, this applies only to
- characters with codes less than 128. Higher-valued codes never match
- escapes such as \w or \d, but can be tested with \p if PCRE is built
- with Unicode character property support. The use of locales with Uni-
- code is discouraged. If you are handling characters with codes greater
- than 128, you should either use UTF-8 and Unicode, or use locales, but
+ PCRE handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character value. When running in UTF-8 mode, this applies only to
+ characters with codes less than 128. Higher-valued codes never match
+ escapes such as \w or \d, but can be tested with \p if PCRE is built
+ with Unicode character property support. The use of locales with Uni-
+ code is discouraged. If you are handling characters with codes greater
+ than 128, you should either use UTF-8 and Unicode, or use locales, but
not try to mix the two.
- PCRE contains an internal set of tables that are used when the final
- argument of pcre_compile() is NULL. These are sufficient for many
+ PCRE contains an internal set of tables that are used when the final
+ argument of pcre_compile() is NULL. These are sufficient for many
applications. Normally, the internal tables recognize only ASCII char-
acters. However, when PCRE is built, it is possible to cause the inter-
nal tables to be rebuilt in the default "C" locale of the local system,
which may cause them to be different.
- The internal tables can always be overridden by tables supplied by the
+ The internal tables can always be overridden by tables supplied by the
application that calls PCRE. These may be created in a different locale
- from the default. As more and more applications change to using Uni-
+ from the default. As more and more applications change to using Uni-
code, the need for this locale support is expected to die away.
- External tables are built by calling the pcre_maketables() function,
- which has no arguments, in the relevant locale. The result can then be
- passed to pcre_compile() or pcre_exec() as often as necessary. For
- example, to build and use tables that are appropriate for the French
- locale (where accented characters with values greater than 128 are
+ External tables are built by calling the pcre_maketables() function,
+ which has no arguments, in the relevant locale. The result can then be
+ passed to pcre_compile() or pcre_exec() as often as necessary. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
treated as letters), the following code could be used:
setlocale(LC_CTYPE, "fr_FR");
tables = pcre_maketables();
re = pcre_compile(..., tables);
- The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
if you are using Windows, the name for the French locale is "french".
- When pcre_maketables() runs, the tables are built in memory that is
- obtained via pcre_malloc. It is the caller's responsibility to ensure
- that the memory containing the tables remains available for as long as
+ When pcre_maketables() runs, the tables are built in memory that is
+ obtained via pcre_malloc. It is the caller's responsibility to ensure
+ that the memory containing the tables remains available for as long as
it is needed.
The pointer that is passed to pcre_compile() is saved with the compiled
- pattern, and the same tables are used via this pointer by pcre_study()
+ pattern, and the same tables are used via this pointer by pcre_study()
and normally also by pcre_exec(). Thus, by default, for any single pat-
tern, compilation, studying and matching all happen in the same locale,
but different patterns can be compiled in different locales.
- It is possible to pass a table pointer or NULL (indicating the use of
- the internal tables) to pcre_exec(). Although not intended for this
- purpose, this facility could be used to match a pattern in a different
+ It is possible to pass a table pointer or NULL (indicating the use of
+ the internal tables) to pcre_exec(). Although not intended for this
+ purpose, this facility could be used to match a pattern in a different
locale from the one in which it was compiled. Passing table pointers at
run time is discussed below in the section on matching a pattern.
@@ -1584,15 +1596,15 @@ INFORMATION ABOUT A PATTERN
int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
int what, void *where);
- The pcre_fullinfo() function returns information about a compiled pat-
+ The pcre_fullinfo() function returns information about a compiled pat-
tern. It replaces the obsolete pcre_info() function, which is neverthe-
less retained for backwards compability (and is documented below).
- The first argument for pcre_fullinfo() is a pointer to the compiled
- pattern. The second argument is the result of pcre_study(), or NULL if
- the pattern was not studied. The third argument specifies which piece
- of information is required, and the fourth argument is a pointer to a
- variable to receive the data. The yield of the function is zero for
+ The first argument for pcre_fullinfo() is a pointer to the compiled
+ pattern. The second argument is the result of pcre_study(), or NULL if
+ the pattern was not studied. The third argument specifies which piece
+ of information is required, and the fourth argument is a pointer to a
+ variable to receive the data. The yield of the function is zero for
success, or one of the following negative numbers:
PCRE_ERROR_NULL the argument code was NULL
@@ -1600,9 +1612,9 @@ INFORMATION ABOUT A PATTERN
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of what was invalid
- The "magic number" is placed at the start of each compiled pattern as
- an simple check against passing an arbitrary memory pointer. Here is a
- typical call of pcre_fullinfo(), to obtain the length of the compiled
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre_fullinfo(), to obtain the length of the compiled
pattern:
int rc;
@@ -1613,111 +1625,131 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
- The possible values for the third argument are defined in pcre.h, and
+ The possible values for the third argument are defined in pcre.h, and
are as follows:
PCRE_INFO_BACKREFMAX
- Return the number of the highest back reference in the pattern. The
- fourth argument should point to an int variable. Zero is returned if
+ Return the number of the highest back reference in the pattern. The
+ fourth argument should point to an int variable. Zero is returned if
there are no back references.
PCRE_INFO_CAPTURECOUNT
- Return the number of capturing subpatterns in the pattern. The fourth
+ Return the number of capturing subpatterns in the pattern. The fourth
argument should point to an int variable.
PCRE_INFO_DEFAULT_TABLES
- Return a pointer to the internal default character tables within PCRE.
- The fourth argument should point to an unsigned char * variable. This
+ Return a pointer to the internal default character tables within PCRE.
+ The fourth argument should point to an unsigned char * variable. This
information call is provided for internal use by the pcre_study() func-
- tion. External callers can cause PCRE to use its internal tables by
+ tion. External callers can cause PCRE to use its internal tables by
passing a NULL table pointer.
PCRE_INFO_FIRSTBYTE
- Return information about the first byte of any matched string, for a
- non-anchored pattern. The fourth argument should point to an int vari-
- able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
+ Return information about the first byte of any matched string, for a
+ non-anchored pattern. The fourth argument should point to an int vari-
+ able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
is still recognized for backwards compatibility.)
- If there is a fixed first byte, for example, from a pattern such as
+ If there is a fixed first byte, for example, from a pattern such as
(cat|cow|coyote), its value is returned. Otherwise, if either
- (a) the pattern was compiled with the PCRE_MULTILINE option, and every
+ (a) the pattern was compiled with the PCRE_MULTILINE option, and every
branch starts with "^", or
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
set (if it were set, the pattern would be anchored),
- -1 is returned, indicating that the pattern matches only at the start
- of a subject string or after any newline within the string. Otherwise
+ -1 is returned, indicating that the pattern matches only at the start
+ of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned.
PCRE_INFO_FIRSTTABLE
- If the pattern was studied, and this resulted in the construction of a
+ If the pattern was studied, and this resulted in the construction of a
256-bit table indicating a fixed set of bytes for the first byte in any
- matching string, a pointer to the table is returned. Otherwise NULL is
- returned. The fourth argument should point to an unsigned char * vari-
+ matching string, a pointer to the table is returned. Otherwise NULL is
+ returned. The fourth argument should point to an unsigned char * vari-
able.
PCRE_INFO_HASCRORLF
- Return 1 if the pattern contains any explicit matches for CR or LF
- characters, otherwise 0. The fourth argument should point to an int
- variable. An explicit match is either a literal CR or LF character, or
+ Return 1 if the pattern contains any explicit matches for CR or LF
+ characters, otherwise 0. The fourth argument should point to an int
+ variable. An explicit match is either a literal CR or LF character, or
\r or \n.
PCRE_INFO_JCHANGED
- Return 1 if the (?J) or (?-J) option setting is used in the pattern,
- otherwise 0. The fourth argument should point to an int variable. (?J)
+ Return 1 if the (?J) or (?-J) option setting is used in the pattern,
+ otherwise 0. The fourth argument should point to an int variable. (?J)
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
PCRE_INFO_LASTLITERAL
- Return the value of the rightmost literal byte that must exist in any
- matched string, other than at its start, if such a byte has been
+ Return the value of the rightmost literal byte that must exist in any
+ matched string, other than at its start, if such a byte has been
recorded. The fourth argument should point to an int variable. If there
- is no such byte, -1 is returned. For anchored patterns, a last literal
- byte is recorded only if it follows something of variable length. For
+ is no such byte, -1 is returned. For anchored patterns, a last literal
+ byte is recorded only if it follows something of variable length. For
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
/^a\dz\d/ the returned value is -1.
+ PCRE_INFO_MINLENGTH
+
+ If the pattern was studied and a minimum length for matching subject
+ strings was computed, its value is returned. Otherwise the returned
+ value is -1. The value is a number of characters, not bytes (this may
+ be relevant in UTF-8 mode). The fourth argument should point to an int
+ variable. A non-negative value is a lower bound to the length of any
+ matching string. There may not be any strings of that length that do
+ actually match, but every string that does match is at least that long.
+
PCRE_INFO_NAMECOUNT
PCRE_INFO_NAMEENTRYSIZE
PCRE_INFO_NAMETABLE
- PCRE supports the use of named as well as numbered capturing parenthe-
- ses. The names are just an additional way of identifying the parenthe-
+ PCRE supports the use of named as well as numbered capturing parenthe-
+ ses. The names are just an additional way of identifying the parenthe-
ses, which still acquire numbers. Several convenience functions such as
- pcre_get_named_substring() are provided for extracting captured sub-
- strings by name. It is also possible to extract the data directly, by
- first converting the name to a number in order to access the correct
+ pcre_get_named_substring() are provided for extracting captured sub-
+ strings by name. It is also possible to extract the data directly, by
+ first converting the name to a number in order to access the correct
pointers in the output vector (described with pcre_exec() below). To do
- the conversion, you need to use the name-to-number map, which is
+ the conversion, you need to use the name-to-number map, which is
described by these three values.
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
- of each entry; both of these return an int value. The entry size
- depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
- a pointer to the first entry of the table (a pointer to char). The
+ of each entry; both of these return an int value. The entry size
+ depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
+ a pointer to the first entry of the table (a pointer to char). The
first two bytes of each entry are the number of the capturing parenthe-
- sis, most significant byte first. The rest of the entry is the corre-
- sponding name, zero terminated. The names are in alphabetical order.
- When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
- theses numbers. For example, consider the following pattern (assume
- PCRE_EXTENDED is set, so white space - including newlines - is
- ignored):
+ sis, most significant byte first. The rest of the entry is the corre-
+ sponding name, zero terminated.
+
+ The names are in alphabetical order. Duplicate names may appear if (?|
+ is used to create multiple groups with the same number, as described in
+ the section on duplicate subpattern numbers in the pcrepattern page.
+ Duplicate names for subpatterns with different numbers are permitted
+ only if PCRE_DUPNAMES is set. In all cases of duplicate names, they
+ appear in the table in the order in which they were found in the pat-
+ tern. In the absence of (?| this is the order of increasing number;
+ when (?| is used this is not necessarily the case because later subpat-
+ terns may have lower numbers.
+
+ As a simple example of the name/number table, consider the following
+ pattern (assume PCRE_EXTENDED is set, so white space - including new-
+ lines - is ignored):
(?<date> (?<year>(\d\d)?\d\d) -
(?<month>\d\d) - (?<day>\d\d) )
- There are four named subpatterns, so the table has four entries, and
- each entry in the table is eight bytes long. The table is as follows,
+ There are four named subpatterns, so the table has four entries, and
+ each entry in the table is eight bytes long. The table is as follows,
with non-printing bytes shows in hexadecimal, and undefined bytes shown
as ??:
@@ -1726,31 +1758,31 @@ INFORMATION ABOUT A PATTERN
00 04 m o n t h 00
00 02 y e a r 00 ??
- When writing code to extract data from named subpatterns using the
- name-to-number map, remember that the length of the entries is likely
+ When writing code to extract data from named subpatterns using the
+ name-to-number map, remember that the length of the entries is likely
to be different for each compiled pattern.
PCRE_INFO_OKPARTIAL
- Return 1 if the pattern can be used for partial matching with
- pcre_exec(), otherwise 0. The fourth argument should point to an int
- variable. From release 8.00, this always returns 1, because the
- restrictions that previously applied to partial matching have been
- lifted. The pcrepartial documentation gives details of partial match-
+ Return 1 if the pattern can be used for partial matching with
+ pcre_exec(), otherwise 0. The fourth argument should point to an int
+ variable. From release 8.00, this always returns 1, because the
+ restrictions that previously applied to partial matching have been
+ lifted. The pcrepartial documentation gives details of partial match-
ing.
PCRE_INFO_OPTIONS
- Return a copy of the options with which the pattern was compiled. The
- fourth argument should point to an unsigned long int variable. These
+ Return a copy of the options with which the pattern was compiled. The
+ fourth argument should point to an unsigned long int variable. These
option bits are those specified in the call to pcre_compile(), modified
by any top-level option settings at the start of the pattern itself. In
- other words, they are the options that will be in force when matching
- starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
- the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
+ other words, they are the options that will be in force when matching
+ starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
+ the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
and PCRE_EXTENDED.
- A pattern is automatically anchored by PCRE if all of its top-level
+ A pattern is automatically anchored by PCRE if all of its top-level
alternatives begin with one of the following:
^ unless PCRE_MULTILINE is set
@@ -1764,7 +1796,7 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE
- Return the size of the compiled pattern, that is, the value that was
+ Return the size of the compiled pattern, that is, the value that was
passed as the argument to pcre_malloc() when PCRE was getting memory in
which to place the compiled data. The fourth argument should point to a
size_t variable.
@@ -1772,9 +1804,10 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_STUDYSIZE
Return the size of the data block pointed to by the study_data field in
- a pcre_extra block. That is, it is the value that was passed to
+ a pcre_extra block. That is, it is the value that was passed to
pcre_malloc() when PCRE was getting memory into which to place the data
- created by pcre_study(). The fourth argument should point to a size_t
+ created by pcre_study(). If pcre_extra is NULL, or there is no study
+ data, zero is returned. The fourth argument should point to a size_t
variable.
@@ -1830,7 +1863,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
The function pcre_exec() is called to match a subject string against a
compiled pattern, which is passed in the code argument. If the pattern
- has been studied, the result of the study should be passed in the extra
+ was studied, the result of the study should be passed in the extra
argument. This function is the main matching facility of the library,
and it operates in a Perl-like manner. For specialist use there is also
an alternative matching function, which is described below in the sec-
@@ -1889,8 +1922,8 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
The match_limit field provides a means of preventing PCRE from using up
a vast amount of resources when running patterns that are not going to
match, but which have a very large number of possibilities in their
- search trees. The classic example is the use of nested unlimited
- repeats.
+ search trees. The classic example is a pattern that uses nested unlim-
+ ited repeats.
Internally, PCRE uses a function called match() which it calls repeat-
edly (sometimes recursively). The limit set by match_limit is imposed
@@ -2177,7 +2210,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
has to get additional memory for use during matching. Thus it is usu-
ally advisable to supply an ovector.
- The pcre_info() function can be used to find out how many capturing
+ The pcre_fullinfo() function can be used to find out how many capturing
subpatterns there are in a compiled pattern. The smallest size for
ovector that will allow for n captured substrings, in addition to the
offsets of the substring matched by the whole pattern, is (n+1)*3.
@@ -2438,10 +2471,13 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
behaviour may not be what you want (see the next section).
- Warning: If the pattern uses the "(?|" feature to set up multiple sub-
- patterns with the same number, you cannot use names to distinguish
- them, because names are not included in the compiled code. The matching
- process uses only numbers.
+ Warning: If the pattern uses the (?| feature to set up multiple subpat-
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcrepattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
+ causes an error at compile time.
DUPLICATE SUBPATTERN NAMES
@@ -2449,47 +2485,51 @@ DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);
- When a pattern is compiled with the PCRE_DUPNAMES option, names for
- subpatterns are not required to be unique. Normally, patterns with
- duplicate names are such that in any one match, only one of the named
- subpatterns participates. An example is shown in the pcrepattern docu-
- mentation.
+ When a pattern is compiled with the PCRE_DUPNAMES option, names for
+ subpatterns are not required to be unique. (Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
+ use the same names.)
- When duplicates are present, pcre_copy_named_substring() and
- pcre_get_named_substring() return the first substring corresponding to
- the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
- (-7) is returned; no data is returned. The pcre_get_stringnumber()
- function returns one of the numbers that are associated with the name,
+ Normally, patterns with duplicate names are such that in any one match,
+ only one of the named subpatterns participates. An example is shown in
+ the pcrepattern documentation.
+
+ When duplicates are present, pcre_copy_named_substring() and
+ pcre_get_named_substring() return the first substring corresponding to
+ the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
+ (-7) is returned; no data is returned. The pcre_get_stringnumber()
+ function returns one of the numbers that are associated with the name,
but it is not defined which it is.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre_get_stringtable_entries() function. The
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre_get_stringtable_entries() function. The
first argument is the compiled pattern, and the second is the name. The
- third and fourth are pointers to variables which are updated by the
+ third and fourth are pointers to variables which are updated by the
function. After it has run, they point to the first and last entries in
- the name-to-number table for the given name. The function itself
- returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
- there are none. The format of the table is described above in the sec-
- tion entitled Information about a pattern. Given all the relevant
- entries for the name, you can extract each of their numbers, and hence
+ the name-to-number table for the given name. The function itself
+ returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
+ there are none. The format of the table is described above in the sec-
+ tion entitled Information about a pattern. Given all the relevant
+ entries for the name, you can extract each of their numbers, and hence
the captured data, if any.
FINDING ALL POSSIBLE MATCHES
- The traditional matching function uses a similar algorithm to Perl,
+ The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in
- the subject. If you want to find all possible matches, or the longest
- possible match, consider using the alternative matching function (see
- below) instead. If you cannot use the alternative function, but still
- need to find all possible matches, you can kludge it up by making use
+ the subject. If you want to find all possible matches, or the longest
+ possible match, consider using the alternative matching function (see
+ below) instead. If you cannot use the alternative function, but still
+ need to find all possible matches, you can kludge it up by making use
of the callout facility, which is described in the pcrecallout documen-
tation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre_exec() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre_exec() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
@@ -2500,26 +2540,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
- The function pcre_dfa_exec() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
- subject string just once, and does not backtrack. This has different
- characteristics to the normal algorithm, and is not compatible with
- Perl. Some of the features of PCRE patterns are not supported. Never-
- theless, there are times when this kind of matching can be useful. For
- a discussion of the two matching algorithms, and a list of features
- that pcre_dfa_exec() does not support, see the pcrematching documenta-
+ The function pcre_dfa_exec() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
+ subject string just once, and does not backtrack. This has different
+ characteristics to the normal algorithm, and is not compatible with
+ Perl. Some of the features of PCRE patterns are not supported. Never-
+ theless, there are times when this kind of matching can be useful. For
+ a discussion of the two matching algorithms, and a list of features
+ that pcre_dfa_exec() does not support, see the pcrematching documenta-
tion.
- The arguments for the pcre_dfa_exec() function are the same as for
+ The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ-
- ent way, and this is described below. The other common arguments are
- used in the same way as for pcre_exec(), so their description is not
+ ent way, and this is described below. The other common arguments are
+ used in the same way as for pcre_exec(), so their description is not
repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace will be needed for patterns and subjects where there are a
+ workspace will be needed for patterns and subjects where there are a
lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec():
@@ -2541,52 +2581,52 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_exec()
- The unused bits of the options argument for pcre_dfa_exec() must be
- zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
+ The unused bits of the options argument for pcre_dfa_exec() must be
+ zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR-
- TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
- four of these are exactly the same as for pcre_exec(), so their
+ TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
+ four of these are exactly the same as for pcre_exec(), so their
description is not repeated here.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
- These have the same general effect as they do for pcre_exec(), but the
- details are slightly different. When PCRE_PARTIAL_HARD is set for
- pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
- ject is reached and there is still at least one matching possibility
+ These have the same general effect as they do for pcre_exec(), but the
+ details are slightly different. When PCRE_PARTIAL_HARD is set for
+ pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
+ ject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
- of the subject is reached, there have been no complete matches, but
- there is still at least one matching possibility. The portion of the
- string that was inspected when the longest partial match was found is
+ of the subject is reached, there have been no complete matches, but
+ there is still at least one matching possibility. The portion of the
+ string that was inspected when the longest partial match was found is
set as the first matching string in both cases.
PCRE_DFA_SHORTEST
- Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
+ Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
- tive algorithm works, this is necessarily the shortest possible match
+ tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE_DFA_RESTART
When pcre_dfa_exec() returns a partial match, it is possible to call it
- again, with additional subject characters, and have it continue with
- the same match. The PCRE_DFA_RESTART option requests this action; when
- it is set, the workspace and wscount options must reference the same
- vector as before because data about the match so far is left in them
+ again, with additional subject characters, and have it continue with
+ the same match. The PCRE_DFA_RESTART option requests this action; when
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcrepartial documentation.
Successful returns from pcre_dfa_exec()
- When pcre_dfa_exec() succeeds, it may have matched more than one sub-
+ When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -2601,61 +2641,61 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else>
<something> <something else> <something further>
- On success, the yield of the function is a number greater than zero,
- which is the number of matched substrings. The substrings themselves
- are returned in ovector. Each string uses two elements; the first is
- the offset to the start, and the second is the offset to the end. In
- fact, all the strings have the same start offset. (Space could have
- been saved by giving this only once, but it was decided to retain some
- compatibility with the way pcre_exec() returns data, even though the
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The substrings themselves
+ are returned in ovector. Each string uses two elements; the first is
+ the offset to the start, and the second is the offset to the end. In
+ fact, all the strings have the same start offset. (Space could have
+ been saved by giving this only once, but it was decided to retain some
+ compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long-
- est matching string is given first. If there were too many matches to
- fit into ovector, the yield of the function is zero, and the vector is
+ est matching string is given first. If there were too many matches to
+ fit into ovector, the yield of the function is zero, and the vector is
filled with the longest matches.
Error returns from pcre_dfa_exec()
- The pcre_dfa_exec() function returns a negative number when it fails.
- Many of the errors are the same as for pcre_exec(), and these are
- described above. There are in addition the following errors that are
+ The pcre_dfa_exec() function returns a negative number when it fails.
+ Many of the errors are the same as for pcre_exec(), and these are
+ described above. There are in addition the following errors that are
specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16)
- This return is given if pcre_dfa_exec() encounters an item in the pat-
- tern that it does not support, for instance, the use of \C or a back
+ This return is given if pcre_dfa_exec() encounters an item in the pat-
+ tern that it does not support, for instance, the use of \C or a back
reference.
PCRE_ERROR_DFA_UCOND (-17)
- This return is given if pcre_dfa_exec() encounters a condition item
- that uses a back reference for the condition, or a test for recursion
+ This return is given if pcre_dfa_exec() encounters a condition item
+ that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18)
- This return is given if pcre_dfa_exec() is called with an extra block
+ This return is given if pcre_dfa_exec() is called with an extra block
that contains a setting of the match_limit field. This is not supported
(it is meaningless).
PCRE_ERROR_DFA_WSSIZE (-19)
- This return is given if pcre_dfa_exec() runs out of space in the
+ This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.
PCRE_ERROR_DFA_RECURSE (-20)
- When a recursive subpattern is processed, the matching function calls
- itself recursively, using private vectors for ovector and workspace.
- This error is given if the output vector is not large enough. This
+ When a recursive subpattern is processed, the matching function calls
+ itself recursively, using private vectors for ovector and workspace.
+ This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
SEE ALSO
- pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
+ pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
@@ -2668,11 +2708,11 @@ AUTHOR
REVISION
- Last updated: 22 September 2009
+ Last updated: 03 October 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECALLOUT(3) PCRECALLOUT(3)
@@ -2698,10 +2738,10 @@ PCRE CALLOUTS
(?C1)abc(?C2)def
- If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() is
- called, PCRE automatically inserts callouts, all with number 255,
- before each item in the pattern. For example, if PCRE_AUTO_CALLOUT is
- used with the pattern
+ If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() or
+ pcre_compile2() is called, PCRE automatically inserts callouts, all
+ with number 255, before each item in the pattern. For example, if
+ PCRE_AUTO_CALLOUT is used with the pattern
A(\d{2}|--)
@@ -2730,18 +2770,23 @@ MISSING CALLOUTS
ever start, and the callout is never reached. However, with "abyd",
though the result is still no match, the callout is obeyed.
- You can disable these optimizations by passing the PCRE_NO_START_OPTI-
- MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the
- matching process, but does ensure that callouts such as the example
+ If the pattern is studied, PCRE knows the minimum length of a matching
+ string, and will immediately give a "no match" return without actually
+ running a match if the subject is not long enough, or, for unanchored
+ patterns, if it has been scanned far enough.
+
+ You can disable these optimizations by passing the PCRE_NO_START_OPTI-
+ MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the
+ matching process, but does ensure that callouts such as the example
above are obeyed.
THE CALLOUT INTERFACE
- During matching, when PCRE reaches a callout point, the external func-
- tion defined by pcre_callout is called (if it is set). This applies to
- both the pcre_exec() and the pcre_dfa_exec() matching functions. The
- only argument to the callout function is a pointer to a pcre_callout
+ During matching, when PCRE reaches a callout point, the external func-
+ tion defined by pcre_callout is called (if it is set). This applies to
+ both the pcre_exec() and the pcre_dfa_exec() matching functions. The
+ only argument to the callout function is a pointer to a pcre_callout
block. This structure contains the following fields:
int version;
@@ -2757,81 +2802,81 @@ THE CALLOUT INTERFACE
int pattern_position;
int next_item_length;
- The version field is an integer containing the version number of the
- block format. The initial version was 0; the current version is 1. The
- version number will change again in future if additional fields are
+ The version field is an integer containing the version number of the
+ block format. The initial version was 0; the current version is 1. The
+ version number will change again in future if additional fields are
added, but the intention is never to remove any of the existing fields.
- The callout_number field contains the number of the callout, as com-
- piled into the pattern (that is, the number after ?C for manual call-
+ The callout_number field contains the number of the callout, as com-
+ piled into the pattern (that is, the number after ?C for manual call-
outs, and 255 for automatically generated callouts).
- The offset_vector field is a pointer to the vector of offsets that was
- passed by the caller to pcre_exec() or pcre_dfa_exec(). When
- pcre_exec() is used, the contents can be inspected in order to extract
- substrings that have been matched so far, in the same way as for
- extracting substrings after a match has completed. For pcre_dfa_exec()
+ The offset_vector field is a pointer to the vector of offsets that was
+ passed by the caller to pcre_exec() or pcre_dfa_exec(). When
+ pcre_exec() is used, the contents can be inspected in order to extract
+ substrings that have been matched so far, in the same way as for
+ extracting substrings after a match has completed. For pcre_dfa_exec()
this field is not useful.
The subject and subject_length fields contain copies of the values that
were passed to pcre_exec().
- The start_match field normally contains the offset within the subject
- at which the current match attempt started. However, if the escape
- sequence \K has been encountered, this value is changed to reflect the
- modified starting point. If the pattern is not anchored, the callout
+ The start_match field normally contains the offset within the subject
+ at which the current match attempt started. However, if the escape
+ sequence \K has been encountered, this value is changed to reflect the
+ modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
- The current_position field contains the offset within the subject of
+ The current_position field contains the offset within the subject of
the current match pointer.
- When the pcre_exec() function is used, the capture_top field contains
- one more than the number of the highest numbered captured substring so
- far. If no substrings have been captured, the value of capture_top is
- one. This is always the case when pcre_dfa_exec() is used, because it
+ When the pcre_exec() function is used, the capture_top field contains
+ one more than the number of the highest numbered captured substring so
+ far. If no substrings have been captured, the value of capture_top is
+ one. This is always the case when pcre_dfa_exec() is used, because it
does not support captured substrings.
- The capture_last field contains the number of the most recently cap-
- tured substring. If no substrings have been captured, its value is -1.
+ The capture_last field contains the number of the most recently cap-
+ tured substring. If no substrings have been captured, its value is -1.
This is always the case when pcre_dfa_exec() is used.
- The callout_data field contains a value that is passed to pcre_exec()
- or pcre_dfa_exec() specifically so that it can be passed back in call-
- outs. It is passed in the pcre_callout field of the pcre_extra data
- structure. If no such data was passed, the value of callout_data in a
- pcre_callout block is NULL. There is a description of the pcre_extra
+ The callout_data field contains a value that is passed to pcre_exec()
+ or pcre_dfa_exec() specifically so that it can be passed back in call-
+ outs. It is passed in the pcre_callout field of the pcre_extra data
+ structure. If no such data was passed, the value of callout_data in a
+ pcre_callout block is NULL. There is a description of the pcre_extra
structure in the pcreapi documentation.
- The pattern_position field is present from version 1 of the pcre_call-
+ The pattern_position field is present from version 1 of the pcre_call-
out structure. It contains the offset to the next item to be matched in
the pattern string.
- The next_item_length field is present from version 1 of the pcre_call-
+ The next_item_length field is present from version 1 of the pcre_call-
out structure. It contains the length of the next item to be matched in
- the pattern string. When the callout immediately precedes an alterna-
- tion bar, a closing parenthesis, or the end of the pattern, the length
- is zero. When the callout precedes an opening parenthesis, the length
+ the pattern string. When the callout immediately precedes an alterna-
+ tion bar, a closing parenthesis, or the end of the pattern, the length
+ is zero. When the callout precedes an opening parenthesis, the length
is that of the entire subpattern.
- The pattern_position and next_item_length fields are intended to help
- in distinguishing between different automatic callouts, which all have
+ The pattern_position and next_item_length fields are intended to help
+ in distinguishing between different automatic callouts, which all have
the same callout number. However, they are set for all callouts.
RETURN VALUES
- The external callout function returns an integer to PCRE. If the value
- is zero, matching proceeds as normal. If the value is greater than
- zero, matching fails at the current point, but the testing of other
+ The external callout function returns an integer to PCRE. If the value
+ is zero, matching proceeds as normal. If the value is greater than
+ zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
- failed. If the value is less than zero, the match is abandoned, and
- pcre_exec() (or pcre_dfa_exec()) returns the negative value.
+ failed. If the value is less than zero, the match is abandoned, and
+ pcre_exec() or pcre_dfa_exec() returns the negative value.
- Negative values should normally be chosen from the set of
+ Negative values should normally be chosen from the set of
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
- dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
- reserved for use by callout functions; it will never be used by PCRE
+ dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
+ reserved for use by callout functions; it will never be used by PCRE
itself.
@@ -2844,11 +2889,11 @@ AUTHOR
REVISION
- Last updated: 15 March 2009
+ Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECOMPAT(3) PCRECOMPAT(3)
@@ -2859,50 +2904,49 @@ NAME
DIFFERENCES BETWEEN PCRE AND PERL
This document describes the differences in the ways that PCRE and Perl
- handle regular expressions. The differences described here are mainly
- with respect to Perl 5.8, though PCRE versions 7.0 and later contain
- some features that are in Perl 5.10.
+ handle regular expressions. The differences described here are with
+ respect to Perl 5.10.
- 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
- of what it does have are given in the section on UTF-8 support in the
+ 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
+ of what it does have are given in the section on UTF-8 support in the
main pcre page.
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
- permits them, but they do not mean what you might think. For example,
+ permits them, but they do not mean what you might think. For example,
(?!a){3} does not assert that the next three characters are not "a". It
just asserts that the next character is not "a" three times.
- 3. Capturing subpatterns that occur inside negative lookahead asser-
- tions are counted, but their entries in the offsets vector are never
- set. Perl sets its numerical variables from any such patterns that are
+ 3. Capturing subpatterns that occur inside negative lookahead asser-
+ tions are counted, but their entries in the offsets vector are never
+ set. Perl sets its numerical variables from any such patterns that are
matched before the assertion fails to match something (thereby succeed-
- ing), but only if the negative lookahead assertion contains just one
+ ing), but only if the negative lookahead assertion contains just one
branch.
- 4. Though binary zero characters are supported in the subject string,
+ 4. Though binary zero characters are supported in the subject string,
they are not allowed in a pattern string because it is passed as a nor-
mal C string, terminated by zero. The escape sequence \0 can be used in
the pattern to represent a binary zero.
- 5. The following Perl escape sequences are not supported: \l, \u, \L,
+ 5. The following Perl escape sequences are not supported: \l, \u, \L,
\U, and \N. In fact these are implemented by Perl's general string-han-
- dling and are not part of its pattern matching engine. If any of these
+ dling and are not part of its pattern matching engine. If any of these
are encountered by PCRE, an error is generated.
- 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
- is built with Unicode character property support. The properties that
- can be tested with \p and \P are limited to the general category prop-
- erties such as Lu and Nd, script names such as Greek or Han, and the
- derived properties Any and L&. PCRE does support the Cs (surrogate)
- property, which Perl does not; the Perl documentation says "Because
+ 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
+ is built with Unicode character property support. The properties that
+ can be tested with \p and \P are limited to the general category prop-
+ erties such as Lu and Nd, script names such as Greek or Han, and the
+ derived properties Any and L&. PCRE does support the Cs (surrogate)
+ property, which Perl does not; the Perl documentation says "Because
Perl hides the need for the user to understand the internal representa-
- tion of Unicode characters, there is no need to implement the somewhat
+ tion of Unicode characters, there is no need to implement the somewhat
messy concept of surrogates."
7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
- ters in between are treated as literals. This is slightly different
- from Perl in that $ and @ are also handled as literals inside the
- quotes. In Perl, they cause variable interpolation (but of course PCRE
+ ters in between are treated as literals. This is slightly different
+ from Perl in that $ and @ are also handled as literals inside the
+ quotes. In Perl, they cause variable interpolation (but of course PCRE
does not have variables). Note the following examples:
Pattern PCRE matches Perl matches
@@ -2912,47 +2956,59 @@ DIFFERENCES BETWEEN PCRE AND PERL
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
- The \Q...\E sequence is recognized both inside and outside character
+ The \Q...\E sequence is recognized both inside and outside character
classes.
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
- constructions. However, there is support for recursive patterns. This
- is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
- "callout" feature allows an external function to be called during pat-
+ constructions. However, there is support for recursive patterns. This
+ is not available in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
+ "callout" feature allows an external function to be called during pat-
tern matching. See the pcrecallout documentation for details.
- 9. Subpatterns that are called recursively or as "subroutines" are
- always treated as atomic groups in PCRE. This is like Python, but
- unlike Perl. There is a discussion of an example that explains this in
- more detail in the section on recursion differences from Perl in the
- pcrecompat page.
+ 9. Subpatterns that are called recursively or as "subroutines" are
+ always treated as atomic groups in PCRE. This is like Python, but
+ unlike Perl. There is a discussion of an example that explains this in
+ more detail in the section on recursion differences from Perl in the
+ pcrepattern page.
- 10. There are some differences that are concerned with the settings of
- captured strings when part of a pattern is repeated. For example,
- matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
+ 10. There are some differences that are concerned with the settings of
+ captured strings when part of a pattern is repeated. For example,
+ matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
unset, but in PCRE it is set to "b".
11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT),
- (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
+ (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
the forms without an argument. PCRE does not support (*MARK).
- 12. PCRE provides some extensions to the Perl regular expression facil-
- ities. Perl 5.10 will include new features that are not in earlier
- versions, some of which (such as named parentheses) have been in PCRE
- for some time. This list is with respect to Perl 5.10:
-
- (a) Although lookbehind assertions must match fixed length strings,
- each alternative branch of a lookbehind assertion can match a different
- length of string. Perl requires them all to have the same length.
-
- (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
+ 12. PCRE's handling of duplicate subpattern numbers and duplicate sub-
+ pattern names is not as general as Perl's. This is a consequence of the
+ fact the PCRE works internally just with numbers, using an external ta-
+ ble to translate between numbers and names. In particular, a pattern
+ such as (?|(?<a>A)|(?<b)B), where the two capturing parentheses have
+ the same number but different names, is not supported, and causes an
+ error at compile time. If it were allowed, it would not be possible to
+ distinguish which parentheses matched, because both names map to cap-
+ turing subpattern number 1. To avoid this confusing situation, an error
+ is given at compile time.
+
+ 13. PCRE provides some extensions to the Perl regular expression facil-
+ ities. Perl 5.10 includes new features that are not in earlier ver-
+ sions of Perl, some of which (such as named parentheses) have been in
+ PCRE for some time. This list is with respect to Perl 5.10:
+
+ (a) Although lookbehind assertions in PCRE must match fixed length
+ strings, each alternative branch of a lookbehind assertion can match a
+ different length of string. Perl requires them all to have the same
+ length.
+
+ (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
ignored. (Perl can be made to issue a warning.)
- (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
+ (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
fiers is inverted, that is, by default they are not greedy, but if fol-
lowed by a question mark they are.
@@ -2960,10 +3016,10 @@ DIFFERENCES BETWEEN PCRE AND PERL
tried only at the first matching position in the subject string.
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
- and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
+ and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
lents.
- (g) The \R escape sequence can be restricted to match only CR, LF, or
+ (g) The \R escape sequence can be restricted to match only CR, LF, or
CRLF by the PCRE_BSR_ANYCRLF option.
(h) The callout facility is PCRE-specific.
@@ -2973,10 +3029,10 @@ DIFFERENCES BETWEEN PCRE AND PERL
(j) Patterns compiled by PCRE can be saved and re-used at a later time,
even on different hosts that have the other endianness.
- (k) The alternative matching function (pcre_dfa_exec()) matches in a
+ (k) The alternative matching function (pcre_dfa_exec()) matches in a
different way and is not Perl-compatible.
- (l) PCRE recognizes some special sequences such as (*CR) at the start
+ (l) PCRE recognizes some special sequences such as (*CR) at the start
of a pattern that set overall options that cannot be changed within the
pattern.
@@ -2990,11 +3046,11 @@ AUTHOR
REVISION
- Last updated: 18 September 2009
+ Last updated: 04 October 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPATTERN(3) PCREPATTERN(3)
@@ -3021,9 +3077,9 @@ PCRE REGULAR EXPRESSION DETAILS
The original operation of PCRE was on strings of one-byte characters.
However, there is now also support for UTF-8 character strings. To use
- this, you must build PCRE to include UTF-8 support, and then call
- pcre_compile() with the PCRE_UTF8 option. There is also a special
- sequence that can be given at the start of a pattern:
+ this, PCRE must be built to include UTF-8 support, and you must call
+ pcre_compile() or pcre_compile2() with the PCRE_UTF8 option. There is
+ also a special sequence that can be given at the start of a pattern:
(*UTF8)
@@ -3061,9 +3117,9 @@ NEWLINE CONVENTIONS
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
- These override the default and the options given to pcre_compile(). For
- example, on a Unix system where LF is the default newline sequence, the
- pattern
+ These override the default and the options given to pcre_compile() or
+ pcre_compile2(). For example, on a Unix system where LF is the default
+ newline sequence, the pattern
(*CR)a.b
@@ -3180,7 +3236,7 @@ BACKSLASH
acters in patterns in a visible manner. There is no restriction on the
appearance of non-printing characters, apart from the binary zero that
terminates a pattern, but when a pattern is being prepared by text
- editing, it is usually easier to use one of the following escape
+ editing, it is often easier to use one of the following escape
sequences than the binary character it represents:
\a alarm, that is, the BEL character (hex 07)
@@ -3392,13 +3448,13 @@ BACKSLASH
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
- These override the default and the options given to pcre_compile(), but
- they can be overridden by options given to pcre_exec(). Note that these
- special settings, which are not Perl-compatible, are recognized only at
- the very start of a pattern, and that they must be in upper case. If
- more than one of them is present, the last one is used. They can be
- combined with a change of newline convention, for example, a pattern
- can start with:
+ These override the default and the options given to pcre_compile() or
+ pcre_compile2(), but they can be overridden by options given to
+ pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
+ are not Perl-compatible, are recognized only at the very start of a
+ pattern, and that they must be in upper case. If more than one of them
+ is present, the last one is used. They can be combined with a change of
+ newline convention, for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
@@ -3581,34 +3637,37 @@ BACKSLASH
A word boundary is a position in the subject string where the current
character and the previous character do not both match \w or \W (i.e.
one matches \w and the other matches \W), or the start or end of the
- string if the first or last character matches \w, respectively.
+ string if the first or last character matches \w, respectively. Neither
+ PCRE nor Perl has a separte "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
+ For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
- affect only the behaviour of the circumflex and dollar metacharacters.
- However, if the startoffset argument of pcre_exec() is non-zero, indi-
+ affect only the behaviour of the circumflex and dollar metacharacters.
+ However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
- the subject, \A can never match. The difference between \Z and \z is
+ the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre_exec(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre_exec() multiple times with appropriate argu-
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre_exec(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
- Note, however, that PCRE's interpretation of \G, as the start of the
+ Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
@@ -3616,90 +3675,90 @@ BACKSLASH
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre_exec() is non-zero, circumflex can never match if the
- PCRE_MULTILINE option is unset. Inside a character class, circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre_exec() is non-zero, circumflex can never match if the
+ PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- A dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
+ A dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
before a newline at the end of the string (by default). Dollar need not
- be the last character of the pattern if a number of alternatives are
- involved, but it should be the last item in any branch in which it
+ be the last character of the pattern if a number of alternatives are
+ involved, but it should be the last item in any branch in which it
appears. Dollar has no special meaning in a character class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
- PCRE_MULTILINE option is set. When this is the case, a circumflex
- matches immediately after internal newlines as well as at the start of
- the subject string. It does not match after a newline that ends the
- string. A dollar matches before any newlines in the string, as well as
- at the very end, when PCRE_MULTILINE is set. When newline is specified
- as the two-character sequence CRLF, isolated CR and LF characters do
+ PCRE_MULTILINE option is set. When this is the case, a circumflex
+ matches immediately after internal newlines as well as at the start of
+ the subject string. It does not match after a newline that ends the
+ string. A dollar matches before any newlines in the string, as well as
+ at the very end, when PCRE_MULTILINE is set. When newline is specified
+ as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
- fies the end of a line. In UTF-8 mode, the matched character may be
+ ter in the subject string except (by default) a character that signi-
+ fies the end of a line. In UTF-8 mode, the matched character may be
more than one byte long.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE_DOTALL option is set, a dot matches any one character, without
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
MATCHING A SINGLE BYTE
Outside a character class, the escape sequence \C matches any one byte,
- both in and out of UTF-8 mode. Unlike a dot, it always matches any
- line-ending characters. The feature is provided in Perl in order to
- match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
- acters into individual bytes, what remains in the string may be a mal-
- formed UTF-8 string. For this reason, the \C escape sequence is best
+ both in and out of UTF-8 mode. Unlike a dot, it always matches any
+ line-ending characters. The feature is provided in Perl in order to
+ match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
+ acters into individual bytes, what remains in the string may be a mal-
+ formed UTF-8 string. For this reason, the \C escape sequence is best
avoided.
- PCRE does not allow \C to appear in lookbehind assertions (described
- below), because in UTF-8 mode this would make it impossible to calcu-
+ PCRE does not allow \C to appear in lookbehind assertions (described
+ below), because in UTF-8 mode this would make it impossible to calcu-
late the length of the lookbehind.
@@ -3707,97 +3766,99 @@ SQUARE BRACKETS AND CHARACTER CLASSES
An opening square bracket introduces a character class, terminated by a
closing square bracket. A closing square bracket on its own is not spe-
- cial. If a closing square bracket is required as a member of the class,
- it should be the first data character in the class (after an initial
- circumflex, if present) or escaped with a backslash.
-
- A character class matches a single character in the subject. In UTF-8
- mode, the character may occupy more than one byte. A matched character
+ cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
+ a lone closing square bracket causes a compile-time error. If a closing
+ square bracket is required as a member of the class, it should be the
+ first data character in the class (after an initial circumflex, if
+ present) or escaped with a backslash.
+
+ A character class matches a single character in the subject. In UTF-8
+ mode, the character may be more than one byte long. A matched character
must be in the set of characters defined by the class, unless the first
- character in the class definition is a circumflex, in which case the
- subject character must not be in the set defined by the class. If a
- circumflex is actually required as a member of the class, ensure it is
+ character in the class definition is a circumflex, in which case the
+ subject character must not be in the set defined by the class. If a
+ circumflex is actually required as a member of the class, ensure it is
not the first character, or escape it with a backslash.
- For example, the character class [aeiou] matches any lower case vowel,
- while [^aeiou] matches any character that is not a lower case vowel.
+ For example, the character class [aeiou] matches any lower case vowel,
+ while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the
- characters that are in the class by enumerating those that are not. A
- class that starts with a circumflex is not an assertion: it still con-
- sumes a character from the subject string, and therefore it fails if
+ characters that are in the class by enumerating those that are not. A
+ class that starts with a circumflex is not an assertion; it still con-
+ sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string.
- In UTF-8 mode, characters with values greater than 255 can be included
- in a class as a literal string of bytes, or by using the \x{ escaping
+ In UTF-8 mode, characters with values greater than 255 can be included
+ in a class as a literal string of bytes, or by using the \x{ escaping
mechanism.
- When caseless matching is set, any letters in a class represent both
- their upper case and lower case versions, so for example, a caseless
- [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
- match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
- understands the concept of case for characters whose values are less
- than 128, so caseless matching is always possible. For characters with
- higher values, the concept of case is supported if PCRE is compiled
- with Unicode property support, but not otherwise. If you want to use
- caseless matching for characters 128 and above, you must ensure that
- PCRE is compiled with Unicode property support as well as with UTF-8
- support.
-
- Characters that might indicate line breaks are never treated in any
- special way when matching character classes, whatever line-ending
- sequence is in use, and whatever setting of the PCRE_DOTALL and
+ When caseless matching is set, any letters in a class represent both
+ their upper case and lower case versions, so for example, a caseless
+ [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
+ match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
+ understands the concept of case for characters whose values are less
+ than 128, so caseless matching is always possible. For characters with
+ higher values, the concept of case is supported if PCRE is compiled
+ with Unicode property support, but not otherwise. If you want to use
+ caseless matching in UTF8-mode for characters 128 and above, you must
+ ensure that PCRE is compiled with Unicode property support as well as
+ with UTF-8 support.
+
+ Characters that might indicate line breaks are never treated in any
+ special way when matching character classes, whatever line-ending
+ sequence is in use, and whatever setting of the PCRE_DOTALL and
PCRE_MULTILINE options is used. A class such as [^a] always matches one
of these characters.
- The minus (hyphen) character can be used to specify a range of charac-
- ters in a character class. For example, [d-m] matches any letter
- between d and m, inclusive. If a minus character is required in a
- class, it must be escaped with a backslash or appear in a position
- where it cannot be interpreted as indicating a range, typically as the
+ The minus (hyphen) character can be used to specify a range of charac-
+ ters in a character class. For example, [d-m] matches any letter
+ between d and m, inclusive. If a minus character is required in a
+ class, it must be escaped with a backslash or appear in a position
+ where it cannot be interpreted as indicating a range, typically as the
first or last character in the class.
It is not possible to have the literal character "]" as the end charac-
- ter of a range. A pattern such as [W-]46] is interpreted as a class of
- two characters ("W" and "-") followed by a literal string "46]", so it
- would match "W46]" or "-46]". However, if the "]" is escaped with a
- backslash it is interpreted as the end of range, so [W-\]46] is inter-
- preted as a class containing a range followed by two other characters.
- The octal or hexadecimal representation of "]" can also be used to end
+ ter of a range. A pattern such as [W-]46] is interpreted as a class of
+ two characters ("W" and "-") followed by a literal string "46]", so it
+ would match "W46]" or "-46]". However, if the "]" is escaped with a
+ backslash it is interpreted as the end of range, so [W-\]46] is inter-
+ preted as a class containing a range followed by two other characters.
+ The octal or hexadecimal representation of "]" can also be used to end
a range.
- Ranges operate in the collating sequence of character values. They can
- also be used for characters specified numerically, for example
- [\000-\037]. In UTF-8 mode, ranges can include characters whose values
+ Ranges operate in the collating sequence of character values. They can
+ also be used for characters specified numerically, for example
+ [\000-\037]. In UTF-8 mode, ranges can include characters whose values
are greater than 255, for example [\x{100}-\x{2ff}].
If a range that includes letters is used when caseless matching is set,
it matches the letters in either case. For example, [W-c] is equivalent
- to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
- character tables for a French locale are in use, [\xc8-\xcb] matches
- accented E characters in both cases. In UTF-8 mode, PCRE supports the
- concept of case for characters with values greater than 128 only when
+ to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
+ character tables for a French locale are in use, [\xc8-\xcb] matches
+ accented E characters in both cases. In UTF-8 mode, PCRE supports the
+ concept of case for characters with values greater than 128 only when
it is compiled with Unicode property support.
- The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
- in a character class, and add the characters that they match to the
+ The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
+ in a character class, and add the characters that they match to the
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
- flex can conveniently be used with the upper case character types to
- specify a more restricted set of characters than the matching lower
- case type. For example, the class [^\W_] matches any letter or digit,
+ flex can conveniently be used with the upper case character types to
+ specify a more restricted set of characters than the matching lower
+ case type. For example, the class [^\W_] matches any letter or digit,
but not underscore.
- The only metacharacters that are recognized in character classes are
- backslash, hyphen (only where it can be interpreted as specifying a
- range), circumflex (only at the start), opening square bracket (only
- when it can be interpreted as introducing a POSIX class name - see the
- next section), and the terminating closing square bracket. However,
+ The only metacharacters that are recognized in character classes are
+ backslash, hyphen (only where it can be interpreted as specifying a
+ range), circumflex (only at the start), opening square bracket (only
+ when it can be interpreted as introducing a POSIX class name - see the
+ next section), and the terminating closing square bracket. However,
escaping other non-alphanumeric characters does no harm.
POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names
- enclosed by [: and :] within the enclosing square brackets. PCRE also
+ enclosed by [: and :] within the enclosing square brackets. PCRE also
supports this notation. For example,
[01[:alpha:]%]
@@ -3820,18 +3881,18 @@ POSIX CHARACTER CLASSES
word "word" characters (same as \w)
xdigit hexadecimal digits
- The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
- and space (32). Notice that this list includes the VT character (code
+ The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+ and space (32). Notice that this list includes the VT character (code
11). This makes "space" different to \s, which does not include VT (for
Perl compatibility).
- The name "word" is a Perl extension, and "blank" is a GNU extension
- from Perl 5.8. Another Perl extension is negation, which is indicated
+ The name "word" is a Perl extension, and "blank" is a GNU extension
+ from Perl 5.8. Another Perl extension is negation, which is indicated
by a ^ character after the colon. For example,
[12[:^digit:]]
- matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
+ matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
these are not supported, and an error is given if they are encountered.
@@ -3841,24 +3902,24 @@ POSIX CHARACTER CLASSES
VERTICAL BAR
- Vertical bar characters are used to separate alternative patterns. For
+ Vertical bar characters are used to separate alternative patterns. For
example, the pattern
gilbert|sullivan
- matches either "gilbert" or "sullivan". Any number of alternatives may
- appear, and an empty alternative is permitted (matching the empty
+ matches either "gilbert" or "sullivan". Any number of alternatives may
+ appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left
- to right, and the first one that succeeds is used. If the alternatives
- are within a subpattern (defined below), "succeeds" means matching the
+ to right, and the first one that succeeds is used. If the alternatives
+ are within a subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the subpattern.
INTERNAL OPTION SETTING
- The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
- PCRE_EXTENDED options (which are Perl-compatible) can be changed from
- within the pattern by a sequence of Perl option letters enclosed
+ The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
+ PCRE_EXTENDED options (which are Perl-compatible) can be changed from
+ within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are
i for PCRE_CASELESS
@@ -3868,46 +3929,46 @@ INTERNAL OPTION SETTING
For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
- combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
- LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
- is also permitted. If a letter appears both before and after the
+ combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
+ LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
+ is also permitted. If a letter appears both before and after the
hyphen, the option is unset.
- The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
- can be changed in the same way as the Perl-compatible options by using
+ The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
+ can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.
- When one of these option changes occurs at top level (that is, not
- inside subpattern parentheses), the change applies to the remainder of
+ When one of these option changes occurs at top level (that is, not
+ inside subpattern parentheses), the change applies to the remainder of
the pattern that follows. If the change is placed right at the start of
a pattern, PCRE extracts it into the global options (and it will there-
fore show up in data extracted by the pcre_fullinfo() function).
- An option change within a subpattern (see below for a description of
+ An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the current pattern that follows
it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
- used). By this means, options can be made to have different settings
- in different parts of the pattern. Any changes made in one alternative
- do carry on into subsequent branches within the same subpattern. For
+ used). By this means, options can be made to have different settings
+ in different parts of the pattern. Any changes made in one alternative
+ do carry on into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
- matches "ab", "aB", "c", and "C", even though when matching "C" the
- first branch is abandoned before the option setting. This is because
- the effects of option settings happen at compile time. There would be
+ matches "ab", "aB", "c", and "C", even though when matching "C" the
+ first branch is abandoned before the option setting. This is because
+ the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.
- Note: There are other PCRE-specific options that can be set by the
- application when the compile or match functions are called. In some
+ Note: There are other PCRE-specific options that can be set by the
+ application when the compile or match functions are called. In some
cases the pattern can contain special leading sequences such as (*CRLF)
- to override what the application has set or what has been defaulted.
- Details are given in the section entitled "Newline sequences" above.
- There is also the (*UTF8) leading sequence that can be used to set
+ to override what the application has set or what has been defaulted.
+ Details are given in the section entitled "Newline sequences" above.
+ There is also the (*UTF8) leading sequence that can be used to set
UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
@@ -3920,18 +3981,18 @@ SUBPATTERNS
cat(aract|erpillar|)
- matches one of the words "cat", "cataract", or "caterpillar". Without
- the parentheses, it would match "cataract", "erpillar" or an empty
+ matches one of the words "cat", "cataract", or "caterpillar". Without
+ the parentheses, it would match "cataract", "erpillar" or an empty
string.
- 2. It sets up the subpattern as a capturing subpattern. This means
- that, when the whole pattern matches, that portion of the subject
+ 2. It sets up the subpattern as a capturing subpattern. This means
+ that, when the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via the
- ovector argument of pcre_exec(). Opening parentheses are counted from
- left to right (starting from 1) to obtain numbers for the capturing
+ ovector argument of pcre_exec(). Opening parentheses are counted from
+ left to right (starting from 1) to obtain numbers for the capturing
subpatterns.
- For example, if the string "the red king" is matched against the pat-
+ For example, if the string "the red king" is matched against the pat-
tern
the ((red|white) (king|queen))
@@ -3939,12 +4000,12 @@ SUBPATTERNS
the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively.
- The fact that plain parentheses fulfil two functions is not always
- helpful. There are often times when a grouping subpattern is required
- without a capturing requirement. If an opening parenthesis is followed
- by a question mark and a colon, the subpattern does not do any captur-
- ing, and is not counted when computing the number of any subsequent
- capturing subpatterns. For example, if the string "the white queen" is
+ The fact that plain parentheses fulfil two functions is not always
+ helpful. There are often times when a grouping subpattern is required
+ without a capturing requirement. If an opening parenthesis is followed
+ by a question mark and a colon, the subpattern does not do any captur-
+ ing, and is not counted when computing the number of any subsequent
+ capturing subpatterns. For example, if the string "the white queen" is
matched against the pattern
the ((?:red|white) (king|queen))
@@ -3952,46 +4013,59 @@ SUBPATTERNS
the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535.
- As a convenient shorthand, if any option settings are required at the
- start of a non-capturing subpattern, the option letters may appear
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern, the option letters may appear
between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are
- tried from left to right, and options are not reset until the end of
- the subpattern is reached, an option setting in one branch does affect
- subsequent branches, so the above patterns match "SUNDAY" as well as
+ tried from left to right, and options are not reset until the end of
+ the subpattern is reached, an option setting in one branch does affect
+ subsequent branches, so the above patterns match "SUNDAY" as well as
"Saturday".
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern
- uses the same numbers for its capturing parentheses. Such a subpattern
- starts with (?| and is itself a non-capturing subpattern. For example,
+ uses the same numbers for its capturing parentheses. Such a subpattern
+ starts with (?| and is itself a non-capturing subpattern. For example,
consider this pattern:
(?|(Sat)ur|(Sun))day
- Because the two alternatives are inside a (?| group, both sets of cap-
- turing parentheses are numbered one. Thus, when the pattern matches,
- you can look at captured substring number one, whichever alternative
- matched. This construct is useful when you want to capture part, but
+ Because the two alternatives are inside a (?| group, both sets of cap-
+ turing parentheses are numbered one. Thus, when the pattern matches,
+ you can look at captured substring number one, whichever alternative
+ matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren-
- theses are numbered as usual, but the number is reset at the start of
- each branch. The numbers of any capturing buffers that follow the sub-
- pattern start after the highest number used in any branch. The follow-
- ing example is taken from the Perl documentation. The numbers under-
+ theses are numbered as usual, but the number is reset at the start of
+ each branch. The numbers of any capturing buffers that follow the sub-
+ pattern start after the highest number used in any branch. The follow-
+ ing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
- A backreference or a recursive call to a numbered subpattern always
- refers to the first one in the pattern with the given number.
+ A backreference to a numbered subpattern uses the most recent value
+ that is set for that number by any subpattern. The following pattern
+ matches "abcabc" or "defdef":
+
+ /(?|(abc)|(def))\1/
+
+ In contrast, a recursive or "subroutine" call to a numbered subpattern
+ always refers to the first one in the pattern with the given number.
+ The following pattern matches "abcabc" or "defabc":
+
+ /(?|(abc)|(def))(?1)/
+
+ If a condition test for a subpattern's having matched refers to a non-
+ unique number, the test is true if any of the subpatterns of that num-
+ ber have matched.
An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
@@ -4006,26 +4080,29 @@ NAMED SUBPATTERNS
patterns. This feature was not added to Perl until release 5.10. Python
had the feature earlier, and PCRE introduced it at release 4.0, using
the Python syntax. PCRE now supports both the Perl and the Python syn-
- tax.
+ tax. Perl allows identically numbered subpatterns to have different
+ names, but PCRE does not.
- In PCRE, a subpattern can be named in one of three ways: (?<name>...)
- or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
+ In PCRE, a subpattern can be named in one of three ways: (?<name>...)
+ or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
to capturing parentheses from other parts of the pattern, such as back-
- references, recursion, and conditions, can be made by name as well as
+ references, recursion, and conditions, can be made by name as well as
by number.
- Names consist of up to 32 alphanumeric characters and underscores.
- Named capturing parentheses are still allocated numbers as well as
- names, exactly as if the names were not present. The PCRE API provides
+ Names consist of up to 32 alphanumeric characters and underscores.
+ Named capturing parentheses are still allocated numbers as well as
+ names, exactly as if the names were not present. The PCRE API provides
function calls for extracting the name-to-number translation table from
a compiled pattern. There is also a convenience function for extracting
a captured substring by name.
- By default, a name must be unique within a pattern, but it is possible
+ By default, a name must be unique within a pattern, but it is possible
to relax this constraint by setting the PCRE_DUPNAMES option at compile
- time. This can be useful for patterns where only one instance of the
- named parentheses can match. Suppose you want to match the name of a
- weekday, either as a 3-letter abbreviation or as the full name, and in
+ time. (Duplicate names are also always permitted for subpatterns with
+ the same number, set up as described in the previous section.) Dupli-
+ cate names can be useful for patterns where only one instance of the
+ named parentheses can match. Suppose you want to match the name of a
+ weekday, either as a 3-letter abbreviation or as the full name, and in
both cases you want to extract the abbreviation. This pattern (ignoring
the line breaks) does the job:
@@ -4035,26 +4112,38 @@ NAMED SUBPATTERNS
(?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)?
- There are five capturing substrings, but only one is ever set after a
+ There are five capturing substrings, but only one is ever set after a
match. (An alternative way of solving this problem is to use a "branch
reset" subpattern, as described in the previous section.)
- The convenience function for extracting the data by name returns the
- substring for the first (and in this example, the only) subpattern of
- that name that matched. This saves searching to find which numbered
- subpattern it was. If you make a reference to a non-unique named sub-
- pattern from elsewhere in the pattern, the one that corresponds to the
- lowest number is used. For further details of the interfaces for han-
- dling named subpatterns, see the pcreapi documentation.
+ The convenience function for extracting the data by name returns the
+ substring for the first (and in this example, the only) subpattern of
+ that name that matched. This saves searching to find which numbered
+ subpattern it was.
+
+ If you make a backreference to a non-unique named subpattern from else-
+ where in the pattern, the one that corresponds to the first occurrence
+ of the name is used. In the absence of duplicate numbers (see the pre-
+ vious section) this is the one with the lowest number. If you use a
+ named reference in a condition test (see the section about conditions
+ below), either to check whether a subpattern has matched, or to check
+ for recursion, all subpatterns with the same name are tested. If the
+ condition is true for any one of them, the overall condition is true.
+ This is the same behaviour as testing by number. For further details of
+ the interfaces for handling named subpatterns, see the pcreapi documen-
+ tation.
Warning: You cannot use different names to distinguish between two sub-
- patterns with the same number (see the previous section) because PCRE
- uses only the numbers when matching.
+ patterns with the same number because PCRE uses only the numbers when
+ matching. For this reason, an error is given at compile time if differ-
+ ent names are given to subpatterns with the same number. However, you
+ can give the same name to subpatterns with the same number, even when
+ PCRE_DUPNAMES is not set.
REPETITION
- Repetition is specified by quantifiers, which can follow any of the
+ Repetition is specified by quantifiers, which can follow any of the
following items:
a literal data character
@@ -4066,18 +4155,19 @@ REPETITION
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
+ a recursive or "subroutine" call to a subpattern
- The general repetition quantifier specifies a minimum and maximum num-
- ber of permitted matches, by giving the two numbers in curly brackets
- (braces), separated by a comma. The numbers must be less than 65536,
+ The general repetition quantifier specifies a minimum and maximum num-
+ ber of permitted matches, by giving the two numbers in curly brackets
+ (braces), separated by a comma. The numbers must be less than 65536,
and the first must be less than or equal to the second. For example:
z{2,4}
- matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
- special character. If the second number is omitted, but the comma is
- present, there is no upper limit; if the second number and the comma
- are both omitted, the quantifier specifies an exact number of required
+ matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
+ special character. If the second number is omitted, but the comma is
+ present, there is no upper limit; if the second number and the comma
+ are both omitted, the quantifier specifies an exact number of required
matches. Thus
[aeiou]{3,}
@@ -4086,49 +4176,49 @@ REPETITION
\d{8}
- matches exactly 8 digits. An opening curly bracket that appears in a
- position where a quantifier is not allowed, or one that does not match
- the syntax of a quantifier, is taken as a literal character. For exam-
+ matches exactly 8 digits. An opening curly bracket that appears in a
+ position where a quantifier is not allowed, or one that does not match
+ the syntax of a quantifier, is taken as a literal character. For exam-
ple, {,6} is not a quantifier, but a literal string of four characters.
- In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
+ In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
acters, each of which is represented by a two-byte sequence. Similarly,
when Unicode property support is available, \X{3} matches three Unicode
- extended sequences, each of which may be several bytes long (and they
+ extended sequences, each of which may be several bytes long (and they
may be of different lengths).
The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present. This may be use-
- ful for subpatterns that are referenced as subroutines from elsewhere
+ ful for subpatterns that are referenced as subroutines from elsewhere
in the pattern. Items other than subpatterns that have a {0} quantifier
are omitted from the compiled pattern.
- For convenience, the three most common quantifiers have single-charac-
+ For convenience, the three most common quantifiers have single-charac-
ter abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
- It is possible to construct infinite loops by following a subpattern
+ It is possible to construct infinite loops by following a subpattern
that can match no characters with a quantifier that has no upper limit,
for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at compile time
- for such patterns. However, because there are cases where this can be
- useful, such patterns are now accepted, but if any repetition of the
- subpattern does in fact match no characters, the loop is forcibly bro-
+ for such patterns. However, because there are cases where this can be
+ useful, such patterns are now accepted, but if any repetition of the
+ subpattern does in fact match no characters, the loop is forcibly bro-
ken.
- By default, the quantifiers are "greedy", that is, they match as much
- as possible (up to the maximum number of permitted times), without
- causing the rest of the pattern to fail. The classic example of where
+ By default, the quantifiers are "greedy", that is, they match as much
+ as possible (up to the maximum number of permitted times), without
+ causing the rest of the pattern to fail. The classic example of where
this gives problems is in trying to match comments in C programs. These
- appear between /* and */ and within the comment, individual * and /
- characters may appear. An attempt to match C comments by applying the
+ appear between /* and */ and within the comment, individual * and /
+ characters may appear. An attempt to match C comments by applying the
pattern
/\*.*\*/
@@ -4137,19 +4227,19 @@ REPETITION
/* first comment */ not comment /* second comment */
- fails, because it matches the entire string owing to the greediness of
+ fails, because it matches the entire string owing to the greediness of
the .* item.
- However, if a quantifier is followed by a question mark, it ceases to
+ However, if a quantifier is followed by a question mark, it ceases to
be greedy, and instead matches the minimum number of times possible, so
the pattern
/\*.*?\*/
- does the right thing with the C comments. The meaning of the various
- quantifiers is not otherwise changed, just the preferred number of
- matches. Do not confuse this use of question mark with its use as a
- quantifier in its own right. Because it has two uses, it can sometimes
+ does the right thing with the C comments. The meaning of the various
+ quantifiers is not otherwise changed, just the preferred number of
+ matches. Do not confuse this use of question mark with its use as a
+ quantifier in its own right. Because it has two uses, it can sometimes
appear doubled, as in
\d??\d
@@ -4157,36 +4247,36 @@ REPETITION
which matches one digit by preference, but can match two if that is the
only way the rest of the pattern matches.
- If the PCRE_UNGREEDY option is set (an option that is not available in
- Perl), the quantifiers are not greedy by default, but individual ones
- can be made greedy by following them with a question mark. In other
+ If the PCRE_UNGREEDY option is set (an option that is not available in
+ Perl), the quantifiers are not greedy by default, but individual ones
+ can be made greedy by following them with a question mark. In other
words, it inverts the default behaviour.
- When a parenthesized subpattern is quantified with a minimum repeat
- count that is greater than 1 or with a limited maximum, more memory is
- required for the compiled pattern, in proportion to the size of the
+ When a parenthesized subpattern is quantified with a minimum repeat
+ count that is greater than 1 or with a limited maximum, more memory is
+ required for the compiled pattern, in proportion to the size of the
minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
- alent to Perl's /s) is set, thus allowing the dot to match newlines,
- the pattern is implicitly anchored, because whatever follows will be
- tried against every character position in the subject string, so there
- is no point in retrying the overall match at any position after the
- first. PCRE normally treats such a pattern as though it were preceded
+ alent to Perl's /s) is set, thus allowing the dot to match newlines,
+ the pattern is implicitly anchored, because whatever follows will be
+ tried against every character position in the subject string, so there
+ is no point in retrying the overall match at any position after the
+ first. PCRE normally treats such a pattern as though it were preceded
by \A.
- In cases where it is known that the subject string contains no new-
- lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
+ In cases where it is known that the subject string contains no new-
+ lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
mization, or alternatively using ^ to indicate anchoring explicitly.
- However, there is one situation where the optimization cannot be used.
- When .* is inside capturing parentheses that are the subject of a
- backreference elsewhere in the pattern, a match at the start may fail
+ However, there is one situation where the optimization cannot be used.
+ When .* is inside capturing parentheses that are the subject of a
+ backreference elsewhere in the pattern, a match at the start may fail
where a later one succeeds. Consider, for example:
(.*)abc\1
- If the subject is "xyz123abc123" the match point is the fourth charac-
+ If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored.
When a capturing subpattern is repeated, the value captured is the sub-
@@ -4195,8 +4285,8 @@ REPETITION
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring
- is "tweedledee". However, if there are nested capturing subpatterns,
- the corresponding captured values may have been set in previous itera-
+ is "tweedledee". However, if there are nested capturing subpatterns,
+ the corresponding captured values may have been set in previous itera-
tions. For example, after
/(a|(b))+/
@@ -4206,53 +4296,53 @@ REPETITION
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
- With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
- repetition, failure of what follows normally causes the repeated item
- to be re-evaluated to see if a different number of repeats allows the
- rest of the pattern to match. Sometimes it is useful to prevent this,
- either to change the nature of the match, or to cause it fail earlier
- than it otherwise might, when the author of the pattern knows there is
+ With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+ repetition, failure of what follows normally causes the repeated item
+ to be re-evaluated to see if a different number of repeats allows the
+ rest of the pattern to match. Sometimes it is useful to prevent this,
+ either to change the nature of the match, or to cause it fail earlier
+ than it otherwise might, when the author of the pattern knows there is
no point in carrying on.
- Consider, for example, the pattern \d+foo when applied to the subject
+ Consider, for example, the pattern \d+foo when applied to the subject
line
123456bar
After matching all 6 digits and then failing to match "foo", the normal
- action of the matcher is to try again with only 5 digits matching the
- \d+ item, and then with 4, and so on, before ultimately failing.
- "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
- the means for specifying that once a subpattern has matched, it is not
+ action of the matcher is to try again with only 5 digits matching the
+ \d+ item, and then with 4, and so on, before ultimately failing.
+ "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
+ the means for specifying that once a subpattern has matched, it is not
to be re-evaluated in this way.
- If we use atomic grouping for the previous example, the matcher gives
- up immediately on failing to match "foo" the first time. The notation
+ If we use atomic grouping for the previous example, the matcher gives
+ up immediately on failing to match "foo" the first time. The notation
is a kind of special parenthesis, starting with (?> as in this example:
(?>\d+)foo
- This kind of parenthesis "locks up" the part of the pattern it con-
- tains once it has matched, and a failure further into the pattern is
- prevented from backtracking into it. Backtracking past it to previous
+ This kind of parenthesis "locks up" the part of the pattern it con-
+ tains once it has matched, and a failure further into the pattern is
+ prevented from backtracking into it. Backtracking past it to previous
items, however, works as normal.
- An alternative description is that a subpattern of this type matches
- the string of characters that an identical standalone pattern would
+ An alternative description is that a subpattern of this type matches
+ the string of characters that an identical standalone pattern would
match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
such as the above example can be thought of as a maximizing repeat that
- must swallow everything it can. So, while both \d+ and \d+? are pre-
- pared to adjust the number of digits they match in order to make the
+ must swallow everything it can. So, while both \d+ and \d+? are pre-
+ pared to adjust the number of digits they match in order to make the
rest of the pattern match, (?>\d+) can only match an entire sequence of
digits.
- Atomic groups in general can of course contain arbitrarily complicated
- subpatterns, and can be nested. However, when the subpattern for an
+ Atomic groups in general can of course contain arbitrarily complicated
+ subpatterns, and can be nested. However, when the subpattern for an
atomic group is just a single repeated item, as in the example above, a
- simpler notation, called a "possessive quantifier" can be used. This
- consists of an additional + character following a quantifier. Using
+ simpler notation, called a "possessive quantifier" can be used. This
+ consists of an additional + character following a quantifier. Using
this notation, the previous example can be rewritten as
\d++foo
@@ -4262,45 +4352,45 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
(abc|xyz){2,3}+
- Possessive quantifiers are always greedy; the setting of the
+ Possessive quantifiers are always greedy; the setting of the
PCRE_UNGREEDY option is ignored. They are a convenient notation for the
- simpler forms of atomic group. However, there is no difference in the
- meaning of a possessive quantifier and the equivalent atomic group,
- though there may be a performance difference; possessive quantifiers
+ simpler forms of atomic group. However, there is no difference in the
+ meaning of a possessive quantifier and the equivalent atomic group,
+ though there may be a performance difference; possessive quantifiers
should be slightly faster.
- The possessive quantifier syntax is an extension to the Perl 5.8 syn-
- tax. Jeffrey Friedl originated the idea (and the name) in the first
+ The possessive quantifier syntax is an extension to the Perl 5.8 syn-
+ tax. Jeffrey Friedl originated the idea (and the name) in the first
edition of his book. Mike McCloskey liked it, so implemented it when he
- built Sun's Java package, and PCRE copied it from there. It ultimately
+ built Sun's Java package, and PCRE copied it from there. It ultimately
found its way into Perl at release 5.10.
PCRE has an optimization that automatically "possessifies" certain sim-
- ple pattern constructs. For example, the sequence A+B is treated as
- A++B because there is no point in backtracking into a sequence of A's
+ ple pattern constructs. For example, the sequence A+B is treated as
+ A++B because there is no point in backtracking into a sequence of A's
when B must follow.
- When a pattern contains an unlimited repeat inside a subpattern that
- can itself be repeated an unlimited number of times, the use of an
- atomic group is the only way to avoid some failing matches taking a
+ When a pattern contains an unlimited repeat inside a subpattern that
+ can itself be repeated an unlimited number of times, the use of an
+ atomic group is the only way to avoid some failing matches taking a
very long time indeed. The pattern
(\D+|<\d+>)*[!?]
- matches an unlimited number of substrings that either consist of non-
- digits, or digits enclosed in <>, followed by either ! or ?. When it
+ matches an unlimited number of substrings that either consist of non-
+ digits, or digits enclosed in <>, followed by either ! or ?. When it
matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- it takes a long time before reporting failure. This is because the
- string can be divided between the internal \D+ repeat and the external
- * repeat in a large number of ways, and all have to be tried. (The
- example uses [!?] rather than a single character at the end, because
- both PCRE and Perl have an optimization that allows for fast failure
- when a single character is used. They remember the last single charac-
- ter that is required for a match, and fail early if it is not present
- in the string.) If the pattern is changed so that it uses an atomic
+ it takes a long time before reporting failure. This is because the
+ string can be divided between the internal \D+ repeat and the external
+ * repeat in a large number of ways, and all have to be tried. (The
+ example uses [!?] rather than a single character at the end, because
+ both PCRE and Perl have an optimization that allows for fast failure
+ when a single character is used. They remember the last single charac-
+ ter that is required for a match, and fail early if it is not present
+ in the string.) If the pattern is changed so that it uses an atomic
group, like this:
((?>\D+)|<\d+>)*[!?]
@@ -4312,37 +4402,37 @@ BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than
0 (and possibly further digits) is a back reference to a capturing sub-
- pattern earlier (that is, to its left) in the pattern, provided there
+ pattern earlier (that is, to its left) in the pattern, provided there
have been that many previous capturing left parentheses.
However, if the decimal number following the backslash is less than 10,
- it is always taken as a back reference, and causes an error only if
- there are not that many capturing left parentheses in the entire pat-
- tern. In other words, the parentheses that are referenced need not be
- to the left of the reference for numbers less than 10. A "forward back
- reference" of this type can make sense when a repetition is involved
- and the subpattern to the right has participated in an earlier itera-
+ it is always taken as a back reference, and causes an error only if
+ there are not that many capturing left parentheses in the entire pat-
+ tern. In other words, the parentheses that are referenced need not be
+ to the left of the reference for numbers less than 10. A "forward back
+ reference" of this type can make sense when a repetition is involved
+ and the subpattern to the right has participated in an earlier itera-
tion.
- It is not possible to have a numerical "forward back reference" to a
- subpattern whose number is 10 or more using this syntax because a
- sequence such as \50 is interpreted as a character defined in octal.
+ It is not possible to have a numerical "forward back reference" to a
+ subpattern whose number is 10 or more using this syntax because a
+ sequence such as \50 is interpreted as a character defined in octal.
See the subsection entitled "Non-printing characters" above for further
- details of the handling of digits following a backslash. There is no
- such problem when named parentheses are used. A back reference to any
+ details of the handling of digits following a backslash. There is no
+ such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
- Another way of avoiding the ambiguity inherent in the use of digits
+ Another way of avoiding the ambiguity inherent in the use of digits
following a backslash is to use the \g escape sequence, which is a fea-
- ture introduced in Perl 5.10. This escape must be followed by an
- unsigned number or a negative number, optionally enclosed in braces.
+ ture introduced in Perl 5.10. This escape must be followed by an
+ unsigned number or a negative number, optionally enclosed in braces.
These examples are all identical:
(ring), \1
(ring), \g1
(ring), \g{1}
- An unsigned number specifies an absolute reference without the ambigu-
+ An unsigned number specifies an absolute reference without the ambigu-
ity that is present in the older syntax. It is also useful when literal
digits follow the reference. A negative number is a relative reference.
Consider this example:
@@ -4350,33 +4440,33 @@ BACK REFERENCES
(abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started captur-
- ing subpattern before \g, that is, is it equivalent to \2. Similarly,
+ ing subpattern before \g, that is, is it equivalent to \2. Similarly,
\g{-2} would be equivalent to \1. The use of relative references can be
- helpful in long patterns, and also in patterns that are created by
+ helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
- A back reference matches whatever actually matched the capturing sub-
- pattern in the current subject string, rather than anything matching
+ A back reference matches whatever actually matched the capturing sub-
+ pattern in the current subject string, rather than anything matching
the subpattern itself (see "Subpatterns as subroutines" below for a way
of doing that). So the pattern
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
- not "sense and responsibility". If caseful matching is in force at the
- time of the back reference, the case of letters is relevant. For exam-
+ matches "sense and sensibility" and "response and responsibility", but
+ not "sense and responsibility". If caseful matching is in force at the
+ time of the back reference, the case of letters is relevant. For exam-
ple,
((?i)rah)\s+\1
- matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
+ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly.
- There are several different ways of writing back references to named
- subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
- \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
+ There are several different ways of writing back references to named
+ subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
+ \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
unified back reference syntax, in which \g can be used for both numeric
- and named references, is also supported. We could rewrite the above
+ and named references, is also supported. We could rewrite the above
example in any of the following ways:
(?<p1>(?i)rah)\s+\k<p1>
@@ -4384,22 +4474,25 @@ BACK REFERENCES
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
- A subpattern that is referenced by name may appear in the pattern
+ A subpattern that is referenced by name may appear in the pattern
before or after the reference.
- There may be more than one back reference to the same subpattern. If a
- subpattern has not actually been used in a particular match, any back
- references to it always fail. For example, the pattern
+ There may be more than one back reference to the same subpattern. If a
+ subpattern has not actually been used in a particular match, any back
+ references to it always fail by default. For example, the pattern
(a|(bc))\2
- always fails if it starts to match "a" rather than "bc". Because there
- may be many capturing parentheses in a pattern, all digits following
- the backslash are taken as part of a potential back reference number.
- If the pattern continues with a digit character, some delimiter must be
- used to terminate the back reference. If the PCRE_EXTENDED option is
- set, this can be whitespace. Otherwise an empty comment (see "Com-
- ments" below) can be used.
+ always fails if it starts to match "a" rather than "bc". However, if
+ the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
+ ence to an unset value matches an empty string.
+
+ Because there may be many capturing parentheses in a pattern, all dig-
+ its following a backslash are taken as part of a potential back refer-
+ ence number. If the pattern continues with a digit character, some
+ delimiter must be used to terminate the back reference. If the
+ PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
+ syntax or an empty comment (see "Comments" below) can be used.
A back reference that occurs inside the parentheses to which it refers
fails when the subpattern is first used, so, for example, (a\1) never
@@ -4462,19 +4555,20 @@ ASSERTIONS
If you want to force a matching failure at some point in a pattern, the
most convenient way to do it is with (?!) because an empty string
always matches, so an assertion that requires there not to be an empty
- string must always fail.
+ string must always fail. The Perl 5.10 backtracking control verb
+ (*FAIL) or (*F) is essentially a synonym for (?!).
Lookbehind assertions
- Lookbehind assertions start with (?<= for positive assertions and (?<!
+ Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example,
(?<!foo)bar
- does find an occurrence of "bar" that is not preceded by "foo". The
- contents of a lookbehind assertion are restricted such that all the
+ does find an occurrence of "bar" that is not preceded by "foo". The
+ contents of a lookbehind assertion are restricted such that all the
strings it matches must have a fixed length. However, if there are sev-
- eral top-level alternatives, they do not all have to have the same
+ eral top-level alternatives, they do not all have to have the same
fixed length. Thus
(?<=bullock|donkey)
@@ -4483,62 +4577,62 @@ ASSERTIONS
(?<!dogs?|cats?)
- causes an error at compile time. Branches that match different length
- strings are permitted only at the top level of a lookbehind assertion.
- This is an extension compared with Perl (5.8 and 5.10), which requires
+ causes an error at compile time. Branches that match different length
+ strings are permitted only at the top level of a lookbehind assertion.
+ This is an extension compared with Perl (5.8 and 5.10), which requires
all branches to match the same length of string. An assertion such as
(?<=ab(c|de))
- is not permitted, because its single top-level branch can match two
+ is not permitted, because its single top-level branch can match two
different lengths, but it is acceptable to PCRE if rewritten to use two
top-level branches:
(?<=abc|abde)
In some cases, the Perl 5.10 escape sequence \K (see above) can be used
- instead of a lookbehind assertion to get round the fixed-length
+ instead of a lookbehind assertion to get round the fixed-length
restriction.
- The implementation of lookbehind assertions is, for each alternative,
- to temporarily move the current position back by the fixed length and
+ The implementation of lookbehind assertions is, for each alternative,
+ to temporarily move the current position back by the fixed length and
then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails.
PCRE does not allow the \C escape (which matches a single byte in UTF-8
- mode) to appear in lookbehind assertions, because it makes it impossi-
- ble to calculate the length of the lookbehind. The \X and \R escapes,
+ mode) to appear in lookbehind assertions, because it makes it impossi-
+ ble to calculate the length of the lookbehind. The \X and \R escapes,
which can match different numbers of bytes, are also not permitted.
- "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
- lookbehinds, as long as the subpattern matches a fixed-length string.
+ "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
+ lookbehinds, as long as the subpattern matches a fixed-length string.
Recursion, however, is not supported.
- Possessive quantifiers can be used in conjunction with lookbehind
- assertions to specify efficient matching at the end of the subject
- string. Consider a simple pattern such as
+ Possessive quantifiers can be used in conjunction with lookbehind
+ assertions to specify efficient matching of fixed-length strings at the
+ end of subject strings. Consider a simple pattern such as
abcd$
- when applied to a long string that does not match. Because matching
+ when applied to a long string that does not match. Because matching
proceeds from left to right, PCRE will look for each "a" in the subject
- and then see if what follows matches the rest of the pattern. If the
+ and then see if what follows matches the rest of the pattern. If the
pattern is specified as
^.*abcd$
- the initial .* matches the entire string at first, but when this fails
+ the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the
- last character, then all but the last two characters, and so on. Once
- again the search for "a" covers the entire string, from right to left,
+ last character, then all but the last two characters, and so on. Once
+ again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)
- there can be no backtracking for the .*+ item; it can match only the
- entire string. The subsequent lookbehind assertion does a single test
- on the last four characters. If it fails, the match fails immediately.
- For long strings, this approach makes a significant difference to the
+ there can be no backtracking for the .*+ item; it can match only the
+ entire string. The subsequent lookbehind assertion does a single test
+ on the last four characters. If it fails, the match fails immediately.
+ For long strings, this approach makes a significant difference to the
processing time.
Using multiple assertions
@@ -4547,18 +4641,18 @@ ASSERTIONS
(?<=\d{3})(?<!999)foo
- matches "foo" preceded by three digits that are not "999". Notice that
- each of the assertions is applied independently at the same point in
- the subject string. First there is a check that the previous three
- characters are all digits, and then there is a check that the same
+ matches "foo" preceded by three digits that are not "999". Notice that
+ each of the assertions is applied independently at the same point in
+ the subject string. First there is a check that the previous three
+ characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre-
- ceded by six characters, the first of which are digits and the last
- three of which are not "999". For example, it doesn't match "123abc-
+ ceded by six characters, the first of which are digits and the last
+ three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
- This time the first assertion looks at the preceding six characters,
+ This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999".
@@ -4566,43 +4660,46 @@ ASSERTIONS
(?<=(?<!foo)bar)baz
- matches an occurrence of "baz" that is preceded by "bar" which in turn
+ matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
- is another pattern that matches "foo" preceded by three digits and any
+ is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".
CONDITIONAL SUBPATTERNS
- It is possible to cause the matching process to obey a subpattern con-
- ditionally or to choose between two alternative subpatterns, depending
- on the result of an assertion, or whether a previous capturing subpat-
- tern matched or not. The two possible forms of conditional subpattern
- are
+ It is possible to cause the matching process to obey a subpattern con-
+ ditionally or to choose between two alternative subpatterns, depending
+ on the result of an assertion, or whether a specific capturing subpat-
+ tern has already been matched. The two possible forms of conditional
+ subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
- If the condition is satisfied, the yes-pattern is used; otherwise the
- no-pattern (if present) is used. If there are more than two alterna-
+ If the condition is satisfied, the yes-pattern is used; otherwise the
+ no-pattern (if present) is used. If there are more than two alterna-
tives in the subpattern, a compile-time error occurs.
- There are four kinds of condition: references to subpatterns, refer-
+ There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.
Checking for a used subpattern by number
- If the text between the parentheses consists of a sequence of digits,
- the condition is true if the capturing subpattern of that number has
- previously matched. An alternative notation is to precede the digits
- with a plus or minus sign. In this case, the subpattern number is rela-
- tive rather than absolute. The most recently opened parentheses can be
- referenced by (?(-1), the next most recent by (?(-2), and so on. In
- looping constructs it can also make sense to refer to subsequent groups
- with constructs such as (?(+2).
+ If the text between the parentheses consists of a sequence of digits,
+ the condition is true if a capturing subpattern of that number has pre-
+ viously matched. If there is more than one capturing subpattern with
+ the same number (see the earlier section about duplicate subpattern
+ numbers), the condition is true if any of them have been set. An alter-
+ native notation is to precede the digits with a plus or minus sign. In
+ this case, the subpattern number is relative rather than absolute. The
+ most recently opened parentheses can be referenced by (?(-1), the next
+ most recent by (?(-2), and so on. In looping constructs it can also
+ make sense to refer to subsequent groups with constructs such as
+ (?(+2).
Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to
@@ -4645,6 +4742,9 @@ CONDITIONAL SUBPATTERNS
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
+ of them has matched.
Checking for pattern recursion
@@ -4655,12 +4755,14 @@ CONDITIONAL SUBPATTERNS
(?(R3)...) or (?(R&name)...)
- the condition is true if the most recent recursion is into the subpat-
- tern whose number or name is given. This condition does not check the
- entire recursion stack.
+ the condition is true if the most recent recursion is into a subpattern
+ whose number or name is given. This condition does not check the entire
+ recursion stack. If the name used in a condition of this kind is a
+ duplicate, the test is applied to all subpatterns of the same name, and
+ is true if any one of them is the most recent recursion.
- At "top level", all these recursion test conditions are false. Recur-
- sive patterns are described below.
+ At "top level", all these recursion test conditions are false. The
+ syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
@@ -4680,11 +4782,9 @@ CONDITIONAL SUBPATTERNS
group named "byte" is defined. This matches an individual component of
an IPv4 address (a number less than 256). When matching takes place,
this part of the pattern is skipped because DEFINE acts like a false
- condition.
-
- The rest of the pattern uses references to the named group to match the
- four dot-separated components of an IPv4 address, insisting on a word
- boundary at each end.
+ condition. The rest of the pattern uses references to the named group
+ to match the four dot-separated components of an IPv4 address, insist-
+ ing on a word boundary at each end.
Assertion conditions
@@ -4752,24 +4852,26 @@ RECURSIVE PATTERNS
This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
- \( ( (?>[^()]+) | (?R) )* \)
+ \( ( [^()]++ | (?R) )* \)
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a
recursive match of the pattern itself (that is, a correctly parenthe-
- sized substring). Finally there is a closing parenthesis.
+ sized substring). Finally there is a closing parenthesis. Note the use
+ of a possessive quantifier to avoid backtracking into sequences of non-
+ parentheses.
If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
- ( \( ( (?>[^()]+) | (?1) )* \) )
+ ( \( ( [^()]++ | (?1) )* \) )
We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references. (A Perl
- 5.10 feature.) Instead of (?1) in the pattern above you can write
+ tricky. This is made easier by the use of relative references (a Perl
+ 5.10 feature). Instead of (?1) in the pattern above you can write
(?-2) to refer to the second most recently opened parentheses preceding
the recursion. In other words, a negative number counts capturing
parentheses leftwards from the point at which it is encountered.
@@ -4784,23 +4886,23 @@ RECURSIVE PATTERNS
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
- (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
+ (?<pn> \( ( [^()]++ | (?&pn) )* \) )
If there is more than one subpattern with the same name, the earliest
one is used.
This particular example pattern that we have been looking at contains
- nested unlimited repeats, and so the use of atomic grouping for match-
- ing strings of non-parentheses is important when applying the pattern
- to strings that do not match. For example, when this pattern is applied
- to
+ nested unlimited repeats, and so the use of a possessive quantifier for
+ matching strings of non-parentheses is important when applying the pat-
+ tern to strings that do not match. For example, when this pattern is
+ applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if atomic grouping is not used,
- the match runs for a very long time indeed because there are so many
- different ways the + and * repeats can carve up the subject, and all
- have to be tested before failure can be reported.
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
+ and all have to be tested before failure can be reported.
At the end of a match, the values set for any capturing subpatterns are
those from the outermost level of the recursion at which the subpattern
@@ -4814,7 +4916,7 @@ RECURSIVE PATTERNS
value taken on at the top level. If additional parentheses are added,
giving
- \( ( ( (?>[^()]+) | (?R) )* ) \)
+ \( ( ( [^()]++ | (?R) )* ) \)
^ ^
^ ^
@@ -4894,7 +4996,7 @@ RECURSIVE PATTERNS
If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
- ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+4|\W*+.\W*+))\W*+$
+ ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
@@ -4903,6 +5005,14 @@ RECURSIVE PATTERNS
great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
+ WARNING: The palindrome-matching patterns above work only if the sub-
+ ject string does not start with a palindrome that is shorter than the
+ entire string. For example, although "abcba" is correctly matched, if
+ the subject is "ababa", PCRE finds the palindrome "aba" at the start,
+ then fails at top level because the end of the string does not follow.
+ Once again, it cannot jump back into the recursion to try other alter-
+ natives, so the entire match fails.
+
SUBPATTERNS AS SUBROUTINES
@@ -5034,8 +5144,8 @@ BACKTRACKING CONTROL
This verb causes the match to end successfully, skipping the remainder
of the pattern. When inside a recursion, only the innermost pattern is
- ended immediately. If the (*ACCEPT) is inside capturing parentheses,
- the data so far is captured. (This feature was added to PCRE at release
+ ended immediately. If (*ACCEPT) is inside capturing parentheses, the
+ data so far is captured. (This feature was added to PCRE at release
8.00.) For example:
A((?:A|B(*ACCEPT)|C)D)
@@ -5068,9 +5178,9 @@ BACKTRACKING CONTROL
This verb causes the whole match to fail outright if the rest of the
pattern does not match. Even if the pattern is unanchored, no further
- attempts to find a match by advancing the start point take place. Once
- (*COMMIT) has been passed, pcre_exec() is committed to finding a match
- at the current starting point, or not at all. For example:
+ attempts to find a match by advancing the starting point take place.
+ Once (*COMMIT) has been passed, pcre_exec() is committed to finding a
+ match at the current starting point, or not at all. For example:
a+(*COMMIT)b
@@ -5102,7 +5212,7 @@ BACKTRACKING CONTROL
If the subject is "aaaac...", after the first match attempt fails
(starting at the first character in the string), the starting point
skips on to start the next attempt at "c". Note that a possessive quan-
- tifer does not have the same effect in this example; although it would
+ tifer does not have the same effect as this example; although it would
suppress backtracking during the first match attempt, the second
attempt would start at the second character instead of skipping on to
"c".
@@ -5125,7 +5235,7 @@ BACKTRACKING CONTROL
SEE ALSO
- pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
+ pcreapi(3), pcrecallout(3), pcrematching(3), pcresyntax(3), pcre(3).
AUTHOR
@@ -5137,11 +5247,11 @@ AUTHOR
REVISION
- Last updated: 22 September 2009
+ Last updated: 04 October 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRESYNTAX(3) PCRESYNTAX(3)
@@ -5493,8 +5603,8 @@ REVISION
Last updated: 11 April 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPARTIAL(3) PCREPARTIAL(3)
@@ -5533,154 +5643,157 @@ PARTIAL MATCHING IN PCRE
plete match, though the details differ between the two matching func-
tions. If both options are set, PCRE_PARTIAL_HARD takes precedence.
- Setting a partial matching option disables one of PCRE's optimizations.
+ Setting a partial matching option disables two of PCRE's optimizations.
PCRE remembers the last literal byte in a pattern, and abandons match-
ing immediately if such a byte is not present in the subject string.
This optimization cannot be used for a subject string that might match
- only partially.
+ only partially. If the pattern was studied, PCRE knows the minimum
+ length of a matching string, and does not bother to run the matching
+ function on shorter strings. This optimization is also disabled for
+ partial matching.
PARTIAL MATCHING USING pcre_exec()
A partial match occurs during a call to pcre_exec() whenever the end of
- the subject string is reached successfully, but matching cannot con-
+ the subject string is reached successfully, but matching cannot con-
tinue because more characters are needed. However, at least one charac-
- ter must have been matched. (In other words, a partial match can never
+ ter must have been matched. (In other words, a partial match can never
be an empty string.)
- If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
+ If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
matching continues as normal, and other alternatives in the pattern are
- tried. If no complete match can be found, pcre_exec() returns
+ tried. If no complete match can be found, pcre_exec() returns
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
two slots in the offsets vector, the first of them is set to the offset
of the earliest character that was inspected when the partial match was
- found. For convenience, the second offset points to the end of the
- string so that a substring can easily be extracted.
+ found. For convenience, the second offset points to the end of the
+ string so that a substring can easily be identified.
- For the majority of patterns, the first offset identifies the start of
- the partially matched string. However, for patterns that contain look-
- behind assertions, or \K, or begin with \b or \B, earlier characters
+ For the majority of patterns, the first offset identifies the start of
+ the partially matched string. However, for patterns that contain look-
+ behind assertions, or \K, or begin with \b or \B, earlier characters
have been inspected while carrying out the match. For example:
/(?<=abc)123/
This pattern matches "123", but only if it is preceded by "abc". If the
subject string is "xyzabc12", the offsets after a partial match are for
- the substring "abc12", because all these characters are needed if
+ the substring "abc12", because all these characters are needed if
another match is tried with extra characters added.
- If there is more than one partial match, the first one that was found
+ If there is more than one partial match, the first one that was found
provides the data that is returned. Consider this pattern:
/123\w+X|dogY/
- If this is matched against the subject string "abc123dog", both alter-
- natives fail to match, but the end of the subject is reached during
- matching, so PCRE_ERROR_PARTIAL is returned instead of
- PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
- "123dog" as the first partial match that was found. (In this example,
- there are two partial matches, because "dog" on its own partially
+ If this is matched against the subject string "abc123dog", both alter-
+ natives fail to match, but the end of the subject is reached during
+ matching, so PCRE_ERROR_PARTIAL is returned instead of
+ PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
+ "123dog" as the first partial match that was found. (In this example,
+ there are two partial matches, because "dog" on its own partially
matches the second alternative.)
If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
- TIAL as soon as a partial match is found, without continuing to search
- for possible complete matches. The difference between the two options
+ TIAL as soon as a partial match is found, without continuing to search
+ for possible complete matches. The difference between the two options
can be illustrated by a pattern such as:
/dog(sbody)?/
- This matches either "dog" or "dogsbody", greedily (that is, it prefers
- the longer string if possible). If it is matched against the string
- "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
+ This matches either "dog" or "dogsbody", greedily (that is, it prefers
+ the longer string if possible). If it is matched against the string
+ "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
- On the other hand, if the pattern is made ungreedy the result is dif-
+ On the other hand, if the pattern is made ungreedy the result is dif-
ferent:
/dog(sbody)??/
- In this case the result is always a complete match because pcre_exec()
- finds that first, and it never continues after finding a match. It
- might be easier to follow this explanation by thinking of the two pat-
+ In this case the result is always a complete match because pcre_exec()
+ finds that first, and it never continues after finding a match. It
+ might be easier to follow this explanation by thinking of the two pat-
terns like this:
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
- The second pattern will never match "dogsbody" when pcre_exec() is
+ The second pattern will never match "dogsbody" when pcre_exec() is
used, because it will always find the shorter match first.
PARTIAL MATCHING USING pcre_dfa_exec()
- The pcre_dfa_exec() function moves along the subject string character
- by character, without backtracking, searching for all possible matches
- simultaneously. If the end of the subject is reached before the end of
- the pattern, there is the possibility of a partial match, again pro-
+ The pcre_dfa_exec() function moves along the subject string character
+ by character, without backtracking, searching for all possible matches
+ simultaneously. If the end of the subject is reached before the end of
+ the pattern, there is the possibility of a partial match, again pro-
vided that at least one character has matched.
- When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
- there have been no complete matches. Otherwise, the complete matches
- are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
- takes precedence over any complete matches. The portion of the string
- that was inspected when the longest partial match was found is set as
+ When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
+ there have been no complete matches. Otherwise, the complete matches
+ are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
+ takes precedence over any complete matches. The portion of the string
+ that was inspected when the longest partial match was found is set as
the first matching string, provided there are at least two slots in the
offsets vector.
- Because pcre_dfa_exec() always searches for all possible matches, and
- there is no difference between greedy and ungreedy repetition, its be-
+ Because pcre_dfa_exec() always searches for all possible matches, and
+ there is no difference between greedy and ungreedy repetition, its be-
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
- sider the string "dog" matched against the ungreedy pattern shown
+ sider the string "dog" matched against the ungreedy pattern shown
above:
/dog(sbody)??/
- Whereas pcre_exec() stops as soon as it finds the complete match for
+ Whereas pcre_exec() stops as soon as it finds the complete match for
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
so returns that when PCRE_PARTIAL_HARD is set.
PARTIAL MATCHING AND WORD BOUNDARIES
- If a pattern ends with one of sequences \w or \W, which test for word
- boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
+ If a pattern ends with one of sequences \w or \W, which test for word
+ boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
intuitive results. Consider this pattern:
/\bcat\b/
This matches "cat", provided there is a word boundary at either end. If
the subject string is "the cat", the comparison of the final "t" with a
- following character cannot take place, so a partial match is found.
- However, pcre_exec() carries on with normal matching, which matches \b
- at the end of the subject when the last character is a letter, thus
+ following character cannot take place, so a partial match is found.
+ However, pcre_exec() carries on with normal matching, which matches \b
+ at the end of the subject when the last character is a letter, thus
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
- TIAL. The same thing happens with pcre_dfa_exec(), because it also
+ TIAL. The same thing happens with pcre_dfa_exec(), because it also
finds the complete match.
- Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
+ Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
because then the partial match takes precedence.
FORMERLY RESTRICTED PATTERNS
For releases of PCRE prior to 8.00, because of the way certain internal
- optimizations were implemented in the pcre_exec() function, the
- PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
- used with all patterns. From release 8.00 onwards, the restrictions no
- longer apply, and partial matching with pcre_exec() can be requested
+ optimizations were implemented in the pcre_exec() function, the
+ PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
+ used with all patterns. From release 8.00 onwards, the restrictions no
+ longer apply, and partial matching with pcre_exec() can be requested
for any pattern.
Items that were formerly restricted were repeated single characters and
- repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
- not conform to the restrictions, pcre_exec() returned the error code
- PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
- PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
+ repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
+ not conform to the restrictions, pcre_exec() returned the error code
+ PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
+ PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
pattern can be used for partial matching now always returns 1.
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
- If the escape sequence \P is present in a pcretest data line, the
- PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
+ If the escape sequence \P is present in a pcretest data line, the
+ PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
pcretest that uses the date example quoted above:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -5696,24 +5809,24 @@ EXAMPLE OF PARTIAL MATCHING USING PCRETEST
data> j\P
No match
- The first data string is matched completely, so pcretest shows the
- matched substrings. The remaining four strings do not match the com-
+ The first data string is matched completely, so pcretest shows the
+ matched substrings. The remaining four strings do not match the com-
plete pattern, but the first two are partial matches. Similar output is
obtained when pcre_dfa_exec() is used.
- If the escape sequence \P is present more than once in a pcretest data
+ If the escape sequence \P is present more than once in a pcretest data
line, the PCRE_PARTIAL_HARD option is set for the match.
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
When a partial match has been found using pcre_dfa_exec(), it is possi-
- ble to continue the match by providing additional subject data and
- calling pcre_dfa_exec() again with the same compiled regular expres-
- sion, this time setting the PCRE_DFA_RESTART option. You must pass the
+ ble to continue the match by providing additional subject data and
+ calling pcre_dfa_exec() again with the same compiled regular expres-
+ sion, this time setting the PCRE_DFA_RESTART option. You must pass the
same working space as before, because this is where details of the pre-
- vious partial match are stored. Here is an example using pcretest,
- using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
+ vious partial match are stored. Here is an example using pcretest,
+ using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
specifies the use of pcre_dfa_exec()):
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -5722,26 +5835,26 @@ MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
data> n05\R\D
0: n05
- The first call has "23ja" as the subject, and requests partial match-
- ing; the second call has "n05" as the subject for the continued
- (restarted) match. Notice that when the match is complete, only the
- last part is shown; PCRE does not retain the previously partially-
- matched string. It is up to the calling program to do that if it needs
+ The first call has "23ja" as the subject, and requests partial match-
+ ing; the second call has "n05" as the subject for the continued
+ (restarted) match. Notice that when the match is complete, only the
+ last part is shown; PCRE does not retain the previously partially-
+ matched string. It is up to the calling program to do that if it needs
to.
- You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
- PCRE_DFA_RESTART to continue partial matching over multiple segments.
- This facility can be used to pass very long subject strings to
+ You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
+ PCRE_DFA_RESTART to continue partial matching over multiple segments.
+ This facility can be used to pass very long subject strings to
pcre_dfa_exec().
MULTI-SEGMENT MATCHING WITH pcre_exec()
- From release 8.00, pcre_exec() can also be used to do multi-segment
- matching. Unlike pcre_dfa_exec(), it is not possible to restart the
- previous match with a new segment of data. Instead, new data must be
- added to the previous subject string, and the entire match re-run,
- starting from the point where the partial match occurred. Earlier data
+ From release 8.00, pcre_exec() can also be used to do multi-segment
+ matching. Unlike pcre_dfa_exec(), it is not possible to restart the
+ previous match with a new segment of data. Instead, new data must be
+ added to the previous subject string, and the entire match re-run,
+ starting from the point where the partial match occurred. Earlier data
can be discarded. Consider an unanchored pattern that matches dates:
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
@@ -5749,15 +5862,15 @@ MULTI-SEGMENT MATCHING WITH pcre_exec()
Partial match: 23ja
The this stage, an application could discard the text preceding "23ja",
- add on text from the next segment, and call pcre_exec() again. Unlike
- pcre_dfa_exec(), the entire matching string must always be available,
- and the complete matching process occurs for each call, so more memory
+ add on text from the next segment, and call pcre_exec() again. Unlike
+ pcre_dfa_exec(), the entire matching string must always be available,
+ and the complete matching process occurs for each call, so more memory
and more processing time is needed.
- Note: If the pattern contains lookbehind assertions, or \K, or starts
- with \b or \B, the string that is returned for a partial match will
- include characters that precede the partially matched string itself,
- because these must be retained when adding on more characters for a
+ Note: If the pattern contains lookbehind assertions, or \K, or starts
+ with \b or \B, the string that is returned for a partial match will
+ include characters that precede the partially matched string itself,
+ because these must be retained when adding on more characters for a
subsequent matching attempt.
@@ -5766,28 +5879,28 @@ ISSUES WITH MULTI-SEGMENT MATCHING
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
- 1. If the pattern contains tests for the beginning or end of a line,
- you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
- ate, when the subject string for any call does not contain the begin-
+ 1. If the pattern contains tests for the beginning or end of a line,
+ you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
+ ate, when the subject string for any call does not contain the begin-
ning or end of a line.
- 2. Lookbehind assertions at the start of a pattern are catered for in
- the offsets that are returned for a partial match. However, in theory,
- a lookbehind assertion later in the pattern could require even earlier
- characters to be inspected, and it might not have been reached when a
- partial match occurs. This is probably an extremely unlikely case; you
- could guard against it to a certain extent by always including extra
+ 2. Lookbehind assertions at the start of a pattern are catered for in
+ the offsets that are returned for a partial match. However, in theory,
+ a lookbehind assertion later in the pattern could require even earlier
+ characters to be inspected, and it might not have been reached when a
+ partial match occurs. This is probably an extremely unlikely case; you
+ could guard against it to a certain extent by always including extra
characters at the start.
- 3. Matching a subject string that is split into multiple segments may
- not always produce exactly the same result as matching over one single
- long string, especially when PCRE_PARTIAL_SOFT is used. The section
- "Partial Matching and Word Boundaries" above describes an issue that
- arises if the pattern ends with \b or \B. Another kind of difference
- may occur when there are multiple matching possibilities, because a
+ 3. Matching a subject string that is split into multiple segments may
+ not always produce exactly the same result as matching over one single
+ long string, especially when PCRE_PARTIAL_SOFT is used. The section
+ "Partial Matching and Word Boundaries" above describes an issue that
+ arises if the pattern ends with \b or \B. Another kind of difference
+ may occur when there are multiple matching possibilities, because a
partial match result is given only when there are no completed matches.
This means that as soon as the shortest match has been found, continua-
- tion to a new subject segment is no longer possible. Consider again
+ tion to a new subject segment is no longer possible. Consider again
this pcretest example:
re> /dog(sbody)?/
@@ -5801,17 +5914,17 @@ ISSUES WITH MULTI-SEGMENT MATCHING
0: dogsbody
1: dog
- The first data line passes the string "dogsb" to pcre_exec(), setting
- the PCRE_PARTIAL_SOFT option. Although the string is a partial match
- for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
- shorter string "dog" is a complete match. Similarly, when the subject
- is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
+ The first data line passes the string "dogsb" to pcre_exec(), setting
+ the PCRE_PARTIAL_SOFT option. Although the string is a partial match
+ for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
+ shorter string "dog" is a complete match. Similarly, when the subject
+ is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
the first two) the match stops when "dog" has been found, and it is not
- possible to continue. On the other hand, if "dogsbody" is presented as
+ possible to continue. On the other hand, if "dogsbody" is presented as
a single string, pcre_dfa_exec() finds both matches.
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
- when matching multi-segment data. The example above then behaves dif-
+ when matching multi-segment data. The example above then behaves dif-
ferently:
re> /dog(sbody)?/
@@ -5824,25 +5937,25 @@ ISSUES WITH MULTI-SEGMENT MATCHING
4. Patterns that contain alternatives at the top level which do not all
- start with the same pattern item may not work as expected when
+ start with the same pattern item may not work as expected when
pcre_dfa_exec() is used. For example, consider this pattern:
1234|3789
- If the first part of the subject is "ABC123", a partial match of the
- first alternative is found at offset 3. There is no partial match for
+ If the first part of the subject is "ABC123", a partial match of the
+ first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the same
- point in the subject string. Attempting to continue with the string
- "7890" does not yield a match because only those alternatives that
- match at one point in the subject are remembered. The problem arises
- because the start of the second alternative matches within the first
- alternative. There is no problem with anchored patterns or patterns
+ point in the subject string. Attempting to continue with the string
+ "7890" does not yield a match because only those alternatives that
+ match at one point in the subject are remembered. The problem arises
+ because the start of the second alternative matches within the first
+ alternative. There is no problem with anchored patterns or patterns
such as:
1234|ABCD
- where no string can be a partial match for both alternatives. This is
- not a problem if pcre_exec() is used, because the entire match has to
+ where no string can be a partial match for both alternatives. This is
+ not a problem if pcre_exec() is used, because the entire match has to
be rerun each time:
re> /1234|3789/
@@ -5861,11 +5974,11 @@ AUTHOR
REVISION
- Last updated: 05 September 2009
+ Last updated: 29 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPRECOMPILE(3) PCREPRECOMPILE(3)
@@ -5988,8 +6101,8 @@ REVISION
Last updated: 13 June 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPERFORM(3) PCREPERFORM(3)
@@ -6138,8 +6251,8 @@ REVISION
Last updated: 06 March 2007
Copyright (c) 1997-2007 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPOSIX(3) PCREPOSIX(3)
@@ -6394,8 +6507,8 @@ REVISION
Last updated: 02 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECPP(3) PCRECPP(3)
@@ -6735,8 +6848,8 @@ REVISION
Last updated: 17 March 2009
------------------------------------------------------------------------------
-
-
+
+
PCRESAMPLE(3) PCRESAMPLE(3)
@@ -6765,8 +6878,8 @@ PCRE SAMPLE PROGRAM
is going on.
If PCRE is installed in the standard include and library directories
- for your system, you should be able to compile the demonstration pro-
- gram using this command:
+ for your operating system, you should be able to compile the demonstra-
+ tion program using this command:
gcc -o pcredemo pcredemo.c -lpcre
@@ -6813,7 +6926,7 @@ AUTHOR
REVISION
- Last updated: 01 September 2009
+ Last updated: 30 September 2009
Copyright (c) 1997-2009 University of Cambridge.
------------------------------------------------------------------------------
PCRESTACK(3) PCRESTACK(3)
@@ -6952,5 +7065,5 @@ REVISION
Last updated: 09 July 2008
Copyright (c) 1997-2008 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
diff --git a/doc/pcre_compile2.3 b/doc/pcre_compile2.3
index 84dbf19..b358cd4 100644
--- a/doc/pcre_compile2.3
+++ b/doc/pcre_compile2.3
@@ -49,7 +49,7 @@ The option bits are:
PCRE_JAVASCRIPT_COMPAT JavaScript compatibility
PCRE_MULTILINE ^ and $ match newlines within data
PCRE_NEWLINE_ANY Recognize any Unicode newline sequence
- PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline
+ PCRE_NEWLINE_ANYCRLF Recognize CR, LF, and CRLF as newline
sequences
PCRE_NEWLINE_CR Set CR as the newline sequence
PCRE_NEWLINE_CRLF Set CRLF as the newline sequence
diff --git a/doc/pcre_dfa_exec.3 b/doc/pcre_dfa_exec.3
index 4f4bb91..c8ca381 100644
--- a/doc/pcre_dfa_exec.3
+++ b/doc/pcre_dfa_exec.3
@@ -57,8 +57,8 @@ The options are:
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
- PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
PCRE_DFA_SHORTEST Return only the shortest match
PCRE_DFA_RESTART Restart after a partial match
.sp
diff --git a/doc/pcre_exec.3 b/doc/pcre_exec.3
index d5689eb..0a3399f 100644
--- a/doc/pcre_exec.3
+++ b/doc/pcre_exec.3
@@ -52,8 +52,8 @@ The options are:
was set at compile time)
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
- PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
+ even if there is a full match as well
.sp
For details of partial matching, see the
.\" HREF
diff --git a/doc/pcre_fullinfo.3 b/doc/pcre_fullinfo.3
index 16a3c63..28aec67 100644
--- a/doc/pcre_fullinfo.3
+++ b/doc/pcre_fullinfo.3
@@ -33,7 +33,7 @@ The following information is available:
PCRE_INFO_FIRSTTABLE Table of first bytes (after studying)
PCRE_INFO_JCHANGED Return 1 if (?J) or (?-J) was used
PCRE_INFO_LASTLITERAL Literal last byte required
- PCRE_INFO_MINLENGTH Lower bound length of matching strings
+ PCRE_INFO_MINLENGTH Lower bound length of matching strings
PCRE_INFO_NAMECOUNT Number of named subpatterns
PCRE_INFO_NAMEENTRYSIZE Size of name table entry
PCRE_INFO_NAMETABLE Pointer to name table
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index c0ee78e..6341cdc 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -395,8 +395,8 @@ avoiding the use of the stack.
Either of the functions \fBpcre_compile()\fP or \fBpcre_compile2()\fP can be
called to compile a pattern into an internal form. The only difference between
the two interfaces is that \fBpcre_compile2()\fP has an additional argument,
-\fIerrorcodeptr\fP, via which a numerical error code can be returned. To avoid
-too much repetition, we refer just to \fBpcre_compile()\fP below, but the
+\fIerrorcodeptr\fP, via which a numerical error code can be returned. To avoid
+too much repetition, we refer just to \fBpcre_compile()\fP below, but the
information applies equally to \fBpcre_compile2()\fP.
.P
The pattern is a C string terminated by a binary zero, and is passed in the
@@ -421,7 +421,7 @@ within the pattern (see the detailed description in the
.\"
documentation). For those options that can be different in different parts of
the pattern, the contents of the \fIoptions\fP argument specifies their
-settings at the start of compilation and execution. The PCRE_ANCHORED,
+settings at the start of compilation and execution. The PCRE_ANCHORED,
PCRE_BSR_\fIxxx\fP, and PCRE_NEWLINE_\fIxxx\fP options can be set at the time
of matching as well as at compile time.
.P
@@ -785,7 +785,7 @@ in the section on matching a pattern.
.P
If studying the pattern does not produce any useful information,
\fBpcre_study()\fP returns NULL. In that circumstance, if the calling program
-wants to pass any of the other fields to \fBpcre_exec()\fP or
+wants to pass any of the other fields to \fBpcre_exec()\fP or
\fBpcre_dfa_exec()\fP, it must set up its own \fBpcre_extra\fP block.
.P
The second argument of \fBpcre_study()\fP contains option bits. At present, no
@@ -807,16 +807,16 @@ This is a typical call to \fBpcre_study\fP():
&error); /* set to NULL or points to a message */
.sp
Studying a pattern does two things: first, a lower bound for the length of
-subject string that is needed to match the pattern is computed. This does not
-mean that there are any strings of that length that match, but it does
-guarantee that no shorter strings match. The value is used by
-\fBpcre_exec()\fP and \fBpcre_dfa_exec()\fP to avoid wasting time by trying to
-match strings that are shorter than the lower bound. You can find out the value
+subject string that is needed to match the pattern is computed. This does not
+mean that there are any strings of that length that match, but it does
+guarantee that no shorter strings match. The value is used by
+\fBpcre_exec()\fP and \fBpcre_dfa_exec()\fP to avoid wasting time by trying to
+match strings that are shorter than the lower bound. You can find out the value
in a calling program via the \fBpcre_fullinfo()\fP function.
.P
Studying a pattern is also useful for non-anchored patterns that do not have a
single fixed starting character. A bitmap of possible starting bytes is
-created. This speeds up finding a position in the subject at which to start
+created. This speeds up finding a position in the subject at which to start
matching.
.
.
@@ -1012,7 +1012,7 @@ entry; both of these return an \fBint\fP value. The entry size depends on the
length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
entry of the table (a pointer to \fBchar\fP). The first two bytes of each entry
are the number of the capturing parenthesis, most significant byte first. The
-rest of the entry is the corresponding name, zero terminated.
+rest of the entry is the corresponding name, zero terminated.
.P
The names are in alphabetical order. Duplicate names may appear if (?| is used
to create multiple groups with the same number, as described in the
@@ -1024,10 +1024,10 @@ in the
.\" HREF
\fBpcrepattern\fP
.\"
-page. Duplicate names for subpatterns with different numbers are permitted only
-if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the
-table in the order in which they were found in the pattern. In the absence of
-(?| this is the order of increasing number; when (?| is used this is not
+page. Duplicate names for subpatterns with different numbers are permitted only
+if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the
+table in the order in which they were found in the pattern. In the absence of
+(?| this is the order of increasing number; when (?| is used this is not
necessarily the case because later subpatterns may have lower numbers.
.P
As a simple example of the name/number table, consider the following pattern
@@ -1371,7 +1371,7 @@ valid, so PCRE searches further into the string for occurrences of "a" or "b".
.sp
PCRE_NOTEMPTY_ATSTART
.sp
-This is like PCRE_NOTEMPTY, except that an empty string match that is not at
+This is like PCRE_NOTEMPTY, except that an empty string match that is not at
the start of the subject is permitted. If the pattern is anchored, such a match
can occur only if the pattern contains \eK.
.P
@@ -1427,7 +1427,7 @@ PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 string as a
subject, or a value of \fIstartoffset\fP that does not point to the start of a
UTF-8 character, is undefined. Your program may crash.
.sp
- PCRE_PARTIAL_HARD
+ PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
.sp
These options turn on the partial matching feature. For backwards
@@ -1634,7 +1634,7 @@ documentation for details of partial matching.
.sp
This code is no longer in use. It was formerly returned when the PCRE_PARTIAL
option was used with a compiled pattern containing items that were not
-supported for partial matching. From release 8.00 onwards, there are no
+supported for partial matching. From release 8.00 onwards, there are no
restrictions on partial matching.
.sp
PCRE_ERROR_INTERNAL (-14)
@@ -1898,7 +1898,7 @@ a compiled pattern, using a matching algorithm that scans the subject string
just once, and does not backtrack. This has different characteristics to the
normal algorithm, and is not compatible with Perl. Some of the features of PCRE
patterns are not supported. Nevertheless, there are times when this kind of
-matching can be useful. For a discussion of the two matching algorithms, and a
+matching can be useful. For a discussion of the two matching algorithms, and a
list of features that \fBpcre_dfa_exec()\fP does not support, see the
.\" HREF
\fBpcrematching\fP
@@ -1944,7 +1944,7 @@ and PCRE_DFA_RESTART. All but the last four of these are exactly the same as
for \fBpcre_exec()\fP, so their description is not repeated here.
.sp
PCRE_PARTIAL_HARD
- PCRE_PARTIAL_SOFT
+ PCRE_PARTIAL_SOFT
.sp
These have the same general effect as they do for \fBpcre_exec()\fP, but the
details are slightly different. When PCRE_PARTIAL_HARD is set for
diff --git a/doc/pcrebuild.3 b/doc/pcrebuild.3
index dd970dc..3f907b0 100644
--- a/doc/pcrebuild.3
+++ b/doc/pcrebuild.3
@@ -12,11 +12,11 @@ the optional features are selected or deselected by providing options to
\fBconfigure\fP before running the \fBmake\fP command. However, the same
options can be selected in both Unix-like and non-Unix-like environments using
the GUI facility of \fBcmake-gui\fP if you are using \fBCMake\fP instead of
-\fBconfigure\fP to build PCRE.
+\fBconfigure\fP to build PCRE.
.P
-There is a lot more information about building PCRE in non-Unix-like
-environments in the file called \fINON_UNIX_USE\fP, which is part of the PCRE
-distribution. You should consult this file as well as the \fIREADME\fP file if
+There is a lot more information about building PCRE in non-Unix-like
+environments in the file called \fINON_UNIX_USE\fP, which is part of the PCRE
+distribution. You should consult this file as well as the \fIREADME\fP file if
you are building in a non-Unix-like environment.
.P
The complete list of options for \fBconfigure\fP (which includes the standard
diff --git a/doc/pcrecallout.3 b/doc/pcrecallout.3
index ad8a211..b691a16 100644
--- a/doc/pcrecallout.3
+++ b/doc/pcrecallout.3
@@ -19,7 +19,7 @@ For example, this pattern has two callout points:
.sp
(?C1)abc(?C2)def
.sp
-If the PCRE_AUTO_CALLOUT option bit is set when \fBpcre_compile()\fP or
+If the PCRE_AUTO_CALLOUT option bit is set when \fBpcre_compile()\fP or
\fBpcre_compile2()\fP is called, PCRE automatically inserts callouts, all with
number 255, before each item in the pattern. For example, if PCRE_AUTO_CALLOUT
is used with the pattern
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index 243e9bf..e5c683c 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -77,7 +77,7 @@ the
documentation for details.
.P
9. Subpatterns that are called recursively or as "subroutines" are always
-treated as atomic groups in PCRE. This is like Python, but unlike Perl. There
+treated as atomic groups in PCRE. This is like Python, but unlike Perl. There
is a discussion of an example that explains this in more detail in the
.\" HTML <a href="pcrepattern.html#recursiondifference">
.\" </a>
@@ -97,7 +97,7 @@ the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b".
(*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in the forms without an
argument. PCRE does not support (*MARK).
.P
-12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
+12. PCRE's handling of duplicate subpattern numbers and duplicate subpattern
names is not as general as Perl's. This is a consequence of the fact the PCRE
works internally just with numbers, using an external table to translate
between numbers and names. In particular, a pattern such as (?|(?<a>A)|(?<b)B),
diff --git a/doc/pcregrep.1 b/doc/pcregrep.1
index a436c88..e171c26 100644
--- a/doc/pcregrep.1
+++ b/doc/pcregrep.1
@@ -89,9 +89,9 @@ standard input is always so treated.
.SH OPTIONS
.rs
.sp
-The order in which some of the options appear can affect the output. For
-example, both the \fB-h\fP and \fB-l\fP options affect the printing of file
-names. Whichever comes later in the command line will be the one that takes
+The order in which some of the options appear can affect the output. For
+example, both the \fB-h\fP and \fB-l\fP options affect the printing of file
+names. Whichever comes later in the command line will be the one that takes
effect.
.TP 10
\fB--\fP
@@ -272,9 +272,9 @@ output once, on a separate line.
Instead of outputting lines from the files, just output the names of the files
containing lines that would have been output. Each file name is output
once, on a separate line. Searching normally stops as soon as a matching line
-is found in a file. However, if the \fB-c\fP (count) option is also used,
-matching continues in order to obtain the correct count, and those files that
-have at least one match are listed along with their counts. Using this option
+is found in a file. However, if the \fB-c\fP (count) option is also used,
+matching continues in order to obtain the correct count, and those files that
+have at least one match are listed along with their counts. Using this option
with \fB-c\fP is a way of suppressing the listing of files with no matches.
.TP
\fB--label\fP=\fIname\fP
@@ -410,8 +410,8 @@ The majority of short and long forms of \fBpcregrep\fP's options are the same
as in the GNU \fBgrep\fP program. Any long option of the form
\fB--xxx-regexp\fP (GNU terminology) is also available as \fB--xxx-regex\fP
(PCRE terminology). However, the \fB--locale\fP, \fB-M\fP, \fB--multiline\fP,
-\fB-u\fP, and \fB--utf-8\fP options are specific to \fBpcregrep\fP. If both the
-\fB-c\fP and \fB-l\fP options are given, GNU grep lists only file names,
+\fB-u\fP, and \fB--utf-8\fP options are specific to \fBpcregrep\fP. If both the
+\fB-c\fP and \fB-l\fP options are given, GNU grep lists only file names,
without counts, but \fBpcregrep\fP gives the counts.
.
.
diff --git a/doc/pcrematching.3 b/doc/pcrematching.3
index 2e2abd9..490f914 100644
--- a/doc/pcrematching.3
+++ b/doc/pcrematching.3
@@ -74,9 +74,9 @@ this is a kind of "DFA algorithm", though it is not implemented as a
traditional finite state machine (it keeps multiple states active
simultaneously).
.P
-Although the general principle of this matching algorithm is that it scans the
-subject string only once, without backtracking, there is one exception: when a
-lookaround assertion is encountered, the characters following or preceding the
+Although the general principle of this matching algorithm is that it scans the
+subject string only once, without backtracking, there is one exception: when a
+lookaround assertion is encountered, the characters following or preceding the
current point have to be independently inspected.
.P
The scan continues until either the end of the subject is reached, or there are
@@ -152,9 +152,9 @@ callouts.
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
time. The
-.\" HREF
+.\" HREF
\fBpcrepartial\fP
-.\"
+.\"
documentation gives details of partial matching.
.
.
diff --git a/doc/pcrepartial.3 b/doc/pcrepartial.3
index 05487e1..f0a9b3a 100644
--- a/doc/pcrepartial.3
+++ b/doc/pcrepartial.3
@@ -35,9 +35,9 @@ are set, PCRE_PARTIAL_HARD takes precedence.
Setting a partial matching option disables two of PCRE's optimizations. PCRE
remembers the last literal byte in a pattern, and abandons matching immediately
if such a byte is not present in the subject string. This optimization cannot
-be used for a subject string that might match only partially. If the pattern
-was studied, PCRE knows the minimum length of a matching string, and does not
-bother to run the matching function on shorter strings. This optimization is
+be used for a subject string that might match only partially. If the pattern
+was studied, PCRE knows the minimum length of a matching string, and does not
+bother to run the matching function on shorter strings. This optimization is
also disabled for partial matching.
.
.
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 6cb24f4..460a6f8 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -21,7 +21,7 @@ published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
.P
The original operation of PCRE was on strings of one-byte characters. However,
-there is now also support for UTF-8 character strings. To use this,
+there is now also support for UTF-8 character strings. To use this,
PCRE must be built to include UTF-8 support, and you must call
\fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There
is also a special sequence that can be given at the start of a pattern:
@@ -83,7 +83,7 @@ string with one of the following five sequences:
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
.sp
-These override the default and the options given to \fBpcre_compile()\fP or
+These override the default and the options given to \fBpcre_compile()\fP or
\fBpcre_compile2()\fP. For example, on a Unix system where LF is the default
newline sequence, the pattern
.sp
@@ -333,7 +333,7 @@ syntax for referencing a subpattern as a "subroutine". Details are discussed
later.
.\"
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
-synonymous. The former is a back reference; the latter is a
+synonymous. The former is a back reference; the latter is a
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
subroutine
@@ -468,7 +468,7 @@ one of the following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
.sp
-These override the default and the options given to \fBpcre_compile()\fP or
+These override the default and the options given to \fBpcre_compile()\fP or
\fBpcre_compile2()\fP, but they can be overridden by options given to
\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings,
which are not Perl-compatible, are recognized only at the very start of a
@@ -741,9 +741,9 @@ different meaning, namely the backspace character, inside a character class).
A word boundary is a position in the subject string where the current character
and the previous character do not both match \ew or \eW (i.e. one matches
\ew and the other matches \eW), or the start or end of the string if the
-first or last character matches \ew, respectively. Neither PCRE nor Perl has a
-separte "start of word" or "end of word" metasequence. However, whatever
-follows \eb normally determines which it is. For example, the fragment
+first or last character matches \ew, respectively. Neither PCRE nor Perl has a
+separte "start of word" or "end of word" metasequence. However, whatever
+follows \eb normally determines which it is. For example, the fragment
\eba matches "a" at the start of a word.
.P
The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
@@ -876,8 +876,8 @@ the lookbehind.
.rs
.sp
An opening square bracket introduces a character class, terminated by a closing
-square bracket. A closing square bracket on its own is not special by default.
-However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
+square bracket. A closing square bracket on its own is not special by default.
+However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
bracket causes a compile-time error. If a closing square bracket is required as
a member of the class, it should be the first data character in the class
(after an initial circumflex, if present) or escaped with a backslash.
@@ -1163,14 +1163,14 @@ stored.
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
.sp
-A backreference to a numbered subpattern uses the most recent value that is set
+A backreference to a numbered subpattern uses the most recent value that is set
for that number by any subpattern. The following pattern matches "abcabc" or
"defdef":
.sp
- /(?|(abc)|(def))\1/
+ /(?|(abc)|(def))\e1/
.sp
In contrast, a recursive or "subroutine" call to a numbered subpattern always
-refers to the first one in the pattern with the given number. The following
+refers to the first one in the pattern with the given number. The following
pattern matches "abcabc" or "defabc":
.sp
/(?|(abc)|(def))(?1)/
@@ -1225,7 +1225,7 @@ is also a convenience function for extracting a captured substring by name.
.P
By default, a name must be unique within a pattern, but it is possible to relax
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
-names are also always permitted for subpatterns with the same number, set up as
+names are also always permitted for subpatterns with the same number, set up as
described in the previous section.) Duplicate names can be useful for patterns
where only one instance of the named parentheses can match. Suppose you want to
match the name of a weekday, either as a 3-letter abbreviation or as the full
@@ -1244,7 +1244,7 @@ subpattern, as described in the previous section.)
.P
The convenience function for extracting the data by name returns the substring
for the first (and in this example, the only) subpattern of that name that
-matched. This saves searching to find which numbered subpattern it was.
+matched. This saves searching to find which numbered subpattern it was.
.P
If you make a backreference to a non-unique named subpattern from elsewhere in
the pattern, the one that corresponds to the first occurrence of the name is
@@ -1256,7 +1256,7 @@ test (see the
.\" </a>
section about conditions
.\"
-below), either to check whether a subpattern has matched, or to check for
+below), either to check whether a subpattern has matched, or to check for
recursion, all subpatterns with the same name are tested. If the condition is
true for any one of them, the overall condition is true. This is the same
behaviour as testing by number. For further details of the interfaces for
@@ -1288,7 +1288,7 @@ items:
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
- a recursive or "subroutine" call to a subpattern
+ a recursive or "subroutine" call to a subpattern
.sp
The general repetition quantifier specifies a minimum and maximum number of
permitted matches, by giving the two numbers in curly brackets (braces),
@@ -1614,8 +1614,8 @@ references to it always fail by default. For example, the pattern
.sp
(a|(bc))\e2
.sp
-always fails if it starts to match "a" rather than "bc". However, if the
-PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
+always fails if it starts to match "a" rather than "bc". However, if the
+PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
unset value matches an empty string.
.P
Because there may be many capturing parentheses in a pattern, all digits
@@ -1737,7 +1737,7 @@ In some cases, the Perl 5.10 escape sequence \eK
.\" </a>
(see above)
.\"
-can be used instead of a lookbehind assertion to get round the fixed-length
+can be used instead of a lookbehind assertion to get round the fixed-length
restriction.
.P
The implementation of lookbehind assertions is, for each alternative, to
@@ -1755,7 +1755,7 @@ different numbers of bytes, are also not permitted.
"Subroutine"
.\"
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
-as the subpattern matches a fixed-length string.
+as the subpattern matches a fixed-length string.
.\" HTML <a href="#recursion">
.\" </a>
Recursion,
@@ -1828,7 +1828,7 @@ characters that are not "999".
.sp
It is possible to cause the matching process to obey a subpattern
conditionally or to choose between two alternative subpatterns, depending on
-the result of an assertion, or whether a specific capturing subpattern has
+the result of an assertion, or whether a specific capturing subpattern has
already been matched. The two possible forms of conditional subpattern are:
.sp
(?(condition)yes-pattern)
@@ -1846,8 +1846,8 @@ recursion, a pseudo-condition called DEFINE, and assertions.
.sp
If the text between the parentheses consists of a sequence of digits, the
condition is true if a capturing subpattern of that number has previously
-matched. If there is more than one capturing subpattern with the same number
-(see the earlier
+matched. If there is more than one capturing subpattern with the same number
+(see the earlier
.\"
.\" HTML <a href="#recursion">
.\" </a>
@@ -1899,8 +1899,8 @@ Rewriting the above example to use a named subpattern gives this:
.sp
(?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) )
.sp
-If the name used in a condition of this kind is a duplicate, the test is
-applied to all subpatterns of the same name, and is true if any one of them has
+If the name used in a condition of this kind is a duplicate, the test is
+applied to all subpatterns of the same name, and is true if any one of them has
matched.
.
.SS "Checking for pattern recursion"
@@ -1915,11 +1915,11 @@ letter R, for example:
.sp
the condition is true if the most recent recursion is into a subpattern whose
number or name is given. This condition does not check the entire recursion
-stack. If the name used in a condition of this kind is a duplicate, the test is
-applied to all subpatterns of the same name, and is true if any one of them is
-the most recent recursion.
+stack. If the name used in a condition of this kind is a duplicate, the test is
+applied to all subpatterns of the same name, and is true if any one of them is
+the most recent recursion.
.P
-At "top level", all these recursion test conditions are false.
+At "top level", all these recursion test conditions are false.
.\" HTML <a href="#recursion">
.\" </a>
The syntax for recursive patterns
@@ -1933,7 +1933,7 @@ If the condition is the string (DEFINE), and there is no subpattern with the
name DEFINE, the condition is always false. In this case, there may be only one
alternative in the subpattern. It is always skipped if control reaches this
point in the pattern; the idea of DEFINE is that it can be used to define
-"subroutines" that can be referenced from elsewhere. (The use of
+"subroutines" that can be referenced from elsewhere. (The use of
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"subroutines"
@@ -2010,7 +2010,7 @@ this kind of recursion was subsequently introduced into Perl at release 5.10.
.P
A special item that consists of (? followed by a number greater than zero and a
closing parenthesis is a recursive call of the subpattern of the given number,
-provided that it occurs inside that subpattern. (If not, it is a
+provided that it occurs inside that subpattern. (If not, it is a
.\" HTML <a href="#subpatternsassubroutines">
.\" </a>
"subroutine"
@@ -2026,7 +2026,7 @@ PCRE_EXTENDED option is set so that white space is ignored):
First it matches an opening parenthesis. Then it matches any number of
substrings which can either be a sequence of non-parentheses, or a recursive
match of the pattern itself (that is, a correctly parenthesized substring).
-Finally there is a closing parenthesis. Note the use of a possessive quantifier
+Finally there is a closing parenthesis. Note the use of a possessive quantifier
to avoid backtracking into sequences of non-parentheses.
.P
If this were part of a larger pattern, you would not want to recurse the entire
@@ -2117,25 +2117,25 @@ is the actual recursive call.
In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
treated as an atomic group. That is, once it has matched some of the subject
string, it is never re-entered, even if it contains untried alternatives and
-there is a subsequent matching failure. This can be illustrated by the
-following pattern, which purports to match a palindromic string that contains
+there is a subsequent matching failure. This can be illustrated by the
+following pattern, which purports to match a palindromic string that contains
an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
.sp
^(.|(.)(?1)\e2)$
.sp
-The idea is that it either matches a single character, or two identical
-characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
+The idea is that it either matches a single character, or two identical
+characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
it does not if the pattern is longer than three characters. Consider the
subject string "abcba":
.P
-At the top level, the first character is matched, but as it is not at the end
+At the top level, the first character is matched, but as it is not at the end
of the string, the first alternative fails; the second alternative is taken
and the recursion kicks in. The recursive call to subpattern 1 successfully
matches the next character ("b"). (Note that the beginning and end of line
tests are not part of the recursion).
.P
Back at the top level, the next character ("c") is compared with what
-subpattern 2 matched, which was "a". This fails. Because the recursion is
+subpattern 2 matched, which was "a". This fails. Because the recursion is
treated as an atomic group, there are now no backtracking points, and so the
entire match fails. (Perl is able, at this point, to re-enter the recursion and
try the second alternative.) However, if the pattern is written with the
@@ -2143,32 +2143,32 @@ alternatives in the other order, things are different:
.sp
^((.)(?1)\e2|.)$
.sp
-This time, the recursing alternative is tried first, and continues to recurse
-until it runs out of characters, at which point the recursion fails. But this
-time we do have another alternative to try at the higher level. That is the big
+This time, the recursing alternative is tried first, and continues to recurse
+until it runs out of characters, at which point the recursion fails. But this
+time we do have another alternative to try at the higher level. That is the big
difference: in the previous case the remaining alternative is at a deeper
recursion level, which PCRE cannot use.
.P
-To change the pattern so that matches all palindromic strings, not just those
+To change the pattern so that matches all palindromic strings, not just those
with an odd number of characters, it is tempting to change the pattern to this:
.sp
^((.)(?1)\e2|.?)$
.sp
-Again, this works in Perl, but not in PCRE, and for the same reason. When a
-deeper recursion has matched a single character, it cannot be entered again in
-order to match an empty string. The solution is to separate the two cases, and
+Again, this works in Perl, but not in PCRE, and for the same reason. When a
+deeper recursion has matched a single character, it cannot be entered again in
+order to match an empty string. The solution is to separate the two cases, and
write out the odd and even cases as alternatives at the higher level:
.sp
^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
-.sp
-If you want to match typical palindromic phrases, the pattern has to ignore all
+.sp
+If you want to match typical palindromic phrases, the pattern has to ignore all
non-word characters, which can be done like this:
.sp
- ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$
+ ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
.sp
-If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
-man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
-the use of the possessive quantifier *+ to avoid backtracking into sequences of
+If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
+man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
+the use of the possessive quantifier *+ to avoid backtracking into sequences of
non-word characters. Without this, PCRE takes a great deal longer (ten times or
more) to match typical phrases, and Perl takes so long that you think it has
gone into a loop.
@@ -2294,9 +2294,9 @@ a backtracking algorithm. With the exception of (*FAIL), which behaves like a
failing negative assertion, they cause an error if encountered by
\fBpcre_dfa_exec()\fP.
.P
-If any of these verbs are used in an assertion subpattern, their effect is
+If any of these verbs are used in an assertion subpattern, their effect is
confined to that subpattern; it does not extend to the surrounding pattern.
-Note that assertion subpatterns are processed as anchored at the point where
+Note that assertion subpatterns are processed as anchored at the point where
they are tested.
.P
The new verbs make use of what was previously invalid syntax: an opening
@@ -2319,7 +2319,7 @@ captured. (This feature was added to PCRE at release 8.00.) For example:
.sp
A((?:A|B(*ACCEPT)|C)D)
.sp
-This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
+This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
the outer parentheses.
.sp
(*FAIL) or (*F)
@@ -2400,7 +2400,7 @@ is used outside of any alternation, it acts exactly like (*PRUNE).
.SH "SEE ALSO"
.rs
.sp
-\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
+\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
\fBpcresyntax\fP(3), \fBpcre\fP(3).
.
.
diff --git a/doc/pcreposix.3 b/doc/pcreposix.3
index bdb9381..c6092ac 100644
--- a/doc/pcreposix.3
+++ b/doc/pcreposix.3
@@ -103,9 +103,9 @@ are returned.
.sp
REG_UNGREEDY
.sp
-The PCRE_UNGREEDY option is set when the regular expression is passed for
+The PCRE_UNGREEDY option is set when the regular expression is passed for
compilation to the native function. Note that REG_UNGREEDY is not part of the
-POSIX standard.
+POSIX standard.
.sp
REG_UTF8
.sp
diff --git a/doc/pcresample.3 b/doc/pcresample.3
index f7eefda..9f5067b 100644
--- a/doc/pcresample.3
+++ b/doc/pcresample.3
@@ -10,7 +10,7 @@ this program is given in the
.\" HREF
\fBpcredemo\fP
.\"
-documentation. If you do not have a copy of the PCRE distribution, you can save
+documentation. If you do not have a copy of the PCRE distribution, you can save
this listing to re-create \fIpcredemo.c\fP.
.P
The program compiles the regular expression that is its first argument, and
@@ -50,7 +50,7 @@ Note that there is a much more comprehensive test program, called
\fBpcretest\fP,
.\"
which supports many more facilities for testing regular expressions and the
-PCRE library. The
+PCRE library. The
.\" HREF
\fBpcredemo\fP
.\"
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 7a6f662..c07d42b 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -213,7 +213,7 @@ begins with a lookbehind assertion (including \eb or \eB).
If any call to \fBpcre_exec()\fP in a \fB/g\fP or \fB/G\fP sequence matches an
empty string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
PCRE_ANCHORED flags set in order to search for another, non-empty, match at the
-same point. If this second match fails, the start offset is advanced by one
+same point. If this second match fails, the start offset is advanced by one
character, and the normal match is retried. This imitates the way Perl handles
such cases when using the \fB/g\fP modifier or the \fBsplit()\fP function.
.
@@ -357,14 +357,14 @@ recognized:
.\" JOIN
\eN pass the PCRE_NOTEMPTY option to \fBpcre_exec()\fP
or \fBpcre_dfa_exec()\fP; if used twice, pass the
- PCRE_NOTEMPTY_ATSTART option
+ PCRE_NOTEMPTY_ATSTART option
.\" JOIN
\eOdd set the size of the output vector passed to
\fBpcre_exec()\fP to dd (any number of digits)
.\" JOIN
\eP pass the PCRE_PARTIAL_SOFT option to \fBpcre_exec()\fP
or \fBpcre_dfa_exec()\fP; if used twice, pass the
- PCRE_PARTIAL_HARD option
+ PCRE_PARTIAL_HARD option
.\" JOIN
\eQdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
(any number of digits)
@@ -551,7 +551,7 @@ the subject where there is at least one match. For example:
.sp
(Using the normal matching function on this data finds only "tang".) The
longest matching string is always given first (and numbered zero). After a
-PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
+PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
partially matching substring.
.P
If \fB/g\fP is present on the pattern, the search for further matches resumes
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index 46df32a..9cb6e8d 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -335,6 +335,8 @@ DATA LINES
(any number of digits)
\R pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
\S output details of memory get/free calls during matching
+ \Y pass the PCRE_NO_START_OPTIMIZE option to pcre_exec()
+ or pcre_dfa_exec()
\Z pass the PCRE_NOTEOL option to pcre_exec()
or pcre_dfa_exec()
\? pass the PCRE_NO_UTF8_CHECK option to
@@ -661,5 +663,5 @@ AUTHOR
REVISION
- Last updated: 11 September 2009
+ Last updated: 26 September 2009
Copyright (c) 1997-2009 University of Cambridge.
diff --git a/doc/perltest.txt b/doc/perltest.txt
index b106390..3424b91 100644
--- a/doc/perltest.txt
+++ b/doc/perltest.txt
@@ -17,10 +17,10 @@ identical, apart from the initial identifying banner.
The perltest.pl script can also test UTF-8 features. It recognizes the special
modifier /8 that pcretest uses to invoke UTF-8 functionality. The testinput4
and testinput6 files can be fed to perltest to run compatible UTF-8 tests.
-However, it is necessary to add "use utf8;" to the script to make this work
-correctly.
+However, it is necessary to add "use utf8;" to the script to make this work
+correctly.
-The testinput11 file contains tests that use features of Perl 5.10, so does not
+The testinput11 file contains tests that use features of Perl 5.10, so does not
work with Perl 5.8.
The other testinput files are not suitable for feeding to perltest.pl, since
diff --git a/pcre_compile.c b/pcre_compile.c
index 5dd5cba..c360c43 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -343,7 +343,7 @@ static const char error_texts[] =
"digit expected after (?+\0"
"] is an invalid data character in JavaScript compatibility mode\0"
/* 65 */
- "different names for subpatterns of the same number are not allowed";
+ "different names for subpatterns of the same number are not allowed";
/* Table to identify digits and hex digits. This is used when compiling
@@ -1102,7 +1102,7 @@ if (ptr[0] == CHAR_LEFT_PARENTHESIS)
if (name != NULL && lorn == ptr - thisname &&
strncmp((const char *)name, (const char *)thisname, lorn) == 0)
return *count;
- term++;
+ term++;
}
}
}
@@ -1148,10 +1148,10 @@ for (; *ptr != 0; ptr++)
break;
}
else if (!negate_class && ptr[1] == CHAR_CIRCUMFLEX_ACCENT)
- {
+ {
negate_class = TRUE;
ptr++;
- }
+ }
else break;
}
@@ -1340,22 +1340,22 @@ for (;;)
/* Scan a branch and compute the fixed length of subject that will match it,
if the length is fixed. This is needed for dealing with backward assertions.
-In UTF8 mode, the result is in characters rather than bytes. The branch is
+In UTF8 mode, the result is in characters rather than bytes. The branch is
temporarily terminated with OP_END when this function is called.
-This function is called when a backward assertion is encountered, so that if it
-fails, the error message can point to the correct place in the pattern.
+This function is called when a backward assertion is encountered, so that if it
+fails, the error message can point to the correct place in the pattern.
However, we cannot do this when the assertion contains subroutine calls,
-because they can be forward references. We solve this by remembering this case
+because they can be forward references. We solve this by remembering this case
and doing the check at the end; a flag specifies which mode we are running in.
Arguments:
code points to the start of the pattern (the bracket)
options the compiling options
- atend TRUE if called when the pattern is complete
- cd the "compile data" structure
+ atend TRUE if called when the pattern is complete
+ cd the "compile data" structure
-Returns: the fixed length,
+Returns: the fixed length,
or -1 if there is no fixed length,
or -2 if \C was encountered
or -3 if an OP_RECURSE item was encountered and atend is FALSE
@@ -1405,21 +1405,21 @@ for (;;)
cc += 1 + LINK_SIZE;
branchlength = 0;
break;
-
+
/* A true recursion implies not fixed length, but a subroutine call may
be OK. If the subroutine is a forward reference, we can't deal with
it until the end of the pattern, so return -3. */
-
+
case OP_RECURSE:
if (!atend) return -3;
cs = ce = (uschar *)cd->start_code + GET(cc, 1); /* Start subpattern */
do ce += GET(ce, 1); while (*ce == OP_ALT); /* End subpattern */
if (cc > cs && cc < ce) return -1; /* Recursion */
d = find_fixedlength(cs + 2, options, atend, cd);
- if (d < 0) return d;
+ if (d < 0) return d;
branchlength += d;
cc += 1 + LINK_SIZE;
- break;
+ break;
/* Skip over assertive subpatterns */
@@ -1459,7 +1459,7 @@ for (;;)
branchlength++;
cc += 2;
#ifdef SUPPORT_UTF8
- if ((options & PCRE_UTF8) != 0 && cc[-1] >= 0xc0)
+ if ((options & PCRE_UTF8) != 0 && cc[-1] >= 0xc0)
cc += _pcre_utf8_table4[cc[-1] & 0x3f];
#endif
break;
@@ -1471,7 +1471,7 @@ for (;;)
branchlength += GET2(cc,1);
cc += 4;
#ifdef SUPPORT_UTF8
- if ((options & PCRE_UTF8) != 0 && cc[-1] >= 0xc0)
+ if ((options & PCRE_UTF8) != 0 && cc[-1] >= 0xc0)
cc += _pcre_utf8_table4[cc[-1] & 0x3f];
#endif
break;
@@ -1556,8 +1556,8 @@ for (;;)
/* This little function scans through a compiled pattern until it finds a
capturing bracket with the given number, or, if the number is negative, an
-instance of OP_REVERSE for a lookbehind. The function is global in the C sense
-so that it can be called from pcre_study() when finding the minimum matching
+instance of OP_REVERSE for a lookbehind. The function is global in the C sense
+so that it can be called from pcre_study() when finding the minimum matching
length.
Arguments:
@@ -1581,12 +1581,12 @@ for (;;)
the table is zero; the actual length is stored in the compiled code. */
if (c == OP_XCLASS) code += GET(code, 1);
-
+
/* Handle recursion */
-
+
else if (c == OP_REVERSE)
{
- if (number < 0) return (uschar *)code;
+ if (number < 0) return (uschar *)code;
code += _pcre_OP_lengths[c];
}
@@ -1957,7 +1957,7 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
case OP_POSQUERY:
if (utf8 && code[1] >= 0xc0) code += _pcre_utf8_table4[code[1] & 0x3f];
break;
-
+
case OP_UPTO:
case OP_MINUPTO:
case OP_POSUPTO:
@@ -3915,15 +3915,15 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (repeat_max == 0) goto END_REPEAT;
- /*--------------------------------------------------------------------*/
+ /*--------------------------------------------------------------------*/
/* This code is obsolete from release 8.00; the restriction was finally
removed: */
-
+
/* All real repeats make it impossible to handle partial matching (maybe
one day we will be able to remove this restriction). */
-
+
/* if (repeat_max != 1) cd->external_flags |= PCRE_NOPARTIAL; */
- /*--------------------------------------------------------------------*/
+ /*--------------------------------------------------------------------*/
/* Combine the op_type with the repeat_type */
@@ -4070,7 +4070,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
goto END_REPEAT;
}
- /*--------------------------------------------------------------------*/
+ /*--------------------------------------------------------------------*/
/* This code is obsolete from release 8.00; the restriction was finally
removed: */
@@ -4078,7 +4078,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
one day we will be able to remove this restriction). */
/* if (repeat_max != 1) cd->external_flags |= PCRE_NOPARTIAL; */
- /*--------------------------------------------------------------------*/
+ /*--------------------------------------------------------------------*/
if (repeat_min == 0 && repeat_max == -1)
*code++ = OP_CRSTAR + repeat_type;
@@ -4393,11 +4393,11 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (possessive_quantifier)
{
int len;
-
+
if (*tempcode == OP_TYPEEXACT)
tempcode += _pcre_OP_lengths[*tempcode] +
- ((tempcode[3] == OP_PROP || tempcode[3] == OP_NOTPROP)? 2 : 0);
-
+ ((tempcode[3] == OP_PROP || tempcode[3] == OP_NOTPROP)? 2 : 0);
+
else if (*tempcode == OP_EXACT || *tempcode == OP_NOTEXACT)
{
tempcode += _pcre_OP_lengths[*tempcode];
@@ -4405,8 +4405,8 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (utf8 && tempcode[-1] >= 0xc0)
tempcode += _pcre_utf8_table4[tempcode[-1] & 0x3f];
#endif
- }
-
+ }
+
len = code - tempcode;
if (len > 0) switch (*tempcode)
{
@@ -4485,17 +4485,17 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
strncmp((char *)name, vn, namelen) == 0)
{
/* Check for open captures before ACCEPT */
-
+
if (verbs[i].op == OP_ACCEPT)
{
- open_capitem *oc;
- cd->had_accept = TRUE;
+ open_capitem *oc;
+ cd->had_accept = TRUE;
for (oc = cd->open_caps; oc != NULL; oc = oc->next)
{
*code++ = OP_CLOSE;
- PUT2INC(code, 0, oc->number);
- }
- }
+ PUT2INC(code, 0, oc->number);
+ }
+ }
*code++ = verbs[i].op;
break;
}
@@ -4658,9 +4658,9 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
}
/* Otherwise (did not start with "+" or "-"), start by looking for the
- name. If we find a name, add one to the opcode to change OP_CREF or
- OP_RREF into OP_NCREF or OP_NRREF. These behave exactly the same,
- except they record that the reference was originally to a name. The
+ name. If we find a name, add one to the opcode to change OP_CREF or
+ OP_RREF into OP_NCREF or OP_NRREF. These behave exactly the same,
+ except they record that the reference was originally to a name. The
information is used to check duplicate names. */
slot = cd->name_table;
@@ -4887,7 +4887,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
is because the number of names, and hence the table size, is computed
in the pre-compile, and it affects various numbers and pointers which
would all have to be modified, and the compiled code moved down, if
- duplicates with the same number were omitted from the table. This
+ duplicates with the same number were omitted from the table. This
doesn't seem worth the hassle. However, *different* names for the
same number are not permitted. */
@@ -4895,7 +4895,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
{
BOOL dupname = FALSE;
slot = cd->name_table;
-
+
for (i = 0; i < cd->names_found; i++)
{
int crc = memcmp(name, slot+2, namelen);
@@ -4909,31 +4909,31 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
*errorcodeptr = ERR43;
goto FAILED;
}
- else dupname = TRUE;
+ else dupname = TRUE;
}
else crc = -1; /* Current name is a substring */
}
-
- /* Make space in the table and break the loop for an earlier
- name. For a duplicate or later name, carry on. We do this for
- duplicates so that in the simple case (when ?(| is not used) they
+
+ /* Make space in the table and break the loop for an earlier
+ name. For a duplicate or later name, carry on. We do this for
+ duplicates so that in the simple case (when ?(| is not used) they
are in order of their numbers. */
-
+
if (crc < 0)
{
memmove(slot + cd->name_entry_size, slot,
(cd->names_found - i) * cd->name_entry_size);
break;
}
-
+
/* Continue the loop for a later or duplicate name */
-
+
slot += cd->name_entry_size;
}
-
+
/* For non-duplicate names, check for a duplicate number before
adding the new name. */
-
+
if (!dupname)
{
uschar *cslot = cd->name_table;
@@ -4945,12 +4945,12 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
{
*errorcodeptr = ERR65;
goto FAILED;
- }
+ }
}
- else i--;
+ else i--;
cslot += cd->name_entry_size;
- }
- }
+ }
+ }
PUT2(slot, 0, cd->bracount + 1);
memcpy(slot + 2, name, namelen);
@@ -5131,7 +5131,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (lengthptr == NULL)
{
*code = OP_END;
- if (recno != 0)
+ if (recno != 0)
called = _pcre_find_bracket(cd->start_code, utf8, recno);
/* Forward reference */
@@ -5812,8 +5812,8 @@ if (*code == OP_CBRA)
capnumber = GET2(code, 1 + LINK_SIZE);
capitem.number = capnumber;
capitem.next = cd->open_caps;
- cd->open_caps = &capitem;
- }
+ cd->open_caps = &capitem;
+ }
/* Offset is set zero to mark that this bracket is still open */
@@ -5909,10 +5909,10 @@ for (;;)
/* If lookbehind, check that this branch matches a fixed-length string, and
put the length into the OP_REVERSE item. Temporarily mark the end of the
- branch with OP_END. If the branch contains OP_RECURSE, the result is -3
+ branch with OP_END. If the branch contains OP_RECURSE, the result is -3
because there may be forward references that we can't check here. Set a
- flag to cause another lookbehind check at the end. Why not do it all at the
- end? Because common, erroneous checks are picked up here and the offset of
+ flag to cause another lookbehind check at the end. Why not do it all at the
+ end? Because common, erroneous checks are picked up here and the offset of
the problem can be shown. */
if (lookbehind)
@@ -5923,8 +5923,8 @@ for (;;)
DPRINTF(("fixed length = %d\n", fixed_length));
if (fixed_length == -3)
{
- cd->check_lookbehind = TRUE;
- }
+ cd->check_lookbehind = TRUE;
+ }
else if (fixed_length < 0)
{
*errorcodeptr = (fixed_length == -2)? ERR36 : ERR25;
@@ -5958,9 +5958,9 @@ for (;;)
}
while (branch_length > 0);
}
-
+
/* If it was a capturing subpattern, remove it from the chain. */
-
+
if (capnumber > 0) cd->open_caps = cd->open_caps->next;
/* Fill in the ket */
@@ -6654,7 +6654,7 @@ subpattern. */
if (errorcode == 0 && re->top_backref > re->top_bracket) errorcode = ERR15;
-/* If there were any lookbehind assertions that contained OP_RECURSE
+/* If there were any lookbehind assertions that contained OP_RECURSE
(recursions or subroutine calls), a flag is set for them to be checked here,
because they may contain forward references. Actual recursions can't be fixed
length, but subroutine calls can. It is done like this so that those without
@@ -6665,21 +6665,21 @@ length, and set their lengths. */
if (cd->check_lookbehind)
{
uschar *cc = (uschar *)codestart;
-
- /* Loop, searching for OP_REVERSE items, and process those that do not have
- their length set. (Actually, it will also re-process any that have a length
- of zero, but that is a pathological case, and it does no harm.) When we find
+
+ /* Loop, searching for OP_REVERSE items, and process those that do not have
+ their length set. (Actually, it will also re-process any that have a length
+ of zero, but that is a pathological case, and it does no harm.) When we find
one, we temporarily terminate the branch it is in while we scan it. */
-
+
for (cc = (uschar *)_pcre_find_bracket(codestart, utf8, -1);
cc != NULL;
cc = (uschar *)_pcre_find_bracket(cc, utf8, -1))
- {
+ {
if (GET(cc, 1) == 0)
- {
- int fixed_length;
+ {
+ int fixed_length;
uschar *be = cc - 1 - LINK_SIZE + GET(cc, -LINK_SIZE);
- int end_op = *be;
+ int end_op = *be;
*be = OP_END;
fixed_length = find_fixedlength(cc, re->options, TRUE, cd);
*be = end_op;
@@ -6687,13 +6687,13 @@ if (cd->check_lookbehind)
if (fixed_length < 0)
{
errorcode = (fixed_length == -2)? ERR36 : ERR25;
- break;
+ break;
}
PUT(cc, 1, fixed_length);
}
cc += 1 + LINK_SIZE;
- }
- }
+ }
+ }
/* Failed to compile, or error while post-processing */
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index ce1c456..93a885e 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -45,9 +45,9 @@ FSM). This is NOT Perl- compatible, but it has advantages in certain
applications. */
-/* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
-the performance of his patterns greatly. I could not use it as it stood, as it
-was not thread safe, and made assumptions about pattern sizes. Also, it caused
+/* NOTE ABOUT PERFORMANCE: A user of this function sent some code that improved
+the performance of his patterns greatly. I could not use it as it stood, as it
+was not thread safe, and made assumptions about pattern sizes. Also, it caused
test 7 to loop, and test 9 to crash with a segfault.
The issue is the check for duplicate states, which is done by a simple linear
@@ -68,7 +68,7 @@ was the extra time to initialize the index. This had to be done for each call
of internal_dfa_exec(). (The supplied patch used a static vector, initialized
only once - I suspect this was the cause of the problems with the tests.)
-Overall, I concluded that the gains in some cases did not outweigh the losses
+Overall, I concluded that the gains in some cases did not outweigh the losses
in others, so I abandoned this code. */
@@ -417,11 +417,11 @@ if (*first_op == OP_REVERSE)
current_subject - start_subject : max_back;
current_subject -= gone_back;
}
-
+
/* Save the earliest consulted character */
-
- if (current_subject < md->start_used_ptr)
- md->start_used_ptr = current_subject;
+
+ if (current_subject < md->start_used_ptr)
+ md->start_used_ptr = current_subject;
/* Now we can process the individual branches. */
@@ -488,7 +488,7 @@ for (;;)
int clen, dlen;
unsigned int c, d;
int forced_fail = 0;
- int reached_end = 0;
+ int reached_end = 0;
/* Make the new state list into the active state list and empty the
new state list. */
@@ -578,7 +578,7 @@ for (;;)
}
}
- /* Check for a duplicate state with the same count, and skip if found.
+ /* Check for a duplicate state with the same count, and skip if found.
See the note at the head of this module about the possibility of improving
performance here. */
@@ -647,7 +647,7 @@ for (;;)
/* ========================================================================== */
/* Reached a closing bracket. If not at the end of the pattern, carry
on with the next opcode. Otherwise, unless we have an empty string and
- PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
+ PCRE_NOTEMPTY is set, or PCRE_NOTEMPTY_ATSTART is set and we are at the
start of the subject, save the match data, shifting up all previous
matches so we always have the longest first. */
@@ -662,10 +662,10 @@ for (;;)
ADD_ACTIVE(state_offset - GET(code, 1), 0);
}
}
- else
+ else
{
- reached_end++; /* Count branches that reach the end */
- if (ptr > current_subject ||
+ reached_end++; /* Count branches that reach the end */
+ if (ptr > current_subject ||
((md->moptions & PCRE_NOTEMPTY) == 0 &&
((md->moptions & PCRE_NOTEMPTY_ATSTART) == 0 ||
current_subject > start_subject + md->start_offset)))
@@ -689,7 +689,7 @@ for (;;)
match_count, rlevel*2-2, SP));
return match_count;
}
- }
+ }
}
break;
@@ -839,7 +839,7 @@ for (;;)
if (ptr > start_subject)
{
const uschar *temp = ptr - 1;
- if (temp < md->start_used_ptr) md->start_used_ptr = temp;
+ if (temp < md->start_used_ptr) md->start_used_ptr = temp;
#ifdef SUPPORT_UTF8
if (utf8) BACKCHAR(temp);
#endif
@@ -848,13 +848,13 @@ for (;;)
}
else left_word = 0;
- if (clen > 0)
+ if (clen > 0)
right_word = c < 256 && (ctypes[c] & ctype_word) != 0;
else /* This is a fudge to ensure that if this is the */
{ /* last item in the pattern, we don't count it as */
reached_end--; /* reached, thus disabling a partial match. */
right_word = 0;
- }
+ }
if ((left_word == right_word) == (codevalue == OP_NOT_WORD_BOUNDARY))
{ ADD_ACTIVE(state_offset + 1, 0); }
@@ -2287,7 +2287,7 @@ for (;;)
/* Back reference conditions are not supported */
- if (condcode == OP_CREF || condcode == OP_NCREF)
+ if (condcode == OP_CREF || condcode == OP_NCREF)
return PCRE_ERROR_DFA_UCOND;
/* The DEFINE condition is always false */
@@ -2531,7 +2531,7 @@ for (;;)
if (new_count <= 0)
{
if (rlevel == 1 && /* Top level, and */
- reached_end != workspace[1] && /* Not all reached end */
+ reached_end != workspace[1] && /* Not all reached end */
forced_fail != workspace[1] && /* Not all forced fail & */
( /* either... */
(md->moptions & PCRE_PARTIAL_HARD) != 0 /* Hard partial */
@@ -2652,7 +2652,7 @@ if (extra_data != NULL)
if ((flags & PCRE_EXTRA_TABLES) != 0)
md->tables = extra_data->tables;
}
-
+
/* Check that the first field in the block is the magic number. If it is not,
test for a regex that was compiled on a host of opposite endianness. If this is
the case, flipped values are put in internal_re and internal_study if there was
@@ -2914,13 +2914,13 @@ for (;;)
end_subject = save_end_subject;
- /* The following two optimizations are disabled for partial matching or if
- disabling is explicitly requested (and of course, by the test above, this
+ /* The following two optimizations are disabled for partial matching or if
+ disabling is explicitly requested (and of course, by the test above, this
code is not obeyed when restarting after a partial match). */
-
+
if ((options & PCRE_NO_START_OPTIMIZE) == 0 &&
(options & (PCRE_PARTIAL_HARD|PCRE_PARTIAL_SOFT)) == 0)
- {
+ {
/* If the pattern was studied, a minimum subject length may be set. This
is a lower bound; no actual string of that length may actually match the
pattern. Although the value is, strictly, in characters, we treat it as
@@ -2929,7 +2929,7 @@ for (;;)
if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
end_subject - current_subject < study->minlength)
return PCRE_ERROR_NOMATCH;
-
+
/* If req_byte is set, we know that that character must appear in the
subject for the match to succeed. If the first character is set, req_byte
must be later in the subject; otherwise the test starts at the match
@@ -2937,19 +2937,19 @@ for (;;)
nested unlimited repeats that aren't going to match. Writing separate
code for cased/caseless versions makes it go faster, as does using an
autoincrement and backing off on a match.
-
+
HOWEVER: when the subject string is very, very long, searching to its end
can take a long time, and give bad performance on quite ordinary
patterns. This showed up when somebody was matching /^C/ on a 32-megabyte
string... so we don't do this when the string is sufficiently long. */
-
+
if (req_byte >= 0 && end_subject - current_subject < REQ_BYTE_MAX)
{
register const uschar *p = current_subject + ((first_byte >= 0)? 1 : 0);
-
+
/* We don't need to repeat the search if we haven't yet reached the
place we found it at last time. */
-
+
if (p > req_byte_ptr)
{
if (req_byte_caseless)
@@ -2967,26 +2967,26 @@ for (;;)
if (*p++ == req_byte) { p--; break; }
}
}
-
+
/* If we can't find the required character, break the matching loop,
which will cause a return or PCRE_ERROR_NOMATCH. */
-
+
if (p >= end_subject) break;
-
+
/* If we have found the required character, save the point where we
found it, so that we don't search again next time round the loop if
the start hasn't passed this character yet. */
-
+
req_byte_ptr = p;
}
- }
+ }
}
} /* End of optimizations that are done when not restarting */
/* OK, now we can do the business */
md->start_used_ptr = current_subject;
-
+
rc = internal_dfa_exec(
md, /* fixed match data */
md->start_code, /* this subexpression's code */
diff --git a/pcre_exec.c b/pcre_exec.c
index 607c57a..db1e926 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -843,34 +843,34 @@ for (;;)
{
if (md->recursive == NULL) /* Not recursing => FALSE */
{
- condition = FALSE;
- ecode += GET(ecode, 1);
- }
+ condition = FALSE;
+ ecode += GET(ecode, 1);
+ }
else
- {
+ {
int recno = GET2(ecode, LINK_SIZE + 2); /* Recursion group number*/
condition = (recno == RREF_ANY || recno == md->recursive->group_num);
-
+
/* If the test is for recursion into a specific subpattern, and it is
false, but the test was set up by name, scan the table to see if the
name refers to any other numbers, and test them. The condition is true
if any one is set. */
-
+
if (!condition && condcode == OP_NRREF && recno != RREF_ANY)
{
uschar *slotA = md->name_table;
for (i = 0; i < md->name_count; i++)
- {
- if (GET2(slotA, 0) == recno) break;
+ {
+ if (GET2(slotA, 0) == recno) break;
slotA += md->name_entry_size;
}
-
+
/* Found a name for the number - there can be only one; duplicate
names for different numbers are allowed, but not vice versa. First
scan down for duplicates. */
-
+
if (i < md->name_count)
- {
+ {
uschar *slotB = slotA;
while (slotB > md->name_table)
{
@@ -878,15 +878,15 @@ for (;;)
if (strcmp((char *)slotA + 2, (char *)slotB + 2) == 0)
{
condition = GET2(slotB, 0) == md->recursive->group_num;
- if (condition) break;
- }
+ if (condition) break;
+ }
else break;
- }
-
+ }
+
/* Scan up for duplicates */
-
+
if (!condition)
- {
+ {
slotB = slotA;
for (i++; i < md->name_count; i++)
{
@@ -895,46 +895,46 @@ for (;;)
{
condition = GET2(slotB, 0) == md->recursive->group_num;
if (condition) break;
- }
+ }
else break;
- }
- }
+ }
+ }
}
- }
-
+ }
+
/* Chose branch according to the condition */
-
+
ecode += condition? 3 : GET(ecode, 1);
}
- }
+ }
else if (condcode == OP_CREF || condcode == OP_NCREF) /* Group used test */
{
offset = GET2(ecode, LINK_SIZE+2) << 1; /* Doubled ref number */
condition = offset < offset_top && md->offset_vector[offset] >= 0;
-
+
/* If the numbered capture is unset, but the reference was by name,
- scan the table to see if the name refers to any other numbers, and test
- them. The condition is true if any one is set. This is tediously similar
- to the code above, but not close enough to try to amalgamate. */
-
+ scan the table to see if the name refers to any other numbers, and test
+ them. The condition is true if any one is set. This is tediously similar
+ to the code above, but not close enough to try to amalgamate. */
+
if (!condition && condcode == OP_NCREF)
{
- int refno = offset >> 1;
+ int refno = offset >> 1;
uschar *slotA = md->name_table;
-
+
for (i = 0; i < md->name_count; i++)
- {
- if (GET2(slotA, 0) == refno) break;
+ {
+ if (GET2(slotA, 0) == refno) break;
slotA += md->name_entry_size;
}
-
- /* Found a name for the number - there can be only one; duplicate names
- for different numbers are allowed, but not vice versa. First scan down
+
+ /* Found a name for the number - there can be only one; duplicate names
+ for different numbers are allowed, but not vice versa. First scan down
for duplicates. */
-
+
if (i < md->name_count)
- {
+ {
uschar *slotB = slotA;
while (slotB > md->name_table)
{
@@ -942,17 +942,17 @@ for (;;)
if (strcmp((char *)slotA + 2, (char *)slotB + 2) == 0)
{
offset = GET2(slotB, 0) << 1;
- condition = offset < offset_top &&
+ condition = offset < offset_top &&
md->offset_vector[offset] >= 0;
- if (condition) break;
- }
+ if (condition) break;
+ }
else break;
- }
-
+ }
+
/* Scan up for duplicates */
-
+
if (!condition)
- {
+ {
slotB = slotA;
for (i++; i < md->name_count; i++)
{
@@ -960,16 +960,16 @@ for (;;)
if (strcmp((char *)slotA + 2, (char *)slotB + 2) == 0)
{
offset = GET2(slotB, 0) << 1;
- condition = offset < offset_top &&
+ condition = offset < offset_top &&
md->offset_vector[offset] >= 0;
- if (condition) break;
- }
+ if (condition) break;
+ }
else break;
- }
- }
+ }
+ }
}
- }
-
+ }
+
/* Chose branch according to the condition */
ecode += condition? 3 : GET(ecode, 1);
@@ -1030,15 +1030,15 @@ for (;;)
ecode += 1 + LINK_SIZE;
}
break;
-
+
/* Before OP_ACCEPT there may be any number of OP_CLOSE opcodes,
to close any currently open capturing brackets. */
-
+
case OP_CLOSE:
- number = GET2(ecode, 1);
+ number = GET2(ecode, 1);
offset = number << 1;
-
+
#ifdef DEBUG
printf("end bracket %d at *ACCEPT", number);
printf("\n");
@@ -1053,7 +1053,7 @@ for (;;)
if (offset_top <= offset) offset_top = offset + 2;
}
ecode += 3;
- break;
+ break;
/* End of the pattern, either real or forced. If we are in a top-level
@@ -1069,7 +1069,7 @@ for (;;)
md->recursive = rec->prevrec;
memmove(md->offset_vector, rec->offset_save,
rec->saved_max * sizeof(int));
- offset_top = rec->offset_top;
+ offset_top = rec->save_offset_top;
mstart = rec->save_start;
ims = original_ims;
ecode = rec->after_call;
@@ -1261,7 +1261,7 @@ for (;;)
memcpy(new_recursive.offset_save, md->offset_vector,
new_recursive.saved_max * sizeof(int));
new_recursive.save_start = mstart;
- new_recursive.offset_top = offset_top;
+ new_recursive.save_offset_top = offset_top;
mstart = eptr;
/* OK, now we can do the recursion. For each top-level alternative we
@@ -1460,7 +1460,7 @@ for (;;)
{
number = GET2(prev, 1+LINK_SIZE);
offset = number << 1;
-
+
#ifdef DEBUG
printf("end bracket %d", number);
printf("\n");
@@ -1486,7 +1486,7 @@ for (;;)
mstart = rec->save_start;
memcpy(md->offset_vector, rec->offset_save,
rec->saved_max * sizeof(int));
- offset_top = rec->offset_top;
+ offset_top = rec->save_offset_top;
ecode = rec->after_call;
ims = original_ims;
break;
@@ -5010,7 +5010,7 @@ if (re == NULL || subject == NULL ||
(offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
-/* This information is for finding all the numbers associated with a given
+/* This information is for finding all the numbers associated with a given
name, for condition testing. */
md->name_table = (uschar *)re + re->name_table_offset;
@@ -5375,24 +5375,24 @@ for(;;)
/* Restore fudged end_subject */
end_subject = save_end_subject;
-
- /* The following two optimizations are disabled for partial matching or if
+
+ /* The following two optimizations are disabled for partial matching or if
disabling is explicitly requested. */
-
- if ((options & PCRE_NO_START_OPTIMIZE) == 0 && !md->partial)
- {
+
+ if ((options & PCRE_NO_START_OPTIMIZE) == 0 && !md->partial)
+ {
/* If the pattern was studied, a minimum subject length may be set. This is
a lower bound; no actual string of that length may actually match the
pattern. Although the value is, strictly, in characters, we treat it as
bytes to avoid spending too much time in this optimization. */
-
+
if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
end_subject - start_match < study->minlength)
{
rc = MATCH_NOMATCH;
- break;
+ break;
}
-
+
/* If req_byte is set, we know that that character must appear in the
subject for the match to succeed. If the first character is set, req_byte
must be later in the subject; otherwise the test starts at the match point.
@@ -5400,20 +5400,20 @@ for(;;)
nested unlimited repeats that aren't going to match. Writing separate code
for cased/caseless versions makes it go faster, as does using an
autoincrement and backing off on a match.
-
+
HOWEVER: when the subject string is very, very long, searching to its end
can take a long time, and give bad performance on quite ordinary patterns.
This showed up when somebody was matching something like /^\d+C/ on a
32-megabyte string... so we don't do this when the string is sufficiently
long. */
-
+
if (req_byte >= 0 && end_subject - start_match < REQ_BYTE_MAX)
{
register USPTR p = start_match + ((first_byte >= 0)? 1 : 0);
-
+
/* We don't need to repeat the search if we haven't yet reached the
place we found it at last time. */
-
+
if (p > req_byte_ptr)
{
if (req_byte_caseless)
@@ -5431,24 +5431,24 @@ for(;;)
if (*p++ == req_byte) { p--; break; }
}
}
-
+
/* If we can't find the required character, break the matching loop,
forcing a match failure. */
-
+
if (p >= end_subject)
{
rc = MATCH_NOMATCH;
break;
}
-
+
/* If we have found the required character, save the point where we
found it, so that we don't search again next time round the loop if
the start hasn't passed this character yet. */
-
+
req_byte_ptr = p;
}
}
- }
+ }
#ifdef DEBUG /* Sigh. Some compilers never learn. */
printf(">>>> Match against: ");
@@ -5575,7 +5575,7 @@ if (rc == MATCH_MATCH)
too many to fit into the vector. */
rc = md->offset_overflow? 0 : md->end_offset_top/2;
-
+
/* If there is space, set up the whole thing as substring 0. The value of
md->start_match_ptr might be modified if \K was encountered on the success
matching path. */
diff --git a/pcre_fullinfo.c b/pcre_fullinfo.c
index 120f1ef..6b8d789 100644
--- a/pcre_fullinfo.c
+++ b/pcre_fullinfo.c
@@ -122,12 +122,12 @@ switch (what)
(study != NULL && (study->flags & PCRE_STUDY_MAPPED) != 0)?
((const pcre_study_data *)extra_data->study_data)->start_bits : NULL;
break;
-
+
case PCRE_INFO_MINLENGTH:
*((int *)where) =
(study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0)?
study->minlength : -1;
- break;
+ break;
case PCRE_INFO_LASTLITERAL:
*((int *)where) =
@@ -152,7 +152,7 @@ switch (what)
/* From release 8.00 this will always return TRUE because NOPARTIAL is
no longer ever set (the restrictions have been removed). */
-
+
case PCRE_INFO_OKPARTIAL:
*((int *)where) = (re->flags & PCRE_NOPARTIAL) == 0;
break;
diff --git a/pcre_internal.h b/pcre_internal.h
index 9ba2c88..a892af9 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1348,7 +1348,7 @@ enum {
OP_SCOND, /* 99 Conditional group, check empty */
/* The next two pairs must (respectively) be kept together. */
-
+
OP_CREF, /* 100 Used to hold a capture number as condition */
OP_NCREF, /* 101 Same, but generaged by a name reference*/
OP_RREF, /* 102 Used to hold a recursion number as condition */
@@ -1588,7 +1588,7 @@ typedef struct recursion_info {
USPTR save_start; /* Old value of mstart */
int *offset_save; /* Pointer to start of saved offsets */
int saved_max; /* Number of saved offsets */
- int offset_top; /* Current value of offset_top */
+ int save_offset_top; /* Current value of offset_top */
} recursion_info;
/* Structure for building a chain of data for holding the values of the subject
@@ -1615,7 +1615,7 @@ typedef struct match_data {
int nllen; /* Newline string length */
int name_count; /* Number of names in name table */
int name_entry_size; /* Size of entry in names table */
- uschar *name_table; /* Table of names */
+ uschar *name_table; /* Table of names */
uschar nl[4]; /* Newline string when fixed */
const uschar *lcc; /* Points to lower casing table */
const uschar *ctypes; /* Points to table of type maps */
diff --git a/pcre_printint.src b/pcre_printint.src
index 60d12f9..acfc4ca 100644
--- a/pcre_printint.src
+++ b/pcre_printint.src
@@ -245,13 +245,13 @@ for(;;)
else fprintf(f, " ");
fprintf(f, "%s", OP_names[*code]);
break;
-
+
case OP_CLOSE:
fprintf(f, " %s %d", OP_names[*code], GET2(code, 1));
- break;
+ break;
case OP_CREF:
- case OP_NCREF:
+ case OP_NCREF:
fprintf(f, "%3d %s", GET2(code,1), OP_names[*code]);
break;
diff --git a/pcre_study.c b/pcre_study.c
index 23f51a0..8e1cc6e 100644
--- a/pcre_study.c
+++ b/pcre_study.c
@@ -60,18 +60,18 @@ enum { SSB_FAIL, SSB_DONE, SSB_CONTINUE };
*************************************************/
/* Scan a parenthesized group and compute the minimum length of subject that
-is needed to match it. This is a lower bound; it does not mean there is a
+is needed to match it. This is a lower bound; it does not mean there is a
string of that length that matches. In UTF8 mode, the result is in characters
rather than bytes.
Arguments:
code pointer to start of group (the bracket)
startcode pointer to start of the whole pattern
- options the compiling options
+ options the compiling options
Returns: the minimum length
-1 if \C was encountered
- -2 internal error (missing capturing bracket)
+ -2 internal error (missing capturing bracket)
*/
static int
@@ -91,18 +91,18 @@ branch, check the length against that of the other branches. */
for (;;)
{
int d, min;
- uschar *cs, *ce;
+ uschar *cs, *ce;
register int op = *cc;
-
+
switch (op)
{
case OP_CBRA:
- case OP_SCBRA:
+ case OP_SCBRA:
case OP_BRA:
- case OP_SBRA:
+ case OP_SBRA:
case OP_ONCE:
case OP_COND:
- case OP_SCOND:
+ case OP_SCOND:
d = find_minlength(cc, startcode, options);
if (d < 0) return d;
branchlength += d;
@@ -119,12 +119,12 @@ for (;;)
case OP_KETRMAX:
case OP_KETRMIN:
case OP_END:
- if (length < 0 || (!had_recurse && branchlength < length))
+ if (length < 0 || (!had_recurse && branchlength < length))
length = branchlength;
if (*cc != OP_ALT) return length;
cc += 1 + LINK_SIZE;
branchlength = 0;
- had_recurse = FALSE;
+ had_recurse = FALSE;
break;
/* Skip over assertive subpatterns */
@@ -156,11 +156,11 @@ for (;;)
case OP_WORD_BOUNDARY:
cc += _pcre_OP_lengths[*cc];
break;
-
+
/* Skip over a subpattern that has a {0} or {0,x} quantifier */
case OP_BRAZERO:
- case OP_BRAMINZERO:
+ case OP_BRAMINZERO:
case OP_SKIPZERO:
cc += _pcre_OP_lengths[*cc];
do cc += GET(cc, 1); while (*cc == OP_ALT);
@@ -184,10 +184,10 @@ for (;;)
if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
#endif
break;
-
+
case OP_TYPEPLUS:
case OP_TYPEMINPLUS:
- case OP_TYPEPOSPLUS:
+ case OP_TYPEPOSPLUS:
branchlength++;
cc += (cc[1] == OP_PROP || cc[1] == OP_NOTPROP)? 4 : 2;
break;
@@ -196,7 +196,7 @@ for (;;)
need to skip over a multibyte character in UTF8 mode. */
case OP_EXACT:
- case OP_NOTEXACT:
+ case OP_NOTEXACT:
branchlength += GET2(cc,1);
cc += 4;
#ifdef SUPPORT_UTF8
@@ -225,20 +225,20 @@ for (;;)
case OP_ANY:
case OP_ALLANY:
case OP_EXTUNI:
- case OP_HSPACE:
+ case OP_HSPACE:
case OP_NOT_HSPACE:
case OP_VSPACE:
- case OP_NOT_VSPACE:
+ case OP_NOT_VSPACE:
branchlength++;
cc++;
break;
-
+
/* "Any newline" might match two characters */
-
+
case OP_ANYNL:
branchlength += 2;
cc++;
- break;
+ break;
/* The single-byte matcher means we can't proceed in UTF-8 mode */
@@ -248,7 +248,7 @@ for (;;)
#endif
branchlength++;
cc++;
- break;
+ break;
/* For repeated character types, we have to test for \p and \P, which have
an extra two bytes of parameters. */
@@ -287,35 +287,35 @@ for (;;)
case OP_CRPLUS:
case OP_CRMINPLUS:
branchlength++;
- /* Fall through */
+ /* Fall through */
case OP_CRSTAR:
case OP_CRMINSTAR:
case OP_CRQUERY:
case OP_CRMINQUERY:
- cc++;
+ cc++;
break;
-
+
case OP_CRRANGE:
case OP_CRMINRANGE:
branchlength += GET2(cc,1);
cc += 5;
break;
-
+
default:
branchlength++;
- break;
+ break;
}
break;
-
- /* Backreferences and subroutine calls are treated in the same way: we find
- the minimum length for the subpattern. A recursion, however, causes an
+
+ /* Backreferences and subroutine calls are treated in the same way: we find
+ the minimum length for the subpattern. A recursion, however, causes an
a flag to be set that causes the length of this branch to be ignored. The
logic is that a recursion can only make sense if there is another
alternation that stops the recursing. That will provide the minimum length
(when no recursion happens). A backreference within the group that it is
referencing behaves in the same way. */
-
+
case OP_REF:
ce = cs = (uschar *)_pcre_find_bracket(startcode, utf8, GET2(cc, 1));
if (cs == NULL) return -2;
@@ -323,13 +323,13 @@ for (;;)
if (cc > cs && cc < ce)
{
d = 0;
- had_recurse = TRUE;
- }
+ had_recurse = TRUE;
+ }
else d = find_minlength(cs, startcode, options);
- cc += 3;
+ cc += 3;
/* Handle repeated back references */
-
+
switch (*cc)
{
case OP_CRSTAR:
@@ -339,61 +339,61 @@ for (;;)
min = 0;
cc++;
break;
-
+
case OP_CRRANGE:
case OP_CRMINRANGE:
min = GET2(cc, 1);
cc += 5;
break;
-
+
default:
min = 1;
break;
}
branchlength += min * d;
- break;
+ break;
- case OP_RECURSE:
+ case OP_RECURSE:
cs = ce = (uschar *)startcode + GET(cc, 1);
if (cs == NULL) return -2;
do ce += GET(ce, 1); while (*ce == OP_ALT);
if (cc > cs && cc < ce)
- had_recurse = TRUE;
- else
+ had_recurse = TRUE;
+ else
branchlength += find_minlength(cs, startcode, options);
cc += 1 + LINK_SIZE;
break;
/* Anything else does not or need not match a character. We can get the
- item's length from the table, but for those that can match zero occurrences
- of a character, we must take special action for UTF-8 characters. */
-
+ item's length from the table, but for those that can match zero occurrences
+ of a character, we must take special action for UTF-8 characters. */
+
case OP_UPTO:
- case OP_NOTUPTO:
+ case OP_NOTUPTO:
case OP_MINUPTO:
- case OP_NOTMINUPTO:
+ case OP_NOTMINUPTO:
case OP_POSUPTO:
case OP_STAR:
case OP_MINSTAR:
- case OP_NOTMINSTAR:
+ case OP_NOTMINSTAR:
case OP_POSSTAR:
- case OP_NOTPOSSTAR:
+ case OP_NOTPOSSTAR:
case OP_QUERY:
case OP_MINQUERY:
case OP_NOTMINQUERY:
case OP_POSQUERY:
- case OP_NOTPOSQUERY:
+ case OP_NOTPOSQUERY:
cc += _pcre_OP_lengths[op];
-#ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF8
if (utf8 && cc[-1] >= 0xc0) cc += _pcre_utf8_table4[cc[-1] & 0x3f];
-#endif
+#endif
break;
/* For the record, these are the opcodes that are matched by "default":
OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
OP_THEN. */
-
+
default:
cc += _pcre_OP_lengths[op];
break;
@@ -885,32 +885,32 @@ code = (uschar *)re + re->name_table_offset +
(re->name_count * re->name_entry_size);
/* For an anchored pattern, or an unanchored pattern that has a first char, or
-a multiline pattern that matches only at "line starts", there is no point in
+a multiline pattern that matches only at "line starts", there is no point in
seeking a list of starting bytes. */
if ((re->options & PCRE_ANCHORED) == 0 &&
(re->flags & (PCRE_FIRSTSET|PCRE_STARTLINE)) == 0)
{
/* Set the character tables in the block that is passed around */
-
+
tables = re->tables;
if (tables == NULL)
(void)pcre_fullinfo(external_re, NULL, PCRE_INFO_DEFAULT_TABLES,
(void *)(&tables));
-
+
compile_block.lcc = tables + lcc_offset;
compile_block.fcc = tables + fcc_offset;
compile_block.cbits = tables + cbits_offset;
compile_block.ctypes = tables + ctypes_offset;
-
+
/* See if we can find a fixed set of initial characters for the pattern. */
-
+
memset(start_bits, 0, 32 * sizeof(uschar));
- bits_set = set_start_bits(code, start_bits,
- (re->options & PCRE_CASELESS) != 0, (re->options & PCRE_UTF8) != 0,
+ bits_set = set_start_bits(code, start_bits,
+ (re->options & PCRE_CASELESS) != 0, (re->options & PCRE_UTF8) != 0,
&compile_block) == SSB_DONE;
}
-
+
/* Find the minimum length of subject string. */
min = find_minlength(code, code, re->options);
@@ -947,12 +947,12 @@ if (bits_set)
study->flags |= PCRE_STUDY_MAPPED;
memcpy(study->start_bits, start_bits, sizeof(start_bits));
}
-
+
if (min >= 0)
{
study->flags |= PCRE_STUDY_MINLEN;
study->minlength = min;
- }
+ }
return extra;
}
diff --git a/pcre_try_flipped.c b/pcre_try_flipped.c
index 49887a5..606504c 100644
--- a/pcre_try_flipped.c
+++ b/pcre_try_flipped.c
@@ -129,7 +129,7 @@ if (study != NULL)
*internal_study = *study; /* To copy other fields */
internal_study->size = byteflip(study->size, sizeof(study->size));
internal_study->flags = byteflip(study->flags, sizeof(study->flags));
- internal_study->minlength = byteflip(study->minlength,
+ internal_study->minlength = byteflip(study->minlength,
sizeof(study->minlength));
}
diff --git a/pcregrep.c b/pcregrep.c
index db6ae37..729ff63 100644
--- a/pcregrep.c
+++ b/pcregrep.c
@@ -1367,11 +1367,11 @@ if (filenames == FN_NOMATCH_ONLY)
if (count_only)
{
if (count > 0 || !omit_zero_count)
- {
- if (printname != NULL && filenames != FN_NONE)
+ {
+ if (printname != NULL && filenames != FN_NONE)
fprintf(stdout, "%s:", printname);
fprintf(stdout, "%d\n", count);
- }
+ }
}
return rc;
@@ -1936,9 +1936,9 @@ for (i = 1; i < argc; i++)
{
char *opbra = strchr(op->long_name, '(');
char *equals = strchr(op->long_name, '=');
-
+
/* Handle options with only one spelling of the name */
-
+
if (opbra == NULL) /* Does not contain '(' */
{
if (equals == NULL) /* Not thing=data case */
@@ -1961,36 +1961,36 @@ for (i = 1; i < argc; i++)
}
}
}
-
+
/* Handle options with an alternate spelling of the name */
-
- else
+
+ else
{
char buff1[24];
char buff2[24];
-
+
int baselen = opbra - op->long_name;
int fulllen = strchr(op->long_name, ')') - op->long_name + 1;
- int arglen = (argequals == NULL || equals == NULL)?
+ int arglen = (argequals == NULL || equals == NULL)?
(int)strlen(arg) : argequals - arg;
-
+
sprintf(buff1, "%.*s", baselen, op->long_name);
sprintf(buff2, "%s%.*s", buff1, fulllen - baselen - 2, opbra + 1);
-
- if (strncmp(arg, buff1, arglen) == 0 ||
+
+ if (strncmp(arg, buff1, arglen) == 0 ||
strncmp(arg, buff2, arglen) == 0)
{
if (equals != NULL && argequals != NULL)
{
- option_data = argequals;
+ option_data = argequals;
if (*option_data == '=')
{
- option_data++;
+ option_data++;
longopwasequals = TRUE;
- }
- }
+ }
+ }
break;
- }
+ }
}
}
diff --git a/pcreposix.c b/pcreposix.c
index e51be30..b30378c 100644
--- a/pcreposix.c
+++ b/pcreposix.c
@@ -70,80 +70,80 @@ static const int eint[] = {
REG_EESCAPE, /* \c at end of pattern */
REG_EESCAPE, /* unrecognized character follows \ */
REG_BADBR, /* numbers out of order in {} quantifier */
- /* 5 */
+ /* 5 */
REG_BADBR, /* number too big in {} quantifier */
REG_EBRACK, /* missing terminating ] for character class */
REG_ECTYPE, /* invalid escape sequence in character class */
REG_ERANGE, /* range out of order in character class */
REG_BADRPT, /* nothing to repeat */
- /* 10 */
+ /* 10 */
REG_BADRPT, /* operand of unlimited repeat could match the empty string */
REG_ASSERT, /* internal error: unexpected repeat */
REG_BADPAT, /* unrecognized character after (? */
REG_BADPAT, /* POSIX named classes are supported only within a class */
REG_EPAREN, /* missing ) */
- /* 15 */
+ /* 15 */
REG_ESUBREG, /* reference to non-existent subpattern */
REG_INVARG, /* erroffset passed as NULL */
REG_INVARG, /* unknown option bit(s) set */
REG_EPAREN, /* missing ) after comment */
REG_ESIZE, /* parentheses nested too deeply */
- /* 20 */
+ /* 20 */
REG_ESIZE, /* regular expression too large */
REG_ESPACE, /* failed to get memory */
REG_EPAREN, /* unmatched parentheses */
REG_ASSERT, /* internal error: code overflow */
REG_BADPAT, /* unrecognized character after (?< */
- /* 25 */
+ /* 25 */
REG_BADPAT, /* lookbehind assertion is not fixed length */
REG_BADPAT, /* malformed number or name after (?( */
REG_BADPAT, /* conditional group contains more than two branches */
REG_BADPAT, /* assertion expected after (?( */
REG_BADPAT, /* (?R or (?[+-]digits must be followed by ) */
- /* 30 */
+ /* 30 */
REG_ECTYPE, /* unknown POSIX class name */
REG_BADPAT, /* POSIX collating elements are not supported */
REG_INVARG, /* this version of PCRE is not compiled with PCRE_UTF8 support */
REG_BADPAT, /* spare error */
REG_BADPAT, /* character value in \x{...} sequence is too large */
- /* 35 */
+ /* 35 */
REG_BADPAT, /* invalid condition (?(0) */
REG_BADPAT, /* \C not allowed in lookbehind assertion */
REG_EESCAPE, /* PCRE does not support \L, \l, \N, \U, or \u */
REG_BADPAT, /* number after (?C is > 255 */
REG_BADPAT, /* closing ) for (?C expected */
- /* 40 */
+ /* 40 */
REG_BADPAT, /* recursive call could loop indefinitely */
REG_BADPAT, /* unrecognized character after (?P */
REG_BADPAT, /* syntax error in subpattern name (missing terminator) */
REG_BADPAT, /* two named subpatterns have the same name */
REG_BADPAT, /* invalid UTF-8 string */
- /* 45 */
+ /* 45 */
REG_BADPAT, /* support for \P, \p, and \X has not been compiled */
REG_BADPAT, /* malformed \P or \p sequence */
REG_BADPAT, /* unknown property name after \P or \p */
REG_BADPAT, /* subpattern name is too long (maximum 32 characters) */
REG_BADPAT, /* too many named subpatterns (maximum 10,000) */
- /* 50 */
+ /* 50 */
REG_BADPAT, /* repeated subpattern is too long */
REG_BADPAT, /* octal value is greater than \377 (not in UTF-8 mode) */
REG_BADPAT, /* internal error: overran compiling workspace */
REG_BADPAT, /* internal error: previously-checked referenced subpattern not found */
REG_BADPAT, /* DEFINE group contains more than one branch */
- /* 55 */
+ /* 55 */
REG_BADPAT, /* repeating a DEFINE group is not allowed */
REG_INVARG, /* inconsistent NEWLINE options */
REG_BADPAT, /* \g is not followed followed by an (optionally braced) non-zero number */
REG_BADPAT, /* a numbered reference must not be zero */
REG_BADPAT, /* (*VERB) with an argument is not supported */
/* 60 */
- REG_BADPAT, /* (*VERB) not recognized */
+ REG_BADPAT, /* (*VERB) not recognized */
REG_BADPAT, /* number is too big */
REG_BADPAT, /* subpattern name expected */
REG_BADPAT, /* digit expected after (?+ */
REG_BADPAT, /* ] is an invalid data character in JavaScript compatibility mode */
/* 65 */
- REG_BADPAT /* different names for subpatterns of the same number are not allowed */
+ REG_BADPAT /* different names for subpatterns of the same number are not allowed */
};
/* Table of texts corresponding to POSIX error codes */
@@ -253,14 +253,14 @@ preg->re_pcre = pcre_compile2(pattern, options, &errorcode, &errorptr,
&erroffset, NULL);
preg->re_erroffset = erroffset;
-/* Safety: if the error code is too big for the translation vector (which
+/* Safety: if the error code is too big for the translation vector (which
should not happen, but we all make mistakes), return REG_BADPAT. */
-if (preg->re_pcre == NULL)
+if (preg->re_pcre == NULL)
{
return (errorcode < sizeof(eint)/sizeof(const int))?
eint[errorcode] : REG_BADPAT;
- }
+ }
preg->re_nsub = pcre_info((const pcre *)preg->re_pcre, NULL, NULL);
return 0;
@@ -302,7 +302,7 @@ if ((eflags & REG_NOTEMPTY) != 0) options |= PCRE_NOTEMPTY;
((regex_t *)preg)->re_erroffset = (size_t)(-1); /* Only has meaning after compile */
-/* When no string data is being returned, or no vector has been passed in which
+/* When no string data is being returned, or no vector has been passed in which
to put it, ensure that nmatch is zero. Otherwise, ensure the vector for holding
the return data is large enough. */
diff --git a/pcretest.c b/pcretest.c
index 25080b9..827c3b1 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -1305,7 +1305,7 @@ while (!done)
if ((options & PCRE_DOTALL) != 0) cflags |= REG_DOTALL;
if ((options & PCRE_NO_AUTO_CAPTURE) != 0) cflags |= REG_NOSUB;
if ((options & PCRE_UTF8) != 0) cflags |= REG_UTF8;
- if ((options & PCRE_UNGREEDY) != 0) cflags |= REG_UNGREEDY;
+ if ((options & PCRE_UNGREEDY) != 0) cflags |= REG_UNGREEDY;
rc = regcomp(&preg, (char *)p, cflags);
@@ -1630,10 +1630,10 @@ while (!done)
{
uschar *start_bits = NULL;
int minlength;
-
+
new_info(re, extra, PCRE_INFO_MINLENGTH, &minlength);
- fprintf(outfile, "Subject length lower bound = %d\n", minlength);
-
+ fprintf(outfile, "Subject length lower bound = %d\n", minlength);
+
new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits);
if (start_bits == NULL)
fprintf(outfile, "No set of starting bytes\n");
@@ -1977,7 +1977,7 @@ while (!done)
case 'N':
if ((options & PCRE_NOTEMPTY) != 0)
options = (options & ~PCRE_NOTEMPTY) | PCRE_NOTEMPTY_ATSTART;
- else
+ else
options |= PCRE_NOTEMPTY;
continue;
@@ -2001,7 +2001,7 @@ while (!done)
continue;
case 'P':
- options |= ((options & PCRE_PARTIAL_SOFT) == 0)?
+ options |= ((options & PCRE_PARTIAL_SOFT) == 0)?
PCRE_PARTIAL_SOFT : PCRE_PARTIAL_HARD;
continue;
@@ -2377,8 +2377,8 @@ while (!done)
{
fprintf(outfile, ": ");
pchars(bptr + use_offsets[0], use_offsets[1] - use_offsets[0],
- outfile);
- }
+ outfile);
+ }
fprintf(outfile, "\n");
break; /* Out of the /g loop */
}
diff --git a/perltest.pl b/perltest.pl
index c4f1c97..a16345d 100755
--- a/perltest.pl
+++ b/perltest.pl
@@ -90,10 +90,10 @@ for (;;)
# Remove /8 from a UTF-8 pattern.
$utf8 = $pattern =~ s/8(?=[a-z]*$)//;
-
+
# Remove /J from a pattern with duplicate names.
-
- $pattern =~ s/J(?=[a-z]*$)//;
+
+ $pattern =~ s/J(?=[a-z]*$)//;
# Check that the pattern is valid
diff --git a/testdata/testinput2 b/testdata/testinput2
index ac108c4..7f887c8 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -3113,14 +3113,14 @@ a random value. /Ix
b"11111
a"11111
-/^(?|(a)(b)(c)(?<D>d)|(?<D>e)) (?('D')X|Y)/JDx
+/^(?|(a)(b)(c)(?<D>d)|(?<D>e)) (?('D')X|Y)/JDZx
abcdX
eX
** Failers
abcdY
ey
-/(?<A>a) (b)(c) (?<A>d (?(R&A)$ | (?4)) )/JDx
+/(?<A>a) (b)(c) (?<A>d (?(R&A)$ | (?4)) )/JDZx
abcdd
** Failers
abcdde
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index f3afc0d..0d5b61b 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -10274,36 +10274,36 @@ No match
a"11111
No match
-/^(?|(a)(b)(c)(?<D>d)|(?<D>e)) (?('D')X|Y)/JDx
-------------------------------------------------------------------
- 0 79 Bra
- 3 ^
- 4 43 Bra
- 7 7 CBra 1
- 12 a
- 14 7 Ket
- 17 7 CBra 2
- 22 b
- 24 7 Ket
- 27 7 CBra 3
- 32 c
- 34 7 Ket
- 37 7 CBra 4
- 42 d
- 44 7 Ket
- 47 13 Alt
- 50 7 CBra 1
- 55 e
- 57 7 Ket
- 60 56 Ket
- 63 8 Cond
- 66 4 Cond nref
- 69 X
- 71 5 Alt
- 74 Y
- 76 13 Ket
- 79 79 Ket
- 82 End
+/^(?|(a)(b)(c)(?<D>d)|(?<D>e)) (?('D')X|Y)/JDZx
+------------------------------------------------------------------
+ Bra
+ ^
+ Bra
+ CBra 1
+ a
+ Ket
+ CBra 2
+ b
+ Ket
+ CBra 3
+ c
+ Ket
+ CBra 4
+ d
+ Ket
+ Alt
+ CBra 1
+ e
+ Ket
+ Ket
+ Cond
+ 4 Cond nref
+ X
+ Alt
+ Y
+ Ket
+ Ket
+ End
------------------------------------------------------------------
Capturing subpattern count = 4
Named capturing subpatterns:
@@ -10328,31 +10328,31 @@ No match
ey
No match
-/(?<A>a) (b)(c) (?<A>d (?(R&A)$ | (?4)) )/JDx
-------------------------------------------------------------------
- 0 65 Bra
- 3 7 CBra 1
- 8 a
- 10 7 Ket
- 13 7 CBra 2
- 18 b
- 20 7 Ket
- 23 7 CBra 3
- 28 c
- 30 7 Ket
- 33 29 CBra 4
- 38 d
- 40 7 Cond
- 43 Cond nrecurse 1
- 46 $
- 47 12 Alt
- 50 6 Once
- 53 33 Recurse
- 56 6 Ket
- 59 19 Ket
- 62 29 Ket
- 65 65 Ket
- 68 End
+/(?<A>a) (b)(c) (?<A>d (?(R&A)$ | (?4)) )/JDZx
+------------------------------------------------------------------
+ Bra
+ CBra 1
+ a
+ Ket
+ CBra 2
+ b
+ Ket
+ CBra 3
+ c
+ Ket
+ CBra 4
+ d
+ Cond
+ Cond nrecurse 1
+ $
+ Alt
+ Once
+ Recurse
+ Ket
+ Ket
+ Ket
+ Ket
+ End
------------------------------------------------------------------
Capturing subpattern count = 4
Named capturing subpatterns: