summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-12-12 16:23:37 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-12-12 16:23:37 +0000
commitdf3f8297fbe5e0a4c395e9021ecf176fdd6dab52 (patch)
treeed344a17ac5696586050c4ef8dca54ca34ba60d7
parent02b9094df724302cd24f71f6a28ec3df318cec71 (diff)
downloadpcre-df3f8297fbe5e0a4c395e9021ecf176fdd6dab52.tar.gz
Merge changes from trunk r755 to r800 into the 16-bit branch.
git-svn-id: svn://vcs.exim.org/pcre/code/branches/pcre16@801 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog104
-rw-r--r--Makefile.am1
-rw-r--r--NEWS7
-rwxr-xr-xPrepareRelease12
-rwxr-xr-xRunTest2
-rw-r--r--RunTest.bat20
-rw-r--r--configure.ac4
-rw-r--r--doc/html/pcreapi.html36
-rw-r--r--doc/html/pcrecallout.html9
-rw-r--r--doc/html/pcrecompat.html5
-rw-r--r--doc/html/pcrejit.html141
-rw-r--r--doc/html/pcrelimits.html8
-rw-r--r--doc/html/pcrematching.html8
-rw-r--r--doc/html/pcrepattern.html148
-rw-r--r--doc/html/pcretest.html7
-rw-r--r--doc/pcre.txt1301
-rw-r--r--doc/pcreapi.335
-rw-r--r--doc/pcrecallout.39
-rw-r--r--doc/pcrecompat.32
-rw-r--r--doc/pcrejit.3112
-rw-r--r--doc/pcrelimits.37
-rw-r--r--doc/pcrepattern.3102
-rw-r--r--doc/pcretest.17
-rw-r--r--doc/pcretest.txt307
-rwxr-xr-xmaint/ManyConfigTests11
-rw-r--r--pcre-config.in2
-rw-r--r--pcre.h.in19
-rw-r--r--pcre_compile.c230
-rw-r--r--pcre_dfa_exec.c2
-rw-r--r--pcre_exec.c688
-rw-r--r--pcre_fullinfo.c13
-rw-r--r--pcre_internal.h8
-rw-r--r--pcre_jit_compile.c13
-rw-r--r--pcre_study.c9
-rw-r--r--pcre_valid_utf8.c46
-rw-r--r--pcregrep.c4
-rw-r--r--pcreposix.c6
-rw-r--r--pcretest.c41
-rwxr-xr-xperltest.pl4
-rw-r--r--testdata/testinput11153
-rw-r--r--testdata/testinput137
-rw-r--r--testdata/testinput2225
-rw-r--r--testdata/testinput614
-rw-r--r--testdata/testoutput11271
-rw-r--r--testdata/testoutput1311
-rw-r--r--testdata/testoutput144
-rw-r--r--testdata/testoutput151
-rw-r--r--testdata/testoutput2367
-rw-r--r--testdata/testoutput54
-rw-r--r--testdata/testoutput622
50 files changed, 2624 insertions, 1945 deletions
diff --git a/ChangeLog b/ChangeLog
index 1cc0d94..81eaf17 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,8 +1,8 @@
ChangeLog for PCRE
------------------
-Version 8.21
-------------
+Version 8.21 12-Dec-2011
+------------------------
1. Updating the JIT compiler.
@@ -13,7 +13,7 @@ Version 8.21
PCRE_EXTRA_TABLES is not suported by JIT, and should be checked before
calling _pcre_jit_exec. Some extra comments are added.
-4. Mark settings inside atomic groups that do not contain any capturing
+4. (*MARK) settings inside atomic groups that do not contain any capturing
parentheses, for example, (?>a(*:m)), were not being passed out. This bug
was introduced by change 18 for 8.20.
@@ -22,36 +22,100 @@ Version 8.21
6. Lookbehinds such as (?<=a{2}b) that contained a fixed repetition were
erroneously being rejected as "not fixed length" if PCRE_CASELESS was set.
- This bug was probably introduced by change 9 of 8.13.
-
+ This bug was probably introduced by change 9 of 8.13.
+
7. While fixing 6 above, I noticed that a number of other items were being
- incorrectly rejected as "not fixed length". This arose partly because newer
+ incorrectly rejected as "not fixed length". This arose partly because newer
opcodes had not been added to the fixed-length checking code. I have (a)
corrected the bug and added tests for these items, and (b) arranged for an
error to occur if an unknown opcode is encountered while checking for fixed
- length instead of just assuming "not fixed length". The items that were
- rejected were: (*ACCEPT), (*COMMIT), (*FAIL), (*MARK), (*PRUNE), (*SKIP),
- (*THEN), \h, \H, \v, \V, and single character negative classes with fixed
+ length instead of just assuming "not fixed length". The items that were
+ rejected were: (*ACCEPT), (*COMMIT), (*FAIL), (*MARK), (*PRUNE), (*SKIP),
+ (*THEN), \h, \H, \v, \V, and single character negative classes with fixed
repetitions, e.g. [^a]{3}, with and without PCRE_CASELESS.
-
+
8. A possessively repeated conditional subpattern such as (?(?=c)c|d)++ was
- being incorrectly compiled and would have given unpredicatble results.
-
-9. A possessively repeated subpattern with minimum repeat count greater than
+ being incorrectly compiled and would have given unpredicatble results.
+
+9. A possessively repeated subpattern with minimum repeat count greater than
one behaved incorrectly. For example, (A){2,}+ behaved as if it was
- (A)(A)++ which meant that, after a subsequent mismatch, backtracking into
- the first (A) could occur when it should not.
-
-10. Add a cast and remove a redundant test from the code.
+ (A)(A)++ which meant that, after a subsequent mismatch, backtracking into
+ the first (A) could occur when it should not.
+
+10. Add a cast and remove a redundant test from the code.
11. JIT should use pcre_malloc/pcre_free for allocation.
12. Updated pcre-config so that it no longer shows -L/usr/lib, which seems
- best practice nowadays, and helps with cross-compiling. (If the exec_prefix
- is anything other than /usr, -L is still shown).
-
+ best practice nowadays, and helps with cross-compiling. (If the exec_prefix
+ is anything other than /usr, -L is still shown).
+
13. In non-UTF-8 mode, \C is now supported in lookbehinds and DFA matching.
+14. Perl does not support \N without a following name in a [] class; PCRE now
+ also gives an error.
+
+15. If a forward reference was repeated with an upper limit of around 2000,
+ it caused the error "internal error: overran compiling workspace". The
+ maximum number of forward references (including repeats) was limited by the
+ internal workspace, and dependent on the LINK_SIZE. The code has been
+ rewritten so that the workspace expands (via pcre_malloc) if necessary, and
+ the default depends on LINK_SIZE. There is a new upper limit (for safety)
+ of around 200,000 forward references. While doing this, I also speeded up
+ the filling in of repeated forward references.
+
+16. A repeated forward reference in a pattern such as (a)(?2){2}(.) was
+ incorrectly expecting the subject to contain another "a" after the start.
+
+17. When (*SKIP:name) is activated without a corresponding (*MARK:name) earlier
+ in the match, the SKIP should be ignored. This was not happening; instead
+ the SKIP was being treated as NOMATCH. For patterns such as
+ /A(*MARK:A)A+(*SKIP:B)Z|AAC/ this meant that the AAC branch was never
+ tested.
+
+18. The behaviour of (*MARK), (*PRUNE), and (*THEN) has been reworked and is
+ now much more compatible with Perl, in particular in cases where the result
+ is a non-match for a non-anchored pattern. For example, if
+ /b(*:m)f|a(*:n)w/ is matched against "abc", the non-match returns the name
+ "m", where previously it did not return a name. A side effect of this
+ change is that for partial matches, the last encountered mark name is
+ returned, as for non matches. A number of tests that were previously not
+ Perl-compatible have been moved into the Perl-compatible test files. The
+ refactoring has had the pleasing side effect of removing one argument from
+ the match() function, thus reducing its stack requirements.
+
+19. If the /S+ option was used in pcretest to study a pattern using JIT,
+ subsequent uses of /S (without +) incorrectly behaved like /S+.
+
+21. Retrieve executable code size support for the JIT compiler and fixing
+ some warnings.
+
+22. A caseless match of a UTF-8 character whose other case uses fewer bytes did
+ not work when the shorter character appeared right at the end of the
+ subject string.
+
+23. Added some (int) casts to non-JIT modules to reduce warnings on 64-bit
+ systems.
+
+24. Added PCRE_INFO_JITSIZE to pass on the value from (21) above, and also
+ output it when the /M option is used in pcretest.
+
+25. The CheckMan script was not being included in the distribution. Also, added
+ an explicit "perl" to run Perl scripts from the PrepareRelease script
+ because this is reportedly needed in Windows.
+
+26. If study data was being save in a file and studying had not found a set of
+ "starts with" bytes for the pattern, the data written to the file (though
+ never used) was taken from uninitialized memory and so caused valgrind to
+ complain.
+
+27. Updated RunTest.bat as provided by Sheri Pierce.
+
+28. Fixed a possible uninitialized memory bug in pcre_jit_compile.c.
+
+29. Computation of memory usage for the table of capturing group names was
+ giving an unnecessarily large value.
+
Version 8.20 21-Oct-2011
------------------------
diff --git a/Makefile.am b/Makefile.am
index cbcb85d..d67d167 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -100,6 +100,7 @@ EXTRA_DIST += \
# These files are used in the preparation of a release
EXTRA_DIST += \
PrepareRelease \
+ CheckMan \
CleanTxt \
Detrail \
132html \
diff --git a/NEWS b/NEWS
index 1d76c86..28d676d 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,13 @@
News about PCRE releases
------------------------
+Release 8.21 12-Dec-2011
+------------------------
+
+This is almost entirely a bug-fix release. The only new feature is the ability
+to obtain the size of the memory used by the JIT compiler.
+
+
Release 8.20 21-Oct-2011
------------------------
diff --git a/PrepareRelease b/PrepareRelease
index c079642..7123be6 100755
--- a/PrepareRelease
+++ b/PrepareRelease
@@ -37,7 +37,7 @@ echo Processing documentation
# Check the remaining man pages
-../CheckMan *.1 *.3
+perl ../CheckMan *.1 *.3
if [ $? != 0 ] ; then exit 1; fi
# Make Text form of the documentation. It needs some mangling to make it
@@ -64,7 +64,7 @@ for file in pcre pcrebuild pcrematching pcreapi pcrecallout pcrecompat \
pcrelimits pcrestack ; do
echo " Processing $file.3"
nroff -c -man $file.3 >$file.rawtxt
- ../CleanTxt <$file.rawtxt >>pcre.txt
+ perl ../CleanTxt <$file.rawtxt >>pcre.txt
/bin/rm $file.rawtxt
echo "------------------------------------------------------------------------------" >>pcre.txt
if [ "$file" != "pcresample" ] ; then
@@ -77,7 +77,7 @@ done
for file in pcretest pcregrep pcre-config ; do
echo Making $file.txt
nroff -c -man $file.1 >$file.rawtxt
- ../CleanTxt <$file.rawtxt >$file.txt
+ perl ../CleanTxt <$file.rawtxt >$file.txt
/bin/rm $file.rawtxt
done
@@ -126,7 +126,7 @@ cp index.html.src html/index.html
for file in *.1 ; do
base=`basename $file .1`
echo " Making $base.html"
- ../132html -toc $base <$file >html/$base.html
+ perl ../132html -toc $base <$file >html/$base.html
done
# Exclude table of contents for function summaries. It seems that expr
@@ -146,7 +146,7 @@ for file in *.3 ; do
toc=""
fi
echo " Making $base.html"
- ../132html $toc $base <$file >html/$base.html
+ perl ../132html $toc $base <$file >html/$base.html
if [ $? != 0 ] ; then exit 1; fi
done
@@ -235,7 +235,7 @@ files="\
libpcreposix.def"
echo Detrailing
-./Detrail $files doc/p* doc/html/*
+perl ./Detrail $files doc/p* doc/html/*
echo Doing basic configure to get default pcre.h and config.h
# This is in case the caller has set aliases (as I do - PH)
diff --git a/RunTest b/RunTest
index 3a5a9db..b98bb96 100755
--- a/RunTest
+++ b/RunTest
@@ -222,7 +222,7 @@ fi
# PCRE tests that are not JIT or Perl-compatible: API, errors, internals
if [ $do2 = yes ] ; then
- echo "Test 2: API, errors, internals, and non-Perl stuff"
+ echo "Test 2: API, errors, internals, and non-Perl stuff (not UTF-8)"
for opt in "" "-s" $jitopt; do
$sim $valgrind ./pcretest -q $opt $testdata/testinput2 testtry
if [ $? = 0 ] ; then
diff --git a/RunTest.bat b/RunTest.bat
index 5db979f..38f1eae 100644
--- a/RunTest.bat
+++ b/RunTest.bat
@@ -18,6 +18,8 @@
@rem 14 requires presense of jit support
@rem 15 requires absence of jit support
@rem Sheri P also added override tests for study and jit testing
+@rem JIT testing n/a for tests 7-10, removed JIT override test for them
+@rem removed override tests for 14-15
setlocal enabledelayedexpansion
if [%srcdir%]==[] (
@@ -27,7 +29,7 @@ if exist ..\testdata\ set srcdir=..)
if [%srcdir%]==[] (
if exist ..\..\testdata\ set srcdir=..\..)
if NOT exist "%srcdir%\testdata\" (
-Error: echo distribution testdata folder not found.
+Error: echo distribution testdata folder not found!
call :conferror
exit /b 1
goto :eof
@@ -41,14 +43,14 @@ echo pcretest=%pcretest%
echo pcregrep=%pcregrep%
if NOT exist "%pcregrep%" (
-echo Error: "%pcregrep%" not found.
+echo Error: "%pcregrep%" not found!
echo.
call :conferror
exit /b 1
)
if NOT exist "%pcretest%" (
-echo Error: "%pcretest%" not found.
+echo Error: "%pcretest%" not found!
echo.
call :conferror
exit /b 1
@@ -219,7 +221,7 @@ if %jit% EQU 1 call :runsub 1 testoutjit "Test with JIT Override" -q -s+
goto :eof
:do2
- call :runsub 2 testout "API, errors, internals, and non-Perl stuff" -q
+ call :runsub 2 testout "API, errors, internals, and non-Perl stuff (not UTF-8)" -q
call :runsub 2 testoutstudy "Test with Study Override" -q -s
if %jit% EQU 1 call :runsub 2 testoutjit "Test with JIT Override" -q -s+
goto :eof
@@ -263,7 +265,6 @@ goto :eof
:do7
call :runsub 7 testout "DFA matching" -q -dfa
call :runsub 7 testoutstudy "Test with Study Override" -q -dfa -s
- if %jit% EQU 1 call :runsub 7 testoutjit "Test with JIT Override" -q -dfa -s+
goto :eof
:do8
@@ -273,7 +274,6 @@ goto :eof
)
call :runsub 8 testout "DFA matching with UTF-8" -q -dfa
call :runsub 8 testoutstudy "Test with Study Override" -q -dfa -s
- if %jit% EQU 1 call :runsub 8 testoutjit "Test with JIT Override" -q -dfa -s+
goto :eof
:do9
@@ -283,7 +283,6 @@ goto :eof
)
call :runsub 9 testout "DFA matching with Unicode properties" -q -dfa
call :runsub 9 testoutstudy "Test with Study Override" -q -dfa -s
- if %jit% EQU 1 call :runsub 9 testoutjit "Test with JIT Override" -q -dfa -s+
goto :eof
:do10
@@ -293,7 +292,6 @@ goto :eof
)
call :runsub 10 testout "Internal offsets and code size tests" -q
call :runsub 10 testoutstudy "Test with Study Override" -q -s
- if %jit% EQU 1 call :runsub 10 testoutjit "Test with JIT Override" -q -s+
goto :eof
:do11
@@ -328,8 +326,6 @@ if %jit% EQU 0 (
goto :eof
)
call :runsub 14 testout "JIT-specific features - have JIT" -q
- call :runsub 14 testoutstudy "Test with Study Override" -q -s
- call :runsub 14 testoutjit "Test with JIT Override" -q -s+
goto :eof
:do15
@@ -338,11 +334,11 @@ goto :eof
goto :eof
)
call :runsub 15 testout "JIT-specific features - no JIT" -q
- call :runsub 15 testoutstudy "Test with Study Override" -q -s
goto :eof
:conferror
-@echo Configuration error.
+@echo.
+@echo Either your build is incomplete or you have a configuration error.
@echo.
@echo If configured with cmake and executed via "make test" or the MSVC "RUN_TESTS"
@echo project, pcre_test.bat defines variables and automatically calls RunTest.bat.
diff --git a/configure.ac b/configure.ac
index ef1524f..a7e71ce 100644
--- a/configure.ac
+++ b/configure.ac
@@ -10,8 +10,8 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8])
m4_define(pcre_minor, [21])
-m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2011-11-14])
+m4_define(pcre_prerelease, [])
+m4_define(pcre_date, [2011-12-12])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index cd90766..3cbb6be 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -649,6 +649,23 @@ character). Thus, the pattern AB]CD becomes illegal when this option is set.
string (by default this causes the current matching alternative to fail). A
pattern such as (\1)(a) succeeds when this option is set (assuming it can find
an "a" in the subject), whereas it fails by default, for Perl compatibility.
+</P>
+<P>
+(3) \U matches an upper case "U" character; by default \U causes a compile
+time error (Perl uses \U to upper case subsequent characters).
+</P>
+<P>
+(4) \u matches a lower case "u" character unless it is followed by four
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, \u causes a compile time error (Perl uses it to upper
+case the following character).
+</P>
+<P>
+(5) \x matches a lower case "x" character unless it is followed by two
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, as in Perl, a hexadecimal number is always expected after
+\x, but it may have zero, one, or two digits (so, for example, \xz matches a
+binary zero character followed by z).
<pre>
PCRE_MULTILINE
</pre>
@@ -1127,6 +1144,12 @@ particular pattern. See the
<a href="pcrejit.html"><b>pcrejit</b></a>
documentation for details of what can and cannot be handled.
<pre>
+ PCRE_INFO_JITSIZE
+</pre>
+If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE option,
+return the size of the JIT compiled code, otherwise return zero. The fourth
+argument should point to a <b>size_t</b> variable.
+<pre>
PCRE_INFO_LASTLITERAL
</pre>
Return the value of the rightmost literal byte that must exist in any matched
@@ -1235,10 +1258,13 @@ For such patterns, the PCRE_ANCHORED bit is set in the options returned by
<pre>
PCRE_INFO_SIZE
</pre>
-Return the size of the compiled pattern, that is, the value that was passed as
-the argument to <b>pcre_malloc()</b> when PCRE was getting memory in which to
-place the compiled data. The fourth argument should point to a <b>size_t</b>
-variable.
+Return the size of the compiled pattern. The fourth argument should point to a
+<b>size_t</b> variable. This value does not include the size of the <b>pcre</b>
+structure that is returned by <b>pcre_compile()</b>. The value that is passed as
+the argument to <b>pcre_malloc()</b> when <b>pcre_compile()</b> is getting memory
+in which to place the compiled data is the value returned by this option plus
+the size of the <b>pcre</b> structure. Studying a compiled pattern, with or
+without JIT, does not alter the value returned by this option.
<pre>
PCRE_INFO_STUDYSIZE
</pre>
@@ -2486,7 +2512,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC24" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 23 September 2011
+Last updated: 02 December 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrecallout.html b/doc/html/pcrecallout.html
index 40d5fa2..e891fdf 100644
--- a/doc/html/pcrecallout.html
+++ b/doc/html/pcrecallout.html
@@ -189,9 +189,10 @@ same callout number. However, they are set for all callouts.
<P>
The <i>mark</i> field is present from version 2 of the <i>pcre_callout</i>
structure. In callouts from <b>pcre_exec()</b> it contains a pointer to the
-zero-terminated name of the most recently passed (*MARK) item in the match, or
-NULL if there are no (*MARK)s in the current matching path. In callouts from
-<b>pcre_dfa_exec()</b> this field always contains NULL.
+zero-terminated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)
+item in the match, or NULL if no such items have been passed. Instances of
+(*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
+callouts from <b>pcre_dfa_exec()</b> this field always contains NULL.
</P>
<br><a name="SEC4" href="#TOC1">RETURN VALUES</a><br>
<P>
@@ -219,7 +220,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 26 August 2011
+Last updated: 30 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrecompat.html b/doc/html/pcrecompat.html
index 69d9d1d..4e5e18b 100644
--- a/doc/html/pcrecompat.html
+++ b/doc/html/pcrecompat.html
@@ -53,7 +53,8 @@ represent a binary zero.
own, matching a non-newline character, is supported.) In fact these are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE, an error is
-generated.
+generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
+\U and \u are interpreted as JavaScript interprets them.
</P>
<P>
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE is
@@ -202,7 +203,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 09 October 2011
+Last updated: 14 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrejit.html b/doc/html/pcrejit.html
index c257d0d..7411ecf 100644
--- a/doc/html/pcrejit.html
+++ b/doc/html/pcrejit.html
@@ -20,10 +20,11 @@ man page, in case the conversion went wrong.
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT EXECUTION</a>
<li><a name="TOC6" href="#SEC6">SAVING AND RESTORING COMPILED PATTERNS</a>
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
-<li><a name="TOC8" href="#SEC8">EXAMPLE CODE</a>
-<li><a name="TOC9" href="#SEC9">SEE ALSO</a>
-<li><a name="TOC10" href="#SEC10">AUTHOR</a>
-<li><a name="TOC11" href="#SEC11">REVISION</a>
+<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
+<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a>
+<li><a name="TOC10" href="#SEC10">SEE ALSO</a>
+<li><a name="TOC11" href="#SEC11">AUTHOR</a>
+<li><a name="TOC12" href="#SEC12">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
@@ -57,11 +58,17 @@ fully tested. If --enable-jit is set on an unsupported platform, compilation
fails.
</P>
<P>
-A program can tell if JIT support is available by calling <b>pcre_config()</b>
-with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available, and 0
-otherwise. However, a simple program does not need to check this in order to
-use JIT. The API is implemented in a way that falls back to the ordinary PCRE
-code if JIT is not available.
+A program that is linked with PCRE 8.20 or later can tell if JIT support is
+available by calling <b>pcre_config()</b> with the PCRE_CONFIG_JIT option. The
+result is 1 when JIT is available, and 0 otherwise. However, a simple program
+does not need to check this in order to use JIT. The API is implemented in a
+way that falls back to the ordinary PCRE code if JIT is not available.
+</P>
+<P>
+If your program may sometimes be linked with versions of PCRE that are older
+than 8.20, but you want to use JIT when it is available, you can test
+the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT macro such
+as PCRE_CONFIG_JIT, for compile-time control of your code.
</P>
<br><a name="SEC3" href="#TOC1">SIMPLE USE OF JIT</a><br>
<P>
@@ -75,6 +82,21 @@ You have to do two things to make use of the JIT support in the simplest way:
no longer needed instead of just freeing it yourself. This
ensures that any JIT data is also freed.
</pre>
+For a program that may be linked with pre-8.20 versions of PCRE, you can insert
+<pre>
+ #ifndef PCRE_STUDY_JIT_COMPILE
+ #define PCRE_STUDY_JIT_COMPILE 0
+ #endif
+</pre>
+so that no option is passed to <b>pcre_study()</b>, and then use something like
+this to free the study data:
+<pre>
+ #ifdef PCRE_CONFIG_JIT
+ pcre_free_study(study_ptr);
+ #else
+ pcre_free(study_ptr);
+ #endif
+</pre>
In some circumstances you may need to call additional functions. These are
described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
@@ -116,12 +138,8 @@ supported.
<P>
The unsupported pattern items are:
<pre>
- \C match a single byte; not supported in UTF-8 mode
+ \C match a single byte; not supported in UTF-8 mode
(?Cn) callouts
- (?(&#60;name&#62;)... conditional test on setting of a named subpattern
- (?(R)... conditional test on whole pattern recursion
- (?(Rn)... conditional test on recursion, by number
- (?(R&name)... conditional test on recursion, by name
(*COMMIT) )
(*MARK) )
(*PRUNE) ) the backtracking control verbs
@@ -167,7 +185,10 @@ When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32K on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
-managing blocks of memory for use as JIT stacks.
+managing blocks of memory for use as JIT stacks. There is further discussion
+about the use of JIT stacks in the section entitled
+<a href="#stackcontrol">"JIT stack FAQ"</a>
+below.
</P>
<P>
The <b>pcre_jit_stack_alloc()</b> function creates a JIT stack. Its arguments
@@ -234,8 +255,86 @@ All the functions described in this section do nothing if JIT is not available,
and <b>pcre_assign_jit_stack()</b> does nothing unless the <b>extra</b> argument
is non-NULL and points to a <b>pcre_extra</b> block that is the result of a
successful study with PCRE_STUDY_JIT_COMPILE.
+<a name="stackfaq"></a></P>
+<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
+<P>
+(1) Why do we need JIT stacks?
+<br>
+<br>
+PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where
+the local data of the current node is pushed before checking its child nodes.
+Allocating real machine stack on some platforms is difficult. For example, the
+stack chain needs to be updated every time if we extend the stack on PowerPC.
+Although it is possible, its updating time overhead decreases performance. So
+we do the recursion in memory.
+</P>
+<P>
+(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>?
+<br>
+<br>
+Modern operating systems have a nice feature: they can reserve an address space
+instead of allocating memory. We can safely allocate memory pages inside this
+address space, so the stack could grow without moving memory data (this is
+important because of pointers). Thus we can allocate 1M address space, and use
+only a single memory page (usually 4K) if that is enough. However, we can still
+grow up to 1M anytime if needed.
+</P>
+<P>
+(3) Who "owns" a JIT stack?
+<br>
+<br>
+The owner of the stack is the user program, not the JIT studied pattern or
+anything else. The user program must ensure that if a stack is used by
+<b>pcre_exec()</b>, (that is, it is assigned to the pattern currently running),
+that stack must not be used by any other threads (to avoid overwriting the same
+memory area). The best practice for multithreaded programs is to allocate a
+stack for each thread, and return this stack through the JIT callback function.
+</P>
+<P>
+(4) When should a JIT stack be freed?
+<br>
+<br>
+You can free a JIT stack at any time, as long as it will not be used by
+<b>pcre_exec()</b> again. When you assign the stack to a pattern, only a pointer
+is set. There is no reference counting or any other magic. You can free the
+patterns and stacks in any order, anytime. Just <i>do not</i> call
+<b>pcre_exec()</b> with a pattern pointing to an already freed stack, as that
+will cause SEGFAULT. (Also, do not free a stack currently used by
+<b>pcre_exec()</b> in another thread). You can also replace the stack for a
+pattern at any time. You can even free the previous stack before assigning a
+replacement.
+</P>
+<P>
+(5) Should I allocate/free a stack every time before/after calling
+<b>pcre_exec()</b>?
+<br>
+<br>
+No, because this is too costly in terms of resources. However, you could
+implement some clever idea which release the stack if it is not used in let's
+say two minutes. The JIT callback can help to achive this without keeping a
+list of the currently JIT studied patterns.
+</P>
+<P>
+(6) OK, the stack is for long term memory allocation. But what happens if a
+pattern causes stack overflow with a stack of 1M? Is that 1M kept until the
+stack is freed?
+<br>
+<br>
+Especially on embedded sytems, it might be a good idea to release
+memory sometimes without freeing the stack. There is no API for this at the
+moment. Probably a function call which returns with the currently allocated
+memory for any stack and another which allows releasing memory (shrinking the
+stack) would be a good idea if someone needs this.
+</P>
+<P>
+(7) This is too much of a headache. Isn't there any better solution for JIT
+stack handling?
+<br>
+<br>
+No, thanks to Windows. If POSIX threads were used everywhere, we could throw
+out this complicated API.
</P>
-<br><a name="SEC8" href="#TOC1">EXAMPLE CODE</a><br>
+<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br>
<P>
This is a single-threaded example that specifies a JIT stack without using a
callback.
@@ -260,22 +359,22 @@ callback.
</PRE>
</P>
-<br><a name="SEC9" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC10" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcreapi</b>(3)
</P>
-<br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
<P>
-Philip Hazel
+Philip Hazel (FAQ by Zoltan Herczeg)
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
-<br><a name="SEC11" href="#TOC1">REVISION</a><br>
+<br><a name="SEC12" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 October 2011
+Last updated: 26 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrelimits.html b/doc/html/pcrelimits.html
index 4dc28f7..2cab81f 100644
--- a/doc/html/pcrelimits.html
+++ b/doc/html/pcrelimits.html
@@ -37,6 +37,12 @@ There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns.
</P>
<P>
+There is a limit to the number of forward references to subsequent subpatterns
+of around 200,000. Repeated forward references with fixed upper limits, for
+example, (?2){0,100} when subpattern number 2 is to the right, are included in
+the count. There is no limit to the number of backward references.
+</P>
+<P>
The maximum length of name for a named subpattern is 32 characters, and the
maximum number of named subpatterns is 10000.
</P>
@@ -65,7 +71,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 24 August 2011
+Last updated: 30 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html
index 3d1acf6..ad17c98 100644
--- a/doc/html/pcrematching.html
+++ b/doc/html/pcrematching.html
@@ -164,9 +164,9 @@ always 1, and the value of the <i>capture_last</i> field is always -1.
</P>
<P>
7. The \C escape sequence, which (in the standard algorithm) matches a single
-byte, even in UTF-8 mode, is not supported because the alternative algorithm
-moves through the subject string one character at a time, for all active paths
-through the tree.
+byte, even in UTF-8 mode, is not supported in UTF-8 mode, because the
+alternative algorithm moves through the subject string one character at a time,
+for all active paths through the tree.
</P>
<P>
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
@@ -220,7 +220,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 17 November 2010
+Last updated: 19 November 2011
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 349c98c..aa39d63 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -268,7 +268,8 @@ one of the following escape sequences than the binary character it represents:
\t tab (hex 09)
\ddd character with octal code ddd, or back reference
\xhh character with hex code hh
- \x{hhh..} character with hex code hhh..
+ \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
+ \uhhhh character with hex code hhhh (JavaScript mode only)
</pre>
The precise effect of \cx is as follows: if x is a lower case letter, it
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
@@ -280,12 +281,12 @@ values are valid. A lower case letter is converted to upper case, and then the
0xc0 bits are flipped.)
</P>
<P>
-After \x, from zero to two hexadecimal digits are read (letters can be in
-upper or lower case). Any number of hexadecimal digits may appear between \x{
-and }, but the value of the character code must be less than 256 in non-UTF-8
-mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
-hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
-point, which is 10FFFF.
+By default, after \x, from zero to two hexadecimal digits are read (letters
+can be in upper or lower case). Any number of hexadecimal digits may appear
+between \x{ and }, but the value of the character code must be less than 256
+in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
+value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
+Unicode code point, which is 10FFFF.
</P>
<P>
If characters other than hexadecimal digits appear between \x{ and }, or if
@@ -294,9 +295,17 @@ initial \x will be interpreted as a basic hexadecimal escape, with no
following digits, giving a character whose value is zero.
</P>
<P>
+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
+as just described only when it is followed by two hexadecimal digits.
+Otherwise, it matches a literal "x" character. In JavaScript mode, support for
+code points greater than 256 is provided by \u, which must be followed by
+four hexadecimal digits; otherwise it matches a literal "u" character.
+</P>
+<P>
Characters whose value is less than 256 can be defined by either of the two
-syntaxes for \x. There is no difference in the way they are handled. For
-example, \xdc is exactly the same as \x{dc}.
+syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
+way they are handled. For example, \xdc is exactly the same as \x{dc} (or
+\u00dc in JavaScript mode).
</P>
<P>
After \0 up to two further octal digits are read. If there are fewer than two
@@ -338,12 +347,25 @@ zero, because no more than three octal digits are ever read.
</P>
<P>
All the sequences that define a single character value can be used both inside
-and outside character classes. In addition, inside a character class, the
-sequence \b is interpreted as the backspace character (hex 08). The sequences
-\B, \N, \R, and \X are not special inside a character class. Like any other
-unrecognized escape sequences, they are treated as the literal characters "B",
-"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is
-set. Outside a character class, these sequences have different meanings.
+and outside character classes. In addition, inside a character class, \b is
+interpreted as the backspace character (hex 08).
+</P>
+<P>
+\N is not allowed in a character class. \B, \R, and \X are not special
+inside a character class. Like other unrecognized escape sequences, they are
+treated as the literal characters "B", "R", and "X" by default, but cause an
+error if the PCRE_EXTRA option is set. Outside a character class, these
+sequences have different meanings.
+</P>
+<br><b>
+Unsupported escape sequences
+</b><br>
+<P>
+In Perl, the sequences \l, \L, \u, and \U are recognized by its string
+handler and used to modify the case of following characters. By default, PCRE
+does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
+option is set, \U matches a "U" character, and \u can be used to define a
+character by code point, as described in the previous section.
</P>
<br><b>
Absolute and relative back references
@@ -389,7 +411,8 @@ Another use of backslash is for specifying generic character types:
There is also the single sequence \N, which matches a non-newline character.
This is the same as
<a href="#fullstopdot">the "." metacharacter</a>
-when PCRE_DOTALL is not set.
+when PCRE_DOTALL is not set. Perl also uses \N to match characters by name;
+PCRE does not support this.
</P>
<P>
Each pair of lower and upper case escape sequences partitions the complete set
@@ -963,7 +986,8 @@ special meaning in a character class.
<P>
The escape sequence \N behaves like a dot, except that it is not affected by
the PCRE_DOTALL option. In other words, it matches any character except one
-that signifies the end of a line.
+that signifies the end of a line. Perl also uses \N to match characters by
+name; PCRE does not support this.
</P>
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
<P>
@@ -979,8 +1003,8 @@ processing unless the PCRE_NO_UTF8_CHECK option is used).
</P>
<P>
PCRE does not allow \C to appear in lookbehind assertions
-<a href="#lookbehind">(described below),</a>
-because in UTF-8 mode this would make it impossible to calculate the length of
+<a href="#lookbehind">(described below)</a>
+in UTF-8 mode, because this would make it impossible to calculate the length of
the lookbehind.
</P>
<P>
@@ -1926,10 +1950,10 @@ match. If there are insufficient characters before the current position, the
assertion fails.
</P>
<P>
-PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)
-to appear in lookbehind assertions, because it makes it impossible to calculate
-the length of the lookbehind. The \X and \R escapes, which can match
-different numbers of bytes, are also not permitted.
+In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte,
+even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
+impossible to calculate the length of the lookbehind. The \X and \R escapes,
+which can match different numbers of bytes, are also not permitted.
</P>
<P>
<a href="#subpatternsassubroutines">"Subroutine"</a>
@@ -2511,10 +2535,11 @@ failing negative assertion, they cause an error if encountered by
If any of these verbs are used in an assertion or in a subpattern that is
called as a subroutine (whether or not recursively), their effect is confined
to that subpattern; it does not extend to the surrounding pattern, with one
-exception: a *MARK that is encountered in a positive assertion <i>is</i> passed
-back (compare capturing parentheses in assertions). Note that such subpatterns
-are processed as anchored at the point where they are tested. Note also that
-Perl's treatment of subroutines is different in some cases.
+exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
+a successful positive assertion <i>is</i> passed back when a match succeeds
+(compare capturing parentheses in assertions). Note that such subpatterns are
+processed as anchored at the point where they are tested. Note also that Perl's
+treatment of subroutines is different in some cases.
</P>
<P>
The new verbs make use of what was previously invalid syntax: an opening
@@ -2536,6 +2561,10 @@ the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
pattern with (*NO_START_OPT).
</P>
+<P>
+Experiments with Perl suggest that it too has similar optimizations, sometimes
+leading to anomalous results.
+</P>
<br><b>
Verbs that act immediately
</b><br>
@@ -2583,17 +2612,17 @@ A name is always required with this verb. There may be as many instances of
(*MARK) as you like in a pattern, and their names do not have to be unique.
</P>
<P>
-When a match succeeds, the name of the last-encountered (*MARK) is passed back
-to the caller via the <i>pcre_extra</i> data structure, as described in the
+When a match succeeds, the name of the last-encountered (*MARK) on the matching
+path is passed back to the caller via the <i>pcre_extra</i> data structure, as
+described in the
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
-documentation. No data is returned for a partial match. Here is an example of
-<b>pcretest</b> output, where the /K modifier requests the retrieval and
-outputting of (*MARK) data:
+documentation. Here is an example of <b>pcretest</b> output, where the /K
+modifier requests the retrieval and outputting of (*MARK) data:
<pre>
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XY
+ re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data&#62; XY
0: XY
MK: A
XZ
@@ -2611,32 +2640,17 @@ passed back if it is the last-encountered. This does not happen for negative
assertions.
</P>
<P>
-A name may also be returned after a failed match if the final path through the
-pattern involves (*MARK). However, unless (*MARK) used in conjunction with
-(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the
-starting point for matching is advanced, the final check is often with an empty
-string, causing a failure before (*MARK) is reached. For example:
+After a partial match or a failed match, the name of the last encountered
+(*MARK) in the entire match process is returned. For example:
<pre>
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XP
- No match
-</pre>
-There are three potential starting points for this match (starting with X,
-starting with P, and with an empty string). If the pattern is anchored, the
-result is different:
-<pre>
- /^X(*MARK:A)Y|^X(*MARK:B)Z/K
- XP
+ re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data&#62; XP
No match, mark = B
</pre>
-PCRE's start-of-match optimizations can also interfere with this. For example,
-if, as a result of a call to <b>pcre_study()</b>, it knows the minimum
-subject length for a match, a shorter subject will not be scanned at all.
-</P>
-<P>
-Note that similar anomalies (though different in detail) exist in Perl, no
-doubt for the same reasons. The use of (*MARK) data after a failed match of an
-unanchored pattern is not recommended, unless (*COMMIT) is involved.
+Note that in this unanchored example the mark is retained from the match
+attempt that started at the letter "X". Subsequent match attempts starting at
+"P" and then with an empty string do not get as far as the (*MARK) item, but
+nevertheless do not reset it.
</P>
<br><b>
Verbs that act after backtracking
@@ -2675,8 +2689,8 @@ Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE's start-of-match optimizations are turned off, as shown in this
<b>pcretest</b> example:
<pre>
- /(*COMMIT)abc/
- xyzabc
+ re&#62; /(*COMMIT)abc/
+ data&#62; xyzabc
0: abc
xyzabc\Y
No match
@@ -2697,10 +2711,8 @@ reached, or when matching to the right of (*PRUNE), but if there is no match to
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
but there are some uses of (*PRUNE) that cannot be expressed in any other way.
-The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
-match fails completely; the name is passed back if this is the final attempt.
-(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
-pattern (*PRUNE) has the same effect as (*COMMIT).
+The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
+anchored pattern (*PRUNE) has the same effect as (*COMMIT).
<pre>
(*SKIP)
</pre>
@@ -2726,8 +2738,7 @@ following pattern fails to match, the previous path through the pattern is
searched for the most recent (*MARK) that has the same name. If one is found,
the "bumpalong" advance is to the subject position that corresponds to that
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
-matching name is found, normal "bumpalong" of one character happens (that is,
-the (*SKIP) is ignored).
+matching name is found, the (*SKIP) is ignored.
<pre>
(*THEN) or (*THEN:NAME)
</pre>
@@ -2741,9 +2752,8 @@ be used for a pattern-based if-then-else block:
If the COND1 pattern matches, FOO is tried (and possibly further items after
the end of the group if FOO succeeds); on failure, the matcher skips to the
second alternative and tries COND2, without backtracking into COND1. The
-behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
-overall match fails. If (*THEN) is not inside an alternation, it acts like
-(*PRUNE).
+behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
+If (*THEN) is not inside an alternation, it acts like (*PRUNE).
</P>
<P>
Note that a subpattern that does not contain a | character is just a part of
@@ -2819,7 +2829,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 October 2011
+Last updated: 29 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 40b970d..c883064 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -364,7 +364,10 @@ which it appears.
</P>
<P>
The <b>/M</b> modifier causes the size of memory block used to hold the compiled
-pattern to be output.
+pattern to be output. This does not include the size of the <b>pcre</b> block;
+it is just the actual compiled data. If the pattern is successfully studied
+with the PCRE_STUDY_JIT_COMPILE option, the size of the JIT compiled code is
+also output.
</P>
<P>
If the <b>/S</b> modifier appears once, it causes <b>pcre_study()</b> to be
@@ -856,7 +859,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 26 August 2011
+Last updated: 02 December 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 52ba310..4750c59 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -633,9 +633,9 @@ THE ALTERNATIVE MATCHING ALGORITHM
always 1, and the value of the capture_last field is always -1.
7. The \C escape sequence, which (in the standard algorithm) matches a
- single byte, even in UTF-8 mode, is not supported because the alterna-
- tive algorithm moves through the subject string one character at a
- time, for all active paths through the tree.
+ single byte, even in UTF-8 mode, is not supported in UTF-8 mode,
+ because the alternative algorithm moves through the subject string one
+ character at a time, for all active paths through the tree.
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
are not supported. (*FAIL) is supported, and behaves like a failing
@@ -685,7 +685,7 @@ AUTHOR
REVISION
- Last updated: 17 November 2010
+ Last updated: 19 November 2011
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -1256,6 +1256,20 @@ COMPILING A PATTERN
set (assuming it can find an "a" in the subject), whereas it fails by
default, for Perl compatibility.
+ (3) \U matches an upper case "U" character; by default \U causes a com-
+ pile time error (Perl uses \U to upper case subsequent characters).
+
+ (4) \u matches a lower case "u" character unless it is followed by four
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, \u causes a compile time error (Perl
+ uses it to upper case the following character).
+
+ (5) \x matches a lower case "x" character unless it is followed by two
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, as in Perl, a hexadecimal number is
+ always expected after \x, but it may have zero, one, or two digits (so,
+ for example, \xz matches a binary zero character followed by z).
+
PCRE_MULTILINE
By default, PCRE treats the subject string as consisting of a single
@@ -1710,6 +1724,12 @@ INFORMATION ABOUT A PATTERN
compiler could not handle this particular pattern. See the pcrejit doc-
umentation for details of what can and cannot be handled.
+ PCRE_INFO_JITSIZE
+
+ If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE
+ option, return the size of the JIT compiled code, otherwise return
+ zero. The fourth argument should point to a size_t variable.
+
PCRE_INFO_LASTLITERAL
Return the value of the rightmost literal byte that must exist in any
@@ -1818,10 +1838,14 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE
- Return the size of the compiled pattern, that is, the value that was
- passed as the argument to pcre_malloc() when PCRE was getting memory in
- which to place the compiled data. The fourth argument should point to a
- size_t variable.
+ Return the size of the compiled pattern. The fourth argument should
+ point to a size_t variable. This value does not include the size of the
+ pcre structure that is returned by pcre_compile(). The value that is
+ passed as the argument to pcre_malloc() when pcre_compile() is getting
+ memory in which to place the compiled data is the value returned by
+ this option plus the size of the pcre structure. Studying a compiled
+ pattern, with or without JIT, does not alter the value returned by this
+ option.
PCRE_INFO_STUDYSIZE
@@ -2980,7 +3004,7 @@ AUTHOR
REVISION
- Last updated: 23 September 2011
+ Last updated: 02 December 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
@@ -3143,9 +3167,11 @@ THE CALLOUT INTERFACE
The mark field is present from version 2 of the pcre_callout structure.
In callouts from pcre_exec() it contains a pointer to the zero-termi-
- nated name of the most recently passed (*MARK) item in the match, or
- NULL if there are no (*MARK)s in the current matching path. In callouts
- from pcre_dfa_exec() this field always contains NULL.
+ nated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)
+ item in the match, or NULL if no such items have been passed. Instances
+ of (*PRUNE) or (*THEN) without a name do not obliterate a previous
+ (*MARK). In callouts from pcre_dfa_exec() this field always contains
+ NULL.
RETURN VALUES
@@ -3173,7 +3199,7 @@ AUTHOR
REVISION
- Last updated: 26 August 2011
+ Last updated: 30 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
@@ -3218,7 +3244,9 @@ DIFFERENCES BETWEEN PCRE AND PERL
its own, matching a non-newline character, is supported.) In fact these
are implemented by Perl's general string-handling and are not part of
its pattern matching engine. If any of these are encountered by PCRE,
- an error is generated.
+ an error is generated by default. However, if the PCRE_JAVASCRIPT_COM-
+ PAT option is set, \U and \u are interpreted as JavaScript interprets
+ them.
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
is built with Unicode character property support. The properties that
@@ -3345,7 +3373,7 @@ AUTHOR
REVISION
- Last updated: 09 October 2011
+ Last updated: 14 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
@@ -3572,7 +3600,8 @@ BACKSLASH
\t tab (hex 09)
\ddd character with octal code ddd, or back reference
\xhh character with hex code hh
- \x{hhh..} character with hex code hhh..
+ \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
+ \uhhhh character with hex code hhhh (JavaScript mode only)
The precise effect of \cx is as follows: if x is a lower case letter,
it is converted to upper case. Then bit 6 of the character (hex 40) is
@@ -3583,12 +3612,12 @@ BACKSLASH
is compiled in EBCDIC mode, all byte values are valid. A lower case
letter is converted to upper case, and then the 0xc0 bits are flipped.)
- After \x, from zero to two hexadecimal digits are read (letters can be
- in upper or lower case). Any number of hexadecimal digits may appear
- between \x{ and }, but the value of the character code must be less
- than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
- the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
- than the largest Unicode code point, which is 10FFFF.
+ By default, after \x, from zero to two hexadecimal digits are read
+ (letters can be in upper or lower case). Any number of hexadecimal dig-
+ its may appear between \x{ and }, but the value of the character code
+ must be less than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8
+ mode. That is, the maximum value in hexadecimal is 7FFFFFFF. Note that
+ this is bigger than the largest Unicode code point, which is 10FFFF.
If characters other than hexadecimal digits appear between \x{ and },
or if there is no terminating }, this form of escape is not recognized.
@@ -3596,9 +3625,17 @@ BACKSLASH
escape, with no following digits, giving a character whose value is
zero.
+ If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
+ is as just described only when it is followed by two hexadecimal dig-
+ its. Otherwise, it matches a literal "x" character. In JavaScript
+ mode, support for code points greater than 256 is provided by \u, which
+ must be followed by four hexadecimal digits; otherwise it matches a
+ literal "u" character.
+
Characters whose value is less than 256 can be defined by either of the
- two syntaxes for \x. There is no difference in the way they are han-
- dled. For example, \xdc is exactly the same as \x{dc}.
+ two syntaxes for \x (or by \u in JavaScript mode). There is no differ-
+ ence in the way they are handled. For example, \xdc is exactly the same
+ as \x{dc} (or \u00dc in JavaScript mode).
After \0 up to two further octal digits are read. If there are fewer
than two digits, just those that are present are used. Thus the
@@ -3642,12 +3679,22 @@ BACKSLASH
All the sequences that define a single character value can be used both
inside and outside character classes. In addition, inside a character
- class, the sequence \b is interpreted as the backspace character (hex
- 08). The sequences \B, \N, \R, and \X are not special inside a charac-
- ter class. Like any other unrecognized escape sequences, they are
- treated as the literal characters "B", "N", "R", and "X" by default,
- but cause an error if the PCRE_EXTRA option is set. Outside a character
- class, these sequences have different meanings.
+ class, \b is interpreted as the backspace character (hex 08).
+
+ \N is not allowed in a character class. \B, \R, and \X are not special
+ inside a character class. Like other unrecognized escape sequences,
+ they are treated as the literal characters "B", "R", and "X" by
+ default, but cause an error if the PCRE_EXTRA option is set. Outside a
+ character class, these sequences have different meanings.
+
+ Unsupported escape sequences
+
+ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
+ handler and used to modify the case of following characters. By
+ default, PCRE does not support these escape sequences. However, if the
+ PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
+ \u can be used to define a character by code point, as described in the
+ previous section.
Absolute and relative back references
@@ -3682,53 +3729,54 @@ BACKSLASH
There is also the single sequence \N, which matches a non-newline char-
acter. This is the same as the "." metacharacter when PCRE_DOTALL is
- not set.
-
- Each pair of lower and upper case escape sequences partitions the com-
- plete set of characters into two disjoint sets. Any given character
- matches one, and only one, of each pair. The sequences can appear both
- inside and outside character classes. They each match one character of
- the appropriate type. If the current matching point is at the end of
- the subject string, all of them fail, because there is no character to
+ not set. Perl also uses \N to match characters by name; PCRE does not
+ support this.
+
+ Each pair of lower and upper case escape sequences partitions the com-
+ plete set of characters into two disjoint sets. Any given character
+ matches one, and only one, of each pair. The sequences can appear both
+ inside and outside character classes. They each match one character of
+ the appropriate type. If the current matching point is at the end of
+ the subject string, all of them fail, because there is no character to
match.
- For compatibility with Perl, \s does not match the VT character (code
- 11). This makes it different from the the POSIX "space" class. The \s
- characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
+ For compatibility with Perl, \s does not match the VT character (code
+ 11). This makes it different from the the POSIX "space" class. The \s
+ characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
"use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.
- A "word" character is an underscore or any character that is a letter
- or digit. By default, the definition of letters and digits is con-
- trolled by PCRE's low-valued character tables, and may vary if locale-
- specific matching is taking place (see "Locale support" in the pcreapi
- page). For example, in a French locale such as "fr_FR" in Unix-like
- systems, or "french" in Windows, some character codes greater than 128
- are used for accented letters, and these are then matched by \w. The
+ A "word" character is an underscore or any character that is a letter
+ or digit. By default, the definition of letters and digits is con-
+ trolled by PCRE's low-valued character tables, and may vary if locale-
+ specific matching is taking place (see "Locale support" in the pcreapi
+ page). For example, in a French locale such as "fr_FR" in Unix-like
+ systems, or "french" in Windows, some character codes greater than 128
+ are used for accented letters, and these are then matched by \w. The
use of locales with Unicode is discouraged.
- By default, in UTF-8 mode, characters with values greater than 128
- never match \d, \s, or \w, and always match \D, \S, and \W. These
- sequences retain their original meanings from before UTF-8 support was
- available, mainly for efficiency reasons. However, if PCRE is compiled
- with Unicode property support, and the PCRE_UCP option is set, the be-
- haviour is changed so that Unicode properties are used to determine
+ By default, in UTF-8 mode, characters with values greater than 128
+ never match \d, \s, or \w, and always match \D, \S, and \W. These
+ sequences retain their original meanings from before UTF-8 support was
+ available, mainly for efficiency reasons. However, if PCRE is compiled
+ with Unicode property support, and the PCRE_UCP option is set, the be-
+ haviour is changed so that Unicode properties are used to determine
character types, as follows:
\d any character that \p{Nd} matches (decimal digit)
\s any character that \p{Z} matches, plus HT, LF, FF, CR
\w any character that \p{L} or \p{N} matches, plus underscore
- The upper case escapes match the inverse sets of characters. Note that
- \d matches only decimal digits, whereas \w matches any Unicode digit,
- as well as any Unicode letter, and underscore. Note also that PCRE_UCP
- affects \b, and \B because they are defined in terms of \w and \W.
+ The upper case escapes match the inverse sets of characters. Note that
+ \d matches only decimal digits, whereas \w matches any Unicode digit,
+ as well as any Unicode letter, and underscore. Note also that PCRE_UCP
+ affects \b, and \B because they are defined in terms of \w and \W.
Matching these sequences is noticeably slower when PCRE_UCP is set.
- The sequences \h, \H, \v, and \V are features that were added to Perl
- at release 5.10. In contrast to the other sequences, which match only
- ASCII characters by default, these always match certain high-valued
- codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
+ The sequences \h, \H, \v, and \V are features that were added to Perl
+ at release 5.10. In contrast to the other sequences, which match only
+ ASCII characters by default, these always match certain high-valued
+ codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
tal space characters are:
U+0009 Horizontal tab
@@ -3763,104 +3811,104 @@ BACKSLASH
Newline sequences
- Outside a character class, by default, the escape sequence \R matches
+ Outside a character class, by default, the escape sequence \R matches
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
- This is an example of an "atomic group", details of which are given
+ This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character sequence
- CR followed by LF, or one of the single characters LF (linefeed,
+ CR followed by LF, or one of the single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
return, U+000D), or NEL (next line, U+0085). The two-character sequence
is treated as a single unit that cannot be split.
- In UTF-8 mode, two additional characters whose codepoints are greater
+ In UTF-8 mode, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
- rator, U+2029). Unicode character property support is not needed for
+ rator, U+2029). Unicode character property support is not needed for
these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
- the complete set of Unicode line endings) by setting the option
+ the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default
- when PCRE is built; if this is the case, the other behaviour can be
- requested via the PCRE_BSR_UNICODE option. It is also possible to
- specify these settings by starting a pattern string with one of the
+ when PCRE is built; if this is the case, the other behaviour can be
+ requested via the PCRE_BSR_UNICODE option. It is also possible to
+ specify these settings by starting a pattern string with one of the
following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
- These override the default and the options given to pcre_compile() or
- pcre_compile2(), but they can be overridden by options given to
+ These override the default and the options given to pcre_compile() or
+ pcre_compile2(), but they can be overridden by options given to
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
- are not Perl-compatible, are recognized only at the very start of a
- pattern, and that they must be in upper case. If more than one of them
+ are not Perl-compatible, are recognized only at the very start of a
+ pattern, and that they must be in upper case. If more than one of them
is present, the last one is used. They can be combined with a change of
newline convention; for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF8) or (*UCP) special sequences.
- Inside a character class, \R is treated as an unrecognized escape
+ Inside a character class, \R is treated as an unrecognized escape
sequence, and so matches the letter "R" by default, but causes an error
if PCRE_EXTRA is set.
Unicode character properties
When PCRE is built with Unicode character property support, three addi-
- tional escape sequences that match characters with specific properties
- are available. When not in UTF-8 mode, these sequences are of course
- limited to testing characters whose codepoints are less than 256, but
+ tional escape sequences that match characters with specific properties
+ are available. When not in UTF-8 mode, these sequences are of course
+ limited to testing characters whose codepoints are less than 256, but
they do work in this mode. The extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X an extended Unicode sequence
- The property names represented by xx above are limited to the Unicode
+ The property names represented by xx above are limited to the Unicode
script names, the general category properties, "Any", which matches any
- character (including newline), and some special PCRE properties
- (described in the next section). Other Perl properties such as "InMu-
- sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
+ character (including newline), and some special PCRE properties
+ (described in the next section). Other Perl properties such as "InMu-
+ sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
does not match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts.
- A character from one of these sets can be matched using a script name.
+ A character from one of these sets can be matched using a script name.
For example:
\p{Greek}
\P{Han}
- Those that are not part of an identified script are lumped together as
+ Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
- Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
- Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
- tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
- Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
+ Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
+ Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
+ tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
+ Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
- Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
+ Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
- Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
- Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
- Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
- Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
- Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
+ Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
+ Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
+ Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
+ Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
+ Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Vai, Yi.
Each character has exactly one Unicode general category property, spec-
- ified by a two-letter abbreviation. For compatibility with Perl, nega-
- tion can be specified by including a circumflex between the opening
- brace and the property name. For example, \p{^Lu} is the same as
+ ified by a two-letter abbreviation. For compatibility with Perl, nega-
+ tion can be specified by including a circumflex between the opening
+ brace and the property name. For example, \p{^Lu} is the same as
\P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
- eral category properties that start with that letter. In this case, in
- the absence of negation, the curly brackets in the escape sequence are
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
@@ -3912,54 +3960,54 @@ BACKSLASH
Zp Paragraph separator
Zs Space separator
- The special property L& is also supported: it matches a character that
- has the Lu, Ll, or Lt property, in other words, a letter that is not
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
- The Cs (Surrogate) property applies only to characters in the range
- U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
- ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
+ ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
the pcreapi page). Perl does not support the Cs property.
- The long synonyms for property names that Perl supports (such as
- \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
- Specifying caseless matching does not affect these escape sequences.
+ Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.
- The \X escape matches any number of Unicode characters that form an
+ The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
- That is, it matches a character without the "mark" property, followed
- by zero or more characters with the "mark" property, and treats the
- sequence as an atomic group (see below). Characters with the "mark"
- property are typically accents that affect the preceding character.
- None of them have codepoints less than 256, so in non-UTF-8 mode \X
+ That is, it matches a character without the "mark" property, followed
+ by zero or more characters with the "mark" property, and treats the
+ sequence as an atomic group (see below). Characters with the "mark"
+ property are typically accents that affect the preceding character.
+ None of them have codepoints less than 256, so in non-UTF-8 mode \X
matches any one character.
Note that recent versions of Perl have changed \X to match what Unicode
calls an "extended grapheme cluster", which has a more complicated def-
inition.
- Matching characters by Unicode property is not fast, because PCRE has
- to search a structure that contains data for over fifteen thousand
+ Matching characters by Unicode property is not fast, because PCRE has
+ to search a structure that contains data for over fifteen thousand
characters. That is why the traditional escape sequences such as \d and
- \w do not use Unicode properties in PCRE by default, though you can
+ \w do not use Unicode properties in PCRE by default, though you can
make them do so by setting the PCRE_UCP option for pcre_compile() or by
starting the pattern with (*UCP).
PCRE's additional properties
- As well as the standard Unicode properties described in the previous
- section, PCRE supports four more that make it possible to convert tra-
+ As well as the standard Unicode properties described in the previous
+ section, PCRE supports four more that make it possible to convert tra-
ditional escape sequences such as \w and \s and POSIX character classes
to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
erties internally when PCRE_UCP is set. They are:
@@ -3969,40 +4017,40 @@ BACKSLASH
Xsp Any Perl space character
Xwd Any Perl "word" character
- Xan matches characters that have either the L (letter) or the N (num-
- ber) property. Xps matches the characters tab, linefeed, vertical tab,
- formfeed, or carriage return, and any other character that has the Z
+ Xan matches characters that have either the L (letter) or the N (num-
+ ber) property. Xps matches the characters tab, linefeed, vertical tab,
+ formfeed, or carriage return, and any other character that has the Z
(separator) property. Xsp is the same as Xps, except that vertical tab
is excluded. Xwd matches the same characters as Xan, plus underscore.
Resetting the match start
- The escape sequence \K causes any previously matched characters not to
+ The escape sequence \K causes any previously matched characters not to
be included in the final matched sequence. For example, the pattern:
foo\Kbar
- matches "foobar", but reports that it has matched "bar". This feature
- is similar to a lookbehind assertion (described below). However, in
- this case, the part of the subject before the real match does not have
- to be of fixed length, as lookbehind assertions do. The use of \K does
- not interfere with the setting of captured substrings. For example,
+ matches "foobar", but reports that it has matched "bar". This feature
+ is similar to a lookbehind assertion (described below). However, in
+ this case, the part of the subject before the real match does not have
+ to be of fixed length, as lookbehind assertions do. The use of \K does
+ not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
- Perl documents that the use of \K within assertions is "not well
- defined". In PCRE, \K is acted upon when it occurs inside positive
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE, \K is acted upon when it occurs inside positive
assertions, but is ignored in negative assertions.
Simple assertions
- The final use of backslash is for certain simple assertions. An asser-
- tion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The
- use of subpatterns for more complicated assertions is described below.
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
@@ -4013,49 +4061,49 @@ BACKSLASH
\z matches only at the end of the subject
\G matches at the first matching position in the subject
- Inside a character class, \b has a different meaning; it matches the
- backspace character. If any other of these assertions appears in a
- character class, by default it matches the corresponding literal char-
+ Inside a character class, \b has a different meaning; it matches the
+ backspace character. If any other of these assertions appears in a
+ character class, by default it matches the corresponding literal char-
acter (for example, \B matches the letter B). However, if the
- PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
+ PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
ated instead.
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match \w or \W (i.e.
- one matches \w and the other matches \W), or the start or end of the
- string if the first or last character matches \w, respectively. In
- UTF-8 mode, the meanings of \w and \W can be changed by setting the
- PCRE_UCP option. When this is done, it also affects \b and \B. Neither
- PCRE nor Perl has a separate "start of word" or "end of word" metase-
- quence. However, whatever follows \b normally determines which it is.
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
+ string if the first or last character matches \w, respectively. In
+ UTF-8 mode, the meanings of \w and \W can be changed by setting the
+ PCRE_UCP option. When this is done, it also affects \b and \B. Neither
+ PCRE nor Perl has a separate "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
- affect only the behaviour of the circumflex and dollar metacharacters.
- However, if the startoffset argument of pcre_exec() is non-zero, indi-
+ affect only the behaviour of the circumflex and dollar metacharacters.
+ However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
- the subject, \A can never match. The difference between \Z and \z is
+ the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre_exec(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre_exec() multiple times with appropriate argu-
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre_exec(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
- Note, however, that PCRE's interpretation of \G, as the start of the
+ Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
@@ -4063,80 +4111,81 @@ BACKSLASH
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre_exec() is non-zero, circumflex can never match if the
- PCRE_MULTILINE option is unset. Inside a character class, circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre_exec() is non-zero, circumflex can never match if the
+ PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- A dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
+ A dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
before a newline at the end of the string (by default). Dollar need not
- be the last character of the pattern if a number of alternatives are
- involved, but it should be the last item in any branch in which it
+ be the last character of the pattern if a number of alternatives are
+ involved, but it should be the last item in any branch in which it
appears. Dollar has no special meaning in a character class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
- PCRE_MULTILINE option is set. When this is the case, a circumflex
- matches immediately after internal newlines as well as at the start of
- the subject string. It does not match after a newline that ends the
- string. A dollar matches before any newlines in the string, as well as
- at the very end, when PCRE_MULTILINE is set. When newline is specified
- as the two-character sequence CRLF, isolated CR and LF characters do
+ PCRE_MULTILINE option is set. When this is the case, a circumflex
+ matches immediately after internal newlines as well as at the start of
+ the subject string. It does not match after a newline that ends the
+ string. A dollar matches before any newlines in the string, as well as
+ at the very end, when PCRE_MULTILINE is set. When newline is specified
+ as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT) AND \N
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
- fies the end of a line. In UTF-8 mode, the matched character may be
+ ter in the subject string except (by default) a character that signi-
+ fies the end of a line. In UTF-8 mode, the matched character may be
more than one byte long.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE_DOTALL option is set, a dot matches any one character, without
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
- The escape sequence \N behaves like a dot, except that it is not
- affected by the PCRE_DOTALL option. In other words, it matches any
- character except one that signifies the end of a line.
+ The escape sequence \N behaves like a dot, except that it is not
+ affected by the PCRE_DOTALL option. In other words, it matches any
+ character except one that signifies the end of a line. Perl also uses
+ \N to match characters by name; PCRE does not support this.
MATCHING A SINGLE BYTE
@@ -4153,7 +4202,7 @@ MATCHING A SINGLE BYTE
PCRE_NO_UTF8_CHECK option is used).
PCRE does not allow \C to appear in lookbehind assertions (described
- below), because in UTF-8 mode this would make it impossible to calcu-
+ below) in UTF-8 mode, because this would make it impossible to calcu-
late the length of the lookbehind.
In general, the \C escape sequence is best avoided in UTF-8 mode. How-
@@ -5060,40 +5109,41 @@ ASSERTIONS
then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails.
- PCRE does not allow the \C escape (which matches a single byte in UTF-8
- mode) to appear in lookbehind assertions, because it makes it impossi-
- ble to calculate the length of the lookbehind. The \X and \R escapes,
- which can match different numbers of bytes, are also not permitted.
+ In UTF-8 mode, PCRE does not allow the \C escape (which matches a sin-
+ gle byte, even in UTF-8 mode) to appear in lookbehind assertions,
+ because it makes it impossible to calculate the length of the lookbe-
+ hind. The \X and \R escapes, which can match different numbers of
+ bytes, are also not permitted.
- "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
- lookbehinds, as long as the subpattern matches a fixed-length string.
+ "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
+ lookbehinds, as long as the subpattern matches a fixed-length string.
Recursion, however, is not supported.
- Possessive quantifiers can be used in conjunction with lookbehind
+ Possessive quantifiers can be used in conjunction with lookbehind
assertions to specify efficient matching of fixed-length strings at the
end of subject strings. Consider a simple pattern such as
abcd$
- when applied to a long string that does not match. Because matching
+ when applied to a long string that does not match. Because matching
proceeds from left to right, PCRE will look for each "a" in the subject
- and then see if what follows matches the rest of the pattern. If the
+ and then see if what follows matches the rest of the pattern. If the
pattern is specified as
^.*abcd$
- the initial .* matches the entire string at first, but when this fails
+ the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the
- last character, then all but the last two characters, and so on. Once
- again the search for "a" covers the entire string, from right to left,
+ last character, then all but the last two characters, and so on. Once
+ again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)
- there can be no backtracking for the .*+ item; it can match only the
- entire string. The subsequent lookbehind assertion does a single test
- on the last four characters. If it fails, the match fails immediately.
- For long strings, this approach makes a significant difference to the
+ there can be no backtracking for the .*+ item; it can match only the
+ entire string. The subsequent lookbehind assertion does a single test
+ on the last four characters. If it fails, the match fails immediately.
+ For long strings, this approach makes a significant difference to the
processing time.
Using multiple assertions
@@ -5102,18 +5152,18 @@ ASSERTIONS
(?<=\d{3})(?<!999)foo
- matches "foo" preceded by three digits that are not "999". Notice that
- each of the assertions is applied independently at the same point in
- the subject string. First there is a check that the previous three
- characters are all digits, and then there is a check that the same
+ matches "foo" preceded by three digits that are not "999". Notice that
+ each of the assertions is applied independently at the same point in
+ the subject string. First there is a check that the previous three
+ characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre-
- ceded by six characters, the first of which are digits and the last
- three of which are not "999". For example, it doesn't match "123abc-
+ ceded by six characters, the first of which are digits and the last
+ three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
- This time the first assertion looks at the preceding six characters,
+ This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999".
@@ -5121,29 +5171,29 @@ ASSERTIONS
(?<=(?<!foo)bar)baz
- matches an occurrence of "baz" that is preceded by "bar" which in turn
+ matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
- is another pattern that matches "foo" preceded by three digits and any
+ is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".
CONDITIONAL SUBPATTERNS
- It is possible to cause the matching process to obey a subpattern con-
- ditionally or to choose between two alternative subpatterns, depending
- on the result of an assertion, or whether a specific capturing subpat-
- tern has already been matched. The two possible forms of conditional
+ It is possible to cause the matching process to obey a subpattern con-
+ ditionally or to choose between two alternative subpatterns, depending
+ on the result of an assertion, or whether a specific capturing subpat-
+ tern has already been matched. The two possible forms of conditional
subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
- If the condition is satisfied, the yes-pattern is used; otherwise the
- no-pattern (if present) is used. If there are more than two alterna-
- tives in the subpattern, a compile-time error occurs. Each of the two
+ If the condition is satisfied, the yes-pattern is used; otherwise the
+ no-pattern (if present) is used. If there are more than two alterna-
+ tives in the subpattern, a compile-time error occurs. Each of the two
alternatives may itself contain nested subpatterns of any form, includ-
ing conditional subpatterns; the restriction to two alternatives
applies only at the level of the condition. This pattern fragment is an
@@ -5152,73 +5202,73 @@ CONDITIONAL SUBPATTERNS
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
- There are four kinds of condition: references to subpatterns, refer-
+ There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.
Checking for a used subpattern by number
- If the text between the parentheses consists of a sequence of digits,
+ If the text between the parentheses consists of a sequence of digits,
the condition is true if a capturing subpattern of that number has pre-
- viously matched. If there is more than one capturing subpattern with
- the same number (see the earlier section about duplicate subpattern
- numbers), the condition is true if any of them have matched. An alter-
- native notation is to precede the digits with a plus or minus sign. In
- this case, the subpattern number is relative rather than absolute. The
- most recently opened parentheses can be referenced by (?(-1), the next
- most recent by (?(-2), and so on. Inside loops it can also make sense
+ viously matched. If there is more than one capturing subpattern with
+ the same number (see the earlier section about duplicate subpattern
+ numbers), the condition is true if any of them have matched. An alter-
+ native notation is to precede the digits with a plus or minus sign. In
+ this case, the subpattern number is relative rather than absolute. The
+ most recently opened parentheses can be referenced by (?(-1), the next
+ most recent by (?(-2), and so on. Inside loops it can also make sense
to refer to subsequent groups. The next parentheses to be opened can be
- referenced as (?(+1), and so on. (The value zero in any of these forms
+ referenced as (?(+1), and so on. (The value zero in any of these forms
is not used; it provokes a compile-time error.)
- Consider the following pattern, which contains non-significant white
+ Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
- The first part matches an optional opening parenthesis, and if that
+ The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec-
- ond part matches one or more characters that are not parentheses. The
- third part is a conditional subpattern that tests whether or not the
- first set of parentheses matched. If they did, that is, if subject
- started with an opening parenthesis, the condition is true, and so the
- yes-pattern is executed and a closing parenthesis is required. Other-
- wise, since no-pattern is not present, the subpattern matches nothing.
- In other words, this pattern matches a sequence of non-parentheses,
+ ond part matches one or more characters that are not parentheses. The
+ third part is a conditional subpattern that tests whether or not the
+ first set of parentheses matched. If they did, that is, if subject
+ started with an opening parenthesis, the condition is true, and so the
+ yes-pattern is executed and a closing parenthesis is required. Other-
+ wise, since no-pattern is not present, the subpattern matches nothing.
+ In other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
- If you were embedding this pattern in a larger one, you could use a
+ If you were embedding this pattern in a larger one, you could use a
relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
- This makes the fragment independent of the parentheses in the larger
+ This makes the fragment independent of the parentheses in the larger
pattern.
Checking for a used subpattern by name
- Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
- used subpattern by name. For compatibility with earlier versions of
- PCRE, which had this facility before Perl, the syntax (?(name)...) is
- also recognized. However, there is a possible ambiguity with this syn-
- tax, because subpattern names may consist entirely of digits. PCRE
- looks first for a named subpattern; if it cannot find one and the name
- consists entirely of digits, PCRE looks for a subpattern of that num-
- ber, which must be greater than zero. Using subpattern names that con-
+ Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
+ used subpattern by name. For compatibility with earlier versions of
+ PCRE, which had this facility before Perl, the syntax (?(name)...) is
+ also recognized. However, there is a possible ambiguity with this syn-
+ tax, because subpattern names may consist entirely of digits. PCRE
+ looks first for a named subpattern; if it cannot find one and the name
+ consists entirely of digits, PCRE looks for a subpattern of that num-
+ ber, which must be greater than zero. Using subpattern names that con-
sist entirely of digits is not recommended.
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
- If the name used in a condition of this kind is a duplicate, the test
- is applied to all subpatterns of the same name, and is true if any one
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
of them has matched.
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the
- name R, the condition is true if a recursive call to the whole pattern
+ name R, the condition is true if a recursive call to the whole pattern
or any subpattern has been made. If digits or a name preceded by amper-
sand follow the letter R, for example:
@@ -5226,51 +5276,51 @@ CONDITIONAL SUBPATTERNS
the condition is true if the most recent recursion is into a subpattern
whose number or name is given. This condition does not check the entire
- recursion stack. If the name used in a condition of this kind is a
+ recursion stack. If the name used in a condition of this kind is a
duplicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion.
- At "top level", all these recursion test conditions are false. The
+ At "top level", all these recursion test conditions are false. The
syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
- If the condition is the string (DEFINE), and there is no subpattern
- with the name DEFINE, the condition is always false. In this case,
- there may be only one alternative in the subpattern. It is always
- skipped if control reaches this point in the pattern; the idea of
- DEFINE is that it can be used to define subroutines that can be refer-
- enced from elsewhere. (The use of subroutines is described below.) For
- example, a pattern to match an IPv4 address such as "192.168.23.245"
+ If the condition is the string (DEFINE), and there is no subpattern
+ with the name DEFINE, the condition is always false. In this case,
+ there may be only one alternative in the subpattern. It is always
+ skipped if control reaches this point in the pattern; the idea of
+ DEFINE is that it can be used to define subroutines that can be refer-
+ enced from elsewhere. (The use of subroutines is described below.) For
+ example, a pattern to match an IPv4 address such as "192.168.23.245"
could be written like this (ignore whitespace and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
- The first part of the pattern is a DEFINE group inside which a another
- group named "byte" is defined. This matches an individual component of
- an IPv4 address (a number less than 256). When matching takes place,
- this part of the pattern is skipped because DEFINE acts like a false
- condition. The rest of the pattern uses references to the named group
- to match the four dot-separated components of an IPv4 address, insist-
+ The first part of the pattern is a DEFINE group inside which a another
+ group named "byte" is defined. This matches an individual component of
+ an IPv4 address (a number less than 256). When matching takes place,
+ this part of the pattern is skipped because DEFINE acts like a false
+ condition. The rest of the pattern uses references to the named group
+ to match the four dot-separated components of an IPv4 address, insist-
ing on a word boundary at each end.
Assertion conditions
- If the condition is not in any of the above formats, it must be an
- assertion. This may be a positive or negative lookahead or lookbehind
- assertion. Consider this pattern, again containing non-significant
+ If the condition is not in any of the above formats, it must be an
+ assertion. This may be a positive or negative lookahead or lookbehind
+ assertion. Consider this pattern, again containing non-significant
white space, and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
- The condition is a positive lookahead assertion that matches an
- optional sequence of non-letters followed by a letter. In other words,
- it tests for the presence of at least one letter in the subject. If a
- letter is found, the subject is matched against the first alternative;
- otherwise it is matched against the second. This pattern matches
- strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+ The condition is a positive lookahead assertion that matches an
+ optional sequence of non-letters followed by a letter. In other words,
+ it tests for the presence of at least one letter in the subject. If a
+ letter is found, the subject is matched against the first alternative;
+ otherwise it is matched against the second. This pattern matches
+ strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
@@ -5279,41 +5329,41 @@ COMMENTS
There are two ways of including comments in patterns that are processed
by PCRE. In both cases, the start of the comment must not be in a char-
acter class, nor in the middle of any other sequence of related charac-
- ters such as (?: or a subpattern name or number. The characters that
+ ters such as (?: or a subpattern name or number. The characters that
make up a comment play no part in the pattern matching.
- The sequence (?# marks the start of a comment that continues up to the
- next closing parenthesis. Nested parentheses are not permitted. If the
+ The sequence (?# marks the start of a comment that continues up to the
+ next closing parenthesis. Nested parentheses are not permitted. If the
PCRE_EXTENDED option is set, an unescaped # character also introduces a
- comment, which in this case continues to immediately after the next
- newline character or character sequence in the pattern. Which charac-
+ comment, which in this case continues to immediately after the next
+ newline character or character sequence in the pattern. Which charac-
ters are interpreted as newlines is controlled by the options passed to
pcre_compile() or by a special sequence at the start of the pattern, as
- described in the section entitled "Newline conventions" above. Note
- that the end of this type of comment is a literal newline sequence in
+ described in the section entitled "Newline conventions" above. Note
+ that the end of this type of comment is a literal newline sequence in
the pattern; escape sequences that happen to represent a newline do not
- count. For example, consider this pattern when PCRE_EXTENDED is set,
+ count. For example, consider this pattern when PCRE_EXTENDED is set,
and the default newline convention is in force:
abc #comment \n still comment
- On encountering the # character, pcre_compile() skips along, looking
- for a newline in the pattern. The sequence \n is still literal at this
- stage, so it does not terminate the comment. Only an actual character
+ On encountering the # character, pcre_compile() skips along, looking
+ for a newline in the pattern. The sequence \n is still literal at this
+ stage, so it does not terminate the comment. Only an actual character
with the code value 0x0a (the default newline) does so.
RECURSIVE PATTERNS
- Consider the problem of matching a string in parentheses, allowing for
- unlimited nested parentheses. Without the use of recursion, the best
- that can be done is to use a pattern that matches up to some fixed
- depth of nesting. It is not possible to handle an arbitrary nesting
+ Consider the problem of matching a string in parentheses, allowing for
+ unlimited nested parentheses. Without the use of recursion, the best
+ that can be done is to use a pattern that matches up to some fixed
+ depth of nesting. It is not possible to handle an arbitrary nesting
depth.
For some time, Perl has provided a facility that allows regular expres-
- sions to recurse (amongst other things). It does this by interpolating
- Perl code in the expression at run time, and the code can refer to the
+ sions to recurse (amongst other things). It does this by interpolating
+ Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this:
@@ -5323,201 +5373,201 @@ RECURSIVE PATTERNS
refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
- it supports special syntax for recursion of the entire pattern, and
- also for individual subpattern recursion. After its introduction in
- PCRE and Python, this kind of recursion was subsequently introduced
+ it supports special syntax for recursion of the entire pattern, and
+ also for individual subpattern recursion. After its introduction in
+ PCRE and Python, this kind of recursion was subsequently introduced
into Perl at release 5.10.
- A special item that consists of (? followed by a number greater than
- zero and a closing parenthesis is a recursive subroutine call of the
- subpattern of the given number, provided that it occurs inside that
- subpattern. (If not, it is a non-recursive subroutine call, which is
- described in the next section.) The special item (?R) or (?0) is a
+ A special item that consists of (? followed by a number greater than
+ zero and a closing parenthesis is a recursive subroutine call of the
+ subpattern of the given number, provided that it occurs inside that
+ subpattern. (If not, it is a non-recursive subroutine call, which is
+ described in the next section.) The special item (?R) or (?0) is a
recursive call of the entire regular expression.
- This PCRE pattern solves the nested parentheses problem (assume the
+ This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
- First it matches an opening parenthesis. Then it matches any number of
- substrings which can either be a sequence of non-parentheses, or a
- recursive match of the pattern itself (that is, a correctly parenthe-
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis. Note the use
of a possessive quantifier to avoid backtracking into sequences of non-
parentheses.
- If this were part of a larger pattern, you would not want to recurse
+ If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
- We have put the pattern into parentheses, and caused the recursion to
+ We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
- In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references. Instead
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references. Instead
of (?1) in the pattern above you can write (?-2) to refer to the second
- most recently opened parentheses preceding the recursion. In other
- words, a negative number counts capturing parentheses leftwards from
+ most recently opened parentheses preceding the recursion. In other
+ words, a negative number counts capturing parentheses leftwards from
the point at which it is encountered.
- It is also possible to refer to subsequently opened parentheses, by
- writing references such as (?+2). However, these cannot be recursive
- because the reference is not inside the parentheses that are refer-
- enced. They are always non-recursive subroutine calls, as described in
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always non-recursive subroutine calls, as described in
the next section.
- An alternative approach is to use named parentheses instead. The Perl
- syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
+ An alternative approach is to use named parentheses instead. The Perl
+ syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
- If there is more than one subpattern with the same name, the earliest
+ If there is more than one subpattern with the same name, the earliest
one is used.
- This particular example pattern that we have been looking at contains
+ This particular example pattern that we have been looking at contains
nested unlimited repeats, and so the use of a possessive quantifier for
matching strings of non-parentheses is important when applying the pat-
- tern to strings that do not match. For example, when this pattern is
+ tern to strings that do not match. For example, when this pattern is
applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if a possessive quantifier is
- not used, the match runs for a very long time indeed because there are
- so many different ways the + and * repeats can carve up the subject,
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported.
- At the end of a match, the values of capturing parentheses are those
- from the outermost level. If you want to obtain intermediate values, a
- callout function can be used (see below and the pcrecallout documenta-
+ At the end of a match, the values of capturing parentheses are those
+ from the outermost level. If you want to obtain intermediate values, a
+ callout function can be used (see below and the pcrecallout documenta-
tion). If the pattern above is matched against
(ab(cd)ef)
- the value for the inner capturing parentheses (numbered 2) is "ef",
- which is the last value taken on at the top level. If a capturing sub-
- pattern is not matched at the top level, its final captured value is
- unset, even if it was (temporarily) set at a deeper level during the
+ the value for the inner capturing parentheses (numbered 2) is "ef",
+ which is the last value taken on at the top level. If a capturing sub-
+ pattern is not matched at the top level, its final captured value is
+ unset, even if it was (temporarily) set at a deeper level during the
matching process.
- If there are more than 15 capturing parentheses in a pattern, PCRE has
- to obtain extra memory to store data during a recursion, which it does
+ If there are more than 15 capturing parentheses in a pattern, PCRE has
+ to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
- Do not confuse the (?R) item with the condition (R), which tests for
- recursion. Consider this pattern, which matches text in angle brack-
- ets, allowing for arbitrary nesting. Only digits are allowed in nested
- brackets (that is, when recursing), whereas any characters are permit-
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
- In this pattern, (?(R) is the start of a conditional subpattern, with
- two different alternatives for the recursive and non-recursive cases.
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
Differences in recursion processing between PCRE and Perl
- Recursion processing in PCRE differs from Perl in two important ways.
- In PCRE (like Python, but unlike Perl), a recursive subpattern call is
+ Recursion processing in PCRE differs from Perl in two important ways.
+ In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
- alternatives and there is a subsequent matching failure. This can be
- illustrated by the following pattern, which purports to match a palin-
- dromic string that contains an odd number of characters (for example,
+ alternatives and there is a subsequent matching failure. This can be
+ illustrated by the following pattern, which purports to match a palin-
+ dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical
- characters surrounding a sub-palindrome. In Perl, this pattern works;
- in PCRE it does not if the pattern is longer than three characters.
+ characters surrounding a sub-palindrome. In Perl, this pattern works;
+ in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba":
- At the top level, the first character is matched, but as it is not at
+ At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat-
- tern 1 successfully matches the next character ("b"). (Note that the
+ tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
- Back at the top level, the next character ("c") is compared with what
- subpattern 2 matched, which was "a". This fails. Because the recursion
- is treated as an atomic group, there are now no backtracking points,
- and so the entire match fails. (Perl is able, at this point, to re-
- enter the recursion and try the second alternative.) However, if the
+ Back at the top level, the next character ("c") is compared with what
+ subpattern 2 matched, which was "a". This fails. Because the recursion
+ is treated as an atomic group, there are now no backtracking points,
+ and so the entire match fails. (Perl is able, at this point, to re-
+ enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
- This time, the recursing alternative is tried first, and continues to
- recurse until it runs out of characters, at which point the recursion
- fails. But this time we do have another alternative to try at the
- higher level. That is the big difference: in the previous case the
+ This time, the recursing alternative is tried first, and continues to
+ recurse until it runs out of characters, at which point the recursion
+ fails. But this time we do have another alternative to try at the
+ higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot
use.
- To change the pattern so that it matches all palindromic strings, not
- just those with an odd number of characters, it is tempting to change
+ To change the pattern so that it matches all palindromic strings, not
+ just those with an odd number of characters, it is tempting to change
the pattern to this:
^((.)(?1)\2|.?)$
- Again, this works in Perl, but not in PCRE, and for the same reason.
- When a deeper recursion has matched a single character, it cannot be
- entered again in order to match an empty string. The solution is to
- separate the two cases, and write out the odd and even cases as alter-
+ Again, this works in Perl, but not in PCRE, and for the same reason.
+ When a deeper recursion has matched a single character, it cannot be
+ entered again in order to match an empty string. The solution is to
+ separate the two cases, and write out the odd and even cases as alter-
natives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
- If you want to match typical palindromic phrases, the pattern has to
+ If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
- Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
- ing into sequences of non-word characters. Without this, PCRE takes a
- great deal longer (ten times or more) to match typical phrases, and
+ Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
+ ing into sequences of non-word characters. Without this, PCRE takes a
+ great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
- WARNING: The palindrome-matching patterns above work only if the sub-
- ject string does not start with a palindrome that is shorter than the
- entire string. For example, although "abcba" is correctly matched, if
- the subject is "ababa", PCRE finds the palindrome "aba" at the start,
- then fails at top level because the end of the string does not follow.
- Once again, it cannot jump back into the recursion to try other alter-
+ WARNING: The palindrome-matching patterns above work only if the sub-
+ ject string does not start with a palindrome that is shorter than the
+ entire string. For example, although "abcba" is correctly matched, if
+ the subject is "ababa", PCRE finds the palindrome "aba" at the start,
+ then fails at top level because the end of the string does not follow.
+ Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails.
- The second way in which PCRE and Perl differ in their recursion pro-
- cessing is in the handling of captured values. In Perl, when a subpat-
- tern is called recursively or as a subpattern (see the next section),
- it has no access to any values that were captured outside the recur-
- sion, whereas in PCRE these values can be referenced. Consider this
+ The second way in which PCRE and Perl differ in their recursion pro-
+ cessing is in the handling of captured values. In Perl, when a subpat-
+ tern is called recursively or as a subpattern (see the next section),
+ it has no access to any values that were captured outside the recur-
+ sion, whereas in PCRE these values can be referenced. Consider this
pattern:
^(.)(\1|a(?2))
- In PCRE, this pattern matches "bab". The first capturing parentheses
- match "b", then in the second group, when the back reference \1 fails
- to match "b", the second alternative matches "a" and then recurses. In
- the recursion, \1 does now match "b" and so the whole match succeeds.
- In Perl, the pattern fails to match because inside the recursive call
+ In PCRE, this pattern matches "bab". The first capturing parentheses
+ match "b", then in the second group, when the back reference \1 fails
+ to match "b", the second alternative matches "a" and then recurses. In
+ the recursion, \1 does now match "b" and so the whole match succeeds.
+ In Perl, the pattern fails to match because inside the recursive call
\1 cannot access the externally set value.
SUBPATTERNS AS SUBROUTINES
- If the syntax for a recursive subpattern call (either by number or by
- name) is used outside the parentheses to which it refers, it operates
- like a subroutine in a programming language. The called subpattern may
- be defined before or after the reference. A numbered reference can be
+ If the syntax for a recursive subpattern call (either by number or by
+ name) is used outside the parentheses to which it refers, it operates
+ like a subroutine in a programming language. The called subpattern may
+ be defined before or after the reference. A numbered reference can be
absolute or relative, as in these examples:
(...(absolute)...)...(?2)...
@@ -5528,108 +5578,109 @@ SUBPATTERNS AS SUBROUTINES
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
+ matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
- is used, it does match "sense and responsibility" as well as the other
- two strings. Another example is given in the discussion of DEFINE
+ is used, it does match "sense and responsibility" as well as the other
+ two strings. Another example is given in the discussion of DEFINE
above.
- All subroutine calls, whether recursive or not, are always treated as
- atomic groups. That is, once a subroutine has matched some of the sub-
+ All subroutine calls, whether recursive or not, are always treated as
+ atomic groups. That is, once a subroutine has matched some of the sub-
ject string, it is never re-entered, even if it contains untried alter-
- natives and there is a subsequent matching failure. Any capturing
- parentheses that are set during the subroutine call revert to their
+ natives and there is a subsequent matching failure. Any capturing
+ parentheses that are set during the subroutine call revert to their
previous values afterwards.
- Processing options such as case-independence are fixed when a subpat-
- tern is defined, so if it is used as a subroutine, such options cannot
+ Processing options such as case-independence are fixed when a subpat-
+ tern is defined, so if it is used as a subroutine, such options cannot
be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))
- It matches "abcabc". It does not match "abcABC" because the change of
+ It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a subroutine,
- possibly recursively. Here are two of the examples used above, rewrit-
+ an alternative syntax for referencing a subpattern as a subroutine,
+ possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
- PCRE supports an extension to Oniguruma: if a number is preceded by a
+ PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
- Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
- synonymous. The former is a back reference; the latter is a subroutine
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a back reference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
- Perl code to be obeyed in the middle of matching a regular expression.
+ Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides
- an external function by putting its entry point in the global variable
- pcre_callout. By default, this variable contains NULL, which disables
+ an external function by putting its entry point in the global variable
+ pcre_callout. By default, this variable contains NULL, which disables
all calling out.
- Within a regular expression, (?C) indicates the points at which the
- external function is to be called. If you want to identify different
- callout points, you can put a number less than 256 after the letter C.
- The default value is zero. For example, this pattern has two callout
+ Within a regular expression, (?C) indicates the points at which the
+ external function is to be called. If you want to identify different
+ callout points, you can put a number less than 256 after the letter C.
+ The default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
- automatically installed before each item in the pattern. They are all
+ automatically installed before each item in the pattern. They are all
numbered 255.
During matching, when PCRE reaches a callout point (and pcre_callout is
- set), the external function is called. It is provided with the number
- of the callout, the position in the pattern, and, optionally, one item
- of data originally supplied by the caller of pcre_exec(). The callout
- function may cause matching to proceed, to backtrack, or to fail alto-
+ set), the external function is called. It is provided with the number
+ of the callout, the position in the pattern, and, optionally, one item
+ of data originally supplied by the caller of pcre_exec(). The callout
+ function may cause matching to proceed, to backtrack, or to fail alto-
gether. A complete description of the interface to the callout function
is given in the pcrecallout documentation.
BACKTRACKING CONTROL
- Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
+ Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are described in the Perl documentation as "experimental and sub-
- ject to change or removal in a future version of Perl". It goes on to
- say: "Their usage in production code should be noted to avoid problems
+ ject to change or removal in a future version of Perl". It goes on to
+ say: "Their usage in production code should be noted to avoid problems
during upgrades." The same remarks apply to the PCRE features described
in this section.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using
pcre_exec(), which uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, they cause an
error if encountered by pcre_dfa_exec().
- If any of these verbs are used in an assertion or in a subpattern that
+ If any of these verbs are used in an assertion or in a subpattern that
is called as a subroutine (whether or not recursively), their effect is
confined to that subpattern; it does not extend to the surrounding pat-
- tern, with one exception: a *MARK that is encountered in a positive
- assertion is passed back (compare capturing parentheses in assertions).
+ tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
+ that is encountered in a successful positive assertion is passed back
+ when a match succeeds (compare capturing parentheses in assertions).
Note that such subpatterns are processed as anchored at the point where
they are tested. Note also that Perl's treatment of subroutines is dif-
ferent in some cases.
@@ -5652,59 +5703,61 @@ BACKTRACKING CONTROL
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com-
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
+ Experiments with Perl suggest that it too has similar optimizations,
+ sometimes leading to anomalous results.
+
Verbs that act immediately
- The following verbs act as soon as they are encountered. They may not
+ The following verbs act as soon as they are encountered. They may not
be followed by a name.
(*ACCEPT)
- This verb causes the match to end successfully, skipping the remainder
- of the pattern. However, when it is inside a subpattern that is called
- as a subroutine, only that subpattern is ended successfully. Matching
- then continues at the outer level. If (*ACCEPT) is inside capturing
+ This verb causes the match to end successfully, skipping the remainder
+ of the pattern. However, when it is inside a subpattern that is called
+ as a subroutine, only that subpattern is ended successfully. Matching
+ then continues at the outer level. If (*ACCEPT) is inside capturing
parentheses, the data so far is captured. For example:
A((?:A|B(*ACCEPT)|C)D)
- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses.
(*FAIL) or (*F)
- This verb causes a matching failure, forcing backtracking to occur. It
- is equivalent to (?!) but easier to read. The Perl documentation notes
- that it is probably useful only when combined with (?{}) or (??{}).
- Those are, of course, Perl features that are not present in PCRE. The
- nearest equivalent is the callout feature, as for example in this pat-
+ This verb causes a matching failure, forcing backtracking to occur. It
+ is equivalent to (?!) but easier to read. The Perl documentation notes
+ that it is probably useful only when combined with (?{}) or (??{}).
+ Those are, of course, Perl features that are not present in PCRE. The
+ nearest equivalent is the callout feature, as for example in this pat-
tern:
a+(?C)(*FAIL)
- A match with the string "aaaa" always fails, but the callout is taken
+ A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times).
Recording which path was taken
- There is one verb whose main purpose is to track how a match was
- arrived at, though it also has a secondary use in conjunction with
+ There is one verb whose main purpose is to track how a match was
+ arrived at, though it also has a secondary use in conjunction with
advancing the match starting point (see (*SKIP) below).
(*MARK:NAME) or (*:NAME)
- A name is always required with this verb. There may be as many
- instances of (*MARK) as you like in a pattern, and their names do not
+ A name is always required with this verb. There may be as many
+ instances of (*MARK) as you like in a pattern, and their names do not
have to be unique.
- When a match succeeds, the name of the last-encountered (*MARK) is
- passed back to the caller via the pcre_extra data structure, as
- described in the section on pcre_extra in the pcreapi documentation. No
- data is returned for a partial match. Here is an example of pcretest
- output, where the /K modifier requests the retrieval and outputting of
- (*MARK) data:
+ When a match succeeds, the name of the last-encountered (*MARK) on the
+ matching path is passed back to the caller via the pcre_extra data
+ structure, as described in the section on pcre_extra in the pcreapi
+ documentation. Here is an example of pcretest output, where the /K mod-
+ ifier requests the retrieval and outputting of (*MARK) data:
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XY
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data> XY
0: XY
MK: A
XZ
@@ -5720,98 +5773,78 @@ BACKTRACKING CONTROL
and passed back if it is the last-encountered. This does not happen for
negative assertions.
- A name may also be returned after a failed match if the final path
- through the pattern involves (*MARK). However, unless (*MARK) used in
- conjunction with (*COMMIT), this is unlikely to happen for an unan-
- chored pattern because, as the starting point for matching is advanced,
- the final check is often with an empty string, causing a failure before
- (*MARK) is reached. For example:
-
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XP
- No match
-
- There are three potential starting points for this match (starting with
- X, starting with P, and with an empty string). If the pattern is
- anchored, the result is different:
+ After a partial match or a failed match, the name of the last encoun-
+ tered (*MARK) in the entire match process is returned. For example:
- /^X(*MARK:A)Y|^X(*MARK:B)Z/K
- XP
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data> XP
No match, mark = B
- PCRE's start-of-match optimizations can also interfere with this. For
- example, if, as a result of a call to pcre_study(), it knows the mini-
- mum subject length for a match, a shorter subject will not be scanned
- at all.
-
- Note that similar anomalies (though different in detail) exist in Perl,
- no doubt for the same reasons. The use of (*MARK) data after a failed
- match of an unanchored pattern is not recommended, unless (*COMMIT) is
- involved.
+ Note that in this unanchored example the mark is retained from the
+ match attempt that started at the letter "X". Subsequent match attempts
+ starting at "P" and then with an empty string do not get as far as the
+ (*MARK) item, but nevertheless do not reset it.
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con-
- tinues with what follows, but if there is no subsequent match, causing
- a backtrack to the verb, a failure is forced. That is, backtracking
- cannot pass to the left of the verb. However, when one of these verbs
- appears inside an atomic group, its effect is confined to that group,
- because once the group has been matched, there is never any backtrack-
- ing into it. In this situation, backtracking can "jump back" to the
- left of the entire atomic group. (Remember also, as stated above, that
+ tinues with what follows, but if there is no subsequent match, causing
+ a backtrack to the verb, a failure is forced. That is, backtracking
+ cannot pass to the left of the verb. However, when one of these verbs
+ appears inside an atomic group, its effect is confined to that group,
+ because once the group has been matched, there is never any backtrack-
+ ing into it. In this situation, backtracking can "jump back" to the
+ left of the entire atomic group. (Remember also, as stated above, that
this localization also applies in subroutine calls and assertions.)
- These verbs differ in exactly what kind of failure occurs when back-
+ These verbs differ in exactly what kind of failure occurs when back-
tracking reaches them.
(*COMMIT)
- This verb, which may not be followed by a name, causes the whole match
+ This verb, which may not be followed by a name, causes the whole match
to fail outright if the rest of the pattern does not match. Even if the
pattern is unanchored, no further attempts to find a match by advancing
the starting point take place. Once (*COMMIT) has been passed,
- pcre_exec() is committed to finding a match at the current starting
+ pcre_exec() is committed to finding a match at the current starting
point, or not at all. For example:
a+(*COMMIT)b
- This matches "xxaab" but not "aacaab". It can be thought of as a kind
+ This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish." The name of the
- most recently passed (*MARK) in the path is passed back when (*COMMIT)
+ most recently passed (*MARK) in the path is passed back when (*COMMIT)
forces a match failure.
- Note that (*COMMIT) at the start of a pattern is not the same as an
- anchor, unless PCRE's start-of-match optimizations are turned off, as
+ Note that (*COMMIT) at the start of a pattern is not the same as an
+ anchor, unless PCRE's start-of-match optimizations are turned off, as
shown in this pcretest example:
- /(*COMMIT)abc/
- xyzabc
+ re> /(*COMMIT)abc/
+ data> xyzabc
0: abc
xyzabc\Y
No match
- PCRE knows that any match must start with "a", so the optimization
- skips along the subject to "a" before running the first match attempt,
- which succeeds. When the optimization is disabled by the \Y escape in
+ PCRE knows that any match must start with "a", so the optimization
+ skips along the subject to "a" before running the first match attempt,
+ which succeeds. When the optimization is disabled by the \Y escape in
the second subject, the match starts at "x" and so the (*COMMIT) causes
it to fail without trying any other starting points.
(*PRUNE) or (*PRUNE:NAME)
- This verb causes the match to fail at the current starting position in
- the subject if the rest of the pattern does not match. If the pattern
- is unanchored, the normal "bumpalong" advance to the next starting
- character then happens. Backtracking can occur as usual to the left of
- (*PRUNE), before it is reached, or when matching to the right of
- (*PRUNE), but if there is no match to the right, backtracking cannot
- cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
- native to an atomic group or possessive quantifier, but there are some
+ This verb causes the match to fail at the current starting position in
+ the subject if the rest of the pattern does not match. If the pattern
+ is unanchored, the normal "bumpalong" advance to the next starting
+ character then happens. Backtracking can occur as usual to the left of
+ (*PRUNE), before it is reached, or when matching to the right of
+ (*PRUNE), but if there is no match to the right, backtracking cannot
+ cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
+ native to an atomic group or possessive quantifier, but there are some
uses of (*PRUNE) that cannot be expressed in any other way. The behav-
- iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
- match fails completely; the name is passed back if this is the final
- attempt. (*PRUNE:NAME) does not pass back a name if the match suc-
- ceeds. In an anchored pattern (*PRUNE) has the same effect as (*COM-
- MIT).
+ iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
+ anchored pattern (*PRUNE) has the same effect as (*COMMIT).
(*SKIP)
@@ -5838,67 +5871,66 @@ BACKTRACKING CONTROL
is searched for the most recent (*MARK) that has the same name. If one
is found, the "bumpalong" advance is to the subject position that cor-
responds to that (*MARK) instead of to where (*SKIP) was encountered.
- If no (*MARK) with a matching name is found, normal "bumpalong" of one
- character happens (that is, the (*SKIP) is ignored).
+ If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
(*THEN) or (*THEN:NAME)
- This verb causes a skip to the next innermost alternative if the rest
- of the pattern does not match. That is, it cancels pending backtrack-
- ing, but only within the current alternative. Its name comes from the
+ This verb causes a skip to the next innermost alternative if the rest
+ of the pattern does not match. That is, it cancels pending backtrack-
+ ing, but only within the current alternative. Its name comes from the
observation that it can be used for a pattern-based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
- If the COND1 pattern matches, FOO is tried (and possibly further items
- after the end of the group if FOO succeeds); on failure, the matcher
- skips to the second alternative and tries COND2, without backtracking
- into COND1. The behaviour of (*THEN:NAME) is exactly the same as
- (*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not
- inside an alternation, it acts like (*PRUNE).
-
- Note that a subpattern that does not contain a | character is just a
- part of the enclosing alternative; it is not a nested alternation with
- only one alternative. The effect of (*THEN) extends beyond such a sub-
- pattern to the enclosing alternative. Consider this pattern, where A,
+ If the COND1 pattern matches, FOO is tried (and possibly further items
+ after the end of the group if FOO succeeds); on failure, the matcher
+ skips to the second alternative and tries COND2, without backtracking
+ into COND1. The behaviour of (*THEN:NAME) is exactly the same as
+ (*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts
+ like (*PRUNE).
+
+ Note that a subpattern that does not contain a | character is just a
+ part of the enclosing alternative; it is not a nested alternation with
+ only one alternative. The effect of (*THEN) extends beyond such a sub-
+ pattern to the enclosing alternative. Consider this pattern, where A,
B, etc. are complex pattern fragments that do not contain any | charac-
ters at this level:
A (B(*THEN)C) | D
- If A and B are matched, but there is a failure in C, matching does not
+ If A and B are matched, but there is a failure in C, matching does not
backtrack into A; instead it moves to the next alternative, that is, D.
- However, if the subpattern containing (*THEN) is given an alternative,
+ However, if the subpattern containing (*THEN) is given an alternative,
it behaves differently:
A (B(*THEN)C | (*FAIL)) | D
- The effect of (*THEN) is now confined to the inner subpattern. After a
+ The effect of (*THEN) is now confined to the inner subpattern. After a
failure in C, matching moves to (*FAIL), which causes the whole subpat-
- tern to fail because there are no more alternatives to try. In this
+ tern to fail because there are no more alternatives to try. In this
case, matching does now backtrack into A.
Note also that a conditional subpattern is not considered as having two
- alternatives, because only one is ever used. In other words, the |
+ alternatives, because only one is ever used. In other words, the |
character in a conditional subpattern has a different meaning. Ignoring
white space, consider:
^.*? (?(?=a) a | b(*THEN)c )
- If the subject is "ba", this pattern does not match. Because .*? is
- ungreedy, it initially matches zero characters. The condition (?=a)
- then fails, the character "b" is matched, but "c" is not. At this
- point, matching does not backtrack to .*? as might perhaps be expected
- from the presence of the | character. The conditional subpattern is
+ If the subject is "ba", this pattern does not match. Because .*? is
+ ungreedy, it initially matches zero characters. The condition (?=a)
+ then fails, the character "b" is matched, but "c" is not. At this
+ point, matching does not backtrack to .*? as might perhaps be expected
+ from the presence of the | character. The conditional subpattern is
part of the single alternative that comprises the whole pattern, and so
- the match fails. (If there was a backtrack into .*?, allowing it to
+ the match fails. (If there was a backtrack into .*?, allowing it to
match "b", the match would succeed.)
- The verbs just described provide four different "strengths" of control
+ The verbs just described provide four different "strengths" of control
when subsequent matching fails. (*THEN) is the weakest, carrying on the
- match at the next alternative. (*PRUNE) comes next, failing the match
- at the current starting position, but allowing an advance to the next
- character (for an unanchored pattern). (*SKIP) is similar, except that
+ match at the next alternative. (*PRUNE) comes next, failing the match
+ at the current starting position, but allowing an advance to the next
+ character (for an unanchored pattern). (*SKIP) is similar, except that
the advance may be more than one character. (*COMMIT) is the strongest,
causing the entire match to fail.
@@ -5908,8 +5940,8 @@ BACKTRACKING CONTROL
(A(*COMMIT)B(*THEN)C|D)
- Once A has matched, PCRE is committed to this match, at the current
- starting position. If subsequently B matches, but C does not, the nor-
+ Once A has matched, PCRE is committed to this match, at the current
+ starting position. If subsequently B matches, but C does not, the nor-
mal (*THEN) action of trying the next alternative (that is, D) does not
happen because (*COMMIT) overrides.
@@ -5928,7 +5960,7 @@ AUTHOR
REVISION
- Last updated: 19 October 2011
+ Last updated: 29 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
@@ -6497,11 +6529,17 @@ AVAILABILITY OF JIT SUPPORT
been fully tested. If --enable-jit is set on an unsupported platform,
compilation fails.
- A program can tell if JIT support is available by calling pcre_config()
- with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available,
- and 0 otherwise. However, a simple program does not need to check this
- in order to use JIT. The API is implemented in a way that falls back to
- the ordinary PCRE code if JIT is not available.
+ A program that is linked with PCRE 8.20 or later can tell if JIT sup-
+ port is available by calling pcre_config() with the PCRE_CONFIG_JIT
+ option. The result is 1 when JIT is available, and 0 otherwise. How-
+ ever, a simple program does not need to check this in order to use JIT.
+ The API is implemented in a way that falls back to the ordinary PCRE
+ code if JIT is not available.
+
+ If your program may sometimes be linked with versions of PCRE that are
+ older than 8.20, but you want to use JIT when it is available, you can
+ test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
+ macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
SIMPLE USE OF JIT
@@ -6517,6 +6555,22 @@ SIMPLE USE OF JIT
no longer needed instead of just freeing it yourself. This
ensures that any JIT data is also freed.
+ For a program that may be linked with pre-8.20 versions of PCRE, you
+ can insert
+
+ #ifndef PCRE_STUDY_JIT_COMPILE
+ #define PCRE_STUDY_JIT_COMPILE 0
+ #endif
+
+ so that no option is passed to pcre_study(), and then use something
+ like this to free the study data:
+
+ #ifdef PCRE_CONFIG_JIT
+ pcre_free_study(study_ptr);
+ #else
+ pcre_free(study_ptr);
+ #endif
+
In some circumstances you may need to call additional functions. These
are described in the section entitled "Controlling the JIT stack"
below.
@@ -6555,12 +6609,8 @@ UNSUPPORTED OPTIONS AND PATTERN ITEMS
The unsupported pattern items are:
- \C match a single byte; not supported in UTF-8 mode
+ \C match a single byte; not supported in UTF-8 mode
(?Cn) callouts
- (?(<name>)... conditional test on setting of a named subpattern
- (?(R)... conditional test on whole pattern recursion
- (?(Rn)... conditional test on recursion, by number
- (?(R&name)... conditional test on recursion, by name
(*COMMIT) )
(*MARK) )
(*PRUNE) ) the backtracking control verbs
@@ -6609,28 +6659,29 @@ CONTROLLING THE JIT STACK
large or complicated patterns need more than this. The error
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
Three functions are provided for managing blocks of memory for use as
- JIT stacks.
-
- The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
- are a starting size and a maximum size, and it returns a pointer to an
- opaque structure of type pcre_jit_stack, or NULL if there is an error.
- The pcre_jit_stack_free() function can be used to free a stack that is
- no longer needed. (For the technically minded: the address space is
+ JIT stacks. There is further discussion about the use of JIT stacks in
+ the section entitled "JIT stack FAQ" below.
+
+ The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
+ are a starting size and a maximum size, and it returns a pointer to an
+ opaque structure of type pcre_jit_stack, or NULL if there is an error.
+ The pcre_jit_stack_free() function can be used to free a stack that is
+ no longer needed. (For the technically minded: the address space is
allocated by mmap or VirtualAlloc.)
- JIT uses far less memory for recursion than the interpretive code, and
- a maximum stack size of 512K to 1M should be more than enough for any
+ JIT uses far less memory for recursion than the interpretive code, and
+ a maximum stack size of 512K to 1M should be more than enough for any
pattern.
- The pcre_assign_jit_stack() function specifies which stack JIT code
+ The pcre_assign_jit_stack() function specifies which stack JIT code
should use. Its arguments are as follows:
pcre_extra *extra
pcre_jit_callback callback
void *data
- The extra argument must be the result of studying a pattern with
- PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the
+ The extra argument must be the result of studying a pattern with
+ PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the
other two options:
(1) If callback is NULL and data is NULL, an internal 32K block
@@ -6645,18 +6696,18 @@ CONTROLLING THE JIT STACK
is used; otherwise the return value must be a valid JIT stack,
the result of calling pcre_jit_stack_alloc().
- You may safely assign the same JIT stack to more than one pattern, as
+ You may safely assign the same JIT stack to more than one pattern, as
long as they are all matched sequentially in the same thread. In a mul-
tithread application, each thread must use its own JIT stack.
- Strictly speaking, even more is allowed. You can assign the same stack
- to any number of patterns as long as they are not used for matching by
+ Strictly speaking, even more is allowed. You can assign the same stack
+ to any number of patterns as long as they are not used for matching by
multiple threads at the same time. For example, you can assign the same
- stack to all compiled patterns, and use a global mutex in the callback
+ stack to all compiled patterns, and use a global mutex in the callback
to wait until the stack is available for use. However, this is an inef-
ficient solution, and not recommended.
- This is a suggestion for how a typical multithreaded program might
+ This is a suggestion for how a typical multithreaded program might
operate:
During thread initalization
@@ -6668,12 +6719,80 @@ CONTROLLING THE JIT STACK
Use a one-line callback function
return thread_local_var
- All the functions described in this section do nothing if JIT is not
- available, and pcre_assign_jit_stack() does nothing unless the extra
- argument is non-NULL and points to a pcre_extra block that is the
+ All the functions described in this section do nothing if JIT is not
+ available, and pcre_assign_jit_stack() does nothing unless the extra
+ argument is non-NULL and points to a pcre_extra block that is the
result of a successful study with PCRE_STUDY_JIT_COMPILE.
+JIT STACK FAQ
+
+ (1) Why do we need JIT stacks?
+
+ PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack
+ where the local data of the current node is pushed before checking its
+ child nodes. Allocating real machine stack on some platforms is diffi-
+ cult. For example, the stack chain needs to be updated every time if we
+ extend the stack on PowerPC. Although it is possible, its updating
+ time overhead decreases performance. So we do the recursion in memory.
+
+ (2) Why don't we simply allocate blocks of memory with malloc()?
+
+ Modern operating systems have a nice feature: they can reserve an
+ address space instead of allocating memory. We can safely allocate mem-
+ ory pages inside this address space, so the stack could grow without
+ moving memory data (this is important because of pointers). Thus we can
+ allocate 1M address space, and use only a single memory page (usually
+ 4K) if that is enough. However, we can still grow up to 1M anytime if
+ needed.
+
+ (3) Who "owns" a JIT stack?
+
+ The owner of the stack is the user program, not the JIT studied pattern
+ or anything else. The user program must ensure that if a stack is used
+ by pcre_exec(), (that is, it is assigned to the pattern currently run-
+ ning), that stack must not be used by any other threads (to avoid over-
+ writing the same memory area). The best practice for multithreaded pro-
+ grams is to allocate a stack for each thread, and return this stack
+ through the JIT callback function.
+
+ (4) When should a JIT stack be freed?
+
+ You can free a JIT stack at any time, as long as it will not be used by
+ pcre_exec() again. When you assign the stack to a pattern, only a
+ pointer is set. There is no reference counting or any other magic. You
+ can free the patterns and stacks in any order, anytime. Just do not
+ call pcre_exec() with a pattern pointing to an already freed stack, as
+ that will cause SEGFAULT. (Also, do not free a stack currently used by
+ pcre_exec() in another thread). You can also replace the stack for a
+ pattern at any time. You can even free the previous stack before
+ assigning a replacement.
+
+ (5) Should I allocate/free a stack every time before/after calling
+ pcre_exec()?
+
+ No, because this is too costly in terms of resources. However, you
+ could implement some clever idea which release the stack if it is not
+ used in let's say two minutes. The JIT callback can help to achive this
+ without keeping a list of the currently JIT studied patterns.
+
+ (6) OK, the stack is for long term memory allocation. But what happens
+ if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
+ until the stack is freed?
+
+ Especially on embedded sytems, it might be a good idea to release mem-
+ ory sometimes without freeing the stack. There is no API for this at
+ the moment. Probably a function call which returns with the currently
+ allocated memory for any stack and another which allows releasing mem-
+ ory (shrinking the stack) would be a good idea if someone needs this.
+
+ (7) This is too much of a headache. Isn't there any better solution for
+ JIT stack handling?
+
+ No, thanks to Windows. If POSIX threads were used everywhere, we could
+ throw out this complicated API.
+
+
EXAMPLE CODE
This is a single-threaded example that specifies a JIT stack without
@@ -6705,14 +6824,14 @@ SEE ALSO
AUTHOR
- Philip Hazel
+ Philip Hazel (FAQ by Zoltan Herczeg)
University Computing Service
Cambridge CB2 3QH, England.
REVISION
- Last updated: 19 October 2011
+ Last updated: 26 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
@@ -8153,6 +8272,12 @@ SIZE AND OTHER LIMITATIONS
There is no limit to the number of parenthesized subpatterns, but there
can be no more than 65535 capturing subpatterns.
+ There is a limit to the number of forward references to subsequent sub-
+ patterns of around 200,000. Repeated forward references with fixed
+ upper limits, for example, (?2){0,100} when subpattern number 2 is to
+ the right, are included in the count. There is no limit to the number
+ of backward references.
+
The maximum length of name for a named subpattern is 32 characters, and
the maximum number of named subpatterns is 10000.
@@ -8173,7 +8298,7 @@ AUTHOR
REVISION
- Last updated: 24 August 2011
+ Last updated: 30 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index da05d05..a59cd05 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -644,18 +644,18 @@ string (by default this causes the current matching alternative to fail). A
pattern such as (\e1)(a) succeeds when this option is set (assuming it can find
an "a" in the subject), whereas it fails by default, for Perl compatibility.
.P
-(3) \eU matches an upper case "U" character; by default \eU causes a compile
+(3) \eU matches an upper case "U" character; by default \eU causes a compile
time error (Perl uses \eU to upper case subsequent characters).
.P
-(4) \eu matches a lower case "u" character unless it is followed by four
-hexadecimal digits, in which case the hexadecimal number defines the code point
-to match. By default, \eu causes a compile time error (Perl uses it to upper
+(4) \eu matches a lower case "u" character unless it is followed by four
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, \eu causes a compile time error (Perl uses it to upper
case the following character).
.P
-(5) \ex matches a lower case "x" character unless it is followed by two
-hexadecimal digits, in which case the hexadecimal number defines the code point
-to match. By default, as in Perl, a hexadecimal number is always expected after
-\ex, but it may have zero, one, or two digits (so, for example, \exz matches a
+(5) \ex matches a lower case "x" character unless it is followed by two
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, as in Perl, a hexadecimal number is always expected after
+\ex, but it may have zero, one, or two digits (so, for example, \exz matches a
binary zero character followed by z).
.sp
PCRE_MULTILINE
@@ -1147,6 +1147,12 @@ particular pattern. See the
.\"
documentation for details of what can and cannot be handled.
.sp
+ PCRE_INFO_JITSIZE
+.sp
+If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE option,
+return the size of the JIT compiled code, otherwise return zero. The fourth
+argument should point to a \fBsize_t\fP variable.
+.sp
PCRE_INFO_LASTLITERAL
.sp
Return the value of the rightmost literal byte that must exist in any matched
@@ -1262,10 +1268,13 @@ For such patterns, the PCRE_ANCHORED bit is set in the options returned by
.sp
PCRE_INFO_SIZE
.sp
-Return the size of the compiled pattern, that is, the value that was passed as
-the argument to \fBpcre_malloc()\fP when PCRE was getting memory in which to
-place the compiled data. The fourth argument should point to a \fBsize_t\fP
-variable.
+Return the size of the compiled pattern. The fourth argument should point to a
+\fBsize_t\fP variable. This value does not include the size of the \fBpcre\fP
+structure that is returned by \fBpcre_compile()\fP. The value that is passed as
+the argument to \fBpcre_malloc()\fP when \fBpcre_compile()\fP is getting memory
+in which to place the compiled data is the value returned by this option plus
+the size of the \fBpcre\fP structure. Studying a compiled pattern, with or
+without JIT, does not alter the value returned by this option.
.sp
PCRE_INFO_STUDYSIZE
.sp
@@ -2544,6 +2553,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 14 November 2011
+Last updated: 02 December 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrecallout.3 b/doc/pcrecallout.3
index d458653..ca14277 100644
--- a/doc/pcrecallout.3
+++ b/doc/pcrecallout.3
@@ -160,9 +160,10 @@ same callout number. However, they are set for all callouts.
.P
The \fImark\fP field is present from version 2 of the \fIpcre_callout\fP
structure. In callouts from \fBpcre_exec()\fP it contains a pointer to the
-zero-terminated name of the most recently passed (*MARK) item in the match, or
-NULL if there are no (*MARK)s in the current matching path. In callouts from
-\fBpcre_dfa_exec()\fP this field always contains NULL.
+zero-terminated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)
+item in the match, or NULL if no such items have been passed. Instances of
+(*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
+callouts from \fBpcre_dfa_exec()\fP this field always contains NULL.
.
.
.SH "RETURN VALUES"
@@ -195,6 +196,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 26 August 2011
+Last updated: 30 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrecompat.3 b/doc/pcrecompat.3
index 69bcf90..64d3dbc 100644
--- a/doc/pcrecompat.3
+++ b/doc/pcrecompat.3
@@ -38,7 +38,7 @@ represent a binary zero.
own, matching a non-newline character, is supported.) In fact these are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE, an error is
-generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
+generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
\eU and \eu are interpreted as JavaScript interprets them.
.P
6. The Perl escape sequences \ep, \eP, and \eX are supported only if PCRE is
diff --git a/doc/pcrejit.3 b/doc/pcrejit.3
index 391b837..ba68301 100644
--- a/doc/pcrejit.3
+++ b/doc/pcrejit.3
@@ -34,11 +34,16 @@ The Power PC support is designated as experimental because it has not been
fully tested. If --enable-jit is set on an unsupported platform, compilation
fails.
.P
-A program can tell if JIT support is available by calling \fBpcre_config()\fP
-with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available, and 0
-otherwise. However, a simple program does not need to check this in order to
-use JIT. The API is implemented in a way that falls back to the ordinary PCRE
-code if JIT is not available.
+A program that is linked with PCRE 8.20 or later can tell if JIT support is
+available by calling \fBpcre_config()\fP with the PCRE_CONFIG_JIT option. The
+result is 1 when JIT is available, and 0 otherwise. However, a simple program
+does not need to check this in order to use JIT. The API is implemented in a
+way that falls back to the ordinary PCRE code if JIT is not available.
+.P
+If your program may sometimes be linked with versions of PCRE that are older
+than 8.20, but you want to use JIT when it is available, you can test
+the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT macro such
+as PCRE_CONFIG_JIT, for compile-time control of your code.
.
.
.SH "SIMPLE USE OF JIT"
@@ -54,6 +59,21 @@ You have to do two things to make use of the JIT support in the simplest way:
no longer needed instead of just freeing it yourself. This
ensures that any JIT data is also freed.
.sp
+For a program that may be linked with pre-8.20 versions of PCRE, you can insert
+.sp
+ #ifndef PCRE_STUDY_JIT_COMPILE
+ #define PCRE_STUDY_JIT_COMPILE 0
+ #endif
+.sp
+so that no option is passed to \fBpcre_study()\fP, and then use something like
+this to free the study data:
+.sp
+ #ifdef PCRE_CONFIG_JIT
+ pcre_free_study(study_ptr);
+ #else
+ pcre_free(study_ptr);
+ #endif
+.sp
In some circumstances you may need to call additional functions. These are
described in the section entitled
.\" HTML <a href="#stackcontrol">
@@ -95,7 +115,7 @@ supported.
.P
The unsupported pattern items are:
.sp
- \eC match a single byte; not supported in UTF-8 mode
+ \eC match a single byte; not supported in UTF-8 mode
(?Cn) callouts
(*COMMIT) )
(*MARK) )
@@ -153,7 +173,13 @@ When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32K on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
-managing blocks of memory for use as JIT stacks.
+managing blocks of memory for use as JIT stacks. There is further discussion
+about the use of JIT stacks in the section entitled
+.\" HTML <a href="#stackcontrol">
+.\" </a>
+"JIT stack FAQ"
+.\"
+below.
.P
The \fBpcre_jit_stack_alloc()\fP function creates a JIT stack. Its arguments
are a starting size and a maximum size, and it returns a pointer to an opaque
@@ -217,6 +243,74 @@ is non-NULL and points to a \fBpcre_extra\fP block that is the result of a
successful study with PCRE_STUDY_JIT_COMPILE.
.
.
+.\" HTML <a name="stackfaq"></a>
+.SH "JIT STACK FAQ"
+.rs
+.sp
+(1) Why do we need JIT stacks?
+.sp
+PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where
+the local data of the current node is pushed before checking its child nodes.
+Allocating real machine stack on some platforms is difficult. For example, the
+stack chain needs to be updated every time if we extend the stack on PowerPC.
+Although it is possible, its updating time overhead decreases performance. So
+we do the recursion in memory.
+.P
+(2) Why don't we simply allocate blocks of memory with \fBmalloc()\fP?
+.sp
+Modern operating systems have a nice feature: they can reserve an address space
+instead of allocating memory. We can safely allocate memory pages inside this
+address space, so the stack could grow without moving memory data (this is
+important because of pointers). Thus we can allocate 1M address space, and use
+only a single memory page (usually 4K) if that is enough. However, we can still
+grow up to 1M anytime if needed.
+.P
+(3) Who "owns" a JIT stack?
+.sp
+The owner of the stack is the user program, not the JIT studied pattern or
+anything else. The user program must ensure that if a stack is used by
+\fBpcre_exec()\fP, (that is, it is assigned to the pattern currently running),
+that stack must not be used by any other threads (to avoid overwriting the same
+memory area). The best practice for multithreaded programs is to allocate a
+stack for each thread, and return this stack through the JIT callback function.
+.P
+(4) When should a JIT stack be freed?
+.sp
+You can free a JIT stack at any time, as long as it will not be used by
+\fBpcre_exec()\fP again. When you assign the stack to a pattern, only a pointer
+is set. There is no reference counting or any other magic. You can free the
+patterns and stacks in any order, anytime. Just \fIdo not\fP call
+\fBpcre_exec()\fP with a pattern pointing to an already freed stack, as that
+will cause SEGFAULT. (Also, do not free a stack currently used by
+\fBpcre_exec()\fP in another thread). You can also replace the stack for a
+pattern at any time. You can even free the previous stack before assigning a
+replacement.
+.P
+(5) Should I allocate/free a stack every time before/after calling
+\fBpcre_exec()\fP?
+.sp
+No, because this is too costly in terms of resources. However, you could
+implement some clever idea which release the stack if it is not used in let's
+say two minutes. The JIT callback can help to achive this without keeping a
+list of the currently JIT studied patterns.
+.P
+(6) OK, the stack is for long term memory allocation. But what happens if a
+pattern causes stack overflow with a stack of 1M? Is that 1M kept until the
+stack is freed?
+.sp
+Especially on embedded sytems, it might be a good idea to release
+memory sometimes without freeing the stack. There is no API for this at the
+moment. Probably a function call which returns with the currently allocated
+memory for any stack and another which allows releasing memory (shrinking the
+stack) would be a good idea if someone needs this.
+.P
+(7) This is too much of a headache. Isn't there any better solution for JIT
+stack handling?
+.sp
+No, thanks to Windows. If POSIX threads were used everywhere, we could throw
+out this complicated API.
+.
+.
.SH "EXAMPLE CODE"
.rs
.sp
@@ -253,7 +347,7 @@ callback.
.rs
.sp
.nf
-Philip Hazel
+Philip Hazel (FAQ by Zoltan Herczeg)
University Computing Service
Cambridge CB2 3QH, England.
.fi
@@ -263,6 +357,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 15 November 2011
+Last updated: 26 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrelimits.3 b/doc/pcrelimits.3
index 1fceec9..028cf55 100644
--- a/doc/pcrelimits.3
+++ b/doc/pcrelimits.3
@@ -23,6 +23,11 @@ All values in repeating quantifiers must be less than 65536.
There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns.
.P
+There is a limit to the number of forward references to subsequent subpatterns
+of around 200,000. Repeated forward references with fixed upper limits, for
+example, (?2){0,100} when subpattern number 2 is to the right, are included in
+the count. There is no limit to the number of backward references.
+.P
The maximum length of name for a named subpattern is 32 characters, and the
maximum number of named subpatterns is 10000.
.P
@@ -52,6 +57,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 24 August 2011
+Last updated: 30 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 18def50..bb8a4a0 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -242,7 +242,7 @@ one of the following escape sequences than the binary character it represents:
\eddd character with octal code ddd, or back reference
\exhh character with hex code hh
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
- \euhhhh character with hex code hhhh (JavaScript mode only)
+ \euhhhh character with hex code hhhh (JavaScript mode only)
.sp
The precise effect of \ecx is as follows: if x is a lower case letter, it
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
@@ -265,15 +265,15 @@ there is no terminating }, this form of escape is not recognized. Instead, the
initial \ex will be interpreted as a basic hexadecimal escape, with no
following digits, giving a character whose value is zero.
.P
-If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
-as just described only when it is followed by two hexadecimal digits.
+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
+as just described only when it is followed by two hexadecimal digits.
Otherwise, it matches a literal "x" character. In JavaScript mode, support for
-code points greater than 256 is provided by \eu, which must be followed by
+code points greater than 256 is provided by \eu, which must be followed by
four hexadecimal digits; otherwise it matches a literal "u" character.
.P
Characters whose value is less than 256 can be defined by either of the two
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
-way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
+way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
\eu00dc in JavaScript mode).
.P
After \e0 up to two further octal digits are read. If there are fewer than two
@@ -328,12 +328,14 @@ Note that octal values of 100 or greater must not be introduced by a leading
zero, because no more than three octal digits are ever read.
.P
All the sequences that define a single character value can be used both inside
-and outside character classes. In addition, inside a character class, the
-sequence \eb is interpreted as the backspace character (hex 08). The sequences
-\eB, \eN, \eR, and \eX are not special inside a character class. Like any other
-unrecognized escape sequences, they are treated as the literal characters "B",
-"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is
-set. Outside a character class, these sequences have different meanings.
+and outside character classes. In addition, inside a character class, \eb is
+interpreted as the backspace character (hex 08).
+.P
+\eN is not allowed in a character class. \eB, \eR, and \eX are not special
+inside a character class. Like other unrecognized escape sequences, they are
+treated as the literal characters "B", "R", and "X" by default, but cause an
+error if the PCRE_EXTRA option is set. Outside a character class, these
+sequences have different meanings.
.
.
.SS "Unsupported escape sequences"
@@ -405,7 +407,7 @@ This is the same as
.\" </a>
the "." metacharacter
.\"
-when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
+when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
PCRE does not support this.
.P
Each pair of lower and upper case escape sequences partitions the complete set
@@ -2567,10 +2569,11 @@ failing negative assertion, they cause an error if encountered by
If any of these verbs are used in an assertion or in a subpattern that is
called as a subroutine (whether or not recursively), their effect is confined
to that subpattern; it does not extend to the surrounding pattern, with one
-exception: a *MARK that is encountered in a positive assertion \fIis\fP passed
-back (compare capturing parentheses in assertions). Note that such subpatterns
-are processed as anchored at the point where they are tested. Note also that
-Perl's treatment of subroutines is different in some cases.
+exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
+a successful positive assertion \fIis\fP passed back when a match succeeds
+(compare capturing parentheses in assertions). Note that such subpatterns are
+processed as anchored at the point where they are tested. Note also that Perl's
+treatment of subroutines is different in some cases.
.P
The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form
@@ -2589,6 +2592,9 @@ included backtracking verbs will not, of course, be processed. You can suppress
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the
pattern with (*NO_START_OPT).
+.P
+Experiments with Perl suggest that it too has similar optimizations, sometimes
+leading to anomalous results.
.
.
.SS "Verbs that act immediately"
@@ -2636,8 +2642,9 @@ starting point (see (*SKIP) below).
A name is always required with this verb. There may be as many instances of
(*MARK) as you like in a pattern, and their names do not have to be unique.
.P
-When a match succeeds, the name of the last-encountered (*MARK) is passed back
-to the caller via the \fIpcre_extra\fP data structure, as described in the
+When a match succeeds, the name of the last-encountered (*MARK) on the matching
+path is passed back to the caller via the \fIpcre_extra\fP data structure, as
+described in the
.\" HTML <a href="pcreapi.html#extradata">
.\" </a>
section on \fIpcre_extra\fP
@@ -2646,12 +2653,11 @@ in the
.\" HREF
\fBpcreapi\fP
.\"
-documentation. No data is returned for a partial match. Here is an example of
-\fBpcretest\fP output, where the /K modifier requests the retrieval and
-outputting of (*MARK) data:
+documentation. Here is an example of \fBpcretest\fP output, where the /K
+modifier requests the retrieval and outputting of (*MARK) data:
.sp
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XY
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data> XY
0: XY
MK: A
XZ
@@ -2667,31 +2673,17 @@ If (*MARK) is encountered in a positive assertion, its name is recorded and
passed back if it is the last-encountered. This does not happen for negative
assertions.
.P
-A name may also be returned after a failed match if the final path through the
-pattern involves (*MARK). However, unless (*MARK) used in conjunction with
-(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the
-starting point for matching is advanced, the final check is often with an empty
-string, causing a failure before (*MARK) is reached. For example:
-.sp
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XP
- No match
-.sp
-There are three potential starting points for this match (starting with X,
-starting with P, and with an empty string). If the pattern is anchored, the
-result is different:
+After a partial match or a failed match, the name of the last encountered
+(*MARK) in the entire match process is returned. For example:
.sp
- /^X(*MARK:A)Y|^X(*MARK:B)Z/K
- XP
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data> XP
No match, mark = B
.sp
-PCRE's start-of-match optimizations can also interfere with this. For example,
-if, as a result of a call to \fBpcre_study()\fP, it knows the minimum
-subject length for a match, a shorter subject will not be scanned at all.
-.P
-Note that similar anomalies (though different in detail) exist in Perl, no
-doubt for the same reasons. The use of (*MARK) data after a failed match of an
-unanchored pattern is not recommended, unless (*COMMIT) is involved.
+Note that in this unanchored example the mark is retained from the match
+attempt that started at the letter "X". Subsequent match attempts starting at
+"P" and then with an empty string do not get as far as the (*MARK) item, but
+nevertheless do not reset it.
.
.
.SS "Verbs that act after backtracking"
@@ -2728,8 +2720,8 @@ Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE's start-of-match optimizations are turned off, as shown in this
\fBpcretest\fP example:
.sp
- /(*COMMIT)abc/
- xyzabc
+ re> /(*COMMIT)abc/
+ data> xyzabc
0: abc
xyzabc\eY
No match
@@ -2750,10 +2742,8 @@ reached, or when matching to the right of (*PRUNE), but if there is no match to
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
but there are some uses of (*PRUNE) that cannot be expressed in any other way.
-The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
-match fails completely; the name is passed back if this is the final attempt.
-(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
-pattern (*PRUNE) has the same effect as (*COMMIT).
+The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
+anchored pattern (*PRUNE) has the same effect as (*COMMIT).
.sp
(*SKIP)
.sp
@@ -2779,8 +2769,7 @@ following pattern fails to match, the previous path through the pattern is
searched for the most recent (*MARK) that has the same name. If one is found,
the "bumpalong" advance is to the subject position that corresponds to that
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
-matching name is found, normal "bumpalong" of one character happens (that is,
-the (*SKIP) is ignored).
+matching name is found, the (*SKIP) is ignored.
.sp
(*THEN) or (*THEN:NAME)
.sp
@@ -2794,9 +2783,8 @@ be used for a pattern-based if-then-else block:
If the COND1 pattern matches, FOO is tried (and possibly further items after
the end of the group if FOO succeeds); on failure, the matcher skips to the
second alternative and tries COND2, without backtracking into COND1. The
-behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
-overall match fails. If (*THEN) is not inside an alternation, it acts like
-(*PRUNE).
+behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
+If (*THEN) is not inside an alternation, it acts like (*PRUNE).
.P
Note that a subpattern that does not contain a | character is just a part of
the enclosing alternative; it is not a nested alternation with only one
@@ -2874,6 +2862,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 19 November 2011
+Last updated: 29 November 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 2689bfa..968a232 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -319,7 +319,10 @@ as the tables pointer; that is, \fB/L\fP applies only to the expression on
which it appears.
.P
The \fB/M\fP modifier causes the size of memory block used to hold the compiled
-pattern to be output.
+pattern to be output. This does not include the size of the \fBpcre\fP block;
+it is just the actual compiled data. If the pattern is successfully studied
+with the PCRE_STUDY_JIT_COMPILE option, the size of the JIT compiled code is
+also output.
.P
If the \fB/S\fP modifier appears once, it causes \fBpcre_study()\fP to be
called after the expression has been compiled, and the results used when the
@@ -875,6 +878,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 26 August 2011
+Last updated: 02 December 2011
Copyright (c) 1997-2011 University of Cambridge.
.fi
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index 999ee0c..3835f48 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -305,30 +305,33 @@ PATTERN MODIFIERS
it appears.
The /M modifier causes the size of memory block used to hold the com-
- piled pattern to be output.
-
- If the /S modifier appears once, it causes pcre_study() to be called
- after the expression has been compiled, and the results used when the
- expression is matched. If /S appears twice, it suppresses studying,
+ piled pattern to be output. This does not include the size of the pcre
+ block; it is just the actual compiled data. If the pattern is success-
+ fully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
+ JIT compiled code is also output.
+
+ If the /S modifier appears once, it causes pcre_study() to be called
+ after the expression has been compiled, and the results used when the
+ expression is matched. If /S appears twice, it suppresses studying,
even if it was requested externally by the -s command line option. This
- makes it possible to specify that certain patterns are always studied,
+ makes it possible to specify that certain patterns are always studied,
and others are never studied, independently of -s. This feature is used
in the test files in a few cases where the output is different when the
pattern is studied.
- If the /S modifier is immediately followed by a + character, the call
- to pcre_study() is made with the PCRE_STUDY_JIT_COMPILE option,
- requesting just-in-time optimization support if it is available. Note
- that there is also a /+ modifier; it must not be given immediately
- after /S because this will be misinterpreted. If JIT studying is suc-
- cessful, it will automatically be used when pcre_exec() is run, except
- when incompatible run-time options are specified. These include the
+ If the /S modifier is immediately followed by a + character, the call
+ to pcre_study() is made with the PCRE_STUDY_JIT_COMPILE option,
+ requesting just-in-time optimization support if it is available. Note
+ that there is also a /+ modifier; it must not be given immediately
+ after /S because this will be misinterpreted. If JIT studying is suc-
+ cessful, it will automatically be used when pcre_exec() is run, except
+ when incompatible run-time options are specified. These include the
partial matching options; a complete list is given in the pcrejit docu-
- mentation. See also the \J escape sequence below for a way of setting
+ mentation. See also the \J escape sequence below for a way of setting
the size of the JIT stack.
- The /T modifier must be followed by a single digit. It causes a spe-
- cific set of built-in character tables to be passed to pcre_compile().
+ The /T modifier must be followed by a single digit. It causes a spe-
+ cific set of built-in character tables to be passed to pcre_compile().
It is used in the standard PCRE tests to check behaviour with different
character tables. The digit specifies the tables as follows:
@@ -336,12 +339,12 @@ PATTERN MODIFIERS
pcre_chartables.c.dist
1 a set of tables defining ISO 8859 characters
- In table 1, some characters whose codes are greater than 128 are iden-
+ In table 1, some characters whose codes are greater than 128 are iden-
tified as letters, digits, spaces, etc.
Using the POSIX wrapper API
- The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+ The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
rather than its native API. When /P is set, the following modifiers set
options for the regcomp() function:
@@ -353,17 +356,17 @@ PATTERN MODIFIERS
/W REG_UCP ) the POSIX standard
/8 REG_UTF8 )
- The /+ modifier works as described above. All other modifiers are
+ The /+ modifier works as described above. All other modifiers are
ignored.
DATA LINES
- Before each data line is passed to pcre_exec(), leading and trailing
- white space is removed, and it is then scanned for \ escapes. Some of
- these are pretty esoteric features, intended for checking out some of
- the more complicated features of PCRE. If you are just testing "ordi-
- nary" regular expressions, you probably don't need any of these. The
+ Before each data line is passed to pcre_exec(), leading and trailing
+ white space is removed, and it is then scanned for \ escapes. Some of
+ these are pretty esoteric features, intended for checking out some of
+ the more complicated features of PCRE. If you are just testing "ordi-
+ nary" regular expressions, you probably don't need any of these. The
following escapes are recognized:
\a alarm (BEL, \x07)
@@ -444,95 +447,95 @@ DATA LINES
\<any> pass the PCRE_NEWLINE_ANY option to pcre_exec()
or pcre_dfa_exec()
- Note that \xhh always specifies one byte, even in UTF-8 mode; this
+ Note that \xhh always specifies one byte, even in UTF-8 mode; this
makes it possible to construct invalid UTF-8 sequences for testing pur-
poses. On the other hand, \x{hh} is interpreted as a UTF-8 character in
- UTF-8 mode, generating more than one byte if the value is greater than
+ UTF-8 mode, generating more than one byte if the value is greater than
127. When not in UTF-8 mode, it generates one byte for values less than
256, and causes an error for greater values.
- The escapes that specify line ending sequences are literal strings,
+ The escapes that specify line ending sequences are literal strings,
exactly as shown. No more than one newline setting should be present in
any data line.
- A backslash followed by anything else just escapes the anything else.
- If the very last character is a backslash, it is ignored. This gives a
- way of passing an empty line as data, since a real empty line termi-
+ A backslash followed by anything else just escapes the anything else.
+ If the very last character is a backslash, it is ignored. This gives a
+ way of passing an empty line as data, since a real empty line termi-
nates the data input.
- The \J escape provides a way of setting the maximum stack size that is
- used by the just-in-time optimization code. It is ignored if JIT opti-
- mization is not being used. Providing a stack that is larger than the
+ The \J escape provides a way of setting the maximum stack size that is
+ used by the just-in-time optimization code. It is ignored if JIT opti-
+ mization is not being used. Providing a stack that is larger than the
default 32K is necessary only for very complicated patterns.
- If \M is present, pcretest calls pcre_exec() several times, with dif-
- ferent values in the match_limit and match_limit_recursion fields of
- the pcre_extra data structure, until it finds the minimum numbers for
- each parameter that allow pcre_exec() to complete without error.
- Because this is testing a specific feature of the normal interpretive
- pcre_exec() execution, the use of any JIT optimization that might have
+ If \M is present, pcretest calls pcre_exec() several times, with dif-
+ ferent values in the match_limit and match_limit_recursion fields of
+ the pcre_extra data structure, until it finds the minimum numbers for
+ each parameter that allow pcre_exec() to complete without error.
+ Because this is testing a specific feature of the normal interpretive
+ pcre_exec() execution, the use of any JIT optimization that might have
been set up by the /S+ qualifier of -s+ option is disabled.
- The match_limit number is a measure of the amount of backtracking that
- takes place, and checking it out can be instructive. For most simple
- matches, the number is quite small, but for patterns with very large
- numbers of matching possibilities, it can become large very quickly
- with increasing length of subject string. The match_limit_recursion
- number is a measure of how much stack (or, if PCRE is compiled with
- NO_RECURSE, how much heap) memory is needed to complete the match
+ The match_limit number is a measure of the amount of backtracking that
+ takes place, and checking it out can be instructive. For most simple
+ matches, the number is quite small, but for patterns with very large
+ numbers of matching possibilities, it can become large very quickly
+ with increasing length of subject string. The match_limit_recursion
+ number is a measure of how much stack (or, if PCRE is compiled with
+ NO_RECURSE, how much heap) memory is needed to complete the match
attempt.
- When \O is used, the value specified may be higher or lower than the
+ When \O is used, the value specified may be higher or lower than the
size set by the -O command line option (or defaulted to 45); \O applies
only to the call of pcre_exec() for the line in which it appears.
- If the /P modifier was present on the pattern, causing the POSIX wrap-
- per API to be used, the only option-setting sequences that have any
- effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
+ If the /P modifier was present on the pattern, causing the POSIX wrap-
+ per API to be used, the only option-setting sequences that have any
+ effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
REG_NOTEOL, respectively, to be passed to regexec().
- The use of \x{hh...} to represent UTF-8 characters is not dependent on
- the use of the /8 modifier on the pattern. It is recognized always.
- There may be any number of hexadecimal digits inside the braces. The
- result is from one to six bytes, encoded according to the original
- UTF-8 rules of RFC 2279. This allows for values in the range 0 to
- 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
- or indeed valid UTF-8 characters according to the later rules in RFC
+ The use of \x{hh...} to represent UTF-8 characters is not dependent on
+ the use of the /8 modifier on the pattern. It is recognized always.
+ There may be any number of hexadecimal digits inside the braces. The
+ result is from one to six bytes, encoded according to the original
+ UTF-8 rules of RFC 2279. This allows for values in the range 0 to
+ 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
+ or indeed valid UTF-8 characters according to the later rules in RFC
3629.
THE ALTERNATIVE MATCHING FUNCTION
- By default, pcretest uses the standard PCRE matching function,
+ By default, pcretest uses the standard PCRE matching function,
pcre_exec() to match each data line. From release 6.0, PCRE supports an
- alternative matching function, pcre_dfa_test(), which operates in a
- different way, and has some restrictions. The differences between the
+ alternative matching function, pcre_dfa_test(), which operates in a
+ different way, and has some restrictions. The differences between the
two functions are described in the pcrematching documentation.
- If a data line contains the \D escape sequence, or if the command line
- contains the -dfa option, the alternative matching function is called.
+ If a data line contains the \D escape sequence, or if the command line
+ contains the -dfa option, the alternative matching function is called.
This function finds all possible matches at a given point. If, however,
- the \F escape sequence is present in the data line, it stops after the
+ the \F escape sequence is present in the data line, it stops after the
first match is found. This is always the shortest possible match.
DEFAULT OUTPUT FROM PCRETEST
- This section describes the output when the normal matching function,
+ This section describes the output when the normal matching function,
pcre_exec(), is being used.
When a match succeeds, pcretest outputs the list of captured substrings
- that pcre_exec() returns, starting with number 0 for the string that
- matched the whole pattern. Otherwise, it outputs "No match" when the
+ that pcre_exec() returns, starting with number 0 for the string that
+ matched the whole pattern. Otherwise, it outputs "No match" when the
return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the par-
- tially matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
- (Note that this is the entire substring that was inspected during the
- partial match; it may include characters before the actual match start
- if a lookbehind assertion, \K, \b, or \B was involved.) For any other
- return, pcretest outputs the PCRE negative error number and a short
- descriptive phrase. If the error is a failed UTF-8 string check, the
- byte offset of the start of the failing character and the reason code
- are also output, provided that the size of the output vector is at
+ tially matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
+ (Note that this is the entire substring that was inspected during the
+ partial match; it may include characters before the actual match start
+ if a lookbehind assertion, \K, \b, or \B was involved.) For any other
+ return, pcretest outputs the PCRE negative error number and a short
+ descriptive phrase. If the error is a failed UTF-8 string check, the
+ byte offset of the start of the failing character and the reason code
+ are also output, provided that the size of the output vector is at
least two. Here is an example of an interactive pcretest run.
$ pcretest
@@ -547,9 +550,9 @@ DEFAULT OUTPUT FROM PCRETEST
Unset capturing substrings that are not followed by one that is set are
not returned by pcre_exec(), and are not shown by pcretest. In the fol-
- lowing example, there are two capturing substrings, but when the first
- data line is matched, the second, unset substring is not shown. An
- "internal" unset substring is shown as "<unset>", as for the second
+ lowing example, there are two capturing substrings, but when the first
+ data line is matched, the second, unset substring is not shown. An
+ "internal" unset substring is shown as "<unset>", as for the second
data line.
re> /(a)|(b)/
@@ -561,11 +564,11 @@ DEFAULT OUTPUT FROM PCRETEST
1: <unset>
2: b
- If the strings contain any non-printing characters, they are output as
- \0x escapes, or as \x{...} escapes if the /8 modifier was present on
- the pattern. See below for the definition of non-printing characters.
- If the pattern has the /+ modifier, the output for substring 0 is fol-
- lowed by the the rest of the subject string, identified by "0+" like
+ If the strings contain any non-printing characters, they are output as
+ \0x escapes, or as \x{...} escapes if the /8 modifier was present on
+ the pattern. See below for the definition of non-printing characters.
+ If the pattern has the /+ modifier, the output for substring 0 is fol-
+ lowed by the the rest of the subject string, identified by "0+" like
this:
re> /cat/+
@@ -573,7 +576,7 @@ DEFAULT OUTPUT FROM PCRETEST
0: cat
0+ aract
- If the pattern has the /g or /G modifier, the results of successive
+ If the pattern has the /g or /G modifier, the results of successive
matching attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@@ -585,32 +588,32 @@ DEFAULT OUTPUT FROM PCRETEST
0: ipp
1: pp
- "No match" is output only if the first match attempt fails. Here is an
- example of a failure message (the offset 4 that is specified by \>4 is
+ "No match" is output only if the first match attempt fails. Here is an
+ example of a failure message (the offset 4 that is specified by \>4 is
past the end of the subject string):
re> /xyz/
data> xyz\>4
Error -24 (bad offset value)
- If any of the sequences \C, \G, or \L are present in a data line that
- is successfully matched, the substrings extracted by the convenience
+ If any of the sequences \C, \G, or \L are present in a data line that
+ is successfully matched, the substrings extracted by the convenience
functions are output with C, G, or L after the string number instead of
a colon. This is in addition to the normal full list. The string length
- (that is, the return from the extraction function) is given in paren-
+ (that is, the return from the extraction function) is given in paren-
theses after each string for \C and \G.
Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), data lines may not. However new-
- lines can be included in data by means of the \n escape (or \r, \r\n,
+ lines can be included in data by means of the \n escape (or \r, \r\n,
etc., depending on the newline sequence setting).
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
- When the alternative matching function, pcre_dfa_exec(), is used (by
- means of the \D escape sequence or the -dfa command line option), the
- output consists of a list of all the matches that start at the first
+ When the alternative matching function, pcre_dfa_exec(), is used (by
+ means of the \D escape sequence or the -dfa command line option), the
+ output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/
@@ -619,11 +622,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- (Using the normal matching function on this data finds only "tang".)
- The longest matching string is always given first (and numbered zero).
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
- lowed by the partially matching substring. (Note that this is the
- entire substring that was inspected during the partial match; it may
+ lowed by the partially matching substring. (Note that this is the
+ entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser-
tion, \K, \b, or \B was involved.)
@@ -639,16 +642,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- Since the matching function does not support substring capture, the
- escape sequences that are concerned with captured substrings are not
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE_ERROR_PARTIAL
- return, indicating that the subject partially matched the pattern, you
- can restart the match with additional subject data by means of the \R
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
escape sequence. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -657,30 +660,30 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\R\D
0: n05
- For further information about partial matching, see the pcrepartial
+ For further information about partial matching, see the pcrepartial
documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. This works with both matching func-
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the
- start and current positions in the text at the callout time, and the
+ start and current positions in the text at the callout time, and the
next pattern item to be tested. For example, the output
--->pqrabcdef
0 ^ ^ \d
- indicates that callout number 0 occurred for a match attempt starting
- at the fourth character of the subject string, when the pointer was at
- the seventh character of the data, and when the next pattern item was
- \d. Just one circumflex is output if the start and current positions
+ indicates that callout number 0 occurred for a match attempt starting
+ at the fourth character of the subject string, when the pointer was at
+ the seventh character of the data, and when the next pattern item was
+ \d. Just one circumflex is output if the start and current positions
are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /C pattern modifier. In this case, instead of showing
- the callout number, the offset in the pattern, preceded by a plus, is
+ a result of the /C pattern modifier. In this case, instead of showing
+ the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
re> /\d?[A-E]\*/C
@@ -693,7 +696,7 @@ CALLOUTS
0: E*
If a pattern contains (*MARK) items, an additional line is output when-
- ever a change of latest mark is passed to the callout function. For
+ ever a change of latest mark is passed to the callout function. For
example:
re> /a(*MARK:X)bc/C
@@ -707,59 +710,59 @@ CALLOUTS
+12 ^ ^
0: abc
- The mark changes between matching "a" and "b", but stays the same for
- the rest of the match, so nothing more is output. If, as a result of
- backtracking, the mark reverts to being unset, the text "<unset>" is
+ The mark changes between matching "a" and "b", but stays the same for
+ the rest of the match, so nothing more is output. If, as a result of
+ backtracking, the mark reverts to being unset, the text "<unset>" is
output.
- The callout function in pcretest returns zero (carry on matching) by
- default, but you can use a \C item in a data line (as described above)
+ The callout function in pcretest returns zero (carry on matching) by
+ default, but you can use a \C item in a data line (as described above)
to change this and other parameters of the callout.
- Inserting callouts can be helpful when using pcretest to check compli-
- cated regular expressions. For further information about callouts, see
+ Inserting callouts can be helpful when using pcretest to check compli-
+ cated regular expressions. For further information about callouts, see
the pcrecallout documentation.
NON-PRINTING CHARACTERS
- When pcretest is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
are are therefore shown as hex escapes.
- When pcretest is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /L modifier). In this case, the
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
isprint() function to distinguish printing and non-printing characters.
SAVING AND RELOADING COMPILED PATTERNS
- The facilities described in this section are not available when the
- POSIX interface to PCRE is being used, that is, when the /P pattern
+ The facilities described in this section are not available when the
+ POSIX interface to PCRE is being used, that is, when the /P pattern
modifier is specified.
When the POSIX interface is not in use, you can cause pcretest to write
- a compiled pattern to a file, by following the modifiers with > and a
+ a compiled pattern to a file, by following the modifiers with > and a
file name. For example:
/pattern/im >/some/file
- See the pcreprecompile documentation for a discussion about saving and
- re-using compiled patterns. Note that if the pattern was successfully
+ See the pcreprecompile documentation for a discussion about saving and
+ re-using compiled patterns. Note that if the pattern was successfully
studied with JIT optimization, the JIT data cannot be saved.
- The data that is written is binary. The first eight bytes are the
- length of the compiled pattern data followed by the length of the
- optional study data, each written as four bytes in big-endian order
- (most significant byte first). If there is no study data (either the
+ The data that is written is binary. The first eight bytes are the
+ length of the compiled pattern data followed by the length of the
+ optional study data, each written as four bytes in big-endian order
+ (most significant byte first). If there is no study data (either the
pattern was not studied, or studying did not return any data), the sec-
- ond length is zero. The lengths are followed by an exact copy of the
- compiled pattern. If there is additional study data, this (excluding
- any JIT data) follows immediately after the compiled pattern. After
+ ond length is zero. The lengths are followed by an exact copy of the
+ compiled pattern. If there is additional study data, this (excluding
+ any JIT data) follows immediately after the compiled pattern. After
writing the file, pcretest expects to read a new pattern.
- A saved pattern can be reloaded into pcretest by specifying < and a
+ A saved pattern can be reloaded into pcretest by specifying < and a
file name instead of a pattern. The name of the file must not contain a
< character, as otherwise pcretest will interpret the line as a pattern
delimited by < characters. For example:
@@ -768,27 +771,27 @@ SAVING AND RELOADING COMPILED PATTERNS
Compiled pattern loaded from /some/file
No study data
- If the pattern was previously studied with the JIT optimization, the
- JIT information cannot be saved and restored, and so is lost. When the
- pattern has been loaded, pcretest proceeds to read data lines in the
+ If the pattern was previously studied with the JIT optimization, the
+ JIT information cannot be saved and restored, and so is lost. When the
+ pattern has been loaded, pcretest proceeds to read data lines in the
usual way.
- You can copy a file written by pcretest to a different host and reload
- it there, even if the new host has opposite endianness to the one on
- which the pattern was compiled. For example, you can compile on an i86
+ You can copy a file written by pcretest to a different host and reload
+ it there, even if the new host has opposite endianness to the one on
+ which the pattern was compiled. For example, you can compile on an i86
machine and run on a SPARC machine.
- File names for saving and reloading can be absolute or relative, but
- note that the shell facility of expanding a file name that starts with
+ File names for saving and reloading can be absolute or relative, but
+ note that the shell facility of expanding a file name that starts with
a tilde (~) is not available.
- The ability to save and reload files in pcretest is intended for test-
- ing and experimentation. It is not intended for production use because
- only a single pattern can be written to a file. Furthermore, there is
- no facility for supplying custom character tables for use with a
- reloaded pattern. If the original pattern was compiled with custom
- tables, an attempt to match a subject string using a reloaded pattern
- is likely to cause pcretest to crash. Finally, if you attempt to load
+ The ability to save and reload files in pcretest is intended for test-
+ ing and experimentation. It is not intended for production use because
+ only a single pattern can be written to a file. Furthermore, there is
+ no facility for supplying custom character tables for use with a
+ reloaded pattern. If the original pattern was compiled with custom
+ tables, an attempt to match a subject string using a reloaded pattern
+ is likely to cause pcretest to crash. Finally, if you attempt to load
a file that is not in the correct format, the result is undefined.
@@ -807,5 +810,5 @@ AUTHOR
REVISION
- Last updated: 26 August 2011
+ Last updated: 02 December 2011
Copyright (c) 1997-2011 University of Cambridge.
diff --git a/maint/ManyConfigTests b/maint/ManyConfigTests
index fc5e4e8..6a61f84 100755
--- a/maint/ManyConfigTests
+++ b/maint/ManyConfigTests
@@ -141,6 +141,10 @@ function runtest()
# builds both) to save a bit of time by building only one version of the
# library for the subsequent tests.
+valgrind=
+cvalgrind=
+withvalgrind=
+
echo "Tests in the current directory"
srcdir=.
for opts in \
@@ -179,10 +183,13 @@ do
runtest
done
+valgrind=
+cvalgrind=
+withvalgrind=
+
# Clean up the distribution and then do at least one build and test in a
# directory other than the source directory. It doesn't work unless the
-# source directory is cleaned up first - and anyway, it's best to leave it
-# in a clean state after all this reconfiguring.
+# source directory is cleaned up first.
if [ -f Makefile ]; then
echo "Running 'make distclean'"
diff --git a/pcre-config.in b/pcre-config.in
index ccbf210..fbdf612 100644
--- a/pcre-config.in
+++ b/pcre-config.in
@@ -70,7 +70,7 @@ while test $# -gt 0; do
;;
--libs-cpp)
if test @enable_cpp@ = yes ; then
- echo -L@libdir@$libR -lpcrecpp -lpcre
+ echo $libS$libR -lpcrecpp -lpcre
else
echo "${usage}" 1>&2
fi
diff --git a/pcre.h.in b/pcre.h.in
index 5ea21d2..0cf9438 100644
--- a/pcre.h.in
+++ b/pcre.h.in
@@ -98,19 +98,25 @@ extern "C" {
/* Options. Some are compile-time only, some are run-time only, and some are
both, so we keep them all distinct. However, almost all the bits in the options
word are now used. In the long run, we may have to re-use some of the
-compile-time only bits for runtime options, or vice versa. */
+compile-time only bits for runtime options, or vice versa. In the comments
+below, "compile", "exec", and "DFA exec" mean that the option is permitted to
+be set for those functions; "used in" means that an option may be set only for
+compile, but is subsequently referenced in exec and/or DFA exec. Any of the
+compile-time options may be inspected during studying (and therefore JIT
+compiling). */
#define PCRE_CASELESS 0x00000001 /* Compile */
#define PCRE_MULTILINE 0x00000002 /* Compile */
#define PCRE_DOTALL 0x00000004 /* Compile */
#define PCRE_EXTENDED 0x00000008 /* Compile */
#define PCRE_ANCHORED 0x00000010 /* Compile, exec, DFA exec */
-#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile */
+#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile, used in exec, DFA exec */
#define PCRE_EXTRA 0x00000040 /* Compile */
#define PCRE_NOTBOL 0x00000080 /* Exec, DFA exec */
#define PCRE_NOTEOL 0x00000100 /* Exec, DFA exec */
#define PCRE_UNGREEDY 0x00000200 /* Compile */
#define PCRE_NOTEMPTY 0x00000400 /* Exec, DFA exec */
+/* The next two are also used in exec and DFA exec */
#define PCRE_UTF8 0x00000800 /* Compile (Same as PCRE_UTF16) */
#define PCRE_UTF16 0x00000800 /* Compile (Same as PCRE_UTF8) */
#define PCRE_NO_AUTO_CAPTURE 0x00001000 /* Compile */
@@ -120,7 +126,7 @@ compile-time only bits for runtime options, or vice versa. */
#define PCRE_PARTIAL 0x00008000 /* Backwards compatible synonym */
#define PCRE_DFA_SHORTEST 0x00010000 /* DFA exec */
#define PCRE_DFA_RESTART 0x00020000 /* DFA exec */
-#define PCRE_FIRSTLINE 0x00040000 /* Compile */
+#define PCRE_FIRSTLINE 0x00040000 /* Compile, used in exec, DFA exec */
#define PCRE_DUPNAMES 0x00080000 /* Compile */
#define PCRE_NEWLINE_CR 0x00100000 /* Compile, exec, DFA exec */
#define PCRE_NEWLINE_LF 0x00200000 /* Compile, exec, DFA exec */
@@ -129,12 +135,12 @@ compile-time only bits for runtime options, or vice versa. */
#define PCRE_NEWLINE_ANYCRLF 0x00500000 /* Compile, exec, DFA exec */
#define PCRE_BSR_ANYCRLF 0x00800000 /* Compile, exec, DFA exec */
#define PCRE_BSR_UNICODE 0x01000000 /* Compile, exec, DFA exec */
-#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile */
+#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile, used in exec */
#define PCRE_NO_START_OPTIMIZE 0x04000000 /* Compile, exec, DFA exec */
#define PCRE_NO_START_OPTIMISE 0x04000000 /* Synonym */
#define PCRE_PARTIAL_HARD 0x08000000 /* Exec, DFA exec */
#define PCRE_NOTEMPTY_ATSTART 0x10000000 /* Exec, DFA exec */
-#define PCRE_UCP 0x20000000 /* Compile */
+#define PCRE_UCP 0x20000000 /* Compile, used in exec, DFA exec */
/* Exec-time and get/set-time error codes */
@@ -221,6 +227,7 @@ compile-time only bits for runtime options, or vice versa. */
#define PCRE_INFO_HASCRORLF 14
#define PCRE_INFO_MINLENGTH 15
#define PCRE_INFO_JIT 16
+#define PCRE_INFO_JITSIZE 17
/* Request types for pcre_config(). Do not re-arrange, in order to remain
compatible. */
@@ -316,7 +323,7 @@ typedef struct pcre_callout_block {
int pattern_position; /* Offset to next item in the pattern */
int next_item_length; /* Length of next item in the pattern */
/* ------------------- Added for Version 2 -------------------------- */
- const unsigned char *mark; /* Pointer to current mark or NULL */
+ const void *mark; /* Pointer to current mark or NULL */
/* ------------------------------------------------------------------ */
} pcre_callout_block;
diff --git a/pcre_compile.c b/pcre_compile.c
index 2be0936..8dee2fb 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -88,14 +88,21 @@ so this number is very generous.
The same workspace is used during the second, actual compile phase for
remembering forward references to groups so that they can be filled in at the
end. Each entry in this list occupies LINK_SIZE bytes, so even when LINK_SIZE
-is 4 there is plenty of room. */
+is 4 there is plenty of room for most patterns. However, the memory can get
+filled up by repetitions of forward references, for example patterns like
+/(?1){0,1999}(b)/, and one user did hit the limit. The code has been changed so
+that the workspace is expanded using malloc() in this situation. The value
+below is therefore a minimum, and we put a maximum on it for safety. The
+minimum is now also defined in terms of LINK_SIZE so that the use of malloc()
+kicks in at the same number of forward references in all cases. */
-#define COMPILE_WORK_SIZE (4096)
+#define COMPILE_WORK_SIZE (2048*LINK_SIZE)
+#define COMPILE_WORK_SIZE_MAX (100*COMPILE_WORK_SIZE)
/* The overrun tests check for a slightly smaller size so that they detect the
overrun before it actually does run off the end of the data block. */
-#define WORK_SIZE_CHECK (COMPILE_WORK_SIZE - 100)
+#define WORK_SIZE_SAFETY_MARGIN (100)
/* Private flags added to firstchar and reqchar. */
@@ -474,7 +481,9 @@ static const char error_texts[] =
"\\k is not followed by a braced, angle-bracketed, or quoted name\0"
/* 70 */
"internal error: unknown opcode in find_fixedlength()\0"
- "Not allowed UTF-8 / UTF-16 code point (>= 0xd800 && <= 0xdfff)\0"
+ "\\N is not supported in a class\0"
+ "too many forward references\0"
+ "disallowed UTF-8/16 code point (>= 0xd800 && <= 0xdfff)\0"
;
/* Table to identify digits and hex digits. This is used when compiling
@@ -649,6 +658,44 @@ return s;
/*************************************************
+* Expand the workspace *
+*************************************************/
+
+/* This function is called during the second compiling phase, if the number of
+forward references fills the existing workspace, which is originally a block on
+the stack. A larger block is obtained from malloc() unless the ultimate limit
+has been reached or the increase will be rather small.
+
+Argument: pointer to the compile data block
+Returns: 0 if all went well, else an error number
+*/
+
+static int
+expand_workspace(compile_data *cd)
+{
+pcre_uchar *newspace;
+int newsize = cd->workspace_size * 2;
+
+if (newsize > COMPILE_WORK_SIZE_MAX) newsize = COMPILE_WORK_SIZE_MAX;
+if (cd->workspace_size >= COMPILE_WORK_SIZE_MAX ||
+ newsize - cd->workspace_size < WORK_SIZE_SAFETY_MARGIN)
+ return ERR72;
+
+newspace = (pcre_malloc)(newsize);
+if (newspace == NULL) return ERR21;
+
+memcpy(newspace, cd->start_workspace, cd->workspace_size);
+cd->hwm = (pcre_uchar *)newspace + (cd->hwm - cd->start_workspace);
+if (cd->workspace_size > COMPILE_WORK_SIZE)
+ (pcre_free)((void *)cd->start_workspace);
+cd->start_workspace = newspace;
+cd->workspace_size = newsize;
+return 0;
+}
+
+
+
+/*************************************************
* Check for counted repeat *
*************************************************/
@@ -1013,7 +1060,7 @@ else
if (*pt == CHAR_RIGHT_CURLY_BRACKET)
{
- if (utf && c >= 0xd800 && c <= 0xdfff) *errorcodeptr = ERR71;
+ if (utf && c >= 0xd800 && c <= 0xdfff) *errorcodeptr = ERR73;
ptr = pt;
break;
}
@@ -1811,7 +1858,7 @@ for (;;)
cc++;
break;
- /* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
+ /* The single-byte matcher isn't allowed. This only happens in UTF-8 mode;
otherwise \C is coded as OP_ALLANY. */
case OP_ANYBYTE:
@@ -3449,7 +3496,8 @@ for (;; ptr++)
#ifdef PCRE_DEBUG
if (code > cd->hwm) cd->hwm = code; /* High water info */
#endif
- if (code > cd->start_workspace + WORK_SIZE_CHECK) /* Check for overrun */
+ if (code > cd->start_workspace + cd->workspace_size -
+ WORK_SIZE_SAFETY_MARGIN) /* Check for overrun */
{
*errorcodeptr = ERR52;
goto FAILED;
@@ -3474,7 +3522,7 @@ for (;; ptr++)
*lengthptr += (int)(code - last_code);
DPRINTF(("length=%d added %d c=%c (0x%x)\n", *lengthptr,
(int)(code - last_code), c, c));
-
+
/* If "previous" is set and it is not at the start of the work space, move
it back to there, in order to avoid filling up the work space. Otherwise,
if "previous" is NULL, reset the current code pointer to the start. */
@@ -3499,7 +3547,8 @@ for (;; ptr++)
/* In the real compile phase, just check the workspace used by the forward
reference list. */
- else if (cd->hwm > cd->start_workspace + WORK_SIZE_CHECK)
+ else if (cd->hwm > cd->start_workspace + cd->workspace_size -
+ WORK_SIZE_SAFETY_MARGIN)
{
*errorcodeptr = ERR52;
goto FAILED;
@@ -3895,6 +3944,11 @@ for (;; ptr++)
if (*errorcodeptr != 0) goto FAILED;
if (-c == ESC_b) c = CHAR_BS; /* \b is backspace in a class */
+ else if (-c == ESC_N) /* \N is not supported in a class */
+ {
+ *errorcodeptr = ERR71;
+ goto FAILED;
+ }
else if (-c == ESC_Q) /* Handle start of quoted string */
{
if (ptr[1] == CHAR_BACKSLASH && ptr[2] == CHAR_E)
@@ -4404,6 +4458,19 @@ for (;; ptr++)
{
*class_uchardata++ = XCL_SINGLE;
class_uchardata += PRIV(ord2utf)(othercase, class_uchardata);
+
+ /* In the first pass, we must accumulate the space used here for
+ the following reason: If this ends up as the only character in the
+ class, it will later be optimized down to a single character.
+ However, that uses less memory, and so if this happens to be at the
+ end of the regex, there will not be enough memory in the real
+ compile for this temporary storage. */
+
+ if (lengthptr != NULL)
+ {
+ *lengthptr += class_uchardata - class_uchardata_base;
+ class_uchardata = class_uchardata_base;
+ }
}
}
#endif /* SUPPORT_UCP */
@@ -4441,7 +4508,8 @@ for (;; ptr++)
goto FAILED;
}
- /* If class_charcount is 1, we saw precisely one character whose value is
+ /* COMMENT NEEDS FIXING - no longer true.
+ If class_charcount is 1, we saw precisely one character whose value is
less than 256. As long as there were no characters >= 128 and there was no
use of \p or \P, in other words, no use of any XCLASS features, we can
optimize.
@@ -4457,7 +4525,7 @@ for (;; ptr++)
case, it can cause firstchar to be set. Otherwise, there can be no first
char if this item is first, whatever repeat count may follow. In the case
of reqchar, save the previous value for reinstating. */
-
+
#ifdef SUPPORT_UTF
if (class_single_char == 1 && (!utf || !negate_class
|| class_lastchar < (MAX_VALUE_FOR_SINGLE_CHAR + 1)))
@@ -4538,7 +4606,7 @@ for (;; ptr++)
/* Now fill in the complete length of the item */
- PUT(previous, 1, code - previous);
+ PUT(previous, 1, (int)(code - previous));
break; /* End of class handling */
}
#endif
@@ -4633,7 +4701,7 @@ for (;; ptr++)
past, but it no longer happens for non-repeated recursions. In fact, the
repeated ones could be re-implemented independently so as not to need this,
but for the moment we rely on the code for repeating groups. */
-
+
if (*previous == OP_RECURSE)
{
memmove(previous + 1 + LINK_SIZE, previous, IN_UCHARS(1 + LINK_SIZE));
@@ -4677,7 +4745,7 @@ for (;; ptr++)
{
pcre_uchar *lastchar = code - 1;
BACKCHAR(lastchar);
- c = code - lastchar; /* Length of UTF-8 character */
+ c = (int)(code - lastchar); /* Length of UTF-8 character */
memcpy(utf_chars, lastchar, IN_UCHARS(c)); /* Save the char */
c |= UTF_LENGTH; /* Flag c as a length */
}
@@ -5083,16 +5151,32 @@ for (;; ptr++)
*lengthptr += delta;
}
- /* This is compiling for real */
+ /* This is compiling for real. If there is a set first byte for
+ the group, and we have not yet set a "required byte", set it. Make
+ sure there is enough workspace for copying forward references before
+ doing the copy. */
else
{
if (groupsetfirstchar && reqchar < 0) reqchar = firstchar;
+
for (i = 1; i < repeat_min; i++)
{
pcre_uchar *hc;
pcre_uchar *this_hwm = cd->hwm;
memcpy(code, previous, IN_UCHARS(len));
+
+ while (cd->hwm > cd->start_workspace + cd->workspace_size -
+ WORK_SIZE_SAFETY_MARGIN - (this_hwm - save_hwm))
+ {
+ int save_offset = save_hwm - cd->start_workspace;
+ int this_offset = this_hwm - cd->start_workspace;
+ *errorcodeptr = expand_workspace(cd);
+ if (*errorcodeptr != 0) goto FAILED;
+ save_hwm = (pcre_uchar *)cd->start_workspace + save_offset;
+ this_hwm = (pcre_uchar *)cd->start_workspace + this_offset;
+ }
+
for (hc = save_hwm; hc < this_hwm; hc += LINK_SIZE)
{
PUT(cd->hwm, 0, GET(hc, 0) + len);
@@ -5160,6 +5244,21 @@ for (;; ptr++)
}
memcpy(code, previous, IN_UCHARS(len));
+
+ /* Ensure there is enough workspace for forward references before
+ copying them. */
+
+ while (cd->hwm > cd->start_workspace + cd->workspace_size -
+ WORK_SIZE_SAFETY_MARGIN - (this_hwm - save_hwm))
+ {
+ int save_offset = save_hwm - cd->start_workspace;
+ int this_offset = this_hwm - cd->start_workspace;
+ *errorcodeptr = expand_workspace(cd);
+ if (*errorcodeptr != 0) goto FAILED;
+ save_hwm = (pcre_uchar *)cd->start_workspace + save_offset;
+ this_hwm = (pcre_uchar *)cd->start_workspace + this_offset;
+ }
+
for (hc = save_hwm; hc < this_hwm; hc += LINK_SIZE)
{
PUT(cd->hwm, 0, GET(hc, 0) + len + ((i != 0)? 2+LINK_SIZE : 1));
@@ -5190,24 +5289,24 @@ for (;; ptr++)
ONCE brackets can be converted into non-capturing brackets, as the
behaviour of (?:xx)++ is the same as (?>xx)++ and this saves having to
deal with possessive ONCEs specially.
-
+
Otherwise, when we are doing the actual compile phase, check to see
whether this group is one that could match an empty string. If so,
convert the initial operator to the S form (e.g. OP_BRA -> OP_SBRA) so
that runtime checking can be done. [This check is also applied to ONCE
groups at runtime, but in a different way.]
- Then, if the quantifier was possessive and the bracket is not a
+ Then, if the quantifier was possessive and the bracket is not a
conditional, we convert the BRA code to the POS form, and the KET code to
KETRPOS. (It turns out to be convenient at runtime to detect this kind of
subpattern at both the start and at the end.) The use of special opcodes
makes it possible to reduce greatly the stack usage in pcre_exec(). If
- the group is preceded by OP_BRAZERO, convert this to OP_BRAPOSZERO.
-
+ the group is preceded by OP_BRAZERO, convert this to OP_BRAPOSZERO.
+
Then, if the minimum number of matches is 1 or 0, cancel the possessive
flag so that the default action below, of wrapping everything inside
atomic brackets, does not happen. When the minimum is greater than 1,
- there will be earlier copies of the group, and so we still have to wrap
+ there will be earlier copies of the group, and so we still have to wrap
the whole thing. */
else
@@ -5216,23 +5315,23 @@ for (;; ptr++)
pcre_uchar *bracode = ketcode - GET(ketcode, 1);
/* Convert possessive ONCE brackets to non-capturing */
-
+
if ((*bracode == OP_ONCE || *bracode == OP_ONCE_NC) &&
possessive_quantifier) *bracode = OP_BRA;
/* For non-possessive ONCE brackets, all we need to do is to
set the KET. */
-
+
if (*bracode == OP_ONCE || *bracode == OP_ONCE_NC)
*ketcode = OP_KETRMAX + repeat_type;
-
+
/* Handle non-ONCE brackets and possessive ONCEs (which have been
- converted to non-capturing above). */
-
+ converted to non-capturing above). */
+
else
{
/* In the compile phase, check for empty string matching. */
-
+
if (lengthptr == NULL)
{
pcre_uchar *scode = bracode;
@@ -5247,7 +5346,7 @@ for (;; ptr++)
}
while (*scode == OP_ALT);
}
-
+
/* Handle possessive quantifiers. */
if (possessive_quantifier)
@@ -5256,7 +5355,7 @@ for (;; ptr++)
repeated non-capturing bracket, because we have not invented POS
versions of the COND opcodes. Because we are moving code along, we
must ensure that any pending recursive references are updated. */
-
+
if (*bracode == OP_COND || *bracode == OP_SCOND)
{
int nlen = (int)(code - bracode);
@@ -5269,25 +5368,25 @@ for (;; ptr++)
*code++ = OP_KETRPOS;
PUTINC(code, 0, nlen);
PUT(bracode, 1, nlen);
- }
-
+ }
+
/* For non-COND brackets, we modify the BRA code and use KETRPOS. */
-
- else
+
+ else
{
*bracode += 1; /* Switch to xxxPOS opcodes */
*ketcode = OP_KETRPOS;
}
-
- /* If the minimum is zero, mark it as possessive, then unset the
+
+ /* If the minimum is zero, mark it as possessive, then unset the
possessive flag when the minimum is 0 or 1. */
-
+
if (brazeroptr != NULL) *brazeroptr = OP_BRAPOSZERO;
if (repeat_min < 2) possessive_quantifier = FALSE;
}
-
+
/* Non-possessive quantifier */
-
+
else *ketcode = OP_KETRMAX + repeat_type;
}
}
@@ -6180,6 +6279,12 @@ for (;; ptr++)
of the group. Then remember the forward reference. */
called = cd->start_code + recno;
+ if (cd->hwm >= cd->start_workspace + cd->workspace_size -
+ WORK_SIZE_SAFETY_MARGIN)
+ {
+ *errorcodeptr = expand_workspace(cd);
+ if (*errorcodeptr != 0) goto FAILED;
+ }
PUTINC(cd->hwm, 0, (int)(code + 1 - cd->start_code));
}
@@ -6200,11 +6305,14 @@ for (;; ptr++)
}
}
- /* Insert the recursion/subroutine item. */
+ /* Insert the recursion/subroutine item. It does not have a set first
+ character (relevant if it is repeated, because it will then be
+ wrapped with ONCE brackets). */
*code = OP_RECURSE;
PUT(code, 1, (int)(called - cd->start_code));
code += 1 + LINK_SIZE;
+ groupsetfirstchar = FALSE;
}
/* Can't determine a first byte now */
@@ -6688,8 +6796,8 @@ for (;; ptr++)
#endif
/* In non-UTF-8 mode, we turn \C into OP_ALLANY instead of OP_ANYBYTE
so that it works in DFA mode and in lookbehinds. */
-
- {
+
+ {
previous = (-c > ESC_b && -c < ESC_Z)? code : NULL;
*code++ = (!utf && c == -ESC_C)? OP_ALLANY : -c;
}
@@ -7442,7 +7550,8 @@ compile_data *cd = &compile_block;
computing the amount of memory that is needed. Compiled items are thrown away
as soon as possible, so that a fairly large buffer should be sufficient for
this purpose. The same space is used in the second phase for remembering where
-to fill in forward references to subpatterns. */
+to fill in forward references to subpatterns. That may overflow, in which case
+new memory is obtained from malloc(). */
pcre_uchar cworkspace[COMPILE_WORK_SIZE];
@@ -7642,9 +7751,10 @@ cd->bracount = cd->final_bracount = 0;
cd->names_found = 0;
cd->name_entry_size = 0;
cd->name_table = NULL;
-cd->start_workspace = cworkspace;
cd->start_code = cworkspace;
cd->hwm = cworkspace;
+cd->start_workspace = cworkspace;
+cd->workspace_size = COMPILE_WORK_SIZE;
cd->start_pattern = (const pcre_uchar *)pattern;
cd->end_pattern = (const pcre_uchar *)(pattern + STRLEN_UC((const pcre_uchar *)pattern));
cd->req_varyopt = 0;
@@ -7722,7 +7832,7 @@ cd->names_found = 0;
cd->name_table = (pcre_uchar *)re + re->name_table_offset;
codestart = cd->name_table + re->name_entry_size * re->name_count;
cd->start_code = codestart;
-cd->hwm = cworkspace;
+cd->hwm = (pcre_uchar *)(cd->start_workspace);
cd->req_varyopt = 0;
cd->had_accept = FALSE;
cd->check_lookbehind = FALSE;
@@ -7756,20 +7866,34 @@ if debugging, leave the test till after things are printed out. */
if (code - codestart > length) errorcode = ERR23;
#endif
-/* Fill in any forward references that are required. */
+/* Fill in any forward references that are required. There may be repeated
+references; optimize for them, as searching a large regex takes time. */
-while (errorcode == 0 && cd->hwm > cworkspace)
+if (cd->hwm > cd->start_workspace)
{
- int offset, recno;
- const pcre_uchar *groupptr;
- cd->hwm -= LINK_SIZE;
- offset = GET(cd->hwm, 0);
- recno = GET(codestart, offset);
- groupptr = PRIV(find_bracket)(codestart, utf, recno);
- if (groupptr == NULL) errorcode = ERR53;
- else PUT(((pcre_uchar *)codestart), offset, (int)(groupptr - codestart));
+ int prev_recno = -1;
+ const pcre_uchar *groupptr = NULL;
+ while (errorcode == 0 && cd->hwm > cd->start_workspace)
+ {
+ int offset, recno;
+ cd->hwm -= LINK_SIZE;
+ offset = GET(cd->hwm, 0);
+ recno = GET(codestart, offset);
+ if (recno != prev_recno)
+ {
+ groupptr = PRIV(find_bracket)(codestart, utf, recno);
+ prev_recno = recno;
+ }
+ if (groupptr == NULL) errorcode = ERR53;
+ else PUT(((pcre_uchar *)codestart), offset, (int)(groupptr - codestart));
+ }
}
+/* If the workspace had to be expanded, free the new memory. */
+
+if (cd->workspace_size > COMPILE_WORK_SIZE)
+ (pcre_free)((void *)cd->start_workspace);
+
/* Give an error if there's back reference to a non-existent capturing
subpattern. */
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index 2b48eda..5df9cce 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -2784,7 +2784,7 @@ for (;;)
{
const pcre_uchar *p = ptr;
const pcre_uchar *pp = local_ptr;
- charcount = pp - p;
+ charcount = (int)(pp - p);
while (p < pp) if (NOT_FIRSTCHAR(*p++)) charcount--;
ADD_NEW_DATA(-next_state_offset, 0, (charcount - 1));
}
diff --git a/pcre_exec.c b/pcre_exec.c
index 5d85e4b..1d17e86 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -82,14 +82,6 @@ negative to avoid the external error codes. */
#define MATCH_SKIP_ARG (-993)
#define MATCH_THEN (-992)
-/* This is a convenience macro for code that occurs many times. */
-
-#define MRRETURN(ra) \
- { \
- md->mark = markptr; \
- RRETURN(ra); \
- }
-
/* Maximum number of ints of offset to save on the stack for recursive calls.
If the offset vector is bigger, malloc is used. This should be a multiple of 3,
because the offset vector is always a multiple of 3 long. */
@@ -225,7 +217,7 @@ else
while (length-- > 0) if (*p++ != *eptr++) return -1;
}
-return eptr - eptr_start;
+return (int)(eptr - eptr_start);
}
@@ -290,7 +282,7 @@ actually used in this definition. */
#define RMATCH(ra,rb,rc,rd,re,rw) \
{ \
printf("match() called in line %d\n", __LINE__); \
- rrc = match(ra,rb,mstart,markptr,rc,rd,re,rdepth+1); \
+ rrc = match(ra,rb,mstart,rc,rd,re,rdepth+1); \
printf("to line %d\n", __LINE__); \
}
#define RRETURN(ra) \
@@ -300,7 +292,7 @@ actually used in this definition. */
}
#else
#define RMATCH(ra,rb,rc,rd,re,rw) \
- rrc = match(ra,rb,mstart,markptr,rc,rd,re,rdepth+1)
+ rrc = match(ra,rb,mstart,rc,rd,re,rdepth+1)
#define RRETURN(ra) return ra
#endif
@@ -321,7 +313,6 @@ argument of match(), which never changes. */
newframe->Xeptr = ra;\
newframe->Xecode = rb;\
newframe->Xmstart = mstart;\
- newframe->Xmarkptr = markptr;\
newframe->Xoffset_top = rc;\
newframe->Xeptrb = re;\
newframe->Xrdepth = frame->Xrdepth + 1;\
@@ -357,7 +348,6 @@ typedef struct heapframe {
PCRE_PUCHAR Xeptr;
const pcre_uchar *Xecode;
PCRE_PUCHAR Xmstart;
- PCRE_PUCHAR Xmarkptr;
int Xoffset_top;
eptrblock *Xeptrb;
unsigned int Xrdepth;
@@ -427,7 +417,7 @@ returns a negative (error) response, the outer incarnation must also return the
same response. */
/* These macros pack up tests that are used for partial matching, and which
-appears several times in the code. We set the "hit end" flag if the pointer is
+appear several times in the code. We set the "hit end" flag if the pointer is
at the end of the subject and also past the start of the subject (i.e.
something has been matched). For hard partial matching, we then return
immediately. The second one is used when we already know we are past the end of
@@ -438,14 +428,14 @@ the subject. */
eptr > md->start_used_ptr) \
{ \
md->hitend = TRUE; \
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
+ if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL); \
}
#define SCHECK_PARTIAL()\
if (md->partial != 0 && eptr > md->start_used_ptr) \
{ \
md->hitend = TRUE; \
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
+ if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL); \
}
@@ -459,7 +449,6 @@ Arguments:
ecode pointer to current position in compiled code
mstart pointer to the current match start position (can be modified
by encountering \K)
- markptr pointer to the most recent MARK name, or NULL
offset_top current top pointer
md pointer to "static" info for the match
eptrb pointer to chain of blocks containing eptr at start of
@@ -475,8 +464,8 @@ Returns: MATCH_MATCH if matched ) these values are >= 0
static int
match(REGISTER PCRE_PUCHAR eptr, REGISTER const pcre_uchar *ecode,
- PCRE_PUCHAR mstart, const pcre_uchar *markptr, int offset_top,
- match_data *md, eptrblock *eptrb, unsigned int rdepth)
+ PCRE_PUCHAR mstart, int offset_top, match_data *md, eptrblock *eptrb,
+ unsigned int rdepth)
{
/* These variables do not need to be preserved over recursion in this function,
so they can be ordinary variables in all cases. Mark some of them with
@@ -506,7 +495,6 @@ frame->Xprevframe = NULL; /* Marks the top level */
frame->Xeptr = eptr;
frame->Xecode = ecode;
frame->Xmstart = mstart;
-frame->Xmarkptr = markptr;
frame->Xoffset_top = offset_top;
frame->Xeptrb = eptrb;
frame->Xrdepth = rdepth;
@@ -520,7 +508,6 @@ HEAP_RECURSE:
#define eptr frame->Xeptr
#define ecode frame->Xecode
#define mstart frame->Xmstart
-#define markptr frame->Xmarkptr
#define offset_top frame->Xoffset_top
#define eptrb frame->Xeptrb
#define rdepth frame->Xrdepth
@@ -702,9 +689,12 @@ for (;;)
switch(op)
{
case OP_MARK:
- markptr = ecode + 2;
+ md->nomatch_mark = ecode + 2;
+ md->mark = NULL; /* In case previously set by assertion */
RMATCH(eptr, ecode + PRIV(OP_lengths)[*ecode] + ecode[1], offset_top, md,
eptrb, RM55);
+ if ((rrc == MATCH_MATCH || rrc == MATCH_ACCEPT) &&
+ md->mark == NULL) md->mark = ecode + 2;
/* A return of MATCH_SKIP_ARG means that matching failed at SKIP with an
argument, and we must check whether that argument matches this MARK's
@@ -713,18 +703,16 @@ for (;;)
position and return MATCH_SKIP. Otherwise, pass back the return code
unaltered. */
- if (rrc == MATCH_SKIP_ARG &&
- STRCMP_UC_UC(markptr, md->start_match_ptr) == 0)
+ else if (rrc == MATCH_SKIP_ARG &&
+ STRCMP_UC_UC(ecode + 2, md->start_match_ptr) == 0)
{
md->start_match_ptr = eptr;
RRETURN(MATCH_SKIP);
}
-
- if (md->mark == NULL) md->mark = markptr;
RRETURN(rrc);
case OP_FAIL:
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
/* COMMIT overrides PRUNE, SKIP, and THEN */
@@ -735,7 +723,7 @@ for (;;)
rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
rrc != MATCH_THEN)
RRETURN(rrc);
- MRRETURN(MATCH_COMMIT);
+ RRETURN(MATCH_COMMIT);
/* PRUNE overrides THEN */
@@ -743,13 +731,16 @@ for (;;)
RMATCH(eptr, ecode + PRIV(OP_lengths)[*ecode], offset_top, md,
eptrb, RM51);
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
- MRRETURN(MATCH_PRUNE);
+ RRETURN(MATCH_PRUNE);
case OP_PRUNE_ARG:
+ md->nomatch_mark = ecode + 2;
+ md->mark = NULL; /* In case previously set by assertion */
RMATCH(eptr, ecode + PRIV(OP_lengths)[*ecode] + ecode[1], offset_top, md,
eptrb, RM56);
+ if ((rrc == MATCH_MATCH || rrc == MATCH_ACCEPT) &&
+ md->mark == NULL) md->mark = ecode + 2;
if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
- md->mark = ecode + 2;
RRETURN(MATCH_PRUNE);
/* SKIP overrides PRUNE and THEN */
@@ -760,9 +751,18 @@ for (;;)
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
md->start_match_ptr = eptr; /* Pass back current position */
- MRRETURN(MATCH_SKIP);
+ RRETURN(MATCH_SKIP);
+
+ /* Note that, for Perl compatibility, SKIP with an argument does NOT set
+ nomatch_mark. There is a flag that disables this opcode when re-matching a
+ pattern that ended with a SKIP for which there was not a matching MARK. */
case OP_SKIP_ARG:
+ if (md->ignore_skip_arg)
+ {
+ ecode += PRIV(OP_lengths)[*ecode] + ecode[1];
+ break;
+ }
RMATCH(eptr, ecode + PRIV(OP_lengths)[*ecode] + ecode[1], offset_top, md,
eptrb, RM57);
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
@@ -770,8 +770,8 @@ for (;;)
/* Pass back the current skip name by overloading md->start_match_ptr and
returning the special MATCH_SKIP_ARG return code. This will either be
- caught by a matching MARK, or get to the top, where it is treated the same
- as PRUNE. */
+ caught by a matching MARK, or get to the top, where it causes a rematch
+ with the md->ignore_skip_arg flag set. */
md->start_match_ptr = ecode + 2;
RRETURN(MATCH_SKIP_ARG);
@@ -785,14 +785,17 @@ for (;;)
eptrb, RM54);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
md->start_match_ptr = ecode;
- MRRETURN(MATCH_THEN);
+ RRETURN(MATCH_THEN);
case OP_THEN_ARG:
+ md->nomatch_mark = ecode + 2;
+ md->mark = NULL; /* In case previously set by assertion */
RMATCH(eptr, ecode + PRIV(OP_lengths)[*ecode] + ecode[1], offset_top,
md, eptrb, RM58);
+ if ((rrc == MATCH_MATCH || rrc == MATCH_ACCEPT) &&
+ md->mark == NULL) md->mark = ecode + 2;
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
md->start_match_ptr = ecode;
- md->mark = ecode + 2;
RRETURN(MATCH_THEN);
/* Handle an atomic group that does not contain any capturing parentheses.
@@ -817,7 +820,6 @@ for (;;)
if (rrc == MATCH_MATCH) /* Note: _not_ MATCH_ACCEPT */
{
mstart = md->start_match_ptr;
- markptr = md->mark;
break;
}
if (rrc == MATCH_THEN)
@@ -955,7 +957,6 @@ for (;;)
/* At this point, rrc will be one of MATCH_ONCE or MATCH_NOMATCH. */
- if (md->mark == NULL) md->mark = markptr;
RRETURN(rrc);
}
@@ -1043,7 +1044,6 @@ for (;;)
if (*ecode != OP_ALT) break;
}
- if (md->mark == NULL) md->mark = markptr;
RRETURN(MATCH_NOMATCH);
/* Handle possessive capturing brackets with an unlimited repeat. We come
@@ -1072,7 +1072,7 @@ for (;;)
if (offset < md->offset_max)
{
matched_once = FALSE;
- code_offset = ecode - md->start_code;
+ code_offset = (int)(ecode - md->start_code);
save_offset1 = md->offset_vector[offset];
save_offset2 = md->offset_vector[offset+1];
@@ -1131,7 +1131,6 @@ for (;;)
md->offset_vector[md->offset_end - number] = save_offset3;
}
- if (md->mark == NULL) md->mark = markptr;
if (allow_zero || matched_once)
{
ecode += 1 + LINK_SIZE;
@@ -1163,7 +1162,7 @@ for (;;)
POSSESSIVE_NON_CAPTURE:
matched_once = FALSE;
- code_offset = ecode - md->start_code;
+ code_offset = (int)(ecode - md->start_code);
for (;;)
{
@@ -1233,8 +1232,8 @@ for (;;)
cb.capture_top = offset_top/2;
cb.capture_last = md->capture_last;
cb.callout_data = md->callout_data;
- cb.mark = (unsigned char *)markptr;
- if ((rrc = (*pcre_callout)(&cb)) > 0) MRRETURN(MATCH_NOMATCH);
+ cb.mark = md->nomatch_mark;
+ if ((rrc = (*pcre_callout)(&cb)) > 0) RRETURN(MATCH_NOMATCH);
if (rrc < 0) RRETURN(rrc);
}
ecode += PRIV(OP_lengths)[OP_CALLOUT];
@@ -1489,7 +1488,7 @@ for (;;)
(md->notempty ||
(md->notempty_atstart &&
mstart == md->start_subject + md->start_offset)))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
/* Otherwise, we have a match. */
@@ -1498,10 +1497,10 @@ for (;;)
md->start_match_ptr = mstart; /* and the start (\K can modify) */
/* For some reason, the macros don't work properly if an expression is
- given as the argument to MRRETURN when the heap is in use. */
+ given as the argument to RRETURN when the heap is in use. */
rrc = (op == OP_END)? MATCH_MATCH : MATCH_ACCEPT;
- MRRETURN(rrc);
+ RRETURN(rrc);
/* Assertion brackets. Check the alternative branches in turn - the
matching won't pass the KET for an assertion. If any one branch matches,
@@ -1529,7 +1528,6 @@ for (;;)
if (rrc == MATCH_MATCH || rrc == MATCH_ACCEPT)
{
mstart = md->start_match_ptr; /* In case \K reset it */
- markptr = md->mark;
break;
}
@@ -1541,7 +1539,7 @@ for (;;)
}
while (*ecode == OP_ALT);
- if (*ecode == OP_KET) MRRETURN(MATCH_NOMATCH);
+ if (*ecode == OP_KET) RRETURN(MATCH_NOMATCH);
/* If checking an assertion for a condition, return MATCH_MATCH. */
@@ -1571,7 +1569,7 @@ for (;;)
do
{
RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, NULL, RM5);
- if (rrc == MATCH_MATCH || rrc == MATCH_ACCEPT) MRRETURN(MATCH_NOMATCH);
+ if (rrc == MATCH_MATCH || rrc == MATCH_ACCEPT) RRETURN(MATCH_NOMATCH);
if (rrc == MATCH_SKIP || rrc == MATCH_PRUNE || rrc == MATCH_COMMIT)
{
do ecode += GET(ecode,1); while (*ecode == OP_ALT);
@@ -1604,7 +1602,7 @@ for (;;)
while (i-- > 0)
{
eptr--;
- if (eptr < md->start_subject) MRRETURN(MATCH_NOMATCH);
+ if (eptr < md->start_subject) RRETURN(MATCH_NOMATCH);
BACKCHAR(eptr);
}
}
@@ -1615,7 +1613,7 @@ for (;;)
{
eptr -= GET(ecode, 1);
- if (eptr < md->start_subject) MRRETURN(MATCH_NOMATCH);
+ if (eptr < md->start_subject) RRETURN(MATCH_NOMATCH);
}
/* Save the earliest consulted character, then skip to next op code */
@@ -1644,8 +1642,8 @@ for (;;)
cb.capture_top = offset_top/2;
cb.capture_last = md->capture_last;
cb.callout_data = md->callout_data;
- cb.mark = (unsigned char *)markptr;
- if ((rrc = (*pcre_callout)(&cb)) > 0) MRRETURN(MATCH_NOMATCH);
+ cb.mark = md->nomatch_mark;
+ if ((rrc = (*pcre_callout)(&cb)) > 0) RRETURN(MATCH_NOMATCH);
if (rrc < 0) RRETURN(rrc);
}
ecode += 2 + 2*LINK_SIZE;
@@ -1759,7 +1757,7 @@ for (;;)
md->recursive = new_recursive.prevrec;
if (new_recursive.offset_save != stacksave)
(pcre_free)(new_recursive.offset_save);
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
RECURSION_MATCHED:
@@ -1839,7 +1837,7 @@ for (;;)
md->end_match_ptr = eptr; /* For ONCE_NC */
md->end_offset_top = offset_top;
md->start_match_ptr = mstart;
- MRRETURN(MATCH_MATCH); /* Sets md->mark */
+ RRETURN(MATCH_MATCH); /* Sets md->mark */
}
/* For capturing groups we have to check the group number back at the start
@@ -1981,29 +1979,29 @@ for (;;)
/* Not multiline mode: start of subject assertion, unless notbol. */
case OP_CIRC:
- if (md->notbol && eptr == md->start_subject) MRRETURN(MATCH_NOMATCH);
+ if (md->notbol && eptr == md->start_subject) RRETURN(MATCH_NOMATCH);
/* Start of subject assertion */
case OP_SOD:
- if (eptr != md->start_subject) MRRETURN(MATCH_NOMATCH);
+ if (eptr != md->start_subject) RRETURN(MATCH_NOMATCH);
ecode++;
break;
/* Multiline mode: start of subject unless notbol, or after any newline. */
case OP_CIRCM:
- if (md->notbol && eptr == md->start_subject) MRRETURN(MATCH_NOMATCH);
+ if (md->notbol && eptr == md->start_subject) RRETURN(MATCH_NOMATCH);
if (eptr != md->start_subject &&
(eptr == md->end_subject || !WAS_NEWLINE(eptr)))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
/* Start of match assertion */
case OP_SOM:
- if (eptr != md->start_subject + md->start_offset) MRRETURN(MATCH_NOMATCH);
+ if (eptr != md->start_subject + md->start_offset) RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2019,10 +2017,10 @@ for (;;)
case OP_DOLLM:
if (eptr < md->end_subject)
- { if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
+ { if (!IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH); }
else
{
- if (md->noteol) MRRETURN(MATCH_NOMATCH);
+ if (md->noteol) RRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL();
}
ecode++;
@@ -2032,7 +2030,7 @@ for (;;)
subject unless noteol is set. */
case OP_DOLL:
- if (md->noteol) MRRETURN(MATCH_NOMATCH);
+ if (md->noteol) RRETURN(MATCH_NOMATCH);
if (!md->endonly) goto ASSERT_NL_OR_EOS;
/* ... else fall through for endonly */
@@ -2040,7 +2038,7 @@ for (;;)
/* End of subject assertion (\z) */
case OP_EOD:
- if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
+ if (eptr < md->end_subject) RRETURN(MATCH_NOMATCH);
SCHECK_PARTIAL();
ecode++;
break;
@@ -2051,7 +2049,7 @@ for (;;)
ASSERT_NL_OR_EOS:
if (eptr < md->end_subject &&
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
/* Either at end of string or \n before end. */
@@ -2173,21 +2171,21 @@ for (;;)
if ((*ecode++ == OP_WORD_BOUNDARY)?
cur_is_word == prev_is_word : cur_is_word != prev_is_word)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
/* Match a single character type; inline for speed */
case OP_ANY:
- if (IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH);
+ if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
/* Fall through */
case OP_ALLANY:
if (eptr >= md->end_subject) /* DO NOT merge the eptr++ here; it must */
{ /* not be updated before SCHECK_PARTIAL. */
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr++;
#ifdef SUPPORT_UTF
@@ -2203,7 +2201,7 @@ for (;;)
if (eptr >= md->end_subject) /* DO NOT merge the eptr++ here; it must */
{ /* not be updated before SCHECK_PARTIAL. */
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr++;
ecode++;
@@ -2213,7 +2211,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if (
@@ -2222,7 +2220,7 @@ for (;;)
#endif
(md->ctypes[c] & ctype_digit) != 0
)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2230,7 +2228,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if (
@@ -2239,7 +2237,7 @@ for (;;)
#endif
(md->ctypes[c] & ctype_digit) == 0
)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2247,7 +2245,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if (
@@ -2256,7 +2254,7 @@ for (;;)
#endif
(md->ctypes[c] & ctype_space) != 0
)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2264,7 +2262,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if (
@@ -2273,7 +2271,7 @@ for (;;)
#endif
(md->ctypes[c] & ctype_space) == 0
)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2281,7 +2279,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if (
@@ -2290,7 +2288,7 @@ for (;;)
#endif
(md->ctypes[c] & ctype_word) != 0
)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2298,7 +2296,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if (
@@ -2307,7 +2305,7 @@ for (;;)
#endif
(md->ctypes[c] & ctype_word) == 0
)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
ecode++;
break;
@@ -2315,12 +2313,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x000d:
if (eptr < md->end_subject && *eptr == 0x0a) eptr++;
@@ -2334,7 +2332,7 @@ for (;;)
case 0x0085:
case 0x2028:
case 0x2029:
- if (md->bsr_anycrlf) MRRETURN(MATCH_NOMATCH);
+ if (md->bsr_anycrlf) RRETURN(MATCH_NOMATCH);
break;
}
ecode++;
@@ -2344,7 +2342,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
switch(c)
@@ -2369,7 +2367,7 @@ for (;;)
case 0x202f: /* NARROW NO-BREAK SPACE */
case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
case 0x3000: /* IDEOGRAPHIC SPACE */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
ecode++;
break;
@@ -2378,12 +2376,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
@@ -2412,7 +2410,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
switch(c)
@@ -2425,7 +2423,7 @@ for (;;)
case 0x85: /* NEL */
case 0x2028: /* LINE SEPARATOR */
case 0x2029: /* PARAGRAPH SEPARATOR */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
ecode++;
break;
@@ -2434,12 +2432,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x0a: /* LF */
case 0x0b: /* VT */
case 0x0c: /* FF */
@@ -2461,7 +2459,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
{
@@ -2470,29 +2468,29 @@ for (;;)
switch(ecode[1])
{
case PT_ANY:
- if (op == OP_NOTPROP) MRRETURN(MATCH_NOMATCH);
+ if (op == OP_NOTPROP) RRETURN(MATCH_NOMATCH);
break;
case PT_LAMP:
if ((prop->chartype == ucp_Lu ||
prop->chartype == ucp_Ll ||
prop->chartype == ucp_Lt) == (op == OP_NOTPROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case PT_GC:
if ((ecode[2] != PRIV(ucp_gentype)[prop->chartype]) == (op == OP_PROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case PT_PC:
if ((ecode[2] != prop->chartype) == (op == OP_PROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case PT_SC:
if ((ecode[2] != prop->script) == (op == OP_PROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
/* These are specials */
@@ -2500,14 +2498,14 @@ for (;;)
case PT_ALNUM:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
PRIV(ucp_gentype)[prop->chartype] == ucp_N) == (op == OP_NOTPROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case PT_SPACE: /* Perl space */
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_Z ||
c == CHAR_HT || c == CHAR_NL || c == CHAR_FF || c == CHAR_CR)
== (op == OP_NOTPROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case PT_PXSPACE: /* POSIX space */
@@ -2515,14 +2513,14 @@ for (;;)
c == CHAR_HT || c == CHAR_NL || c == CHAR_VT ||
c == CHAR_FF || c == CHAR_CR)
== (op == OP_NOTPROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case PT_WORD:
if ((PRIV(ucp_gentype)[prop->chartype] == ucp_L ||
PRIV(ucp_gentype)[prop->chartype] == ucp_N ||
c == CHAR_UNDERSCORE) == (op == OP_NOTPROP))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
/* This should never occur */
@@ -2542,10 +2540,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
- if (UCD_CATEGORY(c) == ucp_M) MRRETURN(MATCH_NOMATCH);
+ if (UCD_CATEGORY(c) == ucp_M) RRETURN(MATCH_NOMATCH);
while (eptr < md->end_subject)
{
int len = 1;
@@ -2619,7 +2617,7 @@ for (;;)
if ((length = match_ref(offset, eptr, length, md, caseless)) < 0)
{
CHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr += length;
continue; /* With the main loop */
@@ -2640,7 +2638,7 @@ for (;;)
if ((slength = match_ref(offset, eptr, length, md, caseless)) < 0)
{
CHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr += slength;
}
@@ -2659,11 +2657,11 @@ for (;;)
int slength;
RMATCH(eptr, ecode, offset_top, md, eptrb, RM14);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if ((slength = match_ref(offset, eptr, length, md, caseless)) < 0)
{
CHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr += slength;
}
@@ -2691,7 +2689,7 @@ for (;;)
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
eptr -= length;
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -2754,15 +2752,15 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
if (c > 255)
{
- if (op == OP_CLASS) MRRETURN(MATCH_NOMATCH);
+ if (op == OP_CLASS) RRETURN(MATCH_NOMATCH);
}
else
- if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) RRETURN(MATCH_NOMATCH);
}
}
else
@@ -2774,17 +2772,17 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
c = *eptr++;
#ifndef COMPILE_PCRE8
if (c > 255)
{
- if (op == OP_CLASS) MRRETURN(MATCH_NOMATCH);
+ if (op == OP_CLASS) RRETURN(MATCH_NOMATCH);
}
else
#endif
- if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) RRETURN(MATCH_NOMATCH);
}
}
@@ -2805,19 +2803,19 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM16);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
if (c > 255)
{
- if (op == OP_CLASS) MRRETURN(MATCH_NOMATCH);
+ if (op == OP_CLASS) RRETURN(MATCH_NOMATCH);
}
else
- if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) RRETURN(MATCH_NOMATCH);
}
}
else
@@ -2828,21 +2826,21 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM17);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
c = *eptr++;
#ifndef COMPILE_PCRE8
if (c > 255)
{
- if (op == OP_CLASS) MRRETURN(MATCH_NOMATCH);
+ if (op == OP_CLASS) RRETURN(MATCH_NOMATCH);
}
else
#endif
- if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((BYTE_MAP[c/8] & (1 << (c&7))) == 0) RRETURN(MATCH_NOMATCH);
}
}
/* Control never gets here */
@@ -2912,7 +2910,7 @@ for (;;)
}
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
#undef BYTE_MAP
}
@@ -2965,10 +2963,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
- if (!PRIV(xclass)(c, data, utf)) MRRETURN(MATCH_NOMATCH);
+ if (!PRIV(xclass)(c, data, utf)) RRETURN(MATCH_NOMATCH);
}
/* If max == min we can continue with the main loop without the
@@ -2985,14 +2983,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM20);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
- if (!PRIV(xclass)(c, data, utf)) MRRETURN(MATCH_NOMATCH);
+ if (!PRIV(xclass)(c, data, utf)) RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
}
@@ -3027,7 +3025,7 @@ for (;;)
if (utf) BACKCHAR(eptr);
#endif
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -3046,9 +3044,9 @@ for (;;)
if (length > md->end_subject - eptr)
{
CHECK_PARTIAL(); /* Not SCHECK_PARTIAL() */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- while (length-- > 0) if (*ecode++ != *eptr++) MRRETURN(MATCH_NOMATCH);
+ while (length-- > 0) if (*ecode++ != *eptr++) RRETURN(MATCH_NOMATCH);
}
else
#endif
@@ -3057,16 +3055,23 @@ for (;;)
if (md->end_subject - eptr < 1)
{
SCHECK_PARTIAL(); /* This one can use SCHECK_PARTIAL() */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (ecode[1] != *eptr++) MRRETURN(MATCH_NOMATCH);
+ if (ecode[1] != *eptr++) RRETURN(MATCH_NOMATCH);
ecode += 2;
}
break;
- /* Match a single character, caselessly */
+ /* Match a single character, caselessly. If we are at the end of the
+ subject, give up immediately. */
case OP_CHARI:
+ if (eptr >= md->end_subject)
+ {
+ SCHECK_PARTIAL();
+ RRETURN(MATCH_NOMATCH);
+ }
+
#ifdef SUPPORT_UTF
if (utf)
{
@@ -3074,24 +3079,22 @@ for (;;)
ecode++;
GETCHARLEN(fc, ecode, length);
- if (length > md->end_subject - eptr)
- {
- CHECK_PARTIAL(); /* Not SCHECK_PARTIAL() */
- MRRETURN(MATCH_NOMATCH);
- }
-
/* If the pattern character's value is < 128, we have only one byte, and
- can use the fast lookup table. */
+ we know that its other case must also be one byte long, so we can use the
+ fast lookup table. We know that there is at least one byte left in the
+ subject. */
if (fc < 128)
{
if (md->lcc[fc]
- != TABLE_GET(*eptr, md->lcc, *eptr)) MRRETURN(MATCH_NOMATCH);
+ != TABLE_GET(*eptr, md->lcc, *eptr)) RRETURN(MATCH_NOMATCH);
ecode++;
eptr++;
}
- /* Otherwise we must pick up the subject character */
+ /* Otherwise we must pick up the subject character. Note that we cannot
+ use the value of "length" to check for sufficient bytes left, because the
+ other case of the character may have more or fewer bytes. */
else
{
@@ -3107,7 +3110,7 @@ for (;;)
#ifdef SUPPORT_UCP
if (dc != UCD_OTHERCASE(fc))
#endif
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
}
@@ -3116,13 +3119,8 @@ for (;;)
/* Not UTF mode */
{
- if (md->end_subject - eptr < 1)
- {
- SCHECK_PARTIAL(); /* This one can use SCHECK_PARTIAL() */
- MRRETURN(MATCH_NOMATCH);
- }
if (TABLE_GET(ecode[1], md->lcc, ecode[1])
- != TABLE_GET(*eptr, md->lcc, *eptr)) MRRETURN(MATCH_NOMATCH);
+ != TABLE_GET(*eptr, md->lcc, *eptr)) RRETURN(MATCH_NOMATCH);
eptr++;
ecode += 2;
}
@@ -3229,7 +3227,7 @@ for (;;)
else
{
CHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
@@ -3241,7 +3239,7 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM22);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr <= md->end_subject - length &&
memcmp(eptr, charptr, IN_UCHARS(length)) == 0) eptr += length;
#ifdef SUPPORT_UCP
@@ -3252,7 +3250,7 @@ for (;;)
else
{
CHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
/* Control never gets here */
@@ -3283,7 +3281,7 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM23);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (eptr == pp) { MRRETURN(MATCH_NOMATCH); }
+ if (eptr == pp) { RRETURN(MATCH_NOMATCH); }
#ifdef SUPPORT_UCP
eptr--;
BACKCHAR(eptr);
@@ -3340,9 +3338,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc != *eptr && foc != *eptr) MRRETURN(MATCH_NOMATCH);
+ if (fc != *eptr && foc != *eptr) RRETURN(MATCH_NOMATCH);
eptr++;
}
if (min == max) continue;
@@ -3352,13 +3350,13 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM24);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc != *eptr && foc != *eptr) MRRETURN(MATCH_NOMATCH);
+ if (fc != *eptr && foc != *eptr) RRETURN(MATCH_NOMATCH);
eptr++;
}
/* Control never gets here */
@@ -3385,7 +3383,7 @@ for (;;)
eptr--;
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
}
@@ -3399,9 +3397,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc != *eptr++) MRRETURN(MATCH_NOMATCH);
+ if (fc != *eptr++) RRETURN(MATCH_NOMATCH);
}
if (min == max) continue;
@@ -3412,13 +3410,13 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM26);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc != *eptr++) MRRETURN(MATCH_NOMATCH);
+ if (fc != *eptr++) RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
}
@@ -3443,7 +3441,7 @@ for (;;)
eptr--;
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
/* Control never gets here */
@@ -3456,7 +3454,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
ecode++;
GETCHARINCTEST(c, eptr);
@@ -3480,11 +3478,11 @@ for (;;)
#endif /* SUPPORT_UTF */
och = TABLE_GET(ch, md->fcc, ch);
#endif /* COMPILE_PCRE8 */
- if (ch == c || och == c) MRRETURN(MATCH_NOMATCH);
+ if (ch == c || och == c) RRETURN(MATCH_NOMATCH);
}
else /* Caseful */
{
- if (*ecode++ == c) MRRETURN(MATCH_NOMATCH);
+ if (*ecode++ == c) RRETURN(MATCH_NOMATCH);
}
break;
@@ -3605,10 +3603,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(d, eptr);
- if (fc == d || foc == d) MRRETURN(MATCH_NOMATCH);
+ if (fc == d || foc == d) RRETURN(MATCH_NOMATCH);
}
}
else
@@ -3620,9 +3618,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc == *eptr || foc == *eptr) MRRETURN(MATCH_NOMATCH);
+ if (fc == *eptr || foc == *eptr) RRETURN(MATCH_NOMATCH);
eptr++;
}
}
@@ -3639,14 +3637,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM28);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(d, eptr);
- if (fc == d || foc == d) MRRETURN(MATCH_NOMATCH);
+ if (fc == d || foc == d) RRETURN(MATCH_NOMATCH);
}
}
else
@@ -3657,13 +3655,13 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM29);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc == *eptr || foc == *eptr) MRRETURN(MATCH_NOMATCH);
+ if (fc == *eptr || foc == *eptr) RRETURN(MATCH_NOMATCH);
eptr++;
}
}
@@ -3724,7 +3722,7 @@ for (;;)
}
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
}
@@ -3742,10 +3740,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(d, eptr);
- if (fc == d) MRRETURN(MATCH_NOMATCH);
+ if (fc == d) RRETURN(MATCH_NOMATCH);
}
}
else
@@ -3757,9 +3755,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc == *eptr++) MRRETURN(MATCH_NOMATCH);
+ if (fc == *eptr++) RRETURN(MATCH_NOMATCH);
}
}
@@ -3775,14 +3773,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM32);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(d, eptr);
- if (fc == d) MRRETURN(MATCH_NOMATCH);
+ if (fc == d) RRETURN(MATCH_NOMATCH);
}
}
else
@@ -3793,13 +3791,13 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM33);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (fc == *eptr++) MRRETURN(MATCH_NOMATCH);
+ if (fc == *eptr++) RRETURN(MATCH_NOMATCH);
}
}
/* Control never gets here */
@@ -3859,7 +3857,7 @@ for (;;)
}
}
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
/* Control never gets here */
@@ -3953,13 +3951,13 @@ for (;;)
switch(prop_type)
{
case PT_ANY:
- if (prop_fail_result) MRRETURN(MATCH_NOMATCH);
+ if (prop_fail_result) RRETURN(MATCH_NOMATCH);
for (i = 1; i <= min; i++)
{
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
}
@@ -3972,14 +3970,14 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
chartype = UCD_CHARTYPE(c);
if ((chartype == ucp_Lu ||
chartype == ucp_Ll ||
chartype == ucp_Lt) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -3989,11 +3987,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CATEGORY(c) == prop_value) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4003,11 +4001,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CHARTYPE(c) == prop_value) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4017,11 +4015,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_SCRIPT(c) == prop_value) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4032,12 +4030,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
category = UCD_CATEGORY(c);
if ((category == ucp_L || category == ucp_N) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4047,13 +4045,13 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CATEGORY(c) == ucp_Z || c == CHAR_HT || c == CHAR_NL ||
c == CHAR_FF || c == CHAR_CR)
== prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4063,13 +4061,13 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CATEGORY(c) == ucp_Z || c == CHAR_HT || c == CHAR_NL ||
c == CHAR_VT || c == CHAR_FF || c == CHAR_CR)
== prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4080,13 +4078,13 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
category = UCD_CATEGORY(c);
if ((category == ucp_L || category == ucp_N || c == CHAR_UNDERSCORE)
== prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4107,10 +4105,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
- if (UCD_CATEGORY(c) == ucp_M) MRRETURN(MATCH_NOMATCH);
+ if (UCD_CATEGORY(c) == ucp_M) RRETURN(MATCH_NOMATCH);
while (eptr < md->end_subject)
{
int len = 1;
@@ -4135,9 +4133,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH);
+ if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
eptr++;
ACROSSCHAR(eptr < md->end_subject, *eptr, eptr++);
}
@@ -4149,7 +4147,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr++;
ACROSSCHAR(eptr < md->end_subject, *eptr, eptr++);
@@ -4157,7 +4155,7 @@ for (;;)
break;
case OP_ANYBYTE:
- if (eptr > md->end_subject - min) MRRETURN(MATCH_NOMATCH);
+ if (eptr > md->end_subject - min) RRETURN(MATCH_NOMATCH);
eptr += min;
break;
@@ -4167,12 +4165,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x000d:
if (eptr < md->end_subject && *eptr == 0x0a) eptr++;
@@ -4186,7 +4184,7 @@ for (;;)
case 0x0085:
case 0x2028:
case 0x2029:
- if (md->bsr_anycrlf) MRRETURN(MATCH_NOMATCH);
+ if (md->bsr_anycrlf) RRETURN(MATCH_NOMATCH);
break;
}
}
@@ -4198,7 +4196,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
switch(c)
@@ -4223,7 +4221,7 @@ for (;;)
case 0x202f: /* NARROW NO-BREAK SPACE */
case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
case 0x3000: /* IDEOGRAPHIC SPACE */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
break;
@@ -4234,12 +4232,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
@@ -4270,7 +4268,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
switch(c)
@@ -4283,7 +4281,7 @@ for (;;)
case 0x85: /* NEL */
case 0x2028: /* LINE SEPARATOR */
case 0x2029: /* PARAGRAPH SEPARATOR */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
break;
@@ -4294,12 +4292,12 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x0a: /* LF */
case 0x0b: /* VT */
case 0x0c: /* FF */
@@ -4318,11 +4316,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINC(c, eptr);
if (c < 128 && (md->ctypes[c] & ctype_digit) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4332,10 +4330,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (*eptr >= 128 || (md->ctypes[*eptr++] & ctype_digit) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
/* No need to skip more bytes - we know it's a 1-byte character */
}
break;
@@ -4346,10 +4344,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (*eptr < 128 && (md->ctypes[*eptr] & ctype_space) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
eptr++;
ACROSSCHAR(eptr < md->end_subject, *eptr, eptr++);
}
@@ -4361,10 +4359,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (*eptr >= 128 || (md->ctypes[*eptr++] & ctype_space) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
/* No need to skip more bytes - we know it's a 1-byte character */
}
break;
@@ -4375,10 +4373,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (*eptr < 128 && (md->ctypes[*eptr] & ctype_word) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
eptr++;
ACROSSCHAR(eptr < md->end_subject, *eptr, eptr++);
}
@@ -4390,10 +4388,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (*eptr >= 128 || (md->ctypes[*eptr++] & ctype_word) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
/* No need to skip more bytes - we know it's a 1-byte character */
}
break;
@@ -4416,9 +4414,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if (IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH);
+ if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
eptr++;
}
break;
@@ -4427,7 +4425,7 @@ for (;;)
if (eptr > md->end_subject - min)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr += min;
break;
@@ -4436,7 +4434,7 @@ for (;;)
if (eptr > md->end_subject - min)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
eptr += min;
break;
@@ -4447,11 +4445,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
switch(*eptr++)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x000d:
if (eptr < md->end_subject && *eptr == 0x0a) eptr++;
@@ -4463,7 +4461,7 @@ for (;;)
case 0x000b:
case 0x000c:
case 0x0085:
- if (md->bsr_anycrlf) MRRETURN(MATCH_NOMATCH);
+ if (md->bsr_anycrlf) RRETURN(MATCH_NOMATCH);
break;
}
}
@@ -4475,7 +4473,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
switch(*eptr++)
{
@@ -4483,7 +4481,7 @@ for (;;)
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
break;
@@ -4494,11 +4492,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
switch(*eptr++)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
@@ -4513,7 +4511,7 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
switch(*eptr++)
{
@@ -4523,7 +4521,7 @@ for (;;)
case 0x0c: /* FF */
case 0x0d: /* CR */
case 0x85: /* NEL */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
}
break;
@@ -4534,11 +4532,11 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
switch(*eptr++)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x0a: /* LF */
case 0x0b: /* VT */
case 0x0c: /* FF */
@@ -4555,9 +4553,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if ((md->ctypes[*eptr++] & ctype_digit) != 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[*eptr++] & ctype_digit) != 0) RRETURN(MATCH_NOMATCH);
}
break;
@@ -4567,9 +4565,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if ((md->ctypes[*eptr++] & ctype_digit) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[*eptr++] & ctype_digit) == 0) RRETURN(MATCH_NOMATCH);
}
break;
@@ -4579,9 +4577,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if ((md->ctypes[*eptr++] & ctype_space) != 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[*eptr++] & ctype_space) != 0) RRETURN(MATCH_NOMATCH);
}
break;
@@ -4591,9 +4589,9 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
- if ((md->ctypes[*eptr++] & ctype_space) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[*eptr++] & ctype_space) == 0) RRETURN(MATCH_NOMATCH);
}
break;
@@ -4603,10 +4601,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if ((md->ctypes[*eptr++] & ctype_word) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4616,10 +4614,10 @@ for (;;)
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if ((md->ctypes[*eptr++] & ctype_word) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
@@ -4648,14 +4646,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM36);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
- if (prop_fail_result) MRRETURN(MATCH_NOMATCH);
+ if (prop_fail_result) RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4665,18 +4663,18 @@ for (;;)
int chartype;
RMATCH(eptr, ecode, offset_top, md, eptrb, RM37);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
chartype = UCD_CHARTYPE(c);
if ((chartype == ucp_Lu ||
chartype == ucp_Ll ||
chartype == ucp_Lt) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4685,15 +4683,15 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM38);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CATEGORY(c) == prop_value) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4702,15 +4700,15 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM39);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CHARTYPE(c) == prop_value) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4719,15 +4717,15 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM40);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_SCRIPT(c) == prop_value) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4737,16 +4735,16 @@ for (;;)
int category;
RMATCH(eptr, ecode, offset_top, md, eptrb, RM59);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
category = UCD_CATEGORY(c);
if ((category == ucp_L || category == ucp_N) == prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4755,17 +4753,17 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM60);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CATEGORY(c) == ucp_Z || c == CHAR_HT || c == CHAR_NL ||
c == CHAR_FF || c == CHAR_CR)
== prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4774,17 +4772,17 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM61);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
if ((UCD_CATEGORY(c) == ucp_Z || c == CHAR_HT || c == CHAR_NL ||
c == CHAR_VT || c == CHAR_FF || c == CHAR_CR)
== prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4794,11 +4792,11 @@ for (;;)
int category;
RMATCH(eptr, ecode, offset_top, md, eptrb, RM62);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
category = UCD_CATEGORY(c);
@@ -4806,7 +4804,7 @@ for (;;)
category == ucp_N ||
c == CHAR_UNDERSCORE)
== prop_fail_result)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -4826,14 +4824,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM41);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
GETCHARINCTEST(c, eptr);
- if (UCD_CATEGORY(c) == ucp_M) MRRETURN(MATCH_NOMATCH);
+ if (UCD_CATEGORY(c) == ucp_M) RRETURN(MATCH_NOMATCH);
while (eptr < md->end_subject)
{
int len = 1;
@@ -4853,14 +4851,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM42);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (ctype == OP_ANY && IS_NEWLINE(eptr))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
GETCHARINC(c, eptr);
switch(ctype)
{
@@ -4872,7 +4870,7 @@ for (;;)
case OP_ANYNL:
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x000d:
if (eptr < md->end_subject && *eptr == 0x0a) eptr++;
break;
@@ -4884,7 +4882,7 @@ for (;;)
case 0x0085:
case 0x2028:
case 0x2029:
- if (md->bsr_anycrlf) MRRETURN(MATCH_NOMATCH);
+ if (md->bsr_anycrlf) RRETURN(MATCH_NOMATCH);
break;
}
break;
@@ -4912,14 +4910,14 @@ for (;;)
case 0x202f: /* NARROW NO-BREAK SPACE */
case 0x205f: /* MEDIUM MATHEMATICAL SPACE */
case 0x3000: /* IDEOGRAPHIC SPACE */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
case OP_HSPACE:
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
@@ -4954,14 +4952,14 @@ for (;;)
case 0x85: /* NEL */
case 0x2028: /* LINE SEPARATOR */
case 0x2029: /* PARAGRAPH SEPARATOR */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
case OP_VSPACE:
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x0a: /* LF */
case 0x0b: /* VT */
case 0x0c: /* FF */
@@ -4975,32 +4973,32 @@ for (;;)
case OP_NOT_DIGIT:
if (c < 256 && (md->ctypes[c] & ctype_digit) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case OP_DIGIT:
if (c >= 256 || (md->ctypes[c] & ctype_digit) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case OP_NOT_WHITESPACE:
if (c < 256 && (md->ctypes[c] & ctype_space) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case OP_WHITESPACE:
if (c >= 256 || (md->ctypes[c] & ctype_space) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case OP_NOT_WORDCHAR:
if (c < 256 && (md->ctypes[c] & ctype_word) != 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
case OP_WORDCHAR:
if (c >= 256 || (md->ctypes[c] & ctype_word) == 0)
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
break;
default:
@@ -5016,14 +5014,14 @@ for (;;)
{
RMATCH(eptr, ecode, offset_top, md, eptrb, RM43);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- if (fi >= max) MRRETURN(MATCH_NOMATCH);
+ if (fi >= max) RRETURN(MATCH_NOMATCH);
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
if (ctype == OP_ANY && IS_NEWLINE(eptr))
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
c = *eptr++;
switch(ctype)
{
@@ -5035,7 +5033,7 @@ for (;;)
case OP_ANYNL:
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x000d:
if (eptr < md->end_subject && *eptr == 0x0a) eptr++;
break;
@@ -5046,7 +5044,7 @@ for (;;)
case 0x000b:
case 0x000c:
case 0x0085:
- if (md->bsr_anycrlf) MRRETURN(MATCH_NOMATCH);
+ if (md->bsr_anycrlf) RRETURN(MATCH_NOMATCH);
break;
}
break;
@@ -5058,14 +5056,14 @@ for (;;)
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
case OP_HSPACE:
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x09: /* HT */
case 0x20: /* SPACE */
case 0xa0: /* NBSP */
@@ -5082,14 +5080,14 @@ for (;;)
case 0x0c: /* FF */
case 0x0d: /* CR */
case 0x85: /* NEL */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
break;
case OP_VSPACE:
switch(c)
{
- default: MRRETURN(MATCH_NOMATCH);
+ default: RRETURN(MATCH_NOMATCH);
case 0x0a: /* LF */
case 0x0b: /* VT */
case 0x0c: /* FF */
@@ -5100,27 +5098,27 @@ for (;;)
break;
case OP_NOT_DIGIT:
- if ((md->ctypes[c] & ctype_digit) != 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[c] & ctype_digit) != 0) RRETURN(MATCH_NOMATCH);
break;
case OP_DIGIT:
- if ((md->ctypes[c] & ctype_digit) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[c] & ctype_digit) == 0) RRETURN(MATCH_NOMATCH);
break;
case OP_NOT_WHITESPACE:
- if ((md->ctypes[c] & ctype_space) != 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[c] & ctype_space) != 0) RRETURN(MATCH_NOMATCH);
break;
case OP_WHITESPACE:
- if ((md->ctypes[c] & ctype_space) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[c] & ctype_space) == 0) RRETURN(MATCH_NOMATCH);
break;
case OP_NOT_WORDCHAR:
- if ((md->ctypes[c] & ctype_word) != 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[c] & ctype_word) != 0) RRETURN(MATCH_NOMATCH);
break;
case OP_WORDCHAR:
- if ((md->ctypes[c] & ctype_word) == 0) MRRETURN(MATCH_NOMATCH);
+ if ((md->ctypes[c] & ctype_word) == 0) RRETURN(MATCH_NOMATCH);
break;
default:
@@ -5859,7 +5857,7 @@ for (;;)
/* Get here if we can't make it match with any permitted repetitions */
- MRRETURN(MATCH_NOMATCH);
+ RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
@@ -6165,6 +6163,7 @@ end_subject = md->end_subject;
md->endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0;
md->use_ucp = (re->options & PCRE_UCP) != 0;
md->jscript_compat = (re->options & PCRE_JAVASCRIPT_COMPAT) != 0;
+md->ignore_skip_arg = FALSE;
/* Some options are unpacked into BOOL variables in the hope that testing
them will be faster than individual option bits. */
@@ -6175,7 +6174,7 @@ md->notempty = (options & PCRE_NOTEMPTY) != 0;
md->notempty_atstart = (options & PCRE_NOTEMPTY_ATSTART) != 0;
md->hitend = FALSE;
-md->mark = NULL; /* In case never set */
+md->mark = md->nomatch_mark = NULL; /* In case never set */
md->recursive = NULL; /* No recursion at top level */
md->hasthen = (re->flags & PCRE_HASTHEN) != 0;
@@ -6546,11 +6545,23 @@ for(;;)
md->match_call_count = 0;
md->match_function_type = 0;
md->end_offset_top = 0;
- rc = match(start_match, md->start_code, start_match, NULL, 2, md, NULL, 0);
+ rc = match(start_match, md->start_code, start_match, 2, md, NULL, 0);
if (md->hitend && start_partial == NULL) start_partial = md->start_used_ptr;
switch(rc)
{
+ /* If MATCH_SKIP_ARG reaches this level it means that a MARK that matched
+ the SKIP's arg was not found. In this circumstance, Perl ignores the SKIP
+ entirely. The only way we can do that is to re-do the match at the same
+ point, with a flag to force SKIP with an argument to be ignored. Just
+ treating this case as NOMATCH does not work because it does not check other
+ alternatives in patterns such as A(*SKIP:A)B|AC when the subject is AC. */
+
+ case MATCH_SKIP_ARG:
+ new_start_match = start_match;
+ md->ignore_skip_arg = TRUE;
+ break;
+
/* SKIP passes back the next starting point explicitly, but if it is the
same as the match we have just done, treat it as NOMATCH. */
@@ -6562,18 +6573,13 @@ for(;;)
}
/* Fall through */
- /* If MATCH_SKIP_ARG reaches this level it means that a MARK that matched
- the SKIP's arg was not found. We also treat this as NOMATCH. */
-
- case MATCH_SKIP_ARG:
- /* Fall through */
-
/* NOMATCH and PRUNE advance by one character. THEN at this level acts
- exactly like PRUNE. */
+ exactly like PRUNE. Unset the ignore SKIP-with-argument flag. */
case MATCH_NOMATCH:
case MATCH_PRUNE:
case MATCH_THEN:
+ md->ignore_skip_arg = FALSE;
new_start_match = start_match + 1;
#ifdef SUPPORT_UTF
if (utf)
@@ -6702,8 +6708,12 @@ if (rc == MATCH_MATCH || rc == MATCH_ACCEPT)
offsets[1] = (int)(md->end_match_ptr - md->start_subject);
}
+ /* Return MARK data if requested */
+
+ if (extra_data != NULL && (extra_data->flags & PCRE_EXTRA_MARK) != 0)
+ *(extra_data->mark) = (unsigned char *)(md->mark);
DPRINTF((">>>> returning %d\n", rc));
- goto RETURN_MARK;
+ return rc;
}
/* Control gets here if there has been an error, or if the overall match
@@ -6747,10 +6757,8 @@ else
/* Return the MARK data if it has been requested. */
-RETURN_MARK:
-
if (extra_data != NULL && (extra_data->flags & PCRE_EXTRA_MARK) != 0)
- *(extra_data->mark) = (unsigned char *)(md->mark);
+ *(extra_data->mark) = (unsigned char *)(md->nomatch_mark);
return rc;
}
diff --git a/pcre_fullinfo.c b/pcre_fullinfo.c
index 078f5fd..39ee028 100644
--- a/pcre_fullinfo.c
+++ b/pcre_fullinfo.c
@@ -107,6 +107,19 @@ switch (what)
*((size_t *)where) = (study == NULL)? 0 : study->size;
break;
+ case PCRE_INFO_JITSIZE:
+#ifdef SUPPORT_JIT
+ *((size_t *)where) =
+ (extra_data != NULL &&
+ (extra_data->flags & PCRE_EXTRA_EXECUTABLE_JIT) != 0 &&
+ extra_data->executable_jit != NULL)?
+ PRIV(jit_get_size)(extra_data->executable_jit) : 0;
+#else
+ *((size_t *)where) = 0;
+#endif
+
+ break;
+
case PCRE_INFO_CAPTURECOUNT:
*((int *)where) = re->top_bracket;
break;
diff --git a/pcre_internal.h b/pcre_internal.h
index fa0fb8b..da0a826 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1935,7 +1935,7 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69,
- ERR70, ERR71, ERRCOUNT };
+ ERR70, ERR71, ERR72, ERR73, ERRCOUNT };
/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit
@@ -2011,6 +2011,7 @@ typedef struct compile_data {
pcre_uchar *name_table; /* The name/number table */
int names_found; /* Number of entries so far */
int name_entry_size; /* Size of each entry */
+ int workspace_size; /* Size of workspace */
int bracount; /* Count of capturing parens as we compile */
int final_bracount; /* Saved value after first pass */
int top_backref; /* Maximum back reference */
@@ -2095,6 +2096,7 @@ typedef struct match_data {
BOOL hitend; /* Hit the end of the subject at some point */
BOOL bsr_anycrlf; /* \R is just any CRLF, not full Unicode */
BOOL hasthen; /* Pattern contains (*THEN) */
+ BOOL ignore_skip_arg; /* For re-run when SKIP name not found */
const pcre_uchar *start_code; /* For use when recursing */
PCRE_PUCHAR start_subject; /* Start of the subject string */
PCRE_PUCHAR end_subject; /* End of the subject string */
@@ -2110,7 +2112,8 @@ typedef struct match_data {
int eptrn; /* Next free eptrblock */
recursion_info *recursive; /* Linked list of recursion data */
void *callout_data; /* To pass back to callouts */
- const pcre_uchar *mark; /* Mark pointer to pass back */
+ const pcre_uchar *mark; /* Mark pointer to pass back on success */
+ const pcre_uchar *nomatch_mark;/* Mark pointer to pass back on failure */
const pcre_uchar *once_target; /* Where to back up to for atomic groups */
} match_data;
@@ -2271,6 +2274,7 @@ extern void PRIV(jit_compile)(const real_pcre *, pcre_extra *);
extern int PRIV(jit_exec)(const real_pcre *, void *,
const pcre_uchar *, int, int, int, int, int *, int);
extern void PRIV(jit_free)(void *);
+extern int PRIV(jit_get_size)(void *);
#endif
/* Unicode character database (UCD) */
diff --git a/pcre_jit_compile.c b/pcre_jit_compile.c
index 50376b9..a1b66c6 100644
--- a/pcre_jit_compile.c
+++ b/pcre_jit_compile.c
@@ -166,6 +166,7 @@ typedef struct executable_function {
void *executable_func;
pcre_jit_callback callback;
void *userdata;
+ sljit_uw executable_size;
} executable_function;
typedef struct jump_list {
@@ -5947,7 +5948,8 @@ if (has_alternatives)
{
SLJIT_ASSERT(opcode == OP_COND || opcode == OP_SCOND);
assert = CURRENT_AS(bracket_fallback)->u.assert;
- if (assert->framesize >= 0 && (ccbegin[1 + LINK_SIZE] == OP_ASSERT_NOT || ccbegin[1 + LINK_SIZE] == OP_ASSERTBACK_NOT))
+ if ((ccbegin[1 + LINK_SIZE] == OP_ASSERT_NOT || ccbegin[1 + LINK_SIZE] == OP_ASSERTBACK_NOT) && assert->framesize >= 0)
+
{
OP1(SLJIT_MOV, STACK_TOP, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), assert->localptr);
add_jump(compiler, &common->revertframes, JUMP(SLJIT_FAST_CALL));
@@ -6351,6 +6353,7 @@ pcre_study_data *study;
pcre_uchar *ccend;
executable_function *function;
void *executable_func;
+sljit_uw executable_size;
struct sljit_label *leave;
struct sljit_label *mainloop = NULL;
struct sljit_label *empty_match_found;
@@ -6685,6 +6688,7 @@ if (common->getucd != NULL)
SLJIT_FREE(common->localptrs);
executable_func = sljit_generate_code(compiler);
+executable_size = sljit_get_generated_code_size(compiler);
sljit_free_compiler(compiler);
if (executable_func == NULL)
return;
@@ -6699,6 +6703,7 @@ if (function == NULL)
}
function->executable_func = executable_func;
+function->executable_size = executable_size;
function->callback = NULL;
function->userdata = NULL;
extra->executable_jit = function;
@@ -6787,6 +6792,12 @@ sljit_free_code(function->executable_func);
SLJIT_FREE(function);
}
+int
+PRIV(jit_get_size)(void *executable_func)
+{
+return ((executable_function*)executable_func)->executable_size;
+}
+
#ifdef COMPILE_PCRE8
PCRE_EXP_DECL pcre_jit_stack *
pcre_jit_stack_alloc(int startsize, int maxsize)
diff --git a/pcre_study.c b/pcre_study.c
index 493108e..a2b1c06 100644
--- a/pcre_study.c
+++ b/pcre_study.c
@@ -288,8 +288,8 @@ for (;;)
cc++;
break;
- /* The single-byte matcher means we can't proceed in UTF-8 mode. (In
- non-UTF-8 mode \C will actually be turned into OP_ALLANY, so won't ever
+ /* The single-byte matcher means we can't proceed in UTF-8 mode. (In
+ non-UTF-8 mode \C will actually be turned into OP_ALLANY, so won't ever
appear, but leave the code, just in case.) */
case OP_ANYBYTE:
@@ -1424,11 +1424,16 @@ if (bits_set || min > 0
study->size = sizeof(pcre_study_data);
study->flags = 0;
+ /* Set the start bits always, to avoid unset memory errors if the
+ study data is written to a file, but set the flag only if any of the bits
+ are set, to save time looking when none are. */
+
if (bits_set)
{
study->flags |= PCRE_STUDY_MAPPED;
memcpy(study->start_bits, start_bits, sizeof(start_bits));
}
+ else memset(study->start_bits, 0, 32 * sizeof(pcre_uchar));
#ifdef PCRE_DEBUG
if (bits_set)
diff --git a/pcre_valid_utf8.c b/pcre_valid_utf8.c
index 05d82f9..21a3ad1 100644
--- a/pcre_valid_utf8.c
+++ b/pcre_valid_utf8.c
@@ -111,7 +111,7 @@ register PCRE_PUCHAR p;
if (length < 0)
{
for (p = string; *p != 0; p++);
- length = p - string;
+ length = (int)(p - string);
}
for (p = string; length-- > 0; p++)
@@ -123,20 +123,20 @@ for (p = string; length-- > 0; p++)
if (c < 0xc0) /* Isolated 10xx xxxx byte */
{
- *erroroffset = p - string;
+ *erroroffset = (int)(p - string);
return PCRE_UTF8_ERR20;
}
if (c >= 0xfe) /* Invalid 0xfe or 0xff bytes */
{
- *erroroffset = p - string;
+ *erroroffset = (int)(p - string);
return PCRE_UTF8_ERR21;
}
ab = PRIV(utf8_table4)[c & 0x3f]; /* Number of additional bytes */
if (length < ab)
{
- *erroroffset = p - string; /* Missing bytes */
+ *erroroffset = (int)(p - string); /* Missing bytes */
return ab - length; /* Codes ERR1 to ERR5 */
}
length -= ab; /* Length remaining */
@@ -145,7 +145,7 @@ for (p = string; length-- > 0; p++)
if (((d = *(++p)) & 0xc0) != 0x80)
{
- *erroroffset = p - string - 1;
+ *erroroffset = (int)(p - string) - 1;
return PCRE_UTF8_ERR6;
}
@@ -160,7 +160,7 @@ for (p = string; length-- > 0; p++)
case 1: if ((c & 0x3e) == 0)
{
- *erroroffset = p - string - 1;
+ *erroroffset = (int)(p - string) - 1;
return PCRE_UTF8_ERR15;
}
break;
@@ -172,17 +172,17 @@ for (p = string; length-- > 0; p++)
case 2:
if ((*(++p) & 0xc0) != 0x80) /* Third byte */
{
- *erroroffset = p - string - 2;
+ *erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR7;
}
if (c == 0xe0 && (d & 0x20) == 0)
{
- *erroroffset = p - string - 2;
+ *erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR16;
}
if (c == 0xed && d >= 0xa0)
{
- *erroroffset = p - string - 2;
+ *erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR14;
}
break;
@@ -194,22 +194,22 @@ for (p = string; length-- > 0; p++)
case 3:
if ((*(++p) & 0xc0) != 0x80) /* Third byte */
{
- *erroroffset = p - string - 2;
+ *erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR7;
}
if ((*(++p) & 0xc0) != 0x80) /* Fourth byte */
{
- *erroroffset = p - string - 3;
+ *erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR8;
}
if (c == 0xf0 && (d & 0x30) == 0)
{
- *erroroffset = p - string - 3;
+ *erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR17;
}
if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
{
- *erroroffset = p - string - 3;
+ *erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR13;
}
break;
@@ -225,22 +225,22 @@ for (p = string; length-- > 0; p++)
case 4:
if ((*(++p) & 0xc0) != 0x80) /* Third byte */
{
- *erroroffset = p - string - 2;
+ *erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR7;
}
if ((*(++p) & 0xc0) != 0x80) /* Fourth byte */
{
- *erroroffset = p - string - 3;
+ *erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR8;
}
if ((*(++p) & 0xc0) != 0x80) /* Fifth byte */
{
- *erroroffset = p - string - 4;
+ *erroroffset = (int)(p - string) - 4;
return PCRE_UTF8_ERR9;
}
if (c == 0xf8 && (d & 0x38) == 0)
{
- *erroroffset = p - string - 4;
+ *erroroffset = (int)(p - string) - 4;
return PCRE_UTF8_ERR18;
}
break;
@@ -251,27 +251,27 @@ for (p = string; length-- > 0; p++)
case 5:
if ((*(++p) & 0xc0) != 0x80) /* Third byte */
{
- *erroroffset = p - string - 2;
+ *erroroffset = (int)(p - string) - 2;
return PCRE_UTF8_ERR7;
}
if ((*(++p) & 0xc0) != 0x80) /* Fourth byte */
{
- *erroroffset = p - string - 3;
+ *erroroffset = (int)(p - string) - 3;
return PCRE_UTF8_ERR8;
}
if ((*(++p) & 0xc0) != 0x80) /* Fifth byte */
{
- *erroroffset = p - string - 4;
+ *erroroffset = (int)(p - string) - 4;
return PCRE_UTF8_ERR9;
}
if ((*(++p) & 0xc0) != 0x80) /* Sixth byte */
{
- *erroroffset = p - string - 5;
+ *erroroffset = (int)(p - string) - 5;
return PCRE_UTF8_ERR10;
}
if (c == 0xfc && (d & 0x3c) == 0)
{
- *erroroffset = p - string - 5;
+ *erroroffset = (int)(p - string) - 5;
return PCRE_UTF8_ERR19;
}
break;
@@ -283,7 +283,7 @@ for (p = string; length-- > 0; p++)
if (ab > 3)
{
- *erroroffset = p - string - ab;
+ *erroroffset = (int)(p - string) - ab;
return (ab == 4)? PCRE_UTF8_ERR11 : PCRE_UTF8_ERR12;
}
}
diff --git a/pcregrep.c b/pcregrep.c
index e7e8e47..74a37c2 100644
--- a/pcregrep.c
+++ b/pcregrep.c
@@ -1410,7 +1410,7 @@ while (ptr < endptr)
and its line-ending characters (if they matched the pattern), so there
may be no more to print. */
- plength = (linelength + endlinelength) - startoffset;
+ plength = (int)((linelength + endlinelength) - startoffset);
if (plength > 0) FWRITE(ptr + startoffset, 1, plength, stdout);
}
@@ -1462,7 +1462,7 @@ while (ptr < endptr)
if (input_line_buffered && bufflength < (size_t)bufsize)
{
- int add = read_one_line(ptr, bufsize - (ptr - main_buffer), in);
+ int add = read_one_line(ptr, bufsize - (int)(ptr - main_buffer), in);
bufflength += add;
endptr += add;
}
diff --git a/pcreposix.c b/pcreposix.c
index 0426e2e..8e82ba6 100644
--- a/pcreposix.c
+++ b/pcreposix.c
@@ -154,8 +154,10 @@ static const int eint[] = {
REG_BADPAT, /* \c must be followed by an ASCII character */
REG_BADPAT, /* \k is not followed by a braced, angle-bracketed, or quoted name */
/* 70 */
- REG_BADPAT, /* internal error: unknown opcode in find_fixedlength() */
- REG_BADPAT, /* Not allowed UTF-8 / UTF-16 code point (>= 0xd800 && <= 0xdfff) */
+ REG_BADPAT, /* internal error: unknown opcode in find_fixedlength() */
+ REG_BADPAT, /* \N is not supported in a class */
+ REG_BADPAT, /* too many forward references */
+ REG_BADPAT, /* disallowed UTF-8/16 code point (>= 0xd800 && <= 0xdfff) */
};
/* Table of texts corresponding to POSIX error codes */
diff --git a/pcretest.c b/pcretest.c
index 7826fcd..d27e11e 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -190,6 +190,7 @@ static int locale_set = 0;
static int show_malloc;
static int use_utf8;
static size_t gotten_store;
+static size_t first_gotten_store = 0;
static const unsigned char *last_callout_mark = NULL;
/* The buffers grow automatically if very long input lines are encountered. */
@@ -998,12 +999,14 @@ return (cb->callout_number != callout_fail_id)? 0 :
*************************************************/
/* Alternative malloc function, to test functionality and save the size of a
-compiled re. The show_malloc variable is set only during matching. */
+compiled re, which is the first store request that pcre_compile() makes. The
+show_malloc variable is set only during matching. */
static void *new_malloc(size_t size)
{
void *block = malloc(size);
gotten_store = size;
+if (first_gotten_store == 0) first_gotten_store = size;
if (show_malloc)
fprintf(outfile, "malloc %3d %p\n", (int)size, block);
return block;
@@ -1519,7 +1522,7 @@ while (!done)
(sbuf[4] << 24) | (sbuf[5] << 16) | (sbuf[6] << 8) | sbuf[7];
re = (real_pcre *)new_malloc(true_size);
- regex_gotten_store = gotten_store;
+ regex_gotten_store = first_gotten_store;
if (fread(re, 1, true_size, f) != true_size) goto FAIL_READ;
@@ -1628,6 +1631,7 @@ while (!done)
/* Look for options after final delimiter */
options = 0;
+ study_options = 0;
log_store = showstore; /* default from command line */
while (*pp != 0)
@@ -1776,6 +1780,7 @@ while (!done)
if ((options & PCRE_UCP) != 0) cflags |= REG_UCP;
if ((options & PCRE_UNGREEDY) != 0) cflags |= REG_UNGREEDY;
+ first_gotten_store = 0;
rc = regcomp(&preg, (char *)p, cflags);
/* Compilation failed; go back for another re, skipping to blank line
@@ -1813,6 +1818,7 @@ while (!done)
(double)CLOCKS_PER_SEC);
}
+ first_gotten_store = 0;
re = pcre_compile((char *)p, options, &error, &erroroffset, tables);
/* Compilation failed; go back for another re, skipping to blank line
@@ -1847,22 +1853,20 @@ while (!done)
new_info(re, NULL, PCRE_INFO_OPTIONS, &get_options);
if ((get_options & PCRE_UTF8) != 0) use_utf8 = 1;
- /* Print information if required. There are now two info-returning
- functions. The old one has a limited interface and returns only limited
- data. Check that it agrees with the newer one. */
+ /* Extract the size for possible writing before possibly flipping it,
+ and remember the store that was got. */
+
+ true_size = ((real_pcre *)re)->size;
+ regex_gotten_store = first_gotten_store;
+
+ /* Output code size information if requested */
if (log_store)
fprintf(outfile, "Memory allocation (code space): %d\n",
- (int)(gotten_store -
+ (int)(first_gotten_store -
sizeof(real_pcre) -
((real_pcre *)re)->name_count * ((real_pcre *)re)->name_entry_size));
- /* Extract the size for possible writing before possibly flipping it,
- and remember the store that was got. */
-
- true_size = ((real_pcre *)re)->size;
- regex_gotten_store = gotten_store;
-
/* If -s or /S was present, study the regex to generate additional info to
help with the matching, unless the pattern has the SS option, which
suppresses the effect of /S (used for a few test patterns where studying is
@@ -1887,7 +1891,16 @@ while (!done)
if (error != NULL)
fprintf(outfile, "Failed to study: %s\n", error);
else if (extra != NULL)
+ {
true_study_size = ((pcre_study_data *)(extra->study_data))->size;
+ if (log_store)
+ {
+ size_t jitsize;
+ new_info(re, extra, PCRE_INFO_JITSIZE, &jitsize);
+ if (jitsize != 0)
+ fprintf(outfile, "Memory allocation (JIT code): %d\n", jitsize);
+ }
+ }
}
/* If /K was present, we set up for handling MARK data. */
@@ -1940,7 +1953,9 @@ while (!done)
}
}
- /* Extract information from the compiled data if required */
+ /* Extract information from the compiled data if required. There are now
+ two info-returning functions. The old one has a limited interface and
+ returns only limited data. Check that it agrees with the newer one. */
SHOW_INFO:
diff --git a/perltest.pl b/perltest.pl
index ccddd64..d81a6c4 100755
--- a/perltest.pl
+++ b/perltest.pl
@@ -111,6 +111,10 @@ for (;;)
$pattern =~ s/S(?=[a-zA-Z]*$)//g;
+ # Remove /Y from a pattern (asks pcretest to disable PCRE optimization)
+
+ $pattern =~ s/Y(?=[a-zA-Z]*$)//;
+
# Check that the pattern is valid
eval "\$_ =~ ${pattern}";
diff --git a/testdata/testinput11 b/testdata/testinput11
index a9d1cfd..0f33cb1 100644
--- a/testdata/testinput11
+++ b/testdata/testinput11
@@ -452,20 +452,6 @@ with the handling of backtracking verbs. ---/
/A(*MARK:A)A+(*SKIP:B)(B|Z) | AC(*:B)/xK
AAAC
-/--- We use something more complicated than individual letters here, because
-that causes different behaviour in Perl. Perhaps it disables some optimization;
-anyway, the result now matches PCRE in that no tag is passed back for the
-failures. ---/
-
-/(A|P)(*:A)(B|P) | (X|P)(X|P)(*:B)(Y|P)/xK
- AABC
- XXYZ
- ** Failers
- XAQQ
- XAQQXZZ
- AXQQQ
- AXXQQQ
-
/--- COMMIT at the start of a pattern should act like an anchor. Again,
however, we need the complication for Perl. ---/
@@ -800,4 +786,143 @@ name)/K
/(?<=a(*THEN)b)c/
xabcd
+/(a)(?2){2}(.)/
+ abcd
+
+/(*MARK:A)(*PRUNE:B)(C|X)/KS
+ C
+ D
+
+/(*MARK:A)(*PRUNE:B)(C|X)/KSS
+ C
+ D
+
+/(*MARK:A)(*THEN:B)(C|X)/KS
+ C
+ D
+
+/(*MARK:A)(*THEN:B)(C|X)/KSY
+ C
+ D
+
+/(*MARK:A)(*THEN:B)(C|X)/KSS
+ C
+ D
+
+/--- This should fail, as the skip causes a bump to offset 3 (the skip) ---/
+
+/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xK
+ AAAC
+
+/--- Same --/
+
+/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xK
+ AAAC
+
+/A(*:A)A+(*SKIP)(B|Z) | AC/xK
+ AAAC
+
+/--- This should fail, as a null name is the same as no name ---/
+
+/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xK
+ AAAC
+
+/--- A check on what happens after hitting a mark and them bumping along to
+something that does not even start. Perl reports tags after the failures here,
+though it does not when the individual letters are made into something
+more complicated. ---/
+
+/A(*:A)B|XX(*:B)Y/K
+ AABC
+ XXYZ
+ ** Failers
+ XAQQ
+ XAQQXZZ
+ AXQQQ
+ AXXQQQ
+
+/^(A(*THEN:A)B|C(*THEN:B)D)/K
+ AB
+ CD
+ ** Failers
+ AC
+ CB
+
+/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/K
+ AB
+ CD
+ ** Failers
+ AC
+ CB
+
+/--- An empty name does not pass back an empty string. It is the same as if no
+name were given. ---/
+
+/^(A(*PRUNE:)B|C(*PRUNE:B)D)/K
+ AB
+ CD
+
+/--- PRUNE goes to next bumpalong; COMMIT does not. ---/
+
+/A(*PRUNE:A)B/K
+ ACAB
+
+/--- Mark names can be duplicated ---/
+
+/A(*:A)B|X(*:A)Y/K
+ AABC
+ XXYZ
+
+/b(*:m)f|a(*:n)w/K
+ aw
+ ** Failers
+ abc
+
+/b(*:m)f|aw/K
+ abaw
+ ** Failers
+ abc
+ abax
+
+/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xK
+ AAAC
+
+/a(*PRUNE:X)bc|qq/KY
+ ** Failers
+ axy
+
+/a(*THEN:X)bc|qq/KY
+ ** Failers
+ axy
+
+/(?=a(*MARK:A)b)..x/K
+ abxy
+ ** Failers
+ abpq
+
+/(?=a(*MARK:A)b)..(*:Y)x/K
+ abxy
+ ** Failers
+ abpq
+
+/(?=a(*PRUNE:A)b)..x/K
+ abxy
+ ** Failers
+ abpq
+
+/(?=a(*PRUNE:A)b)..(*:Y)x/K
+ abxy
+ ** Failers
+ abpq
+
+/(?=a(*THEN:A)b)..x/K
+ abxy
+ ** Failers
+ abpq
+
+/(?=a(*THEN:A)b)..(*:Y)x/K
+ abxy
+ ** Failers
+ abpq
+
/-- End of testinput11 --/
diff --git a/testdata/testinput13 b/testdata/testinput13
index 141bf65..e91c24e 100644
--- a/testdata/testinput13
+++ b/testdata/testinput13
@@ -573,4 +573,11 @@ of case for anything other than the ASCII letters. --/
/^\X/8
́réo
+/^a\X41z/<JS>
+ aX41z
+ *** Failers
+ aAz
+
+/(?<=ab\Cde)X/8
+
/-- End of testinput13 --/
diff --git a/testdata/testinput2 b/testdata/testinput2
index b673fef..19801ef 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -3,7 +3,10 @@
It also checks the non-Perl syntax the PCRE supports (Python, .NET,
Oniguruma). Finally, there are some tests where PCRE and Perl differ,
either because PCRE can't be compatible, or there is a possible Perl
- bug. --/
+ bug.
+
+ NOTE: This is a non-UTF-8 set of tests. When UTF-8 is needed, use test
+ 5, and if Unicode Property Support is needed, use test 13. --/
/-- Originally, the Perl >= 5.10 things were in here too, but now I have
separated many (most?) of them out into test 11. However, there may still
@@ -3322,116 +3325,19 @@ a random value. /Ix
/A(*PRUNE)B|A(*PRUNE)C/K
AC
-/--- A whole lot of tests of verbs with arguments are here rather than in test
- 11 because Perl doesn't seem to follow its specification entirely
- correctly. ---/
-
-/--- Perl 5.11 sets $REGERROR on the AC failure case here; PCRE does not. It is
- not clear how Perl defines "involved in the failure of the match". ---/
-
-/^(A(*THEN:A)B|C(*THEN:B)D)/K
- AB
- CD
- ** Failers
- AC
- CB
-
-/--- Check the use of names for success and failure. PCRE doesn't show these
-names for success, though Perl does, contrary to its spec. ---/
-
-/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/K
- AB
- CD
- ** Failers
- AC
- CB
-
-/--- An empty name does not pass back an empty string. It is the same as if no
-name were given. ---/
-
-/^(A(*PRUNE:)B|C(*PRUNE:B)D)/K
- AB
- CD
-
-/--- PRUNE goes to next bumpalong; COMMIT does not. ---/
-
-/A(*PRUNE:A)B/K
- ACAB
-
-/(*MARK:A)(*PRUNE:B)(C|X)/KS
- C
- D
-
-/(*MARK:A)(*PRUNE:B)(C|X)/KSS
- C
- D
-
-/(*MARK:A)(*THEN:B)(C|X)/KS
- C
- D
-
-/(*MARK:A)(*THEN:B)(C|X)/KSY
- C
- D
-
-/(*MARK:A)(*THEN:B)(C|X)/KSS
- C
- D
-
-/--- This should fail, as the skip causes a bump to offset 3 (the skip) ---/
-
-/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xK
- AAAC
-
-/--- Same --/
-
-/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xK
- AAAC
-
/--- This should fail; the SKIP advances by one, but when we get to AC, the
- PRUNE kills it. ---/
+ PRUNE kills it. Perl behaves differently. ---/
/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xK
AAAC
-/A(*:A)A+(*SKIP)(B|Z) | AC/xK
- AAAC
-
-/--- This should fail, as a null name is the same as no name ---/
-
-/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xK
- AAAC
-
-/--- This fails in PCRE, and I think that is in accordance with Perl's
- documentation, though in Perl it succeeds. ---/
-
-/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xK
- AAAC
-
-/--- Mark names can be duplicated ---/
+/--- Mark names can be duplicated. Perl doesn't give a mark for this one,
+though PCRE does. ---/
-/A(*:A)B|X(*:A)Y/K
- AABC
- XXYZ
-
/^A(*:A)B|^X(*:A)Y/K
** Failers
XAQQ
-/--- A check on what happens after hitting a mark and them bumping along to
-something that does not even start. Perl reports tags after the failures here,
-though it does not when the individual letters are made into something
-more complicated. ---/
-
-/A(*:A)B|XX(*:B)Y/K
- AABC
- XXYZ
- ** Failers
- XAQQ
- XAQQXZZ
- AXQQQ
- AXXQQQ
-
/--- COMMIT at the start of a pattern should be the same as an anchor. Perl
optimizations defeat this. So does the PCRE optimization unless we disable it
with \Y. ---/
@@ -3441,78 +3347,6 @@ with \Y. ---/
** Failers
DEFGABC\Y
-/--- Repeat some tests with added studying. ---/
-
-/A(*COMMIT)B/+KS
- ACABX
-
-/A(*THEN)B|A(*THEN)C/KS
- AC
-
-/A(*PRUNE)B|A(*PRUNE)C/KS
- AC
-
-/^(A(*THEN:A)B|C(*THEN:B)D)/KS
- AB
- CD
- ** Failers
- AC
- CB
-
-/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/KS
- AB
- CD
- ** Failers
- AC
- CB
-
-/^(A(*PRUNE:)B|C(*PRUNE:B)D)/KS
- AB
- CD
-
-/A(*PRUNE:A)B/KS
- ACAB
-
-/(*MARK:A)(*PRUNE:B)(C|X)/KS
- C
- D
-
-/(*MARK:A)(*THEN:B)(C|X)/KS
- C
- D
-
-/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xKS
- AAAC
-
-/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xKS
- AAAC
-
-/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xKS
- AAAC
-
-/A(*:A)A+(*SKIP)(B|Z) | AC/xKS
- AAAC
-
-/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xKS
- AAAC
-
-/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xKS
- AAAC
-
-/A(*:A)B|XX(*:B)Y/KS
- AABC
- XXYZ
- ** Failers
- XAQQ
- XAQQXZZ
- AXQQQ
- AXXQQQ
-
-/(*COMMIT)ABC/
- ABCDEFG
- ** Failers
- DEFGABC\Y
-
/^(ab (c+(*THEN)cd) | xyz)/x
abcccd
@@ -3980,11 +3814,6 @@ AbcdCBefgBhiBqz
/^a\x1z/<JS>
ax1z
-/^a\X41z/<JS>
- aX41z
- *** Failers
- aAz
-
/^a\u0041z/<JS>
aAz
*** Failers
@@ -4007,6 +3836,44 @@ AbcdCBefgBhiBqz
/(?(?=c)c|d)*+Y/BZ
-/(?<=ab\Cde)X/8
+/a[\NB]c/
+ aNc
+
+/a[B-\Nc]/
+
+/(a)(?2){0,1999}?(b)/
+
+/(a)(?(DEFINE)(b))(?2){0,1999}?(?2)/
+
+/--- This test, with something more complicated than individual letters, causes
+different behaviour in Perl. Perhaps it disables some optimization; no tag is
+passed back for the failures, whereas in PCRE there is a tag. ---/
+
+/(A|P)(*:A)(B|P) | (X|P)(X|P)(*:B)(Y|P)/xK
+ AABC
+ XXYZ
+ ** Failers
+ XAQQ
+ XAQQXZZ
+ AXQQQ
+ AXXQQQ
+
+/-- Perl doesn't give marks for these, though it does if the alternatives are
+replaced by single letters. --/
+
+/(b|q)(*:m)f|a(*:n)w/K
+ aw
+ ** Failers
+ abc
+
+/(q|b)(*:m)f|a(*:n)w/K
+ aw
+ ** Failers
+ abc
+
+/-- After a partial match, the behaviour is as for a failure. --/
+
+/^a(*:X)bcde/K
+ abc\P
/-- End of testinput2 --/
diff --git a/testdata/testinput6 b/testdata/testinput6
index e5fc0e9..6b0d2f7 100644
--- a/testdata/testinput6
+++ b/testdata/testinput6
@@ -802,4 +802,18 @@
** Failers
a\xFCb
+/ⱥ/8i
+ ⱥ
+ Ⱥx
+ Ⱥ
+
+/[ⱥ]/8i
+ ⱥ
+ Ⱥx
+ Ⱥ
+
+/Ⱥ/8i
+ Ⱥ
+ ⱥ
+
/-- End of testinput6 --/
diff --git a/testdata/testoutput11 b/testdata/testoutput11
index 7fc086f..fd4ae01 100644
--- a/testdata/testoutput11
+++ b/testdata/testoutput11
@@ -888,36 +888,6 @@ No match, mark = A
0: AC
MK: B
-/--- We use something more complicated than individual letters here, because
-that causes different behaviour in Perl. Perhaps it disables some optimization;
-anyway, the result now matches PCRE in that no tag is passed back for the
-failures. ---/
-
-/(A|P)(*:A)(B|P) | (X|P)(X|P)(*:B)(Y|P)/xK
- AABC
- 0: AB
- 1: A
- 2: B
-MK: A
- XXYZ
- 0: XXY
- 1: <unset>
- 2: <unset>
- 3: X
- 4: X
- 5: Y
-MK: B
- ** Failers
-No match
- XAQQ
-No match
- XAQQXZZ
-No match
- AXQQQ
-No match
- AXXQQQ
-No match
-
/--- COMMIT at the start of a pattern should act like an anchor. Again,
however, we need the complication for Perl. ---/
@@ -1441,4 +1411,245 @@ MK: N
xabcd
0: c
+/(a)(?2){2}(.)/
+ abcd
+ 0: abcd
+ 1: a
+ 2: d
+
+/(*MARK:A)(*PRUNE:B)(C|X)/KS
+ C
+ 0: C
+ 1: C
+MK: B
+ D
+No match, mark = B
+
+/(*MARK:A)(*PRUNE:B)(C|X)/KSS
+ C
+ 0: C
+ 1: C
+MK: B
+ D
+No match, mark = B
+
+/(*MARK:A)(*THEN:B)(C|X)/KS
+ C
+ 0: C
+ 1: C
+MK: B
+ D
+No match, mark = B
+
+/(*MARK:A)(*THEN:B)(C|X)/KSY
+ C
+ 0: C
+ 1: C
+MK: B
+ D
+No match, mark = B
+
+/(*MARK:A)(*THEN:B)(C|X)/KSS
+ C
+ 0: C
+ 1: C
+MK: B
+ D
+No match, mark = B
+
+/--- This should fail, as the skip causes a bump to offset 3 (the skip) ---/
+
+/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xK
+ AAAC
+No match, mark = A
+
+/--- Same --/
+
+/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xK
+ AAAC
+No match, mark = B
+
+/A(*:A)A+(*SKIP)(B|Z) | AC/xK
+ AAAC
+No match, mark = A
+
+/--- This should fail, as a null name is the same as no name ---/
+
+/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xK
+ AAAC
+No match, mark = A
+
+/--- A check on what happens after hitting a mark and them bumping along to
+something that does not even start. Perl reports tags after the failures here,
+though it does not when the individual letters are made into something
+more complicated. ---/
+
+/A(*:A)B|XX(*:B)Y/K
+ AABC
+ 0: AB
+MK: A
+ XXYZ
+ 0: XXY
+MK: B
+ ** Failers
+No match
+ XAQQ
+No match, mark = A
+ XAQQXZZ
+No match, mark = A
+ AXQQQ
+No match, mark = A
+ AXXQQQ
+No match, mark = B
+
+/^(A(*THEN:A)B|C(*THEN:B)D)/K
+ AB
+ 0: AB
+ 1: AB
+MK: A
+ CD
+ 0: CD
+ 1: CD
+MK: B
+ ** Failers
+No match
+ AC
+No match, mark = A
+ CB
+No match, mark = B
+
+/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/K
+ AB
+ 0: AB
+ 1: AB
+MK: A
+ CD
+ 0: CD
+ 1: CD
+MK: B
+ ** Failers
+No match
+ AC
+No match, mark = A
+ CB
+No match, mark = B
+
+/--- An empty name does not pass back an empty string. It is the same as if no
+name were given. ---/
+
+/^(A(*PRUNE:)B|C(*PRUNE:B)D)/K
+ AB
+ 0: AB
+ 1: AB
+ CD
+ 0: CD
+ 1: CD
+MK: B
+
+/--- PRUNE goes to next bumpalong; COMMIT does not. ---/
+
+/A(*PRUNE:A)B/K
+ ACAB
+ 0: AB
+MK: A
+
+/--- Mark names can be duplicated ---/
+
+/A(*:A)B|X(*:A)Y/K
+ AABC
+ 0: AB
+MK: A
+ XXYZ
+ 0: XY
+MK: A
+
+/b(*:m)f|a(*:n)w/K
+ aw
+ 0: aw
+MK: n
+ ** Failers
+No match, mark = n
+ abc
+No match, mark = m
+
+/b(*:m)f|aw/K
+ abaw
+ 0: aw
+ ** Failers
+No match
+ abc
+No match, mark = m
+ abax
+No match, mark = m
+
+/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xK
+ AAAC
+ 0: AAC
+
+/a(*PRUNE:X)bc|qq/KY
+ ** Failers
+No match, mark = X
+ axy
+No match, mark = X
+
+/a(*THEN:X)bc|qq/KY
+ ** Failers
+No match, mark = X
+ axy
+No match, mark = X
+
+/(?=a(*MARK:A)b)..x/K
+ abxy
+ 0: abx
+MK: A
+ ** Failers
+No match
+ abpq
+No match
+
+/(?=a(*MARK:A)b)..(*:Y)x/K
+ abxy
+ 0: abx
+MK: Y
+ ** Failers
+No match
+ abpq
+No match
+
+/(?=a(*PRUNE:A)b)..x/K
+ abxy
+ 0: abx
+MK: A
+ ** Failers
+No match
+ abpq
+No match
+
+/(?=a(*PRUNE:A)b)..(*:Y)x/K
+ abxy
+ 0: abx
+MK: Y
+ ** Failers
+No match
+ abpq
+No match
+
+/(?=a(*THEN:A)b)..x/K
+ abxy
+ 0: abx
+MK: A
+ ** Failers
+No match
+ abpq
+No match
+
+/(?=a(*THEN:A)b)..(*:Y)x/K
+ abxy
+ 0: abx
+MK: Y
+ ** Failers
+No match
+ abpq
+No match
+
/-- End of testinput11 --/
diff --git a/testdata/testoutput13 b/testdata/testoutput13
index 5dab51d..13cc0d0 100644
--- a/testdata/testoutput13
+++ b/testdata/testoutput13
@@ -1278,4 +1278,15 @@ No match
́réo
No match
+/^a\X41z/<JS>
+ aX41z
+ 0: aX41z
+ *** Failers
+No match
+ aAz
+No match
+
+/(?<=ab\Cde)X/8
+Failed: \C not allowed in lookbehind assertion at offset 10
+
/-- End of testinput13 --/
diff --git a/testdata/testoutput14 b/testdata/testoutput14
index 7df45cc..fa31d30 100644
--- a/testdata/testoutput14
+++ b/testdata/testoutput14
@@ -42,9 +42,7 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Subject length lower bound = -1
-No set of starting bytes
-JIT study was successful
+Study returned NULL
/(?(R)a*(?1)|((?R))b)/S+
aaaabcde
diff --git a/testdata/testoutput15 b/testdata/testoutput15
index 944659b..10fbc54 100644
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@@ -17,6 +17,5 @@ No options
No first char
No need char
Study returned NULL
-JIT support is not available in this version of PCRE
/-- End of testinput15 --/
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index b35e6a7..22a0725 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -3,7 +3,10 @@
It also checks the non-Perl syntax the PCRE supports (Python, .NET,
Oniguruma). Finally, there are some tests where PCRE and Perl differ,
either because PCRE can't be compatible, or there is a possible Perl
- bug. --/
+ bug.
+
+ NOTE: This is a non-UTF-8 set of tests. When UTF-8 is needed, use test
+ 5, and if Unicode Property Support is needed, use test 13. --/
/-- Originally, the Perl >= 5.10 things were in here too, but now I have
separated many (most?) of them out into test 11. However, there may still
@@ -10989,176 +10992,22 @@ No match
AC
No match
-/--- A whole lot of tests of verbs with arguments are here rather than in test
- 11 because Perl doesn't seem to follow its specification entirely
- correctly. ---/
-
-/--- Perl 5.11 sets $REGERROR on the AC failure case here; PCRE does not. It is
- not clear how Perl defines "involved in the failure of the match". ---/
-
-/^(A(*THEN:A)B|C(*THEN:B)D)/K
- AB
- 0: AB
- 1: AB
- CD
- 0: CD
- 1: CD
- ** Failers
-No match
- AC
-No match
- CB
-No match, mark = B
-
-/--- Check the use of names for success and failure. PCRE doesn't show these
-names for success, though Perl does, contrary to its spec. ---/
-
-/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/K
- AB
- 0: AB
- 1: AB
- CD
- 0: CD
- 1: CD
- ** Failers
-No match
- AC
-No match, mark = A
- CB
-No match, mark = B
-
-/--- An empty name does not pass back an empty string. It is the same as if no
-name were given. ---/
-
-/^(A(*PRUNE:)B|C(*PRUNE:B)D)/K
- AB
- 0: AB
- 1: AB
- CD
- 0: CD
- 1: CD
-
-/--- PRUNE goes to next bumpalong; COMMIT does not. ---/
-
-/A(*PRUNE:A)B/K
- ACAB
- 0: AB
-
-/(*MARK:A)(*PRUNE:B)(C|X)/KS
- C
- 0: C
- 1: C
-MK: A
- D
-No match
-
-/(*MARK:A)(*PRUNE:B)(C|X)/KSS
- C
- 0: C
- 1: C
-MK: A
- D
-No match, mark = B
-
-/(*MARK:A)(*THEN:B)(C|X)/KS
- C
- 0: C
- 1: C
-MK: A
- D
-No match
-
-/(*MARK:A)(*THEN:B)(C|X)/KSY
- C
- 0: C
- 1: C
-MK: A
- D
-No match, mark = B
-
-/(*MARK:A)(*THEN:B)(C|X)/KSS
- C
- 0: C
- 1: C
-MK: A
- D
-No match, mark = B
-
-/--- This should fail, as the skip causes a bump to offset 3 (the skip) ---/
-
-/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xK
- AAAC
-No match
-
-/--- Same --/
-
-/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xK
- AAAC
-No match
-
/--- This should fail; the SKIP advances by one, but when we get to AC, the
- PRUNE kills it. ---/
+ PRUNE kills it. Perl behaves differently. ---/
/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xK
AAAC
-No match
-
-/A(*:A)A+(*SKIP)(B|Z) | AC/xK
- AAAC
-No match
-
-/--- This should fail, as a null name is the same as no name ---/
-
-/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xK
- AAAC
-No match
-
-/--- This fails in PCRE, and I think that is in accordance with Perl's
- documentation, though in Perl it succeeds. ---/
-
-/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xK
- AAAC
-No match
+No match, mark = A
-/--- Mark names can be duplicated ---/
+/--- Mark names can be duplicated. Perl doesn't give a mark for this one,
+though PCRE does. ---/
-/A(*:A)B|X(*:A)Y/K
- AABC
- 0: AB
-MK: A
- XXYZ
- 0: XY
-MK: A
-
/^A(*:A)B|^X(*:A)Y/K
** Failers
No match
XAQQ
No match, mark = A
-/--- A check on what happens after hitting a mark and them bumping along to
-something that does not even start. Perl reports tags after the failures here,
-though it does not when the individual letters are made into something
-more complicated. ---/
-
-/A(*:A)B|XX(*:B)Y/K
- AABC
- 0: AB
-MK: A
- XXYZ
- 0: XXY
-MK: B
- ** Failers
-No match
- XAQQ
-No match
- XAQQXZZ
-No match
- AXQQQ
-No match
- AXXQQQ
-No match
-
/--- COMMIT at the start of a pattern should be the same as an anchor. Perl
optimizations defeat this. So does the PCRE optimization unless we disable it
with \Y. ---/
@@ -11171,126 +11020,6 @@ No match
DEFGABC\Y
No match
-/--- Repeat some tests with added studying. ---/
-
-/A(*COMMIT)B/+KS
- ACABX
-No match
-
-/A(*THEN)B|A(*THEN)C/KS
- AC
- 0: AC
-
-/A(*PRUNE)B|A(*PRUNE)C/KS
- AC
-No match
-
-/^(A(*THEN:A)B|C(*THEN:B)D)/KS
- AB
- 0: AB
- 1: AB
- CD
- 0: CD
- 1: CD
- ** Failers
-No match
- AC
-No match
- CB
-No match, mark = B
-
-/^(A(*PRUNE:A)B|C(*PRUNE:B)D)/KS
- AB
- 0: AB
- 1: AB
- CD
- 0: CD
- 1: CD
- ** Failers
-No match
- AC
-No match, mark = A
- CB
-No match, mark = B
-
-/^(A(*PRUNE:)B|C(*PRUNE:B)D)/KS
- AB
- 0: AB
- 1: AB
- CD
- 0: CD
- 1: CD
-
-/A(*PRUNE:A)B/KS
- ACAB
- 0: AB
-
-/(*MARK:A)(*PRUNE:B)(C|X)/KS
- C
- 0: C
- 1: C
-MK: A
- D
-No match
-
-/(*MARK:A)(*THEN:B)(C|X)/KS
- C
- 0: C
- 1: C
-MK: A
- D
-No match
-
-/A(*MARK:A)A+(*SKIP)(B|Z) | AC/xKS
- AAAC
-No match
-
-/A(*MARK:A)A+(*MARK:B)(*SKIP:B)(B|Z) | AC/xKS
- AAAC
-No match
-
-/A(*PRUNE:A)A+(*SKIP:A)(B|Z) | AC/xKS
- AAAC
-No match
-
-/A(*:A)A+(*SKIP)(B|Z) | AC/xKS
- AAAC
-No match
-
-/A(*MARK:A)A+(*SKIP:)(B|Z) | AC/xKS
- AAAC
-No match
-
-/A(*MARK:A)A+(*SKIP:B)(B|Z) | AAC/xKS
- AAAC
-No match
-
-/A(*:A)B|XX(*:B)Y/KS
- AABC
- 0: AB
-MK: A
- XXYZ
- 0: XXY
-MK: B
- ** Failers
-No match
- XAQQ
-No match
- XAQQXZZ
-No match
- AXQQQ
-No match
- AXXQQQ
-No match
-
-/(*COMMIT)ABC/
- ABCDEFG
- 0: ABC
- ** Failers
-No match
- DEFGABC\Y
-No match
-
/^(ab (c+(*THEN)cd) | xyz)/x
abcccd
No match
@@ -11875,11 +11604,11 @@ No match
1: C
MK: A
D
-No match
+No match, mark = A
/(*:A)A+(*SKIP:A)(B|Z)/KS
AAAC
-No match
+No match, mark = A
/-- --/
@@ -12257,7 +11986,6 @@ Latest Mark: A
Latest Mark: B
+18 ^ ^ z
+20 ^ a
-Latest Mark: <unset>
+21 ^^ e
+22 ^ ^ q
+23 ^ ^ )
@@ -12518,14 +12246,6 @@ No match
ax1z
0: ax1z
-/^a\X41z/<JS>
- aX41z
- 0: aX41z
- *** Failers
-No match
- aAz
-No match
-
/^a\u0041z/<JS>
aAz
0: aAz
@@ -12591,7 +12311,70 @@ No match
End
------------------------------------------------------------------
-/(?<=ab\Cde)X/8
-Failed: \C not allowed in lookbehind assertion at offset 10
+/a[\NB]c/
+Failed: \N is not supported in a class at offset 3
+
+/a[B-\Nc]/
+Failed: \N is not supported in a class at offset 5
+
+/(a)(?2){0,1999}?(b)/
+
+/(a)(?(DEFINE)(b))(?2){0,1999}?(?2)/
+
+/--- This test, with something more complicated than individual letters, causes
+different behaviour in Perl. Perhaps it disables some optimization; no tag is
+passed back for the failures, whereas in PCRE there is a tag. ---/
+
+/(A|P)(*:A)(B|P) | (X|P)(X|P)(*:B)(Y|P)/xK
+ AABC
+ 0: AB
+ 1: A
+ 2: B
+MK: A
+ XXYZ
+ 0: XXY
+ 1: <unset>
+ 2: <unset>
+ 3: X
+ 4: X
+ 5: Y
+MK: B
+ ** Failers
+No match
+ XAQQ
+No match, mark = A
+ XAQQXZZ
+No match, mark = A
+ AXQQQ
+No match, mark = A
+ AXXQQQ
+No match, mark = B
+
+/-- Perl doesn't give marks for these, though it does if the alternatives are
+replaced by single letters. --/
+
+/(b|q)(*:m)f|a(*:n)w/K
+ aw
+ 0: aw
+MK: n
+ ** Failers
+No match, mark = n
+ abc
+No match, mark = m
+
+/(q|b)(*:m)f|a(*:n)w/K
+ aw
+ 0: aw
+MK: n
+ ** Failers
+No match, mark = n
+ abc
+No match, mark = m
+
+/-- After a partial match, the behaviour is as for a failure. --/
+
+/^a(*:X)bcde/K
+ abc\P
+Partial match, mark=X: abc
/-- End of testinput2 --/
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index 8de96cf..603e55f 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -95,10 +95,10 @@ Failed: character value in \x{...} sequence is too large at offset 11
Failed: character value in \x{...} sequence is too large at offset 12
/\x{d800}/8
-Failed: Not allowed UTF-8 / UTF-16 code point (>= 0xd800 && <= 0xdfff) at offset 7
+Failed: disallowed UTF-8/16 code point (>= 0xd800 && <= 0xdfff) at offset 7
/\x{dfff}/8
-Failed: Not allowed UTF-8 / UTF-16 code point (>= 0xd800 && <= 0xdfff) at offset 7
+Failed: disallowed UTF-8/16 code point (>= 0xd800 && <= 0xdfff) at offset 7
/\x{d7ff}/8
diff --git a/testdata/testoutput6 b/testdata/testoutput6
index 1acaa23..68c0a46 100644
--- a/testdata/testoutput6
+++ b/testdata/testoutput6
@@ -1353,4 +1353,26 @@ No match
a\xFCb
No match
+/ⱥ/8i
+ ⱥ
+ 0: \x{2c65}
+ Ⱥx
+ 0: \x{23a}
+ Ⱥ
+ 0: \x{23a}
+
+/[ⱥ]/8i
+ ⱥ
+ 0: \x{2c65}
+ Ⱥx
+ 0: \x{23a}
+ Ⱥ
+ 0: \x{23a}
+
+/Ⱥ/8i
+ Ⱥ
+ 0: \x{23a}
+ ⱥ
+ 0: \x{2c65}
+
/-- End of testinput6 --/