summaryrefslogtreecommitdiff
path: root/pcre
diff options
context:
space:
mode:
authorSergei Golubchik <sergii@pisem.net>2014-06-05 16:05:08 +0200
committerSergei Golubchik <sergii@pisem.net>2014-06-05 16:05:08 +0200
commit2a5905141a3c509a7c34c3d370fb146dbc1c965f (patch)
tree1d03e5cfa31fd4df30f95b43c7bb3d7d815c4cb7 /pcre
parentc9c9f5133f1373fe9496f3c363891149de55b2f2 (diff)
parent8cc5973f1a18654e2e09076adeece2897a768411 (diff)
downloadmariadb-git-2a5905141a3c509a7c34c3d370fb146dbc1c965f.tar.gz
pcre-8.35
Diffstat (limited to 'pcre')
-rw-r--r--pcre/AUTHORS6
-rw-r--r--pcre/ChangeLog120
-rw-r--r--pcre/INSTALL4
-rw-r--r--pcre/LICENCE6
-rw-r--r--pcre/NEWS11
-rw-r--r--pcre/README21
-rwxr-xr-xpcre/RunGrepTest818
-rwxr-xr-xpcre/RunTest42
-rw-r--r--pcre/configure.ac32
-rw-r--r--pcre/doc/html/README.txt21
-rw-r--r--pcre/doc/html/pcre.html15
-rw-r--r--pcre/doc/html/pcreapi.html114
-rw-r--r--pcre/doc/html/pcregrep.html8
-rw-r--r--pcre/doc/html/pcrepattern.html25
-rw-r--r--pcre/doc/html/pcresyntax.html79
-rw-r--r--pcre/doc/html/pcretest.html21
-rw-r--r--pcre/doc/pcre.317
-rw-r--r--pcre/doc/pcre.txt1323
-rw-r--r--pcre/doc/pcreapi.3104
-rw-r--r--pcre/doc/pcregrep.112
-rw-r--r--pcre/doc/pcregrep.txt709
-rw-r--r--pcre/doc/pcrepattern.327
-rw-r--r--pcre/doc/pcresyntax.360
-rw-r--r--pcre/doc/pcretest.124
-rw-r--r--pcre/doc/pcretest.txt375
-rw-r--r--pcre/maria-patches/pcre_stack_guard.diff57
-rw-r--r--pcre/pcre.h.in6
-rw-r--r--pcre/pcre_byte_order.c6
-rw-r--r--pcre/pcre_compile.c151
-rw-r--r--pcre/pcre_dfa_exec.c69
-rw-r--r--pcre/pcre_exec.c102
-rw-r--r--pcre/pcre_globals.c2
-rw-r--r--pcre/pcre_internal.h118
-rw-r--r--pcre/pcre_jit_compile.c1921
-rw-r--r--pcre/pcre_jit_test.c72
-rw-r--r--pcre/pcre_printint.c15
-rw-r--r--pcre/pcre_string_utils.c8
-rw-r--r--pcre/pcre_study.c62
-rw-r--r--pcre/pcre_xclass.c5
-rw-r--r--pcre/pcregrep.c4
-rw-r--r--pcre/pcreposix.c6
-rw-r--r--pcre/pcretest.c92
-rw-r--r--pcre/testdata/saved16BE-1bin410 -> 418 bytes
-rw-r--r--pcre/testdata/saved16LE-1bin410 -> 418 bytes
-rw-r--r--pcre/testdata/saved32BE-1bin552 -> 560 bytes
-rw-r--r--pcre/testdata/saved32LE-1bin552 -> 560 bytes
-rw-r--r--pcre/testdata/testinput182
-rw-r--r--pcre/testdata/testinput217
-rw-r--r--pcre/testdata/testinput252
-rw-r--r--pcre/testdata/testinput311
-rw-r--r--pcre/testdata/testinput46
-rw-r--r--pcre/testdata/testinput52
-rw-r--r--pcre/testdata/testinput69
-rw-r--r--pcre/testdata/testinput76
-rw-r--r--pcre/testdata/testoutput128
-rw-r--r--pcre/testdata/testoutput132
-rw-r--r--pcre/testdata/testoutput1412
-rw-r--r--pcre/testdata/testoutput1536
-rw-r--r--pcre/testdata/testoutput168
-rw-r--r--pcre/testdata/testoutput1728
-rw-r--r--pcre/testdata/testoutput18-1642
-rw-r--r--pcre/testdata/testoutput18-3242
-rw-r--r--pcre/testdata/testoutput192
-rw-r--r--pcre/testdata/testoutput2192
-rw-r--r--pcre/testdata/testoutput21-164
-rw-r--r--pcre/testdata/testoutput21-324
-rw-r--r--pcre/testdata/testoutput22-164
-rw-r--r--pcre/testdata/testoutput22-324
-rw-r--r--pcre/testdata/testoutput2334
-rw-r--r--pcre/testdata/testoutput2536
-rw-r--r--pcre/testdata/testoutput315
-rw-r--r--pcre/testdata/testoutput3A174
-rw-r--r--pcre/testdata/testoutput3B174
-rw-r--r--pcre/testdata/testoutput48
-rw-r--r--pcre/testdata/testoutput533
-rw-r--r--pcre/testdata/testoutput612
-rw-r--r--pcre/testdata/testoutput728
-rw-r--r--pcre/testdata/testoutput82
-rw-r--r--pcre/testdata/wintestoutput34
79 files changed, 4679 insertions, 2974 deletions
diff --git a/pcre/AUTHORS b/pcre/AUTHORS
index 97d8c71dd67..5eee1af4c6f 100644
--- a/pcre/AUTHORS
+++ b/pcre/AUTHORS
@@ -8,7 +8,7 @@ Email domain: cam.ac.uk
University of Cambridge Computing Service,
Cambridge, England.
-Copyright (c) 1997-2013 University of Cambridge
+Copyright (c) 1997-2014 University of Cambridge
All rights reserved
@@ -19,7 +19,7 @@ Written by: Zoltan Herczeg
Email local part: hzmester
Emain domain: freemail.hu
-Copyright(c) 2010-2013 Zoltan Herczeg
+Copyright(c) 2010-2014 Zoltan Herczeg
All rights reserved.
@@ -30,7 +30,7 @@ Written by: Zoltan Herczeg
Email local part: hzmester
Emain domain: freemail.hu
-Copyright(c) 2009-2013 Zoltan Herczeg
+Copyright(c) 2009-2014 Zoltan Herczeg
All rights reserved.
diff --git a/pcre/ChangeLog b/pcre/ChangeLog
index 1f1e8600356..7801ef84117 100644
--- a/pcre/ChangeLog
+++ b/pcre/ChangeLog
@@ -1,6 +1,126 @@
ChangeLog for PCRE
------------------
+Version 8.35 04-April-2014
+--------------------------
+
+1. A new flag is set, when property checks are present in an XCLASS.
+ When this flag is not set, PCRE can perform certain optimizations
+ such as studying these XCLASS-es.
+
+2. The auto-possessification of character sets were improved: a normal
+ and an extended character set can be compared now. Furthermore
+ the JIT compiler optimizes more character set checks.
+
+3. Got rid of some compiler warnings for potentially uninitialized variables
+ that show up only when compiled with -O2.
+
+4. A pattern such as (?=ab\K) that uses \K in an assertion can set the start
+ of a match later then the end of the match. The pcretest program was not
+ handling the case sensibly - it was outputting from the start to the next
+ binary zero. It now reports this situation in a message, and outputs the
+ text from the end to the start.
+
+5. Fast forward search is improved in JIT. Instead of the first three
+ characters, any three characters with fixed position can be searched.
+ Search order: first, last, middle.
+
+6. Improve character range checks in JIT. Characters are read by an inprecise
+ function now, which returns with an unknown value if the character code is
+ above a certain treshold (e.g: 256). The only limitation is that the value
+ must be bigger than the treshold as well. This function is useful, when
+ the characters above the treshold are handled in the same way.
+
+7. The macros whose names start with RAWUCHAR are placeholders for a future
+ mode in which only the bottom 21 bits of 32-bit data items are used. To
+ make this more memorable for those maintaining the code, the names have
+ been changed to start with UCHAR21, and an extensive comment has been added
+ to their definition.
+
+8. Add missing (new) files sljitNativeTILEGX.c and sljitNativeTILEGX-encoder.c
+ to the export list in Makefile.am (they were accidentally omitted from the
+ 8.34 tarball).
+
+9. The informational output from pcretest used the phrase "starting byte set"
+ which is inappropriate for the 16-bit and 32-bit libraries. As the output
+ for "first char" and "need char" really means "non-UTF-char", I've changed
+ "byte" to "char", and slightly reworded the output. The documentation about
+ these values has also been (I hope) clarified.
+
+10. Another JIT related optimization: use table jumps for selecting the correct
+ backtracking path, when more than four alternatives are present inside a
+ bracket.
+
+11. Empty match is not possible, when the minimum length is greater than zero,
+ and there is no \K in the pattern. JIT should avoid empty match checks in
+ such cases.
+
+12. In a caseless character class with UCP support, when a character with more
+ than one alternative case was not the first character of a range, not all
+ the alternative cases were added to the class. For example, s and \x{17f}
+ are both alternative cases for S: the class [RST] was handled correctly,
+ but [R-T] was not.
+
+13. The configure.ac file always checked for pthread support when JIT was
+ enabled. This is not used in Windows, so I have put this test inside a
+ check for the presence of windows.h (which was already tested for).
+
+14. Improve pattern prefix search by a simplified Boyer-Moore algorithm in JIT.
+ The algorithm provides a way to skip certain starting offsets, and usually
+ faster than linear prefix searches.
+
+15. Change 13 for 8.20 updated RunTest to check for the 'fr' locale as well
+ as for 'fr_FR' and 'french'. For some reason, however, it then used the
+ Windows-specific input and output files, which have 'french' screwed in.
+ So this could never have worked. One of the problems with locales is that
+ they aren't always the same. I have now updated RunTest so that it checks
+ the output of the locale test (test 3) against three different output
+ files, and it allows the test to pass if any one of them matches. With luck
+ this should make the test pass on some versions of Solaris where it was
+ failing. Because of the uncertainty, the script did not used to stop if
+ test 3 failed; it now does. If further versions of a French locale ever
+ come to light, they can now easily be added.
+
+16. If --with-pcregrep-bufsize was given a non-integer value such as "50K",
+ there was a message during ./configure, but it did not stop. This now
+ provokes an error. The invalid example in README has been corrected.
+ If a value less than the minimum is given, the minimum value has always
+ been used, but now a warning is given.
+
+17. If --enable-bsr-anycrlf was set, the special 16/32-bit test failed. This
+ was a bug in the test system, which is now fixed. Also, the list of various
+ configurations that are tested for each release did not have one with both
+ 16/32 bits and --enable-bar-anycrlf. It now does.
+
+18. pcretest was missing "-C bsr" for displaying the \R default setting.
+
+19. Little endian PowerPC systems are supported now by the JIT compiler.
+
+20. The fast forward newline mechanism could enter to an infinite loop on
+ certain invalid UTF-8 input. Although we don't support these cases
+ this issue can be fixed by a performance optimization.
+
+21. Change 33 of 8.34 is not sufficient to ensure stack safety because it does
+ not take account if existing stack usage. There is now a new global
+ variable called pcre_stack_guard that can be set to point to an external
+ function to check stack availability. It is called at the start of
+ processing every parenthesized group.
+
+22. A typo in the code meant that in ungreedy mode the max/min qualifier
+ behaved like a min-possessive qualifier, and, for example, /a{1,3}b/U did
+ not match "ab".
+
+23. When UTF was disabled, the JIT program reported some incorrect compile
+ errors. These messages are silenced now.
+
+24. Experimental support for ARM-64 and MIPS-64 has been added to the JIT
+ compiler.
+
+25. Change all the temporary files used in RunGrepTest to be different to those
+ used by RunTest so that the tests can be run simultaneously, for example by
+ "make -j check".
+
+
Version 8.34 15-December-2013
-----------------------------
diff --git a/pcre/INSTALL b/pcre/INSTALL
index 007e9396d0a..2099840756e 100644
--- a/pcre/INSTALL
+++ b/pcre/INSTALL
@@ -12,8 +12,8 @@ without warranty of any kind.
Basic Installation
==================
- Briefly, the shell commands `./configure; make; make install' should
-configure, build, and install this package. The following
+ Briefly, the shell command `./configure && make && make install'
+should configure, build, and install this package. The following
more-detailed instructions are generic; see the `README' file for
instructions specific to this package. Some packages provide this
`INSTALL' file but do not implement all of the features documented
diff --git a/pcre/LICENCE b/pcre/LICENCE
index 3aff6a62c00..602e4ae6804 100644
--- a/pcre/LICENCE
+++ b/pcre/LICENCE
@@ -24,7 +24,7 @@ Email domain: cam.ac.uk
University of Cambridge Computing Service,
Cambridge, England.
-Copyright (c) 1997-2013 University of Cambridge
+Copyright (c) 1997-2014 University of Cambridge
All rights reserved.
@@ -35,7 +35,7 @@ Written by: Zoltan Herczeg
Email local part: hzmester
Emain domain: freemail.hu
-Copyright(c) 2010-2013 Zoltan Herczeg
+Copyright(c) 2010-2014 Zoltan Herczeg
All rights reserved.
@@ -46,7 +46,7 @@ Written by: Zoltan Herczeg
Email local part: hzmester
Emain domain: freemail.hu
-Copyright(c) 2009-2013 Zoltan Herczeg
+Copyright(c) 2009-2014 Zoltan Herczeg
All rights reserved.
diff --git a/pcre/NEWS b/pcre/NEWS
index 5f52f153460..6331e9908d1 100644
--- a/pcre/NEWS
+++ b/pcre/NEWS
@@ -1,6 +1,17 @@
News about PCRE releases
------------------------
+Release 8.35 04-April-2014
+--------------------------
+
+There have been performance improvements for classes containing non-ASCII
+characters and the "auto-possessification" feature has been extended. Other
+minor improvements have been implemented and bugs fixed. There is a new callout
+feature to enable applications to do detailed stack checks at compile time, to
+avoid running out of stack for deeply nested parentheses. The JIT compiler has
+been extended with experimental support for ARM-64, MIPS-64, and PPC-LE.
+
+
Release 8.34 15-December-2013
-----------------------------
diff --git a/pcre/README b/pcre/README
index 51197df7213..88f2dfd4efd 100644
--- a/pcre/README
+++ b/pcre/README
@@ -85,11 +85,12 @@ documentation is supplied in two other forms:
1. There are files called doc/pcre.txt, doc/pcregrep.txt, and
doc/pcretest.txt in the source distribution. The first of these is a
concatenation of the text forms of all the section 3 man pages except
- those that summarize individual functions. The other two are the text
- forms of the section 1 man pages for the pcregrep and pcretest commands.
- These text forms are provided for ease of scanning with text editors or
- similar tools. They are installed in <prefix>/share/doc/pcre, where
- <prefix> is the installation prefix (defaulting to /usr/local).
+ the listing of pcredemo.c and those that summarize individual functions.
+ The other two are the text forms of the section 1 man pages for the
+ pcregrep and pcretest commands. These text forms are provided for ease of
+ scanning with text editors or similar tools. They are installed in
+ <prefix>/share/doc/pcre, where <prefix> is the installation prefix
+ (defaulting to /usr/local).
2. A set of files containing all the documentation in HTML form, hyperlinked
in various ways, and rooted in a file called index.html, is distributed in
@@ -372,12 +373,12 @@ library. They are also documented in the pcrebuild man page.
Of course, the relevant libraries must be installed on your system.
-. The default size of internal buffer used by pcregrep can be set by, for
- example:
+. The default size (in bytes) of the internal buffer used by pcregrep can be
+ set by, for example:
- --with-pcregrep-bufsize=50K
+ --with-pcregrep-bufsize=51200
- The default value is 20K.
+ The value must be a plain integer. The default is 20480.
. It is possible to compile pcretest so that it links with the libreadline
or libedit libraries, by specifying, respectively,
@@ -987,4 +988,4 @@ pcre_xxx, one with the name pcre16_xx, and a third with the name pcre32_xxx.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 05 November 2013
+Last updated: 17 January 2014
diff --git a/pcre/RunGrepTest b/pcre/RunGrepTest
index e192ed77f7c..f1b03484067 100755
--- a/pcre/RunGrepTest
+++ b/pcre/RunGrepTest
@@ -69,447 +69,447 @@ utf8=$?
echo "Testing pcregrep main features"
-echo "---------------------------- Test 1 ------------------------------" >testtry
-(cd $srcdir; $valgrind $pcregrep PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 2 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep '^PATTERN' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 3 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -in PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 4 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -ic PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 5 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -in PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 6 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -inh PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 7 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -il PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 8 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -l PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 9 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -q PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 10 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -q NEVER-PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 11 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -vn pattern ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 12 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -ix pattern ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 13 -----------------------------" >>testtry
-echo seventeen >testtemp1
-(cd $srcdir; $valgrind $pcregrep -f./testdata/greplist -f $builddir/testtemp1 ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 14 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -w pat ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 15 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep 'abc^*' ./testdata/grepinput) 2>>testtry >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 16 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep abc ./testdata/grepinput ./testdata/nonexistfile) 2>>testtry >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 17 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -M 'the\noutput' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 18 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -Mn '(the\noutput|dog\.\n--)' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 19 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -Mix 'Pattern' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 20 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -Mixn 'complete pair\nof lines' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 21 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -nA3 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 22 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -nB3 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 23 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -C3 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 24 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -A9 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 25 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -nB9 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 26 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -A9 -B9 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 27 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -A10 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 28 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -nB10 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 29 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -C12 -B10 'four' ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 30 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -inB3 'pattern' ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 31 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -inA3 'pattern' ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 32 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -L 'fox' ./testdata/grepinput ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 33 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep 'fox' ./testdata/grepnonexist) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 34 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -s 'fox' ./testdata/grepnonexist) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 35 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinputx --include grepinput8 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 36 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinput --exclude 'grepinput$' --exclude=grepinput8 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 37 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep '^(a+)*\d' ./testdata/grepinput) >>testtry 2>teststderr
-echo "RC=$?" >>testtry
-echo "======== STDERR ========" >>testtry
-cat teststderr >>testtry
+echo "---------------------------- Test 1 ------------------------------" >testtrygrep
+(cd $srcdir; $valgrind $pcregrep PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 2 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep '^PATTERN' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 3 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -in PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 4 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -ic PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 5 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -in PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 6 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -inh PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 7 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -il PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 8 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -l PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 9 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -q PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 10 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -q NEVER-PATTERN ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 11 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -vn pattern ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 12 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -ix pattern ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 13 -----------------------------" >>testtrygrep
+echo seventeen >testtemp1grep
+(cd $srcdir; $valgrind $pcregrep -f./testdata/greplist -f $builddir/testtemp1grep ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 14 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -w pat ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 15 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep 'abc^*' ./testdata/grepinput) 2>>testtrygrep >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 16 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep abc ./testdata/grepinput ./testdata/nonexistfile) 2>>testtrygrep >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 17 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -M 'the\noutput' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 18 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -Mn '(the\noutput|dog\.\n--)' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 19 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -Mix 'Pattern' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 20 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -Mixn 'complete pair\nof lines' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 21 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -nA3 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 22 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -nB3 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 23 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -C3 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 24 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -A9 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 25 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -nB9 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 26 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -A9 -B9 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 27 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -A10 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 28 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -nB10 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 29 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -C12 -B10 'four' ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 30 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -inB3 'pattern' ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 31 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -inA3 'pattern' ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 32 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -L 'fox' ./testdata/grepinput ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 33 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep 'fox' ./testdata/grepnonexist) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 34 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -s 'fox' ./testdata/grepnonexist) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 35 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinputx --include grepinput8 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 36 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinput --exclude 'grepinput$' --exclude=grepinput8 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 37 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep '^(a+)*\d' ./testdata/grepinput) >>testtrygrep 2>teststderrgrep
+echo "RC=$?" >>testtrygrep
+echo "======== STDERR ========" >>testtrygrep
+cat teststderrgrep >>testtrygrep
-echo "---------------------------- Test 38 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep '>\x00<' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 38 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep '>\x00<' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 39 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -A1 'before the binary zero' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 39 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -A1 'before the binary zero' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 40 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -B1 'after the binary zero' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 40 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -B1 'after the binary zero' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 41 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -B1 -o '\w+ the binary zero' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 41 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -B1 -o '\w+ the binary zero' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 42 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -B1 -onH '\w+ the binary zero' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 42 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -B1 -onH '\w+ the binary zero' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 43 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -on 'before|zero|after' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 43 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -on 'before|zero|after' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 44 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -on -e before -ezero -e after ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 44 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -on -e before -ezero -e after ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 45 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -on -f ./testdata/greplist -e binary ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 45 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -on -f ./testdata/greplist -e binary ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 46 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -eabc -e '(unclosed' ./testdata/grepinput) 2>>testtry >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 46 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -eabc -e '(unclosed' ./testdata/grepinput) 2>>testtrygrep >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 47 ------------------------------" >>testtry
+echo "---------------------------- Test 47 ------------------------------" >>testtrygrep
(cd $srcdir; $valgrind $pcregrep -Fx "AB.VE
-elephant" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+elephant" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 48 ------------------------------" >>testtry
+echo "---------------------------- Test 48 ------------------------------" >>testtrygrep
(cd $srcdir; $valgrind $pcregrep -F "AB.VE
-elephant" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+elephant" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 49 ------------------------------" >>testtry
+echo "---------------------------- Test 49 ------------------------------" >>testtrygrep
(cd $srcdir; $valgrind $pcregrep -F -e DATA -e "AB.VE
-elephant" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+elephant" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 50 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep "^(abc|def|ghi|jkl)" ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 50 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep "^(abc|def|ghi|jkl)" ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 51 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -Mv "brown\sfox" ./testdata/grepinputv) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 51 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -Mv "brown\sfox" ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 52 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --colour=always jumps ./testdata/grepinputv) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 52 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --colour=always jumps ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 53 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --file-offsets 'before|zero|after' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 53 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --file-offsets 'before|zero|after' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 54 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --line-offsets 'before|zero|after' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 54 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --line-offsets 'before|zero|after' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 55 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -f./testdata/greplist --color=always ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 55 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -f./testdata/greplist --color=always ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 56 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -c lazy ./testdata/grepinput*) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 56 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -c lazy ./testdata/grepinput*) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 57 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -c -l lazy ./testdata/grepinput*) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 57 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -c -l lazy ./testdata/grepinput*) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 58 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --regex=PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 58 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --regex=PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 59 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --regexp=PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 59 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --regexp=PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 60 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --regex PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 60 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --regex PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 61 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --regexp PATTERN ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 61 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --regexp PATTERN ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 62 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --match-limit=1000 --no-jit -M 'This is a file(.|\R)*file.' ./testdata/grepinput) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 62 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --match-limit=1000 --no-jit -M 'This is a file(.|\R)*file.' ./testdata/grepinput) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 63 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --recursion-limit=1000 --no-jit -M 'This is a file(.|\R)*file.' ./testdata/grepinput) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 63 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --recursion-limit=1000 --no-jit -M 'This is a file(.|\R)*file.' ./testdata/grepinput) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 64 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o1 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 64 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o1 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 65 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o2 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 65 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o2 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 66 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o3 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 66 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o3 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 67 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o12 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 67 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o12 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 68 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --only-matching=2 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 68 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --only-matching=2 '(?<=PAT)TERN (ap(pear)s)' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 69 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -vn --colour=always pattern ./testdata/grepinputx) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 69 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -vn --colour=always pattern ./testdata/grepinputx) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 70 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --color=always -M "triple:\t.*\n\n" ./testdata/grepinput3) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 70 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --color=always -M "triple:\t.*\n\n" ./testdata/grepinput3) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 71 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o "^01|^02|^03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 71 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o "^01|^02|^03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 72 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --color=always "^01|^02|^03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 72 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --color=always "^01|^02|^03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 73 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o --colour=always "^01|^02|^03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 73 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o --colour=always "^01|^02|^03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 74 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o "^01|02|^03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 75 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --color=always "^01|02|^03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 76 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o --colour=always "^01|02|^03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 77 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o "^01|^02|03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 78 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --color=always "^01|^02|03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 79 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o --colour=always "^01|^02|03" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 80 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o "\b01|\b02" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 81 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --color=always "\\b01|\\b02" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 82 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o --colour=always "\\b01|\\b02" ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 83 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --buffer-size=100 "^a" ./testdata/grepinput3) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 74 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o "^01|02|^03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 75 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --color=always "^01|02|^03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 76 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o --colour=always "^01|02|^03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 77 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o "^01|^02|03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 78 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --color=always "^01|^02|03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 79 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o --colour=always "^01|^02|03" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 80 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o "\b01|\b02" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 81 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --color=always "\\b01|\\b02" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 82 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o --colour=always "\\b01|\\b02" ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 83 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --buffer-size=100 "^a" ./testdata/grepinput3) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 84 -----------------------------" >>testtry
-echo testdata/grepinput3 >testtemp1
-(cd $srcdir; $valgrind $pcregrep --file-list ./testdata/grepfilelist --file-list $builddir/testtemp1 "fox|complete|t7") >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 85 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --file-list=./testdata/grepfilelist "dolor" ./testdata/grepinput3) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 86 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep "dog" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 87 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep "cat" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 88 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -v "cat" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 89 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -I "dog" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 90 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --binary-files=without-match "dog" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 91 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -a "dog" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 92 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --binary-files=text "dog" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 93 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --text "dog" ./testdata/grepbinary) >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 94 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinputx --include grepinput8 'fox' ./testdata/grepinput* | sort) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 95 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --file-list ./testdata/grepfilelist --exclude grepinputv "fox|complete") >>testtry 2>&1
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 96 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -L -r --include-dir=testdata --exclude '^(?!grepinput)' 'fox' ./test* | sort) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 97 -----------------------------" >>testtry
-echo "grepinput$" >testtemp1
-echo "grepinput8" >>testtemp1
-(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinput --exclude-from $builddir/testtemp1 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 84 -----------------------------" >>testtrygrep
+echo testdata/grepinput3 >testtemp1grep
+(cd $srcdir; $valgrind $pcregrep --file-list ./testdata/grepfilelist --file-list $builddir/testtemp1grep "fox|complete|t7") >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 85 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --file-list=./testdata/grepfilelist "dolor" ./testdata/grepinput3) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 86 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep "dog" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 87 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep "cat" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 88 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -v "cat" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 89 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -I "dog" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 90 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --binary-files=without-match "dog" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 91 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -a "dog" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 92 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --binary-files=text "dog" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 93 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --text "dog" ./testdata/grepbinary) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 94 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinputx --include grepinput8 'fox' ./testdata/grepinput* | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 95 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --file-list ./testdata/grepfilelist --exclude grepinputv "fox|complete") >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 96 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -L -r --include-dir=testdata --exclude '^(?!grepinput)' 'fox' ./test* | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 97 -----------------------------" >>testtrygrep
+echo "grepinput$" >testtemp1grep
+echo "grepinput8" >>testtemp1grep
+(cd $srcdir; $valgrind $pcregrep -L -r --include=grepinput --exclude-from $builddir/testtemp1grep --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 98 -----------------------------" >>testtry
-echo "grepinput$" >testtemp1
-echo "grepinput8" >>testtemp1
-(cd $srcdir; $valgrind $pcregrep -L -r --exclude=grepinput3 --include=grepinput --exclude-from $builddir/testtemp1 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtry
-echo "RC=$?" >>testtry
-
-echo "---------------------------- Test 99 -----------------------------" >>testtry
-echo "grepinput$" >testtemp1
-echo "grepinput8" >testtemp2
-(cd $srcdir; $valgrind $pcregrep -L -r --include grepinput --exclude-from $builddir/testtemp1 --exclude-from=$builddir/testtemp2 --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 98 -----------------------------" >>testtrygrep
+echo "grepinput$" >testtemp1grep
+echo "grepinput8" >>testtemp1grep
+(cd $srcdir; $valgrind $pcregrep -L -r --exclude=grepinput3 --include=grepinput --exclude-from $builddir/testtemp1grep --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 99 -----------------------------" >>testtrygrep
+echo "grepinput$" >testtemp1grep
+echo "grepinput8" >testtemp2grep
+(cd $srcdir; $valgrind $pcregrep -L -r --include grepinput --exclude-from $builddir/testtemp1grep --exclude-from=$builddir/testtemp2grep --exclude-dir='^\.' 'fox' ./testdata | sort) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 100 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -Ho2 --only-matching=1 -o3 '(\w+) binary (\w+)(\.)?' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 100 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -Ho2 --only-matching=1 -o3 '(\w+) binary (\w+)(\.)?' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 101 ------------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -o3 -Ho2 -o12 --only-matching=1 -o3 --colour=always --om-separator='|' '(\w+) binary (\w+)(\.)?' ./testdata/grepinput) >>testtry
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 101 ------------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -o3 -Ho2 -o12 --only-matching=1 -o3 --colour=always --om-separator='|' '(\w+) binary (\w+)(\.)?' ./testdata/grepinput) >>testtrygrep
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 102 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -n "^$" ./testdata/grepinput3) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 102 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -n "^$" ./testdata/grepinput3) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 103 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --only-matching "^$" ./testdata/grepinput3) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 103 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --only-matching "^$" ./testdata/grepinput3) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 104 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep -n --only-matching "^$" ./testdata/grepinput3) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 104 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep -n --only-matching "^$" ./testdata/grepinput3) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 105 -----------------------------" >>testtry
-(cd $srcdir; $valgrind $pcregrep --colour=always "ipsum|" ./testdata/grepinput3) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 105 -----------------------------" >>testtrygrep
+(cd $srcdir; $valgrind $pcregrep --colour=always "ipsum|" ./testdata/grepinput3) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
-echo "---------------------------- Test 106 -----------------------------" >>testtry
-(cd $srcdir; echo "a" | $valgrind $pcregrep -M "|a" ) >>testtry 2>&1
-echo "RC=$?" >>testtry
+echo "---------------------------- Test 106 -----------------------------" >>testtrygrep
+(cd $srcdir; echo "a" | $valgrind $pcregrep -M "|a" ) >>testtrygrep 2>&1
+echo "RC=$?" >>testtrygrep
# Now compare the results.
-$cf $srcdir/testdata/grepoutput testtry
+$cf $srcdir/testdata/grepoutput testtrygrep
if [ $? != 0 ] ; then exit 1; fi
@@ -518,15 +518,15 @@ if [ $? != 0 ] ; then exit 1; fi
if [ $utf8 -ne 0 ] ; then
echo "Testing pcregrep UTF-8 features"
- echo "---------------------------- Test U1 ------------------------------" >testtry
- (cd $srcdir; $valgrind $pcregrep -n -u --newline=any "^X" ./testdata/grepinput8) >>testtry
- echo "RC=$?" >>testtry
+ echo "---------------------------- Test U1 ------------------------------" >testtrygrep
+ (cd $srcdir; $valgrind $pcregrep -n -u --newline=any "^X" ./testdata/grepinput8) >>testtrygrep
+ echo "RC=$?" >>testtrygrep
- echo "---------------------------- Test U2 ------------------------------" >>testtry
- (cd $srcdir; $valgrind $pcregrep -n -u -C 3 --newline=any "Match" ./testdata/grepinput8) >>testtry
- echo "RC=$?" >>testtry
+ echo "---------------------------- Test U2 ------------------------------" >>testtrygrep
+ (cd $srcdir; $valgrind $pcregrep -n -u -C 3 --newline=any "Match" ./testdata/grepinput8) >>testtrygrep
+ echo "RC=$?" >>testtrygrep
- $cf $srcdir/testdata/grepoutput8 testtry
+ $cf $srcdir/testdata/grepoutput8 testtrygrep
if [ $? != 0 ] ; then exit 1; fi
else
@@ -542,28 +542,28 @@ fi
# starts with a hyphen. These tests are run in the build directory.
echo "Testing pcregrep newline settings"
-printf "abc\rdef\r\nghi\njkl" >testNinput
+printf "abc\rdef\r\nghi\njkl" >testNinputgrep
-printf "%c--------------------------- Test N1 ------------------------------\r\n" - >testtry
-$valgrind $pcregrep -n -N CR "^(abc|def|ghi|jkl)" testNinput >>testtry
+printf "%c--------------------------- Test N1 ------------------------------\r\n" - >testtrygrep
+$valgrind $pcregrep -n -N CR "^(abc|def|ghi|jkl)" testNinputgrep >>testtrygrep
-printf "%c--------------------------- Test N2 ------------------------------\r\n" - >>testtry
-$valgrind $pcregrep -n --newline=crlf "^(abc|def|ghi|jkl)" testNinput >>testtry
+printf "%c--------------------------- Test N2 ------------------------------\r\n" - >>testtrygrep
+$valgrind $pcregrep -n --newline=crlf "^(abc|def|ghi|jkl)" testNinputgrep >>testtrygrep
-printf "%c--------------------------- Test N3 ------------------------------\r\n" - >>testtry
+printf "%c--------------------------- Test N3 ------------------------------\r\n" - >>testtrygrep
pattern=`printf 'def\rjkl'`
-$valgrind $pcregrep -n --newline=cr -F "$pattern" testNinput >>testtry
+$valgrind $pcregrep -n --newline=cr -F "$pattern" testNinputgrep >>testtrygrep
-printf "%c--------------------------- Test N4 ------------------------------\r\n" - >>testtry
-$valgrind $pcregrep -n --newline=crlf -F -f $srcdir/testdata/greppatN4 testNinput >>testtry
+printf "%c--------------------------- Test N4 ------------------------------\r\n" - >>testtrygrep
+$valgrind $pcregrep -n --newline=crlf -F -f $srcdir/testdata/greppatN4 testNinputgrep >>testtrygrep
-printf "%c--------------------------- Test N5 ------------------------------\r\n" - >>testtry
-$valgrind $pcregrep -n --newline=any "^(abc|def|ghi|jkl)" testNinput >>testtry
+printf "%c--------------------------- Test N5 ------------------------------\r\n" - >>testtrygrep
+$valgrind $pcregrep -n --newline=any "^(abc|def|ghi|jkl)" testNinputgrep >>testtrygrep
-printf "%c--------------------------- Test N6 ------------------------------\r\n" - >>testtry
-$valgrind $pcregrep -n --newline=anycrlf "^(abc|def|ghi|jkl)" testNinput >>testtry
+printf "%c--------------------------- Test N6 ------------------------------\r\n" - >>testtrygrep
+$valgrind $pcregrep -n --newline=anycrlf "^(abc|def|ghi|jkl)" testNinputgrep >>testtrygrep
-$cf $srcdir/testdata/grepoutputN testtry
+$cf $srcdir/testdata/grepoutputN testtrygrep
if [ $? != 0 ] ; then exit 1; fi
exit 0
diff --git a/pcre/RunTest b/pcre/RunTest
index 7caa51d6772..67cfbf07cf9 100755
--- a/pcre/RunTest
+++ b/pcre/RunTest
@@ -31,6 +31,11 @@
# except test 10. Whatever order the arguments are in, the tests are always run
# in numerical order.
#
+# The special argument "3S" runs test 3, stopping if it fails. Test 3 is the
+# locale test, and failure usually means there's an issue with the locale
+# rather than a bug in PCRE, so normally subsequent tests are run. "3S" is
+# useful when you want to debug or update the test.
+#
# Inappropriate tests are automatically skipped (with a comment to say so): for
# example, if JIT support is not compiled, test 12 is skipped, whereas if JIT
# support is compiled, test 13 is skipped.
@@ -458,8 +463,9 @@ fi
# Locale-specific tests, provided that either the "fr_FR" or the "french"
# locale is available. The former is the Unix-like standard; the latter is
-# for Windows. Another possibility is "fr", which needs to be run against
-# the Windows-specific input and output files.
+# for Windows. Another possibility is "fr". Unfortunately, different versions
+# of the French locale give different outputs for some items. This test passes
+# if the output matches any one of the alternative output files.
if [ $do3 = yes ] ; then
locale -a | grep '^fr_FR$' >/dev/null
@@ -467,20 +473,28 @@ if [ $do3 = yes ] ; then
locale=fr_FR
infile=$testdata/testinput3
outfile=$testdata/testoutput3
+ outfile2=$testdata/testoutput3A
+ outfile3=$testdata/testoutput3B
else
infile=test3input
outfile=test3output
+ outfile2=test3outputA
+ outfile3=test3outputB
locale -a | grep '^french$' >/dev/null
if [ $? -eq 0 ] ; then
locale=french
sed 's/fr_FR/french/' $testdata/testinput3 >test3input
sed 's/fr_FR/french/' $testdata/testoutput3 >test3output
+ sed 's/fr_FR/french/' $testdata/testoutput3A >test3outputA
+ sed 's/fr_FR/french/' $testdata/testoutput3B >test3outputB
else
locale -a | grep '^fr$' >/dev/null
if [ $? -eq 0 ] ; then
locale=fr
- sed 's/fr_FR/fr/' $testdata/wintestinput3 >test3input
- sed 's/fr_FR/fr/' $testdata/wintestoutput3 >test3output
+ sed 's/fr_FR/fr/' $testdata/intestinput3 >test3input
+ sed 's/fr_FR/fr/' $testdata/intestoutput3 >test3output
+ sed 's/fr_FR/fr/' $testdata/intestoutput3A >test3outputA
+ sed 's/fr_FR/fr/' $testdata/intestoutput3B >test3outputB
else
locale=
fi
@@ -492,18 +506,20 @@ if [ $do3 = yes ] ; then
for opt in "" "-s" $jitopt; do
$sim $valgrind ./pcretest -q $bmode $opt $infile testtry
if [ $? = 0 ] ; then
- $cf $outfile testtry
- if [ $? != 0 ] ; then
- echo " "
- echo "Locale test did not run entirely successfully."
- echo "This usually means that there is a problem with the locale"
- echo "settings rather than a bug in PCRE."
- break;
- else
+ if $cf $outfile testtry >teststdout || \
+ $cf $outfile2 testtry >teststdout || \
+ $cf $outfile3 testtry >teststdout
+ then
if [ "$opt" = "-s" ] ; then echo " OK with study"
elif [ "$opt" = "-s+" ] ; then echo " OK with JIT study"
else echo " OK"
fi
+ else
+ echo "** Locale test did not run successfully. The output did not match"
+ echo " $outfile, $outfile2 or $outfile3."
+ echo " This may mean that there is a problem with the locale settings rather"
+ echo " than a bug in PCRE."
+ exit 1
fi
else exit 1
fi
@@ -989,6 +1005,6 @@ fi
done
# Clean up local working files
-rm -f test3input test3output testNinput testsaved* teststderr teststdout testtry
+rm -f test3input test3output test3outputA testNinput testsaved* teststderr teststdout testtry
# End
diff --git a/pcre/configure.ac b/pcre/configure.ac
index 5ce6c62c0d3..aab2f56c218 100644
--- a/pcre/configure.ac
+++ b/pcre/configure.ac
@@ -9,17 +9,17 @@ dnl The PCRE_PRERELEASE feature is for identifying release candidates. It might
dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8])
-m4_define(pcre_minor, [34])
+m4_define(pcre_minor, [35])
m4_define(pcre_prerelease, [])
-m4_define(pcre_date, [2013-12-15])
+m4_define(pcre_date, [2014-04-04])
# NOTE: The CMakeLists.txt file searches for the above variables in the first
# 50 lines of this file. Please update that if the variables above are moved.
# Libtool shared library interface versions (current:revision:age)
-m4_define(libpcre_version, [3:2:2])
-m4_define(libpcre16_version, [2:2:2])
-m4_define(libpcre32_version, [0:2:0])
+m4_define(libpcre_version, [3:3:2])
+m4_define(libpcre16_version, [2:3:2])
+m4_define(libpcre32_version, [0:3:0])
m4_define(libpcreposix_version, [0:2:0])
m4_define(libpcrecpp_version, [0:0:0])
@@ -248,7 +248,7 @@ AC_ARG_ENABLE(pcregrep-libbz2,
# Handle --with-pcregrep-bufsize=N
AC_ARG_WITH(pcregrep-bufsize,
AS_HELP_STRING([--with-pcregrep-bufsize=N],
- [pcregrep buffer size (default=20480)]),
+ [pcregrep buffer size (default=20480, minimum=8192)]),
, with_pcregrep_bufsize=20480)
# Handle --enable-pcretest-libedit
@@ -461,7 +461,8 @@ sure both macros are undefined; an emulation function will then be used. */])
# Checks for header files.
AC_HEADER_STDC
-AC_CHECK_HEADERS(limits.h sys/types.h sys/stat.h dirent.h windows.h)
+AC_CHECK_HEADERS(limits.h sys/types.h sys/stat.h dirent.h)
+AC_CHECK_HEADERS([windows.h], [HAVE_WINDOWS_H=1])
# The files below are C++ header files.
pcre_have_type_traits="0"
@@ -686,11 +687,15 @@ if test "$enable_pcre32" = "yes"; then
Define to any value to enable the 32 bit PCRE library.])
fi
+# Unless running under Windows, JIT support requires pthreads.
+
if test "$enable_jit" = "yes"; then
- AX_PTHREAD([], [AC_MSG_ERROR([JIT support requires pthreads])])
- CC="$PTHREAD_CC"
- CFLAGS="$PTHREAD_CFLAGS $CFLAGS"
- LIBS="$PTHREAD_LIBS $LIBS"
+ if test "$HAVE_WINDOWS_H" != "1"; then
+ AX_PTHREAD([], [AC_MSG_ERROR([JIT support requires pthreads])])
+ CC="$PTHREAD_CC"
+ CFLAGS="$PTHREAD_CFLAGS $CFLAGS"
+ LIBS="$PTHREAD_LIBS $LIBS"
+ fi
AC_DEFINE([SUPPORT_JIT], [], [
Define to any value to enable support for Just-In-Time compiling.])
else
@@ -739,7 +744,12 @@ if test "$enable_pcregrep_libbz2" = "yes"; then
fi
if test $with_pcregrep_bufsize -lt 8192 ; then
+ AC_MSG_WARN([$with_pcregrep_bufsize is too small for --with-pcregrep-bufsize; using 8192])
with_pcregrep_bufsize="8192"
+else
+ if test $? -gt 1 ; then
+ AC_MSG_ERROR([Bad value for --with-pcregrep-bufsize])
+ fi
fi
AC_DEFINE_UNQUOTED([PCREGREP_BUFSIZE], [$with_pcregrep_bufsize], [
diff --git a/pcre/doc/html/README.txt b/pcre/doc/html/README.txt
index 51197df7213..88f2dfd4efd 100644
--- a/pcre/doc/html/README.txt
+++ b/pcre/doc/html/README.txt
@@ -85,11 +85,12 @@ documentation is supplied in two other forms:
1. There are files called doc/pcre.txt, doc/pcregrep.txt, and
doc/pcretest.txt in the source distribution. The first of these is a
concatenation of the text forms of all the section 3 man pages except
- those that summarize individual functions. The other two are the text
- forms of the section 1 man pages for the pcregrep and pcretest commands.
- These text forms are provided for ease of scanning with text editors or
- similar tools. They are installed in <prefix>/share/doc/pcre, where
- <prefix> is the installation prefix (defaulting to /usr/local).
+ the listing of pcredemo.c and those that summarize individual functions.
+ The other two are the text forms of the section 1 man pages for the
+ pcregrep and pcretest commands. These text forms are provided for ease of
+ scanning with text editors or similar tools. They are installed in
+ <prefix>/share/doc/pcre, where <prefix> is the installation prefix
+ (defaulting to /usr/local).
2. A set of files containing all the documentation in HTML form, hyperlinked
in various ways, and rooted in a file called index.html, is distributed in
@@ -372,12 +373,12 @@ library. They are also documented in the pcrebuild man page.
Of course, the relevant libraries must be installed on your system.
-. The default size of internal buffer used by pcregrep can be set by, for
- example:
+. The default size (in bytes) of the internal buffer used by pcregrep can be
+ set by, for example:
- --with-pcregrep-bufsize=50K
+ --with-pcregrep-bufsize=51200
- The default value is 20K.
+ The value must be a plain integer. The default is 20480.
. It is possible to compile pcretest so that it links with the libreadline
or libedit libraries, by specifying, respectively,
@@ -987,4 +988,4 @@ pcre_xxx, one with the name pcre16_xx, and a third with the name pcre32_xxx.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 05 November 2013
+Last updated: 17 January 2014
diff --git a/pcre/doc/html/pcre.html b/pcre/doc/html/pcre.html
index 93b129ecd83..c2b29aa8121 100644
--- a/pcre/doc/html/pcre.html
+++ b/pcre/doc/html/pcre.html
@@ -154,8 +154,11 @@ page.
The user documentation for PCRE comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
-all the sections, except the <b>pcredemo</b> section, are concatenated, for ease
-of searching. The sections are as follows:
+the descriptions of the <b>pcregrep</b> and <b>pcretest</b> programs are in files
+called <b>pcregrep.txt</b> and <b>pcretest.txt</b>, respectively. The remaining
+sections, except for the <b>pcredemo</b> section (which is a program listing),
+are concatenated in <b>pcre.txt</b>, for ease of searching. The sections are as
+follows:
<pre>
pcre this document
pcre-config show PCRE installation configuration information
@@ -182,8 +185,8 @@ of searching. The sections are as follows:
pcretest description of the <b>pcretest</b> testing command
pcreunicode discussion of Unicode and UTF-8/16/32 support
</pre>
-In addition, in the "man" and HTML formats, there is a short page for each
-C library function, listing its arguments and results.
+In the "man" and HTML formats, there is also a short page for each C library
+function, listing its arguments and results.
</P>
<br><a name="SEC4" href="#TOC1">AUTHOR</a><br>
<P>
@@ -201,9 +204,9 @@ two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC5" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 13 May 2013
+Last updated: 08 January 2014
<br>
-Copyright &copy; 1997-2013 University of Cambridge.
+Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/pcre/doc/html/pcreapi.html b/pcre/doc/html/pcreapi.html
index abc3d2663fc..b401ecc76df 100644
--- a/pcre/doc/html/pcreapi.html
+++ b/pcre/doc/html/pcreapi.html
@@ -166,6 +166,9 @@ man page, in case the conversion went wrong.
<br>
<br>
<b>int (*pcre_callout)(pcre_callout_block *);</b>
+<br>
+<br>
+<b>int (*pcre_stack_guard)(void);</b>
</P>
<br><a name="SEC5" href="#TOC1">PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
<P>
@@ -324,6 +327,15 @@ by the caller to a "callout" function, which PCRE will then call at specified
points during a matching operation. Details are given in the
<a href="pcrecallout.html"><b>pcrecallout</b></a>
documentation.
+</P>
+<P>
+The global variable <b>pcre_stack_guard</b> initially contains NULL. It can be
+set by the caller to a function that is called by PCRE whenever it starts
+to compile a parenthesized part of a pattern. When parentheses are nested, PCRE
+uses recursive function calls, which use up the system stack. This function is
+provided so that applications with restricted stacks can force a compilation
+error if the stack runs out. The function should return zero if all is well, or
+non-zero to force an error.
<a name="newlines"></a></P>
<br><a name="SEC7" href="#TOC1">NEWLINES</a><br>
<P>
@@ -369,7 +381,8 @@ controlled in a similar way, but by separate options.
The PCRE functions can be used in multi-threading applications, with the
proviso that the memory management functions pointed to by <b>pcre_malloc</b>,
<b>pcre_free</b>, <b>pcre_stack_malloc</b>, and <b>pcre_stack_free</b>, and the
-callout function pointed to by <b>pcre_callout</b>, are shared by all threads.
+callout and stack-checking functions pointed to by <b>pcre_callout</b> and
+<b>pcre_stack_guard</b>, are shared by all threads.
</P>
<P>
The compiled form of a regular expression is not altered during matching, so
@@ -489,7 +502,10 @@ documentation.
The output is a long integer that gives the maximum depth of nesting of
parentheses (of any kind) in a pattern. This limit is imposed to cap the amount
of system stack used when a pattern is compiled. It is specified when PCRE is
-built; the default is 250.
+built; the default is 250. This limit does not take into account the stack that
+may already be used by the calling application. For finer control over
+compilation stack usage, you can set a pointer to an external checking function
+in <b>pcre_stack_guard</b>.
<pre>
PCRE_CONFIG_MATCH_LIMIT
</pre>
@@ -1008,6 +1024,8 @@ have fallen out of use. To avoid confusion, they have not been re-used.
81 missing opening brace after \o
82 parentheses are too deeply nested
83 invalid range in character class
+ 84 group name must start with a non-digit
+ 85 parentheses are too deeply nested (stack check)
</pre>
The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
be used if the limits were changed when PCRE was built.
@@ -1265,12 +1283,15 @@ information call is provided for internal use by the <b>pcre_study()</b>
function. External callers can cause PCRE to use its internal tables by passing
a NULL table pointer.
<pre>
- PCRE_INFO_FIRSTBYTE
+ PCRE_INFO_FIRSTBYTE (deprecated)
</pre>
Return information about the first data unit of any matched string, for a
-non-anchored pattern. (The name of this option refers to the 8-bit library,
-where data units are bytes.) The fourth argument should point to an <b>int</b>
-variable.
+non-anchored pattern. The name of this option refers to the 8-bit library,
+where data units are bytes. The fourth argument should point to an <b>int</b>
+variable. Negative values are used for special cases. However, this means that
+when the 32-bit library is in non-UTF-32 mode, the full 32-bit range of
+characters cannot be returned. For this reason, this value is deprecated; use
+PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER instead.
</P>
<P>
If there is a fixed first value, for example, the letter "c" from a pattern
@@ -1293,12 +1314,43 @@ starts with "^", or
-1 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise -2 is
returned. For anchored patterns, -2 is returned.
+<pre>
+ PCRE_INFO_FIRSTCHARACTER
+</pre>
+Return the value of the first data unit (non-UTF character) of any matched
+string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS returns 1;
+otherwise return 0. The fourth argument should point to an <b>uint_t</b>
+variable.
</P>
<P>
-Since for the 32-bit library using the non-UTF-32 mode, this function is unable
-to return the full 32-bit range of the character, this value is deprecated;
-instead the PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER values
-should be used.
+In the 8-bit library, the value is always less than 256. In the 16-bit library
+the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
+can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
+<pre>
+ PCRE_INFO_FIRSTCHARACTERFLAGS
+</pre>
+Return information about the first data unit of any matched string, for a
+non-anchored pattern. The fourth argument should point to an <b>int</b>
+variable.
+</P>
+<P>
+If there is a fixed first value, for example, the letter "c" from a pattern
+such as (cat|cow|coyote), 1 is returned, and the character value can be
+retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no fixed first value, and
+if either
+<br>
+<br>
+(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
+starts with "^", or
+<br>
+<br>
+(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
+(if it were set, the pattern would be anchored),
+<br>
+<br>
+2 is returned, indicating that the pattern matches only at the start of a
+subject string or after any newline within the string. Otherwise 0 is
+returned. For anchored patterns, 0 is returned.
<pre>
PCRE_INFO_FIRSTTABLE
</pre>
@@ -1509,44 +1561,6 @@ is made available via this option so that it can be saved and restored (see the
<a href="pcreprecompile.html"><b>pcreprecompile</b></a>
documentation for details).
<pre>
- PCRE_INFO_FIRSTCHARACTERFLAGS
-</pre>
-Return information about the first data unit of any matched string, for a
-non-anchored pattern. The fourth argument should point to an <b>int</b>
-variable.
-</P>
-<P>
-If there is a fixed first value, for example, the letter "c" from a pattern
-such as (cat|cow|coyote), 1 is returned, and the character value can be
-retrieved using PCRE_INFO_FIRSTCHARACTER.
-</P>
-<P>
-If there is no fixed first value, and if either
-<br>
-<br>
-(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
-starts with "^", or
-<br>
-<br>
-(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
-(if it were set, the pattern would be anchored),
-<br>
-<br>
-2 is returned, indicating that the pattern matches only at the start of a
-subject string or after any newline within the string. Otherwise 0 is
-returned. For anchored patterns, 0 is returned.
-<pre>
- PCRE_INFO_FIRSTCHARACTER
-</pre>
-Return the fixed first character value in the situation where
-PCRE_INFO_FIRSTCHARACTERFLAGS returns 1; otherwise return 0. The fourth
-argument should point to an <b>uint_t</b> variable.
-</P>
-<P>
-In the 8-bit library, the value is always less than 256. In the 16-bit library
-the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
-can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
-<pre>
PCRE_INFO_REQUIREDCHARFLAGS
</pre>
Returns 1 if there is a rightmost literal data unit that must exist in any
@@ -2899,9 +2913,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC26" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 12 November 2013
+Last updated: 09 February 2014
<br>
-Copyright &copy; 1997-2013 University of Cambridge.
+Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/pcre/doc/html/pcregrep.html b/pcre/doc/html/pcregrep.html
index bac8f9a43fd..dacbb4998f8 100644
--- a/pcre/doc/html/pcregrep.html
+++ b/pcre/doc/html/pcregrep.html
@@ -37,8 +37,10 @@ man page, in case the conversion went wrong.
<b>pcregrep</b> searches files for character patterns, in the same way as other
grep commands do, but it uses the PCRE regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
+<a href="pcresyntax.html"><b>pcresyntax</b>(3)</a>
+for a quick-reference summary of pattern syntax, or
<a href="pcrepattern.html"><b>pcrepattern</b>(3)</a>
-for a full description of syntax and semantics of the regular expressions
+for a full description of the syntax and semantics of the regular expressions
that PCRE supports.
</P>
<P>
@@ -748,9 +750,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC14" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 13 September 2012
+Last updated: 03 April 2014
<br>
-Copyright &copy; 1997-2012 University of Cambridge.
+Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/pcre/doc/html/pcrepattern.html b/pcre/doc/html/pcrepattern.html
index 624cb447238..c06d1e03f11 100644
--- a/pcre/doc/html/pcrepattern.html
+++ b/pcre/doc/html/pcrepattern.html
@@ -1003,7 +1003,9 @@ matches "foobar", the first substring is still set to "foo".
<P>
Perl documents that the use of \K within assertions is "not well defined". In
PCRE, \K is acted upon when it occurs inside positive assertions, but is
-ignored in negative assertions.
+ignored in negative assertions. Note that when a pattern such as (?=ab\K)
+matches, the reported start of the match can be greater than the end of the
+match.
<a name="smallassertions"></a></P>
<br><b>
Simple assertions
@@ -2990,19 +2992,22 @@ match does not always guarantee that a match must be at this starting point.
<P>
Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE's start-of-match optimizations are turned off, as shown in this
-<b>pcretest</b> example:
+output from <b>pcretest</b>:
<pre>
re&#62; /(*COMMIT)abc/
data&#62; xyzabc
0: abc
- xyzabc\Y
+ data&#62; xyzabc\Y
No match
</pre>
-PCRE knows that any match must start with "a", so the optimization skips along
-the subject to "a" before running the first match attempt, which succeeds. When
-the optimization is disabled by the \Y escape in the second subject, the match
-starts at "x" and so the (*COMMIT) causes it to fail without trying any other
-starting points.
+For this pattern, PCRE knows that any match must start with "a", so the
+optimization skips along the subject to "a" before applying the pattern to the
+first set of data. The match attempt then succeeds. In the second set of data,
+the escape sequence \Y is interpreted by the <b>pcretest</b> program. It causes
+the PCRE_NO_START_OPTIMIZE option to be set when <b>pcre_exec()</b> is called.
+This disables the optimization that skips along to the first character. The
+pattern is now applied starting at "x", and so the (*COMMIT) causes the match
+to fail without trying any other starting points.
<pre>
(*PRUNE) or (*PRUNE:NAME)
</pre>
@@ -3221,9 +3226,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 03 December 2013
+Last updated: 08 January 2014
<br>
-Copyright &copy; 1997-2013 University of Cambridge.
+Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/pcre/doc/html/pcresyntax.html b/pcre/doc/html/pcresyntax.html
index 0764a33a376..89f35737b4f 100644
--- a/pcre/doc/html/pcresyntax.html
+++ b/pcre/doc/html/pcresyntax.html
@@ -29,13 +29,13 @@ man page, in case the conversion went wrong.
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
<li><a name="TOC15" href="#SEC15">COMMENT</a>
<li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
-<li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
-<li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
-<li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
-<li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
-<li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
-<li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
-<li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
+<li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
+<li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
+<li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
+<li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
+<li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
+<li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
+<li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
<li><a name="TOC24" href="#SEC24">CALLOUTS</a>
<li><a name="TOC25" href="#SEC25">SEE ALSO</a>
<li><a name="TOC26" href="#SEC26">AUTHOR</a>
@@ -339,7 +339,8 @@ but some of them use Unicode properties if PCRE_UCP is set. You can use
<P>
<pre>
\K reset start of match
-</PRE>
+</pre>
+\K is honoured in positive assertions, but ignored in negative ones.
</P>
<br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
<P>
@@ -382,11 +383,13 @@ but some of them use Unicode properties if PCRE_UCP is set. You can use
(?x) extended (ignore white space)
(?-...) unset option(s)
</pre>
-The following are recognized only at the start of a pattern or after one of the
-newline-setting options with similar syntax:
+The following are recognized only at the very start of a pattern or after one
+of the newline or \R options with similar syntax. More than one of them may
+appear.
<pre>
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
+ (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
@@ -397,7 +400,28 @@ newline-setting options with similar syntax:
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre_exec(), not increase them.
</P>
-<br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
+<br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
+<P>
+These are recognized only at the very start of the pattern or after option
+settings with a similar syntax.
+<pre>
+ (*CR) carriage return only
+ (*LF) linefeed only
+ (*CRLF) carriage return followed by linefeed
+ (*ANYCRLF) all three of the above
+ (*ANY) any Unicode newline sequence
+</PRE>
+</P>
+<br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
+<P>
+These are recognized only at the very start of the pattern or after option
+setting with a similar syntax.
+<pre>
+ (*BSR_ANYCRLF) CR, LF, or CRLF
+ (*BSR_UNICODE) any Unicode newline sequence
+</PRE>
+</P>
+<br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
<P>
<pre>
(?=...) positive look ahead
@@ -407,7 +431,7 @@ limits set by the caller of pcre_exec(), not increase them.
</pre>
Each top-level branch of a look behind must be of a fixed length.
</P>
-<br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
+<br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
<P>
<pre>
\n reference by number (can be ambiguous)
@@ -421,7 +445,7 @@ Each top-level branch of a look behind must be of a fixed length.
(?P=name) reference by name (Python)
</PRE>
</P>
-<br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
+<br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
<P>
<pre>
(?R) recurse whole pattern
@@ -440,7 +464,7 @@ Each top-level branch of a look behind must be of a fixed length.
\g'-n' call subpattern by relative number (PCRE extension)
</PRE>
</P>
-<br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
+<br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
<P>
<pre>
(?(condition)yes-pattern)
@@ -459,7 +483,7 @@ Each top-level branch of a look behind must be of a fixed length.
(?(assert)... assertion condition
</PRE>
</P>
-<br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
+<br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
<P>
The following act immediately they are reached:
<pre>
@@ -482,27 +506,6 @@ pattern is not anchored.
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
</PRE>
</P>
-<br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
-<P>
-These are recognized only at the very start of the pattern or after a
-(*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
-<pre>
- (*CR) carriage return only
- (*LF) linefeed only
- (*CRLF) carriage return followed by linefeed
- (*ANYCRLF) all three of the above
- (*ANY) any Unicode newline sequence
-</PRE>
-</P>
-<br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
-<P>
-These are recognized only at the very start of the pattern or after a
-(*...) option that sets the newline convention or a UTF or UCP mode.
-<pre>
- (*BSR_ANYCRLF) CR, LF, or CRLF
- (*BSR_UNICODE) any Unicode newline sequence
-</PRE>
-</P>
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
<P>
<pre>
@@ -526,9 +529,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC27" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 12 November 2013
+Last updated: 08 January 2014
<br>
-Copyright &copy; 1997-2013 University of Cambridge.
+Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/pcre/doc/html/pcretest.html b/pcre/doc/html/pcretest.html
index 4ed1dfd0c74..839fabf189b 100644
--- a/pcre/doc/html/pcretest.html
+++ b/pcre/doc/html/pcretest.html
@@ -138,6 +138,9 @@ following options output the value and set the exit code as indicated:
newline the default newline setting:
CR, LF, CRLF, ANYCRLF, or ANY
exit code is always 0
+ bsr the default setting for what \R matches:
+ ANYCRLF or ANY
+ exit code is always 0
</pre>
The following options output 1 for true or 0 for false, and set the exit code
to the same value:
@@ -373,6 +376,7 @@ sections.
<b>/N</b> set PCRE_NO_AUTO_CAPTURE
<b>/O</b> set PCRE_NO_AUTO_POSSESS
<b>/P</b> use the POSIX wrapper
+ <b>/Q</b> test external stack check function
<b>/S</b> study the pattern after compilation
<b>/s</b> set PCRE_DOTALL
<b>/T</b> select character tables
@@ -534,7 +538,10 @@ below.
The <b>/I</b> modifier requests that <b>pcretest</b> output information about the
compiled pattern (whether it is anchored, has a fixed first character, and
so on). It does this by calling <b>pcre[16|32]_fullinfo()</b> after compiling a
-pattern. If the pattern is studied, the results of that are also output.
+pattern. If the pattern is studied, the results of that are also output. In
+this output, the word "char" means a non-UTF character, that is, the value of a
+single data item (8-bit, 16-bit, or 32-bit, depending on the library that is
+being tested).
</P>
<P>
The <b>/K</b> modifier requests <b>pcretest</b> to show names from backtracking
@@ -568,6 +575,14 @@ successfully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
JIT compiled code is also output.
</P>
<P>
+The <b>/Q</b> modifier is used to test the use of <b>pcre_stack_guard</b>. It
+must be followed by '0' or '1', specifying the return code to be given from an
+external function that is passed to PCRE and used for stack checking during
+compilation (see the
+<a href="pcreapi.html"><b>pcreapi</b></a>
+documentation for details).
+</P>
+<P>
The <b>/S</b> modifier causes <b>pcre[16|32]_study()</b> to be called after the
expression has been compiled, and the results used when the expression is
matched. There are a number of qualifying characters that may follow <b>/S</b>.
@@ -1134,9 +1149,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC17" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 12 November 2013
+Last updated: 09 February 2014
<br>
-Copyright &copy; 1997-2013 University of Cambridge.
+Copyright &copy; 1997-2014 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/pcre/doc/pcre.3 b/pcre/doc/pcre.3
index d92f9ef08de..4eda404ccff 100644
--- a/pcre/doc/pcre.3
+++ b/pcre/doc/pcre.3
@@ -1,4 +1,4 @@
-.TH PCRE 3 "01 Oct 2013" "PCRE 8.33"
+.TH PCRE 3 "08 January 2014" "PCRE 8.35"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH INTRODUCTION
@@ -158,8 +158,11 @@ page.
The user documentation for PCRE comprises a number of different sections. In
the "man" format, each of these is a separate "man page". In the HTML format,
each is a separate page, linked from the index page. In the plain text format,
-all the sections, except the \fBpcredemo\fP section, are concatenated, for ease
-of searching. The sections are as follows:
+the descriptions of the \fBpcregrep\fP and \fBpcretest\fP programs are in files
+called \fBpcregrep.txt\fP and \fBpcretest.txt\fP, respectively. The remaining
+sections, except for the \fBpcredemo\fP section (which is a program listing),
+are concatenated in \fBpcre.txt\fP, for ease of searching. The sections are as
+follows:
.sp
pcre this document
pcre-config show PCRE installation configuration information
@@ -188,8 +191,8 @@ of searching. The sections are as follows:
pcretest description of the \fBpcretest\fP testing command
pcreunicode discussion of Unicode and UTF-8/16/32 support
.sp
-In addition, in the "man" and HTML formats, there is a short page for each
-C library function, listing its arguments and results.
+In the "man" and HTML formats, there is also a short page for each C library
+function, listing its arguments and results.
.
.
.SH AUTHOR
@@ -210,6 +213,6 @@ two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 13 May 2013
-Copyright (c) 1997-2013 University of Cambridge.
+Last updated: 08 January 2014
+Copyright (c) 1997-2014 University of Cambridge.
.fi
diff --git a/pcre/doc/pcre.txt b/pcre/doc/pcre.txt
index 9d69515c3b8..14cbb8bf2be 100644
--- a/pcre/doc/pcre.txt
+++ b/pcre/doc/pcre.txt
@@ -130,9 +130,11 @@ USER DOCUMENTATION
The user documentation for PCRE comprises a number of different sec-
tions. In the "man" format, each of these is a separate "man page". In
the HTML format, each is a separate page, linked from the index page.
- In the plain text format, all the sections, except the pcredemo sec-
- tion, are concatenated, for ease of searching. The sections are as fol-
- lows:
+ In the plain text format, the descriptions of the pcregrep and pcretest
+ programs are in files called pcregrep.txt and pcretest.txt, respec-
+ tively. The remaining sections, except for the pcredemo section (which
+ is a program listing), are concatenated in pcre.txt, for ease of
+ searching. The sections are as follows:
pcre this document
pcre-config show PCRE installation configuration information
@@ -160,8 +162,8 @@ USER DOCUMENTATION
pcretest description of the pcretest testing command
pcreunicode discussion of Unicode and UTF-8/16/32 support
- In addition, in the "man" and HTML formats, there is a short page for
- each C library function, listing its arguments and results.
+ In the "man" and HTML formats, there is also a short page for each C
+ library function, listing its arguments and results.
AUTHOR
@@ -177,8 +179,8 @@ AUTHOR
REVISION
- Last updated: 13 May 2013
- Copyright (c) 1997-2013 University of Cambridge.
+ Last updated: 08 January 2014
+ Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@@ -1674,6 +1676,8 @@ PCRE NATIVE API INDIRECTED FUNCTIONS
int (*pcre_callout)(pcre_callout_block *);
+ int (*pcre_stack_guard)(void);
+
PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
@@ -1809,6 +1813,14 @@ PCRE API OVERVIEW
specified points during a matching operation. Details are given in the
pcrecallout documentation.
+ The global variable pcre_stack_guard initially contains NULL. It can be
+ set by the caller to a function that is called by PCRE whenever it
+ starts to compile a parenthesized part of a pattern. When parentheses
+ are nested, PCRE uses recursive function calls, which use up the system
+ stack. This function is provided so that applications with restricted
+ stacks can force a compilation error if the stack runs out. The func-
+ tion should return zero if all is well, or non-zero to force an error.
+
NEWLINES
@@ -1849,25 +1861,26 @@ MULTITHREADING
The PCRE functions can be used in multi-threading applications, with
the proviso that the memory management functions pointed to by
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
- callout function pointed to by pcre_callout, are shared by all threads.
+ callout and stack-checking functions pointed to by pcre_callout and
+ pcre_stack_guard, are shared by all threads.
- The compiled form of a regular expression is not altered during match-
+ The compiled form of a regular expression is not altered during match-
ing, so the same compiled pattern can safely be used by several threads
at once.
- If the just-in-time optimization feature is being used, it needs sepa-
- rate memory stack areas for each thread. See the pcrejit documentation
+ If the just-in-time optimization feature is being used, it needs sepa-
+ rate memory stack areas for each thread. See the pcrejit documentation
for more details.
SAVING PRECOMPILED PATTERNS FOR LATER USE
The compiled form of a regular expression can be saved and re-used at a
- later time, possibly by a different program, and even on a host other
- than the one on which it was compiled. Details are given in the
- pcreprecompile documentation, which includes a description of the
- pcre_pattern_to_host_byte_order() function. However, compiling a regu-
- lar expression with one version of PCRE for use with a different ver-
+ later time, possibly by a different program, and even on a host other
+ than the one on which it was compiled. Details are given in the
+ pcreprecompile documentation, which includes a description of the
+ pcre_pattern_to_host_byte_order() function. However, compiling a regu-
+ lar expression with one version of PCRE for use with a different ver-
sion is not guaranteed to work and may cause crashes.
@@ -1875,45 +1888,45 @@ CHECKING BUILD-TIME OPTIONS
int pcre_config(int what, void *where);
- The function pcre_config() makes it possible for a PCRE client to dis-
+ The function pcre_config() makes it possible for a PCRE client to dis-
cover which optional features have been compiled into the PCRE library.
- The pcrebuild documentation has more details about these optional fea-
+ The pcrebuild documentation has more details about these optional fea-
tures.
- The first argument for pcre_config() is an integer, specifying which
+ The first argument for pcre_config() is an integer, specifying which
information is required; the second argument is a pointer to a variable
- into which the information is placed. The returned value is zero on
- success, or the negative error code PCRE_ERROR_BADOPTION if the value
- in the first argument is not recognized. The following information is
+ into which the information is placed. The returned value is zero on
+ success, or the negative error code PCRE_ERROR_BADOPTION if the value
+ in the first argument is not recognized. The following information is
available:
PCRE_CONFIG_UTF8
- The output is an integer that is set to one if UTF-8 support is avail-
- able; otherwise it is set to zero. This value should normally be given
+ The output is an integer that is set to one if UTF-8 support is avail-
+ able; otherwise it is set to zero. This value should normally be given
to the 8-bit version of this function, pcre_config(). If it is given to
- the 16-bit or 32-bit version of this function, the result is
+ the 16-bit or 32-bit version of this function, the result is
PCRE_ERROR_BADOPTION.
PCRE_CONFIG_UTF16
The output is an integer that is set to one if UTF-16 support is avail-
- able; otherwise it is set to zero. This value should normally be given
+ able; otherwise it is set to zero. This value should normally be given
to the 16-bit version of this function, pcre16_config(). If it is given
- to the 8-bit or 32-bit version of this function, the result is
+ to the 8-bit or 32-bit version of this function, the result is
PCRE_ERROR_BADOPTION.
PCRE_CONFIG_UTF32
The output is an integer that is set to one if UTF-32 support is avail-
- able; otherwise it is set to zero. This value should normally be given
+ able; otherwise it is set to zero. This value should normally be given
to the 32-bit version of this function, pcre32_config(). If it is given
- to the 8-bit or 16-bit version of this function, the result is
+ to the 8-bit or 16-bit version of this function, the result is
PCRE_ERROR_BADOPTION.
PCRE_CONFIG_UNICODE_PROPERTIES
- The output is an integer that is set to one if support for Unicode
+ The output is an integer that is set to one if support for Unicode
character properties is available; otherwise it is set to zero.
PCRE_CONFIG_JIT
@@ -1923,55 +1936,58 @@ CHECKING BUILD-TIME OPTIONS
PCRE_CONFIG_JITTARGET
- The output is a pointer to a zero-terminated "const char *" string. If
+ The output is a pointer to a zero-terminated "const char *" string. If
JIT support is available, the string contains the name of the architec-
- ture for which the JIT compiler is configured, for example "x86 32bit
- (little endian + unaligned)". If JIT support is not available, the
+ ture for which the JIT compiler is configured, for example "x86 32bit
+ (little endian + unaligned)". If JIT support is not available, the
result is NULL.
PCRE_CONFIG_NEWLINE
- The output is an integer whose value specifies the default character
- sequence that is recognized as meaning "newline". The values that are
+ The output is an integer whose value specifies the default character
+ sequence that is recognized as meaning "newline". The values that are
supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
- for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
- ANYCRLF, and ANY yield the same values. However, the value for LF is
- normally 21, though some EBCDIC environments use 37. The corresponding
- values for CRLF are 3349 and 3365. The default should normally corre-
+ for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
+ ANYCRLF, and ANY yield the same values. However, the value for LF is
+ normally 21, though some EBCDIC environments use 37. The corresponding
+ values for CRLF are 3349 and 3365. The default should normally corre-
spond to the standard sequence for your operating system.
PCRE_CONFIG_BSR
The output is an integer whose value indicates what character sequences
- the \R escape sequence matches by default. A value of 0 means that \R
- matches any Unicode line ending sequence; a value of 1 means that \R
+ the \R escape sequence matches by default. A value of 0 means that \R
+ matches any Unicode line ending sequence; a value of 1 means that \R
matches only CR, LF, or CRLF. The default can be overridden when a pat-
tern is compiled or matched.
PCRE_CONFIG_LINK_SIZE
- The output is an integer that contains the number of bytes used for
+ The output is an integer that contains the number of bytes used for
internal linkage in compiled regular expressions. For the 8-bit
library, the value can be 2, 3, or 4. For the 16-bit library, the value
- is either 2 or 4 and is still a number of bytes. For the 32-bit
+ is either 2 or 4 and is still a number of bytes. For the 32-bit
library, the value is either 2 or 4 and is still a number of bytes. The
default value of 2 is sufficient for all but the most massive patterns,
- since it allows the compiled pattern to be up to 64K in size. Larger
- values allow larger regular expressions to be compiled, at the expense
+ since it allows the compiled pattern to be up to 64K in size. Larger
+ values allow larger regular expressions to be compiled, at the expense
of slower matching.
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
- The output is an integer that contains the threshold above which the
- POSIX interface uses malloc() for output vectors. Further details are
+ The output is an integer that contains the threshold above which the
+ POSIX interface uses malloc() for output vectors. Further details are
given in the pcreposix documentation.
PCRE_CONFIG_PARENS_LIMIT
The output is a long integer that gives the maximum depth of nesting of
- parentheses (of any kind) in a pattern. This limit is imposed to cap
+ parentheses (of any kind) in a pattern. This limit is imposed to cap
the amount of system stack used when a pattern is compiled. It is spec-
- ified when PCRE is built; the default is 250.
+ ified when PCRE is built; the default is 250. This limit does not take
+ into account the stack that may already be used by the calling applica-
+ tion. For finer control over compilation stack usage, you can set a
+ pointer to an external checking function in pcre_stack_guard.
PCRE_CONFIG_MATCH_LIMIT
@@ -2474,6 +2490,8 @@ COMPILATION ERROR CODES
81 missing opening brace after \o
82 parentheses are too deeply nested
83 invalid range in character class
+ 84 group name must start with a non-digit
+ 85 parentheses are too deeply nested (stack check)
The numbers 32 and 10000 in errors 48 and 49 are defaults; different
values may be used if the limits were changed when PCRE was built.
@@ -2714,12 +2732,16 @@ INFORMATION ABOUT A PATTERN
tion. External callers can cause PCRE to use its internal tables by
passing a NULL table pointer.
- PCRE_INFO_FIRSTBYTE
+ PCRE_INFO_FIRSTBYTE (deprecated)
Return information about the first data unit of any matched string, for
- a non-anchored pattern. (The name of this option refers to the 8-bit
- library, where data units are bytes.) The fourth argument should point
- to an int variable.
+ a non-anchored pattern. The name of this option refers to the 8-bit
+ library, where data units are bytes. The fourth argument should point
+ to an int variable. Negative values are used for special cases. How-
+ ever, this means that when the 32-bit library is in non-UTF-32 mode,
+ the full 32-bit range of characters cannot be returned. For this rea-
+ son, this value is deprecated; use PCRE_INFO_FIRSTCHARACTERFLAGS and
+ PCRE_INFO_FIRSTCHARACTER instead.
If there is a fixed first value, for example, the letter "c" from a
pattern such as (cat|cow|coyote), its value is returned. In the 8-bit
@@ -2739,10 +2761,38 @@ INFORMATION ABOUT A PATTERN
of a subject string or after any newline within the string. Otherwise
-2 is returned. For anchored patterns, -2 is returned.
- Since for the 32-bit library using the non-UTF-32 mode, this function
- is unable to return the full 32-bit range of the character, this value
- is deprecated; instead the PCRE_INFO_FIRSTCHARACTERFLAGS and
- PCRE_INFO_FIRSTCHARACTER values should be used.
+ PCRE_INFO_FIRSTCHARACTER
+
+ Return the value of the first data unit (non-UTF character) of any
+ matched string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS
+ returns 1; otherwise return 0. The fourth argument should point to an
+ uint_t variable.
+
+ In the 8-bit library, the value is always less than 256. In the 16-bit
+ library the value can be up to 0xffff. In the 32-bit library in UTF-32
+ mode the value can be up to 0x10ffff, and up to 0xffffffff when not
+ using UTF-32 mode.
+
+ PCRE_INFO_FIRSTCHARACTERFLAGS
+
+ Return information about the first data unit of any matched string, for
+ a non-anchored pattern. The fourth argument should point to an int
+ variable.
+
+ If there is a fixed first value, for example, the letter "c" from a
+ pattern such as (cat|cow|coyote), 1 is returned, and the character
+ value can be retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no
+ fixed first value, and if either
+
+ (a) the pattern was compiled with the PCRE_MULTILINE option, and every
+ branch starts with "^", or
+
+ (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
+ set (if it were set, the pattern would be anchored),
+
+ 2 is returned, indicating that the pattern matches only at the start of
+ a subject string or after any newline within the string. Otherwise 0 is
+ returned. For anchored patterns, 0 is returned.
PCRE_INFO_FIRSTTABLE
@@ -2954,57 +3004,24 @@ INFORMATION ABOUT A PATTERN
option so that it can be saved and restored (see the pcreprecompile
documentation for details).
- PCRE_INFO_FIRSTCHARACTERFLAGS
-
- Return information about the first data unit of any matched string, for
- a non-anchored pattern. The fourth argument should point to an int
- variable.
-
- If there is a fixed first value, for example, the letter "c" from a
- pattern such as (cat|cow|coyote), 1 is returned, and the character
- value can be retrieved using PCRE_INFO_FIRSTCHARACTER.
-
- If there is no fixed first value, and if either
-
- (a) the pattern was compiled with the PCRE_MULTILINE option, and every
- branch starts with "^", or
-
- (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
- set (if it were set, the pattern would be anchored),
-
- 2 is returned, indicating that the pattern matches only at the start of
- a subject string or after any newline within the string. Otherwise 0 is
- returned. For anchored patterns, 0 is returned.
-
- PCRE_INFO_FIRSTCHARACTER
-
- Return the fixed first character value in the situation where
- PCRE_INFO_FIRSTCHARACTERFLAGS returns 1; otherwise return 0. The fourth
- argument should point to an uint_t variable.
-
- In the 8-bit library, the value is always less than 256. In the 16-bit
- library the value can be up to 0xffff. In the 32-bit library in UTF-32
- mode the value can be up to 0x10ffff, and up to 0xffffffff when not
- using UTF-32 mode.
-
PCRE_INFO_REQUIREDCHARFLAGS
- Returns 1 if there is a rightmost literal data unit that must exist in
+ Returns 1 if there is a rightmost literal data unit that must exist in
any matched string, other than at its start. The fourth argument should
- point to an int variable. If there is no such value, 0 is returned. If
+ point to an int variable. If there is no such value, 0 is returned. If
returning 1, the character value itself can be retrieved using
PCRE_INFO_REQUIREDCHAR.
For anchored patterns, a last literal value is recorded only if it fol-
- lows something of variable length. For example, for the pattern
- /^a\d+z\d+/ the returned value 1 (with "z" returned from
+ lows something of variable length. For example, for the pattern
+ /^a\d+z\d+/ the returned value 1 (with "z" returned from
PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
PCRE_INFO_REQUIREDCHAR
- Return the value of the rightmost literal data unit that must exist in
- any matched string, other than at its start, if such a value has been
- recorded. The fourth argument should point to an uint32_t variable. If
+ Return the value of the rightmost literal data unit that must exist in
+ any matched string, other than at its start, if such a value has been
+ recorded. The fourth argument should point to an uint32_t variable. If
there is no such value, 0 is returned.
@@ -3012,21 +3029,21 @@ REFERENCE COUNTS
int pcre_refcount(pcre *code, int adjust);
- The pcre_refcount() function is used to maintain a reference count in
+ The pcre_refcount() function is used to maintain a reference count in
the data block that contains a compiled pattern. It is provided for the
- benefit of applications that operate in an object-oriented manner,
+ benefit of applications that operate in an object-oriented manner,
where different parts of the application may be using the same compiled
pattern, but you want to free the block when they are all done.
When a pattern is compiled, the reference count field is initialized to
- zero. It is changed only by calling this function, whose action is to
- add the adjust value (which may be positive or negative) to it. The
+ zero. It is changed only by calling this function, whose action is to
+ add the adjust value (which may be positive or negative) to it. The
yield of the function is the new value. However, the value of the count
- is constrained to lie between 0 and 65535, inclusive. If the new value
+ is constrained to lie between 0 and 65535, inclusive. If the new value
is outside these limits, it is forced to the appropriate limit value.
- Except when it is zero, the reference count is not correctly preserved
- if a pattern is compiled on one host and then transferred to a host
+ Except when it is zero, the reference count is not correctly preserved
+ if a pattern is compiled on one host and then transferred to a host
whose byte-order is different. (This seems a highly unlikely scenario.)
@@ -3036,22 +3053,22 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
const char *subject, int length, int startoffset,
int options, int *ovector, int ovecsize);
- The function pcre_exec() is called to match a subject string against a
- compiled pattern, which is passed in the code argument. If the pattern
- was studied, the result of the study should be passed in the extra
- argument. You can call pcre_exec() with the same code and extra argu-
- ments as many times as you like, in order to match different subject
+ The function pcre_exec() is called to match a subject string against a
+ compiled pattern, which is passed in the code argument. If the pattern
+ was studied, the result of the study should be passed in the extra
+ argument. You can call pcre_exec() with the same code and extra argu-
+ ments as many times as you like, in order to match different subject
strings with the same pattern.
- This function is the main matching facility of the library, and it
- operates in a Perl-like manner. For specialist use there is also an
- alternative matching function, which is described below in the section
+ This function is the main matching facility of the library, and it
+ operates in a Perl-like manner. For specialist use there is also an
+ alternative matching function, which is described below in the section
about the pcre_dfa_exec() function.
- In most applications, the pattern will have been compiled (and option-
- ally studied) in the same process that calls pcre_exec(). However, it
+ In most applications, the pattern will have been compiled (and option-
+ ally studied) in the same process that calls pcre_exec(). However, it
is possible to save compiled patterns and study data, and then use them
- later in different processes, possibly even on different hosts. For a
+ later in different processes, possibly even on different hosts. For a
discussion about this, see the pcreprecompile documentation.
Here is an example of a simple call to pcre_exec():
@@ -3070,10 +3087,10 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
Extra data for pcre_exec()
- If the extra argument is not NULL, it must point to a pcre_extra data
- block. The pcre_study() function returns such a block (when it doesn't
- return NULL), but you can also create one for yourself, and pass addi-
- tional information in it. The pcre_extra block contains the following
+ If the extra argument is not NULL, it must point to a pcre_extra data
+ block. The pcre_study() function returns such a block (when it doesn't
+ return NULL), but you can also create one for yourself, and pass addi-
+ tional information in it. The pcre_extra block contains the following
fields (not necessarily in this order):
unsigned long int flags;
@@ -3085,13 +3102,13 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
const unsigned char *tables;
unsigned char **mark;
- In the 16-bit version of this structure, the mark field has type
+ In the 16-bit version of this structure, the mark field has type
"PCRE_UCHAR16 **".
- In the 32-bit version of this structure, the mark field has type
+ In the 32-bit version of this structure, the mark field has type
"PCRE_UCHAR32 **".
- The flags field is used to specify which of the other fields are set.
+ The flags field is used to specify which of the other fields are set.
The flag bits are:
PCRE_EXTRA_CALLOUT_DATA
@@ -3102,134 +3119,134 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_EXTRA_STUDY_DATA
PCRE_EXTRA_TABLES
- Other flag bits should be set to zero. The study_data field and some-
- times the executable_jit field are set in the pcre_extra block that is
- returned by pcre_study(), together with the appropriate flag bits. You
- should not set these yourself, but you may add to the block by setting
+ Other flag bits should be set to zero. The study_data field and some-
+ times the executable_jit field are set in the pcre_extra block that is
+ returned by pcre_study(), together with the appropriate flag bits. You
+ should not set these yourself, but you may add to the block by setting
other fields and their corresponding flag bits.
The match_limit field provides a means of preventing PCRE from using up
- a vast amount of resources when running patterns that are not going to
- match, but which have a very large number of possibilities in their
- search trees. The classic example is a pattern that uses nested unlim-
+ a vast amount of resources when running patterns that are not going to
+ match, but which have a very large number of possibilities in their
+ search trees. The classic example is a pattern that uses nested unlim-
ited repeats.
- Internally, pcre_exec() uses a function called match(), which it calls
- repeatedly (sometimes recursively). The limit set by match_limit is
- imposed on the number of times this function is called during a match,
- which has the effect of limiting the amount of backtracking that can
+ Internally, pcre_exec() uses a function called match(), which it calls
+ repeatedly (sometimes recursively). The limit set by match_limit is
+ imposed on the number of times this function is called during a match,
+ which has the effect of limiting the amount of backtracking that can
take place. For patterns that are not anchored, the count restarts from
zero for each position in the subject string.
When pcre_exec() is called with a pattern that was successfully studied
- with a JIT option, the way that the matching is executed is entirely
+ with a JIT option, the way that the matching is executed is entirely
different. However, there is still the possibility of runaway matching
that goes on for a very long time, and so the match_limit value is also
used in this case (but in a different way) to limit how long the match-
ing can continue.
- The default value for the limit can be set when PCRE is built; the
- default default is 10 million, which handles all but the most extreme
- cases. You can override the default by suppling pcre_exec() with a
- pcre_extra block in which match_limit is set, and
- PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
+ The default value for the limit can be set when PCRE is built; the
+ default default is 10 million, which handles all but the most extreme
+ cases. You can override the default by suppling pcre_exec() with a
+ pcre_extra block in which match_limit is set, and
+ PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
- A value for the match limit may also be supplied by an item at the
+ A value for the match limit may also be supplied by an item at the
start of a pattern of the form
(*LIMIT_MATCH=d)
- where d is a decimal number. However, such a setting is ignored unless
- d is less than the limit set by the caller of pcre_exec() or, if no
+ where d is a decimal number. However, such a setting is ignored unless
+ d is less than the limit set by the caller of pcre_exec() or, if no
such limit is set, less than the default.
- The match_limit_recursion field is similar to match_limit, but instead
+ The match_limit_recursion field is similar to match_limit, but instead
of limiting the total number of times that match() is called, it limits
- the depth of recursion. The recursion depth is a smaller number than
- the total number of calls, because not all calls to match() are recur-
+ the depth of recursion. The recursion depth is a smaller number than
+ the total number of calls, because not all calls to match() are recur-
sive. This limit is of use only if it is set smaller than match_limit.
- Limiting the recursion depth limits the amount of machine stack that
- can be used, or, when PCRE has been compiled to use memory on the heap
- instead of the stack, the amount of heap memory that can be used. This
- limit is not relevant, and is ignored, when matching is done using JIT
+ Limiting the recursion depth limits the amount of machine stack that
+ can be used, or, when PCRE has been compiled to use memory on the heap
+ instead of the stack, the amount of heap memory that can be used. This
+ limit is not relevant, and is ignored, when matching is done using JIT
compiled code.
- The default value for match_limit_recursion can be set when PCRE is
- built; the default default is the same value as the default for
- match_limit. You can override the default by suppling pcre_exec() with
- a pcre_extra block in which match_limit_recursion is set, and
- PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
+ The default value for match_limit_recursion can be set when PCRE is
+ built; the default default is the same value as the default for
+ match_limit. You can override the default by suppling pcre_exec() with
+ a pcre_extra block in which match_limit_recursion is set, and
+ PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
- A value for the recursion limit may also be supplied by an item at the
+ A value for the recursion limit may also be supplied by an item at the
start of a pattern of the form
(*LIMIT_RECURSION=d)
- where d is a decimal number. However, such a setting is ignored unless
- d is less than the limit set by the caller of pcre_exec() or, if no
+ where d is a decimal number. However, such a setting is ignored unless
+ d is less than the limit set by the caller of pcre_exec() or, if no
such limit is set, less than the default.
- The callout_data field is used in conjunction with the "callout" fea-
+ The callout_data field is used in conjunction with the "callout" fea-
ture, and is described in the pcrecallout documentation.
- The tables field is provided for use with patterns that have been pre-
+ The tables field is provided for use with patterns that have been pre-
compiled using custom character tables, saved to disc or elsewhere, and
- then reloaded, because the tables that were used to compile a pattern
- are not saved with it. See the pcreprecompile documentation for a dis-
- cussion of saving compiled patterns for later use. If NULL is passed
+ then reloaded, because the tables that were used to compile a pattern
+ are not saved with it. See the pcreprecompile documentation for a dis-
+ cussion of saving compiled patterns for later use. If NULL is passed
using this mechanism, it forces PCRE's internal tables to be used.
- Warning: The tables that pcre_exec() uses must be the same as those
- that were used when the pattern was compiled. If this is not the case,
+ Warning: The tables that pcre_exec() uses must be the same as those
+ that were used when the pattern was compiled. If this is not the case,
the behaviour of pcre_exec() is undefined. Therefore, when a pattern is
- compiled and matched in the same process, this field should never be
+ compiled and matched in the same process, this field should never be
set. In this (the most common) case, the correct table pointer is auto-
- matically passed with the compiled pattern from pcre_compile() to
+ matically passed with the compiled pattern from pcre_compile() to
pcre_exec().
- If PCRE_EXTRA_MARK is set in the flags field, the mark field must be
- set to point to a suitable variable. If the pattern contains any back-
- tracking control verbs such as (*MARK:NAME), and the execution ends up
- with a name to pass back, a pointer to the name string (zero termi-
- nated) is placed in the variable pointed to by the mark field. The
- names are within the compiled pattern; if you wish to retain such a
- name you must copy it before freeing the memory of a compiled pattern.
- If there is no name to pass back, the variable pointed to by the mark
- field is set to NULL. For details of the backtracking control verbs,
+ If PCRE_EXTRA_MARK is set in the flags field, the mark field must be
+ set to point to a suitable variable. If the pattern contains any back-
+ tracking control verbs such as (*MARK:NAME), and the execution ends up
+ with a name to pass back, a pointer to the name string (zero termi-
+ nated) is placed in the variable pointed to by the mark field. The
+ names are within the compiled pattern; if you wish to retain such a
+ name you must copy it before freeing the memory of a compiled pattern.
+ If there is no name to pass back, the variable pointed to by the mark
+ field is set to NULL. For details of the backtracking control verbs,
see the section entitled "Backtracking control" in the pcrepattern doc-
umentation.
Option bits for pcre_exec()
- The unused bits of the options argument for pcre_exec() must be zero.
- The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
- PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
- PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and
+ The unused bits of the options argument for pcre_exec() must be zero.
+ The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
+ PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
+ PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and
PCRE_PARTIAL_SOFT.
- If the pattern was successfully studied with one of the just-in-time
+ If the pattern was successfully studied with one of the just-in-time
(JIT) compile options, the only supported options for JIT execution are
- PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
- PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
- unsupported option is used, JIT execution is disabled and the normal
+ PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
+ PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
+ unsupported option is used, JIT execution is disabled and the normal
interpretive code in pcre_exec() is run.
PCRE_ANCHORED
- The PCRE_ANCHORED option limits pcre_exec() to matching at the first
- matching position. If a pattern was compiled with PCRE_ANCHORED, or
- turned out to be anchored by virtue of its contents, it cannot be made
+ The PCRE_ANCHORED option limits pcre_exec() to matching at the first
+ matching position. If a pattern was compiled with PCRE_ANCHORED, or
+ turned out to be anchored by virtue of its contents, it cannot be made
unachored at matching time.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
- sequence matches. The choice is either to match only CR, LF, or CRLF,
- or to match any Unicode newline sequence. These options override the
+ sequence matches. The choice is either to match only CR, LF, or CRLF,
+ or to match any Unicode newline sequence. These options override the
choice that was made or defaulted when the pattern was compiled.
PCRE_NEWLINE_CR
@@ -3238,345 +3255,345 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_NEWLINE_ANYCRLF
PCRE_NEWLINE_ANY
- These options override the newline definition that was chosen or
- defaulted when the pattern was compiled. For details, see the descrip-
- tion of pcre_compile() above. During matching, the newline choice
- affects the behaviour of the dot, circumflex, and dollar metacharac-
- ters. It may also alter the way the match position is advanced after a
+ These options override the newline definition that was chosen or
+ defaulted when the pattern was compiled. For details, see the descrip-
+ tion of pcre_compile() above. During matching, the newline choice
+ affects the behaviour of the dot, circumflex, and dollar metacharac-
+ ters. It may also alter the way the match position is advanced after a
match failure for an unanchored pattern.
- When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
- set, and a match attempt for an unanchored pattern fails when the cur-
- rent position is at a CRLF sequence, and the pattern contains no
- explicit matches for CR or LF characters, the match position is
+ When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is
+ set, and a match attempt for an unanchored pattern fails when the cur-
+ rent position is at a CRLF sequence, and the pattern contains no
+ explicit matches for CR or LF characters, the match position is
advanced by two characters instead of one, in other words, to after the
CRLF.
The above rule is a compromise that makes the most common cases work as
- expected. For example, if the pattern is .+A (and the PCRE_DOTALL
+ expected. For example, if the pattern is .+A (and the PCRE_DOTALL
option is not set), it does not match the string "\r\nA" because, after
- failing at the start, it skips both the CR and the LF before retrying.
- However, the pattern [\r\n]A does match that string, because it con-
+ failing at the start, it skips both the CR and the LF before retrying.
+ However, the pattern [\r\n]A does match that string, because it con-
tains an explicit CR or LF reference, and so advances only by one char-
acter after the first failure.
An explicit match for CR of LF is either a literal appearance of one of
- those characters, or one of the \r or \n escape sequences. Implicit
- matches such as [^X] do not count, nor does \s (which includes CR and
+ those characters, or one of the \r or \n escape sequences. Implicit
+ matches such as [^X] do not count, nor does \s (which includes CR and
LF in the characters that it matches).
- Notwithstanding the above, anomalous effects may still occur when CRLF
+ Notwithstanding the above, anomalous effects may still occur when CRLF
is a valid newline sequence and explicit \r or \n escapes appear in the
pattern.
PCRE_NOTBOL
This option specifies that first character of the subject string is not
- the beginning of a line, so the circumflex metacharacter should not
- match before it. Setting this without PCRE_MULTILINE (at compile time)
- causes circumflex never to match. This option affects only the behav-
+ the beginning of a line, so the circumflex metacharacter should not
+ match before it. Setting this without PCRE_MULTILINE (at compile time)
+ causes circumflex never to match. This option affects only the behav-
iour of the circumflex metacharacter. It does not affect \A.
PCRE_NOTEOL
This option specifies that the end of the subject string is not the end
- of a line, so the dollar metacharacter should not match it nor (except
- in multiline mode) a newline immediately before it. Setting this with-
+ of a line, so the dollar metacharacter should not match it nor (except
+ in multiline mode) a newline immediately before it. Setting this with-
out PCRE_MULTILINE (at compile time) causes dollar never to match. This
- option affects only the behaviour of the dollar metacharacter. It does
+ option affects only the behaviour of the dollar metacharacter. It does
not affect \Z or \z.
PCRE_NOTEMPTY
An empty string is not considered to be a valid match if this option is
- set. If there are alternatives in the pattern, they are tried. If all
- the alternatives match the empty string, the entire match fails. For
+ set. If there are alternatives in the pattern, they are tried. If all
+ the alternatives match the empty string, the entire match fails. For
example, if the pattern
a?b?
- is applied to a string not beginning with "a" or "b", it matches an
- empty string at the start of the subject. With PCRE_NOTEMPTY set, this
+ is applied to a string not beginning with "a" or "b", it matches an
+ empty string at the start of the subject. With PCRE_NOTEMPTY set, this
match is not valid, so PCRE searches further into the string for occur-
rences of "a" or "b".
PCRE_NOTEMPTY_ATSTART
- This is like PCRE_NOTEMPTY, except that an empty string match that is
- not at the start of the subject is permitted. If the pattern is
+ This is like PCRE_NOTEMPTY, except that an empty string match that is
+ not at the start of the subject is permitted. If the pattern is
anchored, such a match can occur only if the pattern contains \K.
- Perl has no direct equivalent of PCRE_NOTEMPTY or
- PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
- match of the empty string within its split() function, and when using
- the /g modifier. It is possible to emulate Perl's behaviour after
+ Perl has no direct equivalent of PCRE_NOTEMPTY or
+ PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern
+ match of the empty string within its split() function, and when using
+ the /g modifier. It is possible to emulate Perl's behaviour after
matching a null string by first trying the match again at the same off-
- set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
+ set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
fails, by advancing the starting offset (see below) and trying an ordi-
- nary match again. There is some code that demonstrates how to do this
- in the pcredemo sample program. In the most general case, you have to
- check to see if the newline convention recognizes CRLF as a newline,
- and if so, and the current character is CR followed by LF, advance the
+ nary match again. There is some code that demonstrates how to do this
+ in the pcredemo sample program. In the most general case, you have to
+ check to see if the newline convention recognizes CRLF as a newline,
+ and if so, and the current character is CR followed by LF, advance the
starting offset by two characters instead of one.
PCRE_NO_START_OPTIMIZE
- There are a number of optimizations that pcre_exec() uses at the start
- of a match, in order to speed up the process. For example, if it is
+ There are a number of optimizations that pcre_exec() uses at the start
+ of a match, in order to speed up the process. For example, if it is
known that an unanchored match must start with a specific character, it
- searches the subject for that character, and fails immediately if it
- cannot find it, without actually running the main matching function.
+ searches the subject for that character, and fails immediately if it
+ cannot find it, without actually running the main matching function.
This means that a special item such as (*COMMIT) at the start of a pat-
- tern is not considered until after a suitable starting point for the
- match has been found. Also, when callouts or (*MARK) items are in use,
+ tern is not considered until after a suitable starting point for the
+ match has been found. Also, when callouts or (*MARK) items are in use,
these "start-up" optimizations can cause them to be skipped if the pat-
tern is never actually used. The start-up optimizations are in effect a
pre-scan of the subject that takes place before the pattern is run.
- The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
- possibly causing performance to suffer, but ensuring that in cases
- where the result is "no match", the callouts do occur, and that items
+ The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
+ possibly causing performance to suffer, but ensuring that in cases
+ where the result is "no match", the callouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting
- position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
- compile time, it cannot be unset at matching time. The use of
+ position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
+ compile time, it cannot be unset at matching time. The use of
PCRE_NO_START_OPTIMIZE at matching time (that is, passing it to
- pcre_exec()) disables JIT execution; in this situation, matching is
+ pcre_exec()) disables JIT execution; in this situation, matching is
always done using interpretively.
- Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching
+ Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching
operation. Consider the pattern
(*COMMIT)ABC
- When this is compiled, PCRE records the fact that a match must start
- with the character "A". Suppose the subject string is "DEFABC". The
- start-up optimization scans along the subject, finds "A" and runs the
- first match attempt from there. The (*COMMIT) item means that the pat-
- tern must match the current starting position, which in this case, it
- does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
- set, the initial scan along the subject string does not happen. The
- first match attempt is run starting from "D" and when this fails,
- (*COMMIT) prevents any further matches being tried, so the overall
- result is "no match". If the pattern is studied, more start-up opti-
- mizations may be used. For example, a minimum length for the subject
+ When this is compiled, PCRE records the fact that a match must start
+ with the character "A". Suppose the subject string is "DEFABC". The
+ start-up optimization scans along the subject, finds "A" and runs the
+ first match attempt from there. The (*COMMIT) item means that the pat-
+ tern must match the current starting position, which in this case, it
+ does. However, if the same match is run with PCRE_NO_START_OPTIMIZE
+ set, the initial scan along the subject string does not happen. The
+ first match attempt is run starting from "D" and when this fails,
+ (*COMMIT) prevents any further matches being tried, so the overall
+ result is "no match". If the pattern is studied, more start-up opti-
+ mizations may be used. For example, a minimum length for the subject
may be recorded. Consider the pattern
(*MARK:A)(X|Y)
- The minimum length for a match is one character. If the subject is
- "ABC", there will be attempts to match "ABC", "BC", "C", and then
- finally an empty string. If the pattern is studied, the final attempt
- does not take place, because PCRE knows that the subject is too short,
- and so the (*MARK) is never encountered. In this case, studying the
- pattern does not affect the overall match result, which is still "no
+ The minimum length for a match is one character. If the subject is
+ "ABC", there will be attempts to match "ABC", "BC", "C", and then
+ finally an empty string. If the pattern is studied, the final attempt
+ does not take place, because PCRE knows that the subject is too short,
+ and so the (*MARK) is never encountered. In this case, studying the
+ pattern does not affect the overall match result, which is still "no
match", but it does affect the auxiliary information that is returned.
PCRE_NO_UTF8_CHECK
When PCRE_UTF8 is set at compile time, the validity of the subject as a
- UTF-8 string is automatically checked when pcre_exec() is subsequently
+ UTF-8 string is automatically checked when pcre_exec() is subsequently
called. The entire string is checked before any other processing takes
- place. The value of startoffset is also checked to ensure that it
- points to the start of a UTF-8 character. There is a discussion about
- the validity of UTF-8 strings in the pcreunicode page. If an invalid
- sequence of bytes is found, pcre_exec() returns the error
+ place. The value of startoffset is also checked to ensure that it
+ points to the start of a UTF-8 character. There is a discussion about
+ the validity of UTF-8 strings in the pcreunicode page. If an invalid
+ sequence of bytes is found, pcre_exec() returns the error
PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
- both cases, information about the precise nature of the error may also
- be returned (see the descriptions of these errors in the section enti-
- tled Error return values from pcre_exec() below). If startoffset con-
+ both cases, information about the precise nature of the error may also
+ be returned (see the descriptions of these errors in the section enti-
+ tled Error return values from pcre_exec() below). If startoffset con-
tains a value that does not point to the start of a UTF-8 character (or
to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
- If you already know that your subject is valid, and you want to skip
- these checks for performance reasons, you can set the
- PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
- do this for the second and subsequent calls to pcre_exec() if you are
- making repeated calls to find all the matches in a single subject
- string. However, you should be sure that the value of startoffset
- points to the start of a character (or the end of the subject). When
+ If you already know that your subject is valid, and you want to skip
+ these checks for performance reasons, you can set the
+ PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
+ do this for the second and subsequent calls to pcre_exec() if you are
+ making repeated calls to find all the matches in a single subject
+ string. However, you should be sure that the value of startoffset
+ points to the start of a character (or the end of the subject). When
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
- subject or an invalid value of startoffset is undefined. Your program
+ subject or an invalid value of startoffset is undefined. Your program
may crash or loop.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
- These options turn on the partial matching feature. For backwards com-
- patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
- match occurs if the end of the subject string is reached successfully,
- but there are not enough subject characters to complete the match. If
+ These options turn on the partial matching feature. For backwards com-
+ patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
+ match occurs if the end of the subject string is reached successfully,
+ but there are not enough subject characters to complete the match. If
this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
- matching continues by testing any remaining alternatives. Only if no
- complete match can be found is PCRE_ERROR_PARTIAL returned instead of
- PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
- caller is prepared to handle a partial match, but only if no complete
+ matching continues by testing any remaining alternatives. Only if no
+ complete match can be found is PCRE_ERROR_PARTIAL returned instead of
+ PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
+ caller is prepared to handle a partial match, but only if no complete
match can be found.
- If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
- case, if a partial match is found, pcre_exec() immediately returns
- PCRE_ERROR_PARTIAL, without considering any other alternatives. In
- other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
+ If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
+ case, if a partial match is found, pcre_exec() immediately returns
+ PCRE_ERROR_PARTIAL, without considering any other alternatives. In
+ other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
ered to be more important that an alternative complete match.
- In both cases, the portion of the string that was inspected when the
+ In both cases, the portion of the string that was inspected when the
partial match was found is set as the first matching string. There is a
- more detailed discussion of partial and multi-segment matching, with
+ more detailed discussion of partial and multi-segment matching, with
examples, in the pcrepartial documentation.
The string to be matched by pcre_exec()
- The subject string is passed to pcre_exec() as a pointer in subject, a
- length in length, and a starting offset in startoffset. The units for
- length and startoffset are bytes for the 8-bit library, 16-bit data
- items for the 16-bit library, and 32-bit data items for the 32-bit
+ The subject string is passed to pcre_exec() as a pointer in subject, a
+ length in length, and a starting offset in startoffset. The units for
+ length and startoffset are bytes for the 8-bit library, 16-bit data
+ items for the 16-bit library, and 32-bit data items for the 32-bit
library.
- If startoffset is negative or greater than the length of the subject,
- pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
- zero, the search for a match starts at the beginning of the subject,
- and this is by far the most common case. In UTF-8 or UTF-16 mode, the
- offset must point to the start of a character, or the end of the sub-
- ject (in UTF-32 mode, one data unit equals one character, so all off-
- sets are valid). Unlike the pattern string, the subject may contain
+ If startoffset is negative or greater than the length of the subject,
+ pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
+ zero, the search for a match starts at the beginning of the subject,
+ and this is by far the most common case. In UTF-8 or UTF-16 mode, the
+ offset must point to the start of a character, or the end of the sub-
+ ject (in UTF-32 mode, one data unit equals one character, so all off-
+ sets are valid). Unlike the pattern string, the subject may contain
binary zeroes.
- A non-zero starting offset is useful when searching for another match
- in the same subject by calling pcre_exec() again after a previous suc-
- cess. Setting startoffset differs from just passing over a shortened
- string and setting PCRE_NOTBOL in the case of a pattern that begins
+ A non-zero starting offset is useful when searching for another match
+ in the same subject by calling pcre_exec() again after a previous suc-
+ cess. Setting startoffset differs from just passing over a shortened
+ string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern
\Biss\B
- which finds occurrences of "iss" in the middle of words. (\B matches
- only if the current position in the subject is not a word boundary.)
- When applied to the string "Mississipi" the first call to pcre_exec()
- finds the first occurrence. If pcre_exec() is called again with just
- the remainder of the subject, namely "issipi", it does not match,
+ which finds occurrences of "iss" in the middle of words. (\B matches
+ only if the current position in the subject is not a word boundary.)
+ When applied to the string "Mississipi" the first call to pcre_exec()
+ finds the first occurrence. If pcre_exec() is called again with just
+ the remainder of the subject, namely "issipi", it does not match,
because \B is always false at the start of the subject, which is deemed
- to be a word boundary. However, if pcre_exec() is passed the entire
+ to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur-
- rence of "iss" because it is able to look behind the starting point to
+ rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
- Finding all the matches in a subject is tricky when the pattern can
+ Finding all the matches in a subject is tricky when the pattern can
match an empty string. It is possible to emulate Perl's /g behaviour by
- first trying the match again at the same offset, with the
- PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
- fails, advancing the starting offset and trying an ordinary match
+ first trying the match again at the same offset, with the
+ PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
+ fails, advancing the starting offset and trying an ordinary match
again. There is some code that demonstrates how to do this in the pcre-
demo sample program. In the most general case, you have to check to see
- if the newline convention recognizes CRLF as a newline, and if so, and
+ if the newline convention recognizes CRLF as a newline, and if so, and
the current character is CR followed by LF, advance the starting offset
by two characters instead of one.
- If a non-zero starting offset is passed when the pattern is anchored,
+ If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is made. This can only succeed
- if the pattern does not require the match to be at the start of the
+ if the pattern does not require the match to be at the start of the
subject.
How pcre_exec() returns captured substrings
- In general, a pattern matches a certain portion of the subject, and in
- addition, further substrings from the subject may be picked out by
- parts of the pattern. Following the usage in Jeffrey Friedl's book,
- this is called "capturing" in what follows, and the phrase "capturing
- subpattern" is used for a fragment of a pattern that picks out a sub-
- string. PCRE supports several other kinds of parenthesized subpattern
+ In general, a pattern matches a certain portion of the subject, and in
+ addition, further substrings from the subject may be picked out by
+ parts of the pattern. Following the usage in Jeffrey Friedl's book,
+ this is called "capturing" in what follows, and the phrase "capturing
+ subpattern" is used for a fragment of a pattern that picks out a sub-
+ string. PCRE supports several other kinds of parenthesized subpattern
that do not cause substrings to be captured.
Captured substrings are returned to the caller via a vector of integers
- whose address is passed in ovector. The number of elements in the vec-
- tor is passed in ovecsize, which must be a non-negative number. Note:
+ whose address is passed in ovector. The number of elements in the vec-
+ tor is passed in ovecsize, which must be a non-negative number. Note:
this argument is NOT the size of ovector in bytes.
- The first two-thirds of the vector is used to pass back captured sub-
- strings, each substring using a pair of integers. The remaining third
- of the vector is used as workspace by pcre_exec() while matching cap-
- turing subpatterns, and is not available for passing back information.
- The number passed in ovecsize should always be a multiple of three. If
+ The first two-thirds of the vector is used to pass back captured sub-
+ strings, each substring using a pair of integers. The remaining third
+ of the vector is used as workspace by pcre_exec() while matching cap-
+ turing subpatterns, and is not available for passing back information.
+ The number passed in ovecsize should always be a multiple of three. If
it is not, it is rounded down.
- When a match is successful, information about captured substrings is
- returned in pairs of integers, starting at the beginning of ovector,
- and continuing up to two-thirds of its length at the most. The first
- element of each pair is set to the offset of the first character in a
- substring, and the second is set to the offset of the first character
- after the end of a substring. These values are always data unit off-
- sets, even in UTF mode. They are byte offsets in the 8-bit library,
- 16-bit data item offsets in the 16-bit library, and 32-bit data item
+ When a match is successful, information about captured substrings is
+ returned in pairs of integers, starting at the beginning of ovector,
+ and continuing up to two-thirds of its length at the most. The first
+ element of each pair is set to the offset of the first character in a
+ substring, and the second is set to the offset of the first character
+ after the end of a substring. These values are always data unit off-
+ sets, even in UTF mode. They are byte offsets in the 8-bit library,
+ 16-bit data item offsets in the 16-bit library, and 32-bit data item
offsets in the 32-bit library. Note: they are not character counts.
- The first pair of integers, ovector[0] and ovector[1], identify the
- portion of the subject string matched by the entire pattern. The next
- pair is used for the first capturing subpattern, and so on. The value
+ The first pair of integers, ovector[0] and ovector[1], identify the
+ portion of the subject string matched by the entire pattern. The next
+ pair is used for the first capturing subpattern, and so on. The value
returned by pcre_exec() is one more than the highest numbered pair that
- has been set. For example, if two substrings have been captured, the
- returned value is 3. If there are no capturing subpatterns, the return
+ has been set. For example, if two substrings have been captured, the
+ returned value is 3. If there are no capturing subpatterns, the return
value from a successful match is 1, indicating that just the first pair
of offsets has been set.
If a capturing subpattern is matched repeatedly, it is the last portion
of the string that it matched that is returned.
- If the vector is too small to hold all the captured substring offsets,
+ If the vector is too small to hold all the captured substring offsets,
it is used as far as possible (up to two-thirds of its length), and the
- function returns a value of zero. If neither the actual string matched
- nor any captured substrings are of interest, pcre_exec() may be called
- with ovector passed as NULL and ovecsize as zero. However, if the pat-
- tern contains back references and the ovector is not big enough to
- remember the related substrings, PCRE has to get additional memory for
- use during matching. Thus it is usually advisable to supply an ovector
+ function returns a value of zero. If neither the actual string matched
+ nor any captured substrings are of interest, pcre_exec() may be called
+ with ovector passed as NULL and ovecsize as zero. However, if the pat-
+ tern contains back references and the ovector is not big enough to
+ remember the related substrings, PCRE has to get additional memory for
+ use during matching. Thus it is usually advisable to supply an ovector
of reasonable size.
- There are some cases where zero is returned (indicating vector over-
- flow) when in fact the vector is exactly the right size for the final
+ There are some cases where zero is returned (indicating vector over-
+ flow) when in fact the vector is exactly the right size for the final
match. For example, consider the pattern
(a)(?:(b)c|bd)
- If a vector of 6 elements (allowing for only 1 captured substring) is
+ If a vector of 6 elements (allowing for only 1 captured substring) is
given with subject string "abd", pcre_exec() will try to set the second
captured string, thereby recording a vector overflow, before failing to
- match "c" and backing up to try the second alternative. The zero
- return, however, does correctly indicate that the maximum number of
+ match "c" and backing up to try the second alternative. The zero
+ return, however, does correctly indicate that the maximum number of
slots (namely 2) have been filled. In similar cases where there is tem-
- porary overflow, but the final number of used slots is actually less
+ porary overflow, but the final number of used slots is actually less
than the maximum, a non-zero value is returned.
The pcre_fullinfo() function can be used to find out how many capturing
- subpatterns there are in a compiled pattern. The smallest size for
- ovector that will allow for n captured substrings, in addition to the
+ subpatterns there are in a compiled pattern. The smallest size for
+ ovector that will allow for n captured substrings, in addition to the
offsets of the substring matched by the whole pattern, is (n+1)*3.
- It is possible for capturing subpattern number n+1 to match some part
+ It is possible for capturing subpattern number n+1 to match some part
of the subject when subpattern n has not been used at all. For example,
- if the string "abc" is matched against the pattern (a|(z))(bc) the
+ if the string "abc" is matched against the pattern (a|(z))(bc) the
return from the function is 4, and subpatterns 1 and 3 are matched, but
- 2 is not. When this happens, both values in the offset pairs corre-
+ 2 is not. When this happens, both values in the offset pairs corre-
sponding to unused subpatterns are set to -1.
- Offset values that correspond to unused subpatterns at the end of the
- expression are also set to -1. For example, if the string "abc" is
- matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
- matched. The return from the function is 2, because the highest used
- capturing subpattern number is 1, and the offsets for for the second
- and third capturing subpatterns (assuming the vector is large enough,
+ Offset values that correspond to unused subpatterns at the end of the
+ expression are also set to -1. For example, if the string "abc" is
+ matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
+ matched. The return from the function is 2, because the highest used
+ capturing subpattern number is 1, and the offsets for for the second
+ and third capturing subpatterns (assuming the vector is large enough,
of course) are set to -1.
- Note: Elements in the first two-thirds of ovector that do not corre-
- spond to capturing parentheses in the pattern are never changed. That
- is, if a pattern contains n capturing parentheses, no more than ovec-
- tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
+ Note: Elements in the first two-thirds of ovector that do not corre-
+ spond to capturing parentheses in the pattern are never changed. That
+ is, if a pattern contains n capturing parentheses, no more than ovec-
+ tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements (in
the first two-thirds) retain whatever values they previously had.
- Some convenience functions are provided for extracting the captured
+ Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below.
Error return values from pcre_exec()
- If pcre_exec() fails, it returns a negative number. The following are
+ If pcre_exec() fails, it returns a negative number. The following are
defined in the header file:
PCRE_ERROR_NOMATCH (-1)
@@ -3585,7 +3602,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_NULL (-2)
- Either code or subject was passed as NULL, or ovector was NULL and
+ Either code or subject was passed as NULL, or ovector was NULL and
ovecsize was not zero.
PCRE_ERROR_BADOPTION (-3)
@@ -3594,82 +3611,82 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_BADMAGIC (-4)
- PCRE stores a 4-byte "magic number" at the start of the compiled code,
+ PCRE stores a 4-byte "magic number" at the start of the compiled code,
to catch the case when it is passed a junk pointer and to detect when a
pattern that was compiled in an environment of one endianness is run in
- an environment with the other endianness. This is the error that PCRE
+ an environment with the other endianness. This is the error that PCRE
gives when the magic number is not present.
PCRE_ERROR_UNKNOWN_OPCODE (-5)
While running the pattern match, an unknown item was encountered in the
- compiled pattern. This error could be caused by a bug in PCRE or by
+ compiled pattern. This error could be caused by a bug in PCRE or by
overwriting of the compiled pattern.
PCRE_ERROR_NOMEMORY (-6)
- If a pattern contains back references, but the ovector that is passed
+ If a pattern contains back references, but the ovector that is passed
to pcre_exec() is not big enough to remember the referenced substrings,
- PCRE gets a block of memory at the start of matching to use for this
- purpose. If the call via pcre_malloc() fails, this error is given. The
+ PCRE gets a block of memory at the start of matching to use for this
+ purpose. If the call via pcre_malloc() fails, this error is given. The
memory is automatically freed at the end of matching.
- This error is also given if pcre_stack_malloc() fails in pcre_exec().
- This can happen only when PCRE has been compiled with --disable-stack-
+ This error is also given if pcre_stack_malloc() fails in pcre_exec().
+ This can happen only when PCRE has been compiled with --disable-stack-
for-recursion.
PCRE_ERROR_NOSUBSTRING (-7)
- This error is used by the pcre_copy_substring(), pcre_get_substring(),
+ This error is used by the pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() functions (see below). It is never
returned by pcre_exec().
PCRE_ERROR_MATCHLIMIT (-8)
- The backtracking limit, as specified by the match_limit field in a
- pcre_extra structure (or defaulted) was reached. See the description
+ The backtracking limit, as specified by the match_limit field in a
+ pcre_extra structure (or defaulted) was reached. See the description
above.
PCRE_ERROR_CALLOUT (-9)
This error is never generated by pcre_exec() itself. It is provided for
- use by callout functions that want to yield a distinctive error code.
+ use by callout functions that want to yield a distinctive error code.
See the pcrecallout documentation for details.
PCRE_ERROR_BADUTF8 (-10)
- A string that contains an invalid UTF-8 byte sequence was passed as a
- subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
- the output vector (ovecsize) is at least 2, the byte offset to the
- start of the the invalid UTF-8 character is placed in the first ele-
- ment, and a reason code is placed in the second element. The reason
+ A string that contains an invalid UTF-8 byte sequence was passed as a
+ subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size of
+ the output vector (ovecsize) is at least 2, the byte offset to the
+ start of the the invalid UTF-8 character is placed in the first ele-
+ ment, and a reason code is placed in the second element. The reason
codes are listed in the following section. For backward compatibility,
- if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
- acter at the end of the subject (reason codes 1 to 5),
+ if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
+ acter at the end of the subject (reason codes 1 to 5),
PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
PCRE_ERROR_BADUTF8_OFFSET (-11)
- The UTF-8 byte sequence that was passed as a subject was checked and
- found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
- value of startoffset did not point to the beginning of a UTF-8 charac-
+ The UTF-8 byte sequence that was passed as a subject was checked and
+ found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
+ value of startoffset did not point to the beginning of a UTF-8 charac-
ter or the end of the subject.
PCRE_ERROR_PARTIAL (-12)
- The subject string did not match, but it did match partially. See the
+ The subject string did not match, but it did match partially. See the
pcrepartial documentation for details of partial matching.
PCRE_ERROR_BADPARTIAL (-13)
- This code is no longer in use. It was formerly returned when the
- PCRE_PARTIAL option was used with a compiled pattern containing items
- that were not supported for partial matching. From release 8.00
+ This code is no longer in use. It was formerly returned when the
+ PCRE_PARTIAL option was used with a compiled pattern containing items
+ that were not supported for partial matching. From release 8.00
onwards, there are no restrictions on partial matching.
PCRE_ERROR_INTERNAL (-14)
- An unexpected internal error has occurred. This error could be caused
+ An unexpected internal error has occurred. This error could be caused
by a bug in PCRE or by overwriting of the compiled pattern.
PCRE_ERROR_BADCOUNT (-15)
@@ -3679,7 +3696,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_RECURSIONLIMIT (-21)
The internal recursion limit, as specified by the match_limit_recursion
- field in a pcre_extra structure (or defaulted) was reached. See the
+ field in a pcre_extra structure (or defaulted) was reached. See the
description above.
PCRE_ERROR_BADNEWLINE (-23)
@@ -3693,29 +3710,29 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_SHORTUTF8 (-25)
- This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
- string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
- option is set. Information about the failure is returned as for
- PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
- this special error code for PCRE_PARTIAL_HARD precedes the implementa-
- tion of returned information; it is retained for backwards compatibil-
+ This error is returned instead of PCRE_ERROR_BADUTF8 when the subject
+ string ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
+ option is set. Information about the failure is returned as for
+ PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this case, but
+ this special error code for PCRE_PARTIAL_HARD precedes the implementa-
+ tion of returned information; it is retained for backwards compatibil-
ity.
PCRE_ERROR_RECURSELOOP (-26)
This error is returned when pcre_exec() detects a recursion loop within
- the pattern. Specifically, it means that either the whole pattern or a
- subpattern has been called recursively for the second time at the same
+ the pattern. Specifically, it means that either the whole pattern or a
+ subpattern has been called recursively for the second time at the same
position in the subject string. Some simple patterns that might do this
- are detected and faulted at compile time, but more complicated cases,
+ are detected and faulted at compile time, but more complicated cases,
in particular mutual recursions between two different subpatterns, can-
not be detected until run time.
PCRE_ERROR_JIT_STACKLIMIT (-27)
- This error is returned when a pattern that was successfully studied
- using a JIT compile option is being matched, but the memory available
- for the just-in-time processing stack is not large enough. See the
+ This error is returned when a pattern that was successfully studied
+ using a JIT compile option is being matched, but the memory available
+ for the just-in-time processing stack is not large enough. See the
pcrejit documentation for more details.
PCRE_ERROR_BADMODE (-28)
@@ -3725,38 +3742,38 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_BADENDIANNESS (-29)
- This error is given if a pattern that was compiled and saved is
- reloaded on a host with different endianness. The utility function
+ This error is given if a pattern that was compiled and saved is
+ reloaded on a host with different endianness. The utility function
pcre_pattern_to_host_byte_order() can be used to convert such a pattern
so that it runs on the new host.
PCRE_ERROR_JIT_BADOPTION
- This error is returned when a pattern that was successfully studied
- using a JIT compile option is being matched, but the matching mode
- (partial or complete match) does not correspond to any JIT compilation
- mode. When the JIT fast path function is used, this error may be also
- given for invalid options. See the pcrejit documentation for more
+ This error is returned when a pattern that was successfully studied
+ using a JIT compile option is being matched, but the matching mode
+ (partial or complete match) does not correspond to any JIT compilation
+ mode. When the JIT fast path function is used, this error may be also
+ given for invalid options. See the pcrejit documentation for more
details.
PCRE_ERROR_BADLENGTH (-32)
- This error is given if pcre_exec() is called with a negative value for
+ This error is given if pcre_exec() is called with a negative value for
the length argument.
Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
Reason codes for invalid UTF-8 strings
- This section applies only to the 8-bit library. The corresponding
- information for the 16-bit and 32-bit libraries is given in the pcre16
+ This section applies only to the 8-bit library. The corresponding
+ information for the 16-bit and 32-bit libraries is given in the pcre16
and pcre32 pages.
When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
- UTF8, and the size of the output vector (ovecsize) is at least 2, the
- offset of the start of the invalid UTF-8 character is placed in the
+ UTF8, and the size of the output vector (ovecsize) is at least 2, the
+ offset of the start of the invalid UTF-8 character is placed in the
first output vector element (ovector[0]) and a reason code is placed in
- the second element (ovector[1]). The reason codes are given names in
+ the second element (ovector[1]). The reason codes are given names in
the pcre.h header file:
PCRE_UTF8_ERR1
@@ -3765,10 +3782,10 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_UTF8_ERR4
PCRE_UTF8_ERR5
- The string ends with a truncated UTF-8 character; the code specifies
- how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
- characters to be no longer than 4 bytes, the encoding scheme (origi-
- nally defined by RFC 2279) allows for up to 6 bytes, and this is
+ The string ends with a truncated UTF-8 character; the code specifies
+ how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
+ characters to be no longer than 4 bytes, the encoding scheme (origi-
+ nally defined by RFC 2279) allows for up to 6 bytes, and this is
checked first; hence the possibility of 4 or 5 missing bytes.
PCRE_UTF8_ERR6
@@ -3778,24 +3795,24 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_UTF8_ERR10
The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
- the character do not have the binary value 0b10 (that is, either the
+ the character do not have the binary value 0b10 (that is, either the
most significant bit is 0, or the next bit is 1).
PCRE_UTF8_ERR11
PCRE_UTF8_ERR12
- A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
+ A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
long; these code points are excluded by RFC 3629.
PCRE_UTF8_ERR13
- A 4-byte character has a value greater than 0x10fff; these code points
+ A 4-byte character has a value greater than 0x10fff; these code points
are excluded by RFC 3629.
PCRE_UTF8_ERR14
- A 3-byte character has a value in the range 0xd800 to 0xdfff; this
- range of code points are reserved by RFC 3629 for use with UTF-16, and
+ A 3-byte character has a value in the range 0xd800 to 0xdfff; this
+ range of code points are reserved by RFC 3629 for use with UTF-16, and
so are excluded from UTF-8.
PCRE_UTF8_ERR15
@@ -3804,28 +3821,28 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_UTF8_ERR18
PCRE_UTF8_ERR19
- A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
- for a value that can be represented by fewer bytes, which is invalid.
- For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
+ A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
+ for a value that can be represented by fewer bytes, which is invalid.
+ For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
rect coding uses just one byte.
PCRE_UTF8_ERR20
The two most significant bits of the first byte of a character have the
- binary value 0b10 (that is, the most significant bit is 1 and the sec-
- ond is 0). Such a byte can only validly occur as the second or subse-
+ binary value 0b10 (that is, the most significant bit is 1 and the sec-
+ ond is 0). Such a byte can only validly occur as the second or subse-
quent byte of a multi-byte character.
PCRE_UTF8_ERR21
- The first byte of a character has the value 0xfe or 0xff. These values
+ The first byte of a character has the value 0xfe or 0xff. These values
can never occur in a valid UTF-8 string.
PCRE_UTF8_ERR22
- This error code was formerly used when the presence of a so-called
- "non-character" caused an error. Unicode corrigendum #9 makes it clear
- that such characters should not cause a string to be rejected, and so
+ This error code was formerly used when the presence of a so-called
+ "non-character" caused an error. Unicode corrigendum #9 makes it clear
+ that such characters should not cause a string to be rejected, and so
this code is no longer in use and is never returned.
@@ -3842,78 +3859,78 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
- Captured substrings can be accessed directly by using the offsets
- returned by pcre_exec() in ovector. For convenience, the functions
+ Captured substrings can be accessed directly by using the offsets
+ returned by pcre_exec() in ovector. For convenience, the functions
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
- string_list() are provided for extracting captured substrings as new,
- separate, zero-terminated strings. These functions identify substrings
- by number. The next section describes functions for extracting named
+ string_list() are provided for extracting captured substrings as new,
+ separate, zero-terminated strings. These functions identify substrings
+ by number. The next section describes functions for extracting named
substrings.
- A substring that contains a binary zero is correctly extracted and has
- a further zero added on the end, but the result is not, of course, a C
- string. However, you can process such a string by referring to the
- length that is returned by pcre_copy_substring() and pcre_get_sub-
+ A substring that contains a binary zero is correctly extracted and has
+ a further zero added on the end, but the result is not, of course, a C
+ string. However, you can process such a string by referring to the
+ length that is returned by pcre_copy_substring() and pcre_get_sub-
string(). Unfortunately, the interface to pcre_get_substring_list() is
- not adequate for handling strings containing binary zeros, because the
+ not adequate for handling strings containing binary zeros, because the
end of the final string is not independently indicated.
- The first three arguments are the same for all three of these func-
- tions: subject is the subject string that has just been successfully
+ The first three arguments are the same for all three of these func-
+ tions: subject is the subject string that has just been successfully
matched, ovector is a pointer to the vector of integer offsets that was
passed to pcre_exec(), and stringcount is the number of substrings that
- were captured by the match, including the substring that matched the
+ were captured by the match, including the substring that matched the
entire regular expression. This is the value returned by pcre_exec() if
- it is greater than zero. If pcre_exec() returned zero, indicating that
- it ran out of space in ovector, the value passed as stringcount should
+ it is greater than zero. If pcre_exec() returned zero, indicating that
+ it ran out of space in ovector, the value passed as stringcount should
be the number of elements in the vector divided by three.
- The functions pcre_copy_substring() and pcre_get_substring() extract a
- single substring, whose number is given as stringnumber. A value of
- zero extracts the substring that matched the entire pattern, whereas
- higher values extract the captured substrings. For pcre_copy_sub-
- string(), the string is placed in buffer, whose length is given by
- buffersize, while for pcre_get_substring() a new block of memory is
- obtained via pcre_malloc, and its address is returned via stringptr.
- The yield of the function is the length of the string, not including
+ The functions pcre_copy_substring() and pcre_get_substring() extract a
+ single substring, whose number is given as stringnumber. A value of
+ zero extracts the substring that matched the entire pattern, whereas
+ higher values extract the captured substrings. For pcre_copy_sub-
+ string(), the string is placed in buffer, whose length is given by
+ buffersize, while for pcre_get_substring() a new block of memory is
+ obtained via pcre_malloc, and its address is returned via stringptr.
+ The yield of the function is the length of the string, not including
the terminating zero, or one of these error codes:
PCRE_ERROR_NOMEMORY (-6)
- The buffer was too small for pcre_copy_substring(), or the attempt to
+ The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber.
- The pcre_get_substring_list() function extracts all available sub-
- strings and builds a list of pointers to them. All this is done in a
+ The pcre_get_substring_list() function extracts all available sub-
+ strings and builds a list of pointers to them. All this is done in a
single block of memory that is obtained via pcre_malloc. The address of
- the memory block is returned via listptr, which is also the start of
- the list of string pointers. The end of the list is marked by a NULL
- pointer. The yield of the function is zero if all went well, or the
+ the memory block is returned via listptr, which is also the start of
+ the list of string pointers. The end of the list is marked by a NULL
+ pointer. The yield of the function is zero if all went well, or the
error code
PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed.
- When any of these functions encounter a substring that is unset, which
- can happen when capturing subpattern number n+1 matches some part of
- the subject, but subpattern n has not been used at all, they return an
+ When any of these functions encounter a substring that is unset, which
+ can happen when capturing subpattern number n+1 matches some part of
+ the subject, but subpattern n has not been used at all, they return an
empty string. This can be distinguished from a genuine zero-length sub-
- string by inspecting the appropriate offset in ovector, which is nega-
+ string by inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
- The two convenience functions pcre_free_substring() and pcre_free_sub-
- string_list() can be used to free the memory returned by a previous
+ The two convenience functions pcre_free_substring() and pcre_free_sub-
+ string_list() can be used to free the memory returned by a previous
call of pcre_get_substring() or pcre_get_substring_list(), respec-
- tively. They do nothing more than call the function pointed to by
- pcre_free, which of course could be called directly from a C program.
- However, PCRE is used in some situations where it is linked via a spe-
- cial interface to another programming language that cannot use
- pcre_free directly; it is for these cases that the functions are pro-
+ tively. They do nothing more than call the function pointed to by
+ pcre_free, which of course could be called directly from a C program.
+ However, PCRE is used in some situations where it is linked via a spe-
+ cial interface to another programming language that cannot use
+ pcre_free directly; it is for these cases that the functions are pro-
vided.
@@ -3932,7 +3949,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
int stringcount, const char *stringname,
const char **stringptr);
- To extract a substring by name, you first have to find associated num-
+ To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern
(a+)b(?<xxx>\d+)...
@@ -3941,35 +3958,35 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
be unique (PCRE_DUPNAMES was not set), you can find the number from the
name by calling pcre_get_stringnumber(). The first argument is the com-
piled pattern, and the second is the name. The yield of the function is
- the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
+ the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
subpattern of that name.
Given the number, you can extract the substring directly, or use one of
the functions described in the previous section. For convenience, there
are also two functions that do the whole job.
- Most of the arguments of pcre_copy_named_substring() and
- pcre_get_named_substring() are the same as those for the similarly
- named functions that extract by number. As these are described in the
- previous section, they are not re-described here. There are just two
+ Most of the arguments of pcre_copy_named_substring() and
+ pcre_get_named_substring() are the same as those for the similarly
+ named functions that extract by number. As these are described in the
+ previous section, they are not re-described here. There are just two
differences:
- First, instead of a substring number, a substring name is given. Sec-
+ First, instead of a substring number, a substring name is given. Sec-
ond, there is an extra argument, given at the start, which is a pointer
- to the compiled pattern. This is needed in order to gain access to the
+ to the compiled pattern. This is needed in order to gain access to the
name-to-number translation table.
- These functions call pcre_get_stringnumber(), and if it succeeds, they
- then call pcre_copy_substring() or pcre_get_substring(), as appropri-
- ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
+ These functions call pcre_get_stringnumber(), and if it succeeds, they
+ then call pcre_copy_substring() or pcre_get_substring(), as appropri-
+ ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
behaviour may not be what you want (see the next section).
Warning: If the pattern uses the (?| feature to set up multiple subpat-
- terns with the same number, as described in the section on duplicate
- subpattern numbers in the pcrepattern page, you cannot use names to
- distinguish the different subpatterns, because names are not included
- in the compiled code. The matching process uses only numbers. For this
- reason, the use of different names for subpatterns of the same number
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcrepattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
causes an error at compile time.
@@ -3978,76 +3995,76 @@ DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);
- When a pattern is compiled with the PCRE_DUPNAMES option, names for
- subpatterns are not required to be unique. (Duplicate names are always
- allowed for subpatterns with the same number, created by using the (?|
- feature. Indeed, if such subpatterns are named, they are required to
+ When a pattern is compiled with the PCRE_DUPNAMES option, names for
+ subpatterns are not required to be unique. (Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
use the same names.)
Normally, patterns with duplicate names are such that in any one match,
- only one of the named subpatterns participates. An example is shown in
+ only one of the named subpatterns participates. An example is shown in
the pcrepattern documentation.
- When duplicates are present, pcre_copy_named_substring() and
- pcre_get_named_substring() return the first substring corresponding to
- the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
- (-7) is returned; no data is returned. The pcre_get_stringnumber()
- function returns one of the numbers that are associated with the name,
+ When duplicates are present, pcre_copy_named_substring() and
+ pcre_get_named_substring() return the first substring corresponding to
+ the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
+ (-7) is returned; no data is returned. The pcre_get_stringnumber()
+ function returns one of the numbers that are associated with the name,
but it is not defined which it is.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre_get_stringtable_entries() function. The
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre_get_stringtable_entries() function. The
first argument is the compiled pattern, and the second is the name. The
- third and fourth are pointers to variables which are updated by the
+ third and fourth are pointers to variables which are updated by the
function. After it has run, they point to the first and last entries in
- the name-to-number table for the given name. The function itself
- returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
- there are none. The format of the table is described above in the sec-
- tion entitled Information about a pattern above. Given all the rele-
- vant entries for the name, you can extract each of their numbers, and
+ the name-to-number table for the given name. The function itself
+ returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
+ there are none. The format of the table is described above in the sec-
+ tion entitled Information about a pattern above. Given all the rele-
+ vant entries for the name, you can extract each of their numbers, and
hence the captured data, if any.
FINDING ALL POSSIBLE MATCHES
- The traditional matching function uses a similar algorithm to Perl,
+ The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in
- the subject. If you want to find all possible matches, or the longest
- possible match, consider using the alternative matching function (see
- below) instead. If you cannot use the alternative function, but still
- need to find all possible matches, you can kludge it up by making use
+ the subject. If you want to find all possible matches, or the longest
+ possible match, consider using the alternative matching function (see
+ below) instead. If you cannot use the alternative function, but still
+ need to find all possible matches, you can kludge it up by making use
of the callout facility, which is described in the pcrecallout documen-
tation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre_exec() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre_exec() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
OBTAINING AN ESTIMATE OF STACK USAGE
- Matching certain patterns using pcre_exec() can use a lot of process
- stack, which in certain environments can be rather limited in size.
- Some users find it helpful to have an estimate of the amount of stack
- that is used by pcre_exec(), to help them set recursion limits, as
- described in the pcrestack documentation. The estimate that is output
+ Matching certain patterns using pcre_exec() can use a lot of process
+ stack, which in certain environments can be rather limited in size.
+ Some users find it helpful to have an estimate of the amount of stack
+ that is used by pcre_exec(), to help them set recursion limits, as
+ described in the pcrestack documentation. The estimate that is output
by pcretest when called with the -m and -C options is obtained by call-
- ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
+ ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
first five arguments.
- Normally, if its first argument is NULL, pcre_exec() immediately
- returns the negative error code PCRE_ERROR_NULL, but with this special
- combination of arguments, it returns instead a negative number whose
- absolute value is the approximate stack frame size in bytes. (A nega-
- tive number is used so that it is clear that no match has happened.)
- The value is approximate because in some cases, recursive calls to
+ Normally, if its first argument is NULL, pcre_exec() immediately
+ returns the negative error code PCRE_ERROR_NULL, but with this special
+ combination of arguments, it returns instead a negative number whose
+ absolute value is the approximate stack frame size in bytes. (A nega-
+ tive number is used so that it is clear that no match has happened.)
+ The value is approximate because in some cases, recursive calls to
pcre_exec() occur when there are one or two additional variables on the
stack.
- If PCRE has been compiled to use the heap instead of the stack for
- recursion, the value returned is the size of each block that is
+ If PCRE has been compiled to use the heap instead of the stack for
+ recursion, the value returned is the size of each block that is
obtained from the heap.
@@ -4058,26 +4075,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
- The function pcre_dfa_exec() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
- subject string just once, and does not backtrack. This has different
- characteristics to the normal algorithm, and is not compatible with
- Perl. Some of the features of PCRE patterns are not supported. Never-
- theless, there are times when this kind of matching can be useful. For
- a discussion of the two matching algorithms, and a list of features
- that pcre_dfa_exec() does not support, see the pcrematching documenta-
+ The function pcre_dfa_exec() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
+ subject string just once, and does not backtrack. This has different
+ characteristics to the normal algorithm, and is not compatible with
+ Perl. Some of the features of PCRE patterns are not supported. Never-
+ theless, there are times when this kind of matching can be useful. For
+ a discussion of the two matching algorithms, and a list of features
+ that pcre_dfa_exec() does not support, see the pcrematching documenta-
tion.
- The arguments for the pcre_dfa_exec() function are the same as for
+ The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ-
- ent way, and this is described below. The other common arguments are
- used in the same way as for pcre_exec(), so their description is not
+ ent way, and this is described below. The other common arguments are
+ used in the same way as for pcre_exec(), so their description is not
repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace will be needed for patterns and subjects where there are a
+ workspace will be needed for patterns and subjects where there are a
lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec():
@@ -4099,55 +4116,55 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_exec()
- The unused bits of the options argument for pcre_dfa_exec() must be
- zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
+ The unused bits of the options argument for pcre_dfa_exec() must be
+ zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
- PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
- PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
- TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
- four of these are exactly the same as for pcre_exec(), so their
+ PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
+ PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
+ TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
+ four of these are exactly the same as for pcre_exec(), so their
description is not repeated here.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
- These have the same general effect as they do for pcre_exec(), but the
- details are slightly different. When PCRE_PARTIAL_HARD is set for
- pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
- ject is reached and there is still at least one matching possibility
+ These have the same general effect as they do for pcre_exec(), but the
+ details are slightly different. When PCRE_PARTIAL_HARD is set for
+ pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
+ ject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
- of the subject is reached, there have been no complete matches, but
- there is still at least one matching possibility. The portion of the
- string that was inspected when the longest partial match was found is
- set as the first matching string in both cases. There is a more
- detailed discussion of partial and multi-segment matching, with exam-
+ of the subject is reached, there have been no complete matches, but
+ there is still at least one matching possibility. The portion of the
+ string that was inspected when the longest partial match was found is
+ set as the first matching string in both cases. There is a more
+ detailed discussion of partial and multi-segment matching, with exam-
ples, in the pcrepartial documentation.
PCRE_DFA_SHORTEST
- Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
+ Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
- tive algorithm works, this is necessarily the shortest possible match
+ tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE_DFA_RESTART
When pcre_dfa_exec() returns a partial match, it is possible to call it
- again, with additional subject characters, and have it continue with
- the same match. The PCRE_DFA_RESTART option requests this action; when
- it is set, the workspace and wscount options must reference the same
- vector as before because data about the match so far is left in them
+ again, with additional subject characters, and have it continue with
+ the same match. The PCRE_DFA_RESTART option requests this action; when
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcrepartial documentation.
Successful returns from pcre_dfa_exec()
- When pcre_dfa_exec() succeeds, it may have matched more than one sub-
+ When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -4162,79 +4179,79 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else>
<something> <something else> <something further>
- On success, the yield of the function is a number greater than zero,
- which is the number of matched substrings. The substrings themselves
- are returned in ovector. Each string uses two elements; the first is
- the offset to the start, and the second is the offset to the end. In
- fact, all the strings have the same start offset. (Space could have
- been saved by giving this only once, but it was decided to retain some
- compatibility with the way pcre_exec() returns data, even though the
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The substrings themselves
+ are returned in ovector. Each string uses two elements; the first is
+ the offset to the start, and the second is the offset to the end. In
+ fact, all the strings have the same start offset. (Space could have
+ been saved by giving this only once, but it was decided to retain some
+ compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long-
- est matching string is given first. If there were too many matches to
- fit into ovector, the yield of the function is zero, and the vector is
- filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
+ est matching string is given first. If there were too many matches to
+ fit into ovector, the yield of the function is zero, and the vector is
+ filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec()
can use the entire ovector for returning matched strings.
- NOTE: PCRE's "auto-possessification" optimization usually applies to
- character repeats at the end of a pattern (as well as internally). For
- example, the pattern "a\d+" is compiled as if it were "a\d++" because
+ NOTE: PCRE's "auto-possessification" optimization usually applies to
+ character repeats at the end of a pattern (as well as internally). For
+ example, the pattern "a\d+" is compiled as if it were "a\d++" because
there is no point even considering the possibility of backtracking into
- the repeated digits. For DFA matching, this means that only one possi-
- ble match is found. If you really do want multiple matches in such
- cases, either use an ungreedy repeat ("a\d+?") or set the
+ the repeated digits. For DFA matching, this means that only one possi-
+ ble match is found. If you really do want multiple matches in such
+ cases, either use an ungreedy repeat ("a\d+?") or set the
PCRE_NO_AUTO_POSSESS option when compiling.
Error returns from pcre_dfa_exec()
- The pcre_dfa_exec() function returns a negative number when it fails.
- Many of the errors are the same as for pcre_exec(), and these are
- described above. There are in addition the following errors that are
+ The pcre_dfa_exec() function returns a negative number when it fails.
+ Many of the errors are the same as for pcre_exec(), and these are
+ described above. There are in addition the following errors that are
specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16)
- This return is given if pcre_dfa_exec() encounters an item in the pat-
- tern that it does not support, for instance, the use of \C or a back
+ This return is given if pcre_dfa_exec() encounters an item in the pat-
+ tern that it does not support, for instance, the use of \C or a back
reference.
PCRE_ERROR_DFA_UCOND (-17)
- This return is given if pcre_dfa_exec() encounters a condition item
- that uses a back reference for the condition, or a test for recursion
+ This return is given if pcre_dfa_exec() encounters a condition item
+ that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18)
- This return is given if pcre_dfa_exec() is called with an extra block
- that contains a setting of the match_limit or match_limit_recursion
- fields. This is not supported (these fields are meaningless for DFA
+ This return is given if pcre_dfa_exec() is called with an extra block
+ that contains a setting of the match_limit or match_limit_recursion
+ fields. This is not supported (these fields are meaningless for DFA
matching).
PCRE_ERROR_DFA_WSSIZE (-19)
- This return is given if pcre_dfa_exec() runs out of space in the
+ This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.
PCRE_ERROR_DFA_RECURSE (-20)
- When a recursive subpattern is processed, the matching function calls
- itself recursively, using private vectors for ovector and workspace.
- This error is given if the output vector is not large enough. This
+ When a recursive subpattern is processed, the matching function calls
+ itself recursively, using private vectors for ovector and workspace.
+ This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
PCRE_ERROR_DFA_BADRESTART (-30)
- When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
- plausibility checks are made on the contents of the workspace, which
- should contain data about the previous partial match. If any of these
+ When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some
+ plausibility checks are made on the contents of the workspace, which
+ should contain data about the previous partial match. If any of these
checks fail, this error is given.
SEE ALSO
- pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
+ pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3),
pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
sample(3), pcrestack(3).
@@ -4248,8 +4265,8 @@ AUTHOR
REVISION
- Last updated: 12 November 2013
- Copyright (c) 1997-2013 University of Cambridge.
+ Last updated: 09 February 2014
+ Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@@ -5510,7 +5527,9 @@ BACKSLASH
Perl documents that the use of \K within assertions is "not well
defined". In PCRE, \K is acted upon when it occurs inside positive
- assertions, but is ignored in negative assertions.
+ assertions, but is ignored in negative assertions. Note that when a
+ pattern such as (?=ab\K) matches, the reported start of the match can
+ be greater than the end of the match.
Simple assertions
@@ -7399,19 +7418,23 @@ BACKTRACKING CONTROL
Note that (*COMMIT) at the start of a pattern is not the same as an
anchor, unless PCRE's start-of-match optimizations are turned off, as
- shown in this pcretest example:
+ shown in this output from pcretest:
re> /(*COMMIT)abc/
data> xyzabc
0: abc
- xyzabc\Y
+ data> xyzabc\Y
No match
- PCRE knows that any match must start with "a", so the optimization
- skips along the subject to "a" before running the first match attempt,
- which succeeds. When the optimization is disabled by the \Y escape in
- the second subject, the match starts at "x" and so the (*COMMIT) causes
- it to fail without trying any other starting points.
+ For this pattern, PCRE knows that any match must start with "a", so the
+ optimization skips along the subject to "a" before applying the pattern
+ to the first set of data. The match attempt then succeeds. In the sec-
+ ond set of data, the escape sequence \Y is interpreted by the pcretest
+ program. It causes the PCRE_NO_START_OPTIMIZE option to be set when
+ pcre_exec() is called. This disables the optimization that skips along
+ to the first character. The pattern is now applied starting at "x", and
+ so the (*COMMIT) causes the match to fail without trying any other
+ starting points.
(*PRUNE) or (*PRUNE:NAME)
@@ -7618,8 +7641,8 @@ AUTHOR
REVISION
- Last updated: 03 December 2013
- Copyright (c) 1997-2013 University of Cambridge.
+ Last updated: 08 January 2014
+ Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
@@ -7840,6 +7863,8 @@ MATCH POINT RESET
\K reset start of match
+ \K is honoured in positive assertions, but ignored in negative ones.
+
ALTERNATION
@@ -7877,11 +7902,13 @@ OPTION SETTING
(?x) extended (ignore white space)
(?-...) unset option(s)
- The following are recognized only at the start of a pattern or after
- one of the newline-setting options with similar syntax:
+ The following are recognized only at the very start of a pattern or
+ after one of the newline or \R options with similar syntax. More than
+ one of them may appear.
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
+ (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
@@ -7889,10 +7916,31 @@ OPTION SETTING
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE_UCP (use Unicode properties for \d etc)
- Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
+ Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of
the limits set by the caller of pcre_exec(), not increase them.
+NEWLINE CONVENTION
+
+ These are recognized only at the very start of the pattern or after
+ option settings with a similar syntax.
+
+ (*CR) carriage return only
+ (*LF) linefeed only
+ (*CRLF) carriage return followed by linefeed
+ (*ANYCRLF) all three of the above
+ (*ANY) any Unicode newline sequence
+
+
+WHAT \R MATCHES
+
+ These are recognized only at the very start of the pattern or after
+ option setting with a similar syntax.
+
+ (*BSR_ANYCRLF) CR, LF, or CRLF
+ (*BSR_UNICODE) any Unicode newline sequence
+
+
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
(?=...) positive look ahead
@@ -7960,7 +8008,7 @@ BACKTRACKING CONTROL
(*FAIL) force backtrack; synonym (*F)
(*MARK:NAME) set name to be passed back; synonym (*:NAME)
- The following act only when a subsequent match failure causes a back-
+ The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored.
@@ -7975,27 +8023,6 @@ BACKTRACKING CONTROL
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
-NEWLINE CONVENTIONS
-
- These are recognized only at the very start of the pattern or after a
- (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
-
- (*CR) carriage return only
- (*LF) linefeed only
- (*CRLF) carriage return followed by linefeed
- (*ANYCRLF) all three of the above
- (*ANY) any Unicode newline sequence
-
-
-WHAT \R MATCHES
-
- These are recognized only at the very start of the pattern or after a
- (*...) option that sets the newline convention or a UTF or UCP mode.
-
- (*BSR_ANYCRLF) CR, LF, or CRLF
- (*BSR_UNICODE) any Unicode newline sequence
-
-
CALLOUTS
(?C) callout
@@ -8016,8 +8043,8 @@ AUTHOR
REVISION
- Last updated: 12 November 2013
- Copyright (c) 1997-2013 University of Cambridge.
+ Last updated: 08 January 2014
+ Copyright (c) 1997-2014 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/pcre/doc/pcreapi.3 b/pcre/doc/pcreapi.3
index ebbd20fc4d5..ab3eaa0b521 100644
--- a/pcre/doc/pcreapi.3
+++ b/pcre/doc/pcreapi.3
@@ -1,4 +1,4 @@
-.TH PCREAPI 3 "12 November 2013" "PCRE 8.34"
+.TH PCREAPI 3 "09 February 2014" "PCRE 8.35"
.SH NAME
PCRE - Perl-compatible regular expressions
.sp
@@ -116,6 +116,8 @@ PCRE - Perl-compatible regular expressions
.B void (*pcre_stack_free)(void *);
.sp
.B int (*pcre_callout)(pcre_callout_block *);
+.sp
+.B int (*pcre_stack_guard)(void);
.fi
.
.
@@ -286,6 +288,14 @@ points during a matching operation. Details are given in the
\fBpcrecallout\fP
.\"
documentation.
+.P
+The global variable \fBpcre_stack_guard\fP initially contains NULL. It can be
+set by the caller to a function that is called by PCRE whenever it starts
+to compile a parenthesized part of a pattern. When parentheses are nested, PCRE
+uses recursive function calls, which use up the system stack. This function is
+provided so that applications with restricted stacks can force a compilation
+error if the stack runs out. The function should return zero if all is well, or
+non-zero to force an error.
.
.
.\" HTML <a name="newlines"></a>
@@ -337,7 +347,8 @@ controlled in a similar way, but by separate options.
The PCRE functions can be used in multi-threading applications, with the
proviso that the memory management functions pointed to by \fBpcre_malloc\fP,
\fBpcre_free\fP, \fBpcre_stack_malloc\fP, and \fBpcre_stack_free\fP, and the
-callout function pointed to by \fBpcre_callout\fP, are shared by all threads.
+callout and stack-checking functions pointed to by \fBpcre_callout\fP and
+\fBpcre_stack_guard\fP, are shared by all threads.
.P
The compiled form of a regular expression is not altered during matching, so
the same compiled pattern can safely be used by several threads at once.
@@ -465,7 +476,10 @@ documentation.
The output is a long integer that gives the maximum depth of nesting of
parentheses (of any kind) in a pattern. This limit is imposed to cap the amount
of system stack used when a pattern is compiled. It is specified when PCRE is
-built; the default is 250.
+built; the default is 250. This limit does not take into account the stack that
+may already be used by the calling application. For finer control over
+compilation stack usage, you can set a pointer to an external checking function
+in \fBpcre_stack_guard\fP.
.sp
PCRE_CONFIG_MATCH_LIMIT
.sp
@@ -991,6 +1005,8 @@ have fallen out of use. To avoid confusion, they have not been re-used.
81 missing opening brace after \eo
82 parentheses are too deeply nested
83 invalid range in character class
+ 84 group name must start with a non-digit
+ 85 parentheses are too deeply nested (stack check)
.sp
The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
be used if the limits were changed when PCRE was built.
@@ -1248,12 +1264,15 @@ information call is provided for internal use by the \fBpcre_study()\fP
function. External callers can cause PCRE to use its internal tables by passing
a NULL table pointer.
.sp
- PCRE_INFO_FIRSTBYTE
+ PCRE_INFO_FIRSTBYTE (deprecated)
.sp
Return information about the first data unit of any matched string, for a
-non-anchored pattern. (The name of this option refers to the 8-bit library,
-where data units are bytes.) The fourth argument should point to an \fBint\fP
-variable.
+non-anchored pattern. The name of this option refers to the 8-bit library,
+where data units are bytes. The fourth argument should point to an \fBint\fP
+variable. Negative values are used for special cases. However, this means that
+when the 32-bit library is in non-UTF-32 mode, the full 32-bit range of
+characters cannot be returned. For this reason, this value is deprecated; use
+PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER instead.
.P
If there is a fixed first value, for example, the letter "c" from a pattern
such as (cat|cow|coyote), its value is returned. In the 8-bit library, the
@@ -1271,11 +1290,38 @@ starts with "^", or
-1 is returned, indicating that the pattern matches only at the start of a
subject string or after any newline within the string. Otherwise -2 is
returned. For anchored patterns, -2 is returned.
+.sp
+ PCRE_INFO_FIRSTCHARACTER
+.sp
+Return the value of the first data unit (non-UTF character) of any matched
+string in the situation where PCRE_INFO_FIRSTCHARACTERFLAGS returns 1;
+otherwise return 0. The fourth argument should point to an \fBuint_t\fP
+variable.
.P
-Since for the 32-bit library using the non-UTF-32 mode, this function is unable
-to return the full 32-bit range of the character, this value is deprecated;
-instead the PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER values
-should be used.
+In the 8-bit library, the value is always less than 256. In the 16-bit library
+the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
+can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
+.sp
+ PCRE_INFO_FIRSTCHARACTERFLAGS
+.sp
+Return information about the first data unit of any matched string, for a
+non-anchored pattern. The fourth argument should point to an \fBint\fP
+variable.
+.P
+If there is a fixed first value, for example, the letter "c" from a pattern
+such as (cat|cow|coyote), 1 is returned, and the character value can be
+retrieved using PCRE_INFO_FIRSTCHARACTER. If there is no fixed first value, and
+if either
+.sp
+(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
+starts with "^", or
+.sp
+(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
+(if it were set, the pattern would be anchored),
+.sp
+2 is returned, indicating that the pattern matches only at the start of a
+subject string or after any newline within the string. Otherwise 0 is
+returned. For anchored patterns, 0 is returned.
.sp
PCRE_INFO_FIRSTTABLE
.sp
@@ -1499,38 +1545,6 @@ is made available via this option so that it can be saved and restored (see the
.\"
documentation for details).
.sp
- PCRE_INFO_FIRSTCHARACTERFLAGS
-.sp
-Return information about the first data unit of any matched string, for a
-non-anchored pattern. The fourth argument should point to an \fBint\fP
-variable.
-.P
-If there is a fixed first value, for example, the letter "c" from a pattern
-such as (cat|cow|coyote), 1 is returned, and the character value can be
-retrieved using PCRE_INFO_FIRSTCHARACTER.
-.P
-If there is no fixed first value, and if either
-.sp
-(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
-starts with "^", or
-.sp
-(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
-(if it were set, the pattern would be anchored),
-.sp
-2 is returned, indicating that the pattern matches only at the start of a
-subject string or after any newline within the string. Otherwise 0 is
-returned. For anchored patterns, 0 is returned.
-.sp
- PCRE_INFO_FIRSTCHARACTER
-.sp
-Return the fixed first character value in the situation where
-PCRE_INFO_FIRSTCHARACTERFLAGS returns 1; otherwise return 0. The fourth
-argument should point to an \fBuint_t\fP variable.
-.P
-In the 8-bit library, the value is always less than 256. In the 16-bit library
-the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
-can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
-.sp
PCRE_INFO_REQUIREDCHARFLAGS
.sp
Returns 1 if there is a rightmost literal data unit that must exist in any
@@ -2900,6 +2914,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 12 November 2013
-Copyright (c) 1997-2013 University of Cambridge.
+Last updated: 09 February 2014
+Copyright (c) 1997-2014 University of Cambridge.
.fi
diff --git a/pcre/doc/pcregrep.1 b/pcre/doc/pcregrep.1
index 7fa5b65e030..988667542f2 100644
--- a/pcre/doc/pcregrep.1
+++ b/pcre/doc/pcregrep.1
@@ -1,4 +1,4 @@
-.TH PCREGREP 1 "13 September 2012" "PCRE 8.32"
+.TH PCREGREP 1 "03 April 2014" "PCRE 8.35"
.SH NAME
pcregrep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -11,9 +11,13 @@ pcregrep - a grep with Perl-compatible regular expressions.
grep commands do, but it uses the PCRE regular expression library to support
patterns that are compatible with the regular expressions of Perl 5. See
.\" HREF
+\fBpcresyntax\fP(3)
+.\"
+for a quick-reference summary of pattern syntax, or
+.\" HREF
\fBpcrepattern\fP(3)
.\"
-for a full description of syntax and semantics of the regular expressions
+for a full description of the syntax and semantics of the regular expressions
that PCRE supports.
.P
Patterns, whether supplied on the command line or in a separate file, are given
@@ -674,6 +678,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 13 September 2012
-Copyright (c) 1997-2012 University of Cambridge.
+Last updated: 03 April 2014
+Copyright (c) 1997-2014 University of Cambridge.
.fi
diff --git a/pcre/doc/pcregrep.txt b/pcre/doc/pcregrep.txt
index b31eb77c304..97d9a7bd379 100644
--- a/pcre/doc/pcregrep.txt
+++ b/pcre/doc/pcregrep.txt
@@ -14,394 +14,395 @@ DESCRIPTION
pcregrep searches files for character patterns, in the same way as
other grep commands do, but it uses the PCRE regular expression library
to support patterns that are compatible with the regular expressions of
- Perl 5. See pcrepattern(3) for a full description of syntax and seman-
+ Perl 5. See pcresyntax(3) for a quick-reference summary of pattern syn-
+ tax, or pcrepattern(3) for a full description of the syntax and seman-
tics of the regular expressions that PCRE supports.
- Patterns, whether supplied on the command line or in a separate file,
+ Patterns, whether supplied on the command line or in a separate file,
are given without delimiters. For example:
pcregrep Thursday /etc/motd
If you attempt to use delimiters (for example, by surrounding a pattern
- with slashes, as is common in Perl scripts), they are interpreted as
- part of the pattern. Quotes can of course be used to delimit patterns
- on the command line because they are interpreted by the shell, and
- indeed quotes are required if a pattern contains white space or shell
+ with slashes, as is common in Perl scripts), they are interpreted as
+ part of the pattern. Quotes can of course be used to delimit patterns
+ on the command line because they are interpreted by the shell, and
+ indeed quotes are required if a pattern contains white space or shell
metacharacters.
- The first argument that follows any option settings is treated as the
- single pattern to be matched when neither -e nor -f is present. Con-
- versely, when one or both of these options are used to specify pat-
+ The first argument that follows any option settings is treated as the
+ single pattern to be matched when neither -e nor -f is present. Con-
+ versely, when one or both of these options are used to specify pat-
terns, all arguments are treated as path names. At least one of -e, -f,
or an argument pattern must be provided.
If no files are specified, pcregrep reads the standard input. The stan-
- dard input can also be referenced by a name consisting of a single
+ dard input can also be referenced by a name consisting of a single
hyphen. For example:
pcregrep some-pattern /file1 - /file3
- By default, each line that matches a pattern is copied to the standard
- output, and if there is more than one file, the file name is output at
+ By default, each line that matches a pattern is copied to the standard
+ output, and if there is more than one file, the file name is output at
the start of each line, followed by a colon. However, there are options
- that can change how pcregrep behaves. In particular, the -M option
- makes it possible to search for patterns that span line boundaries.
- What defines a line boundary is controlled by the -N (--newline)
+ that can change how pcregrep behaves. In particular, the -M option
+ makes it possible to search for patterns that span line boundaries.
+ What defines a line boundary is controlled by the -N (--newline)
option.
The amount of memory used for buffering files that are being scanned is
- controlled by a parameter that can be set by the --buffer-size option.
- The default value for this parameter is specified when pcregrep is
- built, with the default default being 20K. A block of memory three
- times this size is used (to allow for buffering "before" and "after"
+ controlled by a parameter that can be set by the --buffer-size option.
+ The default value for this parameter is specified when pcregrep is
+ built, with the default default being 20K. A block of memory three
+ times this size is used (to allow for buffering "before" and "after"
lines). An error occurs if a line overflows the buffer.
- Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
- greater. BUFSIZ is defined in <stdio.h>. When there is more than one
+ Patterns can be no longer than 8K or BUFSIZ bytes, whichever is the
+ greater. BUFSIZ is defined in <stdio.h>. When there is more than one
pattern (specified by the use of -e and/or -f), each pattern is applied
- to each line in the order in which they are defined, except that all
+ to each line in the order in which they are defined, except that all
the -e patterns are tried before the -f patterns.
- By default, as soon as one pattern matches a line, no further patterns
+ By default, as soon as one pattern matches a line, no further patterns
are considered. However, if --colour (or --color) is used to colour the
- matching substrings, or if --only-matching, --file-offsets, or --line-
- offsets is used to output only the part of the line that matched
+ matching substrings, or if --only-matching, --file-offsets, or --line-
+ offsets is used to output only the part of the line that matched
(either shown literally, or as an offset), scanning resumes immediately
- following the match, so that further matches on the same line can be
- found. If there are multiple patterns, they are all tried on the
- remainder of the line, but patterns that follow the one that matched
+ following the match, so that further matches on the same line can be
+ found. If there are multiple patterns, they are all tried on the
+ remainder of the line, but patterns that follow the one that matched
are not tried on the earlier part of the line.
- This behaviour means that the order in which multiple patterns are
- specified can affect the output when one of the above options is used.
- This is no longer the same behaviour as GNU grep, which now manages to
- display earlier matches for later patterns (as long as there is no
+ This behaviour means that the order in which multiple patterns are
+ specified can affect the output when one of the above options is used.
+ This is no longer the same behaviour as GNU grep, which now manages to
+ display earlier matches for later patterns (as long as there is no
overlap).
- Patterns that can match an empty string are accepted, but empty string
+ Patterns that can match an empty string are accepted, but empty string
matches are never recognized. An example is the pattern
- "(super)?(man)?", in which all components are optional. This pattern
- finds all occurrences of both "super" and "man"; the output differs
- from matching with "super|man" when only the matching substrings are
+ "(super)?(man)?", in which all components are optional. This pattern
+ finds all occurrences of both "super" and "man"; the output differs
+ from matching with "super|man" when only the matching substrings are
being shown.
- If the LC_ALL or LC_CTYPE environment variable is set, pcregrep uses
- the value to set a locale when calling the PCRE library. The --locale
+ If the LC_ALL or LC_CTYPE environment variable is set, pcregrep uses
+ the value to set a locale when calling the PCRE library. The --locale
option can be used to override this.
SUPPORT FOR COMPRESSED FILES
- It is possible to compile pcregrep so that it uses libz or libbz2 to
- read files whose names end in .gz or .bz2, respectively. You can find
+ It is possible to compile pcregrep so that it uses libz or libbz2 to
+ read files whose names end in .gz or .bz2, respectively. You can find
out whether your binary has support for one or both of these file types
by running it with the --help option. If the appropriate support is not
- present, files are treated as plain text. The standard input is always
+ present, files are treated as plain text. The standard input is always
so treated.
BINARY FILES
- By default, a file that contains a binary zero byte within the first
- 1024 bytes is identified as a binary file, and is processed specially.
- (GNU grep also identifies binary files in this manner.) See the
- --binary-files option for a means of changing the way binary files are
+ By default, a file that contains a binary zero byte within the first
+ 1024 bytes is identified as a binary file, and is processed specially.
+ (GNU grep also identifies binary files in this manner.) See the
+ --binary-files option for a means of changing the way binary files are
handled.
OPTIONS
- The order in which some of the options appear can affect the output.
- For example, both the -h and -l options affect the printing of file
- names. Whichever comes later in the command line will be the one that
- takes effect. Similarly, except where noted below, if an option is
- given twice, the later setting is used. Numerical values for options
- may be followed by K or M, to signify multiplication by 1024 or
+ The order in which some of the options appear can affect the output.
+ For example, both the -h and -l options affect the printing of file
+ names. Whichever comes later in the command line will be the one that
+ takes effect. Similarly, except where noted below, if an option is
+ given twice, the later setting is used. Numerical values for options
+ may be followed by K or M, to signify multiplication by 1024 or
1024*1024 respectively.
-- This terminates the list of options. It is useful if the next
- item on the command line starts with a hyphen but is not an
- option. This allows for the processing of patterns and file-
+ item on the command line starts with a hyphen but is not an
+ option. This allows for the processing of patterns and file-
names that start with hyphens.
-A number, --after-context=number
- Output number lines of context after each matching line. If
+ Output number lines of context after each matching line. If
filenames and/or line numbers are being output, a hyphen sep-
- arator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
+ arator is used instead of a colon for the context lines. A
+ line containing "--" is output between each group of lines,
+ unless they are in fact contiguous in the input file. The
+ value of number is expected to be relatively small. However,
pcregrep guarantees to have up to 8K of following text avail-
able for context output.
-a, --text
- Treat binary files as text. This is equivalent to --binary-
+ Treat binary files as text. This is equivalent to --binary-
files=text.
-B number, --before-context=number
- Output number lines of context before each matching line. If
+ Output number lines of context before each matching line. If
filenames and/or line numbers are being output, a hyphen sep-
- arator is used instead of a colon for the context lines. A
- line containing "--" is output between each group of lines,
- unless they are in fact contiguous in the input file. The
- value of number is expected to be relatively small. However,
+ arator is used instead of a colon for the context lines. A
+ line containing "--" is output between each group of lines,
+ unless they are in fact contiguous in the input file. The
+ value of number is expected to be relatively small. However,
pcregrep guarantees to have up to 8K of preceding text avail-
able for context output.
--binary-files=word
- Specify how binary files are to be processed. If the word is
- "binary" (the default), pattern matching is performed on
- binary files, but the only output is "Binary file <name>
- matches" when a match succeeds. If the word is "text", which
- is equivalent to the -a or --text option, binary files are
- processed in the same way as any other file. In this case,
- when a match succeeds, the output may be binary garbage,
- which can have nasty effects if sent to a terminal. If the
- word is "without-match", which is equivalent to the -I
- option, binary files are not processed at all; they are
+ Specify how binary files are to be processed. If the word is
+ "binary" (the default), pattern matching is performed on
+ binary files, but the only output is "Binary file <name>
+ matches" when a match succeeds. If the word is "text", which
+ is equivalent to the -a or --text option, binary files are
+ processed in the same way as any other file. In this case,
+ when a match succeeds, the output may be binary garbage,
+ which can have nasty effects if sent to a terminal. If the
+ word is "without-match", which is equivalent to the -I
+ option, binary files are not processed at all; they are
assumed not to be of interest.
--buffer-size=number
- Set the parameter that controls how much memory is used for
+ Set the parameter that controls how much memory is used for
buffering files that are being scanned.
-C number, --context=number
- Output number lines of context both before and after each
- matching line. This is equivalent to setting both -A and -B
+ Output number lines of context both before and after each
+ matching line. This is equivalent to setting both -A and -B
to the same value.
-c, --count
- Do not output individual lines from the files that are being
+ Do not output individual lines from the files that are being
scanned; instead output the number of lines that would other-
- wise have been shown. If no lines are selected, the number
- zero is output. If several files are are being scanned, a
- count is output for each of them. However, if the --files-
- with-matches option is also used, only those files whose
+ wise have been shown. If no lines are selected, the number
+ zero is output. If several files are are being scanned, a
+ count is output for each of them. However, if the --files-
+ with-matches option is also used, only those files whose
counts are greater than zero are listed. When -c is used, the
-A, -B, and -C options are ignored.
--colour, --color
If this option is given without any data, it is equivalent to
- "--colour=auto". If data is required, it must be given in
+ "--colour=auto". If data is required, it must be given in
the same shell item, separated by an equals sign.
--colour=value, --color=value
This option specifies under what circumstances the parts of a
line that matched a pattern should be coloured in the output.
- By default, the output is not coloured. The value (which is
- optional, see above) may be "never", "always", or "auto". In
- the latter case, colouring happens only if the standard out-
- put is connected to a terminal. More resources are used when
- colouring is enabled, because pcregrep has to search for all
- possible matches in a line, not just one, in order to colour
+ By default, the output is not coloured. The value (which is
+ optional, see above) may be "never", "always", or "auto". In
+ the latter case, colouring happens only if the standard out-
+ put is connected to a terminal. More resources are used when
+ colouring is enabled, because pcregrep has to search for all
+ possible matches in a line, not just one, in order to colour
them all.
The colour that is used can be specified by setting the envi-
ronment variable PCREGREP_COLOUR or PCREGREP_COLOR. The value
of this variable should be a string of two numbers, separated
- by a semicolon. They are copied directly into the control
- string for setting colour on a terminal, so it is your
- responsibility to ensure that they make sense. If neither of
- the environment variables is set, the default is "1;31",
+ by a semicolon. They are copied directly into the control
+ string for setting colour on a terminal, so it is your
+ responsibility to ensure that they make sense. If neither of
+ the environment variables is set, the default is "1;31",
which gives red.
-D action, --devices=action
- If an input path is not a regular file or a directory,
- "action" specifies how it is to be processed. Valid values
+ If an input path is not a regular file or a directory,
+ "action" specifies how it is to be processed. Valid values
are "read" (the default) or "skip" (silently skip the path).
-d action, --directories=action
If an input path is a directory, "action" specifies how it is
- to be processed. Valid values are "read" (the default in
- non-Windows environments, for compatibility with GNU grep),
- "recurse" (equivalent to the -r option), or "skip" (silently
- skip the path, the default in Windows environments). In the
- "read" case, directories are read as if they were ordinary
- files. In some operating systems the effect of reading a
+ to be processed. Valid values are "read" (the default in
+ non-Windows environments, for compatibility with GNU grep),
+ "recurse" (equivalent to the -r option), or "skip" (silently
+ skip the path, the default in Windows environments). In the
+ "read" case, directories are read as if they were ordinary
+ files. In some operating systems the effect of reading a
directory like this is an immediate end-of-file; in others it
may provoke an error.
-e pattern, --regex=pattern, --regexp=pattern
Specify a pattern to be matched. This option can be used mul-
tiple times in order to specify several patterns. It can also
- be used as a way of specifying a single pattern that starts
- with a hyphen. When -e is used, no argument pattern is taken
- from the command line; all arguments are treated as file
- names. There is no limit to the number of patterns. They are
- applied to each line in the order in which they are defined
+ be used as a way of specifying a single pattern that starts
+ with a hyphen. When -e is used, no argument pattern is taken
+ from the command line; all arguments are treated as file
+ names. There is no limit to the number of patterns. They are
+ applied to each line in the order in which they are defined
until one matches.
- If -f is used with -e, the command line patterns are matched
+ If -f is used with -e, the command line patterns are matched
first, followed by the patterns from the file(s), independent
- of the order in which these options are specified. Note that
- multiple use of -e is not the same as a single pattern with
+ of the order in which these options are specified. Note that
+ multiple use of -e is not the same as a single pattern with
alternatives. For example, X|Y finds the first character in a
- line that is X or Y, whereas if the two patterns are given
- separately, with X first, pcregrep finds X if it is present,
+ line that is X or Y, whereas if the two patterns are given
+ separately, with X first, pcregrep finds X if it is present,
even if it follows Y in the line. It finds Y only if there is
- no X in the line. This matters only if you are using -o or
+ no X in the line. This matters only if you are using -o or
--colo(u)r to show the part(s) of the line that matched.
--exclude=pattern
Files (but not directories) whose names match the pattern are
- skipped without being processed. This applies to all files,
- whether listed on the command line, obtained from --file-
+ skipped without being processed. This applies to all files,
+ whether listed on the command line, obtained from --file-
list, or by scanning a directory. The pattern is a PCRE regu-
lar expression, and is matched against the final component of
- the file name, not the entire path. The -F, -w, and -x
+ the file name, not the entire path. The -F, -w, and -x
options do not apply to this pattern. The option may be given
any number of times in order to specify multiple patterns. If
- a file name matches both an --include and an --exclude pat-
+ a file name matches both an --include and an --exclude pat-
tern, it is excluded. There is no short form for this option.
--exclude-from=filename
- Treat each non-empty line of the file as the data for an
+ Treat each non-empty line of the file as the data for an
--exclude option. What constitutes a newline when reading the
- file is the operating system's default. The --newline option
- has no effect on this option. This option may be given more
+ file is the operating system's default. The --newline option
+ has no effect on this option. This option may be given more
than once in order to specify a number of files to read.
--exclude-dir=pattern
Directories whose names match the pattern are skipped without
- being processed, whatever the setting of the --recursive
- option. This applies to all directories, whether listed on
+ being processed, whatever the setting of the --recursive
+ option. This applies to all directories, whether listed on
the command line, obtained from --file-list, or by scanning a
- parent directory. The pattern is a PCRE regular expression,
- and is matched against the final component of the directory
- name, not the entire path. The -F, -w, and -x options do not
- apply to this pattern. The option may be given any number of
- times in order to specify more than one pattern. If a direc-
- tory matches both --include-dir and --exclude-dir, it is
+ parent directory. The pattern is a PCRE regular expression,
+ and is matched against the final component of the directory
+ name, not the entire path. The -F, -w, and -x options do not
+ apply to this pattern. The option may be given any number of
+ times in order to specify more than one pattern. If a direc-
+ tory matches both --include-dir and --exclude-dir, it is
excluded. There is no short form for this option.
-F, --fixed-strings
- Interpret each data-matching pattern as a list of fixed
- strings, separated by newlines, instead of as a regular
- expression. What constitutes a newline for this purpose is
- controlled by the --newline option. The -w (match as a word)
- and -x (match whole line) options can be used with -F. They
+ Interpret each data-matching pattern as a list of fixed
+ strings, separated by newlines, instead of as a regular
+ expression. What constitutes a newline for this purpose is
+ controlled by the --newline option. The -w (match as a word)
+ and -x (match whole line) options can be used with -F. They
apply to each of the fixed strings. A line is selected if any
of the fixed strings are found in it (subject to -w or -x, if
- present). This option applies only to the patterns that are
- matched against the contents of files; it does not apply to
- patterns specified by any of the --include or --exclude
+ present). This option applies only to the patterns that are
+ matched against the contents of files; it does not apply to
+ patterns specified by any of the --include or --exclude
options.
-f filename, --file=filename
- Read patterns from the file, one per line, and match them
- against each line of input. What constitutes a newline when
- reading the file is the operating system's default. The
+ Read patterns from the file, one per line, and match them
+ against each line of input. What constitutes a newline when
+ reading the file is the operating system's default. The
--newline option has no effect on this option. Trailing white
space is removed from each line, and blank lines are ignored.
- An empty file contains no patterns and therefore matches
+ An empty file contains no patterns and therefore matches
nothing. See also the comments about multiple patterns versus
- a single pattern with alternatives in the description of -e
+ a single pattern with alternatives in the description of -e
above.
- If this option is given more than once, all the specified
- files are read. A data line is output if any of the patterns
- match it. A filename can be given as "-" to refer to the
- standard input. When -f is used, patterns specified on the
- command line using -e may also be present; they are tested
- before the file's patterns. However, no other pattern is
+ If this option is given more than once, all the specified
+ files are read. A data line is output if any of the patterns
+ match it. A filename can be given as "-" to refer to the
+ standard input. When -f is used, patterns specified on the
+ command line using -e may also be present; they are tested
+ before the file's patterns. However, no other pattern is
taken from the command line; all arguments are treated as the
names of paths to be searched.
--file-list=filename
- Read a list of files and/or directories that are to be
- scanned from the given file, one per line. Trailing white
+ Read a list of files and/or directories that are to be
+ scanned from the given file, one per line. Trailing white
space is removed from each line, and blank lines are ignored.
- These paths are processed before any that are listed on the
- command line. The filename can be given as "-" to refer to
+ These paths are processed before any that are listed on the
+ command line. The filename can be given as "-" to refer to
the standard input. If --file and --file-list are both spec-
- ified as "-", patterns are read first. This is useful only
- when the standard input is a terminal, from which further
- lines (the list of files) can be read after an end-of-file
- indication. If this option is given more than once, all the
+ ified as "-", patterns are read first. This is useful only
+ when the standard input is a terminal, from which further
+ lines (the list of files) can be read after an end-of-file
+ indication. If this option is given more than once, all the
specified files are read.
--file-offsets
- Instead of showing lines or parts of lines that match, show
- each match as an offset from the start of the file and a
- length, separated by a comma. In this mode, no context is
- shown. That is, the -A, -B, and -C options are ignored. If
+ Instead of showing lines or parts of lines that match, show
+ each match as an offset from the start of the file and a
+ length, separated by a comma. In this mode, no context is
+ shown. That is, the -A, -B, and -C options are ignored. If
there is more than one match in a line, each of them is shown
- separately. This option is mutually exclusive with --line-
+ separately. This option is mutually exclusive with --line-
offsets and --only-matching.
-H, --with-filename
- Force the inclusion of the filename at the start of output
- lines when searching a single file. By default, the filename
- is not shown in this case. For matching lines, the filename
+ Force the inclusion of the filename at the start of output
+ lines when searching a single file. By default, the filename
+ is not shown in this case. For matching lines, the filename
is followed by a colon; for context lines, a hyphen separator
- is used. If a line number is also being output, it follows
+ is used. If a line number is also being output, it follows
the file name.
-h, --no-filename
- Suppress the output filenames when searching multiple files.
- By default, filenames are shown when multiple files are
- searched. For matching lines, the filename is followed by a
- colon; for context lines, a hyphen separator is used. If a
+ Suppress the output filenames when searching multiple files.
+ By default, filenames are shown when multiple files are
+ searched. For matching lines, the filename is followed by a
+ colon; for context lines, a hyphen separator is used. If a
line number is also being output, it follows the file name.
- --help Output a help message, giving brief details of the command
- options and file type support, and then exit. Anything else
+ --help Output a help message, giving brief details of the command
+ options and file type support, and then exit. Anything else
on the command line is ignored.
- -I Treat binary files as never matching. This is equivalent to
+ -I Treat binary files as never matching. This is equivalent to
--binary-files=without-match.
-i, --ignore-case
Ignore upper/lower case distinctions during comparisons.
--include=pattern
- If any --include patterns are specified, the only files that
- are processed are those that match one of the patterns (and
- do not match an --exclude pattern). This option does not
- affect directories, but it applies to all files, whether
- listed on the command line, obtained from --file-list, or by
- scanning a directory. The pattern is a PCRE regular expres-
- sion, and is matched against the final component of the file
- name, not the entire path. The -F, -w, and -x options do not
- apply to this pattern. The option may be given any number of
- times. If a file name matches both an --include and an
- --exclude pattern, it is excluded. There is no short form
+ If any --include patterns are specified, the only files that
+ are processed are those that match one of the patterns (and
+ do not match an --exclude pattern). This option does not
+ affect directories, but it applies to all files, whether
+ listed on the command line, obtained from --file-list, or by
+ scanning a directory. The pattern is a PCRE regular expres-
+ sion, and is matched against the final component of the file
+ name, not the entire path. The -F, -w, and -x options do not
+ apply to this pattern. The option may be given any number of
+ times. If a file name matches both an --include and an
+ --exclude pattern, it is excluded. There is no short form
for this option.
--include-from=filename
- Treat each non-empty line of the file as the data for an
+ Treat each non-empty line of the file as the data for an
--include option. What constitutes a newline for this purpose
- is the operating system's default. The --newline option has
+ is the operating system's default. The --newline option has
no effect on this option. This option may be given any number
of times; all the files are read.
--include-dir=pattern
- If any --include-dir patterns are specified, the only direc-
- tories that are processed are those that match one of the
- patterns (and do not match an --exclude-dir pattern). This
- applies to all directories, whether listed on the command
- line, obtained from --file-list, or by scanning a parent
- directory. The pattern is a PCRE regular expression, and is
- matched against the final component of the directory name,
- not the entire path. The -F, -w, and -x options do not apply
+ If any --include-dir patterns are specified, the only direc-
+ tories that are processed are those that match one of the
+ patterns (and do not match an --exclude-dir pattern). This
+ applies to all directories, whether listed on the command
+ line, obtained from --file-list, or by scanning a parent
+ directory. The pattern is a PCRE regular expression, and is
+ matched against the final component of the directory name,
+ not the entire path. The -F, -w, and -x options do not apply
to this pattern. The option may be given any number of times.
- If a directory matches both --include-dir and --exclude-dir,
+ If a directory matches both --include-dir and --exclude-dir,
it is excluded. There is no short form for this option.
-L, --files-without-match
- Instead of outputting lines from the files, just output the
- names of the files that do not contain any lines that would
- have been output. Each file name is output once, on a sepa-
+ Instead of outputting lines from the files, just output the
+ names of the files that do not contain any lines that would
+ have been output. Each file name is output once, on a sepa-
rate line.
-l, --files-with-matches
- Instead of outputting lines from the files, just output the
+ Instead of outputting lines from the files, just output the
names of the files containing lines that would have been out-
- put. Each file name is output once, on a separate line.
- Searching normally stops as soon as a matching line is found
- in a file. However, if the -c (count) option is also used,
- matching continues in order to obtain the correct count, and
- those files that have at least one match are listed along
+ put. Each file name is output once, on a separate line.
+ Searching normally stops as soon as a matching line is found
+ in a file. However, if the -c (count) option is also used,
+ matching continues in order to obtain the correct count, and
+ those files that have at least one match are listed along
with their counts. Using this option with -c is a way of sup-
pressing the listing of files with no matches.
@@ -411,313 +412,313 @@ OPTIONS
input)" is used. There is no short form for this option.
--line-buffered
- When this option is given, input is read and processed line
- by line, and the output is flushed after each write. By
- default, input is read in large chunks, unless pcregrep can
- determine that it is reading from a terminal (which is cur-
- rently possible only in Unix-like environments). Output to
- terminal is normally automatically flushed by the operating
+ When this option is given, input is read and processed line
+ by line, and the output is flushed after each write. By
+ default, input is read in large chunks, unless pcregrep can
+ determine that it is reading from a terminal (which is cur-
+ rently possible only in Unix-like environments). Output to
+ terminal is normally automatically flushed by the operating
system. This option can be useful when the input or output is
- attached to a pipe and you do not want pcregrep to buffer up
- large amounts of data. However, its use will affect perfor-
+ attached to a pipe and you do not want pcregrep to buffer up
+ large amounts of data. However, its use will affect perfor-
mance, and the -M (multiline) option ceases to work.
--line-offsets
- Instead of showing lines or parts of lines that match, show
+ Instead of showing lines or parts of lines that match, show
each match as a line number, the offset from the start of the
- line, and a length. The line number is terminated by a colon
- (as usual; see the -n option), and the offset and length are
- separated by a comma. In this mode, no context is shown.
- That is, the -A, -B, and -C options are ignored. If there is
- more than one match in a line, each of them is shown sepa-
+ line, and a length. The line number is terminated by a colon
+ (as usual; see the -n option), and the offset and length are
+ separated by a comma. In this mode, no context is shown.
+ That is, the -A, -B, and -C options are ignored. If there is
+ more than one match in a line, each of them is shown sepa-
rately. This option is mutually exclusive with --file-offsets
and --only-matching.
--locale=locale-name
- This option specifies a locale to be used for pattern match-
- ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
- ronment variables. If no locale is specified, the PCRE
- library's default (usually the "C" locale) is used. There is
+ This option specifies a locale to be used for pattern match-
+ ing. It overrides the value in the LC_ALL or LC_CTYPE envi-
+ ronment variables. If no locale is specified, the PCRE
+ library's default (usually the "C" locale) is used. There is
no short form for this option.
--match-limit=number
- Processing some regular expression patterns can require a
- very large amount of memory, leading in some cases to a pro-
- gram crash if not enough is available. Other patterns may
- take a very long time to search for all possible matching
- strings. The pcre_exec() function that is called by pcregrep
- to do the matching has two parameters that can limit the
+ Processing some regular expression patterns can require a
+ very large amount of memory, leading in some cases to a pro-
+ gram crash if not enough is available. Other patterns may
+ take a very long time to search for all possible matching
+ strings. The pcre_exec() function that is called by pcregrep
+ to do the matching has two parameters that can limit the
resources that it uses.
- The --match-limit option provides a means of limiting
+ The --match-limit option provides a means of limiting
resource usage when processing patterns that are not going to
match, but which have a very large number of possibilities in
- their search trees. The classic example is a pattern that
- uses nested unlimited repeats. Internally, PCRE uses a func-
- tion called match() which it calls repeatedly (sometimes
- recursively). The limit set by --match-limit is imposed on
- the number of times this function is called during a match,
- which has the effect of limiting the amount of backtracking
+ their search trees. The classic example is a pattern that
+ uses nested unlimited repeats. Internally, PCRE uses a func-
+ tion called match() which it calls repeatedly (sometimes
+ recursively). The limit set by --match-limit is imposed on
+ the number of times this function is called during a match,
+ which has the effect of limiting the amount of backtracking
that can take place.
The --recursion-limit option is similar to --match-limit, but
instead of limiting the total number of times that match() is
called, it limits the depth of recursive calls, which in turn
- limits the amount of memory that can be used. The recursion
- depth is a smaller number than the total number of calls,
+ limits the amount of memory that can be used. The recursion
+ depth is a smaller number than the total number of calls,
because not all calls to match() are recursive. This limit is
of use only if it is set smaller than --match-limit.
- There are no short forms for these options. The default set-
- tings are specified when the PCRE library is compiled, with
+ There are no short forms for these options. The default set-
+ tings are specified when the PCRE library is compiled, with
the default default being 10 million.
-M, --multiline
- Allow patterns to match more than one line. When this option
+ Allow patterns to match more than one line. When this option
is given, patterns may usefully contain literal newline char-
- acters and internal occurrences of ^ and $ characters. The
- output for a successful match may consist of more than one
- line, the last of which is the one in which the match ended.
+ acters and internal occurrences of ^ and $ characters. The
+ output for a successful match may consist of more than one
+ line, the last of which is the one in which the match ended.
If the matched string ends with a newline sequence the output
ends at the end of that line.
- When this option is set, the PCRE library is called in "mul-
- tiline" mode. There is a limit to the number of lines that
- can be matched, imposed by the way that pcregrep buffers the
- input file as it scans it. However, pcregrep ensures that at
+ When this option is set, the PCRE library is called in "mul-
+ tiline" mode. There is a limit to the number of lines that
+ can be matched, imposed by the way that pcregrep buffers the
+ input file as it scans it. However, pcregrep ensures that at
least 8K characters or the rest of the document (whichever is
- the shorter) are available for forward matching, and simi-
+ the shorter) are available for forward matching, and simi-
larly the previous 8K characters (or all the previous charac-
- ters, if fewer than 8K) are guaranteed to be available for
- lookbehind assertions. This option does not work when input
+ ters, if fewer than 8K) are guaranteed to be available for
+ lookbehind assertions. This option does not work when input
is read line by line (see --line-buffered.)
-N newline-type, --newline=newline-type
- The PCRE library supports five different conventions for
- indicating the ends of lines. They are the single-character
- sequences CR (carriage return) and LF (linefeed), the two-
- character sequence CRLF, an "anycrlf" convention, which rec-
- ognizes any of the preceding three types, and an "any" con-
+ The PCRE library supports five different conventions for
+ indicating the ends of lines. They are the single-character
+ sequences CR (carriage return) and LF (linefeed), the two-
+ character sequence CRLF, an "anycrlf" convention, which rec-
+ ognizes any of the preceding three types, and an "any" con-
vention, in which any Unicode line ending sequence is assumed
- to end a line. The Unicode sequences are the three just men-
- tioned, plus VT (vertical tab, U+000B), FF (form feed,
- U+000C), NEL (next line, U+0085), LS (line separator,
+ to end a line. The Unicode sequences are the three just men-
+ tioned, plus VT (vertical tab, U+000B), FF (form feed,
+ U+000C), NEL (next line, U+0085), LS (line separator,
U+2028), and PS (paragraph separator, U+2029).
When the PCRE library is built, a default line-ending
- sequence is specified. This is normally the standard
+ sequence is specified. This is normally the standard
sequence for the operating system. Unless otherwise specified
- by this option, pcregrep uses the library's default. The
+ by this option, pcregrep uses the library's default. The
possible values for this option are CR, LF, CRLF, ANYCRLF, or
- ANY. This makes it possible to use pcregrep to scan files
+ ANY. This makes it possible to use pcregrep to scan files
that have come from other environments without having to mod-
- ify their line endings. If the data that is being scanned
- does not agree with the convention set by this option, pcre-
- grep may behave in strange ways. Note that this option does
- not apply to files specified by the -f, --exclude-from, or
+ ify their line endings. If the data that is being scanned
+ does not agree with the convention set by this option, pcre-
+ grep may behave in strange ways. Note that this option does
+ not apply to files specified by the -f, --exclude-from, or
--include-from options, which are expected to use the operat-
ing system's standard newline sequence.
-n, --line-number
Precede each output line by its line number in the file, fol-
- lowed by a colon for matching lines or a hyphen for context
- lines. If the filename is also being output, it precedes the
+ lowed by a colon for matching lines or a hyphen for context
+ lines. If the filename is also being output, it precedes the
line number. This option is forced if --line-offsets is used.
- --no-jit If the PCRE library is built with support for just-in-time
- compiling (which speeds up matching), pcregrep automatically
+ --no-jit If the PCRE library is built with support for just-in-time
+ compiling (which speeds up matching), pcregrep automatically
makes use of this, unless it was explicitly disabled at build
- time. This option can be used to disable the use of JIT at
- run time. It is provided for testing and working round prob-
+ time. This option can be used to disable the use of JIT at
+ run time. It is provided for testing and working round prob-
lems. It should never be needed in normal use.
-o, --only-matching
Show only the part of the line that matched a pattern instead
- of the whole line. In this mode, no context is shown. That
- is, the -A, -B, and -C options are ignored. If there is more
- than one match in a line, each of them is shown separately.
- If -o is combined with -v (invert the sense of the match to
- find non-matching lines), no output is generated, but the
- return code is set appropriately. If the matched portion of
- the line is empty, nothing is output unless the file name or
- line number are being printed, in which case they are shown
+ of the whole line. In this mode, no context is shown. That
+ is, the -A, -B, and -C options are ignored. If there is more
+ than one match in a line, each of them is shown separately.
+ If -o is combined with -v (invert the sense of the match to
+ find non-matching lines), no output is generated, but the
+ return code is set appropriately. If the matched portion of
+ the line is empty, nothing is output unless the file name or
+ line number are being printed, in which case they are shown
on an otherwise empty line. This option is mutually exclusive
with --file-offsets and --line-offsets.
-onumber, --only-matching=number
- Show only the part of the line that matched the capturing
+ Show only the part of the line that matched the capturing
parentheses of the given number. Up to 32 capturing parenthe-
ses are supported, and -o0 is equivalent to -o without a num-
- ber. Because these options can be given without an argument
- (see above), if an argument is present, it must be given in
- the same shell item, for example, -o3 or --only-matching=2.
+ ber. Because these options can be given without an argument
+ (see above), if an argument is present, it must be given in
+ the same shell item, for example, -o3 or --only-matching=2.
The comments given for the non-argument case above also apply
- to this case. If the specified capturing parentheses do not
- exist in the pattern, or were not set in the match, nothing
- is output unless the file name or line number are being
+ to this case. If the specified capturing parentheses do not
+ exist in the pattern, or were not set in the match, nothing
+ is output unless the file name or line number are being
printed.
- If this option is given multiple times, multiple substrings
- are output, in the order the options are given. For example,
+ If this option is given multiple times, multiple substrings
+ are output, in the order the options are given. For example,
-o3 -o1 -o3 causes the substrings matched by capturing paren-
- theses 3 and 1 and then 3 again to be output. By default,
+ theses 3 and 1 and then 3 again to be output. By default,
there is no separator (but see the next option).
--om-separator=text
- Specify a separating string for multiple occurrences of -o.
- The default is an empty string. Separating strings are never
+ Specify a separating string for multiple occurrences of -o.
+ The default is an empty string. Separating strings are never
coloured.
-q, --quiet
Work quietly, that is, display nothing except error messages.
- The exit status indicates whether or not any matches were
+ The exit status indicates whether or not any matches were
found.
-r, --recursive
- If any given path is a directory, recursively scan the files
- it contains, taking note of any --include and --exclude set-
- tings. By default, a directory is read as a normal file; in
- some operating systems this gives an immediate end-of-file.
- This option is a shorthand for setting the -d option to
+ If any given path is a directory, recursively scan the files
+ it contains, taking note of any --include and --exclude set-
+ tings. By default, a directory is read as a normal file; in
+ some operating systems this gives an immediate end-of-file.
+ This option is a shorthand for setting the -d option to
"recurse".
--recursion-limit=number
See --match-limit above.
-s, --no-messages
- Suppress error messages about non-existent or unreadable
- files. Such files are quietly skipped. However, the return
+ Suppress error messages about non-existent or unreadable
+ files. Such files are quietly skipped. However, the return
code is still 2, even if matches were found in other files.
-u, --utf-8
- Operate in UTF-8 mode. This option is available only if PCRE
+ Operate in UTF-8 mode. This option is available only if PCRE
has been compiled with UTF-8 support. All patterns (including
- those for any --exclude and --include options) and all sub-
- ject lines that are scanned must be valid strings of UTF-8
+ those for any --exclude and --include options) and all sub-
+ ject lines that are scanned must be valid strings of UTF-8
characters.
-V, --version
Write the version numbers of pcregrep and the PCRE library to
- the standard output and then exit. Anything else on the com-
+ the standard output and then exit. Anything else on the com-
mand line is ignored.
-v, --invert-match
- Invert the sense of the match, so that lines which do not
+ Invert the sense of the match, so that lines which do not
match any of the patterns are the ones that are found.
-w, --word-regex, --word-regexp
Force the patterns to match only whole words. This is equiva-
- lent to having \b at the start and end of the pattern. This
- option applies only to the patterns that are matched against
- the contents of files; it does not apply to patterns speci-
+ lent to having \b at the start and end of the pattern. This
+ option applies only to the patterns that are matched against
+ the contents of files; it does not apply to patterns speci-
fied by any of the --include or --exclude options.
-x, --line-regex, --line-regexp
- Force the patterns to be anchored (each must start matching
- at the beginning of a line) and in addition, require them to
- match entire lines. This is equivalent to having ^ and $
+ Force the patterns to be anchored (each must start matching
+ at the beginning of a line) and in addition, require them to
+ match entire lines. This is equivalent to having ^ and $
characters at the start and end of each alternative branch in
- every pattern. This option applies only to the patterns that
- are matched against the contents of files; it does not apply
- to patterns specified by any of the --include or --exclude
+ every pattern. This option applies only to the patterns that
+ are matched against the contents of files; it does not apply
+ to patterns specified by any of the --include or --exclude
options.
ENVIRONMENT VARIABLES
- The environment variables LC_ALL and LC_CTYPE are examined, in that
- order, for a locale. The first one that is set is used. This can be
- overridden by the --locale option. If no locale is set, the PCRE
+ The environment variables LC_ALL and LC_CTYPE are examined, in that
+ order, for a locale. The first one that is set is used. This can be
+ overridden by the --locale option. If no locale is set, the PCRE
library's default (usually the "C" locale) is used.
NEWLINES
- The -N (--newline) option allows pcregrep to scan files with different
+ The -N (--newline) option allows pcregrep to scan files with different
newline conventions from the default. Any parts of the input files that
- are written to the standard output are copied identically, with what-
- ever newline sequences they have in the input. However, the setting of
- this option does not affect the interpretation of files specified by
+ are written to the standard output are copied identically, with what-
+ ever newline sequences they have in the input. However, the setting of
+ this option does not affect the interpretation of files specified by
the -f, --exclude-from, or --include-from options, which are assumed to
- use the operating system's standard newline sequence, nor does it
- affect the way in which pcregrep writes informational messages to the
+ use the operating system's standard newline sequence, nor does it
+ affect the way in which pcregrep writes informational messages to the
standard error and output streams. For these it uses the string "\n" to
- indicate newlines, relying on the C I/O library to convert this to an
+ indicate newlines, relying on the C I/O library to convert this to an
appropriate sequence.
OPTIONS COMPATIBILITY
- Many of the short and long forms of pcregrep's options are the same as
- in the GNU grep program. Any long option of the form --xxx-regexp (GNU
- terminology) is also available as --xxx-regex (PCRE terminology). How-
- ever, the --file-list, --file-offsets, --include-dir, --line-offsets,
- --locale, --match-limit, -M, --multiline, -N, --newline, --om-separa-
- tor, --recursion-limit, -u, and --utf-8 options are specific to pcre-
- grep, as is the use of the --only-matching option with a capturing
+ Many of the short and long forms of pcregrep's options are the same as
+ in the GNU grep program. Any long option of the form --xxx-regexp (GNU
+ terminology) is also available as --xxx-regex (PCRE terminology). How-
+ ever, the --file-list, --file-offsets, --include-dir, --line-offsets,
+ --locale, --match-limit, -M, --multiline, -N, --newline, --om-separa-
+ tor, --recursion-limit, -u, and --utf-8 options are specific to pcre-
+ grep, as is the use of the --only-matching option with a capturing
parentheses number.
- Although most of the common options work the same way, a few are dif-
- ferent in pcregrep. For example, the --include option's argument is a
- glob for GNU grep, but a regular expression for pcregrep. If both the
- -c and -l options are given, GNU grep lists only file names, without
+ Although most of the common options work the same way, a few are dif-
+ ferent in pcregrep. For example, the --include option's argument is a
+ glob for GNU grep, but a regular expression for pcregrep. If both the
+ -c and -l options are given, GNU grep lists only file names, without
counts, but pcregrep gives the counts.
OPTIONS WITH DATA
There are four different ways in which an option with data can be spec-
- ified. If a short form option is used, the data may follow immedi-
+ ified. If a short form option is used, the data may follow immedi-
ately, or (with one exception) in the next command line item. For exam-
ple:
-f/some/file
-f /some/file
- The exception is the -o option, which may appear with or without data.
- Because of this, if data is present, it must follow immediately in the
+ The exception is the -o option, which may appear with or without data.
+ Because of this, if data is present, it must follow immediately in the
same item, for example -o3.
- If a long form option is used, the data may appear in the same command
- line item, separated by an equals character, or (with two exceptions)
+ If a long form option is used, the data may appear in the same command
+ line item, separated by an equals character, or (with two exceptions)
it may appear in the next command line item. For example:
--file=/some/file
--file /some/file
- Note, however, that if you want to supply a file name beginning with ~
- as data in a shell command, and have the shell expand ~ to a home
+ Note, however, that if you want to supply a file name beginning with ~
+ as data in a shell command, and have the shell expand ~ to a home
directory, you must separate the file name from the option, because the
shell does not treat ~ specially unless it is at the start of an item.
- The exceptions to the above are the --colour (or --color) and --only-
- matching options, for which the data is optional. If one of these
- options does have data, it must be given in the first form, using an
+ The exceptions to the above are the --colour (or --color) and --only-
+ matching options, for which the data is optional. If one of these
+ options does have data, it must be given in the first form, using an
equals character. Otherwise pcregrep will assume that it has no data.
MATCHING ERRORS
- It is possible to supply a regular expression that takes a very long
- time to fail to match certain lines. Such patterns normally involve
- nested indefinite repeats, for example: (a+)*\d when matched against a
- line of a's with no final digit. The PCRE matching function has a
- resource limit that causes it to abort in these circumstances. If this
+ It is possible to supply a regular expression that takes a very long
+ time to fail to match certain lines. Such patterns normally involve
+ nested indefinite repeats, for example: (a+)*\d when matched against a
+ line of a's with no final digit. The PCRE matching function has a
+ resource limit that causes it to abort in these circumstances. If this
happens, pcregrep outputs an error message and the line that caused the
- problem to the standard error stream. If there are more than 20 such
+ problem to the standard error stream. If there are more than 20 such
errors, pcregrep gives up.
- The --match-limit option of pcregrep can be used to set the overall
- resource limit; there is a second option called --recursion-limit that
- sets a limit on the amount of memory (usually stack) that is used (see
+ The --match-limit option of pcregrep can be used to set the overall
+ resource limit; there is a second option called --recursion-limit that
+ sets a limit on the amount of memory (usually stack) that is used (see
the discussion of these options above).
DIAGNOSTICS
Exit status is 0 if any matches were found, 1 if no matches were found,
- and 2 for syntax errors, overlong lines, non-existent or inaccessible
- files (even if matches were found in other files) or too many matching
+ and 2 for syntax errors, overlong lines, non-existent or inaccessible
+ files (even if matches were found in other files) or too many matching
errors. Using the -s option to suppress error messages about inaccessi-
ble files does not affect the return code.
@@ -736,5 +737,5 @@ AUTHOR
REVISION
- Last updated: 13 September 2012
- Copyright (c) 1997-2012 University of Cambridge.
+ Last updated: 03 April 2014
+ Copyright (c) 1997-2014 University of Cambridge.
diff --git a/pcre/doc/pcrepattern.3 b/pcre/doc/pcrepattern.3
index 4c515f83adc..f1c45cda5d2 100644
--- a/pcre/doc/pcrepattern.3
+++ b/pcre/doc/pcrepattern.3
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "03 December 2013" "PCRE 8.34"
+.TH PCREPATTERN 3 "08 January 2014" "PCRE 8.35"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -1004,7 +1004,9 @@ matches "foobar", the first substring is still set to "foo".
.P
Perl documents that the use of \eK within assertions is "not well defined". In
PCRE, \eK is acted upon when it occurs inside positive assertions, but is
-ignored in negative assertions.
+ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
+matches, the reported start of the match can be greater than the end of the
+match.
.
.
.\" HTML <a name="smallassertions"></a>
@@ -3028,19 +3030,22 @@ match does not always guarantee that a match must be at this starting point.
.P
Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE's start-of-match optimizations are turned off, as shown in this
-\fBpcretest\fP example:
+output from \fBpcretest\fP:
.sp
re> /(*COMMIT)abc/
data> xyzabc
0: abc
- xyzabc\eY
+ data> xyzabc\eY
No match
.sp
-PCRE knows that any match must start with "a", so the optimization skips along
-the subject to "a" before running the first match attempt, which succeeds. When
-the optimization is disabled by the \eY escape in the second subject, the match
-starts at "x" and so the (*COMMIT) causes it to fail without trying any other
-starting points.
+For this pattern, PCRE knows that any match must start with "a", so the
+optimization skips along the subject to "a" before applying the pattern to the
+first set of data. The match attempt then succeeds. In the second set of data,
+the escape sequence \eY is interpreted by the \fBpcretest\fP program. It causes
+the PCRE_NO_START_OPTIMIZE option to be set when \fBpcre_exec()\fP is called.
+This disables the optimization that skips along to the first character. The
+pattern is now applied starting at "x", and so the (*COMMIT) causes the match
+to fail without trying any other starting points.
.sp
(*PRUNE) or (*PRUNE:NAME)
.sp
@@ -3255,6 +3260,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 03 December 2013
-Copyright (c) 1997-2013 University of Cambridge.
+Last updated: 08 January 2014
+Copyright (c) 1997-2014 University of Cambridge.
.fi
diff --git a/pcre/doc/pcresyntax.3 b/pcre/doc/pcresyntax.3
index 87f0cead743..fd878da4f99 100644
--- a/pcre/doc/pcresyntax.3
+++ b/pcre/doc/pcresyntax.3
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "12 November 2013" "PCRE 8.34"
+.TH PCRESYNTAX 3 "08 January 2014" "PCRE 8.35"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -309,6 +309,8 @@ but some of them use Unicode properties if PCRE_UCP is set. You can use
.rs
.sp
\eK reset start of match
+.sp
+\eK is honoured in positive assertions, but ignored in negative ones.
.
.
.SH "ALTERNATION"
@@ -354,11 +356,13 @@ but some of them use Unicode properties if PCRE_UCP is set. You can use
(?x) extended (ignore white space)
(?-...) unset option(s)
.sp
-The following are recognized only at the start of a pattern or after one of the
-newline-setting options with similar syntax:
+The following are recognized only at the very start of a pattern or after one
+of the newline or \eR options with similar syntax. More than one of them may
+appear.
.sp
(*LIMIT_MATCH=d) set the match limit to d (decimal number)
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
+ (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
@@ -370,6 +374,29 @@ Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
limits set by the caller of pcre_exec(), not increase them.
.
.
+.SH "NEWLINE CONVENTION"
+.rs
+.sp
+These are recognized only at the very start of the pattern or after option
+settings with a similar syntax.
+.sp
+ (*CR) carriage return only
+ (*LF) linefeed only
+ (*CRLF) carriage return followed by linefeed
+ (*ANYCRLF) all three of the above
+ (*ANY) any Unicode newline sequence
+.
+.
+.SH "WHAT \eR MATCHES"
+.rs
+.sp
+These are recognized only at the very start of the pattern or after option
+setting with a similar syntax.
+.sp
+ (*BSR_ANYCRLF) CR, LF, or CRLF
+ (*BSR_UNICODE) any Unicode newline sequence
+.
+.
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
.rs
.sp
@@ -457,29 +484,6 @@ pattern is not anchored.
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
.
.
-.SH "NEWLINE CONVENTIONS"
-.rs
-.sp
-These are recognized only at the very start of the pattern or after a
-(*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
-.sp
- (*CR) carriage return only
- (*LF) linefeed only
- (*CRLF) carriage return followed by linefeed
- (*ANYCRLF) all three of the above
- (*ANY) any Unicode newline sequence
-.
-.
-.SH "WHAT \eR MATCHES"
-.rs
-.sp
-These are recognized only at the very start of the pattern or after a
-(*...) option that sets the newline convention or a UTF or UCP mode.
-.sp
- (*BSR_ANYCRLF) CR, LF, or CRLF
- (*BSR_UNICODE) any Unicode newline sequence
-.
-.
.SH "CALLOUTS"
.rs
.sp
@@ -508,6 +512,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 12 November 2013
-Copyright (c) 1997-2013 University of Cambridge.
+Last updated: 08 January 2014
+Copyright (c) 1997-2014 University of Cambridge.
.fi
diff --git a/pcre/doc/pcretest.1 b/pcre/doc/pcretest.1
index f17c6f24088..92640da8e1b 100644
--- a/pcre/doc/pcretest.1
+++ b/pcre/doc/pcretest.1
@@ -1,4 +1,4 @@
-.TH PCRETEST 1 "12 November 2013" "PCRE 8.34"
+.TH PCRETEST 1 "09 February 2014" "PCRE 8.35"
.SH NAME
pcretest - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
@@ -113,6 +113,9 @@ following options output the value and set the exit code as indicated:
newline the default newline setting:
CR, LF, CRLF, ANYCRLF, or ANY
exit code is always 0
+ bsr the default setting for what \eR matches:
+ ANYCRLF or ANY
+ exit code is always 0
.sp
The following options output 1 for true or 0 for false, and set the exit code
to the same value:
@@ -330,6 +333,7 @@ sections.
\fB/N\fP set PCRE_NO_AUTO_CAPTURE
\fB/O\fP set PCRE_NO_AUTO_POSSESS
\fB/P\fP use the POSIX wrapper
+ \fB/Q\fP test external stack check function
\fB/S\fP study the pattern after compilation
\fB/s\fP set PCRE_DOTALL
\fB/T\fP select character tables
@@ -483,7 +487,10 @@ below.
The \fB/I\fP modifier requests that \fBpcretest\fP output information about the
compiled pattern (whether it is anchored, has a fixed first character, and
so on). It does this by calling \fBpcre[16|32]_fullinfo()\fP after compiling a
-pattern. If the pattern is studied, the results of that are also output.
+pattern. If the pattern is studied, the results of that are also output. In
+this output, the word "char" means a non-UTF character, that is, the value of a
+single data item (8-bit, 16-bit, or 32-bit, depending on the library that is
+being tested).
.P
The \fB/K\fP modifier requests \fBpcretest\fP to show names from backtracking
control verbs that are returned from calls to \fBpcre[16|32]_exec()\fP. It causes
@@ -513,6 +520,15 @@ the compiled pattern to be output. This does not include the size of the
successfully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
JIT compiled code is also output.
.P
+The \fB/Q\fP modifier is used to test the use of \fBpcre_stack_guard\fP. It
+must be followed by '0' or '1', specifying the return code to be given from an
+external function that is passed to PCRE and used for stack checking during
+compilation (see the
+.\" HREF
+\fBpcreapi\fP
+.\"
+documentation for details).
+.P
The \fB/S\fP modifier causes \fBpcre[16|32]_study()\fP to be called after the
expression has been compiled, and the results used when the expression is
matched. There are a number of qualifying characters that may follow \fB/S\fP.
@@ -1135,6 +1151,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 12 November 2013
-Copyright (c) 1997-2013 University of Cambridge.
+Last updated: 09 February 2014
+Copyright (c) 1997-2014 University of Cambridge.
.fi
diff --git a/pcre/doc/pcretest.txt b/pcre/doc/pcretest.txt
index f0609939f5c..55de5024431 100644
--- a/pcre/doc/pcretest.txt
+++ b/pcre/doc/pcretest.txt
@@ -99,6 +99,9 @@ COMMAND LINE OPTIONS
newline the default newline setting:
CR, LF, CRLF, ANYCRLF, or ANY
exit code is always 0
+ bsr the default setting for what \R matches:
+ ANYCRLF or ANY
+ exit code is always 0
The following options output 1 for true or 0 for false, and
set the exit code to the same value:
@@ -316,6 +319,7 @@ PATTERN MODIFIERS
/N set PCRE_NO_AUTO_CAPTURE
/O set PCRE_NO_AUTO_POSSESS
/P use the POSIX wrapper
+ /Q test external stack check function
/S study the pattern after compilation
/s set PCRE_DOTALL
/T select character tables
@@ -462,7 +466,9 @@ PATTERN MODIFIERS
compiled pattern (whether it is anchored, has a fixed first character,
and so on). It does this by calling pcre[16|32]_fullinfo() after com-
piling a pattern. If the pattern is studied, the results of that are
- also output.
+ also output. In this output, the word "char" means a non-UTF character,
+ that is, the value of a single data item (8-bit, 16-bit, or 32-bit,
+ depending on the library that is being tested).
The /K modifier requests pcretest to show names from backtracking con-
trol verbs that are returned from calls to pcre[16|32]_exec(). It
@@ -493,26 +499,31 @@ PATTERN MODIFIERS
pattern is successfully studied with the PCRE_STUDY_JIT_COMPILE option,
the size of the JIT compiled code is also output.
- The /S modifier causes pcre[16|32]_study() to be called after the
- expression has been compiled, and the results used when the expression
+ The /Q modifier is used to test the use of pcre_stack_guard. It must be
+ followed by '0' or '1', specifying the return code to be given from an
+ external function that is passed to PCRE and used for stack checking
+ during compilation (see the pcreapi documentation for details).
+
+ The /S modifier causes pcre[16|32]_study() to be called after the
+ expression has been compiled, and the results used when the expression
is matched. There are a number of qualifying characters that may follow
/S. They may appear in any order.
If /S is followed by an exclamation mark, pcre[16|32]_study() is called
- with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
+ with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
pcre_extra block, even when studying discovers no useful information.
If /S is followed by a second S character, it suppresses studying, even
- if it was requested externally by the -s command line option. This
- makes it possible to specify that certain patterns are always studied,
+ if it was requested externally by the -s command line option. This
+ makes it possible to specify that certain patterns are always studied,
and others are never studied, independently of -s. This feature is used
in the test files in a few cases where the output is different when the
pattern is studied.
- If the /S modifier is followed by a + character, the call to
- pcre[16|32]_study() is made with all the JIT study options, requesting
- just-in-time optimization support if it is available, for both normal
- and partial matching. If you want to restrict the JIT compiling modes,
+ If the /S modifier is followed by a + character, the call to
+ pcre[16|32]_study() is made with all the JIT study options, requesting
+ just-in-time optimization support if it is available, for both normal
+ and partial matching. If you want to restrict the JIT compiling modes,
you can follow /S+ with a digit in the range 1 to 7:
1 normal match only
@@ -523,40 +534,40 @@ PATTERN MODIFIERS
7 all three modes (default)
If /S++ is used instead of /S+ (with or without a following digit), the
- text "(JIT)" is added to the first output line after a match or no
+ text "(JIT)" is added to the first output line after a match or no
match when JIT-compiled code was actually used.
- Note that there is also an independent /+ modifier; it must not be
+ Note that there is also an independent /+ modifier; it must not be
given immediately after /S or /S+ because this will be misinterpreted.
If JIT studying is successful, the compiled JIT code will automatically
- be used when pcre[16|32]_exec() is run, except when incompatible run-
- time options are specified. For more details, see the pcrejit documen-
- tation. See also the \J escape sequence below for a way of setting the
+ be used when pcre[16|32]_exec() is run, except when incompatible run-
+ time options are specified. For more details, see the pcrejit documen-
+ tation. See also the \J escape sequence below for a way of setting the
size of the JIT stack.
- Finally, if /S is followed by a minus character, JIT compilation is
- suppressed, even if it was requested externally by the -s command line
- option. This makes it possible to specify that JIT is never to be used
+ Finally, if /S is followed by a minus character, JIT compilation is
+ suppressed, even if it was requested externally by the -s command line
+ option. This makes it possible to specify that JIT is never to be used
for certain patterns.
- The /T modifier must be followed by a single digit. It causes a spe-
+ The /T modifier must be followed by a single digit. It causes a spe-
cific set of built-in character tables to be passed to pcre[16|32]_com-
- pile(). It is used in the standard PCRE tests to check behaviour with
+ pile(). It is used in the standard PCRE tests to check behaviour with
different character tables. The digit specifies the tables as follows:
0 the default ASCII tables, as distributed in
pcre_chartables.c.dist
1 a set of tables defining ISO 8859 characters
- In table 1, some characters whose codes are greater than 128 are iden-
+ In table 1, some characters whose codes are greater than 128 are iden-
tified as letters, digits, spaces, etc.
Using the POSIX wrapper API
- The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
- rather than its native API. This supports only the 8-bit library. When
- /P is set, the following modifiers set options for the regcomp() func-
+ The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+ rather than its native API. This supports only the 8-bit library. When
+ /P is set, the following modifiers set options for the regcomp() func-
tion:
/i REG_ICASE
@@ -567,48 +578,48 @@ PATTERN MODIFIERS
/W REG_UCP ) the POSIX standard
/8 REG_UTF8 )
- The /+ modifier works as described above. All other modifiers are
+ The /+ modifier works as described above. All other modifiers are
ignored.
Locking out certain modifiers
- PCRE can be compiled with or without support for certain features such
- as UTF-8/16/32 or Unicode properties. Accordingly, the standard tests
- are split up into a number of different files that are selected for
- running depending on which features are available. When updating the
+ PCRE can be compiled with or without support for certain features such
+ as UTF-8/16/32 or Unicode properties. Accordingly, the standard tests
+ are split up into a number of different files that are selected for
+ running depending on which features are available. When updating the
tests, it is all too easy to put a new test into the wrong file by mis-
- take; for example, to put a test that requires UTF support into a file
- that is used when it is not available. To help detect such mistakes as
- early as possible, there is a facility for locking out specific modi-
+ take; for example, to put a test that requires UTF support into a file
+ that is used when it is not available. To help detect such mistakes as
+ early as possible, there is a facility for locking out specific modi-
fiers. If an input line for pcretest starts with the string "< forbid "
- the following sequence of characters is taken as a list of forbidden
+ the following sequence of characters is taken as a list of forbidden
modifiers. For example, in the test files that must not use UTF or Uni-
code property support, this line appears:
< forbid 8W
- This locks out the /8 and /W modifiers. An immediate error is given if
- they are subsequently encountered. If the character string contains <
- but not >, all the multi-character modifiers that begin with < are
- locked out. Otherwise, such modifiers must be explicitly listed, for
+ This locks out the /8 and /W modifiers. An immediate error is given if
+ they are subsequently encountered. If the character string contains <
+ but not >, all the multi-character modifiers that begin with < are
+ locked out. Otherwise, such modifiers must be explicitly listed, for
example:
< forbid <JS><cr>
There must be a single space between < and "forbid" for this feature to
- be recognised. If there is not, the line is interpreted either as a
- request to re-load a pre-compiled pattern (see "SAVING AND RELOADING
- COMPILED PATTERNS" below) or, if there is a another < character, as a
+ be recognised. If there is not, the line is interpreted either as a
+ request to re-load a pre-compiled pattern (see "SAVING AND RELOADING
+ COMPILED PATTERNS" below) or, if there is a another < character, as a
pattern that uses < as its delimiter.
DATA LINES
- Before each data line is passed to pcre[16|32]_exec(), leading and
- trailing white space is removed, and it is then scanned for \ escapes.
- Some of these are pretty esoteric features, intended for checking out
- some of the more complicated features of PCRE. If you are just testing
- "ordinary" regular expressions, you probably don't need any of these.
+ Before each data line is passed to pcre[16|32]_exec(), leading and
+ trailing white space is removed, and it is then scanned for \ escapes.
+ Some of these are pretty esoteric features, intended for checking out
+ some of the more complicated features of PCRE. If you are just testing
+ "ordinary" regular expressions, you probably don't need any of these.
The following escapes are recognized:
\a alarm (BEL, \x07)
@@ -669,7 +680,7 @@ DATA LINES
(any number of digits)
\R pass the PCRE_DFA_RESTART option to pcre[16|32]_dfa_exec()
\S output details of memory get/free calls during matching
- \Y pass the PCRE_NO_START_OPTIMIZE option to
+ \Y pass the PCRE_NO_START_OPTIMIZE option to
pcre[16|32]_exec()
or pcre[16|32]_dfa_exec()
\Z pass the PCRE_NOTEOL option to pcre[16|32]_exec()
@@ -678,7 +689,7 @@ DATA LINES
pcre[16|32]_exec() or pcre[16|32]_dfa_exec()
\>dd start the match at offset dd (optional "-"; then
any number of digits); this sets the startoffset
- argument for pcre[16|32]_exec() or
+ argument for pcre[16|32]_exec() or
pcre[16|32]_dfa_exec()
\<cr> pass the PCRE_NEWLINE_CR option to pcre[16|32]_exec()
or pcre[16|32]_dfa_exec()
@@ -691,102 +702,102 @@ DATA LINES
\<any> pass the PCRE_NEWLINE_ANY option to pcre[16|32]_exec()
or pcre[16|32]_dfa_exec()
- The use of \x{hh...} is not dependent on the use of the /8 modifier on
- the pattern. It is recognized always. There may be any number of hexa-
- decimal digits inside the braces; invalid values provoke error mes-
+ The use of \x{hh...} is not dependent on the use of the /8 modifier on
+ the pattern. It is recognized always. There may be any number of hexa-
+ decimal digits inside the braces; invalid values provoke error mes-
sages.
- Note that \xhh specifies one byte rather than one character in UTF-8
- mode; this makes it possible to construct invalid UTF-8 sequences for
- testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
- character in UTF-8 mode, generating more than one byte if the value is
- greater than 127. When testing the 8-bit library not in UTF-8 mode,
+ Note that \xhh specifies one byte rather than one character in UTF-8
+ mode; this makes it possible to construct invalid UTF-8 sequences for
+ testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
+ character in UTF-8 mode, generating more than one byte if the value is
+ greater than 127. When testing the 8-bit library not in UTF-8 mode,
\x{hh} generates one byte for values less than 256, and causes an error
for greater values.
In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
possible to construct invalid UTF-16 sequences for testing purposes.
- In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
- makes it possible to construct invalid UTF-32 sequences for testing
+ In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
+ makes it possible to construct invalid UTF-32 sequences for testing
purposes.
- The escapes that specify line ending sequences are literal strings,
+ The escapes that specify line ending sequences are literal strings,
exactly as shown. No more than one newline setting should be present in
any data line.
- A backslash followed by anything else just escapes the anything else.
- If the very last character is a backslash, it is ignored. This gives a
- way of passing an empty line as data, since a real empty line termi-
+ A backslash followed by anything else just escapes the anything else.
+ If the very last character is a backslash, it is ignored. This gives a
+ way of passing an empty line as data, since a real empty line termi-
nates the data input.
- The \J escape provides a way of setting the maximum stack size that is
- used by the just-in-time optimization code. It is ignored if JIT opti-
- mization is not being used. Providing a stack that is larger than the
+ The \J escape provides a way of setting the maximum stack size that is
+ used by the just-in-time optimization code. It is ignored if JIT opti-
+ mization is not being used. Providing a stack that is larger than the
default 32K is necessary only for very complicated patterns.
If \M is present, pcretest calls pcre[16|32]_exec() several times, with
different values in the match_limit and match_limit_recursion fields of
- the pcre[16|32]_extra data structure, until it finds the minimum num-
+ the pcre[16|32]_extra data structure, until it finds the minimum num-
bers for each parameter that allow pcre[16|32]_exec() to complete with-
- out error. Because this is testing a specific feature of the normal
+ out error. Because this is testing a specific feature of the normal
interpretive pcre[16|32]_exec() execution, the use of any JIT optimiza-
- tion that might have been set up by the /S+ qualifier of -s+ option is
+ tion that might have been set up by the /S+ qualifier of -s+ option is
disabled.
- The match_limit number is a measure of the amount of backtracking that
- takes place, and checking it out can be instructive. For most simple
- matches, the number is quite small, but for patterns with very large
- numbers of matching possibilities, it can become large very quickly
- with increasing length of subject string. The match_limit_recursion
- number is a measure of how much stack (or, if PCRE is compiled with
- NO_RECURSE, how much heap) memory is needed to complete the match
+ The match_limit number is a measure of the amount of backtracking that
+ takes place, and checking it out can be instructive. For most simple
+ matches, the number is quite small, but for patterns with very large
+ numbers of matching possibilities, it can become large very quickly
+ with increasing length of subject string. The match_limit_recursion
+ number is a measure of how much stack (or, if PCRE is compiled with
+ NO_RECURSE, how much heap) memory is needed to complete the match
attempt.
- When \O is used, the value specified may be higher or lower than the
+ When \O is used, the value specified may be higher or lower than the
size set by the -O command line option (or defaulted to 45); \O applies
- only to the call of pcre[16|32]_exec() for the line in which it
+ only to the call of pcre[16|32]_exec() for the line in which it
appears.
- If the /P modifier was present on the pattern, causing the POSIX wrap-
- per API to be used, the only option-setting sequences that have any
- effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
+ If the /P modifier was present on the pattern, causing the POSIX wrap-
+ per API to be used, the only option-setting sequences that have any
+ effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
REG_NOTEOL, respectively, to be passed to regexec().
THE ALTERNATIVE MATCHING FUNCTION
- By default, pcretest uses the standard PCRE matching function,
- pcre[16|32]_exec() to match each data line. PCRE also supports an
- alternative matching function, pcre[16|32]_dfa_test(), which operates
- in a different way, and has some restrictions. The differences between
+ By default, pcretest uses the standard PCRE matching function,
+ pcre[16|32]_exec() to match each data line. PCRE also supports an
+ alternative matching function, pcre[16|32]_dfa_test(), which operates
+ in a different way, and has some restrictions. The differences between
the two functions are described in the pcrematching documentation.
- If a data line contains the \D escape sequence, or if the command line
- contains the -dfa option, the alternative matching function is used.
+ If a data line contains the \D escape sequence, or if the command line
+ contains the -dfa option, the alternative matching function is used.
This function finds all possible matches at a given point. If, however,
- the \F escape sequence is present in the data line, it stops after the
+ the \F escape sequence is present in the data line, it stops after the
first match is found. This is always the shortest possible match.
DEFAULT OUTPUT FROM PCRETEST
- This section describes the output when the normal matching function,
+ This section describes the output when the normal matching function,
pcre[16|32]_exec(), is being used.
When a match succeeds, pcretest outputs the list of captured substrings
- that pcre[16|32]_exec() returns, starting with number 0 for the string
- that matched the whole pattern. Otherwise, it outputs "No match" when
- the return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the
- partially matching substring when pcre[16|32]_exec() returns
- PCRE_ERROR_PARTIAL. (Note that this is the entire substring that was
- inspected during the partial match; it may include characters before
- the actual match start if a lookbehind assertion, \K, \b, or \B was
- involved.) For any other return, pcretest outputs the PCRE negative
- error number and a short descriptive phrase. If the error is a failed
- UTF string check, the offset of the start of the failing character and
- the reason code are also output, provided that the size of the output
- vector is at least two. Here is an example of an interactive pcretest
+ that pcre[16|32]_exec() returns, starting with number 0 for the string
+ that matched the whole pattern. Otherwise, it outputs "No match" when
+ the return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the
+ partially matching substring when pcre[16|32]_exec() returns
+ PCRE_ERROR_PARTIAL. (Note that this is the entire substring that was
+ inspected during the partial match; it may include characters before
+ the actual match start if a lookbehind assertion, \K, \b, or \B was
+ involved.) For any other return, pcretest outputs the PCRE negative
+ error number and a short descriptive phrase. If the error is a failed
+ UTF string check, the offset of the start of the failing character and
+ the reason code are also output, provided that the size of the output
+ vector is at least two. Here is an example of an interactive pcretest
run.
$ pcretest
@@ -800,10 +811,10 @@ DEFAULT OUTPUT FROM PCRETEST
No match
Unset capturing substrings that are not followed by one that is set are
- not returned by pcre[16|32]_exec(), and are not shown by pcretest. In
+ not returned by pcre[16|32]_exec(), and are not shown by pcretest. In
the following example, there are two capturing substrings, but when the
- first data line is matched, the second, unset substring is not shown.
- An "internal" unset substring is shown as "<unset>", as for the second
+ first data line is matched, the second, unset substring is not shown.
+ An "internal" unset substring is shown as "<unset>", as for the second
data line.
re> /(a)|(b)/
@@ -815,11 +826,11 @@ DEFAULT OUTPUT FROM PCRETEST
1: <unset>
2: b
- If the strings contain any non-printing characters, they are output as
- \xhh escapes if the value is less than 256 and UTF mode is not set.
+ If the strings contain any non-printing characters, they are output as
+ \xhh escapes if the value is less than 256 and UTF mode is not set.
Otherwise they are output as \x{hh...} escapes. See below for the defi-
- nition of non-printing characters. If the pattern has the /+ modifier,
- the output for substring 0 is followed by the the rest of the subject
+ nition of non-printing characters. If the pattern has the /+ modifier,
+ the output for substring 0 is followed by the the rest of the subject
string, identified by "0+" like this:
re> /cat/+
@@ -827,7 +838,7 @@ DEFAULT OUTPUT FROM PCRETEST
0: cat
0+ aract
- If the pattern has the /g or /G modifier, the results of successive
+ If the pattern has the /g or /G modifier, the results of successive
matching attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@@ -839,32 +850,32 @@ DEFAULT OUTPUT FROM PCRETEST
0: ipp
1: pp
- "No match" is output only if the first match attempt fails. Here is an
- example of a failure message (the offset 4 that is specified by \>4 is
+ "No match" is output only if the first match attempt fails. Here is an
+ example of a failure message (the offset 4 that is specified by \>4 is
past the end of the subject string):
re> /xyz/
data> xyz\>4
Error -24 (bad offset value)
- If any of the sequences \C, \G, or \L are present in a data line that
- is successfully matched, the substrings extracted by the convenience
+ If any of the sequences \C, \G, or \L are present in a data line that
+ is successfully matched, the substrings extracted by the convenience
functions are output with C, G, or L after the string number instead of
a colon. This is in addition to the normal full list. The string length
- (that is, the return from the extraction function) is given in paren-
+ (that is, the return from the extraction function) is given in paren-
theses after each string for \C and \G.
Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), data lines may not. However new-
- lines can be included in data by means of the \n escape (or \r, \r\n,
+ lines can be included in data by means of the \n escape (or \r, \r\n,
etc., depending on the newline sequence setting).
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
When the alternative matching function, pcre[16|32]_dfa_exec(), is used
- (by means of the \D escape sequence or the -dfa command line option),
- the output consists of a list of all the matches that start at the
+ (by means of the \D escape sequence or the -dfa command line option),
+ the output consists of a list of all the matches that start at the
first point in the subject where there is at least one match. For exam-
ple:
@@ -874,11 +885,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- (Using the normal matching function on this data finds only "tang".)
- The longest matching string is always given first (and numbered zero).
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
- lowed by the partially matching substring. (Note that this is the
- entire substring that was inspected during the partial match; it may
+ lowed by the partially matching substring. (Note that this is the
+ entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser-
tion, \K, \b, or \B was involved.)
@@ -894,16 +905,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- Since the matching function does not support substring capture, the
- escape sequences that are concerned with captured substrings are not
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE_ERROR_PARTIAL
- return, indicating that the subject partially matched the pattern, you
- can restart the match with additional subject data by means of the \R
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
escape sequence. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -912,30 +923,30 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\R\D
0: n05
- For further information about partial matching, see the pcrepartial
+ For further information about partial matching, see the pcrepartial
documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. This works with both matching func-
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the
- start and current positions in the text at the callout time, and the
+ start and current positions in the text at the callout time, and the
next pattern item to be tested. For example:
--->pqrabcdef
0 ^ ^ \d
- This output indicates that callout number 0 occurred for a match
- attempt starting at the fourth character of the subject string, when
+ This output indicates that callout number 0 occurred for a match
+ attempt starting at the fourth character of the subject string, when
the pointer was at the seventh character of the data, and when the next
- pattern item was \d. Just one circumflex is output if the start and
+ pattern item was \d. Just one circumflex is output if the start and
current positions are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /C pattern modifier. In this case, instead of showing
- the callout number, the offset in the pattern, preceded by a plus, is
+ a result of the /C pattern modifier. In this case, instead of showing
+ the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
re> /\d?[A-E]\*/C
@@ -948,7 +959,7 @@ CALLOUTS
0: E*
If a pattern contains (*MARK) items, an additional line is output when-
- ever a change of latest mark is passed to the callout function. For
+ ever a change of latest mark is passed to the callout function. For
example:
re> /a(*MARK:X)bc/C
@@ -962,104 +973,104 @@ CALLOUTS
+12 ^ ^
0: abc
- The mark changes between matching "a" and "b", but stays the same for
- the rest of the match, so nothing more is output. If, as a result of
- backtracking, the mark reverts to being unset, the text "<unset>" is
+ The mark changes between matching "a" and "b", but stays the same for
+ the rest of the match, so nothing more is output. If, as a result of
+ backtracking, the mark reverts to being unset, the text "<unset>" is
output.
- The callout function in pcretest returns zero (carry on matching) by
- default, but you can use a \C item in a data line (as described above)
+ The callout function in pcretest returns zero (carry on matching) by
+ default, but you can use a \C item in a data line (as described above)
to change this and other parameters of the callout.
- Inserting callouts can be helpful when using pcretest to check compli-
- cated regular expressions. For further information about callouts, see
+ Inserting callouts can be helpful when using pcretest to check compli-
+ cated regular expressions. For further information about callouts, see
the pcrecallout documentation.
NON-PRINTING CHARACTERS
- When pcretest is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
are are therefore shown as hex escapes.
- When pcretest is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /L modifier). In this case, the
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
isprint() function to distinguish printing and non-printing characters.
SAVING AND RELOADING COMPILED PATTERNS
- The facilities described in this section are not available when the
- POSIX interface to PCRE is being used, that is, when the /P pattern
+ The facilities described in this section are not available when the
+ POSIX interface to PCRE is being used, that is, when the /P pattern
modifier is specified.
When the POSIX interface is not in use, you can cause pcretest to write
- a compiled pattern to a file, by following the modifiers with > and a
+ a compiled pattern to a file, by following the modifiers with > and a
file name. For example:
/pattern/im >/some/file
- See the pcreprecompile documentation for a discussion about saving and
- re-using compiled patterns. Note that if the pattern was successfully
+ See the pcreprecompile documentation for a discussion about saving and
+ re-using compiled patterns. Note that if the pattern was successfully
studied with JIT optimization, the JIT data cannot be saved.
- The data that is written is binary. The first eight bytes are the
- length of the compiled pattern data followed by the length of the
- optional study data, each written as four bytes in big-endian order
- (most significant byte first). If there is no study data (either the
+ The data that is written is binary. The first eight bytes are the
+ length of the compiled pattern data followed by the length of the
+ optional study data, each written as four bytes in big-endian order
+ (most significant byte first). If there is no study data (either the
pattern was not studied, or studying did not return any data), the sec-
- ond length is zero. The lengths are followed by an exact copy of the
- compiled pattern. If there is additional study data, this (excluding
- any JIT data) follows immediately after the compiled pattern. After
+ ond length is zero. The lengths are followed by an exact copy of the
+ compiled pattern. If there is additional study data, this (excluding
+ any JIT data) follows immediately after the compiled pattern. After
writing the file, pcretest expects to read a new pattern.
- A saved pattern can be reloaded into pcretest by specifying < and a
- file name instead of a pattern. There must be no space between < and
- the file name, which must not contain a < character, as otherwise
- pcretest will interpret the line as a pattern delimited by < charac-
+ A saved pattern can be reloaded into pcretest by specifying < and a
+ file name instead of a pattern. There must be no space between < and
+ the file name, which must not contain a < character, as otherwise
+ pcretest will interpret the line as a pattern delimited by < charac-
ters. For example:
re> </some/file
Compiled pattern loaded from /some/file
No study data
- If the pattern was previously studied with the JIT optimization, the
- JIT information cannot be saved and restored, and so is lost. When the
- pattern has been loaded, pcretest proceeds to read data lines in the
+ If the pattern was previously studied with the JIT optimization, the
+ JIT information cannot be saved and restored, and so is lost. When the
+ pattern has been loaded, pcretest proceeds to read data lines in the
usual way.
- You can copy a file written by pcretest to a different host and reload
- it there, even if the new host has opposite endianness to the one on
- which the pattern was compiled. For example, you can compile on an i86
- machine and run on a SPARC machine. When a pattern is reloaded on a
+ You can copy a file written by pcretest to a different host and reload
+ it there, even if the new host has opposite endianness to the one on
+ which the pattern was compiled. For example, you can compile on an i86
+ machine and run on a SPARC machine. When a pattern is reloaded on a
host with different endianness, the confirmation message is changed to:
Compiled pattern (byte-inverted) loaded from /some/file
The test suite contains some saved pre-compiled patterns with different
- endianness. These are reloaded using "<!" instead of just "<". This
+ endianness. These are reloaded using "<!" instead of just "<". This
suppresses the "(byte-inverted)" text so that the output is the same on
- all hosts. It also forces debugging output once the pattern has been
+ all hosts. It also forces debugging output once the pattern has been
reloaded.
- File names for saving and reloading can be absolute or relative, but
- note that the shell facility of expanding a file name that starts with
+ File names for saving and reloading can be absolute or relative, but
+ note that the shell facility of expanding a file name that starts with
a tilde (~) is not available.
- The ability to save and reload files in pcretest is intended for test-
- ing and experimentation. It is not intended for production use because
- only a single pattern can be written to a file. Furthermore, there is
- no facility for supplying custom character tables for use with a
- reloaded pattern. If the original pattern was compiled with custom
- tables, an attempt to match a subject string using a reloaded pattern
- is likely to cause pcretest to crash. Finally, if you attempt to load
+ The ability to save and reload files in pcretest is intended for test-
+ ing and experimentation. It is not intended for production use because
+ only a single pattern can be written to a file. Furthermore, there is
+ no facility for supplying custom character tables for use with a
+ reloaded pattern. If the original pattern was compiled with custom
+ tables, an attempt to match a subject string using a reloaded pattern
+ is likely to cause pcretest to crash. Finally, if you attempt to load
a file that is not in the correct format, the result is undefined.
SEE ALSO
- pcre(3), pcre16(3), pcre32(3), pcreapi(3), pcrecallout(3), pcrejit,
+ pcre(3), pcre16(3), pcre32(3), pcreapi(3), pcrecallout(3), pcrejit,
pcrematching(3), pcrepartial(d), pcrepattern(3), pcreprecompile(3).
@@ -1072,5 +1083,5 @@ AUTHOR
REVISION
- Last updated: 12 November 2013
- Copyright (c) 1997-2013 University of Cambridge.
+ Last updated: 09 February 2014
+ Copyright (c) 1997-2014 University of Cambridge.
diff --git a/pcre/maria-patches/pcre_stack_guard.diff b/pcre/maria-patches/pcre_stack_guard.diff
deleted file mode 100644
index 8cf4b7dbb34..00000000000
--- a/pcre/maria-patches/pcre_stack_guard.diff
+++ /dev/null
@@ -1,57 +0,0 @@
-=== modified file 'pcre/pcre.h.in'
---- pcre/pcre.h.in 2013-09-26 14:02:17 +0000
-+++ pcre/pcre.h.in 2013-10-02 07:58:29 +0000
-@@ -486,6 +486,7 @@ PCRE_EXP_DECL void (*pcre_free)(void *)
- PCRE_EXP_DECL void *(*pcre_stack_malloc)(size_t);
- PCRE_EXP_DECL void (*pcre_stack_free)(void *);
- PCRE_EXP_DECL int (*pcre_callout)(pcre_callout_block *);
-+PCRE_EXP_DECL int (*pcre_stack_guard)(void);
-
- PCRE_EXP_DECL void *(*pcre16_malloc)(size_t);
- PCRE_EXP_DECL void (*pcre16_free)(void *);
-@@ -504,6 +505,7 @@ PCRE_EXP_DECL void pcre_free(void *);
- PCRE_EXP_DECL void *pcre_stack_malloc(size_t);
- PCRE_EXP_DECL void pcre_stack_free(void *);
- PCRE_EXP_DECL int pcre_callout(pcre_callout_block *);
-+PCRE_EXP_DECL int pcre_stack_guard(void);
-
- PCRE_EXP_DECL void *pcre16_malloc(size_t);
- PCRE_EXP_DECL void pcre16_free(void *);
-
-=== modified file 'pcre/pcre_compile.c'
---- pcre/pcre_compile.c 2013-09-26 14:02:17 +0000
-+++ pcre/pcre_compile.c 2013-10-02 07:58:29 +0000
-@@ -7107,6 +7107,12 @@ unsigned int orig_bracount;
- unsigned int max_bracount;
- branch_chain bc;
-
-+if (pcre_stack_guard && pcre_stack_guard())
-+{
-+ *errorcodeptr= ERR23;
-+ return FALSE;
-+}
-+
- bc.outer = bcptr;
- bc.current_branch = code;
-
-
-=== modified file 'pcre/pcre_globals.c'
---- pcre/pcre_globals.c 2013-09-26 14:02:17 +0000
-+++ pcre/pcre_globals.c 2013-10-02 07:58:29 +0000
-@@ -72,6 +72,7 @@ PCRE_EXP_DATA_DEFN void (*PUBL(free))(v
- PCRE_EXP_DATA_DEFN void *(*PUBL(stack_malloc))(size_t) = LocalPcreMalloc;
- PCRE_EXP_DATA_DEFN void (*PUBL(stack_free))(void *) = LocalPcreFree;
- PCRE_EXP_DATA_DEFN int (*PUBL(callout))(PUBL(callout_block) *) = NULL;
-+PCRE_EXP_DATA_DEFN int (*PUBL(stack_guard))(void) = NULL;
-
- #elif !defined VPCOMPAT
- PCRE_EXP_DATA_DEFN void *(*PUBL(malloc))(size_t) = malloc;
-@@ -79,6 +80,7 @@ PCRE_EXP_DATA_DEFN void (*PUBL(free))(v
- PCRE_EXP_DATA_DEFN void *(*PUBL(stack_malloc))(size_t) = malloc;
- PCRE_EXP_DATA_DEFN void (*PUBL(stack_free))(void *) = free;
- PCRE_EXP_DATA_DEFN int (*PUBL(callout))(PUBL(callout_block) *) = NULL;
-+PCRE_EXP_DATA_DEFN int (*PUBL(stack_guard))(void) = NULL;
- #endif
-
- /* End of pcre_globals.c */
-
diff --git a/pcre/pcre.h.in b/pcre/pcre.h.in
index 45cd875dd56..667a45ed575 100644
--- a/pcre/pcre.h.in
+++ b/pcre/pcre.h.in
@@ -5,7 +5,7 @@
/* This is the public header file for the PCRE library, to be #included by
applications that call the PCRE functions.
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -498,12 +498,14 @@ PCRE_EXP_DECL void (*pcre16_free)(void *);
PCRE_EXP_DECL void *(*pcre16_stack_malloc)(size_t);
PCRE_EXP_DECL void (*pcre16_stack_free)(void *);
PCRE_EXP_DECL int (*pcre16_callout)(pcre16_callout_block *);
+PCRE_EXP_DECL int (*pcre16_stack_guard)(void);
PCRE_EXP_DECL void *(*pcre32_malloc)(size_t);
PCRE_EXP_DECL void (*pcre32_free)(void *);
PCRE_EXP_DECL void *(*pcre32_stack_malloc)(size_t);
PCRE_EXP_DECL void (*pcre32_stack_free)(void *);
PCRE_EXP_DECL int (*pcre32_callout)(pcre32_callout_block *);
+PCRE_EXP_DECL int (*pcre32_stack_guard)(void);
#else /* VPCOMPAT */
PCRE_EXP_DECL void *pcre_malloc(size_t);
PCRE_EXP_DECL void pcre_free(void *);
@@ -517,12 +519,14 @@ PCRE_EXP_DECL void pcre16_free(void *);
PCRE_EXP_DECL void *pcre16_stack_malloc(size_t);
PCRE_EXP_DECL void pcre16_stack_free(void *);
PCRE_EXP_DECL int pcre16_callout(pcre16_callout_block *);
+PCRE_EXP_DECL int pcre16_stack_guard(void);
PCRE_EXP_DECL void *pcre32_malloc(size_t);
PCRE_EXP_DECL void pcre32_free(void *);
PCRE_EXP_DECL void *pcre32_stack_malloc(size_t);
PCRE_EXP_DECL void pcre32_stack_free(void *);
PCRE_EXP_DECL int pcre32_callout(pcre32_callout_block *);
+PCRE_EXP_DECL int pcre32_stack_guard(void);
#endif /* VPCOMPAT */
/* User defined callback which provides a stack just before the match starts. */
diff --git a/pcre/pcre_byte_order.c b/pcre/pcre_byte_order.c
index 01cbca36ffe..cf5f12b04ea 100644
--- a/pcre/pcre_byte_order.c
+++ b/pcre/pcre_byte_order.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -311,9 +311,9 @@ while(TRUE)
ptr++;
}
/* Control should never reach here in 16/32 bit mode. */
-#endif /* !COMPILE_PCRE8 */
-
+#else /* In 8-bit mode, the pattern does not need to be processed. */
return 0;
+#endif /* !COMPILE_PCRE8 */
}
/* End of pcre_byte_order.c */
diff --git a/pcre/pcre_compile.c b/pcre/pcre_compile.c
index cdc7f95304a..8a5b7233479 100644
--- a/pcre/pcre_compile.c
+++ b/pcre/pcre_compile.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -547,6 +547,8 @@ static const char error_texts[] =
"parentheses are too deeply nested\0"
"invalid range in character class\0"
"group name must start with a non-digit\0"
+ /* 85 */
+ "parentheses are too deeply nested (stack check)\0"
;
/* Table to identify digits and hex digits. This is used when compiling
@@ -3070,8 +3072,11 @@ const pcre_uint32 *chr_ptr;
const pcre_uint32 *ochr_ptr;
const pcre_uint32 *list_ptr;
const pcre_uchar *next_code;
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+const pcre_uchar *xclass_flags;
+#endif
const pcre_uint8 *class_bitset;
-const pcre_uint32 *set1, *set2, *set_end;
+const pcre_uint8 *set1, *set2, *set_end;
pcre_uint32 chr;
BOOL accepted, invert_bits;
@@ -3202,12 +3207,12 @@ for(;;)
if (base_list[0] == OP_CLASS)
#endif
{
- set1 = (pcre_uint32 *)(base_end - base_list[2]);
+ set1 = (pcre_uint8 *)(base_end - base_list[2]);
list_ptr = list;
}
else
{
- set1 = (pcre_uint32 *)(code - list[2]);
+ set1 = (pcre_uint8 *)(code - list[2]);
list_ptr = base_list;
}
@@ -3216,41 +3221,53 @@ for(;;)
{
case OP_CLASS:
case OP_NCLASS:
- set2 = (pcre_uint32 *)
+ set2 = (pcre_uint8 *)
((list_ptr == list ? code : base_end) - list_ptr[2]);
break;
- /* OP_XCLASS cannot be supported here, because its bitset
- is not necessarily complete. E.g: [a-\0x{200}] is stored
- as a character range, and the appropriate bits are not set. */
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ case OP_XCLASS:
+ xclass_flags = (list_ptr == list ? code : base_end) - list_ptr[2] + LINK_SIZE;
+ if ((*xclass_flags & XCL_HASPROP) != 0) return FALSE;
+ if ((*xclass_flags & XCL_MAP) == 0)
+ {
+ /* No bits are set for characters < 256. */
+ if (list[1] == 0) return TRUE;
+ /* Might be an empty repeat. */
+ continue;
+ }
+ set2 = (pcre_uint8 *)(xclass_flags + 1);
+ break;
+#endif
case OP_NOT_DIGIT:
- invert_bits = TRUE;
- /* Fall through */
+ invert_bits = TRUE;
+ /* Fall through */
case OP_DIGIT:
- set2 = (pcre_uint32 *)(cd->cbits + cbit_digit);
- break;
+ set2 = (pcre_uint8 *)(cd->cbits + cbit_digit);
+ break;
case OP_NOT_WHITESPACE:
- invert_bits = TRUE;
- /* Fall through */
+ invert_bits = TRUE;
+ /* Fall through */
case OP_WHITESPACE:
- set2 = (pcre_uint32 *)(cd->cbits + cbit_space);
- break;
+ set2 = (pcre_uint8 *)(cd->cbits + cbit_space);
+ break;
case OP_NOT_WORDCHAR:
- invert_bits = TRUE;
- /* Fall through */
+ invert_bits = TRUE;
+ /* Fall through */
case OP_WORDCHAR:
- set2 = (pcre_uint32 *)(cd->cbits + cbit_word);
- break;
+ set2 = (pcre_uint8 *)(cd->cbits + cbit_word);
+ break;
default:
return FALSE;
}
- /* Compare 4 bytes to improve speed. */
- set_end = set1 + (32 / 4);
+ /* Because the sets are unaligned, we need
+ to perform byte comparison here. */
+ set_end = set1 + 32;
if (invert_bits)
{
do
@@ -3551,7 +3568,9 @@ for(;;)
if (list[1] == 0) return TRUE;
}
-return FALSE;
+/* Control never reaches here. There used to be a fail-save return FALSE; here,
+but some compilers complain about an unreachable statement. */
+
}
@@ -3623,7 +3642,7 @@ for (;;)
break;
case OP_MINUPTO:
- *code += OP_MINUPTO - OP_UPTO;
+ *code += OP_POSUPTO - OP_MINUPTO;
break;
}
}
@@ -4062,12 +4081,16 @@ for (c = *cptr; c <= d; c++)
if (c > d) return -1; /* Reached end of range */
+/* Found a character that has a single other case. Search for the end of the
+range, which is either the end of the input range, or a character that has zero
+or more than one other cases. */
+
*ocptr = othercase;
next = othercase + 1;
for (++c; c <= d; c++)
{
- if (UCD_OTHERCASE(c) != next) break;
+ if ((co = UCD_CASESET(c)) != 0 || UCD_OTHERCASE(c) != next) break;
next++;
}
@@ -4105,6 +4128,7 @@ add_to_class(pcre_uint8 *classbits, pcre_uchar **uchardptr, int options,
compile_data *cd, pcre_uint32 start, pcre_uint32 end)
{
pcre_uint32 c;
+pcre_uint32 classbits_end = (end <= 0xff ? end : 0xff);
int n8 = 0;
/* If caseless matching is required, scan the range and process alternate
@@ -4148,7 +4172,7 @@ if ((options & PCRE_CASELESS) != 0)
/* Not UTF-mode, or no UCP */
- for (c = start; c <= end && c < 256; c++)
+ for (c = start; c <= classbits_end; c++)
{
SETBIT(classbits, cd->fcc[c]);
n8++;
@@ -4173,22 +4197,21 @@ in all cases. */
#endif /* COMPILE_PCRE[8|16] */
-/* If all characters are less than 256, use the bit map. Otherwise use extra
-data. */
+/* Use the bitmap for characters < 256. Otherwise use extra data.*/
-if (end < 0x100)
+for (c = start; c <= classbits_end; c++)
{
- for (c = start; c <= end; c++)
- {
- n8++;
- SETBIT(classbits, c);
- }
+ /* Regardless of start, c will always be <= 255. */
+ SETBIT(classbits, c);
+ n8++;
}
-else
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+if (start <= 0xff) start = 0xff + 1;
+
+if (end >= start)
{
pcre_uchar *uchardata = *uchardptr;
-
#ifdef SUPPORT_UTF
if ((options & PCRE_UTF8) != 0) /* All UTFs use the same flag bit */
{
@@ -4228,6 +4251,7 @@ else
*uchardptr = uchardata; /* Updata extra data pointer */
}
+#endif /* SUPPORT_UTF || !COMPILE_PCRE8 */
return n8; /* Number of 8-bit characters */
}
@@ -4449,6 +4473,9 @@ for (;; ptr++)
BOOL reset_bracount;
int class_has_8bitchar;
int class_one_char;
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ BOOL xclass_has_prop;
+#endif
int newoptions;
int recno;
int refsign;
@@ -4703,12 +4730,6 @@ for (;; ptr++)
goto FAILED;
}
goto NORMAL_CHAR;
-
- /* In another (POSIX) regex library, the ugly syntax [[:<:]] and [[:>:]] is
- used for "start of word" and "end of word". As these are otherwise illegal
- sequences, we don't break anything by recognizing them. They are replaced
- by \b(?=\w) and \b(?<=\w) respectively. Sequences like [a[:<:]] are
- erroneous and are handled by the normal code below. */
/* In another (POSIX) regex library, the ugly syntax [[:<:]] and [[:>:]] is
used for "start of word" and "end of word". As these are otherwise illegal
@@ -4789,13 +4810,26 @@ for (;; ptr++)
should_flip_negation = FALSE;
+ /* Extended class (xclass) will be used when characters > 255
+ might match. */
+
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ xclass = FALSE;
+ class_uchardata = code + LINK_SIZE + 2; /* For XCLASS items */
+ class_uchardata_base = class_uchardata; /* Save the start */
+#endif
+
/* For optimization purposes, we track some properties of the class:
class_has_8bitchar will be non-zero if the class contains at least one <
256 character; class_one_char will be 1 if the class contains just one
- character. */
+ character; xclass_has_prop will be TRUE if unicode property checks
+ are present in the class. */
class_has_8bitchar = 0;
class_one_char = 0;
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ xclass_has_prop = FALSE;
+#endif
/* Initialize the 32-char bit map to all zeros. We build the map in a
temporary bit of memory, in case the class contains fewer than two
@@ -4804,12 +4838,6 @@ for (;; ptr++)
memset(classbits, 0, 32 * sizeof(pcre_uint8));
-#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
- xclass = FALSE;
- class_uchardata = code + LINK_SIZE + 2; /* For XCLASS items */
- class_uchardata_base = class_uchardata; /* Save the start */
-#endif
-
/* Process characters until ] is reached. By writing this as a "do" it
means that an initial ] is taken as a data character. At the start of the
loop, c contains the first byte of the character. */
@@ -4933,6 +4961,7 @@ for (;; ptr++)
*class_uchardata++ = local_negate? XCL_NOTPROP : XCL_PROP;
*class_uchardata++ = ptype;
*class_uchardata++ = 0;
+ xclass_has_prop = TRUE;
ptr = tempptr + 1;
continue;
@@ -5115,6 +5144,7 @@ for (;; ptr++)
XCL_PROP : XCL_NOTPROP;
*class_uchardata++ = ptype;
*class_uchardata++ = pdata;
+ xclass_has_prop = TRUE;
class_has_8bitchar--; /* Undo! */
continue;
}
@@ -5409,6 +5439,7 @@ for (;; ptr++)
*code++ = OP_XCLASS;
code += LINK_SIZE;
*code = negate_class? XCL_NOT:0;
+ if (xclass_has_prop) *code |= XCL_HASPROP;
/* If the map is required, move up the extra data to make room for it;
otherwise just move the code pointer to the end of the extra data. */
@@ -5418,6 +5449,8 @@ for (;; ptr++)
*code++ |= XCL_MAP;
memmove(code + (32 / sizeof(pcre_uchar)), code,
IN_UCHARS(class_uchardata - code));
+ if (negate_class && !xclass_has_prop)
+ for (c = 0; c < 32; c++) classbits[c] = ~classbits[c];
memcpy(code, classbits, 32);
code = class_uchardata + (32 / sizeof(pcre_uchar));
}
@@ -6586,7 +6619,10 @@ for (;; ptr++)
code[1+LINK_SIZE] = OP_CREF;
skipbytes = 1+IMM2_SIZE;
- refsign = -1;
+ refsign = -1; /* => not a number */
+ namelen = -1; /* => not a name; must set to avoid warning */
+ name = NULL; /* Always set to avoid warning */
+ recno = 0; /* Always set to avoid warning */
/* Check for a test for recursion in a named group. */
@@ -6623,7 +6659,6 @@ for (;; ptr++)
if (refsign >= 0)
{
- recno = 0;
while (IS_DIGIT(*ptr))
{
recno = recno * 10 + (int)(*ptr - CHAR_0);
@@ -8000,12 +8035,16 @@ unsigned int orig_bracount;
unsigned int max_bracount;
branch_chain bc;
-if (pcre_stack_guard && pcre_stack_guard())
-{
- *errorcodeptr= ERR23;
+/* If set, call the external function that checks for stack availability. */
+
+if (PUBL(stack_guard) != NULL && PUBL(stack_guard)())
+ {
+ *errorcodeptr= ERR85;
return FALSE;
-}
-
+ }
+
+/* Miscellaneous initialization */
+
bc.outer = bcptr;
bc.current_branch = code;
diff --git a/pcre/pcre_dfa_exec.c b/pcre/pcre_dfa_exec.c
index 4cbcf9106c4..fb0c7e805dc 100644
--- a/pcre/pcre_dfa_exec.c
+++ b/pcre/pcre_dfa_exec.c
@@ -7,7 +7,7 @@ and semantics are as close as possible to those of the Perl 5 language (but see
below for why this module is different).
Written by Philip Hazel
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -1473,7 +1473,7 @@ for (;;)
goto ANYNL01;
case CHAR_CR:
- if (ptr + 1 < end_subject && RAWUCHARTEST(ptr + 1) == CHAR_LF) ncount = 1;
+ if (ptr + 1 < end_subject && UCHAR21TEST(ptr + 1) == CHAR_LF) ncount = 1;
/* Fall through */
ANYNL01:
@@ -1742,7 +1742,7 @@ for (;;)
goto ANYNL02;
case CHAR_CR:
- if (ptr + 1 < end_subject && RAWUCHARTEST(ptr + 1) == CHAR_LF) ncount = 1;
+ if (ptr + 1 < end_subject && UCHAR21TEST(ptr + 1) == CHAR_LF) ncount = 1;
/* Fall through */
ANYNL02:
@@ -2012,7 +2012,7 @@ for (;;)
goto ANYNL03;
case CHAR_CR:
- if (ptr + 1 < end_subject && RAWUCHARTEST(ptr + 1) == CHAR_LF) ncount = 1;
+ if (ptr + 1 < end_subject && UCHAR21TEST(ptr + 1) == CHAR_LF) ncount = 1;
/* Fall through */
ANYNL03:
@@ -2210,7 +2210,7 @@ for (;;)
if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
reset_could_continue = TRUE;
}
- else if (RAWUCHARTEST(ptr + 1) == CHAR_LF)
+ else if (UCHAR21TEST(ptr + 1) == CHAR_LF)
{
ADD_NEW_DATA(-(state_offset + 1), 0, 1);
}
@@ -3466,7 +3466,7 @@ for (;;)
if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
{
- /* Advance to a known first char. */
+ /* Advance to a known first pcre_uchar (i.e. data item) */
if (has_first_char)
{
@@ -3474,12 +3474,12 @@ for (;;)
{
pcre_uchar csc;
while (current_subject < end_subject &&
- (csc = RAWUCHARTEST(current_subject)) != first_char && csc != first_char2)
+ (csc = UCHAR21TEST(current_subject)) != first_char && csc != first_char2)
current_subject++;
}
else
while (current_subject < end_subject &&
- RAWUCHARTEST(current_subject) != first_char)
+ UCHAR21TEST(current_subject) != first_char)
current_subject++;
}
@@ -3509,36 +3509,26 @@ for (;;)
ANYCRLF, and we are now at a LF, advance the match position by one
more character. */
- if (RAWUCHARTEST(current_subject - 1) == CHAR_CR &&
+ if (UCHAR21TEST(current_subject - 1) == CHAR_CR &&
(md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
current_subject < end_subject &&
- RAWUCHARTEST(current_subject) == CHAR_NL)
+ UCHAR21TEST(current_subject) == CHAR_NL)
current_subject++;
}
}
- /* Or to a non-unique first char after study */
+ /* Advance to a non-unique first pcre_uchar after study */
else if (start_bits != NULL)
{
while (current_subject < end_subject)
{
- register pcre_uint32 c = RAWUCHARTEST(current_subject);
+ register pcre_uint32 c = UCHAR21TEST(current_subject);
#ifndef COMPILE_PCRE8
if (c > 255) c = 255;
#endif
- if ((start_bits[c/8] & (1 << (c&7))) == 0)
- {
- current_subject++;
-#if defined SUPPORT_UTF && defined COMPILE_PCRE8
- /* In non 8-bit mode, the iteration will stop for
- characters > 255 at the beginning or not stop at all. */
- if (utf)
- ACROSSCHAR(current_subject < end_subject, *current_subject,
- current_subject++);
-#endif
- }
- else break;
+ if ((start_bits[c/8] & (1 << (c&7))) != 0) break;
+ current_subject++;
}
}
}
@@ -3557,19 +3547,20 @@ for (;;)
/* If the pattern was studied, a minimum subject length may be set. This
is a lower bound; no actual string of that length may actually match the
pattern. Although the value is, strictly, in characters, we treat it as
- bytes to avoid spending too much time in this optimization. */
+ in pcre_uchar units to avoid spending too much time in this optimization.
+ */
if (study != NULL && (study->flags & PCRE_STUDY_MINLEN) != 0 &&
(pcre_uint32)(end_subject - current_subject) < study->minlength)
return PCRE_ERROR_NOMATCH;
- /* If req_char is set, we know that that character must appear in the
- subject for the match to succeed. If the first character is set, req_char
- must be later in the subject; otherwise the test starts at the match
- point. This optimization can save a huge amount of work in patterns with
- nested unlimited repeats that aren't going to match. Writing separate
- code for cased/caseless versions makes it go faster, as does using an
- autoincrement and backing off on a match.
+ /* If req_char is set, we know that that pcre_uchar must appear in the
+ subject for the match to succeed. If the first pcre_uchar is set,
+ req_char must be later in the subject; otherwise the test starts at the
+ match point. This optimization can save a huge amount of work in patterns
+ with nested unlimited repeats that aren't going to match. Writing
+ separate code for cased/caseless versions makes it go faster, as does
+ using an autoincrement and backing off on a match.
HOWEVER: when the subject string is very, very long, searching to its end
can take a long time, and give bad performance on quite ordinary
@@ -3589,7 +3580,7 @@ for (;;)
{
while (p < end_subject)
{
- register pcre_uint32 pp = RAWUCHARINCTEST(p);
+ register pcre_uint32 pp = UCHAR21INCTEST(p);
if (pp == req_char || pp == req_char2) { p--; break; }
}
}
@@ -3597,18 +3588,18 @@ for (;;)
{
while (p < end_subject)
{
- if (RAWUCHARINCTEST(p) == req_char) { p--; break; }
+ if (UCHAR21INCTEST(p) == req_char) { p--; break; }
}
}
- /* If we can't find the required character, break the matching loop,
+ /* If we can't find the required pcre_uchar, break the matching loop,
which will cause a return or PCRE_ERROR_NOMATCH. */
if (p >= end_subject) break;
- /* If we have found the required character, save the point where we
+ /* If we have found the required pcre_uchar, save the point where we
found it, so that we don't search again next time round the loop if
- the start hasn't passed this character yet. */
+ the start hasn't passed this point yet. */
req_char_ptr = p;
}
@@ -3665,9 +3656,9 @@ for (;;)
not contain any explicit matches for \r or \n, and the newline option is CRLF
or ANY or ANYCRLF, advance the match position by one more character. */
- if (RAWUCHARTEST(current_subject - 1) == CHAR_CR &&
+ if (UCHAR21TEST(current_subject - 1) == CHAR_CR &&
current_subject < end_subject &&
- RAWUCHARTEST(current_subject) == CHAR_NL &&
+ UCHAR21TEST(current_subject) == CHAR_NL &&
(re->flags & PCRE_HASCRORLF) == 0 &&
(md->nltype == NLTYPE_ANY ||
md->nltype == NLTYPE_ANYCRLF ||
diff --git a/pcre/pcre_exec.c b/pcre/pcre_exec.c
index a3f0c1923f2..5dec99234a9 100644
--- a/pcre/pcre_exec.c
+++ b/pcre/pcre_exec.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -134,7 +134,7 @@ pcre_uint32 c;
BOOL utf = md->utf;
if (is_subject && length > md->end_subject - p) length = md->end_subject - p;
while (length-- > 0)
- if (isprint(c = RAWUCHARINCTEST(p))) printf("%c", (char)c); else printf("\\x{%02x}", c);
+ if (isprint(c = UCHAR21INCTEST(p))) printf("%c", (char)c); else printf("\\x{%02x}", c);
}
#endif
@@ -237,8 +237,8 @@ if (caseless)
{
pcre_uint32 cc, cp;
if (eptr >= md->end_subject) return -2; /* Partial match */
- cc = RAWUCHARTEST(eptr);
- cp = RAWUCHARTEST(p);
+ cc = UCHAR21TEST(eptr);
+ cp = UCHAR21TEST(p);
if (TABLE_GET(cp, md->lcc, cp) != TABLE_GET(cc, md->lcc, cc)) return -1;
p++;
eptr++;
@@ -254,7 +254,7 @@ else
while (length-- > 0)
{
if (eptr >= md->end_subject) return -2; /* Partial match */
- if (RAWUCHARINCTEST(p) != RAWUCHARINCTEST(eptr)) return -1;
+ if (UCHAR21INCTEST(p) != UCHAR21INCTEST(eptr)) return -1;
}
}
@@ -2103,7 +2103,7 @@ for (;;)
eptr + 1 >= md->end_subject &&
NLBLOCK->nltype == NLTYPE_FIXED &&
NLBLOCK->nllen == 2 &&
- RAWUCHARTEST(eptr) == NLBLOCK->nl[0])
+ UCHAR21TEST(eptr) == NLBLOCK->nl[0])
{
md->hitend = TRUE;
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);
@@ -2147,7 +2147,7 @@ for (;;)
eptr + 1 >= md->end_subject &&
NLBLOCK->nltype == NLTYPE_FIXED &&
NLBLOCK->nllen == 2 &&
- RAWUCHARTEST(eptr) == NLBLOCK->nl[0])
+ UCHAR21TEST(eptr) == NLBLOCK->nl[0])
{
md->hitend = TRUE;
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);
@@ -2290,7 +2290,7 @@ for (;;)
eptr + 1 >= md->end_subject &&
NLBLOCK->nltype == NLTYPE_FIXED &&
NLBLOCK->nllen == 2 &&
- RAWUCHARTEST(eptr) == NLBLOCK->nl[0])
+ UCHAR21TEST(eptr) == NLBLOCK->nl[0])
{
md->hitend = TRUE;
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);
@@ -2444,7 +2444,7 @@ for (;;)
{
SCHECK_PARTIAL();
}
- else if (RAWUCHARTEST(eptr) == CHAR_LF) eptr++;
+ else if (UCHAR21TEST(eptr) == CHAR_LF) eptr++;
break;
case CHAR_LF:
@@ -2691,16 +2691,22 @@ for (;;)
pcre_uchar *slot = md->name_table + GET2(ecode, 1) * md->name_entry_size;
ecode += 1 + 2*IMM2_SIZE;
+ /* Setting the default length first and initializing 'offset' avoids
+ compiler warnings in the REF_REPEAT code. */
+
+ length = (md->jscript_compat)? 0 : -1;
+ offset = 0;
+
while (count-- > 0)
{
offset = GET2(slot, 0) << 1;
- if (offset < offset_top && md->offset_vector[offset] >= 0) break;
+ if (offset < offset_top && md->offset_vector[offset] >= 0)
+ {
+ length = md->offset_vector[offset+1] - md->offset_vector[offset];
+ break;
+ }
slot += md->name_entry_size;
}
- if (count < 0)
- length = (md->jscript_compat)? 0 : -1;
- else
- length = md->offset_vector[offset+1] - md->offset_vector[offset];
}
goto REF_REPEAT;
@@ -3212,7 +3218,7 @@ for (;;)
CHECK_PARTIAL(); /* Not SCHECK_PARTIAL() */
RRETURN(MATCH_NOMATCH);
}
- while (length-- > 0) if (*ecode++ != RAWUCHARINC(eptr)) RRETURN(MATCH_NOMATCH);
+ while (length-- > 0) if (*ecode++ != UCHAR21INC(eptr)) RRETURN(MATCH_NOMATCH);
}
else
#endif
@@ -3252,7 +3258,7 @@ for (;;)
if (fc < 128)
{
- pcre_uint32 cc = RAWUCHAR(eptr);
+ pcre_uint32 cc = UCHAR21(eptr);
if (md->lcc[fc] != TABLE_GET(cc, md->lcc, cc)) RRETURN(MATCH_NOMATCH);
ecode++;
eptr++;
@@ -3521,7 +3527,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHARTEST(eptr);
+ cc = UCHAR21TEST(eptr);
if (fc != cc && foc != cc) RRETURN(MATCH_NOMATCH);
eptr++;
}
@@ -3539,7 +3545,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHARTEST(eptr);
+ cc = UCHAR21TEST(eptr);
if (fc != cc && foc != cc) RRETURN(MATCH_NOMATCH);
eptr++;
}
@@ -3556,7 +3562,7 @@ for (;;)
SCHECK_PARTIAL();
break;
}
- cc = RAWUCHARTEST(eptr);
+ cc = UCHAR21TEST(eptr);
if (fc != cc && foc != cc) break;
eptr++;
}
@@ -3583,7 +3589,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- if (fc != RAWUCHARINCTEST(eptr)) RRETURN(MATCH_NOMATCH);
+ if (fc != UCHAR21INCTEST(eptr)) RRETURN(MATCH_NOMATCH);
}
if (min == max) continue;
@@ -3600,7 +3606,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- if (fc != RAWUCHARINCTEST(eptr)) RRETURN(MATCH_NOMATCH);
+ if (fc != UCHAR21INCTEST(eptr)) RRETURN(MATCH_NOMATCH);
}
/* Control never gets here */
}
@@ -3614,7 +3620,7 @@ for (;;)
SCHECK_PARTIAL();
break;
}
- if (fc != RAWUCHARTEST(eptr)) break;
+ if (fc != UCHAR21TEST(eptr)) break;
eptr++;
}
if (possessive) continue; /* No backtracking */
@@ -4369,7 +4375,7 @@ for (;;)
eptr + 1 >= md->end_subject &&
NLBLOCK->nltype == NLTYPE_FIXED &&
NLBLOCK->nllen == 2 &&
- RAWUCHAR(eptr) == NLBLOCK->nl[0])
+ UCHAR21(eptr) == NLBLOCK->nl[0])
{
md->hitend = TRUE;
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);
@@ -4411,7 +4417,7 @@ for (;;)
default: RRETURN(MATCH_NOMATCH);
case CHAR_CR:
- if (eptr < md->end_subject && RAWUCHAR(eptr) == CHAR_LF) eptr++;
+ if (eptr < md->end_subject && UCHAR21(eptr) == CHAR_LF) eptr++;
break;
case CHAR_LF:
@@ -4521,7 +4527,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHAR(eptr);
+ cc = UCHAR21(eptr);
if (cc >= 128 || (md->ctypes[cc] & ctype_digit) == 0)
RRETURN(MATCH_NOMATCH);
eptr++;
@@ -4538,7 +4544,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHAR(eptr);
+ cc = UCHAR21(eptr);
if (cc < 128 && (md->ctypes[cc] & ctype_space) != 0)
RRETURN(MATCH_NOMATCH);
eptr++;
@@ -4555,7 +4561,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHAR(eptr);
+ cc = UCHAR21(eptr);
if (cc >= 128 || (md->ctypes[cc] & ctype_space) == 0)
RRETURN(MATCH_NOMATCH);
eptr++;
@@ -4572,7 +4578,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHAR(eptr);
+ cc = UCHAR21(eptr);
if (cc < 128 && (md->ctypes[cc] & ctype_word) != 0)
RRETURN(MATCH_NOMATCH);
eptr++;
@@ -4589,7 +4595,7 @@ for (;;)
SCHECK_PARTIAL();
RRETURN(MATCH_NOMATCH);
}
- cc = RAWUCHAR(eptr);
+ cc = UCHAR21(eptr);
if (cc >= 128 || (md->ctypes[cc] & ctype_word) == 0)
RRETURN(MATCH_NOMATCH);
eptr++;
@@ -5150,7 +5156,7 @@ for (;;)
{
default: RRETURN(MATCH_NOMATCH);
case CHAR_CR:
- if (eptr < md->end_subject && RAWUCHAR(eptr) == CHAR_LF) eptr++;
+ if (eptr < md->end_subject && UCHAR21(eptr) == CHAR_LF) eptr++;
break;
case CHAR_LF:
@@ -5689,7 +5695,7 @@ for (;;)
eptr + 1 >= md->end_subject &&
NLBLOCK->nltype == NLTYPE_FIXED &&
NLBLOCK->nllen == 2 &&
- RAWUCHAR(eptr) == NLBLOCK->nl[0])
+ UCHAR21(eptr) == NLBLOCK->nl[0])
{
md->hitend = TRUE;
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);
@@ -5715,7 +5721,7 @@ for (;;)
eptr + 1 >= md->end_subject &&
NLBLOCK->nltype == NLTYPE_FIXED &&
NLBLOCK->nllen == 2 &&
- RAWUCHAR(eptr) == NLBLOCK->nl[0])
+ UCHAR21(eptr) == NLBLOCK->nl[0])
{
md->hitend = TRUE;
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);
@@ -5772,7 +5778,7 @@ for (;;)
if (c == CHAR_CR)
{
if (++eptr >= md->end_subject) break;
- if (RAWUCHAR(eptr) == CHAR_LF) eptr++;
+ if (UCHAR21(eptr) == CHAR_LF) eptr++;
}
else
{
@@ -5935,8 +5941,8 @@ for (;;)
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
eptr--;
BACKCHAR(eptr);
- if (ctype == OP_ANYNL && eptr > pp && RAWUCHAR(eptr) == CHAR_NL &&
- RAWUCHAR(eptr - 1) == CHAR_CR) eptr--;
+ if (ctype == OP_ANYNL && eptr > pp && UCHAR21(eptr) == CHAR_NL &&
+ UCHAR21(eptr - 1) == CHAR_CR) eptr--;
}
}
else
@@ -6783,10 +6789,10 @@ for(;;)
if (first_char != first_char2)
while (start_match < end_subject &&
- (smc = RAWUCHARTEST(start_match)) != first_char && smc != first_char2)
+ (smc = UCHAR21TEST(start_match)) != first_char && smc != first_char2)
start_match++;
else
- while (start_match < end_subject && RAWUCHARTEST(start_match) != first_char)
+ while (start_match < end_subject && UCHAR21TEST(start_match) != first_char)
start_match++;
}
@@ -6818,7 +6824,7 @@ for(;;)
if (start_match[-1] == CHAR_CR &&
(md->nltype == NLTYPE_ANY || md->nltype == NLTYPE_ANYCRLF) &&
start_match < end_subject &&
- RAWUCHARTEST(start_match) == CHAR_NL)
+ UCHAR21TEST(start_match) == CHAR_NL)
start_match++;
}
}
@@ -6829,22 +6835,12 @@ for(;;)
{
while (start_match < end_subject)
{
- register pcre_uint32 c = RAWUCHARTEST(start_match);
+ register pcre_uint32 c = UCHAR21TEST(start_match);
#ifndef COMPILE_PCRE8
if (c > 255) c = 255;
#endif
- if ((start_bits[c/8] & (1 << (c&7))) == 0)
- {
- start_match++;
-#if defined SUPPORT_UTF && defined COMPILE_PCRE8
- /* In non 8-bit mode, the iteration will stop for
- characters > 255 at the beginning or not stop at all. */
- if (utf)
- ACROSSCHAR(start_match < end_subject, *start_match,
- start_match++);
-#endif
- }
- else break;
+ if ((start_bits[c/8] & (1 << (c&7))) != 0) break;
+ start_match++;
}
}
} /* Starting optimizations */
@@ -6897,7 +6893,7 @@ for(;;)
{
while (p < end_subject)
{
- register pcre_uint32 pp = RAWUCHARINCTEST(p);
+ register pcre_uint32 pp = UCHAR21INCTEST(p);
if (pp == req_char || pp == req_char2) { p--; break; }
}
}
@@ -6905,7 +6901,7 @@ for(;;)
{
while (p < end_subject)
{
- if (RAWUCHARINCTEST(p) == req_char) { p--; break; }
+ if (UCHAR21INCTEST(p) == req_char) { p--; break; }
}
}
diff --git a/pcre/pcre_globals.c b/pcre/pcre_globals.c
index 3f878144a27..0f106aa9013 100644
--- a/pcre/pcre_globals.c
+++ b/pcre/pcre_globals.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2012 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
diff --git a/pcre/pcre_internal.h b/pcre/pcre_internal.h
index 0b9798c5541..6e915a0e453 100644
--- a/pcre/pcre_internal.h
+++ b/pcre/pcre_internal.h
@@ -7,7 +7,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -316,8 +316,8 @@ start/end of string field names are. */
&(NLBLOCK->nllen), utf)) \
: \
((p) <= NLBLOCK->PSEND - NLBLOCK->nllen && \
- RAWUCHARTEST(p) == NLBLOCK->nl[0] && \
- (NLBLOCK->nllen == 1 || RAWUCHARTEST(p+1) == NLBLOCK->nl[1]) \
+ UCHAR21TEST(p) == NLBLOCK->nl[0] && \
+ (NLBLOCK->nllen == 1 || UCHAR21TEST(p+1) == NLBLOCK->nl[1]) \
) \
)
@@ -330,8 +330,8 @@ start/end of string field names are. */
&(NLBLOCK->nllen), utf)) \
: \
((p) >= NLBLOCK->PSSTART + NLBLOCK->nllen && \
- RAWUCHARTEST(p - NLBLOCK->nllen) == NLBLOCK->nl[0] && \
- (NLBLOCK->nllen == 1 || RAWUCHARTEST(p - NLBLOCK->nllen + 1) == NLBLOCK->nl[1]) \
+ UCHAR21TEST(p - NLBLOCK->nllen) == NLBLOCK->nl[0] && \
+ (NLBLOCK->nllen == 1 || UCHAR21TEST(p - NLBLOCK->nllen + 1) == NLBLOCK->nl[1]) \
) \
)
@@ -582,12 +582,27 @@ changed in future to be a fixed number of bytes or to depend on LINK_SIZE. */
#define MAX_MARK ((1u << 8) - 1)
#endif
+/* There is a proposed future special "UTF-21" mode, in which only the lowest
+21 bits of a 32-bit character are interpreted as UTF, with the remaining 11
+high-order bits available to the application for other uses. In preparation for
+the future implementation of this mode, there are macros that load a data item
+and, if in this special mode, mask it to 21 bits. These macros all have names
+starting with UCHAR21. In all other modes, including the normal 32-bit
+library, the macros all have the same simple definitions. When the new mode is
+implemented, it is expected that these definitions will be varied appropriately
+using #ifdef when compiling the library that supports the special mode. */
+
+#define UCHAR21(eptr) (*(eptr))
+#define UCHAR21TEST(eptr) (*(eptr))
+#define UCHAR21INC(eptr) (*(eptr)++)
+#define UCHAR21INCTEST(eptr) (*(eptr)++)
+
/* When UTF encoding is being used, a character is no longer just a single
-byte. The macros for character handling generate simple sequences when used in
-character-mode, and more complicated ones for UTF characters. GETCHARLENTEST
-and other macros are not used when UTF is not supported, so they are not
-defined. To make sure they can never even appear when UTF support is omitted,
-we don't even define them. */
+byte in 8-bit mode or a single short in 16-bit mode. The macros for character
+handling generate simple sequences when used in the basic mode, and more
+complicated ones for UTF characters. GETCHARLENTEST and other macros are not
+used when UTF is not supported. To make sure they can never even appear when
+UTF support is omitted, we don't even define them. */
#ifndef SUPPORT_UTF
@@ -600,10 +615,6 @@ we don't even define them. */
#define GETCHARINC(c, eptr) c = *eptr++;
#define GETCHARINCTEST(c, eptr) c = *eptr++;
#define GETCHARLEN(c, eptr, len) c = *eptr;
-#define RAWUCHAR(eptr) (*(eptr))
-#define RAWUCHARINC(eptr) (*(eptr)++)
-#define RAWUCHARTEST(eptr) (*(eptr))
-#define RAWUCHARINCTEST(eptr) (*(eptr)++)
/* #define GETCHARLENTEST(c, eptr, len) */
/* #define BACKCHAR(eptr) */
/* #define FORWARDCHAR(eptr) */
@@ -776,30 +787,6 @@ do not know if we are in UTF-8 mode. */
c = *eptr; \
if (utf && c >= 0xc0) GETUTF8LEN(c, eptr, len);
-/* Returns the next uchar, not advancing the pointer. This is called when
-we know we are in UTF mode. */
-
-#define RAWUCHAR(eptr) \
- (*(eptr))
-
-/* Returns the next uchar, advancing the pointer. This is called when
-we know we are in UTF mode. */
-
-#define RAWUCHARINC(eptr) \
- (*((eptr)++))
-
-/* Returns the next uchar, testing for UTF mode, and not advancing the
-pointer. */
-
-#define RAWUCHARTEST(eptr) \
- (*(eptr))
-
-/* Returns the next uchar, testing for UTF mode, advancing the
-pointer. */
-
-#define RAWUCHARINCTEST(eptr) \
- (*((eptr)++))
-
/* If the pointer is not at the start of a character, move it back until
it is. This is called only in UTF-8 mode - we don't put a test within the macro
because almost all calls are already within a block of UTF-8 only code. */
@@ -895,30 +882,6 @@ we do not know if we are in UTF-16 mode. */
c = *eptr; \
if (utf && (c & 0xfc00) == 0xd800) GETUTF16LEN(c, eptr, len);
-/* Returns the next uchar, not advancing the pointer. This is called when
-we know we are in UTF mode. */
-
-#define RAWUCHAR(eptr) \
- (*(eptr))
-
-/* Returns the next uchar, advancing the pointer. This is called when
-we know we are in UTF mode. */
-
-#define RAWUCHARINC(eptr) \
- (*((eptr)++))
-
-/* Returns the next uchar, testing for UTF mode, and not advancing the
-pointer. */
-
-#define RAWUCHARTEST(eptr) \
- (*(eptr))
-
-/* Returns the next uchar, testing for UTF mode, advancing the
-pointer. */
-
-#define RAWUCHARINCTEST(eptr) \
- (*((eptr)++))
-
/* If the pointer is not at the start of a character, move it back until
it is. This is called only in UTF-16 mode - we don't put a test within the
macro because almost all calls are already within a block of UTF-16 only
@@ -980,30 +943,6 @@ This is called when we do not know if we are in UTF-32 mode. */
#define GETCHARLENTEST(c, eptr, len) \
GETCHARTEST(c, eptr)
-/* Returns the next uchar, not advancing the pointer. This is called when
-we know we are in UTF mode. */
-
-#define RAWUCHAR(eptr) \
- (*(eptr))
-
-/* Returns the next uchar, advancing the pointer. This is called when
-we know we are in UTF mode. */
-
-#define RAWUCHARINC(eptr) \
- (*((eptr)++))
-
-/* Returns the next uchar, testing for UTF mode, and not advancing the
-pointer. */
-
-#define RAWUCHARTEST(eptr) \
- (*(eptr))
-
-/* Returns the next uchar, testing for UTF mode, advancing the
-pointer. */
-
-#define RAWUCHARINCTEST(eptr) \
- (*((eptr)++))
-
/* If the pointer is not at the start of a character, move it back until
it is. This is called only in UTF-32 mode - we don't put a test within the
macro because almost all calls are already within a block of UTF-32 only
@@ -1874,8 +1813,9 @@ table. */
/* Flag bits and data types for the extended class (OP_XCLASS) for classes that
contain characters with values greater than 255. */
-#define XCL_NOT 0x01 /* Flag: this is a negative class */
-#define XCL_MAP 0x02 /* Flag: a 32-byte map is present */
+#define XCL_NOT 0x01 /* Flag: this is a negative class */
+#define XCL_MAP 0x02 /* Flag: a 32-byte map is present */
+#define XCL_HASPROP 0x04 /* Flag: property checks are present. */
#define XCL_END 0 /* Marks end of individual items */
#define XCL_SINGLE 1 /* Single item (one multibyte char) follows */
@@ -2341,7 +2281,7 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68, ERR69,
ERR70, ERR71, ERR72, ERR73, ERR74, ERR75, ERR76, ERR77, ERR78, ERR79,
- ERR80, ERR81, ERR82, ERR83, ERR84, ERRCOUNT };
+ ERR80, ERR81, ERR82, ERR83, ERR84, ERR85, ERRCOUNT };
/* JIT compiling modes. The function list is indexed by them. */
diff --git a/pcre/pcre_jit_compile.c b/pcre/pcre_jit_compile.c
index 63b32c51670..e67071ef791 100644
--- a/pcre/pcre_jit_compile.c
+++ b/pcre/pcre_jit_compile.c
@@ -179,11 +179,12 @@ typedef struct jit_arguments {
typedef struct executable_functions {
void *executable_funcs[JIT_NUMBER_OF_COMPILE_MODES];
+ sljit_uw *read_only_data[JIT_NUMBER_OF_COMPILE_MODES];
+ sljit_uw executable_sizes[JIT_NUMBER_OF_COMPILE_MODES];
PUBL(jit_callback) callback;
void *userdata;
pcre_uint32 top_bracket;
pcre_uint32 limit_match;
- sljit_uw executable_sizes[JIT_NUMBER_OF_COMPILE_MODES];
} executable_functions;
typedef struct jump_list {
@@ -197,6 +198,12 @@ typedef struct stub_list {
struct stub_list *next;
} stub_list;
+typedef struct label_addr_list {
+ struct sljit_label *label;
+ sljit_uw *addr;
+ struct label_addr_list *next;
+} label_addr_list;
+
enum frame_types {
no_frame = -1,
no_stack = -2
@@ -306,7 +313,7 @@ typedef struct then_trap_backtrack {
int framesize;
} then_trap_backtrack;
-#define MAX_RANGE_SIZE 6
+#define MAX_RANGE_SIZE 4
typedef struct compiler_common {
/* The sljit ceneric compiler. */
@@ -315,6 +322,12 @@ typedef struct compiler_common {
pcre_uchar *start;
/* Maps private data offset to each opcode. */
sljit_si *private_data_ptrs;
+ /* This read-only data is available during runtime. */
+ sljit_uw *read_only_data;
+ /* The total size of the read-only data. */
+ sljit_uw read_only_data_size;
+ /* The next free entry of the read_only_data. */
+ sljit_uw *read_only_data_ptr;
/* Tells whether the capturing bracket is optimized. */
pcre_uint8 *optimized_cbracket;
/* Tells whether the starting offset is a target of then. */
@@ -349,6 +362,8 @@ typedef struct compiler_common {
sljit_sw lcc;
/* Mode can be PCRE_STUDY_JIT_COMPILE and others. */
int mode;
+ /* TRUE, when minlength is greater than 0. */
+ BOOL might_be_empty;
/* \K is found in the pattern. */
BOOL has_set_som;
/* (*SKIP:arg) is found in the pattern. */
@@ -363,13 +378,16 @@ typedef struct compiler_common {
BOOL positive_assert;
/* Newline control. */
int nltype;
+ pcre_uint32 nlmax;
+ pcre_uint32 nlmin;
int newline;
int bsr_nltype;
+ pcre_uint32 bsr_nlmax;
+ pcre_uint32 bsr_nlmin;
/* Dollar endonly. */
int endonly;
/* Tables. */
sljit_sw ctypes;
- int digits[2 + MAX_RANGE_SIZE];
/* Named capturing brackets. */
pcre_uchar *name_table;
sljit_sw name_count;
@@ -380,7 +398,9 @@ typedef struct compiler_common {
struct sljit_label *quit_label;
struct sljit_label *forced_quit_label;
struct sljit_label *accept_label;
+ struct sljit_label *ff_newline_shortcut;
stub_list *stubs;
+ label_addr_list *label_addrs;
recurse_entry *entries;
recurse_entry *currententry;
jump_list *partialmatch;
@@ -404,10 +424,9 @@ typedef struct compiler_common {
#ifdef SUPPORT_UCP
BOOL use_ucp;
#endif
-#ifndef COMPILE_PCRE32
- jump_list *utfreadchar;
-#endif
#ifdef COMPILE_PCRE8
+ jump_list *utfreadchar;
+ jump_list *utfreadchar16;
jump_list *utfreadtype8;
#endif
#endif /* SUPPORT_UTF */
@@ -524,6 +543,8 @@ the start pointers when the end of the capturing group has not yet reached. */
#define GET_LOCAL_BASE(dst, dstw, offset) \
sljit_get_local_base(compiler, (dst), (dstw), (offset))
+#define READ_CHAR_MAX 0x7fffffff
+
static pcre_uchar* bracketend(pcre_uchar* cc)
{
SLJIT_ASSERT((*cc >= OP_ASSERT && *cc <= OP_ASSERTBACK_NOT) || (*cc >= OP_ONCE && *cc <= OP_SCOND));
@@ -533,6 +554,25 @@ cc += 1 + LINK_SIZE;
return cc;
}
+static int no_alternatives(pcre_uchar* cc)
+{
+int count = 0;
+SLJIT_ASSERT((*cc >= OP_ASSERT && *cc <= OP_ASSERTBACK_NOT) || (*cc >= OP_ONCE && *cc <= OP_SCOND));
+do
+ {
+ cc += GET(cc, 1);
+ count++;
+ }
+while (*cc == OP_ALT);
+SLJIT_ASSERT(*cc >= OP_KET && *cc <= OP_KETRPOS);
+return count;
+}
+
+static int ones_in_half_byte[16] = {
+ /* 0 */ 0, 1, 1, 2, /* 4 */ 1, 2, 2, 3,
+ /* 8 */ 1, 2, 2, 3, /* 12 */ 2, 3, 3, 4
+};
+
/* Functions whose might need modification for all new supported opcodes:
next_opcode
check_opcode_types
@@ -752,6 +792,7 @@ while (cc < ccend)
{
case OP_SET_SOM:
common->has_set_som = TRUE;
+ common->might_be_empty = TRUE;
cc += 1;
break;
@@ -761,6 +802,16 @@ while (cc < ccend)
cc += 1 + IMM2_SIZE;
break;
+ case OP_BRA:
+ case OP_CBRA:
+ case OP_SBRA:
+ case OP_SCBRA:
+ count = no_alternatives(cc);
+ if (count > 4)
+ common->read_only_data_size += count * sizeof(sljit_uw);
+ cc += 1 + LINK_SIZE + (*cc == OP_CBRA || *cc == OP_SCBRA ? IMM2_SIZE : 0);
+ break;
+
case OP_CBRAPOS:
case OP_SCBRAPOS:
common->optimized_cbracket[GET2(cc, 1 + LINK_SIZE)] = 0;
@@ -2019,6 +2070,21 @@ while (list_item)
common->stubs = NULL;
}
+static void add_label_addr(compiler_common *common)
+{
+DEFINE_COMPILER;
+label_addr_list *label_addr;
+
+label_addr = sljit_alloc_memory(compiler, sizeof(label_addr_list));
+if (label_addr == NULL)
+ return;
+label_addr->label = LABEL();
+label_addr->addr = common->read_only_data_ptr;
+label_addr->next = common->label_addrs;
+common->label_addrs = label_addr;
+common->read_only_data_ptr++;
+}
+
static SLJIT_INLINE void count_match(compiler_common *common)
{
DEFINE_COMPILER;
@@ -2457,107 +2523,290 @@ else
JUMPHERE(jump);
}
-static void read_char(compiler_common *common)
+static void peek_char(compiler_common *common, pcre_uint32 max)
{
-/* Reads the character into TMP1, updates STR_PTR.
+/* Reads the character into TMP1, keeps STR_PTR.
Does not check STR_END. TMP2 Destroyed. */
DEFINE_COMPILER;
#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
struct sljit_jump *jump;
#endif
+SLJIT_UNUSED_ARG(max);
+
OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), 0);
-#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
if (common->utf)
{
-#if defined COMPILE_PCRE8
+ if (max < 128) return;
+
jump = CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, 0xc0);
-#elif defined COMPILE_PCRE16
- jump = CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, 0xd800);
-#endif /* COMPILE_PCRE[8|16] */
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
add_jump(compiler, &common->utfreadchar, JUMP(SLJIT_FAST_CALL));
+ OP2(SLJIT_SUB, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
JUMPHERE(jump);
}
#endif /* SUPPORT_UTF && !COMPILE_PCRE32 */
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE16
+if (common->utf)
+ {
+ if (max < 0xd800) return;
+
+ OP2(SLJIT_SUB, TMP2, 0, TMP1, 0, SLJIT_IMM, 0xd800);
+ jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0xdc00 - 0xd800 - 1);
+ /* TMP2 contains the high surrogate. */
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+ OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x40);
+ OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 10);
+ OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3ff);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ JUMPHERE(jump);
+ }
+#endif
+}
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+
+static BOOL is_char7_bitset(const pcre_uint8 *bitset, BOOL nclass)
+{
+/* Tells whether the character codes below 128 are enough
+to determine a match. */
+const pcre_uint8 value = nclass ? 0xff : 0;
+const pcre_uint8* end = bitset + 32;
+
+bitset += 16;
+do
+ {
+ if (*bitset++ != value)
+ return FALSE;
+ }
+while (bitset < end);
+return TRUE;
+}
+
+static void read_char7_type(compiler_common *common, BOOL full_read)
+{
+/* Reads the precise character type of a character into TMP1, if the character
+is less than 128. Otherwise it returns with zero. Does not check STR_END. The
+full_read argument tells whether characters above max are accepted or not. */
+DEFINE_COMPILER;
+struct sljit_jump *jump;
+
+SLJIT_ASSERT(common->utf);
+
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), 0);
OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+
+OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
+
+if (full_read)
+ {
+ jump = CMP(SLJIT_C_LESS, TMP2, 0, SLJIT_IMM, 0xc0);
+ OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP2), (sljit_sw)PRIV(utf8_table4) - 0xc0);
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
+ JUMPHERE(jump);
+ }
}
-static void peek_char(compiler_common *common)
+#endif /* SUPPORT_UTF && COMPILE_PCRE8 */
+
+static void read_char_range(compiler_common *common, pcre_uint32 min, pcre_uint32 max, BOOL update_str_ptr)
{
-/* Reads the character into TMP1, keeps STR_PTR.
-Does not check STR_END. TMP2 Destroyed. */
+/* Reads the precise value of a character into TMP1, if the character is
+between min and max (c >= min && c <= max). Otherwise it returns with a value
+outside the range. Does not check STR_END. */
DEFINE_COMPILER;
#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
struct sljit_jump *jump;
#endif
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+struct sljit_jump *jump2;
+#endif
-OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), 0);
-#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+SLJIT_UNUSED_ARG(update_str_ptr);
+SLJIT_UNUSED_ARG(min);
+SLJIT_UNUSED_ARG(max);
+SLJIT_ASSERT(min <= max);
+
+OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
if (common->utf)
{
-#if defined COMPILE_PCRE8
+ if (max < 128 && !update_str_ptr) return;
+
jump = CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, 0xc0);
-#elif defined COMPILE_PCRE16
- jump = CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, 0xd800);
-#endif /* COMPILE_PCRE[8|16] */
- add_jump(compiler, &common->utfreadchar, JUMP(SLJIT_FAST_CALL));
- OP2(SLJIT_SUB, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
+ if (min >= 0x10000)
+ {
+ OP2(SLJIT_SUB, TMP2, 0, TMP1, 0, SLJIT_IMM, 0xf0);
+ if (update_str_ptr)
+ OP1(SLJIT_MOV_UB, RETURN_ADDR, 0, SLJIT_MEM1(TMP1), (sljit_sw)PRIV(utf8_table4) - 0xc0);
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+ jump2 = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0x7);
+ OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
+ OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(2));
+ if (!update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(3));
+ OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ JUMPHERE(jump2);
+ if (update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, RETURN_ADDR, 0);
+ }
+ else if (min >= 0x800 && max <= 0xffff)
+ {
+ OP2(SLJIT_SUB, TMP2, 0, TMP1, 0, SLJIT_IMM, 0xe0);
+ if (update_str_ptr)
+ OP1(SLJIT_MOV_UB, RETURN_ADDR, 0, SLJIT_MEM1(TMP1), (sljit_sw)PRIV(utf8_table4) - 0xc0);
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+ jump2 = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0xf);
+ OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
+ if (!update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(2));
+ OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ JUMPHERE(jump2);
+ if (update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, RETURN_ADDR, 0);
+ }
+ else if (max >= 0x800)
+ add_jump(compiler, (max < 0x10000) ? &common->utfreadchar16 : &common->utfreadchar, JUMP(SLJIT_FAST_CALL));
+ else if (max < 128)
+ {
+ OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP1), (sljit_sw)PRIV(utf8_table4) - 0xc0);
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
+ }
+ else
+ {
+ OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+ if (!update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+ else
+ OP1(SLJIT_MOV_UB, RETURN_ADDR, 0, SLJIT_MEM1(TMP1), (sljit_sw)PRIV(utf8_table4) - 0xc0);
+ OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ if (update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, RETURN_ADDR, 0);
+ }
JUMPHERE(jump);
}
-#endif /* SUPPORT_UTF && !COMPILE_PCRE32 */
+#endif
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE16
+if (common->utf)
+ {
+ if (max >= 0x10000)
+ {
+ OP2(SLJIT_SUB, TMP2, 0, TMP1, 0, SLJIT_IMM, 0xd800);
+ jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0xdc00 - 0xd800 - 1);
+ /* TMP2 contains the high surrogate. */
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+ OP2(SLJIT_ADD, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x40);
+ OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 10);
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+ OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3ff);
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+ JUMPHERE(jump);
+ return;
+ }
+
+ if (max < 0xd800 && !update_str_ptr) return;
+
+ /* Skip low surrogate if necessary. */
+ OP2(SLJIT_SUB, TMP2, 0, TMP1, 0, SLJIT_IMM, 0xd800);
+ jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0xdc00 - 0xd800 - 1);
+ if (update_str_ptr)
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+ if (max >= 0xd800)
+ OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0x10000);
+ JUMPHERE(jump);
+ }
+#endif
+}
+
+static SLJIT_INLINE void read_char(compiler_common *common)
+{
+read_char_range(common, 0, READ_CHAR_MAX, TRUE);
}
-static void read_char8_type(compiler_common *common)
+static void read_char8_type(compiler_common *common, BOOL update_str_ptr)
{
/* Reads the character type into TMP1, updates STR_PTR. Does not check STR_END. */
DEFINE_COMPILER;
-#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
struct sljit_jump *jump;
#endif
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+struct sljit_jump *jump2;
+#endif
-#ifdef SUPPORT_UTF
+SLJIT_UNUSED_ARG(update_str_ptr);
+
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), 0);
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
if (common->utf)
{
- OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), 0);
- OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
-#if defined COMPILE_PCRE8
/* This can be an extra read in some situations, but hopefully
it is needed in most cases. */
OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
jump = CMP(SLJIT_C_LESS, TMP2, 0, SLJIT_IMM, 0xc0);
- add_jump(compiler, &common->utfreadtype8, JUMP(SLJIT_FAST_CALL));
- JUMPHERE(jump);
-#elif defined COMPILE_PCRE16
- OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
- jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 255);
- OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
- JUMPHERE(jump);
- /* Skip low surrogate if necessary. */
- OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0xfc00);
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP2, 0, SLJIT_IMM, 0xd800);
- OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
- OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 1);
- OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
-#elif defined COMPILE_PCRE32
- OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
- jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 255);
- OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
+ if (!update_str_ptr)
+ {
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+ OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 6);
+ OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
+ OP2(SLJIT_OR, TMP2, 0, TMP2, 0, TMP1, 0);
+ OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
+ jump2 = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 255);
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
+ JUMPHERE(jump2);
+ }
+ else
+ add_jump(compiler, &common->utfreadtype8, JUMP(SLJIT_FAST_CALL));
JUMPHERE(jump);
-#endif /* COMPILE_PCRE[8|16|32] */
return;
}
-#endif /* SUPPORT_UTF */
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), 0);
-OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
-#if defined COMPILE_PCRE16 || defined COMPILE_PCRE32
+#endif /* SUPPORT_UTF && COMPILE_PCRE8 */
+
+#if !defined COMPILE_PCRE8
/* The ctypes array contains only 256 values. */
OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 255);
#endif
OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
-#if defined COMPILE_PCRE16 || defined COMPILE_PCRE32
+#if !defined COMPILE_PCRE8
JUMPHERE(jump);
#endif
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE16
+if (common->utf && update_str_ptr)
+ {
+ /* Skip low surrogate if necessary. */
+ OP2(SLJIT_SUB, TMP2, 0, TMP2, 0, SLJIT_IMM, 0xd800);
+ jump = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0xdc00 - 0xd800 - 1);
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+ JUMPHERE(jump);
+ }
+#endif /* SUPPORT_UTF && COMPILE_PCRE16 */
}
static void skip_char_back(compiler_common *common)
@@ -2595,28 +2844,35 @@ if (common->utf)
OP2(SLJIT_SUB, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
}
-static void check_newlinechar(compiler_common *common, int nltype, jump_list **backtracks, BOOL jumpiftrue)
+static void check_newlinechar(compiler_common *common, int nltype, jump_list **backtracks, BOOL jumpifmatch)
{
/* Character comes in TMP1. Checks if it is a newline. TMP2 may be destroyed. */
DEFINE_COMPILER;
+struct sljit_jump *jump;
if (nltype == NLTYPE_ANY)
{
add_jump(compiler, &common->anynewline, JUMP(SLJIT_FAST_CALL));
- add_jump(compiler, backtracks, JUMP(jumpiftrue ? SLJIT_C_NOT_ZERO : SLJIT_C_ZERO));
+ add_jump(compiler, backtracks, JUMP(jumpifmatch ? SLJIT_C_NOT_ZERO : SLJIT_C_ZERO));
}
else if (nltype == NLTYPE_ANYCRLF)
{
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, CHAR_CR);
- OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, CHAR_NL);
- OP_FLAGS(SLJIT_OR | SLJIT_SET_E, TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
- add_jump(compiler, backtracks, JUMP(jumpiftrue ? SLJIT_C_NOT_ZERO : SLJIT_C_ZERO));
+ if (jumpifmatch)
+ {
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, CHAR_CR));
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, CHAR_NL));
+ }
+ else
+ {
+ jump = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, CHAR_CR);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, CHAR_NL));
+ JUMPHERE(jump);
+ }
}
else
{
SLJIT_ASSERT(nltype == NLTYPE_FIXED && common->newline < 256);
- add_jump(compiler, backtracks, CMP(jumpiftrue ? SLJIT_C_EQUAL : SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, common->newline));
+ add_jump(compiler, backtracks, CMP(jumpifmatch ? SLJIT_C_EQUAL : SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, common->newline));
}
}
@@ -2626,58 +2882,84 @@ else
static void do_utfreadchar(compiler_common *common)
{
/* Fast decoding a UTF-8 character. TMP1 contains the first byte
-of the character (>= 0xc0). Return char value in TMP1, length - 1 in TMP2. */
+of the character (>= 0xc0). Return char value in TMP1, length in TMP2. */
DEFINE_COMPILER;
struct sljit_jump *jump;
sljit_emit_fast_enter(compiler, RETURN_ADDR, 0);
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
+OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
+OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
+OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
+
/* Searching for the first zero. */
-OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0x20);
+OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0x800);
jump = JUMP(SLJIT_C_NOT_ZERO);
/* Two byte sequence. */
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
-OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x1f);
+OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(2));
+sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
+
+JUMPHERE(jump);
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
+OP2(SLJIT_XOR, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x800);
OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(1));
-sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
-JUMPHERE(jump);
-OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0x10);
+OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0x10000);
jump = JUMP(SLJIT_C_NOT_ZERO);
/* Three byte sequence. */
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
-OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x0f);
-OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 12);
-OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
-OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 6);
-OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(2));
OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(2));
-OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
-OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(2));
+OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(3));
sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
-JUMPHERE(jump);
/* Four byte sequence. */
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
-OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x07);
-OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 18);
+JUMPHERE(jump);
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(2));
+OP2(SLJIT_XOR, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x10000);
+OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(3));
OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
-OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 12);
OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(2));
+OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(4));
+sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
+}
+
+static void do_utfreadchar16(compiler_common *common)
+{
+/* Fast decoding a UTF-8 character. TMP1 contains the first byte
+of the character (>= 0xc0). Return value in TMP1. */
+DEFINE_COMPILER;
+struct sljit_jump *jump;
+
+sljit_emit_fast_enter(compiler, RETURN_ADDR, 0);
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
+OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
+OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
-OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 6);
OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(3));
-OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(3));
+
+/* Searching for the first zero. */
+OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0x800);
+jump = JUMP(SLJIT_C_NOT_ZERO);
+/* Two byte sequence. */
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
+
+JUMPHERE(jump);
+OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0x400);
+OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_NOT_ZERO);
+/* This code runs only in 8 bit mode. No need to shift the value. */
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
+OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
+OP2(SLJIT_XOR, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x800);
+OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 6);
OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3f);
OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(3));
+/* Three byte sequence. */
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(2));
sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
}
@@ -2697,53 +2979,27 @@ jump = JUMP(SLJIT_C_NOT_ZERO);
OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x1f);
+/* The upper 5 bits are known at this point. */
+compare = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 0x3);
OP2(SLJIT_SHL, TMP2, 0, TMP2, 0, SLJIT_IMM, 6);
OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3f);
OP2(SLJIT_OR, TMP2, 0, TMP2, 0, TMP1, 0);
-compare = CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, 255);
OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), common->ctypes);
sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
JUMPHERE(compare);
OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
-JUMPHERE(jump);
/* We only have types for characters less than 256. */
-OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP2), (sljit_sw)PRIV(utf8_table4) - 0xc0);
-OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP1, 0);
-OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
-sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
-}
-
-#elif defined COMPILE_PCRE16
-
-static void do_utfreadchar(compiler_common *common)
-{
-/* Fast decoding a UTF-16 character. TMP1 contains the first 16 bit char
-of the character (>= 0xd800). Return char value in TMP1, length - 1 in TMP2. */
-DEFINE_COMPILER;
-struct sljit_jump *jump;
-
-sljit_emit_fast_enter(compiler, RETURN_ADDR, 0);
-jump = CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, 0xdc00);
-/* Do nothing, only return. */
-sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
-
JUMPHERE(jump);
-/* Combine two 16 bit characters. */
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
-OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
-OP2(SLJIT_AND, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x3ff);
-OP2(SLJIT_SHL, TMP1, 0, TMP1, 0, SLJIT_IMM, 10);
-OP2(SLJIT_AND, TMP2, 0, TMP2, 0, SLJIT_IMM, 0x3ff);
-OP2(SLJIT_OR, TMP1, 0, TMP1, 0, TMP2, 0);
-OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, IN_UCHARS(1));
-OP2(SLJIT_ADD, TMP1, 0, TMP1, 0, SLJIT_IMM, 0x10000);
+OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP2), (sljit_sw)PRIV(utf8_table4) - 0xc0);
+OP1(SLJIT_MOV, TMP1, 0, SLJIT_IMM, 0);
+OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP2, 0);
sljit_emit_fast_return(compiler, RETURN_ADDR, 0);
}
-#endif /* COMPILE_PCRE[8|16] */
+#endif /* COMPILE_PCRE8 */
#endif /* SUPPORT_UTF */
@@ -2818,7 +3074,7 @@ if (firstline)
mainloop = LABEL();
/* Continual stores does not cause data dependency. */
OP1(SLJIT_MOV, SLJIT_MEM1(SLJIT_LOCALS_REG), common->first_line_end, STR_PTR, 0);
- read_char(common);
+ read_char_range(common, common->nlmin, common->nlmax, TRUE);
check_newlinechar(common, common->nltype, &newline, TRUE);
CMPTO(SLJIT_C_LESS, STR_PTR, 0, STR_END, 0, mainloop);
JUMPHERE(end);
@@ -2894,37 +3150,66 @@ if (newlinecheck)
return mainloop;
}
-#define MAX_N_CHARS 3
+#define MAX_N_CHARS 16
+#define MAX_N_BYTES 8
-static SLJIT_INLINE BOOL fast_forward_first_n_chars(compiler_common *common, BOOL firstline)
+static SLJIT_INLINE void add_prefix_byte(pcre_uint8 byte, pcre_uint8 *bytes)
{
-DEFINE_COMPILER;
-struct sljit_label *start;
-struct sljit_jump *quit;
-pcre_uint32 chars[MAX_N_CHARS * 2];
-pcre_uchar *cc = common->start + 1 + LINK_SIZE;
-int location = 0;
-pcre_int32 len, c, bit, caseless;
-int must_stop;
+pcre_uint8 len = bytes[0];
+int i;
-/* We do not support alternatives now. */
-if (*(common->start + GET(common->start, 1)) == OP_ALT)
- return FALSE;
+if (len == 255)
+ return;
+
+if (len == 0)
+ {
+ bytes[0] = 1;
+ bytes[1] = byte;
+ return;
+ }
+
+for (i = len; i > 0; i--)
+ if (bytes[i] == byte)
+ return;
+if (len >= MAX_N_BYTES - 1)
+ {
+ bytes[0] = 255;
+ return;
+ }
+
+len++;
+bytes[len] = byte;
+bytes[0] = len;
+}
+
+static int scan_prefix(compiler_common *common, pcre_uchar *cc, pcre_uint32 *chars, pcre_uint8 *bytes, int max_chars)
+{
+/* Recursive function, which scans prefix literals. */
+BOOL last, any, caseless;
+int len, repeat, len_save, consumed = 0;
+pcre_uint32 chr, mask;
+pcre_uchar *alternative, *cc_save, *oc;
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+pcre_uchar othercase[8];
+#elif defined SUPPORT_UTF && defined COMPILE_PCRE16
+pcre_uchar othercase[2];
+#else
+pcre_uchar othercase[1];
+#endif
+
+repeat = 1;
while (TRUE)
{
- caseless = 0;
- must_stop = 1;
- switch(*cc)
+ last = TRUE;
+ any = FALSE;
+ caseless = FALSE;
+ switch (*cc)
{
- case OP_CHAR:
- must_stop = 0;
- cc++;
- break;
-
case OP_CHARI:
- caseless = 1;
- must_stop = 0;
+ caseless = TRUE;
+ case OP_CHAR:
+ last = FALSE;
cc++;
break;
@@ -2943,125 +3228,546 @@ while (TRUE)
cc++;
continue;
+ case OP_ASSERT:
+ case OP_ASSERT_NOT:
+ case OP_ASSERTBACK:
+ case OP_ASSERTBACK_NOT:
+ cc = bracketend(cc);
+ continue;
+
+ case OP_PLUSI:
+ case OP_MINPLUSI:
+ case OP_POSPLUSI:
+ caseless = TRUE;
case OP_PLUS:
case OP_MINPLUS:
case OP_POSPLUS:
cc++;
break;
+ case OP_EXACTI:
+ caseless = TRUE;
case OP_EXACT:
+ repeat = GET2(cc, 1);
+ last = FALSE;
cc += 1 + IMM2_SIZE;
break;
- case OP_PLUSI:
- case OP_MINPLUSI:
- case OP_POSPLUSI:
- caseless = 1;
+ case OP_QUERYI:
+ case OP_MINQUERYI:
+ case OP_POSQUERYI:
+ caseless = TRUE;
+ case OP_QUERY:
+ case OP_MINQUERY:
+ case OP_POSQUERY:
+ len = 1;
cc++;
+#ifdef SUPPORT_UTF
+ if (common->utf && HAS_EXTRALEN(*cc)) len += GET_EXTRALEN(*cc);
+#endif
+ max_chars = scan_prefix(common, cc + len, chars, bytes, max_chars);
+ if (max_chars == 0)
+ return consumed;
+ last = FALSE;
break;
- case OP_EXACTI:
- caseless = 1;
+ case OP_KET:
+ cc += 1 + LINK_SIZE;
+ continue;
+
+ case OP_ALT:
+ cc += GET(cc, 1);
+ continue;
+
+ case OP_ONCE:
+ case OP_ONCE_NC:
+ case OP_BRA:
+ case OP_BRAPOS:
+ case OP_CBRA:
+ case OP_CBRAPOS:
+ alternative = cc + GET(cc, 1);
+ while (*alternative == OP_ALT)
+ {
+ max_chars = scan_prefix(common, alternative + 1 + LINK_SIZE, chars, bytes, max_chars);
+ if (max_chars == 0)
+ return consumed;
+ alternative += GET(alternative, 1);
+ }
+
+ if (*cc == OP_CBRA || *cc == OP_CBRAPOS)
+ cc += IMM2_SIZE;
+ cc += 1 + LINK_SIZE;
+ continue;
+
+ case OP_CLASS:
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && !is_char7_bitset((const pcre_uint8 *)(cc + 1), FALSE)) return consumed;
+#endif
+ any = TRUE;
+ cc += 1 + 32 / sizeof(pcre_uchar);
+ break;
+
+ case OP_NCLASS:
+#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+ if (common->utf) return consumed;
+#endif
+ any = TRUE;
+ cc += 1 + 32 / sizeof(pcre_uchar);
+ break;
+
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ case OP_XCLASS:
+#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+ if (common->utf) return consumed;
+#endif
+ any = TRUE;
+ cc += GET(cc, 1);
+ break;
+#endif
+
+ case OP_DIGIT:
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && !is_char7_bitset((const pcre_uint8 *)common->ctypes - cbit_length + cbit_digit, FALSE))
+ return consumed;
+#endif
+ any = TRUE;
+ cc++;
+ break;
+
+ case OP_WHITESPACE:
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && !is_char7_bitset((const pcre_uint8 *)common->ctypes - cbit_length + cbit_space, FALSE))
+ return consumed;
+#endif
+ any = TRUE;
+ cc++;
+ break;
+
+ case OP_WORDCHAR:
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && !is_char7_bitset((const pcre_uint8 *)common->ctypes - cbit_length + cbit_word, FALSE))
+ return consumed;
+#endif
+ any = TRUE;
+ cc++;
+ break;
+
+ case OP_NOT:
+ case OP_NOTI:
+ cc++;
+ /* Fall through. */
+ case OP_NOT_DIGIT:
+ case OP_NOT_WHITESPACE:
+ case OP_NOT_WORDCHAR:
+ case OP_ANY:
+ case OP_ALLANY:
+#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+ if (common->utf) return consumed;
+#endif
+ any = TRUE;
+ cc++;
+ break;
+
+#ifdef SUPPORT_UCP
+ case OP_NOTPROP:
+ case OP_PROP:
+#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+ if (common->utf) return consumed;
+#endif
+ any = TRUE;
+ cc += 1 + 2;
+ break;
+#endif
+
+ case OP_TYPEEXACT:
+ repeat = GET2(cc, 1);
cc += 1 + IMM2_SIZE;
+ continue;
+
+ case OP_NOTEXACT:
+ case OP_NOTEXACTI:
+#if defined SUPPORT_UTF && !defined COMPILE_PCRE32
+ if (common->utf) return consumed;
+#endif
+ any = TRUE;
+ repeat = GET2(cc, 1);
+ cc += 1 + IMM2_SIZE + 1;
break;
default:
- must_stop = 2;
- break;
+ return consumed;
}
- if (must_stop == 2)
- break;
+ if (any)
+ {
+#if defined COMPILE_PCRE8
+ mask = 0xff;
+#elif defined COMPILE_PCRE16
+ mask = 0xffff;
+#elif defined COMPILE_PCRE32
+ mask = 0xffffffff;
+#else
+ SLJIT_ASSERT_STOP();
+#endif
+
+ do
+ {
+ chars[0] = mask;
+ chars[1] = mask;
+ bytes[0] = 255;
+
+ consumed++;
+ if (--max_chars == 0)
+ return consumed;
+ chars += 2;
+ bytes += MAX_N_BYTES;
+ }
+ while (--repeat > 0);
+
+ repeat = 1;
+ continue;
+ }
len = 1;
#ifdef SUPPORT_UTF
- if (common->utf && HAS_EXTRALEN(cc[0])) len += GET_EXTRALEN(cc[0]);
+ if (common->utf && HAS_EXTRALEN(*cc)) len += GET_EXTRALEN(*cc);
#endif
if (caseless && char_has_othercase(common, cc))
{
- caseless = char_get_othercase_bit(common, cc);
- if (caseless == 0)
- return FALSE;
-#ifdef COMPILE_PCRE8
- caseless = ((caseless & 0xff) << 8) | (len - (caseless >> 8));
-#else
- if ((caseless & 0x100) != 0)
- caseless = ((caseless & 0xff) << 16) | (len - (caseless >> 9));
+#ifdef SUPPORT_UTF
+ if (common->utf)
+ {
+ GETCHAR(chr, cc);
+ if ((int)PRIV(ord2utf)(char_othercase(common, chr), othercase) != len)
+ return consumed;
+ }
else
- caseless = ((caseless & 0xff) << 8) | (len - (caseless >> 9));
#endif
+ {
+ chr = *cc;
+ othercase[0] = TABLE_GET(chr, common->fcc, chr);
+ }
}
else
- caseless = 0;
+ caseless = FALSE;
- while (len > 0 && location < MAX_N_CHARS * 2)
+ len_save = len;
+ cc_save = cc;
+ while (TRUE)
{
- c = *cc;
- bit = 0;
- if (len == (caseless & 0xff))
+ oc = othercase;
+ do
{
- bit = caseless >> 8;
- c |= bit;
+ chr = *cc;
+#ifdef COMPILE_PCRE32
+ if (SLJIT_UNLIKELY(chr == NOTACHAR))
+ return consumed;
+#endif
+ add_prefix_byte((pcre_uint8)chr, bytes);
+
+ mask = 0;
+ if (caseless)
+ {
+ add_prefix_byte((pcre_uint8)*oc, bytes);
+ mask = *cc ^ *oc;
+ chr |= mask;
+ }
+
+#ifdef COMPILE_PCRE32
+ if (chars[0] == NOTACHAR && chars[1] == 0)
+#else
+ if (chars[0] == NOTACHAR)
+#endif
+ {
+ chars[0] = chr;
+ chars[1] = mask;
+ }
+ else
+ {
+ mask |= chars[0] ^ chr;
+ chr |= mask;
+ chars[0] = chr;
+ chars[1] |= mask;
+ }
+
+ len--;
+ consumed++;
+ if (--max_chars == 0)
+ return consumed;
+ chars += 2;
+ bytes += MAX_N_BYTES;
+ cc++;
+ oc++;
}
+ while (len > 0);
- chars[location] = c;
- chars[location + 1] = bit;
+ if (--repeat == 0)
+ break;
- len--;
- location += 2;
- cc++;
+ len = len_save;
+ cc = cc_save;
+ }
+
+ repeat = 1;
+ if (last)
+ return consumed;
+ }
+}
+
+static SLJIT_INLINE BOOL fast_forward_first_n_chars(compiler_common *common, BOOL firstline)
+{
+DEFINE_COMPILER;
+struct sljit_label *start;
+struct sljit_jump *quit;
+pcre_uint32 chars[MAX_N_CHARS * 2];
+pcre_uint8 bytes[MAX_N_CHARS * MAX_N_BYTES];
+pcre_uint8 ones[MAX_N_CHARS];
+int offsets[3];
+pcre_uint32 mask;
+pcre_uint8 *byte_set, *byte_set_end;
+int i, max, from;
+int range_right = -1, range_len = 3 - 1;
+sljit_ub *update_table = NULL;
+BOOL in_range;
+
+/* This is even TRUE, if both are NULL. */
+SLJIT_ASSERT(common->read_only_data_ptr == common->read_only_data);
+
+for (i = 0; i < MAX_N_CHARS; i++)
+ {
+ chars[i << 1] = NOTACHAR;
+ chars[(i << 1) + 1] = 0;
+ bytes[i * MAX_N_BYTES] = 0;
+ }
+
+max = scan_prefix(common, common->start, chars, bytes, MAX_N_CHARS);
+
+if (max <= 1)
+ return FALSE;
+
+for (i = 0; i < max; i++)
+ {
+ mask = chars[(i << 1) + 1];
+ ones[i] = ones_in_half_byte[mask & 0xf];
+ mask >>= 4;
+ while (mask != 0)
+ {
+ ones[i] += ones_in_half_byte[mask & 0xf];
+ mask >>= 4;
+ }
+ }
+
+in_range = FALSE;
+from = 0; /* Prevent compiler "uninitialized" warning */
+for (i = 0; i <= max; i++)
+ {
+ if (in_range && (i - from) > range_len && (bytes[(i - 1) * MAX_N_BYTES] <= 4))
+ {
+ range_len = i - from;
+ range_right = i - 1;
+ }
+
+ if (i < max && bytes[i * MAX_N_BYTES] < 255)
+ {
+ if (!in_range)
+ {
+ in_range = TRUE;
+ from = i;
+ }
+ }
+ else if (in_range)
+ in_range = FALSE;
+ }
+
+if (range_right >= 0)
+ {
+ /* Since no data is consumed (see the assert in the beginning
+ of this function), this space can be reallocated. */
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
+
+ common->read_only_data_size += 256;
+ common->read_only_data = (sljit_uw *)SLJIT_MALLOC(common->read_only_data_size);
+ if (common->read_only_data == NULL)
+ return TRUE;
+
+ update_table = (sljit_ub *)common->read_only_data;
+ common->read_only_data_ptr = (sljit_uw *)(update_table + 256);
+ memset(update_table, IN_UCHARS(range_len), 256);
+
+ for (i = 0; i < range_len; i++)
+ {
+ byte_set = bytes + ((range_right - i) * MAX_N_BYTES);
+ SLJIT_ASSERT(byte_set[0] > 0 && byte_set[0] < 255);
+ byte_set_end = byte_set + byte_set[0];
+ byte_set++;
+ while (byte_set <= byte_set_end)
+ {
+ if (update_table[*byte_set] > IN_UCHARS(i))
+ update_table[*byte_set] = IN_UCHARS(i);
+ byte_set++;
+ }
}
+ }
- if (location >= MAX_N_CHARS * 2 || must_stop != 0)
+offsets[0] = -1;
+/* Scan forward. */
+for (i = 0; i < max; i++)
+ if (ones[i] <= 2) {
+ offsets[0] = i;
break;
}
-/* At least two characters are required. */
-if (location < 2 * 2)
+if (offsets[0] < 0 && range_right < 0)
+ return FALSE;
+
+if (offsets[0] >= 0)
+ {
+ /* Scan backward. */
+ offsets[1] = -1;
+ for (i = max - 1; i > offsets[0]; i--)
+ if (ones[i] <= 2 && i != range_right)
+ {
+ offsets[1] = i;
+ break;
+ }
+
+ /* This case is handled better by fast_forward_first_char. */
+ if (offsets[1] == -1 && offsets[0] == 0 && range_right < 0)
return FALSE;
+ offsets[2] = -1;
+ /* We only search for a middle character if there is no range check. */
+ if (offsets[1] >= 0 && range_right == -1)
+ {
+ /* Scan from middle. */
+ for (i = (offsets[0] + offsets[1]) / 2 + 1; i < offsets[1]; i++)
+ if (ones[i] <= 2)
+ {
+ offsets[2] = i;
+ break;
+ }
+
+ if (offsets[2] == -1)
+ {
+ for (i = (offsets[0] + offsets[1]) / 2; i > offsets[0]; i--)
+ if (ones[i] <= 2)
+ {
+ offsets[2] = i;
+ break;
+ }
+ }
+ }
+
+ SLJIT_ASSERT(offsets[1] == -1 || (offsets[0] < offsets[1]));
+ SLJIT_ASSERT(offsets[2] == -1 || (offsets[0] < offsets[2] && offsets[1] > offsets[2]));
+
+ chars[0] = chars[offsets[0] << 1];
+ chars[1] = chars[(offsets[0] << 1) + 1];
+ if (offsets[2] >= 0)
+ {
+ chars[2] = chars[offsets[2] << 1];
+ chars[3] = chars[(offsets[2] << 1) + 1];
+ }
+ if (offsets[1] >= 0)
+ {
+ chars[4] = chars[offsets[1] << 1];
+ chars[5] = chars[(offsets[1] << 1) + 1];
+ }
+ }
+
+max -= 1;
if (firstline)
{
SLJIT_ASSERT(common->first_line_end != 0);
+ OP1(SLJIT_MOV, TMP1, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), common->first_line_end);
OP1(SLJIT_MOV, TMP3, 0, STR_END, 0);
- OP2(SLJIT_SUB, STR_END, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), common->first_line_end, SLJIT_IMM, IN_UCHARS((location >> 1) - 1));
+ OP2(SLJIT_SUB, STR_END, 0, STR_END, 0, SLJIT_IMM, IN_UCHARS(max));
+ quit = CMP(SLJIT_C_LESS_EQUAL, STR_END, 0, TMP1, 0);
+ OP1(SLJIT_MOV, STR_END, 0, TMP1, 0);
+ JUMPHERE(quit);
}
else
- OP2(SLJIT_SUB, STR_END, 0, STR_END, 0, SLJIT_IMM, IN_UCHARS((location >> 1) - 1));
+ OP2(SLJIT_SUB, STR_END, 0, STR_END, 0, SLJIT_IMM, IN_UCHARS(max));
+
+#if !(defined SLJIT_CONFIG_X86_32 && SLJIT_CONFIG_X86_32)
+if (range_right >= 0)
+ OP1(SLJIT_MOV, RETURN_ADDR, 0, SLJIT_IMM, (sljit_sw)update_table);
+#endif
start = LABEL();
quit = CMP(SLJIT_C_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
-OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(0));
-OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
-OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
-if (chars[1] != 0)
- OP2(SLJIT_OR, TMP1, 0, TMP1, 0, SLJIT_IMM, chars[1]);
-CMPTO(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, chars[0], start);
-if (location > 2 * 2)
- OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(1));
-if (chars[3] != 0)
- OP2(SLJIT_OR, TMP2, 0, TMP2, 0, SLJIT_IMM, chars[3]);
-CMPTO(SLJIT_C_NOT_EQUAL, TMP2, 0, SLJIT_IMM, chars[2], start);
-if (location > 2 * 2)
- {
- if (chars[5] != 0)
- OP2(SLJIT_OR, TMP1, 0, TMP1, 0, SLJIT_IMM, chars[5]);
- CMPTO(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, chars[4], start);
+SLJIT_ASSERT(range_right >= 0 || offsets[0] >= 0);
+
+if (range_right >= 0)
+ {
+#if defined COMPILE_PCRE8 || (defined SLJIT_LITTLE_ENDIAN && SLJIT_LITTLE_ENDIAN)
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(range_right));
+#else
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(range_right + 1) - 1);
+#endif
+
+#if !(defined SLJIT_CONFIG_X86_32 && SLJIT_CONFIG_X86_32)
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM2(RETURN_ADDR, TMP1), 0);
+#else
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), (sljit_sw)update_table);
+#endif
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, TMP1, 0);
+ CMPTO(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, 0, start);
+ }
+
+if (offsets[0] >= 0)
+ {
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(offsets[0]));
+ if (offsets[1] >= 0)
+ OP1(MOV_UCHAR, TMP2, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(offsets[1]));
+ OP2(SLJIT_ADD, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
+
+ if (chars[1] != 0)
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, SLJIT_IMM, chars[1]);
+ CMPTO(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, chars[0], start);
+ if (offsets[2] >= 0)
+ OP1(MOV_UCHAR, TMP1, 0, SLJIT_MEM1(STR_PTR), IN_UCHARS(offsets[2] - 1));
+
+ if (offsets[1] >= 0)
+ {
+ if (chars[5] != 0)
+ OP2(SLJIT_OR, TMP2, 0, TMP2, 0, SLJIT_IMM, chars[5]);
+ CMPTO(SLJIT_C_NOT_EQUAL, TMP2, 0, SLJIT_IMM, chars[4], start);
+ }
+
+ if (offsets[2] >= 0)
+ {
+ if (chars[3] != 0)
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, SLJIT_IMM, chars[3]);
+ CMPTO(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, chars[2], start);
+ }
+ OP2(SLJIT_SUB, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
}
-OP2(SLJIT_SUB, STR_PTR, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(1));
JUMPHERE(quit);
if (firstline)
+ {
+ if (range_right >= 0)
+ OP1(SLJIT_MOV, TMP1, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), common->first_line_end);
OP1(SLJIT_MOV, STR_END, 0, TMP3, 0);
+ if (range_right >= 0)
+ {
+ quit = CMP(SLJIT_C_LESS_EQUAL, STR_PTR, 0, TMP1, 0);
+ OP1(SLJIT_MOV, STR_PTR, 0, TMP1, 0);
+ JUMPHERE(quit);
+ }
+ }
else
- OP2(SLJIT_ADD, STR_END, 0, STR_END, 0, SLJIT_IMM, IN_UCHARS((location >> 1) - 1));
+ OP2(SLJIT_ADD, STR_END, 0, STR_END, 0, SLJIT_IMM, IN_UCHARS(max));
return TRUE;
}
#undef MAX_N_CHARS
+#undef MAX_N_BYTES
static SLJIT_INLINE void fast_forward_first_char(compiler_common *common, pcre_uchar first_char, BOOL caseless, BOOL firstline)
{
@@ -3167,7 +3873,7 @@ if (common->nltype == NLTYPE_FIXED && common->newline > 255)
JUMPHERE(lastchar);
if (firstline)
- OP1(SLJIT_MOV, STR_END, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), POSSESSIVE0);
+ OP1(SLJIT_MOV, STR_END, 0, TMP3, 0);
return;
}
@@ -3177,7 +3883,9 @@ firstchar = CMP(SLJIT_C_LESS_EQUAL, STR_PTR, 0, TMP2, 0);
skip_char_back(common);
loop = LABEL();
-read_char(common);
+common->ff_newline_shortcut = loop;
+
+read_char_range(common, common->nlmin, common->nlmax, TRUE);
lastchar = CMP(SLJIT_C_GREATER_EQUAL, STR_PTR, 0, STR_END, 0);
if (common->nltype == NLTYPE_ANY || common->nltype == NLTYPE_ANYCRLF)
foundcr = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, CHAR_CR);
@@ -3206,24 +3914,19 @@ if (firstline)
OP1(SLJIT_MOV, STR_END, 0, TMP3, 0);
}
-static BOOL check_class_ranges(compiler_common *common, const pcre_uint8 *bits, BOOL nclass, jump_list **backtracks);
+static BOOL check_class_ranges(compiler_common *common, const pcre_uint8 *bits, BOOL nclass, BOOL invert, jump_list **backtracks);
-static SLJIT_INLINE void fast_forward_start_bits(compiler_common *common, sljit_uw start_bits, BOOL firstline)
+static SLJIT_INLINE void fast_forward_start_bits(compiler_common *common, pcre_uint8 *start_bits, BOOL firstline)
{
DEFINE_COMPILER;
struct sljit_label *start;
struct sljit_jump *quit;
struct sljit_jump *found = NULL;
jump_list *matches = NULL;
-pcre_uint8 inverted_start_bits[32];
-int i;
#ifndef COMPILE_PCRE8
struct sljit_jump *jump;
#endif
-for (i = 0; i < 32; ++i)
- inverted_start_bits[i] = ~(((pcre_uint8*)start_bits)[i]);
-
if (firstline)
{
SLJIT_ASSERT(common->first_line_end != 0);
@@ -3239,7 +3942,7 @@ if (common->utf)
OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
#endif
-if (!check_class_ranges(common, inverted_start_bits, (inverted_start_bits[31] & 0x80) != 0, &matches))
+if (!check_class_ranges(common, start_bits, (start_bits[31] & 0x80) != 0, TRUE, &matches))
{
#ifndef COMPILE_PCRE8
jump = CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, 255);
@@ -3248,7 +3951,7 @@ if (!check_class_ranges(common, inverted_start_bits, (inverted_start_bits[31] &
#endif
OP2(SLJIT_AND, TMP2, 0, TMP1, 0, SLJIT_IMM, 0x7);
OP2(SLJIT_LSHR, TMP1, 0, TMP1, 0, SLJIT_IMM, 3);
- OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), start_bits);
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), (sljit_sw)start_bits);
OP2(SLJIT_SHL, TMP2, 0, SLJIT_IMM, 1, TMP2, 0);
OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, TMP2, 0);
found = JUMP(SLJIT_C_NOT_ZERO);
@@ -3451,7 +4154,7 @@ JUMPHERE(skipread);
OP1(SLJIT_MOV, TMP2, 0, SLJIT_IMM, 0);
check_str_end(common, &skipread_list);
-peek_char(common);
+peek_char(common, READ_CHAR_MAX);
/* Testing char type. This is a code duplication. */
#ifdef SUPPORT_UCP
@@ -3497,110 +4200,15 @@ OP2(SLJIT_XOR | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP2, 0, SLJIT_MEM1(SLJIT_LOCALS_R
sljit_emit_fast_return(compiler, SLJIT_MEM1(SLJIT_LOCALS_REG), LOCALS0);
}
-/*
- range format:
-
- ranges[0] = length of the range (max MAX_RANGE_SIZE, -1 means invalid range).
- ranges[1] = first bit (0 or 1)
- ranges[2-length] = position of the bit change (when the current bit is not equal to the previous)
-*/
-
-static BOOL check_ranges(compiler_common *common, int *ranges, jump_list **backtracks, BOOL readch)
+static BOOL check_class_ranges(compiler_common *common, const pcre_uint8 *bits, BOOL nclass, BOOL invert, jump_list **backtracks)
{
DEFINE_COMPILER;
-struct sljit_jump *jump;
-
-if (ranges[0] < 0)
- return FALSE;
-
-switch(ranges[0])
- {
- case 1:
- if (readch)
- read_char(common);
- add_jump(compiler, backtracks, CMP(ranges[1] == 0 ? SLJIT_C_LESS : SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[2]));
- return TRUE;
-
- case 2:
- if (readch)
- read_char(common);
- OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[2]);
- add_jump(compiler, backtracks, CMP(ranges[1] != 0 ? SLJIT_C_LESS : SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[3] - ranges[2]));
- return TRUE;
-
- case 4:
- if (ranges[2] + 1 == ranges[3] && ranges[4] + 1 == ranges[5])
- {
- if (readch)
- read_char(common);
- if (ranges[1] != 0)
- {
- add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[2]));
- add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[4]));
- }
- else
- {
- jump = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[2]);
- add_jump(compiler, backtracks, CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, ranges[4]));
- JUMPHERE(jump);
- }
- return TRUE;
- }
- if ((ranges[3] - ranges[2]) == (ranges[5] - ranges[4]) && is_powerof2(ranges[4] - ranges[2]))
- {
- if (readch)
- read_char(common);
- OP2(SLJIT_OR, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[4] - ranges[2]);
- OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[4]);
- add_jump(compiler, backtracks, CMP(ranges[1] != 0 ? SLJIT_C_LESS : SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[5] - ranges[4]));
- return TRUE;
- }
- return FALSE;
-
- default:
- return FALSE;
- }
-}
-
-static void get_ctype_ranges(compiler_common *common, int flag, int *ranges)
-{
-int i, bit, length;
-const pcre_uint8 *ctypes = (const pcre_uint8*)common->ctypes;
-
-bit = ctypes[0] & flag;
-ranges[0] = -1;
-ranges[1] = bit != 0 ? 1 : 0;
-length = 0;
-
-for (i = 1; i < 256; i++)
- if ((ctypes[i] & flag) != bit)
- {
- if (length >= MAX_RANGE_SIZE)
- return;
- ranges[2 + length] = i;
- length++;
- bit ^= flag;
- }
-
-if (bit != 0)
- {
- if (length >= MAX_RANGE_SIZE)
- return;
- ranges[2 + length] = 256;
- length++;
- }
-ranges[0] = length;
-}
-
-static BOOL check_class_ranges(compiler_common *common, const pcre_uint8 *bits, BOOL nclass, jump_list **backtracks)
-{
-int ranges[2 + MAX_RANGE_SIZE];
+int ranges[MAX_RANGE_SIZE];
pcre_uint8 bit, cbit, all;
int i, byte, length = 0;
bit = bits[0] & 0x1;
-ranges[1] = bit;
-/* Can be 0 or 255. */
+/* All bits will be zero or one (since bit is zero or one). */
all = -bit;
for (i = 0; i < 256; )
@@ -3615,7 +4223,7 @@ for (i = 0; i < 256; )
{
if (length >= MAX_RANGE_SIZE)
return FALSE;
- ranges[2 + length] = i;
+ ranges[length] = i;
length++;
bit = cbit;
all = -cbit;
@@ -3628,12 +4236,117 @@ if (((bit == 0) && nclass) || ((bit == 1) && !nclass))
{
if (length >= MAX_RANGE_SIZE)
return FALSE;
- ranges[2 + length] = 256;
+ ranges[length] = 256;
length++;
}
-ranges[0] = length;
-return check_ranges(common, ranges, backtracks, FALSE);
+if (length < 0 || length > 4)
+ return FALSE;
+
+bit = bits[0] & 0x1;
+if (invert) bit ^= 0x1;
+
+/* No character is accepted. */
+if (length == 0 && bit == 0)
+ add_jump(compiler, backtracks, JUMP(SLJIT_JUMP));
+
+switch(length)
+ {
+ case 0:
+ /* When bit != 0, all characters are accepted. */
+ return TRUE;
+
+ case 1:
+ add_jump(compiler, backtracks, CMP(bit == 0 ? SLJIT_C_LESS : SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[0]));
+ return TRUE;
+
+ case 2:
+ if (ranges[0] + 1 != ranges[1])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[0]);
+ add_jump(compiler, backtracks, CMP(bit != 0 ? SLJIT_C_LESS : SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[1] - ranges[0]));
+ }
+ else
+ add_jump(compiler, backtracks, CMP(bit != 0 ? SLJIT_C_EQUAL : SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, ranges[0]));
+ return TRUE;
+
+ case 3:
+ if (bit != 0)
+ {
+ add_jump(compiler, backtracks, CMP(SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[2]));
+ if (ranges[0] + 1 != ranges[1])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[0]);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, ranges[1] - ranges[0]));
+ }
+ else
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[0]));
+ return TRUE;
+ }
+
+ add_jump(compiler, backtracks, CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, ranges[0]));
+ if (ranges[1] + 1 != ranges[2])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[1]);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, ranges[2] - ranges[1]));
+ }
+ else
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[1]));
+ return TRUE;
+
+ case 4:
+ if ((ranges[1] - ranges[0]) == (ranges[3] - ranges[2])
+ && (ranges[0] | (ranges[2] - ranges[0])) == ranges[2]
+ && is_powerof2(ranges[2] - ranges[0]))
+ {
+ OP2(SLJIT_OR, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[2] - ranges[0]);
+ if (ranges[2] + 1 != ranges[3])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[2]);
+ add_jump(compiler, backtracks, CMP(bit != 0 ? SLJIT_C_LESS : SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[3] - ranges[2]));
+ }
+ else
+ add_jump(compiler, backtracks, CMP(bit != 0 ? SLJIT_C_EQUAL : SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, ranges[2]));
+ return TRUE;
+ }
+
+ if (bit != 0)
+ {
+ i = 0;
+ if (ranges[0] + 1 != ranges[1])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[0]);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, ranges[1] - ranges[0]));
+ i = ranges[0];
+ }
+ else
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[0]));
+
+ if (ranges[2] + 1 != ranges[3])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[2] - i);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, ranges[3] - ranges[2]));
+ }
+ else
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[2] - i));
+ return TRUE;
+ }
+
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[0]);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, ranges[3] - ranges[0]));
+ if (ranges[1] + 1 != ranges[2])
+ {
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, ranges[1] - ranges[0]);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_LESS, TMP1, 0, SLJIT_IMM, ranges[2] - ranges[1]));
+ }
+ else
+ add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, ranges[1] - ranges[0]));
+ return TRUE;
+
+ default:
+ SLJIT_ASSERT_STOP();
+ return FALSE;
+ }
}
static void check_anynewline(compiler_common *common)
@@ -4000,20 +4713,20 @@ return cc;
#define SET_TYPE_OFFSET(value) \
if ((value) != typeoffset) \
{ \
- if ((value) > typeoffset) \
- OP2(SLJIT_SUB, typereg, 0, typereg, 0, SLJIT_IMM, (value) - typeoffset); \
- else \
+ if ((value) < typeoffset) \
OP2(SLJIT_ADD, typereg, 0, typereg, 0, SLJIT_IMM, typeoffset - (value)); \
+ else \
+ OP2(SLJIT_SUB, typereg, 0, typereg, 0, SLJIT_IMM, (value) - typeoffset); \
} \
typeoffset = (value);
#define SET_CHAR_OFFSET(value) \
if ((value) != charoffset) \
{ \
- if ((value) > charoffset) \
- OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, (value) - charoffset); \
+ if ((value) < charoffset) \
+ OP2(SLJIT_ADD, TMP1, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(charoffset - (value))); \
else \
- OP2(SLJIT_ADD, TMP1, 0, TMP1, 0, SLJIT_IMM, charoffset - (value)); \
+ OP2(SLJIT_SUB, TMP1, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)((value) - charoffset)); \
} \
charoffset = (value);
@@ -4021,84 +4734,53 @@ static void compile_xclass_matchingpath(compiler_common *common, pcre_uchar *cc,
{
DEFINE_COMPILER;
jump_list *found = NULL;
-jump_list **list = (*cc & XCL_NOT) == 0 ? &found : backtracks;
-pcre_int32 c, charoffset;
+jump_list **list = (cc[0] & XCL_NOT) == 0 ? &found : backtracks;
+sljit_uw c, charoffset, max = 256, min = READ_CHAR_MAX;
struct sljit_jump *jump = NULL;
pcre_uchar *ccbegin;
int compares, invertcmp, numberofcmps;
+#if defined SUPPORT_UTF && (defined COMPILE_PCRE8 || defined COMPILE_PCRE16)
+BOOL utf = common->utf;
+#endif
#ifdef SUPPORT_UCP
BOOL needstype = FALSE, needsscript = FALSE, needschar = FALSE;
BOOL charsaved = FALSE;
int typereg = TMP1, scriptreg = TMP1;
const pcre_uint32 *other_cases;
-pcre_int32 typeoffset;
+sljit_uw typeoffset;
#endif
-/* Although SUPPORT_UTF must be defined, we are
- not necessary in utf mode even in 8 bit mode. */
-detect_partial_match(common, backtracks);
-read_char(common);
-
-if ((*cc++ & XCL_MAP) != 0)
+/* Scanning the necessary info. */
+cc++;
+ccbegin = cc;
+compares = 0;
+if (cc[-1] & XCL_MAP)
{
- OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
-#ifndef COMPILE_PCRE8
- jump = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, 255);
-#elif defined SUPPORT_UTF
- if (common->utf)
- jump = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, 255);
-#endif
-
- if (!check_class_ranges(common, (const pcre_uint8 *)cc, TRUE, list))
- {
- OP2(SLJIT_AND, TMP2, 0, TMP1, 0, SLJIT_IMM, 0x7);
- OP2(SLJIT_LSHR, TMP1, 0, TMP1, 0, SLJIT_IMM, 3);
- OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), (sljit_sw)cc);
- OP2(SLJIT_SHL, TMP2, 0, SLJIT_IMM, 1, TMP2, 0);
- OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, TMP2, 0);
- add_jump(compiler, list, JUMP(SLJIT_C_NOT_ZERO));
- }
-
-#ifndef COMPILE_PCRE8
- JUMPHERE(jump);
-#elif defined SUPPORT_UTF
- if (common->utf)
- JUMPHERE(jump);
-#endif
- OP1(SLJIT_MOV, TMP1, 0, TMP3, 0);
-#ifdef SUPPORT_UCP
- charsaved = TRUE;
-#endif
+ min = 0;
cc += 32 / sizeof(pcre_uchar);
}
-/* Scanning the necessary info. */
-ccbegin = cc;
-compares = 0;
while (*cc != XCL_END)
{
compares++;
if (*cc == XCL_SINGLE)
{
- cc += 2;
-#ifdef SUPPORT_UTF
- if (common->utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
-#endif
+ cc ++;
+ GETCHARINCTEST(c, cc);
+ if (c > max) max = c;
+ if (c < min) min = c;
#ifdef SUPPORT_UCP
needschar = TRUE;
#endif
}
else if (*cc == XCL_RANGE)
{
- cc += 2;
-#ifdef SUPPORT_UTF
- if (common->utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
-#endif
- cc++;
-#ifdef SUPPORT_UTF
- if (common->utf && HAS_EXTRALEN(cc[-1])) cc += GET_EXTRALEN(cc[-1]);
-#endif
+ cc ++;
+ GETCHARINCTEST(c, cc);
+ if (c < min) min = c;
+ GETCHARINCTEST(c, cc);
+ if (c > max) max = c;
#ifdef SUPPORT_UCP
needschar = TRUE;
#endif
@@ -4108,6 +4790,22 @@ while (*cc != XCL_END)
{
SLJIT_ASSERT(*cc == XCL_PROP || *cc == XCL_NOTPROP);
cc++;
+ if (*cc == PT_CLIST)
+ {
+ other_cases = PRIV(ucd_caseless_sets) + cc[1];
+ while (*other_cases != NOTACHAR)
+ {
+ if (*other_cases > max) max = *other_cases;
+ if (*other_cases < min) min = *other_cases;
+ other_cases++;
+ }
+ }
+ else
+ {
+ max = READ_CHAR_MAX;
+ min = 0;
+ }
+
switch(*cc)
{
case PT_ANY:
@@ -4148,6 +4846,64 @@ while (*cc != XCL_END)
#endif
}
+/* We are not necessary in utf mode even in 8 bit mode. */
+cc = ccbegin;
+detect_partial_match(common, backtracks);
+read_char_range(common, min, max, (cc[-1] & XCL_NOT) != 0);
+
+if ((cc[-1] & XCL_HASPROP) == 0)
+ {
+ if ((cc[-1] & XCL_MAP) != 0)
+ {
+ jump = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, 255);
+ if (!check_class_ranges(common, (const pcre_uint8 *)cc, (((const pcre_uint8 *)cc)[31] & 0x80) != 0, TRUE, &found))
+ {
+ OP2(SLJIT_AND, TMP2, 0, TMP1, 0, SLJIT_IMM, 0x7);
+ OP2(SLJIT_LSHR, TMP1, 0, TMP1, 0, SLJIT_IMM, 3);
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), (sljit_sw)cc);
+ OP2(SLJIT_SHL, TMP2, 0, SLJIT_IMM, 1, TMP2, 0);
+ OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, TMP2, 0);
+ add_jump(compiler, &found, JUMP(SLJIT_C_NOT_ZERO));
+ }
+
+ add_jump(compiler, backtracks, JUMP(SLJIT_JUMP));
+ JUMPHERE(jump);
+
+ cc += 32 / sizeof(pcre_uchar);
+ }
+ else
+ {
+ OP2(SLJIT_SUB, TMP2, 0, TMP1, 0, SLJIT_IMM, min);
+ add_jump(compiler, (cc[-1] & XCL_NOT) == 0 ? backtracks : &found, CMP(SLJIT_C_GREATER, TMP2, 0, SLJIT_IMM, max - min));
+ }
+ }
+else if ((cc[-1] & XCL_MAP) != 0)
+ {
+ OP1(SLJIT_MOV, TMP3, 0, TMP1, 0);
+#ifdef SUPPORT_UCP
+ charsaved = TRUE;
+#endif
+ if (!check_class_ranges(common, (const pcre_uint8 *)cc, FALSE, TRUE, list))
+ {
+#ifdef COMPILE_PCRE8
+ SLJIT_ASSERT(common->utf);
+#endif
+ jump = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, 255);
+
+ OP2(SLJIT_AND, TMP2, 0, TMP1, 0, SLJIT_IMM, 0x7);
+ OP2(SLJIT_LSHR, TMP1, 0, TMP1, 0, SLJIT_IMM, 3);
+ OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), (sljit_sw)cc);
+ OP2(SLJIT_SHL, TMP2, 0, SLJIT_IMM, 1, TMP2, 0);
+ OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, TMP2, 0);
+ add_jump(compiler, list, JUMP(SLJIT_C_NOT_ZERO));
+
+ JUMPHERE(jump);
+ }
+
+ OP1(SLJIT_MOV, TMP1, 0, TMP3, 0);
+ cc += 32 / sizeof(pcre_uchar);
+ }
+
#ifdef SUPPORT_UCP
/* Simple register allocation. TMP1 is preferred if possible. */
if (needstype || needsscript)
@@ -4189,7 +4945,6 @@ if (needstype || needsscript)
#endif
/* Generating code. */
-cc = ccbegin;
charoffset = 0;
numberofcmps = 0;
#ifdef SUPPORT_UCP
@@ -4205,70 +4960,50 @@ while (*cc != XCL_END)
if (*cc == XCL_SINGLE)
{
cc ++;
-#ifdef SUPPORT_UTF
- if (common->utf)
- {
- GETCHARINC(c, cc);
- }
- else
-#endif
- c = *cc++;
+ GETCHARINCTEST(c, cc);
if (numberofcmps < 3 && (*cc == XCL_SINGLE || *cc == XCL_RANGE))
{
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, c - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(c - charoffset));
OP_FLAGS(numberofcmps == 0 ? SLJIT_MOV : SLJIT_OR, TMP2, 0, numberofcmps == 0 ? SLJIT_UNUSED : TMP2, 0, SLJIT_C_EQUAL);
numberofcmps++;
}
else if (numberofcmps > 0)
{
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, c - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(c - charoffset));
OP_FLAGS(SLJIT_OR | SLJIT_SET_E, TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
jump = JUMP(SLJIT_C_NOT_ZERO ^ invertcmp);
numberofcmps = 0;
}
else
{
- jump = CMP(SLJIT_C_EQUAL ^ invertcmp, TMP1, 0, SLJIT_IMM, c - charoffset);
+ jump = CMP(SLJIT_C_EQUAL ^ invertcmp, TMP1, 0, SLJIT_IMM, (sljit_sw)(c - charoffset));
numberofcmps = 0;
}
}
else if (*cc == XCL_RANGE)
{
cc ++;
-#ifdef SUPPORT_UTF
- if (common->utf)
- {
- GETCHARINC(c, cc);
- }
- else
-#endif
- c = *cc++;
+ GETCHARINCTEST(c, cc);
SET_CHAR_OFFSET(c);
-#ifdef SUPPORT_UTF
- if (common->utf)
- {
- GETCHARINC(c, cc);
- }
- else
-#endif
- c = *cc++;
+ GETCHARINCTEST(c, cc);
+
if (numberofcmps < 3 && (*cc == XCL_SINGLE || *cc == XCL_RANGE))
{
- OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, c - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(c - charoffset));
OP_FLAGS(numberofcmps == 0 ? SLJIT_MOV : SLJIT_OR, TMP2, 0, numberofcmps == 0 ? SLJIT_UNUSED : TMP2, 0, SLJIT_C_LESS_EQUAL);
numberofcmps++;
}
else if (numberofcmps > 0)
{
- OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, c - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(c - charoffset));
OP_FLAGS(SLJIT_OR | SLJIT_SET_E, TMP2, 0, TMP2, 0, SLJIT_C_LESS_EQUAL);
jump = JUMP(SLJIT_C_NOT_ZERO ^ invertcmp);
numberofcmps = 0;
}
else
{
- jump = CMP(SLJIT_C_LESS_EQUAL ^ invertcmp, TMP1, 0, SLJIT_IMM, c - charoffset);
+ jump = CMP(SLJIT_C_LESS_EQUAL ^ invertcmp, TMP1, 0, SLJIT_IMM, (sljit_sw)(c - charoffset));
numberofcmps = 0;
}
}
@@ -4334,7 +5069,7 @@ while (*cc != XCL_END)
break;
case PT_WORD:
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, CHAR_UNDERSCORE - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(CHAR_UNDERSCORE - charoffset));
OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
/* Fall through. */
@@ -4382,35 +5117,35 @@ while (*cc != XCL_END)
OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP2, 0, SLJIT_IMM, other_cases[2]);
OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, other_cases[0] - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(other_cases[0] - charoffset));
OP_FLAGS(SLJIT_OR | ((other_cases[3] == NOTACHAR) ? SLJIT_SET_E : 0), TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
other_cases += 3;
}
else
{
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, *other_cases++ - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(*other_cases++ - charoffset));
OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
}
while (*other_cases != NOTACHAR)
{
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, *other_cases++ - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(*other_cases++ - charoffset));
OP_FLAGS(SLJIT_OR | ((*other_cases == NOTACHAR) ? SLJIT_SET_E : 0), TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
}
jump = JUMP(SLJIT_C_NOT_ZERO ^ invertcmp);
break;
case PT_UCNC:
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, CHAR_DOLLAR_SIGN - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(CHAR_DOLLAR_SIGN - charoffset));
OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, CHAR_COMMERCIAL_AT - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(CHAR_COMMERCIAL_AT - charoffset));
OP_FLAGS(SLJIT_OR, TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, CHAR_GRAVE_ACCENT - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(CHAR_GRAVE_ACCENT - charoffset));
OP_FLAGS(SLJIT_OR, TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
SET_CHAR_OFFSET(0xa0);
- OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0xd7ff - charoffset);
+ OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, (sljit_sw)(0xd7ff - charoffset));
OP_FLAGS(SLJIT_OR, TMP2, 0, TMP2, 0, SLJIT_C_LESS_EQUAL);
SET_CHAR_OFFSET(0);
OP2(SLJIT_SUB | SLJIT_SET_U, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, 0xe000 - 0);
@@ -4509,7 +5244,7 @@ struct sljit_label *label;
#ifdef SUPPORT_UCP
pcre_uchar propdata[5];
#endif
-#endif
+#endif /* SUPPORT_UTF */
switch(type)
{
@@ -4534,26 +5269,27 @@ switch(type)
case OP_NOT_DIGIT:
case OP_DIGIT:
/* Digits are usually 0-9, so it is worth to optimize them. */
- if (common->digits[0] == -2)
- get_ctype_ranges(common, ctype_digit, common->digits);
detect_partial_match(common, backtracks);
- /* Flip the starting bit in the negative case. */
- if (type == OP_NOT_DIGIT)
- common->digits[1] ^= 1;
- if (!check_ranges(common, common->digits, backtracks, TRUE))
- {
- read_char8_type(common);
- OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, ctype_digit);
- add_jump(compiler, backtracks, JUMP(type == OP_DIGIT ? SLJIT_C_ZERO : SLJIT_C_NOT_ZERO));
- }
- if (type == OP_NOT_DIGIT)
- common->digits[1] ^= 1;
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && is_char7_bitset((const pcre_uint8*)common->ctypes - cbit_length + cbit_digit, FALSE))
+ read_char7_type(common, type == OP_NOT_DIGIT);
+ else
+#endif
+ read_char8_type(common, type == OP_NOT_DIGIT);
+ /* Flip the starting bit in the negative case. */
+ OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, ctype_digit);
+ add_jump(compiler, backtracks, JUMP(type == OP_DIGIT ? SLJIT_C_ZERO : SLJIT_C_NOT_ZERO));
return cc;
case OP_NOT_WHITESPACE:
case OP_WHITESPACE:
detect_partial_match(common, backtracks);
- read_char8_type(common);
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && is_char7_bitset((const pcre_uint8*)common->ctypes - cbit_length + cbit_space, FALSE))
+ read_char7_type(common, type == OP_NOT_WHITESPACE);
+ else
+#endif
+ read_char8_type(common, type == OP_NOT_WHITESPACE);
OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, ctype_space);
add_jump(compiler, backtracks, JUMP(type == OP_WHITESPACE ? SLJIT_C_ZERO : SLJIT_C_NOT_ZERO));
return cc;
@@ -4561,14 +5297,19 @@ switch(type)
case OP_NOT_WORDCHAR:
case OP_WORDCHAR:
detect_partial_match(common, backtracks);
- read_char8_type(common);
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (common->utf && is_char7_bitset((const pcre_uint8*)common->ctypes - cbit_length + cbit_word, FALSE))
+ read_char7_type(common, type == OP_NOT_WORDCHAR);
+ else
+#endif
+ read_char8_type(common, type == OP_NOT_WORDCHAR);
OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, ctype_word);
add_jump(compiler, backtracks, JUMP(type == OP_WORDCHAR ? SLJIT_C_ZERO : SLJIT_C_NOT_ZERO));
return cc;
case OP_ANY:
detect_partial_match(common, backtracks);
- read_char(common);
+ read_char_range(common, common->nlmin, common->nlmax, TRUE);
if (common->nltype == NLTYPE_FIXED && common->newline > 255)
{
jump[0] = CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, (common->newline >> 8) & 0xff);
@@ -4624,7 +5365,7 @@ switch(type)
#ifdef SUPPORT_UCP
case OP_NOTPROP:
case OP_PROP:
- propdata[0] = 0;
+ propdata[0] = XCL_HASPROP;
propdata[1] = type == OP_NOTPROP ? XCL_NOTPROP : XCL_PROP;
propdata[2] = cc[0];
propdata[3] = cc[1];
@@ -4636,7 +5377,7 @@ switch(type)
case OP_ANYNL:
detect_partial_match(common, backtracks);
- read_char(common);
+ read_char_range(common, common->bsr_nlmin, common->bsr_nlmax, FALSE);
jump[0] = CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, CHAR_CR);
/* We don't need to handle soft partial matching case. */
end_list = NULL;
@@ -4658,7 +5399,7 @@ switch(type)
case OP_NOT_HSPACE:
case OP_HSPACE:
detect_partial_match(common, backtracks);
- read_char(common);
+ read_char_range(common, 0x9, 0x3000, type == OP_NOT_HSPACE);
add_jump(compiler, &common->hspace, JUMP(SLJIT_FAST_CALL));
add_jump(compiler, backtracks, JUMP(type == OP_NOT_HSPACE ? SLJIT_C_NOT_ZERO : SLJIT_C_ZERO));
return cc;
@@ -4666,7 +5407,7 @@ switch(type)
case OP_NOT_VSPACE:
case OP_VSPACE:
detect_partial_match(common, backtracks);
- read_char(common);
+ read_char_range(common, 0xa, 0x2029, type == OP_NOT_VSPACE);
add_jump(compiler, &common->vspace, JUMP(SLJIT_FAST_CALL));
add_jump(compiler, backtracks, JUMP(type == OP_NOT_VSPACE ? SLJIT_C_NOT_ZERO : SLJIT_C_ZERO));
return cc;
@@ -4765,7 +5506,7 @@ switch(type)
else
{
OP1(SLJIT_MOV, SLJIT_MEM1(SLJIT_LOCALS_REG), LOCALS1, STR_PTR, 0);
- read_char(common);
+ read_char_range(common, common->nlmin, common->nlmax, TRUE);
add_jump(compiler, backtracks, CMP(SLJIT_C_NOT_EQUAL, STR_PTR, 0, STR_END, 0));
add_jump(compiler, &common->anynewline, JUMP(SLJIT_FAST_CALL));
add_jump(compiler, backtracks, JUMP(SLJIT_C_ZERO));
@@ -4813,7 +5554,7 @@ switch(type)
else
{
skip_char_back(common);
- read_char(common);
+ read_char_range(common, common->nlmin, common->nlmax, TRUE);
check_newlinechar(common, common->nltype, backtracks, FALSE);
}
JUMPHERE(jump[0]);
@@ -4864,7 +5605,7 @@ switch(type)
}
else
{
- peek_char(common);
+ peek_char(common, common->nlmax);
check_newlinechar(common, common->nltype, backtracks, FALSE);
}
JUMPHERE(jump[0]);
@@ -4888,8 +5629,8 @@ switch(type)
#endif
return byte_sequence_compare(common, type == OP_CHARI, cc, &context, backtracks);
}
+
detect_partial_match(common, backtracks);
- read_char(common);
#ifdef SUPPORT_UTF
if (common->utf)
{
@@ -4898,12 +5639,15 @@ switch(type)
else
#endif
c = *cc;
+
if (type == OP_CHAR || !char_has_othercase(common, cc))
{
+ read_char_range(common, c, c, FALSE);
add_jump(compiler, backtracks, CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, c));
return cc + length;
}
oc = char_othercase(common, c);
+ read_char_range(common, c < oc ? c : oc, c > oc ? c : oc, FALSE);
bit = c ^ oc;
if (is_powerof2(bit))
{
@@ -4911,11 +5655,9 @@ switch(type)
add_jump(compiler, backtracks, CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, c | bit));
return cc + length;
}
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, c);
- OP_FLAGS(SLJIT_MOV, TMP2, 0, SLJIT_UNUSED, 0, SLJIT_C_EQUAL);
- OP2(SLJIT_SUB | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, SLJIT_IMM, oc);
- OP_FLAGS(SLJIT_OR | SLJIT_SET_E, TMP2, 0, TMP2, 0, SLJIT_C_EQUAL);
- add_jump(compiler, backtracks, JUMP(SLJIT_C_ZERO));
+ jump[0] = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, c);
+ add_jump(compiler, backtracks, CMP(SLJIT_C_NOT_EQUAL, TMP1, 0, SLJIT_IMM, oc));
+ JUMPHERE(jump[0]);
return cc + length;
case OP_NOT:
@@ -4950,21 +5692,21 @@ switch(type)
#endif /* COMPILE_PCRE8 */
{
GETCHARLEN(c, cc, length);
- read_char(common);
}
}
else
#endif /* SUPPORT_UTF */
- {
- read_char(common);
c = *cc;
- }
if (type == OP_NOT || !char_has_othercase(common, cc))
+ {
+ read_char_range(common, c, c, TRUE);
add_jump(compiler, backtracks, CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, c));
+ }
else
{
oc = char_othercase(common, c);
+ read_char_range(common, c < oc ? c : oc, c > oc ? c : oc, TRUE);
bit = c ^ oc;
if (is_powerof2(bit))
{
@@ -4982,36 +5724,49 @@ switch(type)
case OP_CLASS:
case OP_NCLASS:
detect_partial_match(common, backtracks);
- read_char(common);
- if (check_class_ranges(common, (const pcre_uint8 *)cc, type == OP_NCLASS, backtracks))
+
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ bit = (common->utf && is_char7_bitset((const pcre_uint8 *)cc, type == OP_NCLASS)) ? 127 : 255;
+ read_char_range(common, 0, bit, type == OP_NCLASS);
+#else
+ read_char_range(common, 0, 255, type == OP_NCLASS);
+#endif
+
+ if (check_class_ranges(common, (const pcre_uint8 *)cc, type == OP_NCLASS, FALSE, backtracks))
return cc + 32 / sizeof(pcre_uchar);
-#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
jump[0] = NULL;
-#ifdef COMPILE_PCRE8
- /* This check only affects 8 bit mode. In other modes, we
- always need to compare the value with 255. */
if (common->utf)
-#endif /* COMPILE_PCRE8 */
{
- jump[0] = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, 255);
+ jump[0] = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, bit);
if (type == OP_CLASS)
{
add_jump(compiler, backtracks, jump[0]);
jump[0] = NULL;
}
}
-#endif /* SUPPORT_UTF || !COMPILE_PCRE8 */
+#elif !defined COMPILE_PCRE8
+ jump[0] = CMP(SLJIT_C_GREATER, TMP1, 0, SLJIT_IMM, 255);
+ if (type == OP_CLASS)
+ {
+ add_jump(compiler, backtracks, jump[0]);
+ jump[0] = NULL;
+ }
+#endif /* SUPPORT_UTF && COMPILE_PCRE8 */
+
OP2(SLJIT_AND, TMP2, 0, TMP1, 0, SLJIT_IMM, 0x7);
OP2(SLJIT_LSHR, TMP1, 0, TMP1, 0, SLJIT_IMM, 3);
OP1(SLJIT_MOV_UB, TMP1, 0, SLJIT_MEM1(TMP1), (sljit_sw)cc);
OP2(SLJIT_SHL, TMP2, 0, SLJIT_IMM, 1, TMP2, 0);
OP2(SLJIT_AND | SLJIT_SET_E, SLJIT_UNUSED, 0, TMP1, 0, TMP2, 0);
add_jump(compiler, backtracks, JUMP(SLJIT_C_ZERO));
+
#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
if (jump[0] != NULL)
JUMPHERE(jump[0]);
-#endif /* SUPPORT_UTF || !COMPILE_PCRE8 */
+#endif
+
return cc + 32 / sizeof(pcre_uchar);
#if defined SUPPORT_UTF || defined COMPILE_PCRE16 || defined COMPILE_PCRE32
@@ -7345,7 +8100,7 @@ if (*cc == OP_FAIL)
return cc + 1;
}
-if (*cc == OP_ASSERT_ACCEPT || common->currententry != NULL)
+if (*cc == OP_ASSERT_ACCEPT || common->currententry != NULL || !common->might_be_empty)
{
/* No need to check notempty conditions. */
if (common->accept_label == NULL)
@@ -8047,21 +8802,21 @@ if (bra == OP_BRAZERO)
static void compile_bracket_backtrackingpath(compiler_common *common, struct backtrack_common *current)
{
DEFINE_COMPILER;
-int opcode, stacksize, count;
+int opcode, stacksize, alt_count, alt_max;
int offset = 0;
int private_data_ptr = CURRENT_AS(bracket_backtrack)->private_data_ptr;
int repeat_ptr = 0, repeat_type = 0, repeat_count = 0;
pcre_uchar *cc = current->cc;
pcre_uchar *ccbegin;
pcre_uchar *ccprev;
-jump_list *jumplist = NULL;
-jump_list *jumplistitem = NULL;
pcre_uchar bra = OP_BRA;
pcre_uchar ket;
assert_backtrack *assert;
BOOL has_alternatives;
BOOL needs_control_head = FALSE;
struct sljit_jump *brazero = NULL;
+struct sljit_jump *alt1 = NULL;
+struct sljit_jump *alt2 = NULL;
struct sljit_jump *once = NULL;
struct sljit_jump *cond = NULL;
struct sljit_label *rmin_label = NULL;
@@ -8099,6 +8854,8 @@ if (SLJIT_UNLIKELY(opcode == OP_COND) && (*cc == OP_KETRMAX || *cc == OP_KETRMIN
if (SLJIT_UNLIKELY(opcode == OP_ONCE_NC))
opcode = OP_ONCE;
+alt_max = has_alternatives ? no_alternatives(ccbegin) : 0;
+
/* Decoding the needs_control_head in framesize. */
if (opcode == OP_ONCE)
{
@@ -8212,44 +8969,27 @@ else if (SLJIT_UNLIKELY(opcode == OP_COND) || SLJIT_UNLIKELY(opcode == OP_SCOND)
OP1(SLJIT_MOV, TMP1, 0, SLJIT_MEM1(STACK_TOP), STACK(0));
free_stack(common, 1);
- jumplistitem = sljit_alloc_memory(compiler, sizeof(jump_list));
- if (SLJIT_UNLIKELY(!jumplistitem))
- return;
- jumplist = jumplistitem;
- jumplistitem->next = NULL;
- jumplistitem->jump = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, 1);
+ alt_max = 2;
+ alt1 = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, sizeof(sljit_uw));
}
}
-else if (*cc == OP_ALT)
+else if (has_alternatives)
{
- /* Build a jump list. Get the last successfully matched branch index. */
OP1(SLJIT_MOV, TMP1, 0, SLJIT_MEM1(STACK_TOP), STACK(0));
free_stack(common, 1);
- count = 1;
- do
- {
- /* Append as the last item. */
- if (jumplist != NULL)
- {
- jumplistitem->next = sljit_alloc_memory(compiler, sizeof(jump_list));
- jumplistitem = jumplistitem->next;
- }
- else
- {
- jumplistitem = sljit_alloc_memory(compiler, sizeof(jump_list));
- jumplist = jumplistitem;
- }
- if (SLJIT_UNLIKELY(!jumplistitem))
- return;
-
- jumplistitem->next = NULL;
- jumplistitem->jump = CMP(SLJIT_C_EQUAL, TMP1, 0, SLJIT_IMM, count++);
- cc += GET(cc, 1);
+ if (alt_max > 4)
+ {
+ /* Table jump if alt_max is greater than 4. */
+ sljit_emit_ijump(compiler, SLJIT_JUMP, SLJIT_MEM1(TMP1), (sljit_sw)common->read_only_data_ptr);
+ add_label_addr(common);
+ }
+ else
+ {
+ if (alt_max == 4)
+ alt2 = CMP(SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, 2 * sizeof(sljit_uw));
+ alt1 = CMP(SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, sizeof(sljit_uw));
}
- while (*cc == OP_ALT);
-
- cc = ccbegin + GET(ccbegin, 1);
}
COMPILE_BACKTRACKINGPATH(current->top);
@@ -8284,7 +9024,7 @@ if (SLJIT_UNLIKELY(opcode == OP_COND) || SLJIT_UNLIKELY(opcode == OP_SCOND))
if (has_alternatives)
{
- count = 1;
+ alt_count = sizeof(sljit_uw);
do
{
current->top = NULL;
@@ -8360,7 +9100,7 @@ if (has_alternatives)
stacksize = match_capture_common(common, stacksize, offset, private_data_ptr);
if (opcode != OP_ONCE)
- OP1(SLJIT_MOV, SLJIT_MEM1(STACK_TOP), STACK(stacksize), SLJIT_IMM, count++);
+ OP1(SLJIT_MOV, SLJIT_MEM1(STACK_TOP), STACK(stacksize), SLJIT_IMM, alt_count);
if (offset != 0 && ket == OP_KETRMAX && common->optimized_cbracket[offset >> 1] != 0)
{
@@ -8373,9 +9113,24 @@ if (has_alternatives)
if (opcode != OP_ONCE)
{
- SLJIT_ASSERT(jumplist);
- JUMPHERE(jumplist->jump);
- jumplist = jumplist->next;
+ if (alt_max > 4)
+ add_label_addr(common);
+ else
+ {
+ if (alt_count != 2 * sizeof(sljit_uw))
+ {
+ JUMPHERE(alt1);
+ if (alt_max == 3 && alt_count == sizeof(sljit_uw))
+ alt2 = CMP(SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, 2 * sizeof(sljit_uw));
+ }
+ else
+ {
+ JUMPHERE(alt2);
+ if (alt_max == 4)
+ alt1 = CMP(SLJIT_C_GREATER_EQUAL, TMP1, 0, SLJIT_IMM, 3 * sizeof(sljit_uw));
+ }
+ }
+ alt_count += sizeof(sljit_uw);
}
COMPILE_BACKTRACKINGPATH(current->top);
@@ -8384,7 +9139,6 @@ if (has_alternatives)
SLJIT_ASSERT(!current->nextbacktracks);
}
while (*cc == OP_ALT);
- SLJIT_ASSERT(!jumplist);
if (cond != NULL)
{
@@ -8985,16 +9739,18 @@ pcre_uchar *ccend;
executable_functions *functions;
void *executable_func;
sljit_uw executable_size;
+sljit_uw total_length;
+label_addr_list *label_addr;
struct sljit_label *mainloop_label = NULL;
struct sljit_label *continue_match_label;
-struct sljit_label *empty_match_found_label;
-struct sljit_label *empty_match_backtrack_label;
+struct sljit_label *empty_match_found_label = NULL;
+struct sljit_label *empty_match_backtrack_label = NULL;
struct sljit_label *reset_match_label;
+struct sljit_label *quit_label;
struct sljit_jump *jump;
struct sljit_jump *minlength_check_failed = NULL;
struct sljit_jump *reqbyte_notfound = NULL;
-struct sljit_jump *empty_match;
-struct sljit_label *quit_label;
+struct sljit_jump *empty_match = NULL;
SLJIT_ASSERT((extra->flags & PCRE_EXTRA_STUDY_DATA) != 0);
study = extra->study_data;
@@ -9007,9 +9763,13 @@ memset(common, 0, sizeof(compiler_common));
rootbacktrack.cc = (pcre_uchar *)re + re->name_table_offset + re->name_count * re->name_entry_size;
common->start = rootbacktrack.cc;
+common->read_only_data = NULL;
+common->read_only_data_size = 0;
+common->read_only_data_ptr = NULL;
common->fcc = tables + fcc_offset;
common->lcc = (sljit_sw)(tables + lcc_offset);
common->mode = mode;
+common->might_be_empty = study->minlength == 0;
common->nltype = NLTYPE_FIXED;
switch(re->options & PCRE_NEWLINE_BITS)
{
@@ -9030,6 +9790,8 @@ switch(re->options & PCRE_NEWLINE_BITS)
case PCRE_NEWLINE_ANYCRLF: common->newline = (CHAR_CR << 8) | CHAR_NL; common->nltype = NLTYPE_ANYCRLF; break;
default: return;
}
+common->nlmax = READ_CHAR_MAX;
+common->nlmin = 0;
if ((re->options & PCRE_BSR_ANYCRLF) != 0)
common->bsr_nltype = NLTYPE_ANYCRLF;
else if ((re->options & PCRE_BSR_UNICODE) != 0)
@@ -9042,9 +9804,10 @@ else
common->bsr_nltype = NLTYPE_ANY;
#endif
}
+common->bsr_nlmax = READ_CHAR_MAX;
+common->bsr_nlmin = 0;
common->endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0;
common->ctypes = (sljit_sw)(tables + ctypes_offset);
-common->digits[0] = -2;
common->name_table = ((pcre_uchar *)re) + re->name_table_offset;
common->name_count = re->name_count;
common->name_entry_size = re->name_entry_size;
@@ -9055,8 +9818,31 @@ common->utf = (re->options & PCRE_UTF8) != 0;
#ifdef SUPPORT_UCP
common->use_ucp = (re->options & PCRE_UCP) != 0;
#endif
+if (common->utf)
+ {
+ if (common->nltype == NLTYPE_ANY)
+ common->nlmax = 0x2029;
+ else if (common->nltype == NLTYPE_ANYCRLF)
+ common->nlmax = (CHAR_CR > CHAR_NL) ? CHAR_CR : CHAR_NL;
+ else
+ {
+ /* We only care about the first newline character. */
+ common->nlmax = common->newline & 0xff;
+ }
+
+ if (common->nltype == NLTYPE_FIXED)
+ common->nlmin = common->newline & 0xff;
+ else
+ common->nlmin = (CHAR_CR < CHAR_NL) ? CHAR_CR : CHAR_NL;
+
+ if (common->bsr_nltype == NLTYPE_ANY)
+ common->bsr_nlmax = 0x2029;
+ else
+ common->bsr_nlmax = (CHAR_CR > CHAR_NL) ? CHAR_CR : CHAR_NL;
+ common->bsr_nlmin = (CHAR_CR < CHAR_NL) ? CHAR_CR : CHAR_NL;
+ }
#endif /* SUPPORT_UTF */
-ccend = bracketend(rootbacktrack.cc);
+ccend = bracketend(common->start);
/* Calculate the local space size on the stack. */
common->ovector_start = LIMIT_MATCH + sizeof(sljit_sw);
@@ -9069,12 +9855,12 @@ memset(common->optimized_cbracket, 0, re->top_bracket + 1);
memset(common->optimized_cbracket, 1, re->top_bracket + 1);
#endif
-SLJIT_ASSERT(*rootbacktrack.cc == OP_BRA && ccend[-(1 + LINK_SIZE)] == OP_KET);
+SLJIT_ASSERT(*common->start == OP_BRA && ccend[-(1 + LINK_SIZE)] == OP_KET);
#if defined DEBUG_FORCE_UNOPTIMIZED_CBRAS && DEBUG_FORCE_UNOPTIMIZED_CBRAS == 2
common->capture_last_ptr = common->ovector_start;
common->ovector_start += sizeof(sljit_sw);
#endif
-if (!check_opcode_types(common, rootbacktrack.cc, ccend))
+if (!check_opcode_types(common, common->start, ccend))
{
SLJIT_FREE(common->optimized_cbracket);
return;
@@ -9137,13 +9923,14 @@ if (common->capture_last_ptr != 0)
SLJIT_ASSERT(!(common->req_char_ptr != 0 && common->start_used_ptr != 0));
common->cbra_ptr = OVECTOR_START + (re->top_bracket + 1) * 2 * sizeof(sljit_sw);
-common->private_data_ptrs = (int *)SLJIT_MALLOC((ccend - rootbacktrack.cc) * sizeof(sljit_si));
+total_length = ccend - common->start;
+common->private_data_ptrs = (sljit_si *)SLJIT_MALLOC(total_length * (sizeof(sljit_si) + (common->has_then ? 1 : 0)));
if (!common->private_data_ptrs)
{
SLJIT_FREE(common->optimized_cbracket);
return;
}
-memset(common->private_data_ptrs, 0, (ccend - rootbacktrack.cc) * sizeof(int));
+memset(common->private_data_ptrs, 0, total_length * sizeof(sljit_si));
private_data_size = common->cbra_ptr + (re->top_bracket + 1) * sizeof(sljit_sw);
set_private_data_ptrs(common, &private_data_size, ccend);
@@ -9156,15 +9943,21 @@ if (private_data_size > SLJIT_MAX_LOCAL_SIZE)
if (common->has_then)
{
- common->then_offsets = (pcre_uint8 *)SLJIT_MALLOC(ccend - rootbacktrack.cc);
- if (!common->then_offsets)
+ common->then_offsets = (pcre_uint8 *)(common->private_data_ptrs + total_length);
+ memset(common->then_offsets, 0, total_length);
+ set_then_offsets(common, common->start, NULL);
+ }
+
+if (common->read_only_data_size > 0)
+ {
+ common->read_only_data = (sljit_uw *)SLJIT_MALLOC(common->read_only_data_size);
+ if (common->read_only_data == NULL)
{
SLJIT_FREE(common->optimized_cbracket);
SLJIT_FREE(common->private_data_ptrs);
return;
}
- memset(common->then_offsets, 0, ccend - rootbacktrack.cc);
- set_then_offsets(common, rootbacktrack.cc, NULL);
+ common->read_only_data_ptr = common->read_only_data;
}
compiler = sljit_create_compiler();
@@ -9172,8 +9965,8 @@ if (!compiler)
{
SLJIT_FREE(common->optimized_cbracket);
SLJIT_FREE(common->private_data_ptrs);
- if (common->has_then)
- SLJIT_FREE(common->then_offsets);
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
return;
}
common->compiler = compiler;
@@ -9212,13 +10005,22 @@ if ((re->options & PCRE_ANCHORED) == 0)
if ((re->options & PCRE_NO_START_OPTIMIZE) == 0)
{
if (mode == JIT_COMPILE && fast_forward_first_n_chars(common, (re->options & PCRE_FIRSTLINE) != 0))
- { /* Do nothing */ }
+ {
+ /* If read_only_data is reallocated, we might have an allocation failure. */
+ if (common->read_only_data_size > 0 && common->read_only_data == NULL)
+ {
+ sljit_free_compiler(compiler);
+ SLJIT_FREE(common->optimized_cbracket);
+ SLJIT_FREE(common->private_data_ptrs);
+ return;
+ }
+ }
else if ((re->flags & PCRE_FIRSTSET) != 0)
fast_forward_first_char(common, (pcre_uchar)re->first_char, (re->flags & PCRE_FCH_CASELESS) != 0, (re->options & PCRE_FIRSTLINE) != 0);
else if ((re->flags & PCRE_STARTLINE) != 0)
fast_forward_newline(common, (re->options & PCRE_FIRSTLINE) != 0);
else if ((re->flags & PCRE_STARTLINE) == 0 && study != NULL && (study->flags & PCRE_STUDY_MAPPED) != 0)
- fast_forward_start_bits(common, (sljit_uw)study->start_bits, (re->options & PCRE_FIRSTLINE) != 0);
+ fast_forward_start_bits(common, study->start_bits, (re->options & PCRE_FIRSTLINE) != 0);
}
}
else
@@ -9259,19 +10061,22 @@ if (mode == JIT_PARTIAL_SOFT_COMPILE)
else if (mode == JIT_PARTIAL_HARD_COMPILE)
OP1(SLJIT_MOV, SLJIT_MEM1(SLJIT_LOCALS_REG), common->start_used_ptr, STR_PTR, 0);
-compile_matchingpath(common, rootbacktrack.cc, ccend, &rootbacktrack);
+compile_matchingpath(common, common->start, ccend, &rootbacktrack);
if (SLJIT_UNLIKELY(sljit_get_compiler_error(compiler)))
{
sljit_free_compiler(compiler);
SLJIT_FREE(common->optimized_cbracket);
SLJIT_FREE(common->private_data_ptrs);
- if (common->has_then)
- SLJIT_FREE(common->then_offsets);
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
return;
}
-empty_match = CMP(SLJIT_C_EQUAL, STR_PTR, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), OVECTOR(0));
-empty_match_found_label = LABEL();
+if (common->might_be_empty)
+ {
+ empty_match = CMP(SLJIT_C_EQUAL, STR_PTR, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), OVECTOR(0));
+ empty_match_found_label = LABEL();
+ }
common->accept_label = LABEL();
if (common->accept != NULL)
@@ -9295,15 +10100,16 @@ if (mode != JIT_COMPILE)
return_with_partial_match(common, common->quit_label);
}
-empty_match_backtrack_label = LABEL();
+if (common->might_be_empty)
+ empty_match_backtrack_label = LABEL();
compile_backtrackingpath(common, rootbacktrack.top);
if (SLJIT_UNLIKELY(sljit_get_compiler_error(compiler)))
{
sljit_free_compiler(compiler);
SLJIT_FREE(common->optimized_cbracket);
SLJIT_FREE(common->private_data_ptrs);
- if (common->has_then)
- SLJIT_FREE(common->then_offsets);
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
return;
}
@@ -9331,10 +10137,19 @@ OP1(SLJIT_MOV, STR_PTR, 0, SLJIT_MEM1(SLJIT_LOCALS_REG), common->start_ptr);
if ((re->options & PCRE_ANCHORED) == 0)
{
- if ((re->options & PCRE_FIRSTLINE) == 0)
- CMPTO(SLJIT_C_LESS, STR_PTR, 0, STR_END, 0, mainloop_label);
+ if (common->ff_newline_shortcut != NULL)
+ {
+ if ((re->options & PCRE_FIRSTLINE) == 0)
+ CMPTO(SLJIT_C_LESS, STR_PTR, 0, STR_END, 0, common->ff_newline_shortcut);
+ /* There cannot be more newlines here. */
+ }
else
- CMPTO(SLJIT_C_LESS, STR_PTR, 0, TMP1, 0, mainloop_label);
+ {
+ if ((re->options & PCRE_FIRSTLINE) == 0)
+ CMPTO(SLJIT_C_LESS, STR_PTR, 0, STR_END, 0, mainloop_label);
+ else
+ CMPTO(SLJIT_C_LESS, STR_PTR, 0, TMP1, 0, mainloop_label);
+ }
}
/* No more remaining characters. */
@@ -9349,15 +10164,18 @@ JUMPTO(SLJIT_JUMP, common->quit_label);
flush_stubs(common);
-JUMPHERE(empty_match);
-OP1(SLJIT_MOV, TMP1, 0, ARGUMENTS, 0);
-OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP1), SLJIT_OFFSETOF(jit_arguments, notempty));
-CMPTO(SLJIT_C_NOT_EQUAL, TMP2, 0, SLJIT_IMM, 0, empty_match_backtrack_label);
-OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP1), SLJIT_OFFSETOF(jit_arguments, notempty_atstart));
-CMPTO(SLJIT_C_EQUAL, TMP2, 0, SLJIT_IMM, 0, empty_match_found_label);
-OP1(SLJIT_MOV, TMP2, 0, SLJIT_MEM1(TMP1), SLJIT_OFFSETOF(jit_arguments, str));
-CMPTO(SLJIT_C_NOT_EQUAL, TMP2, 0, STR_PTR, 0, empty_match_found_label);
-JUMPTO(SLJIT_JUMP, empty_match_backtrack_label);
+if (common->might_be_empty)
+ {
+ JUMPHERE(empty_match);
+ OP1(SLJIT_MOV, TMP1, 0, ARGUMENTS, 0);
+ OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP1), SLJIT_OFFSETOF(jit_arguments, notempty));
+ CMPTO(SLJIT_C_NOT_EQUAL, TMP2, 0, SLJIT_IMM, 0, empty_match_backtrack_label);
+ OP1(SLJIT_MOV_UB, TMP2, 0, SLJIT_MEM1(TMP1), SLJIT_OFFSETOF(jit_arguments, notempty_atstart));
+ CMPTO(SLJIT_C_EQUAL, TMP2, 0, SLJIT_IMM, 0, empty_match_found_label);
+ OP1(SLJIT_MOV, TMP2, 0, SLJIT_MEM1(TMP1), SLJIT_OFFSETOF(jit_arguments, str));
+ CMPTO(SLJIT_C_NOT_EQUAL, TMP2, 0, STR_PTR, 0, empty_match_found_label);
+ JUMPTO(SLJIT_JUMP, empty_match_backtrack_label);
+ }
common->currententry = common->entries;
common->local_exit = TRUE;
@@ -9371,8 +10189,8 @@ while (common->currententry != NULL)
sljit_free_compiler(compiler);
SLJIT_FREE(common->optimized_cbracket);
SLJIT_FREE(common->private_data_ptrs);
- if (common->has_then)
- SLJIT_FREE(common->then_offsets);
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
return;
}
flush_stubs(common);
@@ -9456,14 +10274,17 @@ if (common->reset_match != NULL)
JUMPTO(SLJIT_JUMP, reset_match_label);
}
#ifdef SUPPORT_UTF
-#ifndef COMPILE_PCRE32
+#ifdef COMPILE_PCRE8
if (common->utfreadchar != NULL)
{
set_jumps(common->utfreadchar, LABEL());
do_utfreadchar(common);
}
-#endif /* !COMPILE_PCRE32 */
-#ifdef COMPILE_PCRE8
+if (common->utfreadchar16 != NULL)
+ {
+ set_jumps(common->utfreadchar16, LABEL());
+ do_utfreadchar16(common);
+ }
if (common->utfreadtype8 != NULL)
{
set_jumps(common->utfreadtype8, LABEL());
@@ -9479,16 +10300,25 @@ if (common->getucd != NULL)
}
#endif
+SLJIT_ASSERT(common->read_only_data + (common->read_only_data_size >> SLJIT_WORD_SHIFT) == common->read_only_data_ptr);
SLJIT_FREE(common->optimized_cbracket);
SLJIT_FREE(common->private_data_ptrs);
-if (common->has_then)
- SLJIT_FREE(common->then_offsets);
executable_func = sljit_generate_code(compiler);
executable_size = sljit_get_generated_code_size(compiler);
+label_addr = common->label_addrs;
+while (label_addr != NULL)
+ {
+ *label_addr->addr = sljit_get_label_addr(label_addr->label);
+ label_addr = label_addr->next;
+ }
sljit_free_compiler(compiler);
if (executable_func == NULL)
+ {
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
return;
+ }
/* Reuse the function descriptor if possible. */
if ((extra->flags & PCRE_EXTRA_EXECUTABLE_JIT) != 0 && extra->executable_jit != NULL)
@@ -9508,8 +10338,10 @@ else
if (functions == NULL)
{
/* This case is highly unlikely since we just recently
- freed a lot of memory. Although not impossible. */
+ freed a lot of memory. Not impossible though. */
sljit_free_code(executable_func);
+ if (common->read_only_data)
+ SLJIT_FREE(common->read_only_data);
return;
}
memset(functions, 0, sizeof(executable_functions));
@@ -9520,6 +10352,7 @@ else
}
functions->executable_funcs[mode] = executable_func;
+functions->read_only_data[mode] = common->read_only_data;
functions->executable_sizes[mode] = executable_size;
}
@@ -9706,6 +10539,8 @@ for (i = 0; i < JIT_NUMBER_OF_COMPILE_MODES; i++)
{
if (functions->executable_funcs[i] != NULL)
sljit_free_code(functions->executable_funcs[i]);
+ if (functions->read_only_data[i] != NULL)
+ SLJIT_FREE(functions->read_only_data[i]);
}
SLJIT_FREE(functions);
}
diff --git a/pcre/pcre_jit_test.c b/pcre/pcre_jit_test.c
index cabd2560c57..a40913ef0a5 100644
--- a/pcre/pcre_jit_test.c
+++ b/pcre/pcre_jit_test.c
@@ -75,9 +75,14 @@ POSSIBILITY OF SUCH DAMAGE.
\xe1\xbf\xb8 = 0x1ff8 = 8184
\xf0\x90\x90\x80 = 0x10400 = 66560
\xf0\x90\x90\xa8 = 0x10428 = 66600
+ \xc7\x84 = 0x1c4 = 452
+ \xc7\x85 = 0x1c5 = 453
+ \xc7\x86 = 0x1c6 = 454
+
Mark property:
\xcc\x8d = 0x30d = 781
Special:
+ \xc2\x80 = 0x80 = 128 (lowest 2 byte character)
\xdf\xbf = 0x7ff = 2047 (highest 2 byte character)
\xe0\xa0\x80 = 0x800 = 2048 (lowest 2 byte character)
\xef\xbf\xbf = 0xffff = 65535 (highest 3 byte character)
@@ -326,6 +331,22 @@ static struct regression_test_case regression_test_cases[] = {
{ MUA, 0, "\\w(\\s|(?:\\d)*,)+\\w\\wb", "a 5, 4,, bb 5, 4,, aab" },
{ MUA, 0, "(\\v+)(\\V+)", "\x0e\xc2\x85\xe2\x80\xa8\x0b\x09\xe2\x80\xa9" },
{ MUA, 0, "(\\h+)(\\H+)", "\xe2\x80\xa8\xe2\x80\x80\x20\xe2\x80\x8a\xe2\x81\x9f\xe3\x80\x80\x09\x20\xc2\xa0\x0a" },
+ { MUA, 0, "x[bcef]+", "xaxdxecbfg" },
+ { MUA, 0, "x[bcdghij]+", "xaxexfxdgbjk" },
+ { MUA, 0, "x[^befg]+", "xbxexacdhg" },
+ { MUA, 0, "x[^bcdl]+", "xlxbxaekmd" },
+ { MUA, 0, "x[^bcdghi]+", "xbxdxgxaefji" },
+ { MUA, 0, "x[B-Fb-f]+", "xaxAxgxbfBFG" },
+ { CMUA, 0, "\\x{e9}+", "#\xf0\x90\x90\xa8\xc3\xa8\xc3\xa9\xc3\x89\xc3\x88" },
+ { CMUA, 0, "[^\\x{e9}]+", "\xc3\xa9#\xf0\x90\x90\xa8\xc3\xa8\xc3\x88\xc3\x89" },
+ { MUA, 0, "[\\x02\\x7e]+", "\xc3\x81\xe1\xbf\xb8\xf0\x90\x90\xa8\x01\x02\x7e\x7f" },
+ { MUA, 0, "[^\\x02\\x7e]+", "\x02\xc3\x81\xe1\xbf\xb8\xf0\x90\x90\xa8\x01\x7f\x7e" },
+ { MUA, 0, "[\\x{81}-\\x{7fe}]+", "#\xe1\xbf\xb8\xf0\x90\x90\xa8\xc2\x80\xc2\x81\xdf\xbe\xdf\xbf" },
+ { MUA, 0, "[^\\x{81}-\\x{7fe}]+", "\xc2\x81#\xe1\xbf\xb8\xf0\x90\x90\xa8\xc2\x80\xdf\xbf\xdf\xbe" },
+ { MUA, 0, "[\\x{801}-\\x{fffe}]+", "#\xc3\xa9\xf0\x90\x90\x80\xe0\xa0\x80\xe0\xa0\x81\xef\xbf\xbe\xef\xbf\xbf" },
+ { MUA, 0, "[^\\x{801}-\\x{fffe}]+", "\xe0\xa0\x81#\xc3\xa9\xf0\x90\x90\x80\xe0\xa0\x80\xef\xbf\xbf\xef\xbf\xbe" },
+ { MUA, 0, "[\\x{10001}-\\x{10fffe}]+", "#\xc3\xa9\xe2\xb1\xa5\xf0\x90\x80\x80\xf0\x90\x80\x81\xf4\x8f\xbf\xbe\xf4\x8f\xbf\xbf" },
+ { MUA, 0, "[^\\x{10001}-\\x{10fffe}]+", "\xf0\x90\x80\x81#\xc3\xa9\xe2\xb1\xa5\xf0\x90\x80\x80\xf4\x8f\xbf\xbf\xf4\x8f\xbf\xbe" },
/* Unicode properties. */
{ MUAP, 0, "[1-5\xc3\xa9\\w]", "\xc3\xa1_" },
@@ -371,6 +392,10 @@ static struct regression_test_case regression_test_cases[] = {
{ PCRE_MULTILINE | PCRE_NEWLINE_CRLF, 0, "\\W{0,2}[^#]{3}", "\r\n#....." },
{ PCRE_MULTILINE | PCRE_NEWLINE_CR, 0, "\\W{0,2}[^#]{3}", "\r\n#....." },
{ PCRE_MULTILINE | PCRE_NEWLINE_CRLF, 0, "\\W{1,3}[^#]", "\r\n##...." },
+ { MUA | PCRE_NO_UTF8_CHECK, 1, "^.a", "\n\x80\nxa" },
+ { MUA, 1, "^", "\r\n" },
+ { PCRE_MULTILINE | PCRE_NEWLINE_CRLF, 1 | F_NOMATCH, "^", "\r\n" },
+ { PCRE_MULTILINE | PCRE_NEWLINE_CRLF, 1, "^", "\r\na" },
/* Any character except newline or any newline. */
{ PCRE_NEWLINE_CRLF, 0, ".", "\r" },
@@ -629,6 +654,7 @@ static struct regression_test_case regression_test_cases[] = {
{ PCRE_MULTILINE | PCRE_UTF8 | PCRE_NEWLINE_CRLF | PCRE_FIRSTLINE, 0 | F_NOMATCH | F_PROPERTY, "\\p{Any}{4}|a", "\r\na" },
{ PCRE_MULTILINE | PCRE_UTF8 | PCRE_NEWLINE_CRLF | PCRE_FIRSTLINE, 1, ".", "\r\n" },
{ PCRE_FIRSTLINE | PCRE_NEWLINE_LF | PCRE_DOTALL, 0 | F_NOMATCH, "ab.", "ab" },
+ { MUA | PCRE_FIRSTLINE, 1 | F_NOMATCH, "^[a-d0-9]", "\nxx\nd" },
/* Recurse. */
{ MUA, 0, "(a)(?1)", "aa" },
@@ -959,7 +985,7 @@ static int convert_utf8_to_utf16(const char *input, PCRE_UCHAR16 *output, int *o
if (offsetmap)
*offsetmap++ = (int)(iptr - (unsigned char*)input);
- if (!(*iptr & 0x80))
+ if (*iptr < 0xc0)
c = *iptr++;
else if (!(*iptr & 0x20)) {
c = ((iptr[0] & 0x1f) << 6) | (iptr[1] & 0x3f);
@@ -1031,7 +1057,7 @@ static int convert_utf8_to_utf32(const char *input, PCRE_UCHAR32 *output, int *o
if (offsetmap)
*offsetmap++ = (int)(iptr - (unsigned char*)input);
- if (!(*iptr & 0x80))
+ if (*iptr < 0xc0)
c = *iptr++;
else if (!(*iptr & 0x20)) {
c = ((iptr[0] & 0x1f) << 6) | (iptr[1] & 0x3f);
@@ -1092,7 +1118,7 @@ static int regression_tests(void)
const char *error;
char *cpu_info;
int i, err_offs;
- int is_successful, is_ascii_pattern, is_ascii_input;
+ int is_successful, is_ascii;
int total = 0;
int successful = 0;
int successful_row = 0;
@@ -1173,13 +1199,9 @@ static int regression_tests(void)
while (current->pattern) {
/* printf("\nPattern: %s :\n", current->pattern); */
total++;
- if (current->start_offset & F_PROPERTY) {
- is_ascii_pattern = 0;
- is_ascii_input = 0;
- } else {
- is_ascii_pattern = check_ascii(current->pattern);
- is_ascii_input = check_ascii(current->input);
- }
+ is_ascii = 0;
+ if (!(current->start_offset & F_PROPERTY))
+ is_ascii = check_ascii(current->pattern) && check_ascii(current->input);
if (current->flags & PCRE_PARTIAL_SOFT)
study_mode = PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE;
@@ -1211,7 +1233,7 @@ static int regression_tests(void)
re8 = NULL;
}
extra8->flags |= PCRE_EXTRA_MARK;
- } else if (((utf && ucp) || is_ascii_pattern) && !(current->start_offset & F_NO8))
+ } else if (((utf && ucp) || is_ascii) && !(current->start_offset & F_NO8))
printf("\n8 bit: Cannot compile pattern \"%s\": %s\n", current->pattern, error);
#endif
#ifdef SUPPORT_PCRE16
@@ -1242,7 +1264,7 @@ static int regression_tests(void)
re16 = NULL;
}
extra16->flags |= PCRE_EXTRA_MARK;
- } else if (((utf && ucp) || is_ascii_pattern) && !(current->start_offset & F_NO16))
+ } else if (((utf && ucp) || is_ascii) && !(current->start_offset & F_NO16))
printf("\n16 bit: Cannot compile pattern \"%s\": %s\n", current->pattern, error);
#endif
#ifdef SUPPORT_PCRE32
@@ -1273,7 +1295,7 @@ static int regression_tests(void)
re32 = NULL;
}
extra32->flags |= PCRE_EXTRA_MARK;
- } else if (((utf && ucp) || is_ascii_pattern) && !(current->start_offset & F_NO32))
+ } else if (((utf && ucp) || is_ascii) && !(current->start_offset & F_NO32))
printf("\n32 bit: Cannot compile pattern \"%s\": %s\n", current->pattern, error);
#endif
@@ -1305,10 +1327,10 @@ static int regression_tests(void)
if ((counter & 0x1) != 0) {
setstack8(extra8);
return_value8[0] = pcre_exec(re8, extra8, current->input, strlen(current->input), current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector8_1, 32);
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector8_1, 32);
} else
return_value8[0] = pcre_jit_exec(re8, extra8, current->input, strlen(current->input), current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector8_1, 32, getstack8());
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector8_1, 32, getstack8());
memset(&dummy_extra8, 0, sizeof(pcre_extra));
dummy_extra8.flags = PCRE_EXTRA_MARK;
if (current->start_offset & F_STUDY) {
@@ -1317,7 +1339,7 @@ static int regression_tests(void)
}
dummy_extra8.mark = &mark8_2;
return_value8[1] = pcre_exec(re8, &dummy_extra8, current->input, strlen(current->input), current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector8_2, 32);
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector8_2, 32);
}
#endif
@@ -1339,10 +1361,10 @@ static int regression_tests(void)
if ((counter & 0x1) != 0) {
setstack16(extra16);
return_value16[0] = pcre16_exec(re16, extra16, regtest_buf16, length16, current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector16_1, 32);
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector16_1, 32);
} else
return_value16[0] = pcre16_jit_exec(re16, extra16, regtest_buf16, length16, current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector16_1, 32, getstack16());
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector16_1, 32, getstack16());
memset(&dummy_extra16, 0, sizeof(pcre16_extra));
dummy_extra16.flags = PCRE_EXTRA_MARK;
if (current->start_offset & F_STUDY) {
@@ -1351,7 +1373,7 @@ static int regression_tests(void)
}
dummy_extra16.mark = &mark16_2;
return_value16[1] = pcre16_exec(re16, &dummy_extra16, regtest_buf16, length16, current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector16_2, 32);
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector16_2, 32);
}
#endif
@@ -1373,10 +1395,10 @@ static int regression_tests(void)
if ((counter & 0x1) != 0) {
setstack32(extra32);
return_value32[0] = pcre32_exec(re32, extra32, regtest_buf32, length32, current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector32_1, 32);
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector32_1, 32);
} else
return_value32[0] = pcre32_jit_exec(re32, extra32, regtest_buf32, length32, current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector32_1, 32, getstack32());
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector32_1, 32, getstack32());
memset(&dummy_extra32, 0, sizeof(pcre32_extra));
dummy_extra32.flags = PCRE_EXTRA_MARK;
if (current->start_offset & F_STUDY) {
@@ -1385,7 +1407,7 @@ static int regression_tests(void)
}
dummy_extra32.mark = &mark32_2;
return_value32[1] = pcre32_exec(re32, &dummy_extra32, regtest_buf32, length32, current->start_offset & OFFSET_MASK,
- current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD), ovector32_2, 32);
+ current->flags & (PCRE_NOTBOL | PCRE_NOTEOL | PCRE_NOTEMPTY | PCRE_NOTEMPTY_ATSTART | PCRE_PARTIAL_SOFT | PCRE_PARTIAL_HARD | PCRE_NO_UTF8_CHECK), ovector32_2, 32);
}
#endif
@@ -1581,7 +1603,7 @@ static int regression_tests(void)
if (is_successful) {
#ifdef SUPPORT_PCRE8
- if (!(current->start_offset & F_NO8) && ((utf && ucp) || is_ascii_input)) {
+ if (!(current->start_offset & F_NO8) && ((utf && ucp) || is_ascii)) {
if (return_value8[0] < 0 && !(current->start_offset & F_NOMATCH)) {
printf("8 bit: Test should match: [%d] '%s' @ '%s'\n",
total, current->pattern, current->input);
@@ -1596,7 +1618,7 @@ static int regression_tests(void)
}
#endif
#ifdef SUPPORT_PCRE16
- if (!(current->start_offset & F_NO16) && ((utf && ucp) || is_ascii_input)) {
+ if (!(current->start_offset & F_NO16) && ((utf && ucp) || is_ascii)) {
if (return_value16[0] < 0 && !(current->start_offset & F_NOMATCH)) {
printf("16 bit: Test should match: [%d] '%s' @ '%s'\n",
total, current->pattern, current->input);
@@ -1611,7 +1633,7 @@ static int regression_tests(void)
}
#endif
#ifdef SUPPORT_PCRE32
- if (!(current->start_offset & F_NO32) && ((utf && ucp) || is_ascii_input)) {
+ if (!(current->start_offset & F_NO32) && ((utf && ucp) || is_ascii)) {
if (return_value32[0] < 0 && !(current->start_offset & F_NOMATCH)) {
printf("32 bit: Test should match: [%d] '%s' @ '%s'\n",
total, current->pattern, current->input);
diff --git a/pcre/pcre_printint.c b/pcre/pcre_printint.c
index e4ef152d071..60dcb55efbf 100644
--- a/pcre/pcre_printint.c
+++ b/pcre/pcre_printint.c
@@ -644,7 +644,9 @@ for(;;)
int i;
unsigned int min, max;
BOOL printmap;
+ BOOL invertmap = FALSE;
pcre_uint8 *map;
+ pcre_uint8 inverted_map[32];
fprintf(f, " [");
@@ -653,7 +655,12 @@ for(;;)
extra = GET(code, 1);
ccode = code + LINK_SIZE + 1;
printmap = (*ccode & XCL_MAP) != 0;
- if ((*ccode++ & XCL_NOT) != 0) fprintf(f, "^");
+ if ((*ccode & XCL_NOT) != 0)
+ {
+ invertmap = (*ccode & XCL_HASPROP) == 0;
+ fprintf(f, "^");
+ }
+ ccode++;
}
else
{
@@ -666,6 +673,12 @@ for(;;)
if (printmap)
{
map = (pcre_uint8 *)ccode;
+ if (invertmap)
+ {
+ for (i = 0; i < 32; i++) inverted_map[i] = ~map[i];
+ map = inverted_map;
+ }
+
for (i = 0; i < 256; i++)
{
if ((map[i/8] & (1 << (i&7))) != 0)
diff --git a/pcre/pcre_string_utils.c b/pcre/pcre_string_utils.c
index 10b53d5bcb4..25eacc85073 100644
--- a/pcre/pcre_string_utils.c
+++ b/pcre/pcre_string_utils.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -91,8 +91,8 @@ pcre_uchar c2;
while (*str1 != '\0' || *str2 != '\0')
{
- c1 = RAWUCHARINC(str1);
- c2 = RAWUCHARINC(str2);
+ c1 = UCHAR21INC(str1);
+ c2 = UCHAR21INC(str2);
if (c1 != c2)
return ((c1 > c2) << 1) - 1;
}
@@ -131,7 +131,7 @@ pcre_uchar c2;
while (*str1 != '\0' || *ustr2 != '\0')
{
- c1 = RAWUCHARINC(str1);
+ c1 = UCHAR21INC(str1);
c2 = (pcre_uchar)*ustr2++;
if (c1 != c2)
return ((c1 > c2) << 1) - 1;
diff --git a/pcre/pcre_study.c b/pcre/pcre_study.c
index c2aff517a5d..ab9510e20ec 100644
--- a/pcre/pcre_study.c
+++ b/pcre/pcre_study.c
@@ -879,9 +879,6 @@ do
case OP_SOM:
case OP_THEN:
case OP_THEN_ARG:
-#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
- case OP_XCLASS:
-#endif
return SSB_FAIL;
/* We can ignore word boundary tests. */
@@ -1257,6 +1254,16 @@ do
with a value >= 0xc4 is a potentially valid starter because it starts a
character with a value > 255. */
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ case OP_XCLASS:
+ if ((tcode[1 + LINK_SIZE] & XCL_HASPROP) != 0)
+ return SSB_FAIL;
+ /* All bits are set. */
+ if ((tcode[1 + LINK_SIZE] & XCL_MAP) == 0 && (tcode[1 + LINK_SIZE] & XCL_NOT) != 0)
+ return SSB_FAIL;
+#endif
+ /* Fall through */
+
case OP_NCLASS:
#if defined SUPPORT_UTF && defined COMPILE_PCRE8
if (utf)
@@ -1273,8 +1280,21 @@ do
case OP_CLASS:
{
pcre_uint8 *map;
- tcode++;
- map = (pcre_uint8 *)tcode;
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ map = NULL;
+ if (*tcode == OP_XCLASS)
+ {
+ if ((tcode[1 + LINK_SIZE] & XCL_MAP) != 0)
+ map = (pcre_uint8 *)(tcode + 1 + LINK_SIZE + 1);
+ tcode += GET(tcode, 1);
+ }
+ else
+#endif
+ {
+ tcode++;
+ map = (pcre_uint8 *)tcode;
+ tcode += 32 / sizeof(pcre_uchar);
+ }
/* In UTF-8 mode, the bits in a bit map correspond to character
values, not to byte values. However, the bit map we are constructing is
@@ -1282,31 +1302,35 @@ do
value is > 127. In fact, there are only two possible starting bytes for
characters in the range 128 - 255. */
-#if defined SUPPORT_UTF && defined COMPILE_PCRE8
- if (utf)
+#if defined SUPPORT_UTF || !defined COMPILE_PCRE8
+ if (map != NULL)
+#endif
{
- for (c = 0; c < 16; c++) start_bits[c] |= map[c];
- for (c = 128; c < 256; c++)
+#if defined SUPPORT_UTF && defined COMPILE_PCRE8
+ if (utf)
{
- if ((map[c/8] && (1 << (c&7))) != 0)
+ for (c = 0; c < 16; c++) start_bits[c] |= map[c];
+ for (c = 128; c < 256; c++)
{
- int d = (c >> 6) | 0xc0; /* Set bit for this starter */
- start_bits[d/8] |= (1 << (d&7)); /* and then skip on to the */
- c = (c & 0xc0) + 0x40 - 1; /* next relevant character. */
+ if ((map[c/8] && (1 << (c&7))) != 0)
+ {
+ int d = (c >> 6) | 0xc0; /* Set bit for this starter */
+ start_bits[d/8] |= (1 << (d&7)); /* and then skip on to the */
+ c = (c & 0xc0) + 0x40 - 1; /* next relevant character. */
+ }
}
}
- }
- else
+ else
#endif
- {
- /* In non-UTF-8 mode, the two bit maps are completely compatible. */
- for (c = 0; c < 32; c++) start_bits[c] |= map[c];
+ {
+ /* In non-UTF-8 mode, the two bit maps are completely compatible. */
+ for (c = 0; c < 32; c++) start_bits[c] |= map[c];
+ }
}
/* Advance past the bit map, and act on what follows. For a zero
minimum repeat, continue; otherwise stop processing. */
- tcode += 32 / sizeof(pcre_uchar);
switch (*tcode)
{
case OP_CRSTAR:
diff --git a/pcre/pcre_xclass.c b/pcre/pcre_xclass.c
index ad153be7851..c2b61f0f920 100644
--- a/pcre/pcre_xclass.c
+++ b/pcre/pcre_xclass.c
@@ -81,6 +81,11 @@ additional data. */
if (c < 256)
{
+ if ((*data & XCL_HASPROP) == 0)
+ {
+ if ((*data & XCL_MAP) == 0) return negated;
+ return (((pcre_uint8 *)(data + 1))[c/8] & (1 << (c&7))) != 0;
+ }
if ((*data & XCL_MAP) != 0 &&
(((pcre_uint8 *)(data + 1))[c/8] & (1 << (c&7))) != 0)
return !negated; /* char found */
diff --git a/pcre/pcregrep.c b/pcre/pcregrep.c
index f6b6ec39608..3e8d05dc8ad 100644
--- a/pcre/pcregrep.c
+++ b/pcre/pcregrep.c
@@ -12,7 +12,7 @@ distribution because other apparatus is needed to compile pcregrep for z/OS.
The header can be found in the special z/OS distribution, which is available
from www.zaconsultants.net or from www.cbttape.org.
- Copyright (c) 1997-2013 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -1298,7 +1298,7 @@ switch(endlinetype)
while (p > startptr && p[-1] != '\n') p--;
if (p <= startptr + 1 || p[-2] == '\r') return p;
}
- return p; /* But control should never get here */
+ /* Control can never get here */
case EL_ANY:
case EL_ANYCRLF:
diff --git a/pcre/pcreposix.c b/pcre/pcreposix.c
index 7cf4a4a657b..ed506197a53 100644
--- a/pcre/pcreposix.c
+++ b/pcre/pcreposix.c
@@ -6,7 +6,7 @@
and semantics are as close as possible to those of the Perl 5 language.
Written by Philip Hazel
- Copyright (c) 1997-2012 University of Cambridge
+ Copyright (c) 1997-2014 University of Cambridge
-----------------------------------------------------------------------------
Redistribution and use in source and binary forms, with or without
@@ -170,7 +170,9 @@ static const int eint[] = {
REG_BADPAT, /* missing opening brace after \o */
REG_BADPAT, /* parentheses too deeply nested */
REG_BADPAT, /* invalid range in character class */
- REG_BADPAT /* group name must start with a non-digit */
+ REG_BADPAT, /* group name must start with a non-digit */
+ /* 85 */
+ REG_BADPAT /* parentheses too deeply nested (stack check) */
};
/* Table of texts corresponding to POSIX error codes */
diff --git a/pcre/pcretest.c b/pcre/pcretest.c
index 8452d2bab6f..b8dc3c67032 100644
--- a/pcre/pcretest.c
+++ b/pcre/pcretest.c
@@ -233,6 +233,9 @@ argument, the casting might be incorrectly applied. */
#define SET_PCRE_CALLOUT8(callout) \
pcre_callout = callout
+#define SET_PCRE_STACK_GUARD8(stack_guard) \
+ pcre_stack_guard = stack_guard
+
#define PCRE_ASSIGN_JIT_STACK8(extra, callback, userdata) \
pcre_assign_jit_stack(extra, callback, userdata)
@@ -317,6 +320,9 @@ argument, the casting might be incorrectly applied. */
#define SET_PCRE_CALLOUT16(callout) \
pcre16_callout = (int (*)(pcre16_callout_block *))callout
+#define SET_PCRE_STACK_GUARD16(stack_guard) \
+ pcre16_stack_guard = (int (*)(void))stack_guard
+
#define PCRE_ASSIGN_JIT_STACK16(extra, callback, userdata) \
pcre16_assign_jit_stack((pcre16_extra *)extra, \
(pcre16_jit_callback)callback, userdata)
@@ -406,6 +412,9 @@ argument, the casting might be incorrectly applied. */
#define SET_PCRE_CALLOUT32(callout) \
pcre32_callout = (int (*)(pcre32_callout_block *))callout
+#define SET_PCRE_STACK_GUARD32(stack_guard) \
+ pcre32_stack_guard = (int (*)(void))stack_guard
+
#define PCRE_ASSIGN_JIT_STACK32(extra, callback, userdata) \
pcre32_assign_jit_stack((pcre32_extra *)extra, \
(pcre32_jit_callback)callback, userdata)
@@ -533,6 +542,14 @@ cases separately. */
else \
SET_PCRE_CALLOUT8(callout)
+#define SET_PCRE_STACK_GUARD(stack_guard) \
+ if (pcre_mode == PCRE32_MODE) \
+ SET_PCRE_STACK_GUARD32(stack_guard); \
+ else if (pcre_mode == PCRE16_MODE) \
+ SET_PCRE_STACK_GUARD16(stack_guard); \
+ else \
+ SET_PCRE_STACK_GUARD8(stack_guard)
+
#define STRLEN(p) (pcre_mode == PCRE32_MODE ? STRLEN32(p) : pcre_mode == PCRE16_MODE ? STRLEN16(p) : STRLEN8(p))
#define PCRE_ASSIGN_JIT_STACK(extra, callback, userdata) \
@@ -756,6 +773,12 @@ the three different cases. */
else \
G(SET_PCRE_CALLOUT,BITTWO)(callout)
+#define SET_PCRE_STACK_GUARD(stack_guard) \
+ if (pcre_mode == G(G(PCRE,BITONE),_MODE)) \
+ G(SET_PCRE_STACK_GUARD,BITONE)(stack_guard); \
+ else \
+ G(SET_PCRE_STACK_GUARD,BITTWO)(stack_guard)
+
#define STRLEN(p) ((pcre_mode == G(G(PCRE,BITONE),_MODE)) ? \
G(STRLEN,BITONE)(p) : G(STRLEN,BITTWO)(p))
@@ -897,6 +920,7 @@ the three different cases. */
#define PCHARSV PCHARSV8
#define READ_CAPTURE_NAME READ_CAPTURE_NAME8
#define SET_PCRE_CALLOUT SET_PCRE_CALLOUT8
+#define SET_PCRE_STACK_GUARD SET_PCRE_STACK_GUARD8
#define STRLEN STRLEN8
#define PCRE_ASSIGN_JIT_STACK PCRE_ASSIGN_JIT_STACK8
#define PCRE_COMPILE PCRE_COMPILE8
@@ -927,6 +951,7 @@ the three different cases. */
#define PCHARSV PCHARSV16
#define READ_CAPTURE_NAME READ_CAPTURE_NAME16
#define SET_PCRE_CALLOUT SET_PCRE_CALLOUT16
+#define SET_PCRE_STACK_GUARD SET_PCRE_STACK_GUARD16
#define STRLEN STRLEN16
#define PCRE_ASSIGN_JIT_STACK PCRE_ASSIGN_JIT_STACK16
#define PCRE_COMPILE PCRE_COMPILE16
@@ -957,6 +982,7 @@ the three different cases. */
#define PCHARSV PCHARSV32
#define READ_CAPTURE_NAME READ_CAPTURE_NAME32
#define SET_PCRE_CALLOUT SET_PCRE_CALLOUT32
+#define SET_PCRE_STACK_GUARD SET_PCRE_STACK_GUARD32
#define STRLEN STRLEN32
#define PCRE_ASSIGN_JIT_STACK PCRE_ASSIGN_JIT_STACK32
#define PCRE_COMPILE PCRE_COMPILE32
@@ -1015,6 +1041,7 @@ static int first_callout;
static int jit_was_used;
static int locale_set = 0;
static int show_malloc;
+static int stack_guard_return;
static int use_utf;
static const unsigned char *last_callout_mark = NULL;
@@ -2201,6 +2228,18 @@ return p;
/*************************************************
+* Stack guard function *
+*************************************************/
+
+/* Called from PCRE when set in pcre_stack_guard. We give an error (non-zero)
+return when a count overflows. */
+
+static int stack_guard(void)
+{
+return stack_guard_return;
+}
+
+/*************************************************
* Callout function *
*************************************************/
@@ -2883,8 +2922,8 @@ printf(" -32 use the 32-bit library\n");
#endif
printf(" -b show compiled code\n");
printf(" -C show PCRE compile-time options and exit\n");
-printf(" -C arg show a specific compile-time option\n");
-printf(" and exit with its value. The arg can be:\n");
+printf(" -C arg show a specific compile-time option and exit\n");
+printf(" with its value if numeric (else 0). The arg can be:\n");
printf(" linksize internal link size [2, 3, 4]\n");
printf(" pcre8 8 bit library support enabled [0, 1]\n");
printf(" pcre16 16 bit library support enabled [0, 1]\n");
@@ -2892,7 +2931,8 @@ printf(" pcre32 32 bit library support enabled [0, 1]\n");
printf(" utf Unicode Transformation Format supported [0, 1]\n");
printf(" ucp Unicode Properties supported [0, 1]\n");
printf(" jit Just-in-time compiler supported [0, 1]\n");
-printf(" newline Newline type [CR, LF, CRLF, ANYCRLF, ANY, ???]\n");
+printf(" newline Newline type [CR, LF, CRLF, ANYCRLF, ANY]\n");
+printf(" bsr \\R type [ANYCRLF, ANY]\n");
printf(" -d debug: show compiled code and information (-b and -i)\n");
#if !defined NODFA
printf(" -dfa force DFA matching for all subjects\n");
@@ -3231,6 +3271,11 @@ while (argc > 1 && argv[op][0] == '-')
(void)PCRE_CONFIG(PCRE_CONFIG_NEWLINE, &rc);
print_newline_config(rc, TRUE);
}
+ else if (strcmp(argv[op + 1], "bsr") == 0)
+ {
+ (void)PCRE_CONFIG(PCRE_CONFIG_BSR, &rc);
+ printf("%s\n", rc? "ANYCRLF" : "ANY");
+ }
else if (strcmp(argv[op + 1], "ebcdic") == 0)
{
#ifdef EBCDIC
@@ -3439,6 +3484,7 @@ while (!done)
use_utf = 0;
debug_lengths = 1;
+ SET_PCRE_STACK_GUARD(NULL);
if (extend_inputline(infile, buffer, " re> ") == NULL) break;
if (infile != stdin) fprintf(outfile, "%s", (char *)buffer);
@@ -3739,6 +3785,21 @@ while (!done)
case 'P': do_posix = 1; break;
#endif
+ case 'Q':
+ switch (*pp)
+ {
+ case '0':
+ case '1':
+ stack_guard_return = *pp++ - '0';
+ break;
+
+ default:
+ fprintf(outfile, "** Missing 0 or 1 after /Q\n");
+ goto SKIP_DATA;
+ }
+ SET_PCRE_STACK_GUARD(stack_guard);
+ break;
+
case 'S':
do_study = 1;
for (;;)
@@ -4282,12 +4343,12 @@ while (!done)
if (new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits) == 0)
{
if (start_bits == NULL)
- fprintf(outfile, "No set of starting bytes\n");
+ fprintf(outfile, "No starting char list\n");
else
{
int i;
int c = 24;
- fprintf(outfile, "Starting byte set: ");
+ fprintf(outfile, "Starting chars: ");
for (i = 0; i < 256; i++)
{
if ((start_bits[i/8] & (1<<(i&7))) != 0)
@@ -5192,7 +5253,8 @@ while (!done)
if (count * 2 > use_size_offsets) count = use_size_offsets/2;
}
- /* Output the captured substrings */
+ /* Output the captured substrings. Note that, for the matched string,
+ the use of \K in an assertion can make the start later than the end. */
for (i = 0; i < count * 2; i += 2)
{
@@ -5208,11 +5270,25 @@ while (!done)
}
else
{
+ int start = use_offsets[i];
+ int end = use_offsets[i+1];
+
+ if (start > end)
+ {
+ start = use_offsets[i+1];
+ end = use_offsets[i];
+ fprintf(outfile, "Start of matched string is beyond its end - "
+ "displaying from end to start.\n");
+ }
+
fprintf(outfile, "%2d: ", i/2);
- PCHARSV(bptr, use_offsets[i],
- use_offsets[i+1] - use_offsets[i], outfile);
+ PCHARSV(bptr, start, end - start, outfile);
if (verify_jit && jit_was_used) fprintf(outfile, " (JIT)");
fprintf(outfile, "\n");
+
+ /* Note: don't use the start/end variables here because we want to
+ show the text from what is reported as the end. */
+
if (do_showcaprest || (i == 0 && do_showrest))
{
fprintf(outfile, "%2d+ ", i/2);
diff --git a/pcre/testdata/saved16BE-1 b/pcre/testdata/saved16BE-1
index e6edddc6e0b..5d2bd1be528 100644
--- a/pcre/testdata/saved16BE-1
+++ b/pcre/testdata/saved16BE-1
Binary files differ
diff --git a/pcre/testdata/saved16LE-1 b/pcre/testdata/saved16LE-1
index 5035ec07215..822ccd7012a 100644
--- a/pcre/testdata/saved16LE-1
+++ b/pcre/testdata/saved16LE-1
Binary files differ
diff --git a/pcre/testdata/saved32BE-1 b/pcre/testdata/saved32BE-1
index b4c2ffe42cc..609d97cdeba 100644
--- a/pcre/testdata/saved32BE-1
+++ b/pcre/testdata/saved32BE-1
Binary files differ
diff --git a/pcre/testdata/saved32LE-1 b/pcre/testdata/saved32LE-1
index 49392b89a10..901dfb63487 100644
--- a/pcre/testdata/saved32LE-1
+++ b/pcre/testdata/saved32LE-1
Binary files differ
diff --git a/pcre/testdata/testinput18 b/pcre/testdata/testinput18
index abff34e73a5..2dfb54cdfd4 100644
--- a/pcre/testdata/testinput18
+++ b/pcre/testdata/testinput18
@@ -207,7 +207,7 @@ correctly, but that messes up comparisons). --/
CDBABC
\x{2000}ABC
-/\R*A/SI8
+/\R*A/SI8<bsr_unicode>
CDBABC
\x{2028}A
diff --git a/pcre/testdata/testinput2 b/pcre/testdata/testinput2
index 00924ee98fa..da6e61499cd 100644
--- a/pcre/testdata/testinput2
+++ b/pcre/testdata/testinput2
@@ -907,6 +907,9 @@
/\U/I
+/a{1,3}b/U
+ ab
+
/[/I
/[a-/I
@@ -4045,4 +4048,18 @@ backtracking verbs. --/
/[a[:<:]] should give error/
+/(?=ab\K)/+
+ abcd
+
+/abcd/f<lf>
+ xx\nxabcd
+
+/ -- Test stack check external calls --/
+
+/(((((a)))))/Q0
+
+/(((((a)))))/Q1
+
+/(((((a)))))/Q
+
/-- End of testinput2 --/
diff --git a/pcre/testdata/testinput25 b/pcre/testdata/testinput25
index ce9d9e19a40..067ca12fdc0 100644
--- a/pcre/testdata/testinput25
+++ b/pcre/testdata/testinput25
@@ -1,6 +1,6 @@
/-- Tests for the 32-bit library only */
-< forbid 8w
+< forbid 8W
/-- Check maximum character size --/
diff --git a/pcre/testdata/testinput3 b/pcre/testdata/testinput3
index 1d2e855386a..fcd46255c93 100644
--- a/pcre/testdata/testinput3
+++ b/pcre/testdata/testinput3
@@ -1,7 +1,10 @@
-/-- This set of tests checks local-specific features, using the fr_FR locale.
- It is not Perl-compatible. There is different version called wintestinput3
- f or use on Windows, where the locale is called "french". --/
-
+/-- This set of tests checks local-specific features, using the "fr_FR" locale.
+ It is not Perl-compatible. When run via RunTest, the locale is edited to
+ be whichever of "fr_FR", "french", or "fr" is found to exist. There is
+ different version of this file called wintestinput3 for use on Windows,
+ where the locale is called "french" and the tests are run using
+ RunTest.bat. --/
+
< forbid 8W
/^[\w]+/
diff --git a/pcre/testdata/testinput4 b/pcre/testdata/testinput4
index 983f7a119b5..0110267bd80 100644
--- a/pcre/testdata/testinput4
+++ b/pcre/testdata/testinput4
@@ -716,4 +716,10 @@
/^a+[a\x{200}]/8
aa
+/^.\B.\B./8
+ \x{10123}\x{10124}\x{10125}
+
+/^#[^\x{ffff}]#[^\x{ffff}]#[^\x{ffff}]#/8
+ #\x{10000}#\x{100}#\x{10ffff}#
+
/-- End of testinput4 --/
diff --git a/pcre/testdata/testinput5 b/pcre/testdata/testinput5
index 9e9a22a1a1f..e36b09d6377 100644
--- a/pcre/testdata/testinput5
+++ b/pcre/testdata/testinput5
@@ -788,4 +788,6 @@
/^a+[a\x{200}]/8BZ
aa
+/[b-d\x{200}-\x{250}]*[ae-h]?#[\x{200}-\x{250}]{0,8}[\x00-\xff]*#[\x{200}-\x{250}]+[a-z]/8BZ
+
/-- End of testinput5 --/
diff --git a/pcre/testdata/testinput6 b/pcre/testdata/testinput6
index 1e450be04d3..7a6a53f1473 100644
--- a/pcre/testdata/testinput6
+++ b/pcre/testdata/testinput6
@@ -1484,4 +1484,13 @@
\x{a1}\x{a7}
\x{37e}
+/[RST]+/8iW
+ Ss\x{17f}
+
+/[R-T]+/8iW
+ Ss\x{17f}
+
+/[q-u]+/8iW
+ Ss\x{17f}
+
/-- End of testinput6 --/
diff --git a/pcre/testdata/testinput7 b/pcre/testdata/testinput7
index 9d145436350..6bd05864411 100644
--- a/pcre/testdata/testinput7
+++ b/pcre/testdata/testinput7
@@ -829,4 +829,10 @@ of case for anything other than the ASCII letters. --/
/\d+\s{0,5}=\s*\S?=\w{0,4}\W*/8WBZ
+/[RST]+/8iWBZ
+
+/[R-T]+/8iWBZ
+
+/[Q-U]+/8iWBZ
+
/-- End of testinput7 --/
diff --git a/pcre/testdata/testoutput12 b/pcre/testdata/testoutput12
index a76e2aef880..67ad2c8aecf 100644
--- a/pcre/testdata/testoutput12
+++ b/pcre/testdata/testoutput12
@@ -8,7 +8,7 @@ No options
First char = 'a'
Need char = 'c'
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
JIT study was successful
/(?(?C1)(?=a)a)/S+I
@@ -27,7 +27,7 @@ No options
No first char
No need char
Subject length lower bound = -1
-No set of starting bytes
+No starting char list
JIT study was not successful
/abc/S+I>testsavedregex
@@ -36,7 +36,7 @@ No options
First char = 'a'
Need char = 'c'
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
JIT study was successful
Compiled pattern written to testsavedregex
Study data written to testsavedregex
@@ -165,7 +165,7 @@ No options
First char = 'a'
Need char = 'd'
Subject length lower bound = 4
-No set of starting bytes
+No starting char list
JIT study was successful
/(*NO_START_OPT)a(*:m)b/KS++
diff --git a/pcre/testdata/testoutput13 b/pcre/testdata/testoutput13
index 9f73c5000f6..d6fb8a5ca29 100644
--- a/pcre/testdata/testoutput13
+++ b/pcre/testdata/testoutput13
@@ -8,7 +8,7 @@ No options
First char = 'a'
Need char = 'c'
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
JIT support is not available in this version of PCRE
/a*/SI
diff --git a/pcre/testdata/testoutput14 b/pcre/testdata/testoutput14
index 52680a8f9cd..ae85681e0e0 100644
--- a/pcre/testdata/testoutput14
+++ b/pcre/testdata/testoutput14
@@ -361,7 +361,7 @@ Options: extended
No first char
No need char
Subject length lower bound = 3
-Starting byte set: \x09 \x20 ! " # $ % & ' ( * + - / 0 1 2 3 4 5 6 7 8
+Starting chars: \x09 \x20 ! " # $ % & ' ( * + - / 0 1 2 3 4 5 6 7 8
9 = ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ ` a b c d e
f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f
@@ -388,7 +388,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x20 \xa0
+Starting chars: \x09 \x20 \xa0
/\H/SI
Capturing subpattern count = 0
@@ -396,7 +396,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\v/SI
Capturing subpattern count = 0
@@ -404,7 +404,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85
+Starting chars: \x0a \x0b \x0c \x0d \x85
/\V/SI
Capturing subpattern count = 0
@@ -412,7 +412,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\R/SI
Capturing subpattern count = 0
@@ -420,7 +420,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85
+Starting chars: \x0a \x0b \x0c \x0d \x85
/[\h]/BZ
------------------------------------------------------------------
diff --git a/pcre/testdata/testoutput15 b/pcre/testdata/testoutput15
index 5792be72df7..5af369d06d9 100644
--- a/pcre/testdata/testoutput15
+++ b/pcre/testdata/testoutput15
@@ -481,7 +481,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4
5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
@@ -519,7 +519,7 @@ Options: utf
First char = \x{c4}
Need char = \x{80}
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}
@@ -539,7 +539,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x \xc4
+Starting chars: x \xc4
/(\x{100}*a|x)/8SDZ
------------------------------------------------------------------
@@ -558,7 +558,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a x \xc4
+Starting chars: a x \xc4
/(\x{100}{0,2}a|x)/8SDZ
------------------------------------------------------------------
@@ -577,7 +577,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a x \xc4
+Starting chars: a x \xc4
/(\x{100}{1,2}a|x)/8SDZ
------------------------------------------------------------------
@@ -597,7 +597,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x \xc4
+Starting chars: x \xc4
/\x{100}/8DZ
------------------------------------------------------------------
@@ -799,7 +799,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x20 \xc2 \xe1 \xe2 \xe3
+Starting chars: \x09 \x20 \xc2 \xe1 \xe2 \xe3
ABC\x{09}
0: \x{09}
ABC\x{20}
@@ -825,7 +825,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2
+Starting chars: \x0a \x0b \x0c \x0d \xc2 \xe2
ABC\x{0a}
0: \x{0a}
ABC\x{0b}
@@ -845,7 +845,7 @@ Options: utf
No first char
Need char = 'A'
Subject length lower bound = 1
-Starting byte set: \x09 \x20 A \xc2 \xe1 \xe2 \xe3
+Starting chars: \x09 \x20 A \xc2 \xe1 \xe2 \xe3
CDBABC
0: A
@@ -855,7 +855,7 @@ Options: utf
No first char
Need char = 'A'
Subject length lower bound = 2
-Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2
+Starting chars: \x0a \x0b \x0c \x0d \xc2 \xe2
/\s?xxx\s/8SI
Capturing subpattern count = 0
@@ -863,7 +863,7 @@ Options: utf
No first char
Need char = 'x'
Subject length lower bound = 4
-Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20 x
+Starting chars: \x09 \x0a \x0b \x0c \x0d \x20 x
/\sxxx\s/I8ST1
Capturing subpattern count = 0
@@ -871,7 +871,7 @@ Options: utf
No first char
Need char = 'x'
Subject length lower bound = 5
-Starting byte set: \x09 \x0a \x0c \x0d \x20 \xc2
+Starting chars: \x09 \x0a \x0c \x0d \x20 \xc2
AB\x{85}xxx\x{a0}XYZ
0: \x{85}xxx\x{a0}
AB\x{a0}xxx\x{85}XYZ
@@ -883,7 +883,7 @@ Options: utf
No first char
Need char = ' '
Subject length lower bound = 3
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
\x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a \x1b \x1c \x1d
\x1e \x1f ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e
@@ -917,7 +917,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \xe1
+Starting chars: \xe1
/\x{1234}+?/iS8I
Capturing subpattern count = 0
@@ -925,7 +925,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \xe1
+Starting chars: \xe1
/\x{1234}++/iS8I
Capturing subpattern count = 0
@@ -933,7 +933,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \xe1
+Starting chars: \xe1
/\x{1234}{2}/iS8I
Capturing subpattern count = 0
@@ -941,7 +941,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 2
-Starting byte set: \xe1
+Starting chars: \xe1
/[^\x{c4}]/8DZ
------------------------------------------------------------------
@@ -974,7 +974,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2
+Starting chars: \x0a \x0b \x0c \x0d \xc2 \xe2
/\777/8DZ
------------------------------------------------------------------
diff --git a/pcre/testdata/testoutput16 b/pcre/testdata/testoutput16
index 1d5f31d929a..63e9eb06ae6 100644
--- a/pcre/testdata/testoutput16
+++ b/pcre/testdata/testoutput16
@@ -64,7 +64,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 17
-Starting byte set: \xd0 \xd1
+Starting chars: \xd0 \xd1
\x{401}\x{420}\x{421}\x{422}\x{423}\x{424}\x{425}\x{426}\x{427}\x{428}\x{429}\x{42a}\x{42b}\x{42c}\x{42d}\x{42e}\x{42f}
0: \x{401}\x{420}\x{421}\x{422}\x{423}\x{424}\x{425}\x{426}\x{427}\x{428}\x{429}\x{42a}\x{42b}\x{42c}\x{42d}\x{42e}\x{42f}
\x{451}\x{440}\x{441}\x{442}\x{443}\x{444}\x{445}\x{446}\x{447}\x{448}\x{449}\x{44a}\x{44b}\x{44c}\x{44d}\x{44e}\x{44f}
@@ -92,7 +92,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x20 \xa0
+Starting chars: \x09 \x20 \xa0
/\v/SI
Capturing subpattern count = 0
@@ -100,7 +100,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85
+Starting chars: \x0a \x0b \x0c \x0d \x85
/\R/SI
Capturing subpattern count = 0
@@ -108,7 +108,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85
+Starting chars: \x0a \x0b \x0c \x0d \x85
/[[:blank:]]/WBZ
------------------------------------------------------------------
diff --git a/pcre/testdata/testoutput17 b/pcre/testdata/testoutput17
index 9a469c51ae1..1a3b492fb49 100644
--- a/pcre/testdata/testoutput17
+++ b/pcre/testdata/testoutput17
@@ -228,7 +228,7 @@ Options: extended
No first char
No need char
Subject length lower bound = 3
-Starting byte set: \x09 \x20 ! " # $ % & ' ( * + - / 0 1 2 3 4 5 6 7 8
+Starting chars: \x09 \x20 ! " # $ % & ' ( * + - / 0 1 2 3 4 5 6 7 8
9 = ? A B C D E F G H I J K L M N O P Q R S T U V W X Y Z ^ _ ` a b c d e
f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \xff
@@ -274,7 +274,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x20 \xa0 \xff
+Starting chars: \x09 \x20 \xa0 \xff
\x{1681}\x{200b}\x{1680}\x{2000}\x{202f}\x{3000}
0: \x{1680}\x{2000}\x{202f}\x{3000}
\x{3001}\x{2fff}\x{200a}\xa0\x{2000}
@@ -292,7 +292,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+Starting chars: \x09 \x20 \xa0 \xff
\x{1681}\x{200b}\x{1680}\x{2000}\x{202f}\x{3000}
0: \x{1680}\x{2000}\x{202f}\x{3000}
\x{3001}\x{2fff}\x{200a}\xa0\x{2000}
@@ -304,7 +304,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
\x{1680}\x{180e}\x{167f}\x{1681}\x{180d}\x{180f}
0: \x{167f}\x{1681}\x{180d}\x{180f}
\x{2000}\x{200a}\x{1fff}\x{200b}
@@ -330,7 +330,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
\x{2027}\x{2030}\x{2028}\x{2029}
0: \x{2028}\x{2029}
\x09\x0e\x84\x86\x85\x0a\x0b\x0c\x0d
@@ -348,7 +348,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
\x{2027}\x{2030}\x{2028}\x{2029}
0: \x{2028}\x{2029}
\x09\x0e\x84\x86\x85\x0a\x0b\x0c\x0d
@@ -360,7 +360,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
\x{2028}\x{2029}\x{2027}\x{2030}
0: \x{2027}\x{2030}
\x85\x0a\x0b\x0c\x0d\x09\x0e\x84\x86
@@ -378,7 +378,7 @@ Options: bsr_unicode
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
\x{2027}\x{2030}\x{2028}\x{2029}
0: \x{2028}\x{2029}
\x09\x0e\x84\x86\x85\x0a\x0b\x0c\x0d
@@ -534,18 +534,18 @@ MK: 0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789AB
------------------------------------------------------------------
Bra
a*
- [b-\x{200}]?+
+ [b-\xff\x{100}-\x{200}]?+
a#
a*+
- [b-\x{200}]?
+ [b-\xff\x{100}-\x{200}]?
b#
- [a-f]*
- [g-\x{200}]*+
+ [a-f]*+
+ [g-\xff\x{100}-\x{200}]*+
#
- [g-\x{200}]*
+ [g-\xff\x{100}-\x{200}]*+
[a-c]*+
#
- [g-\x{200}]*
+ [g-\xff\x{100}-\x{200}]*
[a-h]*+
Ket
End
diff --git a/pcre/testdata/testoutput18-16 b/pcre/testdata/testoutput18-16
index 1ca9ee74018..a1962057050 100644
--- a/pcre/testdata/testoutput18-16
+++ b/pcre/testdata/testoutput18-16
@@ -339,7 +339,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4
5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
@@ -378,7 +378,7 @@ Options: utf
First char = \x{100}
Need char = \x{100}
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}
@@ -398,7 +398,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x \xff
+Starting chars: x \xff
/(\x{100}*a|x)/8SDZ
------------------------------------------------------------------
@@ -417,7 +417,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a x \xff
+Starting chars: a x \xff
/(\x{100}{0,2}a|x)/8SDZ
------------------------------------------------------------------
@@ -436,7 +436,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a x \xff
+Starting chars: a x \xff
/(\x{100}{1,2}a|x)/8SDZ
------------------------------------------------------------------
@@ -456,7 +456,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x \xff
+Starting chars: x \xff
/\x{100}/8DZ
------------------------------------------------------------------
@@ -666,7 +666,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x20 \xa0 \xff
+Starting chars: \x09 \x20 \xa0 \xff
ABC\x{09}
0: \x{09}
ABC\x{20}
@@ -692,7 +692,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
ABC\x{0a}
0: \x{0a}
ABC\x{0b}
@@ -712,19 +712,19 @@ Options: utf
No first char
Need char = 'A'
Subject length lower bound = 1
-Starting byte set: \x09 \x20 A \xa0 \xff
+Starting chars: \x09 \x20 A \xa0 \xff
CDBABC
0: A
\x{2000}ABC
0: \x{2000}A
-/\R*A/SI8
+/\R*A/SI8<bsr_unicode>
Capturing subpattern count = 0
-Options: utf
+Options: bsr_unicode utf
No first char
Need char = 'A'
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d A \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d A \x85 \xff
CDBABC
0: A
\x{2028}A
@@ -736,7 +736,7 @@ Options: utf
No first char
Need char = 'A'
Subject length lower bound = 2
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
/\s?xxx\s/8SI
Capturing subpattern count = 0
@@ -744,7 +744,7 @@ Options: utf
No first char
Need char = 'x'
Subject length lower bound = 4
-Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20 x
+Starting chars: \x09 \x0a \x0b \x0c \x0d \x20 x
/\sxxx\s/I8ST1
Capturing subpattern count = 0
@@ -752,7 +752,7 @@ Options: utf
No first char
Need char = 'x'
Subject length lower bound = 5
-Starting byte set: \x09 \x0a \x0c \x0d \x20 \x85 \xa0
+Starting chars: \x09 \x0a \x0c \x0d \x20 \x85 \xa0
AB\x{85}xxx\x{a0}XYZ
0: \x{85}xxx\x{a0}
AB\x{a0}xxx\x{85}XYZ
@@ -764,7 +764,7 @@ Options: utf
No first char
Need char = ' '
Subject length lower bound = 3
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
\x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a \x1b \x1c \x1d
\x1e \x1f ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e
@@ -803,7 +803,7 @@ Options: caseless utf
First char = \x{1234}
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\x{1234}+?/iS8I
Capturing subpattern count = 0
@@ -811,7 +811,7 @@ Options: caseless utf
First char = \x{1234}
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\x{1234}++/iS8I
Capturing subpattern count = 0
@@ -819,7 +819,7 @@ Options: caseless utf
First char = \x{1234}
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\x{1234}{2}/iS8I
Capturing subpattern count = 0
@@ -827,7 +827,7 @@ Options: caseless utf
First char = \x{1234}
Need char = \x{1234}
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
/[^\x{c4}]/8DZ
------------------------------------------------------------------
@@ -860,7 +860,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
/-- Check bad offset --/
diff --git a/pcre/testdata/testoutput18-32 b/pcre/testdata/testoutput18-32
index 89be3a4b051..1525994db98 100644
--- a/pcre/testdata/testoutput18-32
+++ b/pcre/testdata/testoutput18-32
@@ -337,7 +337,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4
5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
@@ -376,7 +376,7 @@ Options: utf
First char = \x{100}
Need char = \x{100}
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
\x{100}\x{100}\x{100}\x{100\x{100}
0: \x{100}\x{100}\x{100}
@@ -396,7 +396,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x \xff
+Starting chars: x \xff
/(\x{100}*a|x)/8SDZ
------------------------------------------------------------------
@@ -415,7 +415,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a x \xff
+Starting chars: a x \xff
/(\x{100}{0,2}a|x)/8SDZ
------------------------------------------------------------------
@@ -434,7 +434,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a x \xff
+Starting chars: a x \xff
/(\x{100}{1,2}a|x)/8SDZ
------------------------------------------------------------------
@@ -454,7 +454,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x \xff
+Starting chars: x \xff
/\x{100}/8DZ
------------------------------------------------------------------
@@ -663,7 +663,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x20 \xa0 \xff
+Starting chars: \x09 \x20 \xa0 \xff
ABC\x{09}
0: \x{09}
ABC\x{20}
@@ -689,7 +689,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
ABC\x{0a}
0: \x{0a}
ABC\x{0b}
@@ -709,19 +709,19 @@ Options: utf
No first char
Need char = 'A'
Subject length lower bound = 1
-Starting byte set: \x09 \x20 A \xa0 \xff
+Starting chars: \x09 \x20 A \xa0 \xff
CDBABC
0: A
\x{2000}ABC
0: \x{2000}A
-/\R*A/SI8
+/\R*A/SI8<bsr_unicode>
Capturing subpattern count = 0
-Options: utf
+Options: bsr_unicode utf
No first char
Need char = 'A'
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d A \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d A \x85 \xff
CDBABC
0: A
\x{2028}A
@@ -733,7 +733,7 @@ Options: utf
No first char
Need char = 'A'
Subject length lower bound = 2
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
/\s?xxx\s/8SI
Capturing subpattern count = 0
@@ -741,7 +741,7 @@ Options: utf
No first char
Need char = 'x'
Subject length lower bound = 4
-Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20 x
+Starting chars: \x09 \x0a \x0b \x0c \x0d \x20 x
/\sxxx\s/I8ST1
Capturing subpattern count = 0
@@ -749,7 +749,7 @@ Options: utf
No first char
Need char = 'x'
Subject length lower bound = 5
-Starting byte set: \x09 \x0a \x0c \x0d \x20 \x85 \xa0
+Starting chars: \x09 \x0a \x0c \x0d \x20 \x85 \xa0
AB\x{85}xxx\x{a0}XYZ
0: \x{85}xxx\x{a0}
AB\x{a0}xxx\x{85}XYZ
@@ -761,7 +761,7 @@ Options: utf
No first char
Need char = ' '
Subject length lower bound = 3
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0b \x0e
\x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a \x1b \x1c \x1d
\x1e \x1f ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ? @
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c d e
@@ -800,7 +800,7 @@ Options: caseless utf
First char = \x{1234}
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\x{1234}+?/iS8I
Capturing subpattern count = 0
@@ -808,7 +808,7 @@ Options: caseless utf
First char = \x{1234}
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\x{1234}++/iS8I
Capturing subpattern count = 0
@@ -816,7 +816,7 @@ Options: caseless utf
First char = \x{1234}
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/\x{1234}{2}/iS8I
Capturing subpattern count = 0
@@ -824,7 +824,7 @@ Options: caseless utf
First char = \x{1234}
Need char = \x{1234}
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
/[^\x{c4}]/8DZ
------------------------------------------------------------------
@@ -857,7 +857,7 @@ Options: utf
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x0a \x0b \x0c \x0d \x85 \xff
+Starting chars: \x0a \x0b \x0c \x0d \x85 \xff
/-- Check bad offset --/
diff --git a/pcre/testdata/testoutput19 b/pcre/testdata/testoutput19
index ccc198cc153..21fe677900b 100644
--- a/pcre/testdata/testoutput19
+++ b/pcre/testdata/testoutput19
@@ -55,7 +55,7 @@ Options: caseless utf
First char = \x{401} (caseless)
Need char = \x{42f} (caseless)
Subject length lower bound = 17
-No set of starting bytes
+No starting char list
\x{401}\x{420}\x{421}\x{422}\x{423}\x{424}\x{425}\x{426}\x{427}\x{428}\x{429}\x{42a}\x{42b}\x{42c}\x{42d}\x{42e}\x{42f}
0: \x{401}\x{420}\x{421}\x{422}\x{423}\x{424}\x{425}\x{426}\x{427}\x{428}\x{429}\x{42a}\x{42b}\x{42c}\x{42d}\x{42e}\x{42f}
\x{451}\x{440}\x{441}\x{442}\x{443}\x{444}\x{445}\x{446}\x{447}\x{448}\x{449}\x{44a}\x{44b}\x{44c}\x{44d}\x{44e}\x{44f}
diff --git a/pcre/testdata/testoutput2 b/pcre/testdata/testoutput2
index 844497abcdc..b6da7df187e 100644
--- a/pcre/testdata/testoutput2
+++ b/pcre/testdata/testoutput2
@@ -178,7 +178,7 @@ No options
No first char
No need char
Subject length lower bound = 3
-Starting byte set: c d e
+Starting chars: c d e
this sentence eventually mentions a cat
0: cat
this sentences rambles on and on for a while and then reaches elephant
@@ -190,7 +190,7 @@ Options: caseless
No first char
No need char
Subject length lower bound = 3
-Starting byte set: C D E c d e
+Starting chars: C D E c d e
this sentence eventually mentions a CAT cat
0: CAT
this sentences rambles on and on for a while to elephant ElePhant
@@ -202,7 +202,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/(a|[^\dZ])/IS
Capturing subpattern count = 1
@@ -210,7 +210,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = >
? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y [ \ ] ^ _ ` a b c d
@@ -231,7 +231,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20 a b
+Starting chars: \x09 \x0a \x0b \x0c \x0d \x20 a b
/(ab\2)/
Failed: reference to non-existent subpattern at offset 6
@@ -512,7 +512,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/(?i)[abcd]/IS
Capturing subpattern count = 0
@@ -520,7 +520,7 @@ Options: caseless
No first char
No need char
Subject length lower bound = 1
-Starting byte set: A B C D a b c d
+Starting chars: A B C D a b c d
/(?m)[xy]|(b|c)/IS
Capturing subpattern count = 1
@@ -528,7 +528,7 @@ Options: multiline
No first char
No need char
Subject length lower bound = 1
-Starting byte set: b c x y
+Starting chars: b c x y
/(^a|^b)/Im
Capturing subpattern count = 1
@@ -591,7 +591,7 @@ No options
First char = 'b' (caseless)
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/(a*b|(?i:c*(?-i)d))/IS
Capturing subpattern count = 1
@@ -599,7 +599,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: C a b c d
+Starting chars: C a b c d
/a$/I
Capturing subpattern count = 0
@@ -666,7 +666,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
/(?<!foo)(alpha|omega)/IS
Capturing subpattern count = 1
@@ -675,7 +675,7 @@ No options
No first char
Need char = 'a'
Subject length lower bound = 5
-Starting byte set: a o
+Starting chars: a o
/(?!alphabet)[ab]/IS
Capturing subpattern count = 0
@@ -683,7 +683,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
/(?<=foo\n)^bar/Im
Capturing subpattern count = 0
@@ -1642,7 +1642,7 @@ Options: anchored
No first char
Need char = 'd'
Subject length lower bound = 4
-No set of starting bytes
+No starting char list
/\( # ( at start
(?: # Non-capturing bracket
@@ -1875,7 +1875,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
+Starting chars: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
_ a b c d e f g h i j k l m n o p q r s t u v w x y z
/^[[:ascii:]]/DZ
@@ -1937,7 +1937,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20
+Starting chars: \x09 \x0a \x0b \x0c \x0d \x20
/^[[:cntrl:]]/DZ
------------------------------------------------------------------
@@ -3178,6 +3178,10 @@ Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
/\U/I
Failed: PCRE does not support \L, \l, \N{name}, \U, or \u at offset 1
+/a{1,3}b/U
+ ab
+ 0: ab
+
/[/I
Failed: missing terminating ] for character class at offset 1
@@ -3434,7 +3438,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
/[^a]/I
Capturing subpattern count = 0
@@ -3454,7 +3458,7 @@ No options
No first char
Need char = '6'
Subject length lower bound = 4
-Starting byte set: 0 1 2 3 4 5 6 7 8 9
+Starting chars: 0 1 2 3 4 5 6 7 8 9
/a^b/I
Capturing subpattern count = 0
@@ -3488,7 +3492,7 @@ Options: caseless
No first char
No need char
Subject length lower bound = 1
-Starting byte set: A B a b
+Starting chars: A B a b
/[ab](?i)cd/IS
Capturing subpattern count = 0
@@ -3496,7 +3500,7 @@ No options
No first char
Need char = 'd' (caseless)
Subject length lower bound = 3
-Starting byte set: a b
+Starting chars: a b
/abc(?C)def/I
Capturing subpattern count = 0
@@ -3537,7 +3541,7 @@ No options
No first char
Need char = 'f'
Subject length lower bound = 7
-Starting byte set: 0 1 2 3 4 5 6 7 8 9
+Starting chars: 0 1 2 3 4 5 6 7 8 9
1234abcdef
--->1234abcdef
1 ^ \d
@@ -3856,7 +3860,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
/(?R)/I
Failed: recursive call could loop indefinitely at offset 3
@@ -4637,7 +4641,7 @@ Options: caseless
No first char
Need char = 'g' (caseless)
Subject length lower bound = 8
-No set of starting bytes
+No starting char list
Baby Bjorn Active Carrier - With free SHIPPING!!
0: Baby Bjorn Active Carrier - With free SHIPPING!!
1: Baby Bjorn Active Carrier - With free SHIPPING!!
@@ -4656,7 +4660,7 @@ No options
No first char
Need char = 'b'
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/(a|b)*.?c/ISDZ
------------------------------------------------------------------
@@ -4677,7 +4681,7 @@ No options
No first char
Need char = 'c'
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/abc(?C255)de(?C)f/DZ
------------------------------------------------------------------
@@ -4750,7 +4754,7 @@ Options:
No first char
Need char = 'b'
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
ab
--->ab
+0 ^ a*
@@ -4893,7 +4897,7 @@ Options:
No first char
Need char = 'x'
Subject length lower bound = 4
-Starting byte set: a d
+Starting chars: a d
abcx
--->abcx
+0 ^ (abc|def)
@@ -5127,7 +5131,7 @@ Options:
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b x
+Starting chars: a b x
Note: that { does NOT introduce a quantifier
--->Note: that { does NOT introduce a quantifier
+0 ^ ([ab]{,4}c|xy)
@@ -5607,7 +5611,7 @@ No options
First char = 'a'
Need char = 'c'
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
Compiled pattern written to testsavedregex
Study data written to testsavedregex
<testsavedregex
@@ -5642,7 +5646,7 @@ No options
First char = 'a'
Need char = 'c'
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
Compiled pattern written to testsavedregex
Study data written to testsavedregex
<testsavedregex
@@ -5677,7 +5681,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
Compiled pattern written to testsavedregex
Study data written to testsavedregex
<testsavedregex
@@ -5716,7 +5720,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b
+Starting chars: a b
Compiled pattern written to testsavedregex
Study data written to testsavedregex
<testsavedregex
@@ -6431,7 +6435,7 @@ No options
No first char
Need char = ','
Subject length lower bound = 1
-Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20 ,
+Starting chars: \x09 \x0a \x0b \x0c \x0d \x20 ,
\x0b,\x0b
0: \x0b,\x0b
\x0c,\x0d
@@ -6738,7 +6742,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: C a b c d
+Starting chars: C a b c d
/()[ab]xyz/IS
Capturing subpattern count = 1
@@ -6746,7 +6750,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 4
-Starting byte set: a b
+Starting chars: a b
/(|)[ab]xyz/IS
Capturing subpattern count = 1
@@ -6754,7 +6758,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 4
-Starting byte set: a b
+Starting chars: a b
/(|c)[ab]xyz/IS
Capturing subpattern count = 1
@@ -6762,7 +6766,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 4
-Starting byte set: a b c
+Starting chars: a b c
/(|c?)[ab]xyz/IS
Capturing subpattern count = 1
@@ -6770,7 +6774,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 4
-Starting byte set: a b c
+Starting chars: a b c
/(d?|c?)[ab]xyz/IS
Capturing subpattern count = 1
@@ -6778,7 +6782,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 4
-Starting byte set: a b c d
+Starting chars: a b c d
/(d?|c)[ab]xyz/IS
Capturing subpattern count = 1
@@ -6786,7 +6790,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 4
-Starting byte set: a b c d
+Starting chars: a b c d
/^a*b\d/DZ
------------------------------------------------------------------
@@ -6879,7 +6883,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/(a+|b*)[cd]/IS
Capturing subpattern count = 1
@@ -6887,7 +6891,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/(a*|b+)[cd]/IS
Capturing subpattern count = 1
@@ -6895,7 +6899,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/(a+|b+)[cd]/IS
Capturing subpattern count = 1
@@ -6903,7 +6907,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
@@ -9307,7 +9311,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: x y z
+Starting chars: x y z
/(?(?=.*b)b|^)/CI
Capturing subpattern count = 0
@@ -10096,7 +10100,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/(a|bc)\1{2,3}/SI
Capturing subpattern count = 1
@@ -10105,7 +10109,7 @@ No options
No first char
No need char
Subject length lower bound = 3
-Starting byte set: a b
+Starting chars: a b
/(a|bc)(?1)/SI
Capturing subpattern count = 1
@@ -10113,7 +10117,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/(a|b\1)(a|b\1)/SI
Capturing subpattern count = 2
@@ -10122,7 +10126,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/(a|b\1){2}/SI
Capturing subpattern count = 1
@@ -10131,7 +10135,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/(a|bbbb\1)(a|bbbb\1)/SI
Capturing subpattern count = 2
@@ -10140,7 +10144,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/(a|bbbb\1){2}/SI
Capturing subpattern count = 1
@@ -10149,7 +10153,7 @@ No options
No first char
No need char
Subject length lower bound = 2
-Starting byte set: a b
+Starting chars: a b
/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/SI
Capturing subpattern count = 1
@@ -10157,7 +10161,7 @@ Options: anchored
No first char
Need char = ':'
Subject length lower bound = 22
-No set of starting bytes
+No starting char list
/<tr([\w\W\s\d][^<>]{0,})><TD([\w\W\s\d][^<>]{0,})>([\d]{0,}\.)(.*)((<BR>([\w\W\s\d][^<>]{0,})|[\s]{0,}))<\/a><\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><TD([\w\W\s\d][^<>]{0,})>([\w\W\s\d][^<>]{0,})<\/TD><\/TR>/isIS
Capturing subpattern count = 11
@@ -10165,7 +10169,7 @@ Options: caseless dotall
First char = '<'
Need char = '>'
Subject length lower bound = 47
-No set of starting bytes
+No starting char list
"(?>.*/)foo"SI
Capturing subpattern count = 0
@@ -10173,7 +10177,7 @@ No options
No first char
Need char = 'o'
Subject length lower bound = 4
-No set of starting bytes
+No starting char list
/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /xSI
Capturing subpattern count = 0
@@ -10181,7 +10185,7 @@ Options: extended
No first char
Need char = '-'
Subject length lower bound = 8
-No set of starting bytes
+No starting char list
/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/iSI
Capturing subpattern count = 1
@@ -10189,7 +10193,7 @@ Options: caseless
No first char
No need char
Subject length lower bound = 1
-Starting byte set: A B C a b c
+Starting chars: A B C a b c
/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/SI
Capturing subpattern count = 0
@@ -10197,7 +10201,7 @@ No options
No first char
Need char = 'b'
Subject length lower bound = 41
-Starting byte set: c d
+Starting chars: c d
/<a[\s]+href[\s]*=[\s]* # find <a href=
([\"\'])? # find single or double quote
@@ -10210,7 +10214,7 @@ Options: caseless extended dotall
First char = '<'
Need char = '='
Subject length lower bound = 9
-No set of starting bytes
+No starting char list
/^(?!:) # colon disallowed at start
(?: # start of item
@@ -10226,7 +10230,7 @@ Options: anchored caseless extended
No first char
Need char = ':'
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
/(?|(?<a>A)|(?<a>B))/I
Capturing subpattern count = 1
@@ -10450,7 +10454,7 @@ Options:
No first char
Need char = 'a'
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
cat
0: a
1:
@@ -10464,7 +10468,7 @@ No options
No first char
Need char = 'a'
Subject length lower bound = 3
-No set of starting bytes
+No starting char list
cat
No match
@@ -10476,7 +10480,7 @@ No options
First char = 'i'
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
i
0: i
@@ -10486,7 +10490,7 @@ No options
No first char
Need char = 'i'
Subject length lower bound = 1
-Starting byte set: i
+Starting chars: i
ia
0: ia
1:
@@ -11080,7 +11084,7 @@ No options
First char = 'a'
Need char = '4'
Subject length lower bound = 5
-No set of starting bytes
+No starting char list
/([abc])++1234/SI
Capturing subpattern count = 1
@@ -11088,7 +11092,7 @@ No options
No first char
Need char = '4'
Subject length lower bound = 5
-Starting byte set: a b c
+Starting chars: a b c
/(?<=(abc)+)X/
Failed: lookbehind assertion is not fixed length at offset 10
@@ -11369,7 +11373,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/(a(?2)|b)(b(?1)|a)(?:(?1)|(?2))/SI
Capturing subpattern count = 2
@@ -11377,7 +11381,7 @@ No options
No first char
No need char
Subject length lower bound = 3
-Starting byte set: a b
+Starting chars: a b
/(a(?2)|b)(b(?1)|a)(?1)(?2)/SI
Capturing subpattern count = 2
@@ -11385,7 +11389,7 @@ No options
No first char
No need char
Subject length lower bound = 4
-Starting byte set: a b
+Starting chars: a b
/(abc)(?1)/SI
Capturing subpattern count = 1
@@ -11393,7 +11397,7 @@ No options
First char = 'a'
Need char = 'c'
Subject length lower bound = 6
-No set of starting bytes
+No starting char list
/^(?>a)++/
aa\M
@@ -11711,7 +11715,7 @@ No options
First char = 't'
Need char = 't'
Subject length lower bound = 18
-No set of starting bytes
+No starting char list
/\btype\b\W*?\btext\b\W*?\bjavascript\b|\burl\b\W*?\bshell:|<input\b.*?\btype\b\W*?\bimage\b|\bonkeyup\b\W*?\=/IS
Capturing subpattern count = 0
@@ -11720,7 +11724,7 @@ No options
No first char
No need char
Subject length lower bound = 8
-Starting byte set: < o t u
+Starting chars: < o t u
/a(*SKIP)c|b(*ACCEPT)|/+S!I
Capturing subpattern count = 0
@@ -11729,7 +11733,7 @@ No options
No first char
No need char
Subject length lower bound = -1
-No set of starting bytes
+No starting char list
a
0:
0+
@@ -11740,7 +11744,7 @@ No options
No first char
No need char
Subject length lower bound = -1
-Starting byte set: a b x
+Starting chars: a b x
ax
0: x
@@ -12436,7 +12440,7 @@ No options
No first char
No need char
Subject length lower bound = -1
-No set of starting bytes
+No starting char list
/(?:(a)+(?C1)bb|aa(?C2)b)/
aab\C+
@@ -12722,7 +12726,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 2
-Starting byte set: a z
+Starting chars: a z
aaaaaaaaaaaaaz
Error -21 (recursion limit exceeded)
aaaaaaaaaaaaaz\Q1000
@@ -12735,7 +12739,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 2
-Starting byte set: a z
+Starting chars: a z
aaaaaaaaaaaaaz
Error -21 (recursion limit exceeded)
@@ -12746,7 +12750,7 @@ No options
No first char
Need char = 'z'
Subject length lower bound = 2
-Starting byte set: a z
+Starting chars: a z
aaaaaaaaaaaaaz
No match
aaaaaaaaaaaaaz\Q10
@@ -12790,7 +12794,7 @@ Options: dupnames
First char = 'a'
Need char = 'z'
Subject length lower bound = 5
-No set of starting bytes
+No starting char list
/a*[bcd]/BZ
------------------------------------------------------------------
@@ -13902,7 +13906,7 @@ No options
No first char
Need char = 'd'
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/[a-c]+d/DZS
------------------------------------------------------------------
@@ -13917,7 +13921,7 @@ No options
No first char
Need char = 'd'
Subject length lower bound = 2
-Starting byte set: a b c
+Starting chars: a b c
/[a-c]?d/DZS
------------------------------------------------------------------
@@ -13932,7 +13936,7 @@ No options
No first char
Need char = 'd'
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/[a-c]{4,6}d/DZS
------------------------------------------------------------------
@@ -13947,7 +13951,7 @@ No options
No first char
Need char = 'd'
Subject length lower bound = 5
-Starting byte set: a b c
+Starting chars: a b c
/[a-c]{0,6}d/DZS
------------------------------------------------------------------
@@ -13962,7 +13966,7 @@ No options
No first char
Need char = 'd'
Subject length lower bound = 1
-Starting byte set: a b c d
+Starting chars: a b c d
/-- End of special auto-possessive tests --/
@@ -14125,4 +14129,24 @@ No match
/[a[:<:]] should give error/
Failed: unknown POSIX class name at offset 4
+/(?=ab\K)/+
+ abcd
+Start of matched string is beyond its end - displaying from end to start.
+ 0: ab
+ 0+ abcd
+
+/abcd/f<lf>
+ xx\nxabcd
+No match
+
+/ -- Test stack check external calls --/
+
+/(((((a)))))/Q0
+
+/(((((a)))))/Q1
+Failed: parentheses are too deeply nested (stack check) at offset 0
+
+/(((((a)))))/Q
+** Missing 0 or 1 after /Q
+
/-- End of testinput2 --/
diff --git a/pcre/testdata/testoutput21-16 b/pcre/testdata/testoutput21-16
index 0e21350f891..da194d90e0b 100644
--- a/pcre/testdata/testoutput21-16
+++ b/pcre/testdata/testoutput21-16
@@ -50,7 +50,7 @@ Options: anchored extended
No first char
No need char
Subject length lower bound = 6
-No set of starting bytes
+No starting char list
<!testsaved16BE-1
Compiled pattern loaded from testsaved16BE-1
@@ -83,7 +83,7 @@ Options: anchored extended
No first char
No need char
Subject length lower bound = 6
-No set of starting bytes
+No starting char list
<!testsaved32LE-1
Compiled pattern loaded from testsaved32LE-1
diff --git a/pcre/testdata/testoutput21-32 b/pcre/testdata/testoutput21-32
index 183487aca13..d087bb6f4dd 100644
--- a/pcre/testdata/testoutput21-32
+++ b/pcre/testdata/testoutput21-32
@@ -62,7 +62,7 @@ Options: anchored extended
No first char
No need char
Subject length lower bound = 6
-No set of starting bytes
+No starting char list
<!testsaved32BE-1
Compiled pattern loaded from testsaved32BE-1
@@ -95,6 +95,6 @@ Options: anchored extended
No first char
No need char
Subject length lower bound = 6
-No set of starting bytes
+No starting char list
/-- End of testinput21 --/
diff --git a/pcre/testdata/testoutput22-16 b/pcre/testdata/testoutput22-16
index f896b13e18a..32a71cd4438 100644
--- a/pcre/testdata/testoutput22-16
+++ b/pcre/testdata/testoutput22-16
@@ -37,7 +37,7 @@ Options: extended utf
No first char
No need char
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
<!testsaved16BE-2
Compiled pattern loaded from testsaved16BE-2
@@ -64,7 +64,7 @@ Options: extended utf
No first char
No need char
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
<!testsaved32LE-2
Compiled pattern loaded from testsaved32LE-2
diff --git a/pcre/testdata/testoutput22-32 b/pcre/testdata/testoutput22-32
index 783926b8210..13e441d159c 100644
--- a/pcre/testdata/testoutput22-32
+++ b/pcre/testdata/testoutput22-32
@@ -49,7 +49,7 @@ Options: extended utf
No first char
No need char
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
<!testsaved32BE-2
Compiled pattern loaded from testsaved32BE-2
@@ -76,6 +76,6 @@ Options: extended utf
No first char
No need char
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
/-- End of testinput22 --/
diff --git a/pcre/testdata/testoutput23 b/pcre/testdata/testoutput23
index 6f5384c34e8..6dabf03b0fa 100644
--- a/pcre/testdata/testoutput23
+++ b/pcre/testdata/testoutput23
@@ -18,7 +18,7 @@ Failed: character value in \x{} or \o{} is too large at offset 8
/[\H]/BZSI
------------------------------------------------------------------
Bra
- [\x00-\x08\x0a-\x1f!-\x9f\x{a1}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{ffff}]
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff\x{100}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{ffff}]
Ket
End
------------------------------------------------------------------
@@ -27,12 +27,25 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0a \x0b
+ \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a
+ \x1b \x1c \x1d \x1e \x1f ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9
+ : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^
+ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80
+ \x81 \x82 \x83 \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f
+ \x90 \x91 \x92 \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e
+ \x9f \xa1 \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae
+ \xaf \xb0 \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd
+ \xbe \xbf \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc
+ \xcd \xce \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb
+ \xdc \xdd \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea
+ \xeb \xec \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9
+ \xfa \xfb \xfc \xfd \xfe \xff
/[\V]/BZSI
------------------------------------------------------------------
Bra
- [\x00-\x09\x0e-\x84\x{86}-\x{2027}\x{202a}-\x{ffff}]
+ [\x00-\x09\x0e-\x84\x86-\xff\x{100}-\x{2027}\x{202a}-\x{ffff}]
Ket
End
------------------------------------------------------------------
@@ -41,6 +54,19 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0e
+ \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a \x1b \x1c \x1d
+ \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
+ ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c
+ d e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80 \x81 \x82
+ \x83 \x84 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f \x90 \x91 \x92
+ \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e \x9f \xa0 \xa1
+ \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae \xaf \xb0
+ \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd \xbe \xbf
+ \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce
+ \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd
+ \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec
+ \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb
+ \xfc \xfd \xfe \xff
/-- End of testinput23 --/
diff --git a/pcre/testdata/testoutput25 b/pcre/testdata/testoutput25
index 7ad3378368f..4c62c8d8079 100644
--- a/pcre/testdata/testoutput25
+++ b/pcre/testdata/testoutput25
@@ -1,6 +1,6 @@
/-- Tests for the 32-bit library only */
-< forbid 8w
+< forbid 8W
/-- Check maximum character size --/
@@ -65,7 +65,7 @@ Need char = \x{800000}
/[\H]/BZSI
------------------------------------------------------------------
Bra
- [\x00-\x08\x0a-\x1f!-\x9f\x{a1}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{ffffffff}]
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff\x{100}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{ffffffff}]
Ket
End
------------------------------------------------------------------
@@ -74,12 +74,25 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x0a \x0b
+ \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a
+ \x1b \x1c \x1d \x1e \x1f ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9
+ : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^
+ _ ` a b c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80
+ \x81 \x82 \x83 \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f
+ \x90 \x91 \x92 \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e
+ \x9f \xa1 \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae
+ \xaf \xb0 \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd
+ \xbe \xbf \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc
+ \xcd \xce \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb
+ \xdc \xdd \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea
+ \xeb \xec \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9
+ \xfa \xfb \xfc \xfd \xfe \xff
/[\V]/BZSI
------------------------------------------------------------------
Bra
- [\x00-\x09\x0e-\x84\x{86}-\x{2027}\x{202a}-\x{ffffffff}]
+ [\x00-\x09\x0e-\x84\x86-\xff\x{100}-\x{2027}\x{202a}-\x{ffffffff}]
Ket
End
------------------------------------------------------------------
@@ -88,6 +101,19 @@ No options
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+Starting chars: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0e
+ \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 \x1a \x1b \x1c \x1d
+ \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = >
+ ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z [ \ ] ^ _ ` a b c
+ d e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80 \x81 \x82
+ \x83 \x84 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f \x90 \x91 \x92
+ \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e \x9f \xa0 \xa1
+ \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae \xaf \xb0
+ \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd \xbe \xbf
+ \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce
+ \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd
+ \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec
+ \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb
+ \xfc \xfd \xfe \xff
/-- End of testinput25 --/
diff --git a/pcre/testdata/testoutput3 b/pcre/testdata/testoutput3
index 12ffc9911b6..73119ab4b7b 100644
--- a/pcre/testdata/testoutput3
+++ b/pcre/testdata/testoutput3
@@ -1,7 +1,10 @@
-/-- This set of tests checks local-specific features, using the fr_FR locale.
- It is not Perl-compatible. There is different version called wintestinput3
- f or use on Windows, where the locale is called "french". --/
-
+/-- This set of tests checks local-specific features, using the "fr_FR" locale.
+ It is not Perl-compatible. When run via RunTest, the locale is edited to
+ be whichever of "fr_FR", "french", or "fr" is found to exist. There is
+ different version of this file called wintestinput3 for use on Windows,
+ where the locale is called "french" and the tests are run using
+ RunTest.bat. --/
+
< forbid 8W
/^[\w]+/
@@ -90,7 +93,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
/\w/ISLfr_FR
@@ -99,7 +102,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â
ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
diff --git a/pcre/testdata/testoutput3A b/pcre/testdata/testoutput3A
new file mode 100644
index 00000000000..0bde024e4e1
--- /dev/null
+++ b/pcre/testdata/testoutput3A
@@ -0,0 +1,174 @@
+/-- This set of tests checks local-specific features, using the "fr_FR" locale.
+ It is not Perl-compatible. When run via RunTest, the locale is edited to
+ be whichever of "fr_FR", "french", or "fr" is found to exist. There is
+ different version of this file called wintestinput3 for use on Windows,
+ where the locale is called "french" and the tests are run using
+ RunTest.bat. --/
+
+< forbid 8W
+
+/^[\w]+/
+ *** Failers
+No match
+ École
+No match
+
+/^[\w]+/Lfr_FR
+ École
+ 0: École
+
+/^[\w]+/
+ *** Failers
+No match
+ École
+No match
+
+/^[\W]+/
+ École
+ 0: \xc9
+
+/^[\W]+/Lfr_FR
+ *** Failers
+ 0: ***
+ École
+No match
+
+/[\b]/
+ \b
+ 0: \x08
+ *** Failers
+No match
+ a
+No match
+
+/[\b]/Lfr_FR
+ \b
+ 0: \x08
+ *** Failers
+No match
+ a
+No match
+
+/^\w+/
+ *** Failers
+No match
+ École
+No match
+
+/^\w+/Lfr_FR
+ École
+ 0: École
+
+/(.+)\b(.+)/
+ École
+ 0: \xc9cole
+ 1: \xc9
+ 2: cole
+
+/(.+)\b(.+)/Lfr_FR
+ *** Failers
+ 0: *** Failers
+ 1: ***
+ 2: Failers
+ École
+No match
+
+/École/i
+ École
+ 0: \xc9cole
+ *** Failers
+No match
+ école
+No match
+
+/École/iLfr_FR
+ École
+ 0: École
+ école
+ 0: école
+
+/\w/IS
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+Subject length lower bound = 1
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+ Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
+
+/\w/ISLfr_FR
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+Subject length lower bound = 1
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+ Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
+ ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â
+ ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
+
+/^[\xc8-\xc9]/iLfr_FR
+ École
+ 0: É
+ école
+ 0: é
+
+/^[\xc8-\xc9]/Lfr_FR
+ École
+ 0: É
+ *** Failers
+No match
+ école
+No match
+
+/\W+/Lfr_FR
+ >>>\xaa<<<
+ 0: >>>
+ >>>\xba<<<
+ 0: >>>
+
+/[\W]+/Lfr_FR
+ >>>\xaa<<<
+ 0: >>>
+ >>>\xba<<<
+ 0: >>>
+
+/[^[:alpha:]]+/Lfr_FR
+ >>>\xaa<<<
+ 0: >>>
+ >>>\xba<<<
+ 0: >>>
+
+/\w+/Lfr_FR
+ >>>\xaa<<<
+ 0: ª
+ >>>\xba<<<
+ 0: º
+
+/[\w]+/Lfr_FR
+ >>>\xaa<<<
+ 0: ª
+ >>>\xba<<<
+ 0: º
+
+/[[:alpha:]]+/Lfr_FR
+ >>>\xaa<<<
+ 0: ª
+ >>>\xba<<<
+ 0: º
+
+/[[:alpha:]][[:lower:]][[:upper:]]/DZLfr_FR
+------------------------------------------------------------------
+ Bra
+ [A-Za-z\xaa\xb5\xba\xc0-\xd6\xd8-\xf6\xf8-\xff]
+ [a-z\xaa\xb5\xba\xdf-\xf6\xf8-\xff]
+ [A-Z\xc0-\xd6\xd8-\xde]
+ Ket
+ End
+------------------------------------------------------------------
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+
+/-- End of testinput3 --/
diff --git a/pcre/testdata/testoutput3B b/pcre/testdata/testoutput3B
new file mode 100644
index 00000000000..8d9fe7df772
--- /dev/null
+++ b/pcre/testdata/testoutput3B
@@ -0,0 +1,174 @@
+/-- This set of tests checks local-specific features, using the "fr_FR" locale.
+ It is not Perl-compatible. When run via RunTest, the locale is edited to
+ be whichever of "fr_FR", "french", or "fr" is found to exist. There is
+ different version of this file called wintestinput3 for use on Windows,
+ where the locale is called "french" and the tests are run using
+ RunTest.bat. --/
+
+< forbid 8W
+
+/^[\w]+/
+ *** Failers
+No match
+ École
+No match
+
+/^[\w]+/Lfr_FR
+ École
+ 0: École
+
+/^[\w]+/
+ *** Failers
+No match
+ École
+No match
+
+/^[\W]+/
+ École
+ 0: \xc9
+
+/^[\W]+/Lfr_FR
+ *** Failers
+ 0: ***
+ École
+No match
+
+/[\b]/
+ \b
+ 0: \x08
+ *** Failers
+No match
+ a
+No match
+
+/[\b]/Lfr_FR
+ \b
+ 0: \x08
+ *** Failers
+No match
+ a
+No match
+
+/^\w+/
+ *** Failers
+No match
+ École
+No match
+
+/^\w+/Lfr_FR
+ École
+ 0: École
+
+/(.+)\b(.+)/
+ École
+ 0: \xc9cole
+ 1: \xc9
+ 2: cole
+
+/(.+)\b(.+)/Lfr_FR
+ *** Failers
+ 0: *** Failers
+ 1: ***
+ 2: Failers
+ École
+No match
+
+/École/i
+ École
+ 0: \xc9cole
+ *** Failers
+No match
+ école
+No match
+
+/École/iLfr_FR
+ École
+ 0: École
+ école
+ 0: école
+
+/\w/IS
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+Subject length lower bound = 1
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+ Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
+
+/\w/ISLfr_FR
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+Subject length lower bound = 1
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+ Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
+ ª µ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â
+ ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý þ ÿ
+
+/^[\xc8-\xc9]/iLfr_FR
+ École
+ 0: É
+ école
+ 0: é
+
+/^[\xc8-\xc9]/Lfr_FR
+ École
+ 0: É
+ *** Failers
+No match
+ école
+No match
+
+/\W+/Lfr_FR
+ >>>\xaa<<<
+ 0: >>>
+ >>>\xba<<<
+ 0: >>>
+
+/[\W]+/Lfr_FR
+ >>>\xaa<<<
+ 0: >>>
+ >>>\xba<<<
+ 0: >>>
+
+/[^[:alpha:]]+/Lfr_FR
+ >>>\xaa<<<
+ 0: >>>
+ >>>\xba<<<
+ 0: >>>
+
+/\w+/Lfr_FR
+ >>>\xaa<<<
+ 0: ª
+ >>>\xba<<<
+ 0: º
+
+/[\w]+/Lfr_FR
+ >>>\xaa<<<
+ 0: ª
+ >>>\xba<<<
+ 0: º
+
+/[[:alpha:]]+/Lfr_FR
+ >>>\xaa<<<
+ 0: ª
+ >>>\xba<<<
+ 0: º
+
+/[[:alpha:]][[:lower:]][[:upper:]]/DZLfr_FR
+------------------------------------------------------------------
+ Bra
+ [A-Za-z\x83\x8a\x8c\x8e\x9a\x9c\x9e\x9f\xaa\xb5\xba\xc0-\xd6\xd8-\xf6\xf8-\xff]
+ [a-z\x83\x9a\x9c\x9e\xaa\xb5\xba\xdf-\xf6\xf8-\xff]
+ [A-Z\x8a\x8c\x8e\x9f\xc0-\xd6\xd8-\xde]
+ Ket
+ End
+------------------------------------------------------------------
+Capturing subpattern count = 0
+No options
+No first char
+No need char
+
+/-- End of testinput3 --/
diff --git a/pcre/testdata/testoutput4 b/pcre/testdata/testoutput4
index 0dbec4eccab..dcf13b08507 100644
--- a/pcre/testdata/testoutput4
+++ b/pcre/testdata/testoutput4
@@ -1263,4 +1263,12 @@ No match
aa
0: aa
+/^.\B.\B./8
+ \x{10123}\x{10124}\x{10125}
+ 0: \x{10123}\x{10124}\x{10125}
+
+/^#[^\x{ffff}]#[^\x{ffff}]#[^\x{ffff}]#/8
+ #\x{10000}#\x{100}#\x{10ffff}#
+ 0: #\x{10000}#\x{100}#\x{10ffff}#
+
/-- End of testinput4 --/
diff --git a/pcre/testdata/testoutput5 b/pcre/testdata/testoutput5
index 3fa581052e6..5c098e650ba 100644
--- a/pcre/testdata/testoutput5
+++ b/pcre/testdata/testoutput5
@@ -270,7 +270,7 @@ No match
/[z-\x{100}]/8DZ
------------------------------------------------------------------
Bra
- [z-\x{100}]
+ [z-\xff\x{100}]
Ket
End
------------------------------------------------------------------
@@ -812,7 +812,7 @@ No match
/[\H]/8BZ
------------------------------------------------------------------
Bra
- [\x00-\x08\x0a-\x1f!-\x9f\x{a1}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{10ffff}]
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff\x{100}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{10ffff}]
Ket
End
------------------------------------------------------------------
@@ -820,7 +820,7 @@ No match
/[\V]/8BZ
------------------------------------------------------------------
Bra
- [\x00-\x09\x0e-\x84\x{86}-\x{2027}\x{202a}-\x{10ffff}]
+ [\x00-\x09\x0e-\x84\x86-\xff\x{100}-\x{2027}\x{202a}-\x{10ffff}]
Ket
End
------------------------------------------------------------------
@@ -1536,7 +1536,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/[^\x{1234}]+?/iS8I
Capturing subpattern count = 0
@@ -1544,7 +1544,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/[^\x{1234}]++/iS8I
Capturing subpattern count = 0
@@ -1552,7 +1552,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 1
-No set of starting bytes
+No starting char list
/[^\x{1234}]{2}/iS8I
Capturing subpattern count = 0
@@ -1560,7 +1560,7 @@ Options: caseless utf
No first char
No need char
Subject length lower bound = 2
-No set of starting bytes
+No starting char list
//<bsr_anycrlf><bsr_unicode>
Failed: inconsistent NEWLINE options at offset 0
@@ -1620,7 +1620,7 @@ Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 7
/[\H\x{d7ff}]+/8BZ
------------------------------------------------------------------
Bra
- [\x00-\x08\x0a-\x1f!-\x9f\x{a1}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{10ffff}\x{d7ff}]++
+ [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff\x{100}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{10ffff}\x{d7ff}]++
Ket
End
------------------------------------------------------------------
@@ -1660,7 +1660,7 @@ Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 7
/[\V\x{d7ff}]+/8BZ
------------------------------------------------------------------
Bra
- [\x00-\x09\x0e-\x84\x{86}-\x{2027}\x{202a}-\x{10ffff}\x{d7ff}]++
+ [\x00-\x09\x0e-\x84\x86-\xff\x{100}-\x{2027}\x{202a}-\x{10ffff}\x{d7ff}]++
Ket
End
------------------------------------------------------------------
@@ -1882,4 +1882,19 @@ Failed: disallowed Unicode code point (>= 0xd800 && <= 0xdfff) at offset 5
aa
0: aa
+/[b-d\x{200}-\x{250}]*[ae-h]?#[\x{200}-\x{250}]{0,8}[\x00-\xff]*#[\x{200}-\x{250}]+[a-z]/8BZ
+------------------------------------------------------------------
+ Bra
+ [b-d\x{200}-\x{250}]*+
+ [ae-h]?+
+ #
+ [\x{200}-\x{250}]{0,8}+
+ [\x00-\xff]*
+ #
+ [\x{200}-\x{250}]++
+ [a-z]
+ Ket
+ End
+------------------------------------------------------------------
+
/-- End of testinput5 --/
diff --git a/pcre/testdata/testoutput6 b/pcre/testdata/testoutput6
index 6c42fce1a5b..f355e601383 100644
--- a/pcre/testdata/testoutput6
+++ b/pcre/testdata/testoutput6
@@ -2445,4 +2445,16 @@ No match
\x{37e}
No match
+/[RST]+/8iW
+ Ss\x{17f}
+ 0: Ss\x{17f}
+
+/[R-T]+/8iW
+ Ss\x{17f}
+ 0: Ss\x{17f}
+
+/[q-u]+/8iW
+ Ss\x{17f}
+ 0: Ss\x{17f}
+
/-- End of testinput6 --/
diff --git a/pcre/testdata/testoutput7 b/pcre/testdata/testoutput7
index 45ac72fd8d4..c64e0499421 100644
--- a/pcre/testdata/testoutput7
+++ b/pcre/testdata/testoutput7
@@ -124,7 +124,7 @@ No match
/[z-\x{100}]/8iDZ
------------------------------------------------------------------
Bra
- [Z\x{39c}\x{3bc}\x{1e9e}\x{178}z-\x{101}]
+ [Zz-\xff\x{39c}\x{3bc}\x{212b}\x{1e9e}\x{212b}\x{178}\x{100}-\x{101}]
Ket
End
------------------------------------------------------------------
@@ -162,7 +162,7 @@ No match
/[z-\x{100}]/8DZi
------------------------------------------------------------------
Bra
- [Z\x{39c}\x{3bc}\x{1e9e}\x{178}z-\x{101}]
+ [Zz-\xff\x{39c}\x{3bc}\x{212b}\x{1e9e}\x{212b}\x{178}\x{100}-\x{101}]
Ket
End
------------------------------------------------------------------
@@ -2263,4 +2263,28 @@ No match
End
------------------------------------------------------------------
+/[RST]+/8iWBZ
+------------------------------------------------------------------
+ Bra
+ [R-Tr-t\x{17f}]++
+ Ket
+ End
+------------------------------------------------------------------
+
+/[R-T]+/8iWBZ
+------------------------------------------------------------------
+ Bra
+ [R-Tr-t\x{17f}]++
+ Ket
+ End
+------------------------------------------------------------------
+
+/[Q-U]+/8iWBZ
+------------------------------------------------------------------
+ Bra
+ [Q-Uq-u\x{17f}]++
+ Ket
+ End
+------------------------------------------------------------------
+
/-- End of testinput7 --/
diff --git a/pcre/testdata/testoutput8 b/pcre/testdata/testoutput8
index bb68d3e6452..3861ea41fdb 100644
--- a/pcre/testdata/testoutput8
+++ b/pcre/testdata/testoutput8
@@ -7232,7 +7232,7 @@ No options
No first char
No need char
Subject length lower bound = 3
-Starting byte set: a d x
+Starting chars: a d x
terhjk;abcdaadsfe
0: abc
the quick xyz brown fox
diff --git a/pcre/testdata/wintestoutput3 b/pcre/testdata/wintestoutput3
index 00880070670..456ad196b56 100644
--- a/pcre/testdata/wintestoutput3
+++ b/pcre/testdata/wintestoutput3
@@ -84,7 +84,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
/\w/ISLfrench
@@ -93,7 +93,7 @@ No options
No first char
No need char
Subject length lower bound = 1
-Starting byte set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
+Starting chars: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
ƒ Š Œ Ž š œ ž Ÿ ª ² ³ µ ¹ º À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö
Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ø ù ú û ü ý