summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authornigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-02-24 21:40:24 +0000
committernigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>2007-02-24 21:40:24 +0000
commit4af6fcff808e079ca1aa09104d6146baa932af47 (patch)
treedc14f3624835dd1275c31159a4c365ed439f3df7
parentf08d5b6354f668c0047281d81eda8d0fd2a9e82d (diff)
downloadpcre-4af6fcff808e079ca1aa09104d6146baa932af47.tar.gz
Load pcre-4.4 into code/trunk.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@71 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog56
-rw-r--r--LICENCE2
-rw-r--r--Makefile.in23
-rw-r--r--NEWS15
-rwxr-xr-xconfigure20
-rw-r--r--configure.in17
-rw-r--r--doc/Tech.Notes35
-rw-r--r--doc/html/pcre.html15
-rw-r--r--doc/html/pcre_compile.html6
-rw-r--r--doc/html/pcre_exec.html3
-rw-r--r--doc/html/pcreapi.html34
-rw-r--r--doc/html/pcrecallout.html5
-rw-r--r--doc/html/pcretest.html16
-rw-r--r--doc/pcre.315
-rw-r--r--doc/pcre.txt58
-rw-r--r--doc/pcre_compile.36
-rw-r--r--doc/pcre_exec.33
-rw-r--r--doc/pcreapi.326
-rw-r--r--doc/pcrecallout.35
-rw-r--r--doc/pcretest.115
-rw-r--r--doc/pcretest.txt21
-rw-r--r--internal.h5
-rw-r--r--pcre.c130
-rw-r--r--pcre.in4
-rw-r--r--pcregrep.c4
-rw-r--r--pcreposix.c11
-rw-r--r--pcretest.c44
-rw-r--r--study.c45
-rw-r--r--testdata/testinput530
-rw-r--r--testdata/testoutput12
-rw-r--r--testdata/testoutput219
-rw-r--r--testdata/testoutput34
-rw-r--r--testdata/testoutput42
-rw-r--r--testdata/testoutput576
34 files changed, 579 insertions, 193 deletions
diff --git a/ChangeLog b/ChangeLog
index b912314..1c0d36a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,6 +1,62 @@
ChangeLog for PCRE
------------------
+Version 4.4 13-Aug-03
+---------------------
+
+ 1. In UTF-8 mode, a character class containing characters with values between
+ 127 and 255 was not handled correctly if the compiled pattern was studied.
+ In fixing this, I have also improved the studying algorithm for such
+ classes (slightly).
+
+ 2. Three internal functions had redundant arguments passed to them. Removal
+ might give a very teeny performance improvement.
+
+ 3. Documentation bug: the value of the capture_top field in a callout is *one
+ more than* the number of the hightest numbered captured substring.
+
+ 4. The Makefile linked pcretest and pcregrep with -lpcre, which could result
+ in incorrectly linking with a previously installed version. They now link
+ explicitly with libpcre.la.
+
+ 5. configure.in no longer needs to recognize Cygwin specially.
+
+ 6. A problem in pcre.in for Windows platforms is fixed.
+
+ 7. If a pattern was successfully studied, and the -d (or /D) flag was given to
+ pcretest, it used to include the size of the study block as part of its
+ output. Unfortunately, the structure contains a field that has a different
+ size on different hardware architectures. This meant that the tests that
+ showed this size failed. As the block is currently always of a fixed size,
+ this information isn't actually particularly useful in pcretest output, so
+ I have just removed it.
+
+ 8. Three pre-processor statements accidentally did not start in column 1.
+ Sadly, there are *still* compilers around that complain, even though
+ standard C has not required this for well over a decade. Sigh.
+
+ 9. In pcretest, the code for checking callouts passed small integers in the
+ callout_data field, which is a void * field. However, some picky compilers
+ complained about the casts involved for this on 64-bit systems. Now
+ pcretest passes the address of the small integer instead, which should get
+ rid of the warnings.
+
+10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at
+ both compile and run time, and gives an error if an invalid UTF-8 sequence
+ is found. There is a option for disabling this check in cases where the
+ string is known to be correct and/or the maximum performance is wanted.
+
+11. In response to a bug report, I changed one line in Makefile.in from
+
+ -Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \
+ to
+ -Wl,--out-implib,.libs/@WIN_PREFIX@libpcreposix.dll.a \
+
+ to look similar to other lines, but I have no way of telling whether this
+ is the right thing to do, as I do not use Windows. No doubt I'll get told
+ if it's wrong...
+
+
Version 4.3 21-May-03
---------------------
diff --git a/LICENCE b/LICENCE
index 8d68061..09a242c 100644
--- a/LICENCE
+++ b/LICENCE
@@ -9,7 +9,7 @@ Written by: Philip Hazel <ph10@cam.ac.uk>
University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.
-Copyright (c) 1997-2001 University of Cambridge
+Copyright (c) 1997-2003 University of Cambridge
Permission is granted to anyone to use this software for any purpose on any
computer system, and to redistribute it freely, subject to the following
diff --git a/Makefile.in b/Makefile.in
index ecdd6ef..ef456aa 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -1,21 +1,14 @@
# Makefile.in for PCRE (Perl-Compatible Regular Expression) library.
-#---------------------------------------------------------------------------#
-# MinGW DLLs are built automatically with this configure.in and Makefile.in #
-# as long you are using autoconf 2.50 or higher. The Win32 static libraries #
-# have not been tested, but appear to be generated. This functionality is #
-# by courtesy of Fred Cox. I (Philip Hazel) don't know anything about it, #
-# as I live entirely in a non-Windows world. #
-#---------------------------------------------------------------------------#
-
#############################################################################
# PCRE is developed on a Unix system. I do not use Windows or Macs, and know
# nothing about building software on them. Although the code of PCRE should
# be very portable, the building system in this Makefile is designed for Unix
-# systems, with the exception of the mingw32 stuff just mentioned.
+# systems. However, there are features that have been supplied to me by various
+# people that should make it work on MinGW and Cygwin systems.
# This setting enables Unix-style directory scanning in pcregrep, triggered
# by the -f option. Maybe one day someone will add code for other systems.
@@ -106,11 +99,11 @@ LOBJ = maketables.lo get.lo study.lo pcre.lo @POSIX_LOBJ@
all: libpcre.la @POSIX_LIB@ pcretest@EXEEXT@ pcregrep@EXEEXT@ @ON_WINDOWS@ winshared
pcregrep@EXEEXT@: libpcre.la pcregrep.@OBJEXT@ @ON_WINDOWS@ winshared
- $(LINK) -o pcregrep@EXEEXT@ pcregrep.@OBJEXT@ -lpcre
+ $(LINK) -o pcregrep@EXEEXT@ pcregrep.@OBJEXT@ libpcre.la
pcretest@EXEEXT@: libpcre.la @POSIX_LIB@ pcretest.@OBJEXT@ @ON_WINDOWS@ winshared
$(LINK) $(PURIFY) $(EFENCE) -o pcretest@EXEEXT@ pcretest.@OBJEXT@ \
- -lpcre @POSIX_LIB@
+ libpcre.la @POSIX_LIB@
libpcre.la: $(OBJ)
-rm -f libpcre.la
@@ -119,7 +112,7 @@ libpcre.la: $(OBJ)
libpcreposix.la: pcreposix.@OBJEXT@
-rm -f libpcreposix.la
- $(LINKLIB) -rpath $(LIBDIR) -L. -lpcre -version-info \
+ $(LINKLIB) -rpath $(LIBDIR) libpcre.la -version-info \
'$(PCREPOSIXLIBVERSION)' -o libpcreposix.la pcreposix.lo
pcre.@OBJEXT@: $(top_srcdir)/chartables.c $(top_srcdir)/pcre.c \
@@ -151,7 +144,7 @@ pcretest.@OBJEXT@: $(top_srcdir)/pcretest.c $(top_srcdir)/internal.h \
pcregrep.@OBJEXT@: $(top_srcdir)/pcregrep.c pcre.h Makefile config.h
$(CC) -c $(CFLAGS) -I. $(UTF8) $(PCREGREP_OSTYPE) $(top_srcdir)/pcregrep.c
-# Some Windows-specific targets, for Cygwin and MinGW
+# Some Windows-specific targets for MinGW. Do not use for Cygwin.
winshared : .libs/@WIN_PREFIX@pcre.dll .libs/@WIN_PREFIX@pcreposix.dll
@@ -175,8 +168,8 @@ winshared : .libs/@WIN_PREFIX@pcre.dll .libs/@WIN_PREFIX@pcreposix.dll
.libs/@WIN_PREFIX@pcreposix.dll: libpcreposix.la libpcre.la
$(CC) $(CFLAGS) -shared -o $@ \
-Wl,--whole-archive .libs/libpcreposix.a \
- -Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \
- -Wl,--output-def,.libs/@WIN_PREFIX@pcreposix.dll-def \
+ -Wl,--out-implib,.libs/@WIN_PREFIX@pcreposix.dll.a \
+ -Wl,--output-def,.libs/@WIN_PREFIX@libpcreposix.dll-def \
-Wl,--export-all-symbols \
-Wl,--no-whole-archive .libs/libpcre.a
sed -e "s#dlname=''#dlname='../bin/@WIN_PREFIX@pcreposix.dll'#" \
diff --git a/NEWS b/NEWS
index e620b2d..60d66d7 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,21 @@
News about PCRE releases
------------------------
+Release 4.4 21-Aug-03
+---------------------
+
+This is mainly a bug-fix and tidying release. The only new feature is that PCRE
+checks UTF-8 strings for validity by default. There is an option to suppress
+this, just in case anybody wants that teeny extra bit of performance.
+
+
+Releases 4.1 - 4.3
+------------------
+
+Sorry, I forgot about updating the NEWS file for these releases. Please take a
+look at ChangeLog.
+
+
Release 4.0 17-Feb-03
---------------------
diff --git a/configure b/configure
index 51ac584..f89594f 100755
--- a/configure
+++ b/configure
@@ -936,6 +936,9 @@ if test "$ac_init_help" = "long"; then
# The list generated by autoconf has been trimmed to remove many
# options that are totally irrelevant to PCRE (e.g. relating to X),
# or are not supported by its Makefile.
+ # The list generated by autoconf has been trimmed to remove many
+ # options that are totally irrelevant to PCRE (e.g. relating to X),
+ # or are not supported by its Makefile.
# This message is too long to be a string in the A/UX 3.1 sh.
cat <<_ACEOF
\`configure' configures this package to adapt to many kinds of systems.
@@ -1432,8 +1435,8 @@ ac_compiler_gnu=$ac_cv_c_compiler_gnu
PCRE_MAJOR=4
-PCRE_MINOR=3
-PCRE_DATE=21-May-2003
+PCRE_MINOR=4
+PCRE_DATE=21-August-2003
PCRE_VERSION=${PCRE_MAJOR}.${PCRE_MINOR}
@@ -5094,7 +5097,7 @@ else
;;
darwin* | rhapsody*)
- # This patch put in by hand by PH (22-May-2003) for Darwin 1.3.
+ # This patch put in by hand by PH (21-Aug-2003) for Darwin 1.3.
case "$host_os" in
rhapsody* | darwin1.[[012]])
allow_undefined_flag='-undefined suppress'
@@ -7681,14 +7684,6 @@ mingw* )
NOT_ON_WINDOWS="#"
WIN_PREFIX=
;;
-cygwin* )
- ON_WINDOWS=
- POSIX_OBJ=pcreposix.o
- POSIX_LOBJ=pcreposix.lo
- POSIX_LIB=
- WIN_PREFIX=cyg
- NOT_ON_WINDOWS="#"
- ;;
* )
ON_WINDOWS="#"
NOT_ON_WINDOWS=
@@ -7706,7 +7701,8 @@ esac
if test "x$enable_shared" = "xno" ; then
- cat >>confdefs.h <<\_ACEOF
+
+cat >>confdefs.h <<\_ACEOF
#define PCRE_STATIC 1
_ACEOF
diff --git a/configure.in b/configure.in
index 5394f4f..69cb923 100644
--- a/configure.in
+++ b/configure.in
@@ -21,8 +21,8 @@ dnl digits for minor numbers less than 10. There are unlikely to be
dnl that many releases anyway.
PCRE_MAJOR=4
-PCRE_MINOR=3
-PCRE_DATE=21-May-2003
+PCRE_MINOR=4
+PCRE_DATE=21-August-2003
PCRE_VERSION=${PCRE_MAJOR}.${PCRE_MINOR}
dnl Default values for miscellaneous macros
@@ -146,7 +146,8 @@ AC_SUBST(PCRE_POSIXLIB_VERSION)
AC_SUBST(POSIX_MALLOC_THRESHOLD)
AC_SUBST(UTF8)
-dnl Stuff to make Win32 work better
+dnl Stuff to make MinGW work better. Special treatment is no longer
+dnl needed for Cygwin.
case $host_os in
mingw* )
@@ -157,14 +158,6 @@ mingw* )
NOT_ON_WINDOWS="#"
WIN_PREFIX=
;;
-cygwin* )
- ON_WINDOWS=
- POSIX_OBJ=pcreposix.o
- POSIX_LOBJ=pcreposix.lo
- POSIX_LIB=
- WIN_PREFIX=cyg
- NOT_ON_WINDOWS="#"
- ;;
* )
ON_WINDOWS="#"
NOT_ON_WINDOWS=
@@ -182,7 +175,7 @@ AC_SUBST(POSIX_LOBJ)
AC_SUBST(POSIX_LIB)
if test "x$enable_shared" = "xno" ; then
- AC_DEFINE(PCRE_STATIC,1)
+ AC_DEFINE([PCRE_STATIC],[1],[to link statically])
fi
dnl This must be last; it determines what files are written as well as config.h
diff --git a/doc/Tech.Notes b/doc/Tech.Notes
index dd01932..73c31c7 100644
--- a/doc/Tech.Notes
+++ b/doc/Tech.Notes
@@ -48,7 +48,9 @@ These items are all just one byte long
OP_END end of pattern
OP_ANY match any character
+ OP_ANYBYTE match any single byte, even in UTF-8 mode
OP_SOD match start of data: \A
+ OP_SOM, start of match (subject + offset): \G
OP_CIRC ^ (start of data, or after \n in multiline)
OP_NOT_WORD_BOUNDARY \W
OP_WORD_BOUNDARY \w
@@ -61,7 +63,6 @@ These items are all just one byte long
OP_EODN match end of data or \n at end: \Z
OP_EOD match end of data: \z
OP_DOLL $ (end of data, or before \n in multiline)
- OP_RECURSE match the pattern recursively
Repeating single characters
@@ -119,8 +120,7 @@ instances of OP_CHARS are used.
Character classes
-----------------
-When characters less than 256 are involved, OP_CLASS is used for a character
-class. If there is only one character, OP_CHARS is used for a positive class,
+If there is only one character, OP_CHARS is used for a positive class,
and OP_NOT for a negative one (that is, for something like [^a]). However, in
UTF-8 mode, this applies only to characters with values < 128, because OP_NOT
is confined to single bytes.
@@ -129,9 +129,15 @@ Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
negated, single-character class. The normal ones (OP_STAR etc.) are used for a
repeated positive single-character class.
-OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every
-character that is acceptable. The bits are counted from the least significant
-end of each byte.
+When there's more than one character in a class and all the characters are less
+than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative
+one. In either case, the opcode is followed by a 32-byte bit map containing a 1
+bit for every character that is acceptable. The bits are counted from the least
+significant end of each byte.
+
+The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,
+subject characters with values greater than 256 can be handled correctly. For
+OP_CLASS they don't match, whereas for OP_NCLASS they do.
For classes containing characters with values > 255, OP_XCLASS is used. It
optionally uses a bit map (if any characters lie within it), followed by a list
@@ -243,6 +249,21 @@ same scheme is used, with a "reference number" of 0xffff. Otherwise, a
conditional subpattern always starts with one of the assertions.
+Recursion
+---------
+
+Recursion either matches the current regex, or some subexpression. The opcode
+OP_RECURSE is followed by an value which is the offset to the starting bracket
+from the start of the whole pattern.
+
+
+Callout
+-------
+
+OP_CALLOUT is followed by one byte of data that holds a callout number in the
+range 0 to 255.
+
+
Changing options
----------------
@@ -257,4 +278,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
-August 2002
+August 2003
diff --git a/doc/html/pcre.html b/doc/html/pcre.html
index fb319f3..bb0d354 100644
--- a/doc/html/pcre.html
+++ b/doc/html/pcre.html
@@ -125,9 +125,16 @@ to testing the PCRE_UTF8 flag in several places, so should not be very large.
The following comments apply when PCRE is running in UTF-8 mode:
</P>
<P>
-1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
-not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
-the results are undefined.
+1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects
+are checked for validity on entry to the relevant functions. If an invalid
+UTF-8 string is passed, an error return is given. In some situations, you may
+already know that your strings are valid, and therefore want to skip these
+checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag
+at compile time or at run time, PCRE assumes that the pattern or subject it
+is given (respectively) contains only valid UTF-8 codes. In this case, it does
+not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to
+PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program
+may crash.
</P>
<P>
2. In a pattern, the escape sequence \x{...}, where the contents of the braces
@@ -178,6 +185,6 @@ Cambridge CB2 3QG, England.
Phone: +44 1223 334714
</P>
<P>
-Last updated: 04 February 2003
+Last updated: 20 August 2003
<br>
Copyright &copy; 1997-2003 University of Cambridge.
diff --git a/doc/html/pcre_compile.html b/doc/html/pcre_compile.html
index 0b21683..e1a4379 100644
--- a/doc/html/pcre_compile.html
+++ b/doc/html/pcre_compile.html
@@ -52,10 +52,14 @@ The option bits are:
theses (named ones available)
PCRE_UNGREEDY Invert greediness of quantifiers
PCRE_UTF8 Run in UTF-8 mode
+ PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
+ validity (only relevant if
+ PCRE_UTF8 is set)
</PRE>
</P>
<P>
-PCRE must have been compiled with UTF-8 support when PCRE_UTF8 is used.
+PCRE must be compiled with UTF-8 support in order to use PCRE_UTF8
+(or PCRE_NO_UTF8_CHECK).
</P>
<P>
The yield of the function is a pointer to a private data structure that
diff --git a/doc/html/pcre_exec.html b/doc/html/pcre_exec.html
index 915bc73..cf86dfd 100644
--- a/doc/html/pcre_exec.html
+++ b/doc/html/pcre_exec.html
@@ -47,6 +47,9 @@ The options are:
PCRE_NOTBOL Subject is not the beginning of a line
PCRE_NOTEOL Subject is not the end of a line
PCRE_NOTEMPTY An empty string is not a valid match
+ PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
+ validity (only relevant if PCRE_UTF8
+ was set at compile time)
</PRE>
</P>
<P>
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index 9a479e8..54f8f24 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -442,6 +442,21 @@ in the main
<a href="pcre.html"><b>pcre</b></a>
page.
</P>
+<P>
+<pre>
+ PCRE_NO_UTF8_CHECK
+</PRE>
+</P>
+<P>
+When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
+automatically checked. If an invalid UTF-8 sequence of bytes is found,
+<b>pcre_compile()</b> returns an error. If you already know that your pattern is
+valid, and you want to skip this check for performance reasons, you can set the
+PCRE_NO_UTF8_CHECK option. When it is set, the effect of passing an invalid
+UTF-8 string as a pattern is undefined. It may cause your program to crash.
+Note that there is a similar option for suppressing the checking of subject
+strings passed to <b>pcre_exec()</b>.
+</P>
<br><a name="SEC6" href="#TOC1">STUDYING A PATTERN</a><br>
<P>
<b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i>,</b>
@@ -862,6 +877,15 @@ or turned out to be anchored by virtue of its contents, it cannot be made
unachored at matching time.
</P>
<P>
+When PCRE_UTF8 was set at compile time, the validity of the subject as a UTF-8
+string is automatically checked. If an invalid UTF-8 sequence of bytes is
+found, <b>pcre_exec()</b> returns the error PCRE_ERROR_BADUTF8. If you already
+know that your subject is valid, and you want to skip this check for
+performance reasons, you can set the PCRE_NO_UTF8_CHECK option when calling
+<b>pcre_exec()</b>. When this option is set, the effect of passing an invalid
+UTF-8 string as a subject is undefined. It may cause your program to crash.
+</P>
+<P>
There are also three further options that can be set only at matching time:
</P>
<P>
@@ -1106,6 +1130,14 @@ This error is never generated by <b>pcre_exec()</b> itself. It is provided for
use by callout functions that want to yield a distinctive error code. See the
<b>pcrecallout</b> documentation for details.
</P>
+<P>
+<pre>
+ PCRE_ERROR_BADUTF8 (-10)
+</PRE>
+</P>
+<P>
+A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+</P>
<br><a name="SEC11" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
<P>
<b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
@@ -1257,6 +1289,6 @@ then call <i>pcre_copy_substring()</i> or <i>pcre_get_substring()</i>, as
appropriate.
</P>
<P>
-Last updated: 03 February 2003
+Last updated: 20 August 2003
<br>
Copyright &copy; 1997-2003 University of Cambridge.
diff --git a/doc/html/pcrecallout.html b/doc/html/pcrecallout.html
index 5516c99..f4b7104 100644
--- a/doc/html/pcrecallout.html
+++ b/doc/html/pcrecallout.html
@@ -81,8 +81,9 @@ The <i>current_position</i> field contains the offset within the subject of the
current match pointer.
</P>
<P>
-The <i>capture_top</i> field contains the number of the highest captured
-substring so far.
+The <i>capture_top</i> field contains one more than the number of the highest
+numbered captured substring so far. If no substrings have been captured,
+the value of <i>capture_top</i> is one.
</P>
<P>
The <i>capture_last</i> field contains the number of the most recently captured
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 25b03d3..329fb79 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -149,9 +149,10 @@ respectively. For example:
</P>
<P>
These modifier letters have the same effect as they do in Perl. There are
-others which set PCRE options that do not correspond to anything in Perl:
-<b>/A</b>, <b>/E</b>, and <b>/X</b> set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
-PCRE_EXTRA respectively.
+others that set PCRE options that do not correspond to anything in Perl:
+<b>/A</b>, <b>/E</b>, <b>/N</b>, <b>/U</b>, and <b>/X</b> set PCRE_ANCHORED,
+PCRE_DOLLAR_ENDONLY, PCRE_NO_AUTO_CAPTURE, PCRE_UNGREEDY, and PCRE_EXTRA
+respectively.
</P>
<P>
Searching for all possible matches within each subject string can be requested
@@ -233,6 +234,11 @@ provided that it was compiled with this support enabled. This modifier also
causes any non-printing characters in output strings to be printed using the
\x{hh...} notation if they are valid UTF-8 sequences.
</P>
+<P>
+If the <b>/?</b> modifier is used with <b>/8</b>, it causes <b>pcretest</b> to
+call <b>pcre_compile()</b> with the PCRE_NO_UTF8_CHECK option, to suppress the
+checking of the string for UTF-8 validity.
+</P>
<br><a name="SEC5" href="#TOC1">CALLOUTS</a><br>
<P>
If the pattern contains any callout requests, <b>pcretest</b>'s callout function
@@ -318,6 +324,8 @@ recognized:
<b>pcre_exec()</b> to dd (any number of decimal
digits)
\Z pass the PCRE_NOTEOL option to <b>pcre_exec()</b>
+ \? pass the PCRE_NO_UTF8_CHECK option to
+ <b>pcre_exec()</b>
</PRE>
</P>
<P>
@@ -429,6 +437,6 @@ University Computing Service,
Cambridge CB2 3QG, England.
</P>
<P>
-Last updated: 03 February 2003
+Last updated: 20 August 2003
<br>
Copyright &copy; 1997-2003 University of Cambridge.
diff --git a/doc/pcre.3 b/doc/pcre.3
index 7fd9851..c0c7141 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -116,9 +116,16 @@ to testing the PCRE_UTF8 flag in several places, so should not be very large.
The following comments apply when PCRE is running in UTF-8 mode:
-1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
-not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
-the results are undefined.
+1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects
+are checked for validity on entry to the relevant functions. If an invalid
+UTF-8 string is passed, an error return is given. In some situations, you may
+already know that your strings are valid, and therefore want to skip these
+checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag
+at compile time or at run time, PCRE assumes that the pattern or subject it
+is given (respectively) contains only valid UTF-8 codes. In this case, it does
+not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to
+PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program
+may crash.
2. In a pattern, the escape sequence \\x{...}, where the contents of the braces
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
@@ -162,6 +169,6 @@ Cambridge CB2 3QG, England.
Phone: +44 1223 334714
.in 0
-Last updated: 04 February 2003
+Last updated: 20 August 2003
.br
Copyright (c) 1997-2003 University of Cambridge.
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 1ec5f2c..ad6f3b2 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -118,10 +118,19 @@ UTF-8 SUPPORT
The following comments apply when PCRE is running in UTF-8
mode:
- 1. PCRE assumes that the strings it is given contain valid
- UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
- you pass invalid UTF-8 strings to PCRE, the results are
- undefined.
+ 1. When you set the PCRE_UTF8 flag, the strings passed as
+ patterns and subjects are checked for validity on entry to
+ the relevant functions. If an invalid UTF-8 string is
+ passed, an error return is given. In some situations, you
+ may already know that your strings are valid, and therefore
+ want to skip these checks in order to improve performance.
+ If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
+ run time, PCRE assumes that the pattern or subject it is
+ given (respectively) contains only valid UTF-8 codes. In
+ this case, it does not diagnose an invalid UTF-8 string. If
+ you pass an invalid UTF-8 string to PCRE when
+ PCRE_NO_UTF8_CHECK is set, the results are undefined. Your
+ program may crash.
2. In a pattern, the escape sequence \x{...}, where the con-
tents of the braces is a string of hexadecimal digits, is
@@ -164,7 +173,7 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
-Last updated: 04 February 2003
+Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------
@@ -654,6 +663,20 @@ COMPILING A PATTERN
option changes the behaviour of PCRE are given in the sec-
tion on UTF-8 support in the main pcre page.
+ PCRE_NO_UTF8_CHECK
+
+ When PCRE_UTF8 is set, the validity of the pattern as a
+ UTF-8 string is automatically checked. If an invalid UTF-8
+ sequence of bytes is found, pcre_compile() returns an error.
+ If you already know that your pattern is valid, and you want
+ to skip this check for performance reasons, you can set the
+ PCRE_NO_UTF8_CHECK option. When it is set, the effect of
+ passing an invalid UTF-8 string as a pattern is undefined.
+ It may cause your program to crash. Note that there is a
+ similar option for suppressing the checking of subject
+ strings passed to pcre_exec().
+
+
STUDYING A PATTERN
@@ -747,7 +770,6 @@ INFORMATION ABOUT A PATTERN
compiled pattern. It replaces the obsolete pcre_info() func-
tion, which is nevertheless retained for backwards compabil-
ity (and is documented below).
-
The first argument for pcre_fullinfo() is a pointer to the
compiled pattern. The second argument is the result of
pcre_study(), or NULL if the pattern was not studied. The
@@ -1014,6 +1036,16 @@ MATCHING A PATTERN
turned out to be anchored by virtue of its contents, it can-
not be made unachored at matching time.
+ When PCRE_UTF8 was set at compile time, the validity of the
+ subject as a UTF-8 string is automatically checked. If an
+ invalid UTF-8 sequence of bytes is found, pcre_exec()
+ returns the error PCRE_ERROR_BADUTF8. If you already know
+ that your subject is valid, and you want to skip this check
+ for performance reasons, you can set the PCRE_NO_UTF8_CHECK
+ option when calling pcre_exec(). When this option is set,
+ the effect of passing an invalid UTF-8 string as a subject
+ is undefined. It may cause your program to crash.
+
There are also three further options that can be set only at
matching time:
@@ -1103,7 +1135,6 @@ MATCHING A PATTERN
used for a fragment of a pattern that picks out a substring.
PCRE supports several other kinds of parenthesized subpat-
tern that do not cause substrings to be captured.
-
Captured substrings are returned to the caller via a vector
of integer offsets whose address is passed in ovector. The
number of elements in the vector is passed in ovecsize. The
@@ -1219,6 +1250,11 @@ MATCHING A PATTERN
distinctive error code. See the pcrecallout documentation
for details.
+ PCRE_ERROR_BADUTF8 (-10)
+
+ A string that contains an invalid UTF-8 byte sequence was
+ passed as a subject.
+
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
@@ -1255,7 +1291,6 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
returned zero, indicating that it ran out of space in ovec-
tor, the value passed as stringcount should be the size of
the vector divided by three.
-
The functions pcre_copy_substring() and pcre_get_substring()
extract a single substring, whose number is given as string-
number. A value of zero extracts the substring that matched
@@ -1352,7 +1387,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
succeeds, they then call pcre_copy_substring() or
pcre_get_substring(), as appropriate.
-Last updated: 03 February 2003
+Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
-----------------------------------------------------------------------------
@@ -1420,8 +1455,9 @@ PCRE CALLOUTS
The current_position field contains the offset within the
subject of the current match pointer.
- The capture_top field contains the number of the highest
- captured substring so far.
+ The capture_top field contains one more than the number of
+ the highest numbered captured substring so far. If no sub-
+ strings have been captured, the value of capture_top is one.
The capture_last field contains the number of the most
recently captured substring.
diff --git a/doc/pcre_compile.3 b/doc/pcre_compile.3
index f911623..a827315 100644
--- a/doc/pcre_compile.3
+++ b/doc/pcre_compile.3
@@ -42,8 +42,12 @@ The option bits are:
theses (named ones available)
PCRE_UNGREEDY Invert greediness of quantifiers
PCRE_UTF8 Run in UTF-8 mode
+ PCRE_NO_UTF8_CHECK Do not check the pattern for UTF-8
+ validity (only relevant if
+ PCRE_UTF8 is set)
-PCRE must have been compiled with UTF-8 support when PCRE_UTF8 is used.
+PCRE must be compiled with UTF-8 support in order to use PCRE_UTF8
+(or PCRE_NO_UTF8_CHECK).
The yield of the function is a pointer to a private data structure that
contains the compiled pattern, or NULL if an error was detected.
diff --git a/doc/pcre_exec.3 b/doc/pcre_exec.3
index f61c2a4..0d6c380 100644
--- a/doc/pcre_exec.3
+++ b/doc/pcre_exec.3
@@ -37,6 +37,9 @@ The options are:
PCRE_NOTBOL Subject is not the beginning of a line
PCRE_NOTEOL Subject is not the end of a line
PCRE_NOTEMPTY An empty string is not a valid match
+ PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
+ validity (only relevant if PCRE_UTF8
+ was set at compile time)
There is a complete description of the PCRE API in the
.\" HREF
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index fbd3d5d..0149f50 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -371,6 +371,18 @@ in the main
.\"
page.
+ PCRE_NO_UTF8_CHECK
+
+When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
+automatically checked. If an invalid UTF-8 sequence of bytes is found,
+\fBpcre_compile()\fR returns an error. If you already know that your pattern is
+valid, and you want to skip this check for performance reasons, you can set the
+PCRE_NO_UTF8_CHECK option. When it is set, the effect of passing an invalid
+UTF-8 string as a pattern is undefined. It may cause your program to crash.
+Note that there is a similar option for suppressing the checking of subject
+strings passed to \fBpcre_exec()\fR.
+
+
.SH STUDYING A PATTERN
.rs
.sp
@@ -698,6 +710,14 @@ first matching position. However, if a pattern was compiled with PCRE_ANCHORED,
or turned out to be anchored by virtue of its contents, it cannot be made
unachored at matching time.
+When PCRE_UTF8 was set at compile time, the validity of the subject as a UTF-8
+string is automatically checked. If an invalid UTF-8 sequence of bytes is
+found, \fBpcre_exec()\fR returns the error PCRE_ERROR_BADUTF8. If you already
+know that your subject is valid, and you want to skip this check for
+performance reasons, you can set the PCRE_NO_UTF8_CHECK option when calling
+\fBpcre_exec()\fR. When this option is set, the effect of passing an invalid
+UTF-8 string as a subject is undefined. It may cause your program to crash.
+
There are also three further options that can be set only at matching time:
PCRE_NOTBOL
@@ -872,6 +892,10 @@ This error is never generated by \fBpcre_exec()\fR itself. It is provided for
use by callout functions that want to yield a distinctive error code. See the
\fBpcrecallout\fR documentation for details.
+ PCRE_ERROR_BADUTF8 (-10)
+
+A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+
.SH EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
.rs
.sp
@@ -1011,6 +1035,6 @@ then call \fIpcre_copy_substring()\fR or \fIpcre_get_substring()\fR, as
appropriate.
.in 0
-Last updated: 03 February 2003
+Last updated: 20 August 2003
.br
Copyright (c) 1997-2003 University of Cambridge.
diff --git a/doc/pcrecallout.3 b/doc/pcrecallout.3
index f54d0dd..bfbb66b 100644
--- a/doc/pcrecallout.3
+++ b/doc/pcrecallout.3
@@ -57,8 +57,9 @@ function may be called several times for different starting points.
The \fIcurrent_position\fR field contains the offset within the subject of the
current match pointer.
-The \fIcapture_top\fR field contains the number of the highest captured
-substring so far.
+The \fIcapture_top\fR field contains one more than the number of the highest
+numbered captured substring so far. If no substrings have been captured,
+the value of \fIcapture_top\fR is one.
The \fIcapture_last\fR field contains the number of the most recently captured
substring.
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 76daaf3..2c4fb42 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -111,9 +111,10 @@ respectively. For example:
/caseless/i
These modifier letters have the same effect as they do in Perl. There are
-others which set PCRE options that do not correspond to anything in Perl:
-\fB/A\fR, \fB/E\fR, and \fB/X\fR set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
-PCRE_EXTRA respectively.
+others that set PCRE options that do not correspond to anything in Perl:
+\fB/A\fR, \fB/E\fR, \fB/N\fR, \fB/U\fR, and \fB/X\fR set PCRE_ANCHORED,
+PCRE_DOLLAR_ENDONLY, PCRE_NO_AUTO_CAPTURE, PCRE_UNGREEDY, and PCRE_EXTRA
+respectively.
Searching for all possible matches within each subject string can be requested
by the \fB/g\fR or \fB/G\fR modifier. After finding a match, PCRE is called
@@ -180,6 +181,10 @@ provided that it was compiled with this support enabled. This modifier also
causes any non-printing characters in output strings to be printed using the
\\x{hh...} notation if they are valid UTF-8 sequences.
+If the \fB/?\fR modifier is used with \fB/8\fR, it causes \fBpcretest\fR to
+call \fBpcre_compile()\fR with the PCRE_NO_UTF8_CHECK option, to suppress the
+checking of the string for UTF-8 validity.
+
.SH CALLOUTS
.rs
.sp
@@ -261,6 +266,8 @@ recognized:
\fBpcre_exec()\fR to dd (any number of decimal
digits)
\\Z pass the PCRE_NOTEOL option to \fBpcre_exec()\fR
+ \\? pass the PCRE_NO_UTF8_CHECK option to
+ \fBpcre_exec()\fR
If \\M is present, \fBpcretest\fR calls \fBpcre_exec()\fR several times, with
different values in the \fImatch_limit\fR field of the \fBpcre_extra\fR data
@@ -351,6 +358,6 @@ University Computing Service,
Cambridge CB2 3QG, England.
.in 0
-Last updated: 03 February 2003
+Last updated: 20 August 2003
.br
Copyright (c) 1997-2003 University of Cambridge.
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index 80585af..4fa9ca4 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -119,10 +119,10 @@ PATTERN MODIFIERS
/caseless/i
These modifier letters have the same effect as they do in
- Perl. There are others which set PCRE options that do not
- correspond to anything in Perl: /A, /E, and /X set
- PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respec-
- tively.
+ Perl. There are others that set PCRE options that do not
+ correspond to anything in Perl: /A, /E, /N, /U, and /X set
+ PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, PCRE_NO_AUTO_CAPTURE,
+ PCRE_UNGREEDY, and PCRE_EXTRA respectively.
Searching for all possible matches within each subject
string can be requested by the /g or /G modifier. After
@@ -199,6 +199,10 @@ PATTERN MODIFIERS
printing characters in output strings to be printed using
the \x{hh...} notation if they are valid UTF-8 sequences.
+ If the /? modifier is used with /8, it causes pcretest to
+ call pcre_compile() with the PCRE_NO_UTF8_CHECK option, to
+ suppress the checking of the string for UTF-8 validity.
+
CALLOUTS
@@ -255,12 +259,12 @@ DATA LINES
after a successful match (any decimal number
less than 32)
\Cname call pcre_copy_named_substring() for substring
+
"name" after a successful match (name termin-
ated by next non alphanumeric character)
\C+ show the current captured substrings at callout
time
-
- C- do not supply a callout function
+ \C- do not supply a callout function
\C!n return 1 instead of 0 when callout number n is
reached
\C!n!m return 1 instead of 0 when callout number n is
@@ -281,6 +285,8 @@ DATA LINES
pcre_exec() to dd (any number of decimal
digits)
\Z pass the PCRE_NOTEOL option to pcre_exec()
+ \? pass the PCRE_NO_UTF8_CHECK option to
+ pcre_exec()
If \M is present, pcretest calls pcre_exec() several times,
with different values in the match_limit field of the
@@ -306,7 +312,6 @@ DATA LINES
API to be used, only B, and Z have any effect, causing
REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
tively.
-
The use of \x{hh...} to represent UTF-8 characters is not
dependent on the use of the /8 modifier on the pattern. It
is recognized always. There may be any number of hexadecimal
@@ -378,5 +383,5 @@ AUTHOR
University Computing Service,
Cambridge CB2 3QG, England.
-Last updated: 03 February 2003
+Last updated: 20 August 2003
Copyright (c) 1997-2003 University of Cambridge.
diff --git a/internal.h b/internal.h
index 973e7ee..92454a7 100644
--- a/internal.h
+++ b/internal.h
@@ -198,10 +198,10 @@ time, run time or study time, respectively. */
#define PUBLIC_OPTIONS \
(PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
- PCRE_NO_AUTO_CAPTURE)
+ PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK)
#define PUBLIC_EXEC_OPTIONS \
- (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY)
+ (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NO_UTF8_CHECK)
#define PUBLIC_STUDY_OPTIONS 0 /* None defined */
@@ -526,6 +526,7 @@ just to accommodate the POSIX wrapper. */
#define ERR41 "unrecognized character after (?P"
#define ERR42 "syntax error after (?P"
#define ERR43 "two named groups have the same name"
+#define ERR44 "invalid UTF-8 string"
/* All character handling must be done as unsigned characters. Otherwise there
are problems with top-bit-set characters and functions such as isspace().
diff --git a/pcre.c b/pcre.c
index 5da0f76..455782c 100644
--- a/pcre.c
+++ b/pcre.c
@@ -241,10 +241,16 @@ changed by the caller, but are shared between all threads. However, when
compiling for Virtual Pascal, things are done differently (see pcre.in). */
#ifndef VPCOMPAT
+#ifdef __cplusplus
+extern "C" void *(*pcre_malloc)(size_t) = malloc;
+extern "C" void (*pcre_free)(void *) = free;
+extern "C" int (*pcre_callout)(pcre_callout_block *) = NULL;
+#else
void *(*pcre_malloc)(size_t) = malloc;
void (*pcre_free)(void *) = free;
int (*pcre_callout)(pcre_callout_block *) = NULL;
#endif
+#endif
/*************************************************
@@ -511,7 +517,7 @@ if (re == NULL || where == NULL) return PCRE_ERROR_NULL;
if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC;
if (extra_data != NULL && (extra_data->flags & PCRE_EXTRA_STUDY_DATA) != 0)
- study = extra_data->study_data;
+ study = (const pcre_study_data *)extra_data->study_data;
switch (what)
{
@@ -592,11 +598,11 @@ pcre_config(int what, void *where)
switch (what)
{
case PCRE_CONFIG_UTF8:
- #ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF8
*((int *)where) = 1;
- #else
+#else
*((int *)where) = 0;
- #endif
+#endif
break;
case PCRE_CONFIG_NEWLINE:
@@ -669,7 +675,6 @@ Arguments:
bracount number of previous extracting brackets
options the options bits
isclass TRUE if inside a character class
- cd pointer to char tables block
Returns: zero or positive => a data character
negative => a special escape sequence
@@ -678,7 +683,7 @@ Returns: zero or positive => a data character
static int
check_escape(const uschar **ptrptr, const char **errorptr, int bracount,
- int options, BOOL isclass, compile_data *cd)
+ int options, BOOL isclass)
{
const uschar *ptr = *ptrptr;
int c, i;
@@ -801,7 +806,8 @@ else
c = 0;
while (i++ < 2 && (digitab[ptr[1]] & ctype_xdigit) != 0)
{
- int cc = *(++ptr);
+ int cc; /* Some compilers don't like ++ */
+ cc = *(++ptr); /* in initializers */
if (cc >= 'a') cc -= 32; /* Convert to upper case */
c = c * 16 + cc - ((cc < 'A')? '0' : ('A' - 10));
}
@@ -858,13 +864,12 @@ where the ddds are digits.
Arguments:
p pointer to the first char after '{'
- cd pointer to char tables block
Returns: TRUE or FALSE
*/
static BOOL
-is_counted_repeat(const uschar *p, compile_data *cd)
+is_counted_repeat(const uschar *p)
{
if ((digitab[*p++] && ctype_digit) == 0) return FALSE;
while ((digitab[*p] & ctype_digit) != 0) p++;
@@ -895,15 +900,13 @@ Arguments:
maxp pointer to int for max
returned as -1 if no max
errorptr points to pointer to error message
- cd pointer to character tables clock
Returns: pointer to '}' on success;
current ptr on error, with errorptr set
*/
static const uschar *
-read_repeat_counts(const uschar *p, int *minp, int *maxp,
- const char **errorptr, compile_data *cd)
+read_repeat_counts(const uschar *p, int *minp, int *maxp, const char **errorptr)
{
int min = 0;
int max = -1;
@@ -1793,7 +1796,7 @@ for (;; ptr++)
if (c == '\\')
{
- c = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd);
+ c = check_escape(&ptr, errorptr, *brackets, options, TRUE);
if (-c == ESC_b) c = '\b'; /* \b is backslash in a class */
if (-c == ESC_Q) /* Handle start of quoted string */
@@ -1882,7 +1885,7 @@ for (;; ptr++)
if (d == '\\')
{
const uschar *oldptr = ptr;
- d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd);
+ d = check_escape(&ptr, errorptr, *brackets, options, TRUE);
/* \b is backslash; any other special means the '-' was literal */
@@ -2091,8 +2094,8 @@ for (;; ptr++)
/* Various kinds of repeat */
case '{':
- if (!is_counted_repeat(ptr+1, cd)) goto NORMAL_CHAR;
- ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr, cd);
+ if (!is_counted_repeat(ptr+1)) goto NORMAL_CHAR;
+ ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr);
if (*errorptr != NULL) goto FAILED;
goto REPEAT;
@@ -3039,7 +3042,7 @@ for (;; ptr++)
case '\\':
tempptr = ptr;
- c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd);
+ c = check_escape(&ptr, errorptr, *brackets, options, FALSE);
/* Handle metacharacters introduced by \. For ones like \d, the ESC_ values
are arranged to be the negation of the corresponding OP_values. For the
@@ -3142,7 +3145,7 @@ for (;; ptr++)
if (c == '\\')
{
tempptr = ptr;
- c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd);
+ c = check_escape(&ptr, errorptr, *brackets, options, FALSE);
if (c < 0) { ptr = tempptr; break; }
/* If a character is > 127 in UTF-8 mode, we have to turn it into
@@ -3727,6 +3730,56 @@ return c;
+#ifdef SUPPORT_UTF8
+/*************************************************
+* Validate a UTF-8 string *
+*************************************************/
+
+/* This function is called (optionally) at the start of compile or match, to
+validate that a supposed UTF-8 string is actually valid. The early check means
+that subsequent code can assume it is dealing with a valid string. The check
+can be turned off for maximum performance, but then consequences of supplying
+an invalid string are then undefined.
+
+Arguments:
+ string points to the string
+ length length of string, or -1 if the string is zero-terminated
+
+Returns: < 0 if the string is a valid UTF-8 string
+ >= 0 otherwise; the value is the offset of the bad byte
+*/
+
+static int
+valid_utf8(const uschar *string, int length)
+{
+register const uschar *p;
+
+if (length < 0)
+ {
+ for (p = string; *p != 0; p++);
+ length = p - string;
+ }
+
+for (p = string; length-- > 0; p++)
+ {
+ int ab;
+ if (*p < 128) continue;
+ if ((*p & 0xc0) != 0xc0) return p - string;
+ ab = utf8_table4[*p & 0x3f]; /* Number of additional bytes */
+ if (length < ab) return p - string;
+ while (ab-- > 0)
+ {
+ if ((*(++p) & 0xc0) != 0x80) return p - string;
+ length--;
+ }
+ }
+
+return -1;
+}
+#endif
+
+
+
/*************************************************
* Compile a Regular Expression *
*************************************************/
@@ -3793,6 +3846,12 @@ if (erroroffset == NULL)
#ifdef SUPPORT_UTF8
utf8 = (options & PCRE_UTF8) != 0;
+if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
+ (*erroroffset = valid_utf8((uschar *)pattern, -1)) >= 0)
+ {
+ *errorptr = ERR44;
+ return NULL;
+ }
#else
if ((options & PCRE_UTF8) != 0)
{
@@ -3874,7 +3933,7 @@ while ((c = *(++ptr)) != 0)
case '\\':
{
const uschar *save_ptr = ptr;
- c = check_escape(&ptr, errorptr, bracount, options, FALSE, &compile_block);
+ c = check_escape(&ptr, errorptr, bracount, options, FALSE);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
if (c >= 0)
{
@@ -3910,9 +3969,9 @@ while ((c = *(++ptr)) != 0)
if (refnum > compile_block.top_backref)
compile_block.top_backref = refnum;
length += 2; /* For single back reference */
- if (ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block))
+ if (ptr[1] == '{' && is_counted_repeat(ptr+2))
{
- ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
+ ptr = read_repeat_counts(ptr+2, &min, &max, errorptr);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
if ((min == 0 && (max == 1 || max == -1)) ||
(min == 1 && max == -1))
@@ -3942,8 +4001,8 @@ while ((c = *(++ptr)) != 0)
class, or back reference. */
case '{':
- if (!is_counted_repeat(ptr+1, &compile_block)) goto NORMAL_CHAR;
- ptr = read_repeat_counts(ptr+1, &min, &max, errorptr, &compile_block);
+ if (!is_counted_repeat(ptr+1)) goto NORMAL_CHAR;
+ ptr = read_repeat_counts(ptr+1, &min, &max, errorptr);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
/* These special cases just insert one extra opcode */
@@ -4039,8 +4098,7 @@ while ((c = *(++ptr)) != 0)
#ifdef SUPPORT_UTF8
int prevchar = ptr[-1];
#endif
- int ch = check_escape(&ptr, errorptr, bracount, options, TRUE,
- &compile_block);
+ int ch = check_escape(&ptr, errorptr, bracount, options, TRUE);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
/* \b is backspace inside a class */
@@ -4151,9 +4209,9 @@ while ((c = *(++ptr)) != 0)
/* A repeat needs either 1 or 5 bytes. */
- if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block))
+ if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2))
{
- ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
+ ptr = read_repeat_counts(ptr+2, &min, &max, errorptr);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
if ((min == 0 && (max == 1 || max == -1)) ||
(min == 1 && max == -1))
@@ -4505,9 +4563,9 @@ while ((c = *(++ptr)) != 0)
/* Leave ptr at the final char; for read_repeat_counts this happens
automatically; for the others we need an increment. */
- if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block))
+ if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2))
{
- ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
+ ptr = read_repeat_counts(ptr+2, &min, &max, errorptr);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
}
else if (c == '*') { min = 0; max = -1; ptr++; }
@@ -4596,8 +4654,7 @@ while ((c = *(++ptr)) != 0)
if (c == '\\')
{
const uschar *saveptr = ptr;
- c = check_escape(&ptr, errorptr, bracount, options, FALSE,
- &compile_block);
+ c = check_escape(&ptr, errorptr, bracount, options, FALSE);
if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
if (c < 0) { ptr = saveptr; break; }
@@ -7307,7 +7364,7 @@ if (extra_data != NULL)
{
register unsigned int flags = extra_data->flags;
if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
- study = extra_data->study_data;
+ study = (const pcre_study_data *)extra_data->study_data;
if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0)
match_block.match_limit = extra_data->match_limit;
if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
@@ -7340,6 +7397,15 @@ match_block.recursive = NULL; /* No recursion at top level */
match_block.lcc = re->tables + lcc_offset;
match_block.ctypes = re->tables + ctypes_offset;
+/* Check a UTF-8 string if required. Unfortunately there's no way of passing
+back the character offset. */
+
+#ifdef SUPPORT_UTF8
+if (match_block.utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
+ valid_utf8((uschar *)subject, length) >= 0)
+ return PCRE_ERROR_BADUTF8;
+#endif
+
/* The ims options can vary during the matching as a result of the presence
of (?ims) items in the pattern. They are kept in a local variable so that
restoring at the exit of a group is easy. */
diff --git a/pcre.in b/pcre.in
index 2aa44b9..7b5b209 100644
--- a/pcre.in
+++ b/pcre.in
@@ -23,7 +23,7 @@ make changes to pcre.in. */
# endif
# else
# ifndef PCRE_STATIC
-# define PCRE_DATA_SCOPE __declspec(dllimport)
+# define PCRE_DATA_SCOPE extern __declspec(dllimport)
# endif
# endif
#endif
@@ -57,6 +57,7 @@ extern "C" {
#define PCRE_NOTEMPTY 0x0400
#define PCRE_UTF8 0x0800
#define PCRE_NO_AUTO_CAPTURE 0x1000
+#define PCRE_NO_UTF8_CHECK 0x2000
/* Exec-time and get/set-time error codes */
@@ -69,6 +70,7 @@ extern "C" {
#define PCRE_ERROR_NOSUBSTRING (-7)
#define PCRE_ERROR_MATCHLIMIT (-8)
#define PCRE_ERROR_CALLOUT (-9) /* Never used by PCRE itself */
+#define PCRE_ERROR_BADUTF8 (-10)
/* Request types for pcre_fullinfo() */
diff --git a/pcregrep.c b/pcregrep.c
index f4a59f4..7a06993 100644
--- a/pcregrep.c
+++ b/pcregrep.c
@@ -545,8 +545,8 @@ for (i = 1; i < argc; i++)
}
}
-pattern_list = malloc(MAX_PATTERN_COUNT * sizeof(pcre *));
-hints_list = malloc(MAX_PATTERN_COUNT * sizeof(pcre_extra *));
+pattern_list = (pcre **)malloc(MAX_PATTERN_COUNT * sizeof(pcre *));
+hints_list = (pcre_extra **)malloc(MAX_PATTERN_COUNT * sizeof(pcre_extra *));
if (pattern_list == NULL || hints_list == NULL)
{
diff --git a/pcreposix.c b/pcreposix.c
index 49094f2..6152a15 100644
--- a/pcreposix.c
+++ b/pcreposix.c
@@ -48,7 +48,7 @@ static const char *estring[] = {
ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20,
ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR29, ERR29, ERR30,
ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39, ERR40,
- ERR41, ERR42, ERR43 };
+ ERR41, ERR42, ERR43, ERR44 };
static int eint[] = {
REG_EESCAPE, /* "\\ at end of pattern" */
@@ -93,7 +93,8 @@ static int eint[] = {
REG_BADPAT, /* "recursive call could loop indefinitely" */
REG_BADPAT, /* "unrecognized character after (?P" */
REG_BADPAT, /* "syntax error after (?P" */
- REG_BADPAT /* "two named groups have the same name" */
+ REG_BADPAT, /* "two named groups have the same name" */
+ REG_BADPAT /* "invalid UTF-8 string" */
};
/* Table of texts corresponding to POSIX error codes */
@@ -217,7 +218,7 @@ preg->re_erroffset = erroffset;
if (preg->re_pcre == NULL) return pcre_posix_error_code(errorptr);
-preg->re_nsub = pcre_info(preg->re_pcre, NULL, NULL);
+preg->re_nsub = pcre_info((const pcre *)preg->re_pcre, NULL, NULL);
return 0;
}
@@ -264,8 +265,8 @@ if (nmatch > 0)
}
}
-rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), 0, options,
- ovector, nmatch * 3);
+rc = pcre_exec((const pcre *)preg->re_pcre, NULL, string, (int)strlen(string),
+ 0, options, ovector, nmatch * 3);
if (rc == 0) rc = nmatch; /* All captured slots were filled in */
diff --git a/pcretest.c b/pcretest.c
index ad729b7..24196ac 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -52,7 +52,6 @@ static int use_utf8;
static size_t gotten_store;
-
static const int utf8_table1[] = {
0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};
@@ -321,13 +320,16 @@ if (post_start > 0)
}
fprintf(outfile, "\n");
-
first_callout = 0;
-if ((int)(cb->callout_data) != 0)
+if (cb->callout_data != NULL)
{
- fprintf(outfile, "Callout data = %d\n", (int)(cb->callout_data));
- return (int)(cb->callout_data);
+ int callout_data = *((int *)(cb->callout_data));
+ if (callout_data != 0)
+ {
+ fprintf(outfile, "Callout data = %d\n", callout_data);
+ return callout_data;
+ }
}
return (cb->callout_number != callout_fail_id)? 0 :
@@ -397,8 +399,8 @@ unsigned char *dbuffer;
/* Get buffers from malloc() so that Electric Fence will check their misuse
when I am debugging. */
-buffer = malloc(BUFFER_SIZE);
-dbuffer = malloc(DBUFFER_SIZE);
+buffer = (unsigned char *)malloc(BUFFER_SIZE);
+dbuffer = (unsigned char *)malloc(DBUFFER_SIZE);
/* Static so that new_malloc can use it. */
@@ -464,7 +466,7 @@ while (argc > 1 && argv[op][0] == '-')
/* Get the store for the offsets vector, and remember what it was */
size_offsets_max = size_offsets;
-offsets = malloc(size_offsets_max * sizeof(int));
+offsets = (int *)malloc(size_offsets_max * sizeof(int));
if (offsets == NULL)
{
printf("** Failed to get %d bytes of memory for offsets vector\n",
@@ -619,6 +621,7 @@ while (!done)
case 'U': options |= PCRE_UNGREEDY; break;
case 'X': options |= PCRE_EXTRA; break;
case '8': options |= PCRE_UTF8; use_utf8 = 1; break;
+ case '?': options |= PCRE_NO_UTF8_CHECK; break;
case 'L':
ppp = pp;
@@ -787,7 +790,7 @@ while (!done)
}
if (get_options == 0) fprintf(outfile, "No options\n");
- else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n",
+ else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s%s\n",
((get_options & PCRE_ANCHORED) != 0)? " anchored" : "",
((get_options & PCRE_CASELESS) != 0)? " caseless" : "",
((get_options & PCRE_EXTENDED) != 0)? " extended" : "",
@@ -796,7 +799,8 @@ while (!done)
((get_options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((get_options & PCRE_EXTRA) != 0)? " extra" : "",
((get_options & PCRE_UNGREEDY) != 0)? " ungreedy" : "",
- ((get_options & PCRE_UTF8) != 0)? " utf8" : "");
+ ((get_options & PCRE_UTF8) != 0)? " utf8" : "",
+ ((get_options & PCRE_NO_UTF8_CHECK) != 0)? " no_utf8_check" : "");
if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0)
fprintf(outfile, "Case state changes\n");
@@ -861,13 +865,17 @@ while (!done)
else if (extra == NULL)
fprintf(outfile, "Study returned NULL\n");
+ /* Don't output study size; at present it is in any case a fixed
+ value, but it varies, depending on the computer architecture, and
+ so messes up the test suite. */
+
else if (do_showinfo)
{
size_t size;
uschar *start_bits = NULL;
new_info(re, extra, PCRE_INFO_STUDYSIZE, &size);
new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits);
- fprintf(outfile, "Study size = %d\n", size);
+ /* fprintf(outfile, "Study size = %d\n", size); */
if (start_bits == NULL)
fprintf(outfile, "No starting character set\n");
else
@@ -1105,7 +1113,7 @@ while (!done)
{
size_offsets_max = n;
free(offsets);
- use_offsets = offsets = malloc(size_offsets_max * sizeof(int));
+ use_offsets = offsets = (int *)malloc(size_offsets_max * sizeof(int));
if (offsets == NULL)
{
printf("** Failed to get %d bytes of memory for offsets vector\n",
@@ -1120,6 +1128,10 @@ while (!done)
case 'Z':
options |= PCRE_NOTEOL;
continue;
+
+ case '?':
+ options |= PCRE_NO_UTF8_CHECK;
+ continue;
}
*q++ = c;
}
@@ -1136,7 +1148,7 @@ while (!done)
int eflags = 0;
regmatch_t *pmatch = NULL;
if (use_size_offsets > 0)
- pmatch = malloc(sizeof(regmatch_t) * use_size_offsets);
+ pmatch = (regmatch_t *)malloc(sizeof(regmatch_t) * use_size_offsets);
if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL;
if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL;
@@ -1203,7 +1215,7 @@ while (!done)
if (extra == NULL)
{
- extra = malloc(sizeof(pcre_extra));
+ extra = (pcre_extra *)malloc(sizeof(pcre_extra));
extra->flags = 0;
}
extra->flags |= PCRE_EXTRA_MATCH_LIMIT;
@@ -1242,11 +1254,11 @@ while (!done)
{
if (extra == NULL)
{
- extra = malloc(sizeof(pcre_extra));
+ extra = (pcre_extra *)malloc(sizeof(pcre_extra));
extra->flags = 0;
}
extra->flags |= PCRE_EXTRA_CALLOUT_DATA;
- extra->callout_data = (void *)callout_data;
+ extra->callout_data = &callout_data;
count = pcre_exec(re, extra, (char *)bptr, len, start_offset,
options | g_notempty, use_offsets, use_size_offsets);
extra->flags &= ~PCRE_EXTRA_CALLOUT_DATA;
diff --git a/study.c b/study.c
index 4320bd2..5f0f196 100644
--- a/study.c
+++ b/study.c
@@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
Written by: Philip Hazel <ph10@cam.ac.uk>
- Copyright (c) 1997-2002 University of Cambridge
+ Copyright (c) 1997-2003 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@@ -297,19 +297,50 @@ do
/* Character class where all the information is in a bit map: set the
bits and either carry on or not, according to the repeat count. If it was
a negative class, and we are operating with UTF-8 characters, any byte
- with the top-bit set is a potentially valid starter because it may start
- a character with a value > 255. (This is sub-optimal in that the
- character may be in the range 128-255, and those characters might be
- unwanted, but that's as far as we go for the moment.) */
+ with a value >= 0xc4 is a potentially valid starter because it starts a
+ character with a value > 255. */
case OP_NCLASS:
- if (utf8) memset(start_bits+16, 0xff, 16);
+ if (utf8)
+ {
+ start_bits[24] |= 0xf0; /* Bits for 0xc4 - 0xc8 */
+ memset(start_bits+25, 0xff, 7); /* Bits for 0xc9 - 0xff */
+ }
/* Fall through */
case OP_CLASS:
{
tcode++;
- for (c = 0; c < 32; c++) start_bits[c] |= tcode[c];
+
+ /* In UTF-8 mode, the bits in a bit map correspond to character
+ values, not to byte values. However, the bit map we are constructing is
+ for byte values. So we have to do a conversion for characters whose
+ value is > 127. In fact, there are only two possible starting bytes for
+ characters in the range 128 - 255. */
+
+ if (utf8)
+ {
+ for (c = 0; c < 16; c++) start_bits[c] |= tcode[c];
+ for (c = 128; c < 256; c++)
+ {
+ if ((tcode[c/8] && (1 << (c&7))) != 0)
+ {
+ int d = (c >> 6) | 0xc0; /* Set bit for this starter */
+ start_bits[d/8] |= (1 << (d&7)); /* and then skip on to the */
+ c = (c & 0xc0) + 0x40 - 1; /* next relevant character. */
+ }
+ }
+ }
+
+ /* In non-UTF-8 mode, the two bit maps are completely compatible. */
+
+ else
+ {
+ for (c = 0; c < 32; c++) start_bits[c] |= tcode[c];
+ }
+
+ /* Advance past the bit map, and act on what follows */
+
tcode += 32;
switch (*tcode)
{
diff --git a/testdata/testinput5 b/testdata/testinput5
index 9f07d6e..b82cee0 100644
--- a/testdata/testinput5
+++ b/testdata/testinput5
@@ -192,4 +192,34 @@
/[^\xff]/8D
+/[Ä-Ü]/8
+ Ö # Matches without Study
+ \x{d6}
+
+/[Ä-Ü]/8S
+ Ö <-- Same with Study
+ \x{d6}
+
+/[\x{c4}-\x{dc}]/8
+ Ö # Matches without Study
+ \x{d6}
+
+/[\x{c4}-\x{dc}]/8S
+ Ö <-- Same with Study
+ \x{d6}
+
+/[Ã]/8
+
+/Ã/8
+
+/ÃÃÃxxx/8
+
+/ÃÃÃxxx/8?D
+
+/abc/8
+ Ã]
+ Ã
+ ÃÃÃ
+ ÃÃÃ\?
+
/ End of testinput5 /
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index 63214b7..8a7e6e6 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
/the quick brown fox/
the quick brown fox
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index 22a345b..95b84c6 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
/(a)b|/
Capturing subpattern count = 1
@@ -136,7 +136,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: c d e
this sentence eventually mentions a cat
0: cat
@@ -148,7 +147,6 @@ Capturing subpattern count = 0
Options: caseless
No first char
No need char
-Study size = 40
Starting character set: C D E c d e
this sentence eventually mentions a CAT cat
0: CAT
@@ -160,7 +158,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: a b c d
/(a|[^\dZ])/S
@@ -168,7 +165,6 @@ Capturing subpattern count = 1
No options
No first char
No need char
-Study size = 40
Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = >
@@ -189,7 +185,6 @@ Capturing subpattern count = 1
No options
No first char
No need char
-Study size = 40
Starting character set: \x09 \x0a \x0c \x0d \x20 a b
/(ab\2)/
@@ -524,7 +519,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: a b c d
/(?i)[abcd]/S
@@ -532,7 +526,6 @@ Capturing subpattern count = 0
Options: caseless
No first char
No need char
-Study size = 40
Starting character set: A B C D a b c d
/(?m)[xy]|(b|c)/S
@@ -540,7 +533,6 @@ Capturing subpattern count = 1
Options: multiline
No first char
No need char
-Study size = 40
Starting character set: b c x y
/(^a|^b)/m
@@ -612,7 +604,6 @@ No options
Case state changes
No first char
No need char
-Study size = 40
Starting character set: C a b c d
/a$/
@@ -677,7 +668,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: a b
/(?<!foo)(alpha|omega)/S
@@ -685,7 +675,6 @@ Capturing subpattern count = 1
No options
No first char
Need char = 'a'
-Study size = 40
Starting character set: a o
/(?!alphabet)[ab]/S
@@ -693,7 +682,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: a b
/(?<=foo\n)^bar/m
@@ -3378,7 +3366,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: a b
/[^a]/I
@@ -3398,7 +3385,6 @@ Capturing subpattern count = 0
No options
No first char
Need char = '6'
-Study size = 40
Starting character set: 0 1 2 3 4 5 6 7 8 9
/a^b/I
@@ -3432,7 +3418,6 @@ Capturing subpattern count = 0
Options: caseless
No first char
No need char
-Study size = 40
Starting character set: A B a b
/[ab](?i)cd/IS
@@ -3441,7 +3426,6 @@ No options
Case state changes
No first char
Need char = 'd' (caseless)
-Study size = 40
Starting character set: a b
/abc(?C)def/
@@ -3742,7 +3726,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: a b
/(?R)/
diff --git a/testdata/testoutput3 b/testdata/testoutput3
index 5dac092..77c50b1 100644
--- a/testdata/testoutput3
+++ b/testdata/testoutput3
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
/^[\w]+/
*** Failers
@@ -85,7 +85,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
@@ -94,7 +93,6 @@ Capturing subpattern count = 0
No options
No first char
No need char
-Study size = 40
Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å
diff --git a/testdata/testoutput4 b/testdata/testoutput4
index 312cfbe..a85c37b 100644
--- a/testdata/testoutput4
+++ b/testdata/testoutput4
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
/-- Do not use the \x{} construct except with patterns that have the --/
/-- /8 option set, because PCRE doesn't recognize them as UTF-8 unless --/
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index b681214..6d0e89a 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
/\x{100}/8DM
Memory allocation (code space): 11
@@ -402,21 +402,16 @@ Capturing subpattern count = 0
Options: utf8
No first char
No need char
-Study size = 40
Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a
\x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19
\x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4
5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y
Z [ \ ] ^ _ ` c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f
- \x80 \x81 \x82 \x83 \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e
- \x8f \x90 \x91 \x92 \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d
- \x9e \x9f \xa0 \xa1 \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac
- \xad \xae \xaf \xb0 \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb
- \xbc \xbd \xbe \xbf \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca
- \xcb \xcc \xcd \xce \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9
- \xda \xdb \xdc \xdd \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8
- \xe9 \xea \xeb \xec \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7
- \xf8 \xf9 \xfa \xfb \xfc \xfd \xfe \xff
+ \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce \xcf \xd0
+ \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd \xde \xdf
+ \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec \xed \xee
+ \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb \xfc \xfd
+ \xfe \xff
\x{f1}
0: \x{f1}
\x{bf}
@@ -463,7 +458,6 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
-Study size = 40
Starting character set: x \xc4
/(\x{100}*a|x)/8SD
@@ -482,7 +476,6 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
-Study size = 40
Starting character set: a x \xc4
/(\x{100}{0,2}a|x)/8SD
@@ -501,7 +494,6 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
-Study size = 40
Starting character set: a x \xc4
/(\x{100}{1,2}a|x)/8SD
@@ -521,7 +513,6 @@ Capturing subpattern count = 1
Options: utf8
No first char
No need char
-Study size = 40
Starting character set: x \xc4
/\x{100}*(\d+|"(?1)")/8
@@ -826,5 +817,60 @@ Options: utf8
No first char
No need char
+/[Ä-Ü]/8
+ Ö # Matches without Study
+ 0: \x{d6}
+ \x{d6}
+ 0: \x{d6}
+
+/[Ä-Ü]/8S
+ Ö <-- Same with Study
+ 0: \x{d6}
+ \x{d6}
+ 0: \x{d6}
+
+/[\x{c4}-\x{dc}]/8
+ Ö # Matches without Study
+ 0: \x{d6}
+ \x{d6}
+ 0: \x{d6}
+
+/[\x{c4}-\x{dc}]/8S
+ Ö <-- Same with Study
+ 0: \x{d6}
+ \x{d6}
+ 0: \x{d6}
+
+/[Ã]/8
+Failed: invalid UTF-8 string at offset 2
+
+/Ã/8
+Failed: invalid UTF-8 string at offset 0
+
+/ÃÃÃxxx/8
+Failed: invalid UTF-8 string at offset 1
+
+/ÃÃÃxxx/8?D
+------------------------------------------------------------------
+ 0 11 Bra 0
+ 3 6 \x{c3}\x{f8}xx
+ 11 11 Ket
+ 14 End
+------------------------------------------------------------------
+Capturing subpattern count = 0
+Options: utf8 no_utf8_check
+First char = 195
+Need char = 'x'
+
+/abc/8
+ Ã]
+Error -10
+ Ã
+Error -10
+ ÃÃÃ
+Error -10
+ ÃÃÃ\?
+No match
+
/ End of testinput5 /