Load pcre-4.4 into code/trunk.

git-svn-id: svn://vcs.exim.org/pcre/code/trunk@71 2f5784b3-3f2a-0410-8824-cb99058d5e15
author: nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2007-02-24 21:40:24 +0000
committer: nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2007-02-24 21:40:24 +0000
commit: 4af6fcff808e079ca1aa09104d6146baa932af47 (patch)
tree: dc14f3624835dd1275c31159a4c365ed439f3df7
parent: f08d5b6354f668c0047281d81eda8d0fd2a9e82d (diff)
download: pcre-4af6fcff808e079ca1aa09104d6146baa932af47.tar.gz
34 files changed, 579 insertions, 193 deletions
diff --git a/ChangeLog b/ChangeLog
index b912314..1c0d36a 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,6 +1,62 @@
 ChangeLog for PCRE
 ------------------
 
+Version 4.4 13-Aug-03
+---------------------
+
+ 1. In UTF-8 mode, a character class containing characters with values between
+    127 and 255 was not handled correctly if the compiled pattern was studied.
+    In fixing this, I have also improved the studying algorithm for such
+    classes (slightly).
+
+ 2. Three internal functions had redundant arguments passed to them. Removal
+    might give a very teeny performance improvement.
+
+ 3. Documentation bug: the value of the capture_top field in a callout is *one
+    more than* the number of the hightest numbered captured substring.
+
+ 4. The Makefile linked pcretest and pcregrep with -lpcre, which could result
+    in incorrectly linking with a previously installed version. They now link
+    explicitly with libpcre.la.
+
+ 5. configure.in no longer needs to recognize Cygwin specially.
+
+ 6. A problem in pcre.in for Windows platforms is fixed.
+
+ 7. If a pattern was successfully studied, and the -d (or /D) flag was given to
+    pcretest, it used to include the size of the study block as part of its
+    output. Unfortunately, the structure contains a field that has a different
+    size on different hardware architectures. This meant that the tests that
+    showed this size failed. As the block is currently always of a fixed size,
+    this information isn't actually particularly useful in pcretest output, so
+    I have just removed it.
+
+ 8. Three pre-processor statements accidentally did not start in column 1.
+    Sadly, there are *still* compilers around that complain, even though
+    standard C has not required this for well over a decade. Sigh.
+
+ 9. In pcretest, the code for checking callouts passed small integers in the
+    callout_data field, which is a void * field. However, some picky compilers
+    complained about the casts involved for this on 64-bit systems. Now
+    pcretest passes the address of the small integer instead, which should get
+    rid of the warnings.
+
+10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at
+    both compile and run time, and gives an error if an invalid UTF-8 sequence
+    is found. There is a option for disabling this check in cases where the
+    string is known to be correct and/or the maximum performance is wanted.
+
+11. In response to a bug report, I changed one line in Makefile.in from
+
+        -Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \
+    to
+        -Wl,--out-implib,.libs/@WIN_PREFIX@libpcreposix.dll.a \
+
+    to look similar to other lines, but I have no way of telling whether this
+    is the right thing to do, as I do not use Windows. No doubt I'll get told
+    if it's wrong...
+
+
 Version 4.3 21-May-03
 ---------------------
 
diff --git a/LICENCE b/LICENCE
index 8d68061..09a242c 100644
--- a/LICENCE
+++ b/LICENCE
@@ -9,7 +9,7 @@ Written by: Philip Hazel <ph10@cam.ac.uk>
 University of Cambridge Computing Service,
 Cambridge, England. Phone: +44 1223 334714.
 
-Copyright (c) 1997-2001 University of Cambridge
+Copyright (c) 1997-2003 University of Cambridge
 
 Permission is granted to anyone to use this software for any purpose on any
 computer system, and to redistribute it freely, subject to the following
diff --git a/Makefile.in b/Makefile.in
index ecdd6ef..ef456aa 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -1,21 +1,14 @@
 
 # Makefile.in for PCRE (Perl-Compatible Regular Expression) library.
 
-#---------------------------------------------------------------------------#
-# MinGW DLLs are built automatically with this configure.in and Makefile.in #
-# as long you are using autoconf 2.50 or higher. The Win32 static libraries #
-# have not been tested, but appear to be generated. This functionality is   #
-# by courtesy of Fred Cox. I (Philip Hazel) don't know anything about it,   #
-# as I live entirely in a non-Windows world.                                #
-#---------------------------------------------------------------------------#
-
 
 #############################################################################
 
 # PCRE is developed on a Unix system. I do not use Windows or Macs, and know
 # nothing about building software on them. Although the code of PCRE should
 # be very portable, the building system in this Makefile is designed for Unix
-# systems, with the exception of the mingw32 stuff just mentioned.
+# systems. However, there are features that have been supplied to me by various
+# people that should make it work on MinGW and Cygwin systems.
 
 # This setting enables Unix-style directory scanning in pcregrep, triggered
 # by the -f option. Maybe one day someone will add code for other systems.
@@ -106,11 +99,11 @@ LOBJ = maketables.lo get.lo study.lo pcre.lo @POSIX_LOBJ@
 all:            libpcre.la @POSIX_LIB@ pcretest@EXEEXT@ pcregrep@EXEEXT@ @ON_WINDOWS@ winshared
 
 pcregrep@EXEEXT@: libpcre.la pcregrep.@OBJEXT@ @ON_WINDOWS@ winshared
-		$(LINK) -o pcregrep@EXEEXT@ pcregrep.@OBJEXT@ -lpcre
+		$(LINK) -o pcregrep@EXEEXT@ pcregrep.@OBJEXT@ libpcre.la
 
 pcretest@EXEEXT@: libpcre.la @POSIX_LIB@ pcretest.@OBJEXT@ @ON_WINDOWS@ winshared
 		$(LINK) $(PURIFY) $(EFENCE) -o pcretest@EXEEXT@  pcretest.@OBJEXT@ \
-		-lpcre @POSIX_LIB@
+		libpcre.la @POSIX_LIB@
 
 libpcre.la:     $(OBJ)
 		-rm -f libpcre.la
@@ -119,7 +112,7 @@ libpcre.la:     $(OBJ)
 
 libpcreposix.la: pcreposix.@OBJEXT@
 		-rm -f libpcreposix.la
-		$(LINKLIB) -rpath $(LIBDIR) -L. -lpcre -version-info \
+		$(LINKLIB) -rpath $(LIBDIR) libpcre.la -version-info \
 		'$(PCREPOSIXLIBVERSION)' -o libpcreposix.la pcreposix.lo
 
 pcre.@OBJEXT@:  $(top_srcdir)/chartables.c $(top_srcdir)/pcre.c \
@@ -151,7 +144,7 @@ pcretest.@OBJEXT@:     $(top_srcdir)/pcretest.c $(top_srcdir)/internal.h \
 pcregrep.@OBJEXT@:     $(top_srcdir)/pcregrep.c pcre.h Makefile config.h
 		$(CC) -c $(CFLAGS) -I. $(UTF8) $(PCREGREP_OSTYPE) $(top_srcdir)/pcregrep.c
 
-# Some Windows-specific targets, for Cygwin and MinGW
+# Some Windows-specific targets for MinGW. Do not use for Cygwin.
 
 winshared : .libs/@WIN_PREFIX@pcre.dll .libs/@WIN_PREFIX@pcreposix.dll
 
@@ -175,8 +168,8 @@ winshared : .libs/@WIN_PREFIX@pcre.dll .libs/@WIN_PREFIX@pcreposix.dll
 .libs/@WIN_PREFIX@pcreposix.dll: libpcreposix.la libpcre.la
 	$(CC) $(CFLAGS) -shared -o $@ \
 	-Wl,--whole-archive .libs/libpcreposix.a \
-	-Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \
-	-Wl,--output-def,.libs/@WIN_PREFIX@pcreposix.dll-def \
+	-Wl,--out-implib,.libs/@WIN_PREFIX@pcreposix.dll.a \
+	-Wl,--output-def,.libs/@WIN_PREFIX@libpcreposix.dll-def \
 	-Wl,--export-all-symbols \
 	-Wl,--no-whole-archive .libs/libpcre.a
 	sed -e "s#dlname=''#dlname='../bin/@WIN_PREFIX@pcreposix.dll'#" \
diff --git a/NEWS b/NEWS
index e620b2d..60d66d7 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,21 @@
 News about PCRE releases
 ------------------------
 
+Release 4.4 21-Aug-03
+---------------------
+
+This is mainly a bug-fix and tidying release. The only new feature is that PCRE
+checks UTF-8 strings for validity by default. There is an option to suppress
+this, just in case anybody wants that teeny extra bit of performance.
+
+
+Releases 4.1 - 4.3
+------------------
+
+Sorry, I forgot about updating the NEWS file for these releases. Please take a
+look at ChangeLog.
+
+
 Release 4.0 17-Feb-03
 ---------------------
 
diff --git a/configure b/configure
index 51ac584..f89594f 100755
--- a/configure
+++ b/configure
@@ -936,6 +936,9 @@ if test "$ac_init_help" = "long"; then
     # The list generated by autoconf has been trimmed to remove many
     # options that are totally irrelevant to PCRE (e.g. relating to X),
     # or are not supported by its Makefile.
+    # The list generated by autoconf has been trimmed to remove many
+    # options that are totally irrelevant to PCRE (e.g. relating to X),
+    # or are not supported by its Makefile.
   # This message is too long to be a string in the A/UX 3.1 sh.
   cat <<_ACEOF
 \`configure' configures this package to adapt to many kinds of systems.
@@ -1432,8 +1435,8 @@ ac_compiler_gnu=$ac_cv_c_compiler_gnu
 
 
 PCRE_MAJOR=4
-PCRE_MINOR=3
-PCRE_DATE=21-May-2003
+PCRE_MINOR=4
+PCRE_DATE=21-August-2003
 PCRE_VERSION=${PCRE_MAJOR}.${PCRE_MINOR}
 
 
@@ -5094,7 +5097,7 @@ else
     ;;
 
   darwin* | rhapsody*)
-    # This patch put in by hand by PH (22-May-2003) for Darwin 1.3. 
+    # This patch put in by hand by PH (21-Aug-2003) for Darwin 1.3. 
     case "$host_os" in
       rhapsody* | darwin1.[[012]])
        allow_undefined_flag='-undefined suppress'
@@ -7681,14 +7684,6 @@ mingw* )
     NOT_ON_WINDOWS="#"
     WIN_PREFIX=
     ;;
-cygwin* )
-    ON_WINDOWS=
-    POSIX_OBJ=pcreposix.o
-    POSIX_LOBJ=pcreposix.lo
-    POSIX_LIB=
-    WIN_PREFIX=cyg
-    NOT_ON_WINDOWS="#"
-    ;;
 * )
     ON_WINDOWS="#"
     NOT_ON_WINDOWS=
@@ -7706,7 +7701,8 @@ esac
 
 
 if test "x$enable_shared" = "xno" ; then
-    cat >>confdefs.h <<\_ACEOF
+
+cat >>confdefs.h <<\_ACEOF
 #define PCRE_STATIC 1
 _ACEOF
 
diff --git a/configure.in b/configure.in
index 5394f4f..69cb923 100644
--- a/configure.in
+++ b/configure.in
@@ -21,8 +21,8 @@ dnl digits for minor numbers less than 10. There are unlikely to be
 dnl that many releases anyway.
 
 PCRE_MAJOR=4
-PCRE_MINOR=3
-PCRE_DATE=21-May-2003
+PCRE_MINOR=4
+PCRE_DATE=21-August-2003
 PCRE_VERSION=${PCRE_MAJOR}.${PCRE_MINOR}
 
 dnl Default values for miscellaneous macros
@@ -146,7 +146,8 @@ AC_SUBST(PCRE_POSIXLIB_VERSION)
 AC_SUBST(POSIX_MALLOC_THRESHOLD)
 AC_SUBST(UTF8)
 
-dnl Stuff to make Win32 work better
+dnl Stuff to make MinGW work better. Special treatment is no longer
+dnl needed for Cygwin.
 
 case $host_os in
 mingw* )
@@ -157,14 +158,6 @@ mingw* )
     NOT_ON_WINDOWS="#"
     WIN_PREFIX=
     ;;
-cygwin* )
-    ON_WINDOWS=
-    POSIX_OBJ=pcreposix.o
-    POSIX_LOBJ=pcreposix.lo
-    POSIX_LIB=
-    WIN_PREFIX=cyg
-    NOT_ON_WINDOWS="#"
-    ;;
 * )
     ON_WINDOWS="#"
     NOT_ON_WINDOWS=
@@ -182,7 +175,7 @@ AC_SUBST(POSIX_LOBJ)
 AC_SUBST(POSIX_LIB)
 
 if test "x$enable_shared" = "xno" ; then
-    AC_DEFINE(PCRE_STATIC,1)
+    AC_DEFINE([PCRE_STATIC],[1],[to link statically])
 fi
 
 dnl This must be last; it determines what files are written as well as config.h
diff --git a/doc/Tech.Notes b/doc/Tech.Notes
index dd01932..73c31c7 100644
--- a/doc/Tech.Notes
+++ b/doc/Tech.Notes
@@ -48,7 +48,9 @@ These items are all just one byte long
 
   OP_END                 end of pattern
   OP_ANY                 match any character
+  OP_ANYBYTE             match any single byte, even in UTF-8 mode 
   OP_SOD                 match start of data: \A
+  OP_SOM,                start of match (subject + offset): \G
   OP_CIRC                ^ (start of data, or after \n in multiline)
   OP_NOT_WORD_BOUNDARY   \W
   OP_WORD_BOUNDARY       \w
@@ -61,7 +63,6 @@ These items are all just one byte long
   OP_EODN                match end of data or \n at end: \Z
   OP_EOD                 match end of data: \z
   OP_DOLL                $ (end of data, or before \n in multiline)
-  OP_RECURSE             match the pattern recursively
 
 
 Repeating single characters
@@ -119,8 +120,7 @@ instances of OP_CHARS are used.
 Character classes
 -----------------
 
-When characters less than 256 are involved, OP_CLASS is used for a character
-class. If there is only one character, OP_CHARS is used for a positive class,
+If there is only one character, OP_CHARS is used for a positive class,
 and OP_NOT for a negative one (that is, for something like [^a]). However, in 
 UTF-8 mode, this applies only to characters with values < 128, because OP_NOT 
 is confined to single bytes.
@@ -129,9 +129,15 @@ Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a repeated,
 negated, single-character class. The normal ones (OP_STAR etc.) are used for a
 repeated positive single-character class.
 
-OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every
-character that is acceptable. The bits are counted from the least significant
-end of each byte.
+When there's more than one character in a class and all the characters are less
+than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative 
+one. In either case, the opcode is followed by a 32-byte bit map containing a 1
+bit for every character that is acceptable. The bits are counted from the least
+significant end of each byte.
+
+The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, 
+subject characters with values greater than 256 can be handled correctly. For 
+OP_CLASS they don't match, whereas for OP_NCLASS they do.
 
 For classes containing characters with values > 255, OP_XCLASS is used. It
 optionally uses a bit map (if any characters lie within it), followed by a list
@@ -243,6 +249,21 @@ same scheme is used, with a "reference number" of 0xffff. Otherwise, a
 conditional subpattern always starts with one of the assertions.
 
 
+Recursion
+---------
+
+Recursion either matches the current regex, or some subexpression. The opcode
+OP_RECURSE is followed by an value which is the offset to the starting bracket
+from the start of the whole pattern.
+
+
+Callout
+-------
+
+OP_CALLOUT is followed by one byte of data that holds a callout number in the 
+range 0 to 255.
+
+
 Changing options
 ----------------
 
@@ -257,4 +278,4 @@ at compile time, and so does not cause anything to be put into the compiled
 data.
 
 Philip Hazel
-August 2002
+August 2003
diff --git a/doc/html/pcre.html b/doc/html/pcre.html
index fb319f3..bb0d354 100644
--- a/doc/html/pcre.html
+++ b/doc/html/pcre.html
@@ -125,9 +125,16 @@ to testing the PCRE_UTF8 flag in several places, so should not be very large.
 The following comments apply when PCRE is running in UTF-8 mode:
 </P>
 <P>
-1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
-not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
-the results are undefined.
+1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects
+are checked for validity on entry to the relevant functions. If an invalid
+UTF-8 string is passed, an error return is given. In some situations, you may
+already know that your strings are valid, and therefore want to skip these
+checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag
+at compile time or at run time, PCRE assumes that the pattern or subject it
+is given (respectively) contains only valid UTF-8 codes. In this case, it does
+not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to
+PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program
+may crash.
 </P>
 <P>
 2. In a pattern, the escape sequence \x{...}, where the contents of the braces
@@ -178,6 +185,6 @@ Cambridge CB2 3QG, England.
 Phone: +44 1223 334714
 </P>
 <P>
-Last updated: 04 February 2003
+Last updated: 20 August 2003
 <br>
 Copyright &copy; 1997-2003 University of Cambridge.
diff --git a/doc/html/pcre_compile.html b/doc/html/pcre_compile.html
index 0b21683..e1a4379 100644
--- a/doc/html/pcre_compile.html
+++ b/doc/html/pcre_compile.html
@@ -52,10 +52,14 @@ The option bits are:
                           theses (named ones available)
   PCRE_UNGREEDY         Invert greediness of quantifiers
   PCRE_UTF8             Run in UTF-8 mode
+  PCRE_NO_UTF8_CHECK    Do not check the pattern for UTF-8
+                          validity (only relevant if
+                          PCRE_UTF8 is set)
 </PRE>
 </P>
 <P>
-PCRE must have been compiled with UTF-8 support when PCRE_UTF8 is used.
+PCRE must be compiled with UTF-8 support in order to use PCRE_UTF8
+(or PCRE_NO_UTF8_CHECK).
 </P>
 <P>
 The yield of the function is a pointer to a private data structure that
diff --git a/doc/html/pcre_exec.html b/doc/html/pcre_exec.html
index 915bc73..cf86dfd 100644
--- a/doc/html/pcre_exec.html
+++ b/doc/html/pcre_exec.html
@@ -47,6 +47,9 @@ The options are:
   PCRE_NOTBOL        Subject is not the beginning of a line
   PCRE_NOTEOL        Subject is not the end of a line
   PCRE_NOTEMPTY      An empty string is not a valid match
+  PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
+                       validity (only relevant if PCRE_UTF8
+                       was set at compile time)
 </PRE>
 </P>
 <P>
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index 9a479e8..54f8f24 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -442,6 +442,21 @@ in the main
 <a href="pcre.html"><b>pcre</b></a>
 page.
 </P>
+<P>
+<pre>
+  PCRE_NO_UTF8_CHECK
+</PRE>
+</P>
+<P>
+When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
+automatically checked. If an invalid UTF-8 sequence of bytes is found,
+<b>pcre_compile()</b> returns an error. If you already know that your pattern is
+valid, and you want to skip this check for performance reasons, you can set the
+PCRE_NO_UTF8_CHECK option. When it is set, the effect of passing an invalid
+UTF-8 string as a pattern is undefined. It may cause your program to crash.
+Note that there is a similar option for suppressing the checking of subject
+strings passed to <b>pcre_exec()</b>.
+</P>
 <br><a name="SEC6" href="#TOC1">STUDYING A PATTERN</a><br>
 <P>
 <b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i>,</b>
@@ -862,6 +877,15 @@ or turned out to be anchored by virtue of its contents, it cannot be made
 unachored at matching time.
 </P>
 <P>
+When PCRE_UTF8 was set at compile time, the validity of the subject as a UTF-8
+string is automatically checked. If an invalid UTF-8 sequence of bytes is
+found, <b>pcre_exec()</b> returns the error PCRE_ERROR_BADUTF8. If you already
+know that your subject is valid, and you want to skip this check for
+performance reasons, you can set the PCRE_NO_UTF8_CHECK option when calling
+<b>pcre_exec()</b>. When this option is set, the effect of passing an invalid
+UTF-8 string as a subject is undefined. It may cause your program to crash.
+</P>
+<P>
 There are also three further options that can be set only at matching time:
 </P>
 <P>
@@ -1106,6 +1130,14 @@ This error is never generated by <b>pcre_exec()</b> itself. It is provided for
 use by callout functions that want to yield a distinctive error code. See the
 <b>pcrecallout</b> documentation for details.
 </P>
+<P>
+<pre>
+  PCRE_ERROR_BADUTF8       (-10)
+</PRE>
+</P>
+<P>
+A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+</P>
 <br><a name="SEC11" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
 <P>
 <b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
@@ -1257,6 +1289,6 @@ then call <i>pcre_copy_substring()</i> or <i>pcre_get_substring()</i>, as
 appropriate.
 </P>
 <P>
-Last updated: 03 February 2003
+Last updated: 20 August 2003
 <br>
 Copyright &copy; 1997-2003 University of Cambridge.
diff --git a/doc/html/pcrecallout.html b/doc/html/pcrecallout.html
index 5516c99..f4b7104 100644
--- a/doc/html/pcrecallout.html
+++ b/doc/html/pcrecallout.html
@@ -81,8 +81,9 @@ The <i>current_position</i> field contains the offset within the subject of the
 current match pointer.
 </P>
 <P>
-The <i>capture_top</i> field contains the number of the highest captured
-substring so far.
+The <i>capture_top</i> field contains one more than the number of the highest
+numbered captured substring so far. If no substrings have been captured,
+the value of <i>capture_top</i> is one.
 </P>
 <P>
 The <i>capture_last</i> field contains the number of the most recently captured
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 25b03d3..329fb79 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -149,9 +149,10 @@ respectively. For example:
 </P>
 <P>
 These modifier letters have the same effect as they do in Perl. There are
-others which set PCRE options that do not correspond to anything in Perl:
-<b>/A</b>, <b>/E</b>, and <b>/X</b> set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
-PCRE_EXTRA respectively.
+others that set PCRE options that do not correspond to anything in Perl:
+<b>/A</b>, <b>/E</b>, <b>/N</b>, <b>/U</b>, and <b>/X</b> set PCRE_ANCHORED,
+PCRE_DOLLAR_ENDONLY, PCRE_NO_AUTO_CAPTURE, PCRE_UNGREEDY, and PCRE_EXTRA
+respectively.
 </P>
 <P>
 Searching for all possible matches within each subject string can be requested
@@ -233,6 +234,11 @@ provided that it was compiled with this support enabled. This modifier also
 causes any non-printing characters in output strings to be printed using the
 \x{hh...} notation if they are valid UTF-8 sequences.
 </P>
+<P>
+If the <b>/?</b> modifier is used with <b>/8</b>, it causes <b>pcretest</b> to
+call <b>pcre_compile()</b> with the PCRE_NO_UTF8_CHECK option, to suppress the
+checking of the string for UTF-8 validity.
+</P>
 <br><a name="SEC5" href="#TOC1">CALLOUTS</a><br>
 <P>
 If the pattern contains any callout requests, <b>pcretest</b>'s callout function
@@ -318,6 +324,8 @@ recognized:
                <b>pcre_exec()</b> to dd (any number of decimal
                digits)
   \Z         pass the PCRE_NOTEOL option to <b>pcre_exec()</b>
+  \?         pass the PCRE_NO_UTF8_CHECK option to
+               <b>pcre_exec()</b>
 </PRE>
 </P>
 <P>
@@ -429,6 +437,6 @@ University Computing Service,
 Cambridge CB2 3QG, England.
 </P>
 <P>
-Last updated: 03 February 2003
+Last updated: 20 August 2003
 <br>
 Copyright &copy; 1997-2003 University of Cambridge.
diff --git a/doc/pcre.3 b/doc/pcre.3
index 7fd9851..c0c7141 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -116,9 +116,16 @@ to testing the PCRE_UTF8 flag in several places, so should not be very large.
 
 The following comments apply when PCRE is running in UTF-8 mode:
 
-1. PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
-not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
-the results are undefined.
+1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects
+are checked for validity on entry to the relevant functions. If an invalid
+UTF-8 string is passed, an error return is given. In some situations, you may
+already know that your strings are valid, and therefore want to skip these
+checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag
+at compile time or at run time, PCRE assumes that the pattern or subject it
+is given (respectively) contains only valid UTF-8 codes. In this case, it does
+not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to
+PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program
+may crash.
 
 2. In a pattern, the escape sequence \\x{...}, where the contents of the braces
 is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
@@ -162,6 +169,6 @@ Cambridge CB2 3QG, England.
 Phone: +44 1223 334714
 
 .in 0
-Last updated: 04 February 2003
+Last updated: 20 August 2003
 .br
 Copyright (c) 1997-2003 University of Cambridge.
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 1ec5f2c..ad6f3b2 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -118,10 +118,19 @@ UTF-8 SUPPORT
      The following comments apply when PCRE is running  in  UTF-8
      mode:
 
-     1. PCRE assumes that the strings it is given  contain  valid
-     UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
-     you pass invalid UTF-8 strings  to  PCRE,  the  results  are
-     undefined.
+     1. When you set the PCRE_UTF8 flag, the  strings  passed  as
+     patterns  and  subjects are checked for validity on entry to
+     the relevant  functions.  If  an  invalid  UTF-8  string  is
+     passed,  an  error  return is given. In some situations, you
+     may already know that your strings are valid, and  therefore
+     want  to  skip these checks in order to improve performance.
+     If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
+     run  time,  PCRE  assumes  that the pattern or subject it is
+     given (respectively) contains only  valid  UTF-8  codes.  In
+     this  case, it does not diagnose an invalid UTF-8 string. If
+     you  pass   an   invalid   UTF-8   string   to   PCRE   when
+     PCRE_NO_UTF8_CHECK  is  set, the results are undefined. Your
+     program may crash.
 
      2. In a pattern, the escape sequence \x{...}, where the con-
      tents  of  the  braces is a string of hexadecimal digits, is
@@ -164,7 +173,7 @@ AUTHOR
      Cambridge CB2 3QG, England.
      Phone: +44 1223 334714
 
-Last updated: 04 February 2003
+Last updated: 20 August 2003
 Copyright (c) 1997-2003 University of Cambridge.
 -----------------------------------------------------------------------------
 
@@ -654,6 +663,20 @@ COMPILING A PATTERN
      option  changes  the behaviour of PCRE are given in the sec-
      tion on UTF-8 support in the main pcre page.
 
+       PCRE_NO_UTF8_CHECK
+
+     When PCRE_UTF8 is set, the validity  of  the  pattern  as  a
+     UTF-8  string  is automatically checked. If an invalid UTF-8
+     sequence of bytes is found, pcre_compile() returns an error.
+     If you already know that your pattern is valid, and you want
+     to skip this check for performance reasons, you can set  the
+     PCRE_NO_UTF8_CHECK  option.  When  it  is set, the effect of
+     passing an invalid UTF-8 string as a pattern  is  undefined.
+     It  may  cause  your program to crash.  Note that there is a
+     similar option  for  suppressing  the  checking  of  subject
+     strings passed to pcre_exec().
+
+
 
 STUDYING A PATTERN
 
@@ -747,7 +770,6 @@ INFORMATION ABOUT A PATTERN
      compiled pattern. It replaces the obsolete pcre_info() func-
      tion, which is nevertheless retained for backwards compabil-
      ity (and is documented below).
-
      The first argument for pcre_fullinfo() is a pointer  to  the
      compiled  pattern.  The  second  argument  is  the result of
      pcre_study(), or NULL if the pattern was  not  studied.  The
@@ -1014,6 +1036,16 @@ MATCHING A PATTERN
      turned out to be anchored by virtue of its contents, it can-
      not be made unachored at matching time.
 
+     When PCRE_UTF8 was set at compile time, the validity of  the
+     subject  as  a  UTF-8 string is automatically checked. If an
+     invalid  UTF-8  sequence  of  bytes  is  found,  pcre_exec()
+     returns  the  error  PCRE_ERROR_BADUTF8. If you already know
+     that your subject is valid, and you want to skip this  check
+     for  performance reasons, you can set the PCRE_NO_UTF8_CHECK
+     option when calling pcre_exec(). When this  option  is  set,
+     the  effect  of passing an invalid UTF-8 string as a subject
+     is undefined. It may cause your program to crash.
+
      There are also three further options that can be set only at
      matching time:
 
@@ -1103,7 +1135,6 @@ MATCHING A PATTERN
      used for a fragment of a pattern that picks out a substring.
      PCRE supports several other kinds of  parenthesized  subpat-
      tern that do not cause substrings to be captured.
-
      Captured substrings are returned to the caller via a  vector
      of  integer  offsets whose address is passed in ovector. The
      number of elements in the vector is passed in ovecsize.  The
@@ -1219,6 +1250,11 @@ MATCHING A PATTERN
      distinctive error code. See  the  pcrecallout  documentation
      for details.
 
+       PCRE_ERROR_BADUTF8       (-10)
+
+     A string that contains an invalid UTF-8  byte  sequence  was
+     passed as a subject.
+
 
 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
 
@@ -1255,7 +1291,6 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
      returned zero, indicating that it ran out of space in  ovec-
      tor,  the  value passed as stringcount should be the size of
      the vector divided by three.
-
      The functions pcre_copy_substring() and pcre_get_substring()
      extract a single substring, whose number is given as string-
      number. A value of zero extracts the substring that  matched
@@ -1352,7 +1387,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
      succeeds,    they   then   call   pcre_copy_substring()   or
      pcre_get_substring(), as appropriate.
 
-Last updated: 03 February 2003
+Last updated: 20 August 2003
 Copyright (c) 1997-2003 University of Cambridge.
 -----------------------------------------------------------------------------
 
@@ -1420,8 +1455,9 @@ PCRE CALLOUTS
      The current_position field contains the  offset  within  the
      subject of the current match pointer.
 
-     The capture_top field contains the  number  of  the  highest
-     captured substring so far.
+     The capture_top field contains one more than the  number  of
+     the  highest  numbered captured substring so far. If no sub-
+     strings have been captured, the value of capture_top is one.
 
      The capture_last field  contains  the  number  of  the  most
      recently captured substring.
diff --git a/doc/pcre_compile.3 b/doc/pcre_compile.3
index f911623..a827315 100644
--- a/doc/pcre_compile.3
+++ b/doc/pcre_compile.3
@@ -42,8 +42,12 @@ The option bits are:
                           theses (named ones available)
   PCRE_UNGREEDY         Invert greediness of quantifiers
   PCRE_UTF8             Run in UTF-8 mode
+  PCRE_NO_UTF8_CHECK    Do not check the pattern for UTF-8
+                          validity (only relevant if
+                          PCRE_UTF8 is set)
 
-PCRE must have been compiled with UTF-8 support when PCRE_UTF8 is used.
+PCRE must be compiled with UTF-8 support in order to use PCRE_UTF8
+(or PCRE_NO_UTF8_CHECK).
 
 The yield of the function is a pointer to a private data structure that
 contains the compiled pattern, or NULL if an error was detected.
diff --git a/doc/pcre_exec.3 b/doc/pcre_exec.3
index f61c2a4..0d6c380 100644
--- a/doc/pcre_exec.3
+++ b/doc/pcre_exec.3
@@ -37,6 +37,9 @@ The options are:
   PCRE_NOTBOL        Subject is not the beginning of a line
   PCRE_NOTEOL        Subject is not the end of a line
   PCRE_NOTEMPTY      An empty string is not a valid match
+  PCRE_NO_UTF8_CHECK Do not check the subject for UTF-8
+                       validity (only relevant if PCRE_UTF8
+                       was set at compile time)
 
 There is a complete description of the PCRE API in the
 .\" HREF
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index fbd3d5d..0149f50 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -371,6 +371,18 @@ in the main
 .\"
 page.
 
+  PCRE_NO_UTF8_CHECK
+
+When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
+automatically checked. If an invalid UTF-8 sequence of bytes is found,
+\fBpcre_compile()\fR returns an error. If you already know that your pattern is
+valid, and you want to skip this check for performance reasons, you can set the
+PCRE_NO_UTF8_CHECK option. When it is set, the effect of passing an invalid
+UTF-8 string as a pattern is undefined. It may cause your program to crash.
+Note that there is a similar option for suppressing the checking of subject
+strings passed to \fBpcre_exec()\fR.
+
+
 .SH STUDYING A PATTERN
 .rs
 .sp
@@ -698,6 +710,14 @@ first matching position. However, if a pattern was compiled with PCRE_ANCHORED,
 or turned out to be anchored by virtue of its contents, it cannot be made
 unachored at matching time.
 
+When PCRE_UTF8 was set at compile time, the validity of the subject as a UTF-8
+string is automatically checked. If an invalid UTF-8 sequence of bytes is
+found, \fBpcre_exec()\fR returns the error PCRE_ERROR_BADUTF8. If you already
+know that your subject is valid, and you want to skip this check for
+performance reasons, you can set the PCRE_NO_UTF8_CHECK option when calling
+\fBpcre_exec()\fR. When this option is set, the effect of passing an invalid
+UTF-8 string as a subject is undefined. It may cause your program to crash.
+
 There are also three further options that can be set only at matching time:
 
   PCRE_NOTBOL
@@ -872,6 +892,10 @@ This error is never generated by \fBpcre_exec()\fR itself. It is provided for
 use by callout functions that want to yield a distinctive error code. See the
 \fBpcrecallout\fR documentation for details.
 
+  PCRE_ERROR_BADUTF8       (-10)
+
+A string that contains an invalid UTF-8 byte sequence was passed as a subject.
+
 .SH EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
 .rs
 .sp
@@ -1011,6 +1035,6 @@ then call \fIpcre_copy_substring()\fR or \fIpcre_get_substring()\fR, as
 appropriate.
 
 .in 0
-Last updated: 03 February 2003
+Last updated: 20 August 2003
 .br
 Copyright (c) 1997-2003 University of Cambridge.
diff --git a/doc/pcrecallout.3 b/doc/pcrecallout.3
index f54d0dd..bfbb66b 100644
--- a/doc/pcrecallout.3
+++ b/doc/pcrecallout.3
@@ -57,8 +57,9 @@ function may be called several times for different starting points.
 The \fIcurrent_position\fR field contains the offset within the subject of the
 current match pointer.
 
-The \fIcapture_top\fR field contains the number of the highest captured
-substring so far.
+The \fIcapture_top\fR field contains one more than the number of the highest
+numbered captured substring so far. If no substrings have been captured,
+the value of \fIcapture_top\fR is one.
 
 The \fIcapture_last\fR field contains the number of the most recently captured
 substring.
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 76daaf3..2c4fb42 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -111,9 +111,10 @@ respectively. For example:
   /caseless/i
 
 These modifier letters have the same effect as they do in Perl. There are
-others which set PCRE options that do not correspond to anything in Perl:
-\fB/A\fR, \fB/E\fR, and \fB/X\fR set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
-PCRE_EXTRA respectively.
+others that set PCRE options that do not correspond to anything in Perl:
+\fB/A\fR, \fB/E\fR, \fB/N\fR, \fB/U\fR, and \fB/X\fR set PCRE_ANCHORED,
+PCRE_DOLLAR_ENDONLY, PCRE_NO_AUTO_CAPTURE, PCRE_UNGREEDY, and PCRE_EXTRA
+respectively.
 
 Searching for all possible matches within each subject string can be requested
 by the \fB/g\fR or \fB/G\fR modifier. After finding a match, PCRE is called
@@ -180,6 +181,10 @@ provided that it was compiled with this support enabled. This modifier also
 causes any non-printing characters in output strings to be printed using the
 \\x{hh...} notation if they are valid UTF-8 sequences.
 
+If the \fB/?\fR modifier is used with \fB/8\fR, it causes \fBpcretest\fR to
+call \fBpcre_compile()\fR with the PCRE_NO_UTF8_CHECK option, to suppress the
+checking of the string for UTF-8 validity.
+
 .SH CALLOUTS
 .rs
 .sp
@@ -261,6 +266,8 @@ recognized:
                \fBpcre_exec()\fR to dd (any number of decimal
                digits)
   \\Z         pass the PCRE_NOTEOL option to \fBpcre_exec()\fR
+  \\?         pass the PCRE_NO_UTF8_CHECK option to
+               \fBpcre_exec()\fR
 
 If \\M is present, \fBpcretest\fR calls \fBpcre_exec()\fR several times, with
 different values in the \fImatch_limit\fR field of the \fBpcre_extra\fR data
@@ -351,6 +358,6 @@ University Computing Service,
 Cambridge CB2 3QG, England.
 
 .in 0
-Last updated: 03 February 2003
+Last updated: 20 August 2003
 .br
 Copyright (c) 1997-2003 University of Cambridge.
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index 80585af..4fa9ca4 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -119,10 +119,10 @@ PATTERN MODIFIERS
        /caseless/i
 
      These modifier letters have the same effect as  they  do  in
-     Perl.  There  are  others which set PCRE options that do not
-     correspond  to  anything  in  Perl:   /A,  /E,  and  /X  set
-     PCRE_ANCHORED,  PCRE_DOLLAR_ENDONLY,  and PCRE_EXTRA respec-
-     tively.
+     Perl.  There  are  others  that set PCRE options that do not
+     correspond to anything in Perl:  /A, /E, /N, /U, and /X  set
+     PCRE_ANCHORED,   PCRE_DOLLAR_ENDONLY,  PCRE_NO_AUTO_CAPTURE,
+     PCRE_UNGREEDY, and PCRE_EXTRA respectively.
 
      Searching for  all  possible  matches  within  each  subject
      string  can  be  requested  by  the /g or /G modifier. After
@@ -199,6 +199,10 @@ PATTERN MODIFIERS
      printing characters in output strings to  be  printed  using
      the \x{hh...} notation if they are valid UTF-8 sequences.
 
+     If the /? modifier is used with /8, it  causes  pcretest  to
+     call  pcre_compile()  with the PCRE_NO_UTF8_CHECK option, to
+     suppress the checking of the string for UTF-8 validity.
+
 
 CALLOUTS
 
@@ -255,12 +259,12 @@ DATA LINES
                     after a successful match (any decimal number
                     less than 32)
        \Cname     call pcre_copy_named_substring() for substring
+
                     "name" after a successful match (name termin-
                     ated by next non alphanumeric character)
        \C+        show the current captured substrings at callout
                     time
-
-       C-        do not supply a callout function
+       \C-        do not supply a callout function
        \C!n       return 1 instead of 0 when callout number n is
                     reached
        \C!n!m     return 1 instead of 0 when callout number n is
@@ -281,6 +285,8 @@ DATA LINES
                     pcre_exec() to dd (any number of decimal
                     digits)
        \Z         pass the PCRE_NOTEOL option to pcre_exec()
+       \?         pass the PCRE_NO_UTF8_CHECK option to
+                    pcre_exec()
 
      If \M is present, pcretest calls pcre_exec() several  times,
      with  different  values  in  the  match_limit  field  of the
@@ -306,7 +312,6 @@ DATA LINES
      API  to  be  used,  only  B,  and Z have any effect, causing
      REG_NOTBOL and REG_NOTEOL to be passed to regexec()  respec-
      tively.
-
      The use of \x{hh...} to represent UTF-8  characters  is  not
      dependent  on  the use of the /8 modifier on the pattern. It
      is recognized always. There may be any number of hexadecimal
@@ -378,5 +383,5 @@ AUTHOR
      University Computing Service,
      Cambridge CB2 3QG, England.
 
-Last updated: 03 February 2003
+Last updated: 20 August 2003
 Copyright (c) 1997-2003 University of Cambridge.
diff --git a/internal.h b/internal.h
index 973e7ee..92454a7 100644
--- a/internal.h
+++ b/internal.h
@@ -198,10 +198,10 @@ time, run time or study time, respectively. */
 #define PUBLIC_OPTIONS \
   (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
    PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
-   PCRE_NO_AUTO_CAPTURE)
+   PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK)
 
 #define PUBLIC_EXEC_OPTIONS \
-  (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY)
+  (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NO_UTF8_CHECK)
 
 #define PUBLIC_STUDY_OPTIONS 0   /* None defined */
 
@@ -526,6 +526,7 @@ just to accommodate the POSIX wrapper. */
 #define ERR41 "unrecognized character after (?P"
 #define ERR42 "syntax error after (?P"
 #define ERR43 "two named groups have the same name"
+#define ERR44 "invalid UTF-8 string"
 
 /* All character handling must be done as unsigned characters. Otherwise there
 are problems with top-bit-set characters and functions such as isspace().
diff --git a/pcre.c b/pcre.c
index 5da0f76..455782c 100644
--- a/pcre.c
+++ b/pcre.c
@@ -241,10 +241,16 @@ changed by the caller, but are shared between all threads. However, when
 compiling for Virtual Pascal, things are done differently (see pcre.in). */
 
 #ifndef VPCOMPAT
+#ifdef __cplusplus
+extern "C" void *(*pcre_malloc)(size_t) = malloc;
+extern "C" void  (*pcre_free)(void *) = free;
+extern "C" int   (*pcre_callout)(pcre_callout_block *) = NULL;
+#else
 void *(*pcre_malloc)(size_t) = malloc;
 void  (*pcre_free)(void *) = free;
 int   (*pcre_callout)(pcre_callout_block *) = NULL;
 #endif
+#endif
 
 
 /*************************************************
@@ -511,7 +517,7 @@ if (re == NULL || where == NULL) return PCRE_ERROR_NULL;
 if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC;
 
 if (extra_data != NULL && (extra_data->flags & PCRE_EXTRA_STUDY_DATA) != 0)
-  study = extra_data->study_data;
+  study = (const pcre_study_data *)extra_data->study_data;
 
 switch (what)
   {
@@ -592,11 +598,11 @@ pcre_config(int what, void *where)
 switch (what)
   {
   case PCRE_CONFIG_UTF8:
-  #ifdef SUPPORT_UTF8
+#ifdef SUPPORT_UTF8
   *((int *)where) = 1;
-  #else
+#else
   *((int *)where) = 0;
-  #endif
+#endif
   break;
 
   case PCRE_CONFIG_NEWLINE:
@@ -669,7 +675,6 @@ Arguments:
   bracount   number of previous extracting brackets
   options    the options bits
   isclass    TRUE if inside a character class
-  cd         pointer to char tables block
 
 Returns:     zero or positive => a data character
              negative => a special escape sequence
@@ -678,7 +683,7 @@ Returns:     zero or positive => a data character
 
 static int
 check_escape(const uschar **ptrptr, const char **errorptr, int bracount,
-  int options, BOOL isclass, compile_data *cd)
+  int options, BOOL isclass)
 {
 const uschar *ptr = *ptrptr;
 int c, i;
@@ -801,7 +806,8 @@ else
     c = 0;
     while (i++ < 2 && (digitab[ptr[1]] & ctype_xdigit) != 0)
       {
-      int cc = *(++ptr);
+      int cc;                               /* Some compilers don't like ++ */
+      cc = *(++ptr);                        /* in initializers */
       if (cc >= 'a') cc -= 32;              /* Convert to upper case */
       c = c * 16 + cc - ((cc < 'A')? '0' : ('A' - 10));
       }
@@ -858,13 +864,12 @@ where the ddds are digits.
 
 Arguments:
   p         pointer to the first char after '{'
-  cd        pointer to char tables block
 
 Returns:    TRUE or FALSE
 */
 
 static BOOL
-is_counted_repeat(const uschar *p, compile_data *cd)
+is_counted_repeat(const uschar *p)
 {
 if ((digitab[*p++] && ctype_digit) == 0) return FALSE;
 while ((digitab[*p] & ctype_digit) != 0) p++;
@@ -895,15 +900,13 @@ Arguments:
   maxp       pointer to int for max
              returned as -1 if no max
   errorptr   points to pointer to error message
-  cd         pointer to character tables clock
 
 Returns:     pointer to '}' on success;
              current ptr on error, with errorptr set
 */
 
 static const uschar *
-read_repeat_counts(const uschar *p, int *minp, int *maxp,
-  const char **errorptr, compile_data *cd)
+read_repeat_counts(const uschar *p, int *minp, int *maxp, const char **errorptr)
 {
 int min = 0;
 int max = -1;
@@ -1793,7 +1796,7 @@ for (;; ptr++)
 
       if (c == '\\')
         {
-        c = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd);
+        c = check_escape(&ptr, errorptr, *brackets, options, TRUE);
         if (-c == ESC_b) c = '\b';  /* \b is backslash in a class */
 
         if (-c == ESC_Q)            /* Handle start of quoted string */
@@ -1882,7 +1885,7 @@ for (;; ptr++)
         if (d == '\\')
           {
           const uschar *oldptr = ptr;
-          d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd);
+          d = check_escape(&ptr, errorptr, *brackets, options, TRUE);
 
           /* \b is backslash; any other special means the '-' was literal */
 
@@ -2091,8 +2094,8 @@ for (;; ptr++)
     /* Various kinds of repeat */
 
     case '{':
-    if (!is_counted_repeat(ptr+1, cd)) goto NORMAL_CHAR;
-    ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr, cd);
+    if (!is_counted_repeat(ptr+1)) goto NORMAL_CHAR;
+    ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr);
     if (*errorptr != NULL) goto FAILED;
     goto REPEAT;
 
@@ -3039,7 +3042,7 @@ for (;; ptr++)
 
     case '\\':
     tempptr = ptr;
-    c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd);
+    c = check_escape(&ptr, errorptr, *brackets, options, FALSE);
 
     /* Handle metacharacters introduced by \. For ones like \d, the ESC_ values
     are arranged to be the negation of the corresponding OP_values. For the
@@ -3142,7 +3145,7 @@ for (;; ptr++)
       if (c == '\\')
         {
         tempptr = ptr;
-        c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd);
+        c = check_escape(&ptr, errorptr, *brackets, options, FALSE);
         if (c < 0) { ptr = tempptr; break; }
 
         /* If a character is > 127 in UTF-8 mode, we have to turn it into
@@ -3727,6 +3730,56 @@ return c;
 
 
 
+#ifdef SUPPORT_UTF8
+/*************************************************
+*         Validate a UTF-8 string                *
+*************************************************/
+
+/* This function is called (optionally) at the start of compile or match, to
+validate that a supposed UTF-8 string is actually valid. The early check means
+that subsequent code can assume it is dealing with a valid string. The check
+can be turned off for maximum performance, but then consequences of supplying
+an invalid string are then undefined.
+
+Arguments:
+  string       points to the string
+  length       length of string, or -1 if the string is zero-terminated
+
+Returns:       < 0    if the string is a valid UTF-8 string
+               >= 0   otherwise; the value is the offset of the bad byte
+*/
+
+static int
+valid_utf8(const uschar *string, int length)
+{
+register const uschar *p;
+
+if (length < 0)
+  {
+  for (p = string; *p != 0; p++);
+  length = p - string;
+  }
+
+for (p = string; length-- > 0; p++)
+  {
+  int ab;
+  if (*p < 128) continue;
+  if ((*p & 0xc0) != 0xc0) return p - string;
+  ab = utf8_table4[*p & 0x3f];  /* Number of additional bytes */
+  if (length < ab) return p - string;
+  while (ab-- > 0)
+    {
+    if ((*(++p) & 0xc0) != 0x80) return p - string;
+    length--;
+    }
+  }
+
+return -1;
+}
+#endif
+
+
+
 /*************************************************
 *        Compile a Regular Expression            *
 *************************************************/
@@ -3793,6 +3846,12 @@ if (erroroffset == NULL)
 
 #ifdef SUPPORT_UTF8
 utf8 = (options & PCRE_UTF8) != 0;
+if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
+     (*erroroffset = valid_utf8((uschar *)pattern, -1)) >= 0)
+  {
+  *errorptr = ERR44;
+  return NULL;
+  }
 #else
 if ((options & PCRE_UTF8) != 0)
   {
@@ -3874,7 +3933,7 @@ while ((c = *(++ptr)) != 0)
     case '\\':
       {
       const uschar *save_ptr = ptr;
-      c = check_escape(&ptr, errorptr, bracount, options, FALSE, &compile_block);
+      c = check_escape(&ptr, errorptr, bracount, options, FALSE);
       if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
       if (c >= 0)
         {
@@ -3910,9 +3969,9 @@ while ((c = *(++ptr)) != 0)
       if (refnum > compile_block.top_backref)
         compile_block.top_backref = refnum;
       length += 2;   /* For single back reference */
-      if (ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block))
+      if (ptr[1] == '{' && is_counted_repeat(ptr+2))
         {
-        ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
+        ptr = read_repeat_counts(ptr+2, &min, &max, errorptr);
         if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
         if ((min == 0 && (max == 1 || max == -1)) ||
           (min == 1 && max == -1))
@@ -3942,8 +4001,8 @@ while ((c = *(++ptr)) != 0)
     class, or back reference. */
 
     case '{':
-    if (!is_counted_repeat(ptr+1, &compile_block)) goto NORMAL_CHAR;
-    ptr = read_repeat_counts(ptr+1, &min, &max, errorptr, &compile_block);
+    if (!is_counted_repeat(ptr+1)) goto NORMAL_CHAR;
+    ptr = read_repeat_counts(ptr+1, &min, &max, errorptr);
     if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
 
     /* These special cases just insert one extra opcode */
@@ -4039,8 +4098,7 @@ while ((c = *(++ptr)) != 0)
 #ifdef SUPPORT_UTF8
         int prevchar = ptr[-1];
 #endif
-        int ch = check_escape(&ptr, errorptr, bracount, options, TRUE,
-          &compile_block);
+        int ch = check_escape(&ptr, errorptr, bracount, options, TRUE);
         if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
 
         /* \b is backspace inside a class */
@@ -4151,9 +4209,9 @@ while ((c = *(++ptr)) != 0)
 
       /* A repeat needs either 1 or 5 bytes. */
 
-      if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block))
+      if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2))
         {
-        ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
+        ptr = read_repeat_counts(ptr+2, &min, &max, errorptr);
         if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
         if ((min == 0 && (max == 1 || max == -1)) ||
           (min == 1 && max == -1))
@@ -4505,9 +4563,9 @@ while ((c = *(++ptr)) != 0)
     /* Leave ptr at the final char; for read_repeat_counts this happens
     automatically; for the others we need an increment. */
 
-    if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block))
+    if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2))
       {
-      ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
+      ptr = read_repeat_counts(ptr+2, &min, &max, errorptr);
       if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
       }
     else if (c == '*') { min = 0; max = -1; ptr++; }
@@ -4596,8 +4654,7 @@ while ((c = *(++ptr)) != 0)
       if (c == '\\')
         {
         const uschar *saveptr = ptr;
-        c = check_escape(&ptr, errorptr, bracount, options, FALSE,
-          &compile_block);
+        c = check_escape(&ptr, errorptr, bracount, options, FALSE);
         if (*errorptr != NULL) goto PCRE_ERROR_RETURN;
         if (c < 0) { ptr = saveptr; break; }
 
@@ -7307,7 +7364,7 @@ if (extra_data != NULL)
   {
   register unsigned int flags = extra_data->flags;
   if ((flags & PCRE_EXTRA_STUDY_DATA) != 0)
-    study = extra_data->study_data;
+    study = (const pcre_study_data *)extra_data->study_data;
   if ((flags & PCRE_EXTRA_MATCH_LIMIT) != 0)
     match_block.match_limit = extra_data->match_limit;
   if ((flags & PCRE_EXTRA_CALLOUT_DATA) != 0)
@@ -7340,6 +7397,15 @@ match_block.recursive = NULL;                   /* No recursion at top level */
 match_block.lcc = re->tables + lcc_offset;
 match_block.ctypes = re->tables + ctypes_offset;
 
+/* Check a UTF-8 string if required. Unfortunately there's no way of passing
+back the character offset. */
+
+#ifdef SUPPORT_UTF8
+if (match_block.utf8 && (options & PCRE_NO_UTF8_CHECK) == 0 &&
+    valid_utf8((uschar *)subject, length) >= 0)
+  return PCRE_ERROR_BADUTF8;
+#endif
+
 /* The ims options can vary during the matching as a result of the presence
 of (?ims) items in the pattern. They are kept in a local variable so that
 restoring at the exit of a group is easy. */
diff --git a/pcre.in b/pcre.in
index 2aa44b9..7b5b209 100644
--- a/pcre.in
+++ b/pcre.in
@@ -23,7 +23,7 @@ make changes to pcre.in. */
 #    endif
 #  else
 #    ifndef PCRE_STATIC
-#      define PCRE_DATA_SCOPE __declspec(dllimport)
+#      define PCRE_DATA_SCOPE extern __declspec(dllimport)
 #    endif
 #  endif
 #endif
@@ -57,6 +57,7 @@ extern "C" {
 #define PCRE_NOTEMPTY           0x0400
 #define PCRE_UTF8               0x0800
 #define PCRE_NO_AUTO_CAPTURE    0x1000
+#define PCRE_NO_UTF8_CHECK      0x2000
 
 /* Exec-time and get/set-time error codes */
 
@@ -69,6 +70,7 @@ extern "C" {
 #define PCRE_ERROR_NOSUBSTRING    (-7)
 #define PCRE_ERROR_MATCHLIMIT     (-8)
 #define PCRE_ERROR_CALLOUT        (-9)  /* Never used by PCRE itself */
+#define PCRE_ERROR_BADUTF8       (-10)
 
 /* Request types for pcre_fullinfo() */
 
diff --git a/pcregrep.c b/pcregrep.c
index f4a59f4..7a06993 100644
--- a/pcregrep.c
+++ b/pcregrep.c
@@ -545,8 +545,8 @@ for (i = 1; i < argc; i++)
     }
   }
 
-pattern_list = malloc(MAX_PATTERN_COUNT * sizeof(pcre *));
-hints_list = malloc(MAX_PATTERN_COUNT * sizeof(pcre_extra *));
+pattern_list = (pcre **)malloc(MAX_PATTERN_COUNT * sizeof(pcre *));
+hints_list = (pcre_extra **)malloc(MAX_PATTERN_COUNT * sizeof(pcre_extra *));
 
 if (pattern_list == NULL || hints_list == NULL)
   {
diff --git a/pcreposix.c b/pcreposix.c
index 49094f2..6152a15 100644
--- a/pcreposix.c
+++ b/pcreposix.c
@@ -48,7 +48,7 @@ static const char *estring[] = {
   ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20,
   ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR29, ERR29, ERR30,
   ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39, ERR40,
-  ERR41, ERR42, ERR43 };
+  ERR41, ERR42, ERR43, ERR44 };
 
 static int eint[] = {
   REG_EESCAPE, /* "\\ at end of pattern" */
@@ -93,7 +93,8 @@ static int eint[] = {
   REG_BADPAT,  /* "recursive call could loop indefinitely" */
   REG_BADPAT,  /* "unrecognized character after (?P" */
   REG_BADPAT,  /* "syntax error after (?P" */
-  REG_BADPAT   /* "two named groups have the same name" */
+  REG_BADPAT,  /* "two named groups have the same name" */
+  REG_BADPAT   /* "invalid UTF-8 string" */
 };
 
 /* Table of texts corresponding to POSIX error codes */
@@ -217,7 +218,7 @@ preg->re_erroffset = erroffset;
 
 if (preg->re_pcre == NULL) return pcre_posix_error_code(errorptr);
 
-preg->re_nsub = pcre_info(preg->re_pcre, NULL, NULL);
+preg->re_nsub = pcre_info((const pcre *)preg->re_pcre, NULL, NULL);
 return 0;
 }
 
@@ -264,8 +265,8 @@ if (nmatch > 0)
     }
   }
 
-rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), 0, options,
-  ovector, nmatch * 3);
+rc = pcre_exec((const pcre *)preg->re_pcre, NULL, string, (int)strlen(string),
+  0, options, ovector, nmatch * 3);
 
 if (rc == 0) rc = nmatch;    /* All captured slots were filled in */
 
diff --git a/pcretest.c b/pcretest.c
index ad729b7..24196ac 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -52,7 +52,6 @@ static int use_utf8;
 static size_t gotten_store;
 
 
-
 static const int utf8_table1[] = {
   0x0000007f, 0x000007ff, 0x0000ffff, 0x001fffff, 0x03ffffff, 0x7fffffff};
 
@@ -321,13 +320,16 @@ if (post_start > 0)
   }
 
 fprintf(outfile, "\n");
-
 first_callout = 0;
 
-if ((int)(cb->callout_data) != 0)
+if (cb->callout_data != NULL)
   {
-  fprintf(outfile, "Callout data = %d\n", (int)(cb->callout_data));
-  return (int)(cb->callout_data);
+  int callout_data = *((int *)(cb->callout_data));
+  if (callout_data != 0)
+    {
+    fprintf(outfile, "Callout data = %d\n", callout_data);
+    return callout_data;
+    }
   }
 
 return (cb->callout_number != callout_fail_id)? 0 :
@@ -397,8 +399,8 @@ unsigned char *dbuffer;
 /* Get buffers from malloc() so that Electric Fence will check their misuse
 when I am debugging. */
 
-buffer = malloc(BUFFER_SIZE);
-dbuffer = malloc(DBUFFER_SIZE);
+buffer = (unsigned char *)malloc(BUFFER_SIZE);
+dbuffer = (unsigned char *)malloc(DBUFFER_SIZE);
 
 /* Static so that new_malloc can use it. */
 
@@ -464,7 +466,7 @@ while (argc > 1 && argv[op][0] == '-')
 /* Get the store for the offsets vector, and remember what it was */
 
 size_offsets_max = size_offsets;
-offsets = malloc(size_offsets_max * sizeof(int));
+offsets = (int *)malloc(size_offsets_max * sizeof(int));
 if (offsets == NULL)
   {
   printf("** Failed to get %d bytes of memory for offsets vector\n",
@@ -619,6 +621,7 @@ while (!done)
       case 'U': options |= PCRE_UNGREEDY; break;
       case 'X': options |= PCRE_EXTRA; break;
       case '8': options |= PCRE_UTF8; use_utf8 = 1; break;
+      case '?': options |= PCRE_NO_UTF8_CHECK; break;
 
       case 'L':
       ppp = pp;
@@ -787,7 +790,7 @@ while (!done)
         }
 
       if (get_options == 0) fprintf(outfile, "No options\n");
-        else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n",
+        else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s%s\n",
           ((get_options & PCRE_ANCHORED) != 0)? " anchored" : "",
           ((get_options & PCRE_CASELESS) != 0)? " caseless" : "",
           ((get_options & PCRE_EXTENDED) != 0)? " extended" : "",
@@ -796,7 +799,8 @@ while (!done)
           ((get_options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
           ((get_options & PCRE_EXTRA) != 0)? " extra" : "",
           ((get_options & PCRE_UNGREEDY) != 0)? " ungreedy" : "",
-          ((get_options & PCRE_UTF8) != 0)? " utf8" : "");
+          ((get_options & PCRE_UTF8) != 0)? " utf8" : "",
+          ((get_options & PCRE_NO_UTF8_CHECK) != 0)? " no_utf8_check" : "");
 
       if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0)
         fprintf(outfile, "Case state changes\n");
@@ -861,13 +865,17 @@ while (!done)
       else if (extra == NULL)
         fprintf(outfile, "Study returned NULL\n");
 
+      /* Don't output study size; at present it is in any case a fixed
+      value, but it varies, depending on the computer architecture, and
+      so messes up the test suite. */
+
       else if (do_showinfo)
         {
         size_t size;
         uschar *start_bits = NULL;
         new_info(re, extra, PCRE_INFO_STUDYSIZE, &size);
         new_info(re, extra, PCRE_INFO_FIRSTTABLE, &start_bits);
-        fprintf(outfile, "Study size = %d\n", size);
+        /* fprintf(outfile, "Study size = %d\n", size); */
         if (start_bits == NULL)
           fprintf(outfile, "No starting character set\n");
         else
@@ -1105,7 +1113,7 @@ while (!done)
           {
           size_offsets_max = n;
           free(offsets);
-          use_offsets = offsets = malloc(size_offsets_max * sizeof(int));
+          use_offsets = offsets = (int *)malloc(size_offsets_max * sizeof(int));
           if (offsets == NULL)
             {
             printf("** Failed to get %d bytes of memory for offsets vector\n",
@@ -1120,6 +1128,10 @@ while (!done)
         case 'Z':
         options |= PCRE_NOTEOL;
         continue;
+
+        case '?':
+        options |= PCRE_NO_UTF8_CHECK;
+        continue;
         }
       *q++ = c;
       }
@@ -1136,7 +1148,7 @@ while (!done)
       int eflags = 0;
       regmatch_t *pmatch = NULL;
       if (use_size_offsets > 0)
-        pmatch = malloc(sizeof(regmatch_t) * use_size_offsets);
+        pmatch = (regmatch_t *)malloc(sizeof(regmatch_t) * use_size_offsets);
       if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL;
       if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL;
 
@@ -1203,7 +1215,7 @@ while (!done)
 
         if (extra == NULL)
           {
-          extra = malloc(sizeof(pcre_extra));
+          extra = (pcre_extra *)malloc(sizeof(pcre_extra));
           extra->flags = 0;
           }
         extra->flags |= PCRE_EXTRA_MATCH_LIMIT;
@@ -1242,11 +1254,11 @@ while (!done)
         {
         if (extra == NULL)
           {
-          extra = malloc(sizeof(pcre_extra));
+          extra = (pcre_extra *)malloc(sizeof(pcre_extra));
           extra->flags = 0;
           }
         extra->flags |= PCRE_EXTRA_CALLOUT_DATA;
-        extra->callout_data = (void *)callout_data;
+        extra->callout_data = &callout_data;
         count = pcre_exec(re, extra, (char *)bptr, len, start_offset,
           options | g_notempty, use_offsets, use_size_offsets);
         extra->flags &= ~PCRE_EXTRA_CALLOUT_DATA;
diff --git a/study.c b/study.c
index 4320bd2..5f0f196 100644
--- a/study.c
+++ b/study.c
@@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
 
 Written by: Philip Hazel <ph10@cam.ac.uk>
 
-           Copyright (c) 1997-2002 University of Cambridge
+           Copyright (c) 1997-2003 University of Cambridge
 
 -----------------------------------------------------------------------------
 Permission is granted to anyone to use this software for any purpose on any
@@ -297,19 +297,50 @@ do
       /* Character class where all the information is in a bit map: set the
       bits and either carry on or not, according to the repeat count. If it was
       a negative class, and we are operating with UTF-8 characters, any byte
-      with the top-bit set is a potentially valid starter because it may start
-      a character with a value > 255. (This is sub-optimal in that the
-      character may be in the range 128-255, and those characters might be
-      unwanted, but that's as far as we go for the moment.) */
+      with a value >= 0xc4 is a potentially valid starter because it starts a
+      character with a value > 255. */
 
       case OP_NCLASS:
-      if (utf8) memset(start_bits+16, 0xff, 16);
+      if (utf8)
+        {
+        start_bits[24] |= 0xf0;              /* Bits for 0xc4 - 0xc8 */
+        memset(start_bits+25, 0xff, 7);      /* Bits for 0xc9 - 0xff */
+        }
       /* Fall through */
 
       case OP_CLASS:
         {
         tcode++;
-        for (c = 0; c < 32; c++) start_bits[c] |= tcode[c];
+
+        /* In UTF-8 mode, the bits in a bit map correspond to character
+        values, not to byte values. However, the bit map we are constructing is
+        for byte values. So we have to do a conversion for characters whose
+        value is > 127. In fact, there are only two possible starting bytes for
+        characters in the range 128 - 255. */
+
+        if (utf8)
+          {
+          for (c = 0; c < 16; c++) start_bits[c] |= tcode[c];
+          for (c = 128; c < 256; c++)
+            {
+            if ((tcode[c/8] && (1 << (c&7))) != 0)
+              {
+              int d = (c >> 6) | 0xc0;            /* Set bit for this starter */
+              start_bits[d/8] |= (1 << (d&7));    /* and then skip on to the */
+              c = (c & 0xc0) + 0x40 - 1;          /* next relevant character. */
+              }
+            }
+          }
+
+        /* In non-UTF-8 mode, the two bit maps are completely compatible. */
+
+        else
+          {
+          for (c = 0; c < 32; c++) start_bits[c] |= tcode[c];
+          }
+
+        /* Advance past the bit map, and act on what follows */
+
         tcode += 32;
         switch (*tcode)
           {
diff --git a/testdata/testinput5 b/testdata/testinput5
index 9f07d6e..b82cee0 100644
--- a/testdata/testinput5
+++ b/testdata/testinput5
@@ -192,4 +192,34 @@
 
 /[^\xff]/8D
 
+/[Ä-Ü]/8
+    Ö # Matches without Study
+    \x{d6}
+    
+/[Ä-Ü]/8S
+    Ö <-- Same with Study
+    \x{d6}
+    
+/[\x{c4}-\x{dc}]/8 
+    Ö # Matches without Study
+    \x{d6} 
+
+/[\x{c4}-\x{dc}]/8S
+    Ö <-- Same with Study
+    \x{d6} 
+
+/[�]/8
+
+/�/8
+
+/���xxx/8
+
+/���xxx/8?D
+
+/abc/8
+   �]
+   �
+   ���
+   ���\?
+
 / End of testinput5 /
diff --git a/testdata/testoutput1 b/testdata/testoutput1
index 63214b7..8a7e6e6 100644
--- a/testdata/testoutput1
+++ b/testdata/testoutput1
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
 
 /the quick brown fox/
     the quick brown fox
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index 22a345b..95b84c6 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
 
 /(a)b|/
 Capturing subpattern count = 1
@@ -136,7 +136,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: c d e 
     this sentence eventually mentions a cat
  0: cat
@@ -148,7 +147,6 @@ Capturing subpattern count = 0
 Options: caseless
 No first char
 No need char
-Study size = 40
 Starting character set: C D E c d e 
     this sentence eventually mentions a CAT cat
  0: CAT
@@ -160,7 +158,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: a b c d 
 
 /(a|[^\dZ])/S
@@ -168,7 +165,6 @@ Capturing subpattern count = 1
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a 
   \x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 
   \x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = > 
@@ -189,7 +185,6 @@ Capturing subpattern count = 1
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: \x09 \x0a \x0c \x0d \x20 a b 
 
 /(ab\2)/
@@ -524,7 +519,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: a b c d 
 
 /(?i)[abcd]/S
@@ -532,7 +526,6 @@ Capturing subpattern count = 0
 Options: caseless
 No first char
 No need char
-Study size = 40
 Starting character set: A B C D a b c d 
 
 /(?m)[xy]|(b|c)/S
@@ -540,7 +533,6 @@ Capturing subpattern count = 1
 Options: multiline
 No first char
 No need char
-Study size = 40
 Starting character set: b c x y 
 
 /(^a|^b)/m
@@ -612,7 +604,6 @@ No options
 Case state changes
 No first char
 No need char
-Study size = 40
 Starting character set: C a b c d 
 
 /a$/
@@ -677,7 +668,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: a b 
 
 /(?<!foo)(alpha|omega)/S
@@ -685,7 +675,6 @@ Capturing subpattern count = 1
 No options
 No first char
 Need char = 'a'
-Study size = 40
 Starting character set: a o 
 
 /(?!alphabet)[ab]/S
@@ -693,7 +682,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: a b 
 
 /(?<=foo\n)^bar/m
@@ -3378,7 +3366,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: a b 
 
 /[^a]/I
@@ -3398,7 +3385,6 @@ Capturing subpattern count = 0
 No options
 No first char
 Need char = '6'
-Study size = 40
 Starting character set: 0 1 2 3 4 5 6 7 8 9 
 
 /a^b/I
@@ -3432,7 +3418,6 @@ Capturing subpattern count = 0
 Options: caseless
 No first char
 No need char
-Study size = 40
 Starting character set: A B a b 
 
 /[ab](?i)cd/IS
@@ -3441,7 +3426,6 @@ No options
 Case state changes
 No first char
 Need char = 'd' (caseless)
-Study size = 40
 Starting character set: a b 
 
 /abc(?C)def/
@@ -3742,7 +3726,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: a b 
 
 /(?R)/
diff --git a/testdata/testoutput3 b/testdata/testoutput3
index 5dac092..77c50b1 100644
--- a/testdata/testoutput3
+++ b/testdata/testoutput3
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
 
 /^[\w]+/
     *** Failers
@@ -85,7 +85,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P 
   Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z 
 
@@ -94,7 +93,6 @@ Capturing subpattern count = 0
 No options
 No first char
 No need char
-Study size = 40
 Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P 
   Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z 
   � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � 
diff --git a/testdata/testoutput4 b/testdata/testoutput4
index 312cfbe..a85c37b 100644
--- a/testdata/testoutput4
+++ b/testdata/testoutput4
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
 
 /-- Do not use the \x{} construct except with patterns that have the --/
 /-- /8 option set, because PCRE doesn't recognize them as UTF-8 unless --/
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index b681214..6d0e89a 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -1,4 +1,4 @@
-PCRE version 4.3 21-May-2003
+PCRE version 4.4 21-August-2003
 
 /\x{100}/8DM
 Memory allocation (code space): 11
@@ -402,21 +402,16 @@ Capturing subpattern count = 0
 Options: utf8
 No first char
 No need char
-Study size = 40
 Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a 
   \x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 
   \x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 
   5 6 7 8 9 : ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y 
   Z [ \ ] ^ _ ` c d e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f 
-  \x80 \x81 \x82 \x83 \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e 
-  \x8f \x90 \x91 \x92 \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d 
-  \x9e \x9f \xa0 \xa1 \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac 
-  \xad \xae \xaf \xb0 \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb 
-  \xbc \xbd \xbe \xbf \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca 
-  \xcb \xcc \xcd \xce \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 
-  \xda \xdb \xdc \xdd \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 
-  \xe9 \xea \xeb \xec \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 
-  \xf8 \xf9 \xfa \xfb \xfc \xfd \xfe \xff 
+  \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce \xcf \xd0 
+  \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd \xde \xdf 
+  \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec \xed \xee 
+  \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb \xfc \xfd 
+  \xfe \xff 
     \x{f1}
  0: \x{f1}
     \x{bf}
@@ -463,7 +458,6 @@ Capturing subpattern count = 1
 Options: utf8
 No first char
 No need char
-Study size = 40
 Starting character set: x \xc4 
 
 /(\x{100}*a|x)/8SD
@@ -482,7 +476,6 @@ Capturing subpattern count = 1
 Options: utf8
 No first char
 No need char
-Study size = 40
 Starting character set: a x \xc4 
 
 /(\x{100}{0,2}a|x)/8SD
@@ -501,7 +494,6 @@ Capturing subpattern count = 1
 Options: utf8
 No first char
 No need char
-Study size = 40
 Starting character set: a x \xc4 
 
 /(\x{100}{1,2}a|x)/8SD
@@ -521,7 +513,6 @@ Capturing subpattern count = 1
 Options: utf8
 No first char
 No need char
-Study size = 40
 Starting character set: x \xc4 
 
 /\x{100}*(\d+|"(?1)")/8
@@ -826,5 +817,60 @@ Options: utf8
 No first char
 No need char
 
+/[Ä-Ü]/8
+    Ö # Matches without Study
+ 0: \x{d6}
+    \x{d6}
+ 0: \x{d6}
+    
+/[Ä-Ü]/8S
+    Ö <-- Same with Study
+ 0: \x{d6}
+    \x{d6}
+ 0: \x{d6}
+    
+/[\x{c4}-\x{dc}]/8 
+    Ö # Matches without Study
+ 0: \x{d6}
+    \x{d6} 
+ 0: \x{d6}
+
+/[\x{c4}-\x{dc}]/8S
+    Ö <-- Same with Study
+ 0: \x{d6}
+    \x{d6} 
+ 0: \x{d6}
+
+/[�]/8
+Failed: invalid UTF-8 string at offset 2
+
+/�/8
+Failed: invalid UTF-8 string at offset 0
+
+/���xxx/8
+Failed: invalid UTF-8 string at offset 1
+
+/���xxx/8?D
+------------------------------------------------------------------
+  0  11 Bra 0
+  3   6 \x{c3}\x{f8}xx
+ 11  11 Ket
+ 14     End
+------------------------------------------------------------------
+Capturing subpattern count = 0
+Options: utf8 no_utf8_check
+First char = 195
+Need char = 'x'
+
+/abc/8
+   �]
+Error -10
+   �
+Error -10
+   ���
+Error -10
+   ���\?
+No match
+
 / End of testinput5 /
author	nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2007-02-24 21:40:24 +0000
committer	nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2007-02-24 21:40:24 +0000
commit	4af6fcff808e079ca1aa09104d6146baa932af47 (patch)
tree	dc14f3624835dd1275c31159a4c365ed439f3df7
parent	f08d5b6354f668c0047281d81eda8d0fd2a9e82d (diff)
download	pcre-4af6fcff808e079ca1aa09104d6146baa932af47.tar.gz