Support (*UTF) in all libraries.

git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1219 2f5784b3-3f2a-0410-8824-cb99058d5e15
author: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2012-11-11 18:04:37 +0000
committer: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2012-11-11 18:04:37 +0000
commit: 1b1c6e22c899fa0586d368112f418588be1e1ac8 (patch)
tree: 932c40b78f8f0cb216a73342fe55fbbf84c12eeb
parent: f1660e461ec9b5150d4b59c8423261dab447f165 (diff)
download: pcre-1b1c6e22c899fa0586d368112f418588be1e1ac8.tar.gz
12 files changed, 77 insertions, 52 deletions
diff --git a/ChangeLog b/ChangeLog
index a0cdd9c..a117b0d 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -159,6 +159,8 @@ Version 8.32 11-November-2012
 34. If PCRE is built with --enable-valgrind, certain memory regions are marked
     unaddressable using valgrind annotations, allowing valgrind to detect 
     invalid memory accesses. This is mainly for the benefit of the developers.
+    
+25. (*UTF) can now be used to start a pattern in any of the three libraries. 
 
 
 Version 8.31 06-July-2012
diff --git a/doc/pcre.3 b/doc/pcre.3
index c95ee6c..73e8ee6 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -1,4 +1,4 @@
-.TH PCRE 3 "30 October 2012" "PCRE 8.32"
+.TH PCRE 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH INTRODUCTION
@@ -114,10 +114,12 @@ If you are using PCRE in a non-UTF application that permits users to supply
 arbitrary patterns for compilation, you should be aware of a feature that
 allows users to turn on UTF support from within a pattern, provided that PCRE
 was built with UTF support. For example, an 8-bit pattern that begins with
-"(*UTF8)" turns on UTF-8 mode. This causes both the pattern and any data
-against which it is matched to be checked for UTF-8 validity. If the data
-string is very long, such a check might use sufficiently many resources as to
-cause your application to lose performance.
+"(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and 
+subjects as strings of UTF-8 characters instead of individual 8-bit characters.
+This causes both the pattern and any data against which it is matched to be
+checked for UTF-8 validity. If the data string is very long, such a check might
+use sufficiently many resources as to cause your application to lose
+performance.
 .P
 The best way of guarding against this possibility is to use the
 \fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF.
@@ -195,6 +197,6 @@ two digits 10, at the domain cam.ac.uk.
 .rs
 .sp
 .nf
-Last updated: 30 October 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 46f780d..991c52d 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "07 November 2012" "PCRE 8.32"
+.TH PCREPATTERN 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -22,17 +22,19 @@ description of PCRE's regular expressions is intended as reference material.
 .P
 The original operation of PCRE was on strings of one-byte characters. However,
 there is now also support for UTF-8 strings in the original library, an
-extra library that supports 16-bit and UTF-16 character strings, and an
-extra library that supports 32-bit and UTF-32 character strings. To use these
+extra library that supports 16-bit and UTF-16 character strings, and a
+third library that supports 32-bit and UTF-32 character strings. To use these
 features, PCRE must be built to include appropriate support. When using UTF
 strings you must either call the compiling function with the PCRE_UTF8,
-PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of
+PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of
 these special sequences:
 .sp
   (*UTF8)
   (*UTF16)
   (*UTF32)
+  (*UTF) 
 .sp
+(*UTF) is a generic sequence that can be used with any of the libraries.
 Starting a pattern with such a sequence is equivalent to setting the relevant
 option. This feature is not Perl-compatible. How setting a UTF mode affects
 pattern matching is mentioned in several places below. There is also a summary
@@ -43,7 +45,7 @@ of features in the
 page.
 .P
 Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8) or (*UTF16) or (*UTF32) is:
+combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
 .sp
   (*UCP)
 .sp
@@ -573,10 +575,10 @@ change of newline convention; for example, a pattern can start with:
 .sp
   (*ANY)(*BSR_ANYCRLF)
 .sp
-They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special
-sequences. Inside a character class, \eR is treated as an unrecognized escape
-sequence, and so matches the letter "R" by default, but causes an error if
-PCRE_EXTRA is set.
+They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or
+(*UCP) special sequences. Inside a character class, \eR is treated as an
+unrecognized escape sequence, and so matches the letter "R" by default, but
+causes an error if PCRE_EXTRA is set.
 .
 .
 .\" HTML <a name="uniextseq"></a>
@@ -1349,10 +1351,11 @@ the section entitled
 .\" </a>
 "Newline sequences"
 .\"
-above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading
+above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading
 sequences that can be used to set UTF and Unicode property modes; they are
 equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
-options, respectively.
+options, respectively. The (*UTF) sequence is a generic version that can be
+used with any of the libraries.
 .
 .
 .\" HTML <a name="subpattern"></a>
@@ -2975,6 +2978,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 07 November 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi
diff --git a/doc/pcresyntax.3 b/doc/pcresyntax.3
index f634d4b..868f427 100644
--- a/doc/pcresyntax.3
+++ b/doc/pcresyntax.3
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "24 June 2012" "PCRE 8.30"
+.TH PCRESYNTAX 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -349,6 +349,7 @@ newline-setting options with similar syntax:
   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
+  (*UTF)          set appropriate UTF mode for the library in use
   (*UCP)          set PCRE_UCP (use Unicode properties for \ed etc)
 .
 .
@@ -490,6 +491,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 25 August 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi
diff --git a/doc/pcreunicode.3 b/doc/pcreunicode.3
index 79b31bd..3faaa70 100644
--- a/doc/pcreunicode.3
+++ b/doc/pcreunicode.3
@@ -1,4 +1,4 @@
-.TH PCREUNICODE 3 "08 November 2012" "PCRE 8.32"
+.TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32"
 .SH NAME
 PCRE - Perl-compatible regular expressions
 .SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"
@@ -18,9 +18,9 @@ support, and, in addition, you must call
 \fBpcre_compile()\fP
 .\"
 with the PCRE_UTF8 option flag, or the pattern must start with the sequence
-(*UTF8). When either of these is the case, both the pattern and any subject
-strings that are matched against it are treated as UTF-8 strings instead of
-strings of individual 1-byte characters.
+(*UTF8) or (*UTF). When either of these is the case, both the pattern and any
+subject strings that are matched against it are treated as UTF-8 strings
+instead of strings of individual 1-byte characters.
 .
 .
 .SH "UTF-16 AND UTF-32 SUPPORT"
@@ -36,10 +36,11 @@ or
 \fBpcre32_compile()\fP
 .\"
 with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively,
-the pattern must start with the sequence (*UTF16) or (*UTF32). When UTF mode is 
-set, both the pattern and any subject strings that are matched against it are
-treated as UTF-16 or UTF-32 strings instead of strings of individual 16-bit or
-32-bit characters.
+the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or
+(*UTF), which can be used with either library. When UTF mode is set, both the
+pattern and any subject strings that are matched against it are treated as
+UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit
+characters.
 .
 .
 .SH "UTF SUPPORT OVERHEAD"
@@ -249,6 +250,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 08 November 2012
+Last updated: 11 November 2012
 Copyright (c) 1997-2012 University of Cambridge.
 .fi
diff --git a/pcre_compile.c b/pcre_compile.c
index a57363a..aff0208 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -7848,19 +7848,26 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS &&
   {
   int newnl = 0;
   int newbsr = 0;
+  
+/* For completeness and backward compatibility, (*UTFn) is supported in the
+relevant libraries, but (*UTF) is generic and always supported. Note that 
+PCRE_UTF8 == PCRE_UTF16 == PCRE_UTF32. */ 
 
 #ifdef COMPILE_PCRE8
-  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 5) == 0)
+  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF8_RIGHTPAR, 5) == 0)
     { skipatstart += 7; options |= PCRE_UTF8; continue; }
 #endif
 #ifdef COMPILE_PCRE16
-  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF16_RIGHTPAR, 6) == 0)
     { skipatstart += 8; options |= PCRE_UTF16; continue; }
 #endif
 #ifdef COMPILE_PCRE32
-  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+  if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF32_RIGHTPAR, 6) == 0)
     { skipatstart += 8; options |= PCRE_UTF32; continue; }
 #endif
+
+  else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 4) == 0)
+    { skipatstart += 6; options |= PCRE_UTF8; continue; } 
   else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UCP_RIGHTPAR, 4) == 0)
     { skipatstart += 6; options |= PCRE_UCP; continue; }
   else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
diff --git a/pcre_internal.h b/pcre_internal.h
index 8d0a759..c5d4914 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1528,15 +1528,10 @@ a positive value. */
 #define STRING_ANYCRLF_RIGHTPAR        "ANYCRLF)"
 #define STRING_BSR_ANYCRLF_RIGHTPAR    "BSR_ANYCRLF)"
 #define STRING_BSR_UNICODE_RIGHTPAR    "BSR_UNICODE)"
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR            "UTF8)"
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR            "UTF16)"
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR            "UTF32)"
-#endif
+#define STRING_UTF8_RIGHTPAR           "UTF8)"
+#define STRING_UTF16_RIGHTPAR          "UTF16)"
+#define STRING_UTF32_RIGHTPAR          "UTF32)"
+#define STRING_UTF_RIGHTPAR            "UTF)"
 #define STRING_UCP_RIGHTPAR            "UCP)"
 #define STRING_NO_START_OPT_RIGHTPAR   "NO_START_OPT)"
 
@@ -1794,15 +1789,10 @@ only. */
 #define STRING_ANYCRLF_RIGHTPAR        STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
 #define STRING_BSR_ANYCRLF_RIGHTPAR    STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
 #define STRING_BSR_UNICODE_RIGHTPAR    STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
-#endif
+#define STRING_UTF8_RIGHTPAR           STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
+#define STRING_UTF16_RIGHTPAR          STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
+#define STRING_UTF32_RIGHTPAR          STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
+#define STRING_UTF_RIGHTPAR            STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
 #define STRING_UCP_RIGHTPAR            STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
 #define STRING_NO_START_OPT_RIGHTPAR   STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
 
diff --git a/testdata/testinput15 b/testdata/testinput15
index 8753e5e..6445d43 100644
--- a/testdata/testinput15
+++ b/testdata/testinput15
@@ -327,7 +327,7 @@ correctly, but that messes up comparisons). --/
 /(*UTF8)\x{1234}/
   abcd\x{1234}pqr
 
-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
 
 /\h/SI8
     ABC\x{09}
diff --git a/testdata/testinput18 b/testdata/testinput18
index e55b8bc..7f87ca2 100644
--- a/testdata/testinput18
+++ b/testdata/testinput18
@@ -174,6 +174,9 @@ correctly, but that messes up comparisons). --/
 /(*UTF16)\x{11234}/
   abcd\x{11234}pqr
 
+/(*UTF)\x{11234}/I
+  abcd\x{11234}pqr
+
 /(*UTF-32)\x{11234}/
   abcd\x{11234}pqr
 
diff --git a/testdata/testoutput15 b/testdata/testoutput15
index b9d5aa2..d6b7d09 100644
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@@ -922,7 +922,7 @@ No match
   abcd\x{1234}pqr
  0: \x{1234}
 
-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
 Capturing subpattern count = 0
 Options: bsr_unicode utf
 Forced newline sequence: CRLF
diff --git a/testdata/testoutput18-16 b/testdata/testoutput18-16
index 33054ad..50a8301 100644
--- a/testdata/testoutput18-16
+++ b/testdata/testoutput18-16
@@ -641,6 +641,14 @@ Error -10 (bad UTF-16 string) offset=0 reason=4
   abcd\x{11234}pqr
  0: \x{11234}
 
+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{d804}
+Need char = \x{de34}
+  abcd\x{11234}pqr
+ 0: \x{11234}
+
 /(*UTF-32)\x{11234}/
 Failed: (*VERB) not recognized at offset 5
 
diff --git a/testdata/testoutput18-32 b/testdata/testoutput18-32
index b0f0e8f..334fae0 100644
--- a/testdata/testoutput18-32
+++ b/testdata/testoutput18-32
@@ -638,6 +638,14 @@ Error -10 (bad UTF-32 string) offset=0 reason=2
 /(*UTF16)\x{11234}/
 Failed: (*VERB) not recognized at offset 5
 
+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{11234}
+No need char
+  abcd\x{11234}pqr
+ 0: \x{11234}
+
 /(*UTF-32)\x{11234}/
 Failed: (*VERB) not recognized at offset 5
author	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2012-11-11 18:04:37 +0000
committer	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2012-11-11 18:04:37 +0000
commit	1b1c6e22c899fa0586d368112f418588be1e1ac8 (patch)
tree	932c40b78f8f0cb216a73342fe55fbbf84c12eeb
parent	f1660e461ec9b5150d4b59c8423261dab447f165 (diff)
download	pcre-1b1c6e22c899fa0586d368112f418588be1e1ac8.tar.gz