summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2012-11-11 18:04:37 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2012-11-11 18:04:37 +0000
commit1b1c6e22c899fa0586d368112f418588be1e1ac8 (patch)
tree932c40b78f8f0cb216a73342fe55fbbf84c12eeb
parentf1660e461ec9b5150d4b59c8423261dab447f165 (diff)
downloadpcre-1b1c6e22c899fa0586d368112f418588be1e1ac8.tar.gz
Support (*UTF) in all libraries.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1219 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog2
-rw-r--r--doc/pcre.314
-rw-r--r--doc/pcrepattern.327
-rw-r--r--doc/pcresyntax.35
-rw-r--r--doc/pcreunicode.319
-rw-r--r--pcre_compile.c13
-rw-r--r--pcre_internal.h26
-rw-r--r--testdata/testinput152
-rw-r--r--testdata/testinput183
-rw-r--r--testdata/testoutput152
-rw-r--r--testdata/testoutput18-168
-rw-r--r--testdata/testoutput18-328
12 files changed, 77 insertions, 52 deletions
diff --git a/ChangeLog b/ChangeLog
index a0cdd9c..a117b0d 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -159,6 +159,8 @@ Version 8.32 11-November-2012
34. If PCRE is built with --enable-valgrind, certain memory regions are marked
unaddressable using valgrind annotations, allowing valgrind to detect
invalid memory accesses. This is mainly for the benefit of the developers.
+
+25. (*UTF) can now be used to start a pattern in any of the three libraries.
Version 8.31 06-July-2012
diff --git a/doc/pcre.3 b/doc/pcre.3
index c95ee6c..73e8ee6 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -1,4 +1,4 @@
-.TH PCRE 3 "30 October 2012" "PCRE 8.32"
+.TH PCRE 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH INTRODUCTION
@@ -114,10 +114,12 @@ If you are using PCRE in a non-UTF application that permits users to supply
arbitrary patterns for compilation, you should be aware of a feature that
allows users to turn on UTF support from within a pattern, provided that PCRE
was built with UTF support. For example, an 8-bit pattern that begins with
-"(*UTF8)" turns on UTF-8 mode. This causes both the pattern and any data
-against which it is matched to be checked for UTF-8 validity. If the data
-string is very long, such a check might use sufficiently many resources as to
-cause your application to lose performance.
+"(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and
+subjects as strings of UTF-8 characters instead of individual 8-bit characters.
+This causes both the pattern and any data against which it is matched to be
+checked for UTF-8 validity. If the data string is very long, such a check might
+use sufficiently many resources as to cause your application to lose
+performance.
.P
The best way of guarding against this possibility is to use the
\fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF.
@@ -195,6 +197,6 @@ two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 30 October 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 46f780d..991c52d 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -1,4 +1,4 @@
-.TH PCREPATTERN 3 "07 November 2012" "PCRE 8.32"
+.TH PCREPATTERN 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION DETAILS"
@@ -22,17 +22,19 @@ description of PCRE's regular expressions is intended as reference material.
.P
The original operation of PCRE was on strings of one-byte characters. However,
there is now also support for UTF-8 strings in the original library, an
-extra library that supports 16-bit and UTF-16 character strings, and an
-extra library that supports 32-bit and UTF-32 character strings. To use these
+extra library that supports 16-bit and UTF-16 character strings, and a
+third library that supports 32-bit and UTF-32 character strings. To use these
features, PCRE must be built to include appropriate support. When using UTF
strings you must either call the compiling function with the PCRE_UTF8,
-PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of
+PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of
these special sequences:
.sp
(*UTF8)
(*UTF16)
(*UTF32)
+ (*UTF)
.sp
+(*UTF) is a generic sequence that can be used with any of the libraries.
Starting a pattern with such a sequence is equivalent to setting the relevant
option. This feature is not Perl-compatible. How setting a UTF mode affects
pattern matching is mentioned in several places below. There is also a summary
@@ -43,7 +45,7 @@ of features in the
page.
.P
Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8) or (*UTF16) or (*UTF32) is:
+combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
.sp
(*UCP)
.sp
@@ -573,10 +575,10 @@ change of newline convention; for example, a pattern can start with:
.sp
(*ANY)(*BSR_ANYCRLF)
.sp
-They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special
-sequences. Inside a character class, \eR is treated as an unrecognized escape
-sequence, and so matches the letter "R" by default, but causes an error if
-PCRE_EXTRA is set.
+They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or
+(*UCP) special sequences. Inside a character class, \eR is treated as an
+unrecognized escape sequence, and so matches the letter "R" by default, but
+causes an error if PCRE_EXTRA is set.
.
.
.\" HTML <a name="uniextseq"></a>
@@ -1349,10 +1351,11 @@ the section entitled
.\" </a>
"Newline sequences"
.\"
-above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading
+above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading
sequences that can be used to set UTF and Unicode property modes; they are
equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
-options, respectively.
+options, respectively. The (*UTF) sequence is a generic version that can be
+used with any of the libraries.
.
.
.\" HTML <a name="subpattern"></a>
@@ -2975,6 +2978,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 07 November 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
diff --git a/doc/pcresyntax.3 b/doc/pcresyntax.3
index f634d4b..868f427 100644
--- a/doc/pcresyntax.3
+++ b/doc/pcresyntax.3
@@ -1,4 +1,4 @@
-.TH PCRESYNTAX 3 "24 June 2012" "PCRE 8.30"
+.TH PCRESYNTAX 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
@@ -349,6 +349,7 @@ newline-setting options with similar syntax:
(*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
(*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
(*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
+ (*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
.
.
@@ -490,6 +491,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 25 August 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
diff --git a/doc/pcreunicode.3 b/doc/pcreunicode.3
index 79b31bd..3faaa70 100644
--- a/doc/pcreunicode.3
+++ b/doc/pcreunicode.3
@@ -1,4 +1,4 @@
-.TH PCREUNICODE 3 "08 November 2012" "PCRE 8.32"
+.TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32"
.SH NAME
PCRE - Perl-compatible regular expressions
.SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT"
@@ -18,9 +18,9 @@ support, and, in addition, you must call
\fBpcre_compile()\fP
.\"
with the PCRE_UTF8 option flag, or the pattern must start with the sequence
-(*UTF8). When either of these is the case, both the pattern and any subject
-strings that are matched against it are treated as UTF-8 strings instead of
-strings of individual 1-byte characters.
+(*UTF8) or (*UTF). When either of these is the case, both the pattern and any
+subject strings that are matched against it are treated as UTF-8 strings
+instead of strings of individual 1-byte characters.
.
.
.SH "UTF-16 AND UTF-32 SUPPORT"
@@ -36,10 +36,11 @@ or
\fBpcre32_compile()\fP
.\"
with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively,
-the pattern must start with the sequence (*UTF16) or (*UTF32). When UTF mode is
-set, both the pattern and any subject strings that are matched against it are
-treated as UTF-16 or UTF-32 strings instead of strings of individual 16-bit or
-32-bit characters.
+the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or
+(*UTF), which can be used with either library. When UTF mode is set, both the
+pattern and any subject strings that are matched against it are treated as
+UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit
+characters.
.
.
.SH "UTF SUPPORT OVERHEAD"
@@ -249,6 +250,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 08 November 2012
+Last updated: 11 November 2012
Copyright (c) 1997-2012 University of Cambridge.
.fi
diff --git a/pcre_compile.c b/pcre_compile.c
index a57363a..aff0208 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -7848,19 +7848,26 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS &&
{
int newnl = 0;
int newbsr = 0;
+
+/* For completeness and backward compatibility, (*UTFn) is supported in the
+relevant libraries, but (*UTF) is generic and always supported. Note that
+PCRE_UTF8 == PCRE_UTF16 == PCRE_UTF32. */
#ifdef COMPILE_PCRE8
- if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 5) == 0)
+ if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF8_RIGHTPAR, 5) == 0)
{ skipatstart += 7; options |= PCRE_UTF8; continue; }
#endif
#ifdef COMPILE_PCRE16
- if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+ if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF16_RIGHTPAR, 6) == 0)
{ skipatstart += 8; options |= PCRE_UTF16; continue; }
#endif
#ifdef COMPILE_PCRE32
- if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0)
+ if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF32_RIGHTPAR, 6) == 0)
{ skipatstart += 8; options |= PCRE_UTF32; continue; }
#endif
+
+ else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 4) == 0)
+ { skipatstart += 6; options |= PCRE_UTF8; continue; }
else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UCP_RIGHTPAR, 4) == 0)
{ skipatstart += 6; options |= PCRE_UCP; continue; }
else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
diff --git a/pcre_internal.h b/pcre_internal.h
index 8d0a759..c5d4914 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1528,15 +1528,10 @@ a positive value. */
#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR "UTF8)"
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR "UTF16)"
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR "UTF32)"
-#endif
+#define STRING_UTF8_RIGHTPAR "UTF8)"
+#define STRING_UTF16_RIGHTPAR "UTF16)"
+#define STRING_UTF32_RIGHTPAR "UTF32)"
+#define STRING_UTF_RIGHTPAR "UTF)"
#define STRING_UCP_RIGHTPAR "UCP)"
#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
@@ -1794,15 +1789,10 @@ only. */
#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
-#ifdef COMPILE_PCRE8
-#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE16
-#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
-#endif
-#ifdef COMPILE_PCRE32
-#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
-#endif
+#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
+#define STRING_UTF16_RIGHTPAR STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS
+#define STRING_UTF32_RIGHTPAR STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS
+#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS
#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
diff --git a/testdata/testinput15 b/testdata/testinput15
index 8753e5e..6445d43 100644
--- a/testdata/testinput15
+++ b/testdata/testinput15
@@ -327,7 +327,7 @@ correctly, but that messes up comparisons). --/
/(*UTF8)\x{1234}/
abcd\x{1234}pqr
-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
/\h/SI8
ABC\x{09}
diff --git a/testdata/testinput18 b/testdata/testinput18
index e55b8bc..7f87ca2 100644
--- a/testdata/testinput18
+++ b/testdata/testinput18
@@ -174,6 +174,9 @@ correctly, but that messes up comparisons). --/
/(*UTF16)\x{11234}/
abcd\x{11234}pqr
+/(*UTF)\x{11234}/I
+ abcd\x{11234}pqr
+
/(*UTF-32)\x{11234}/
abcd\x{11234}pqr
diff --git a/testdata/testoutput15 b/testdata/testoutput15
index b9d5aa2..d6b7d09 100644
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@@ -922,7 +922,7 @@ No match
abcd\x{1234}pqr
0: \x{1234}
-/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I
+/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I
Capturing subpattern count = 0
Options: bsr_unicode utf
Forced newline sequence: CRLF
diff --git a/testdata/testoutput18-16 b/testdata/testoutput18-16
index 33054ad..50a8301 100644
--- a/testdata/testoutput18-16
+++ b/testdata/testoutput18-16
@@ -641,6 +641,14 @@ Error -10 (bad UTF-16 string) offset=0 reason=4
abcd\x{11234}pqr
0: \x{11234}
+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{d804}
+Need char = \x{de34}
+ abcd\x{11234}pqr
+ 0: \x{11234}
+
/(*UTF-32)\x{11234}/
Failed: (*VERB) not recognized at offset 5
diff --git a/testdata/testoutput18-32 b/testdata/testoutput18-32
index b0f0e8f..334fae0 100644
--- a/testdata/testoutput18-32
+++ b/testdata/testoutput18-32
@@ -638,6 +638,14 @@ Error -10 (bad UTF-32 string) offset=0 reason=2
/(*UTF16)\x{11234}/
Failed: (*VERB) not recognized at offset 5
+/(*UTF)\x{11234}/I
+Capturing subpattern count = 0
+Options: utf
+First char = \x{11234}
+No need char
+ abcd\x{11234}pqr
+ 0: \x{11234}
+
/(*UTF-32)\x{11234}/
Failed: (*VERB) not recognized at offset 5