diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2012-11-11 18:04:37 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2012-11-11 18:04:37 +0000 |
commit | 1b1c6e22c899fa0586d368112f418588be1e1ac8 (patch) | |
tree | 932c40b78f8f0cb216a73342fe55fbbf84c12eeb | |
parent | f1660e461ec9b5150d4b59c8423261dab447f165 (diff) | |
download | pcre-1b1c6e22c899fa0586d368112f418588be1e1ac8.tar.gz |
Support (*UTF) in all libraries.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@1219 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r-- | ChangeLog | 2 | ||||
-rw-r--r-- | doc/pcre.3 | 14 | ||||
-rw-r--r-- | doc/pcrepattern.3 | 27 | ||||
-rw-r--r-- | doc/pcresyntax.3 | 5 | ||||
-rw-r--r-- | doc/pcreunicode.3 | 19 | ||||
-rw-r--r-- | pcre_compile.c | 13 | ||||
-rw-r--r-- | pcre_internal.h | 26 | ||||
-rw-r--r-- | testdata/testinput15 | 2 | ||||
-rw-r--r-- | testdata/testinput18 | 3 | ||||
-rw-r--r-- | testdata/testoutput15 | 2 | ||||
-rw-r--r-- | testdata/testoutput18-16 | 8 | ||||
-rw-r--r-- | testdata/testoutput18-32 | 8 |
12 files changed, 77 insertions, 52 deletions
@@ -159,6 +159,8 @@ Version 8.32 11-November-2012 34. If PCRE is built with --enable-valgrind, certain memory regions are marked unaddressable using valgrind annotations, allowing valgrind to detect invalid memory accesses. This is mainly for the benefit of the developers. + +25. (*UTF) can now be used to start a pattern in any of the three libraries. Version 8.31 06-July-2012 @@ -1,4 +1,4 @@ -.TH PCRE 3 "30 October 2012" "PCRE 8.32" +.TH PCRE 3 "11 November 2012" "PCRE 8.32" .SH NAME PCRE - Perl-compatible regular expressions .SH INTRODUCTION @@ -114,10 +114,12 @@ If you are using PCRE in a non-UTF application that permits users to supply arbitrary patterns for compilation, you should be aware of a feature that allows users to turn on UTF support from within a pattern, provided that PCRE was built with UTF support. For example, an 8-bit pattern that begins with -"(*UTF8)" turns on UTF-8 mode. This causes both the pattern and any data -against which it is matched to be checked for UTF-8 validity. If the data -string is very long, such a check might use sufficiently many resources as to -cause your application to lose performance. +"(*UTF8)" or "(*UTF)" turns on UTF-8 mode, which interprets patterns and +subjects as strings of UTF-8 characters instead of individual 8-bit characters. +This causes both the pattern and any data against which it is matched to be +checked for UTF-8 validity. If the data string is very long, such a check might +use sufficiently many resources as to cause your application to lose +performance. .P The best way of guarding against this possibility is to use the \fBpcre_fullinfo()\fP function to check the compiled pattern's options for UTF. @@ -195,6 +197,6 @@ two digits 10, at the domain cam.ac.uk. .rs .sp .nf -Last updated: 30 October 2012 +Last updated: 11 November 2012 Copyright (c) 1997-2012 University of Cambridge. .fi diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3 index 46f780d..991c52d 100644 --- a/doc/pcrepattern.3 +++ b/doc/pcrepattern.3 @@ -1,4 +1,4 @@ -.TH PCREPATTERN 3 "07 November 2012" "PCRE 8.32" +.TH PCREPATTERN 3 "11 November 2012" "PCRE 8.32" .SH NAME PCRE - Perl-compatible regular expressions .SH "PCRE REGULAR EXPRESSION DETAILS" @@ -22,17 +22,19 @@ description of PCRE's regular expressions is intended as reference material. .P The original operation of PCRE was on strings of one-byte characters. However, there is now also support for UTF-8 strings in the original library, an -extra library that supports 16-bit and UTF-16 character strings, and an -extra library that supports 32-bit and UTF-32 character strings. To use these +extra library that supports 16-bit and UTF-16 character strings, and a +third library that supports 32-bit and UTF-32 character strings. To use these features, PCRE must be built to include appropriate support. When using UTF strings you must either call the compiling function with the PCRE_UTF8, -PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of +PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of these special sequences: .sp (*UTF8) (*UTF16) (*UTF32) + (*UTF) .sp +(*UTF) is a generic sequence that can be used with any of the libraries. Starting a pattern with such a sequence is equivalent to setting the relevant option. This feature is not Perl-compatible. How setting a UTF mode affects pattern matching is mentioned in several places below. There is also a summary @@ -43,7 +45,7 @@ of features in the page. .P Another special sequence that may appear at the start of a pattern or in -combination with (*UTF8) or (*UTF16) or (*UTF32) is: +combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is: .sp (*UCP) .sp @@ -573,10 +575,10 @@ change of newline convention; for example, a pattern can start with: .sp (*ANY)(*BSR_ANYCRLF) .sp -They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special -sequences. Inside a character class, \eR is treated as an unrecognized escape -sequence, and so matches the letter "R" by default, but causes an error if -PCRE_EXTRA is set. +They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or +(*UCP) special sequences. Inside a character class, \eR is treated as an +unrecognized escape sequence, and so matches the letter "R" by default, but +causes an error if PCRE_EXTRA is set. . . .\" HTML <a name="uniextseq"></a> @@ -1349,10 +1351,11 @@ the section entitled .\" </a> "Newline sequences" .\" -above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading +above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading sequences that can be used to set UTF and Unicode property modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP -options, respectively. +options, respectively. The (*UTF) sequence is a generic version that can be +used with any of the libraries. . . .\" HTML <a name="subpattern"></a> @@ -2975,6 +2978,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 07 November 2012 +Last updated: 11 November 2012 Copyright (c) 1997-2012 University of Cambridge. .fi diff --git a/doc/pcresyntax.3 b/doc/pcresyntax.3 index f634d4b..868f427 100644 --- a/doc/pcresyntax.3 +++ b/doc/pcresyntax.3 @@ -1,4 +1,4 @@ -.TH PCRESYNTAX 3 "24 June 2012" "PCRE 8.30" +.TH PCRESYNTAX 3 "11 November 2012" "PCRE 8.32" .SH NAME PCRE - Perl-compatible regular expressions .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY" @@ -349,6 +349,7 @@ newline-setting options with similar syntax: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) + (*UTF) set appropriate UTF mode for the library in use (*UCP) set PCRE_UCP (use Unicode properties for \ed etc) . . @@ -490,6 +491,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 25 August 2012 +Last updated: 11 November 2012 Copyright (c) 1997-2012 University of Cambridge. .fi diff --git a/doc/pcreunicode.3 b/doc/pcreunicode.3 index 79b31bd..3faaa70 100644 --- a/doc/pcreunicode.3 +++ b/doc/pcreunicode.3 @@ -1,4 +1,4 @@ -.TH PCREUNICODE 3 "08 November 2012" "PCRE 8.32" +.TH PCREUNICODE 3 "11 November 2012" "PCRE 8.32" .SH NAME PCRE - Perl-compatible regular expressions .SH "UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT" @@ -18,9 +18,9 @@ support, and, in addition, you must call \fBpcre_compile()\fP .\" with the PCRE_UTF8 option flag, or the pattern must start with the sequence -(*UTF8). When either of these is the case, both the pattern and any subject -strings that are matched against it are treated as UTF-8 strings instead of -strings of individual 1-byte characters. +(*UTF8) or (*UTF). When either of these is the case, both the pattern and any +subject strings that are matched against it are treated as UTF-8 strings +instead of strings of individual 1-byte characters. . . .SH "UTF-16 AND UTF-32 SUPPORT" @@ -36,10 +36,11 @@ or \fBpcre32_compile()\fP .\" with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively, -the pattern must start with the sequence (*UTF16) or (*UTF32). When UTF mode is -set, both the pattern and any subject strings that are matched against it are -treated as UTF-16 or UTF-32 strings instead of strings of individual 16-bit or -32-bit characters. +the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or +(*UTF), which can be used with either library. When UTF mode is set, both the +pattern and any subject strings that are matched against it are treated as +UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit +characters. . . .SH "UTF SUPPORT OVERHEAD" @@ -249,6 +250,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 08 November 2012 +Last updated: 11 November 2012 Copyright (c) 1997-2012 University of Cambridge. .fi diff --git a/pcre_compile.c b/pcre_compile.c index a57363a..aff0208 100644 --- a/pcre_compile.c +++ b/pcre_compile.c @@ -7848,19 +7848,26 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS && { int newnl = 0; int newbsr = 0; + +/* For completeness and backward compatibility, (*UTFn) is supported in the +relevant libraries, but (*UTF) is generic and always supported. Note that +PCRE_UTF8 == PCRE_UTF16 == PCRE_UTF32. */ #ifdef COMPILE_PCRE8 - if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 5) == 0) + if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF8_RIGHTPAR, 5) == 0) { skipatstart += 7; options |= PCRE_UTF8; continue; } #endif #ifdef COMPILE_PCRE16 - if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0) + if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF16_RIGHTPAR, 6) == 0) { skipatstart += 8; options |= PCRE_UTF16; continue; } #endif #ifdef COMPILE_PCRE32 - if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 6) == 0) + if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF32_RIGHTPAR, 6) == 0) { skipatstart += 8; options |= PCRE_UTF32; continue; } #endif + + else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UTF_RIGHTPAR, 4) == 0) + { skipatstart += 6; options |= PCRE_UTF8; continue; } else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_UCP_RIGHTPAR, 4) == 0) { skipatstart += 6; options |= PCRE_UCP; continue; } else if (STRNCMP_UC_C8(ptr+skipatstart+2, STRING_NO_START_OPT_RIGHTPAR, 13) == 0) diff --git a/pcre_internal.h b/pcre_internal.h index 8d0a759..c5d4914 100644 --- a/pcre_internal.h +++ b/pcre_internal.h @@ -1528,15 +1528,10 @@ a positive value. */ #define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)" #define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)" #define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)" -#ifdef COMPILE_PCRE8 -#define STRING_UTF_RIGHTPAR "UTF8)" -#endif -#ifdef COMPILE_PCRE16 -#define STRING_UTF_RIGHTPAR "UTF16)" -#endif -#ifdef COMPILE_PCRE32 -#define STRING_UTF_RIGHTPAR "UTF32)" -#endif +#define STRING_UTF8_RIGHTPAR "UTF8)" +#define STRING_UTF16_RIGHTPAR "UTF16)" +#define STRING_UTF32_RIGHTPAR "UTF32)" +#define STRING_UTF_RIGHTPAR "UTF)" #define STRING_UCP_RIGHTPAR "UCP)" #define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)" @@ -1794,15 +1789,10 @@ only. */ #define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS #define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS #define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS -#ifdef COMPILE_PCRE8 -#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS -#endif -#ifdef COMPILE_PCRE16 -#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS -#endif -#ifdef COMPILE_PCRE32 -#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS -#endif +#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS +#define STRING_UTF16_RIGHTPAR STR_U STR_T STR_F STR_1 STR_6 STR_RIGHT_PARENTHESIS +#define STRING_UTF32_RIGHTPAR STR_U STR_T STR_F STR_3 STR_2 STR_RIGHT_PARENTHESIS +#define STRING_UTF_RIGHTPAR STR_U STR_T STR_F STR_RIGHT_PARENTHESIS #define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS #define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS diff --git a/testdata/testinput15 b/testdata/testinput15 index 8753e5e..6445d43 100644 --- a/testdata/testinput15 +++ b/testdata/testinput15 @@ -327,7 +327,7 @@ correctly, but that messes up comparisons). --/ /(*UTF8)\x{1234}/ abcd\x{1234}pqr -/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I +/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I /\h/SI8 ABC\x{09} diff --git a/testdata/testinput18 b/testdata/testinput18 index e55b8bc..7f87ca2 100644 --- a/testdata/testinput18 +++ b/testdata/testinput18 @@ -174,6 +174,9 @@ correctly, but that messes up comparisons). --/ /(*UTF16)\x{11234}/ abcd\x{11234}pqr +/(*UTF)\x{11234}/I + abcd\x{11234}pqr + /(*UTF-32)\x{11234}/ abcd\x{11234}pqr diff --git a/testdata/testoutput15 b/testdata/testoutput15 index b9d5aa2..d6b7d09 100644 --- a/testdata/testoutput15 +++ b/testdata/testoutput15 @@ -922,7 +922,7 @@ No match abcd\x{1234}pqr 0: \x{1234} -/(*CRLF)(*UTF8)(*BSR_UNICODE)a\Rb/I +/(*CRLF)(*UTF)(*BSR_UNICODE)a\Rb/I Capturing subpattern count = 0 Options: bsr_unicode utf Forced newline sequence: CRLF diff --git a/testdata/testoutput18-16 b/testdata/testoutput18-16 index 33054ad..50a8301 100644 --- a/testdata/testoutput18-16 +++ b/testdata/testoutput18-16 @@ -641,6 +641,14 @@ Error -10 (bad UTF-16 string) offset=0 reason=4 abcd\x{11234}pqr 0: \x{11234} +/(*UTF)\x{11234}/I +Capturing subpattern count = 0 +Options: utf +First char = \x{d804} +Need char = \x{de34} + abcd\x{11234}pqr + 0: \x{11234} + /(*UTF-32)\x{11234}/ Failed: (*VERB) not recognized at offset 5 diff --git a/testdata/testoutput18-32 b/testdata/testoutput18-32 index b0f0e8f..334fae0 100644 --- a/testdata/testoutput18-32 +++ b/testdata/testoutput18-32 @@ -638,6 +638,14 @@ Error -10 (bad UTF-32 string) offset=0 reason=2 /(*UTF16)\x{11234}/ Failed: (*VERB) not recognized at offset 5 +/(*UTF)\x{11234}/I +Capturing subpattern count = 0 +Options: utf +First char = \x{11234} +No need char + abcd\x{11234}pqr + 0: \x{11234} + /(*UTF-32)\x{11234}/ Failed: (*VERB) not recognized at offset 5 |