From c920d7941295fd6c46fffed9aa9e2e0de0b18570 Mon Sep 17 00:00:00 2001
From: ph10
If you are using PCRE2 in a non-UTF application that permits users to supply
arbitrary patterns for compilation, you should be aware of a feature that
-allows users to turn on UTF support from within a pattern, provided that PCRE2
-was built with Unicode support. For example, an 8-bit pattern that begins with
-"(*UTF)" turns on UTF-8 mode, which interprets patterns and subjects as strings
-of UTF-8 code units instead of individual 8-bit characters. This causes both
-the pattern and any data against which it is matched to be checked for UTF-8
-validity. If the data string is very long, such a check might use sufficiently
-many resources as to cause your application to lose performance.
+allows users to turn on UTF support from within a pattern. For example, an
+8-bit pattern that begins with "(*UTF)" turns on UTF-8 mode, which interprets
+patterns and subjects as strings of UTF-8 code units instead of individual
+8-bit characters. This causes both the pattern and any data against which it is
+matched to be checked for UTF-8 validity. If the data string is very long, such
+a check might use sufficiently many resources as to cause your application to
+lose performance.
One way of guarding against this possibility is to use the
@@ -173,7 +174,7 @@ use my two initials, followed by the two digits 10, at the domain cam.ac.uk.
-Last updated: 28 September 2014
+Last updated: 03 November 2014
-To build PCRE2 with support for Unicode and UTF character strings, add
+By default, PCRE2 is built with support for Unicode and UTF character strings.
+To build it without Unicode support, add
pcre2test -C
@@ -95,13 +96,13 @@ not exported.
REVISION
Copyright © 1997-2014 University of Cambridge.
diff --git a/doc/html/pcre2build.html b/doc/html/pcre2build.html
index 9733057..2fbe197 100644
--- a/doc/html/pcre2build.html
+++ b/doc/html/pcre2build.html
@@ -115,27 +115,24 @@ to the configure command, as required.
Unicode and UTF SUPPORT
- --enable-unicode
+ --disable-unicode
-to the configure command. This setting applies to all three libraries,
-adding support for UTF-8 to the 8-bit library, support for UTF-16 to the 16-bit
-library, and support for UTF-32 to the to the 32-bit library.
-It is not possible to build one library with
-UTF support and another without in the same configuration.
+to the configure command. This setting applies to all three libraries. It
+is not possible to build one library with Unicode support, and another without,
+in the same configuration.
-Of itself, this setting does not make PCRE2 treat strings as UTF-8, UTF-16 or -UTF-32. As well as compiling PCRE2 with this option, you also have have to set -the PCRE2_UTF option when you call pcre2_compile() to compile a pattern. +Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16 +or UTF-32. To do that you have have to set the PCRE2_UTF option when you call +pcre2_compile() to compile a pattern.
-If you set --enable-unicode when compiling in an EBCDIC environment, PCRE2 -expects its input to be either ASCII or UTF-8 (depending on the run-time -option). It is not possible to support both EBCDIC and UTF-8 codes in the same -version of the library. Consequently, --enable-unicode and --enable-ebcdic are -mutually exclusive. +It is not possible to support both EBCDIC and UTF-8 codes in the same version +of the library. Consequently, --enable-unicode and --enable-ebcdic are mutually +exclusive.
UTF support allows the libraries to process character codepoints up to 0x10ffff @@ -301,12 +298,12 @@ code is ASCII (or Unicode, which is a superset of ASCII). This is the case for most computer operating systems. PCRE2 can, however, be compiled to run in an EBCDIC environment by adding
- --enable-ebcdic + --enable-ebcdic --disable-unicodeto the configure command. This setting implies --enable-rebuild-chartables. You should only use it if you know that you are in an EBCDIC environment (for example, an IBM mainframe operating system). The ---enable-ebcdic option is incompatible with --enable-unicode. +--enable-ebcdic option is incompatible with Unicode support.
The EBCDIC character that corresponds to an ASCII LF is assumed to have the @@ -469,7 +466,7 @@ Cambridge CB2 3QH, England.
-Last updated: 28 September 2014
+Last updated: 03 November 2014
Copyright © 1997-2014 University of Cambridge.
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 11d8056..e9da45a 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -89,11 +89,11 @@ In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
specified for the 32-bit library, in which case it constrains the character
values to valid Unicode code points. To process UTF strings, PCRE2 must be
-built to include Unicode support. When using UTF strings you must either call
-the compiling function with the PCRE2_UTF option, or the pattern must start
-with the special sequence (*UTF), which is equivalent to setting the relevant
-option. How setting a UTF mode affects pattern matching is mentioned in several
-places below. There is also a summary of features in the
+built to include Unicode support (which is the default). When using UTF strings
+you must either call the compiling function with the PCRE2_UTF option, or the
+pattern must start with the special sequence (*UTF), which is equivalent to
+setting the relevant option. How setting a UTF mode affects pattern matching is
+mentioned in several places below. There is also a summary of features in the
pcre2unicode
page.
\d any character that matches \p{Nd} (decimal digit) \s any character that matches \p{Z} or \h or \v @@ -641,11 +641,11 @@ an error. Unicode character properties
-When PCRE2 is built with Unicode support, three additional escape sequences -that match characters with specific properties are available. In 8-bit -non-UTF-8 mode, these sequences are of course limited to testing characters -whose codepoints are less than 256, but they do work in this mode. The extra -escape sequences are: +When PCRE2 is built with Unicode support (the default), three additional escape +sequences that match characters with specific properties are available. In +8-bit non-UTF-8 mode, these sequences are of course limited to testing +characters whose codepoints are less than 256, but they do work in this mode. +The extra escape sequences are:
\p{xx} a character with the xx property \P{xx} a character without the xx property @@ -3193,7 +3193,7 @@ Cambridge CB2 3QH, England.
REVISION
-Last updated: 19 October 2014 +Last updated: 03 November 2014
Copyright © 1997-2014 University of Cambridge.
diff --git a/doc/html/pcre2unicode.html b/doc/html/pcre2unicode.html index 52846fb..bc3b327 100644 --- a/doc/html/pcre2unicode.html +++ b/doc/html/pcre2unicode.html @@ -16,11 +16,12 @@ please consult the man page, in case the conversion went wrong. UNICODE AND UTF SUPPORT
-When PCRE2 is built with Unicode support, it acquires knowledge of Unicode -character properties and can process text strings in UTF-8, UTF-16, or UTF-32 -format (depending on the code unit width). By default, PCRE2 assumes that one -code unit is one character. To process a pattern as a UTF string, where a -character may require more than one code unit, you must call +When PCRE2 is built with Unicode support (which is the default), it has +knowledge of Unicode character properties and can process text strings in +UTF-8, UTF-16, or UTF-32 format (depending on the code unit width). However, by +default, PCRE2 assumes that one code unit is one character. To process a +pattern as a UTF string, where a character may require more than one code unit, +you must call pcre2_compile() with the PCRE2_UTF option flag, or the pattern must start with the sequence (*UTF). When either of these is the case, both the pattern and any subject @@ -28,9 +29,8 @@ strings that are matched against it are treated as UTF strings instead of strings of individual one-code-unit characters.
-If you build PCRE2 with Unicode support, the library will be bigger, but the -additional run time overhead is limited to testing the PCRE2_UTF flag -occasionally, so should not be very much. +If you do not need Unicode support you can build PCRE2 without it, in which +case the library will be smaller.
UNICODE PROPERTY SUPPORT @@ -261,7 +261,7 @@ Cambridge CB2 3QH, England. REVISION
-Last updated: 16 September 2014 +Last updated: 03 November 2014
Copyright © 1997-2014 University of Cambridge.
-- cgit v1.2.1