summaryrefslogtreecommitdiff
path: root/doc/html/pcre2api.html
diff options
context:
space:
mode:
Diffstat (limited to 'doc/html/pcre2api.html')
-rw-r--r--doc/html/pcre2api.html56
1 files changed, 36 insertions, 20 deletions
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index 9562a26..ee056ad 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -1481,13 +1481,13 @@ documentation.
</pre>
If this bit is set, letters in the pattern match both upper and lower case
letters in the subject. It is equivalent to Perl's /i option, and it can be
-changed within a pattern by a (?i) option setting. If PCRE2_UTF is set, Unicode
-properties are used for all characters with more than one other case, and for
-all characters whose code points are greater than U+007F. For lower valued
-characters with only one other case, a lookup table is used for speed. When
-PCRE2_UTF is not set, a lookup table is used for all code points less than 256,
-and higher code points (available only in 16-bit or 32-bit mode) are treated as
-not having another case.
+changed within a pattern by a (?i) option setting. If either PCRE2_UTF or
+PCRE2_UCP is set, Unicode properties are used for all characters with more than
+one other case, and for all characters whose code points are greater than
+U+007F. For lower valued characters with only one other case, a lookup table is
+used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set, a lookup table is
+used for all code points less than 256, and higher code points (available only
+in 16-bit or 32-bit mode) are treated as not having another case.
<pre>
PCRE2_DOLLAR_ENDONLY
</pre>
@@ -1820,16 +1820,23 @@ are not representable in UTF-16.
<pre>
PCRE2_UCP
</pre>
-This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
-\w, and some of the POSIX character classes. By default, only ASCII characters
-are recognized, but if PCRE2_UCP is set, Unicode properties are used instead to
-classify characters. More details are given in the section on
+This option has two effects. Firstly, it change the way PCRE2 processes \B,
+\b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes. By
+default, only ASCII characters are recognized, but if PCRE2_UCP is set, Unicode
+properties are used instead to classify characters. More details are given in
+the section on
<a href="pcre2pattern.html#genericchartypes">generic character types</a>
in the
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
page. If you set PCRE2_UCP, matching one of the items it affects takes much
-longer. The option is available only if PCRE2 has been compiled with Unicode
-support (which is the default).
+longer.
+</P>
+<P>
+The second effect of PCRE2_UCP is to force the use of Unicode properties for
+upper/lower casing operations on characters with code points greater than 127,
+even when PCRE2_UTF is not set. This makes it possible, for example, to process
+strings in the 16-bit UCS-2 code. This option is available only if PCRE2 has
+been compiled with Unicode support (which is the default).
<pre>
PCRE2_UNGREEDY
</pre>
@@ -1997,14 +2004,20 @@ PCRE2 handles caseless matching, and determines whether characters are letters,
digits, or whatever, by reference to a set of tables, indexed by character code
point. However, this applies only to characters whose code points are less than
256. By default, higher-valued code points never match escapes such as \w or
-\d. When PCRE2 is built with Unicode support (the default), all characters can
-be tested with \p and \P, or, alternatively, the PCRE2_UCP option can be set
-when a pattern is compiled; this causes \w and friends to use Unicode property
-support instead of the built-in tables.
+\d.
+</P>
+<P>
+When PCRE2 is built with Unicode support (the default), the Unicode properties
+of all characters can be tested with \p and \P, or, alternatively, the
+PCRE2_UCP option can be set when a pattern is compiled; this causes \w and
+friends to use Unicode property support instead of the built-in tables.
+PCRE2_UCP also causes upper/lower casing operations on characters with code
+points greater than 127 to use Unicode properties. These effects apply even
+when PCRE2_UTF is not set.
</P>
<P>
The use of locales with Unicode is discouraged. If you are handling characters
-with code points greater than 128, you should either use Unicode support, or
+with code points greater than 127, you should either use Unicode support, or
use locales, but not try to mix the two.
</P>
<P>
@@ -3494,7 +3507,10 @@ terminating a \Q quoted sequence) reverts to no case forcing. The sequences
\u and \l force the next character (if it is a letter) to upper or lower
case, respectively, and then the state automatically reverts to no case
forcing. Case forcing applies to all inserted characters, including those from
-capture groups and letters within \Q...\E quoted sequences.
+capture groups and letters within \Q...\E quoted sequences. If either
+PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
+properties are used for case forcing characters whose code points are greater
+than 127.
</P>
<P>
Note that case forcing sequences such as \U...\E do not nest. For example,
@@ -3915,7 +3931,7 @@ Cambridge, England.
</P>
<br><a name="SEC42" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 16 February 2020
+Last updated: 24 February 2020
<br>
Copyright &copy; 1997-2020 University of Cambridge.
<br>