summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2015-01-26 14:21:45 +0000
committerph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>2015-01-26 14:21:45 +0000
commit1434a44f884ad6637bcd6ae1876dd997b78135d0 (patch)
tree755d211f4ea8f74b50a12f9a1d9795c30cda73e2
parent61cb4c76713910670d84bc0e3bb9dad8c2661b37 (diff)
downloadpcre2-1434a44f884ad6637bcd6ae1876dd997b78135d0.tar.gz
Documentation clarifications.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@186 6239d852-aaf2-0410-a92c-79f79f948069
-rw-r--r--README16
-rw-r--r--doc/html/README.txt16
-rw-r--r--doc/html/pcre2build.html16
-rw-r--r--doc/html/pcre2pattern.html23
-rw-r--r--doc/pcre2.txt24
-rw-r--r--doc/pcre2build.317
-rw-r--r--doc/pcre2pattern.325
7 files changed, 84 insertions, 53 deletions
diff --git a/README b/README
index 71d6f72..508fd1e 100644
--- a/README
+++ b/README
@@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
- library, and UTF-32 Unicode character strings in the 32-bit library, you can
+ library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
When Unicode support is available, the use of a UTF encoding still has to be
- enabled by an option at run time. When PCRE2 is compiled with Unicode
- support, its input can only either be ASCII or UTF-8/16/32, even when running
- on EBCDIC platforms. It is not possible to use both --enable-unicode and
- --enable-ebcdic at the same time.
+ enabled by setting the PCRE2_UTF option at run time or starting a pattern
+ with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
+ either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
+ not possible to use both --enable-unicode and --enable-ebcdic at the same
+ time.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only the basic two-letter properties such as Lu are supported.
+ Escape sequences such as \d and \w in patterns do not by default make use of
+ Unicode properties, but can be made to do so by setting the PCRE2_UCP option
+ or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, as indicating the
@@ -825,4 +829,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 January 2015
+Last updated: 26 January 2015
diff --git a/doc/html/README.txt b/doc/html/README.txt
index 71d6f72..508fd1e 100644
--- a/doc/html/README.txt
+++ b/doc/html/README.txt
@@ -179,20 +179,24 @@ library. They are also documented in the pcre2build man page.
. If you do not want to make use of the support for UTF-8 Unicode character
strings in the 8-bit library, UTF-16 Unicode character strings in the 16-bit
- library, and UTF-32 Unicode character strings in the 32-bit library, you can
+ library, or UTF-32 Unicode character strings in the 32-bit library, you can
add --disable-unicode to the "configure" command. This reduces the size of
the libraries. It is not possible to configure one library with Unicode
support, and another without, in the same configuration.
When Unicode support is available, the use of a UTF encoding still has to be
- enabled by an option at run time. When PCRE2 is compiled with Unicode
- support, its input can only either be ASCII or UTF-8/16/32, even when running
- on EBCDIC platforms. It is not possible to use both --enable-unicode and
- --enable-ebcdic at the same time.
+ enabled by setting the PCRE2_UTF option at run time or starting a pattern
+ with (*UTF). When PCRE2 is compiled with Unicode support, its input can only
+ either be ASCII or UTF-8/16/32, even when running on EBCDIC platforms. It is
+ not possible to use both --enable-unicode and --enable-ebcdic at the same
+ time.
As well as supporting UTF strings, Unicode support includes support for the
\P, \p, and \X sequences that recognize Unicode character properties.
However, only the basic two-letter properties such as Lu are supported.
+ Escape sequences such as \d and \w in patterns do not by default make use of
+ Unicode properties, but can be made to do so by setting the PCRE2_UCP option
+ or starting a pattern with (*UCP).
. You can build PCRE2 to recognize either CR or LF or the sequence CRLF, or any
of the preceding, or any of the Unicode newline sequences, as indicating the
@@ -825,4 +829,4 @@ The distribution should contain the files listed below.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 20 January 2015
+Last updated: 26 January 2015
diff --git a/doc/html/pcre2build.html b/doc/html/pcre2build.html
index b87dad7..13204d4 100644
--- a/doc/html/pcre2build.html
+++ b/doc/html/pcre2build.html
@@ -127,8 +127,10 @@ in the same configuration.
</P>
<P>
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that, applications that use the library have to set the
-PCRE2_UTF option when they call <b>pcre2_compile()</b> to compile a pattern.
+or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
+option when they call <b>pcre2_compile()</b> to compile a pattern.
+Alternatively, patterns may be started with (*UTF) unless the application has
+locked this out by setting PCRE2_NEVER_UTF.
</P>
<P>
UTF support allows the libraries to process character code points up to
@@ -139,6 +141,12 @@ as \P, \p, and \X. Only the general category properties such as <i>Lu</i> and
<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
documentation.
</P>
+<P>
+Pattern escapes such as \d and \w do not by default make use of Unicode
+properties. The application can request that they do by setting the PCRE2_UCP
+option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
+request this by starting with (*UCP).
+</P>
<br><a name="SEC6" href="#TOC1">JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
Just-in-time compiler support is included in the build by specifying
@@ -471,9 +479,9 @@ Cambridge, England.
</P>
<br><a name="SEC21" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 23 November 2014
+Last updated: 26 January 2015
<br>
-Copyright &copy; 1997-2014 University of Cambridge.
+Copyright &copy; 1997-2015 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 27de3f0..dccb648 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -110,7 +110,7 @@ Unicode property support
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \d and \w to use Unicode properties to determine character types,
-instead of recognizing only characters with codes less than 128 via a lookup
+instead of recognizing only characters with codes less than 256 via a lookup
table.
</P>
<P>
@@ -572,8 +572,8 @@ Unicode is discouraged.
</P>
<P>
By default, characters whose code points are greater than 127 never match \d,
-\s, or \w, and always match \D, \S, and \W, although this may vary for
-characters in the range 128-255 when locale-specific matching is happening.
+\s, or \w, and always match \D, \S, and \W, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
@@ -1369,11 +1369,12 @@ syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
</P>
<P>
-By default, characters with values greater than 128 do not match any of the
-POSIX character classes. However, if the PCRE2_UCP option is passed to
-<b>pcre2_compile()</b>, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to <b>pcre2_compile()</b>, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
<pre>
[:alnum:] becomes \p{Xan}
[:alpha:] becomes \p{L}
@@ -1408,12 +1409,12 @@ not controls, that is, characters with the Zs property.
<P>
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
-plus those characters with code points less than 128 that have the S (Symbol)
+plus those characters with code points less than 256 that have the S (Symbol)
property.
</P>
<P>
The other POSIX classes are unchanged, and match only characters with code
-points less than 128.
+points less than 256.
</P>
<br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
<P>
@@ -3248,7 +3249,7 @@ Cambridge, England.
</P>
<br><a name="SEC30" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 02 January 2015
+Last updated: 26 January 2015
<br>
Copyright &copy; 1997-2015 University of Cambridge.
<br>
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index d4e0057..80d841b 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -2874,17 +2874,23 @@ UNICODE AND UTF SUPPORT
another without, in the same configuration.
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
- UTF-16 or UTF-32. To do that, applications that use the library have to
- set the PCRE2_UTF option when they call pcre2_compile() to compile a
- pattern.
+ UTF-16 or UTF-32. To do that, applications that use the library can set
+ the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
+ tern. Alternatively, patterns may be started with (*UTF) unless the
+ application has locked this out by setting PCRE2_NEVER_UTF.
UTF support allows the libraries to process character code points up to
- 0x10ffff in the strings that they handle. It also provides support for
- accessing the Unicode properties of such characters, using pattern
- escapes such as \P, \p, and \X. Only the general category properties
- such as Lu and Nd are supported. Details are given in the pcre2pattern
+ 0x10ffff in the strings that they handle. It also provides support for
+ accessing the Unicode properties of such characters, using pattern
+ escapes such as \P, \p, and \X. Only the general category properties
+ such as Lu and Nd are supported. Details are given in the pcre2pattern
documentation.
+ Pattern escapes such as \d and \w do not by default make use of Unicode
+ properties. The application can request that they do by setting the
+ PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
+ pattern may also request this by starting with (*UCP).
+
JUST-IN-TIME COMPILER SUPPORT
@@ -3226,8 +3232,8 @@ AUTHOR
REVISION
- Last updated: 23 November 2014
- Copyright (c) 1997-2014 University of Cambridge.
+ Last updated: 26 January 2015
+ Copyright (c) 1997-2015 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcre2build.3 b/doc/pcre2build.3
index 6a081b3..55eab15 100644
--- a/doc/pcre2build.3
+++ b/doc/pcre2build.3
@@ -1,4 +1,4 @@
-.TH PCRE2BUILD 3 "23 November 2014" "PCRE2 10.00"
+.TH PCRE2BUILD 3 "26 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.
@@ -113,8 +113,10 @@ is not possible to build one library with Unicode support, and another without,
in the same configuration.
.P
Of itself, Unicode support does not make PCRE2 treat strings as UTF-8, UTF-16
-or UTF-32. To do that, applications that use the library have to set the
-PCRE2_UTF option when they call \fBpcre2_compile()\fP to compile a pattern.
+or UTF-32. To do that, applications that use the library can set the PCRE2_UTF
+option when they call \fBpcre2_compile()\fP to compile a pattern.
+Alternatively, patterns may be started with (*UTF) unless the application has
+locked this out by setting PCRE2_NEVER_UTF.
.P
UTF support allows the libraries to process character code points up to
0x10ffff in the strings that they handle. It also provides support for
@@ -125,6 +127,11 @@ as \eP, \ep, and \eX. Only the general category properties such as \fILu\fP and
\fBpcre2pattern\fP
.\"
documentation.
+.P
+Pattern escapes such as \ed and \ew do not by default make use of Unicode
+properties. The application can request that they do by setting the PCRE2_UCP
+option. Unless the application has set PCRE2_NEVER_UCP, a pattern may also
+request this by starting with (*UCP).
.
.
.SH "JUST-IN-TIME COMPILER SUPPORT"
@@ -487,6 +494,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 23 November 2014
-Copyright (c) 1997-2014 University of Cambridge.
+Last updated: 26 January 2015
+Copyright (c) 1997-2015 University of Cambridge.
.fi
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index fcd76a1..80ff849 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -1,4 +1,4 @@
-.TH PCRE2PATTERN 3 "02 January 2015" "PCRE2 10.00"
+.TH PCRE2PATTERN 3 "26 January 2015" "PCRE2 10.00"
.SH NAME
PCRE2 - Perl-compatible regular expressions (revised API)
.SH "PCRE2 REGULAR EXPRESSION DETAILS"
@@ -73,7 +73,7 @@ appearance in a pattern causes an error.
Another special sequence that may appear at the start of a pattern is (*UCP).
This has the same effect as setting the PCRE2_UCP option: it causes sequences
such as \ed and \ew to use Unicode properties to determine character types,
-instead of recognizing only characters with codes less than 128 via a lookup
+instead of recognizing only characters with codes less than 256 via a lookup
table.
.P
Some applications that allow their users to supply patterns may wish to
@@ -575,8 +575,8 @@ accented letters, and these are then matched by \ew. The use of locales with
Unicode is discouraged.
.P
By default, characters whose code points are greater than 127 never match \ed,
-\es, or \ew, and always match \eD, \eS, and \eW, although this may vary for
-characters in the range 128-255 when locale-specific matching is happening.
+\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
These escape sequences retain their original meanings from before Unicode
support was available, mainly for efficiency reasons. If the PCRE2_UCP option
is set, the behaviour is changed so that Unicode properties are used to
@@ -1369,11 +1369,12 @@ matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
.P
-By default, characters with values greater than 128 do not match any of the
-POSIX character classes. However, if the PCRE2_UCP option is passed to
-\fBpcre2_compile()\fP, some of the classes are changed so that Unicode
-character properties are used. This is achieved by replacing certain POSIX
-classes by other sequences, as follows:
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
.sp
[:alnum:] becomes \ep{Xan}
[:alpha:] becomes \ep{L}
@@ -1404,11 +1405,11 @@ not controls, that is, characters with the Zs property.
.TP 10
[:punct:]
This matches all characters that have the Unicode P (punctuation) property,
-plus those characters with code points less than 128 that have the S (Symbol)
+plus those characters with code points less than 256 that have the S (Symbol)
property.
.P
The other POSIX classes are unchanged, and match only characters with code
-points less than 128.
+points less than 256.
.
.
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
@@ -3292,6 +3293,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 02 January 2015
+Last updated: 26 January 2015
Copyright (c) 1997-2015 University of Cambridge.
.fi