summaryrefslogtreecommitdiff
path: root/doc/pcrepattern.3
diff options
context:
space:
mode:
Diffstat (limited to 'doc/pcrepattern.3')
-rw-r--r--doc/pcrepattern.369
1 files changed, 38 insertions, 31 deletions
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 1e2c078..c8091b7 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -21,15 +21,17 @@ published by O'Reilly, covers regular expressions in great detail. This
description of PCRE's regular expressions is intended as reference material.
.P
The original operation of PCRE was on strings of one-byte characters. However,
-there is now also support for UTF-8 strings in the original library, and a
-second library that supports 16-bit and UTF-16 character strings. To use these
+there is now also support for UTF-8 strings in the original library, an
+extra library that supports 16-bit and UTF-16 character strings, and an
+extra library that supports 32-bit and UTF-32 character strings. To use these
features, PCRE must be built to include appropriate support. When using UTF
-strings you must either call the compiling function with the PCRE_UTF8 or
-PCRE_UTF16 option, or the pattern must start with one of these special
-sequences:
+strings you must either call the compiling function with the PCRE_UTF8,
+PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of
+these special sequences:
.sp
(*UTF8)
(*UTF16)
+ (*UTF32)
.sp
Starting a pattern with such a sequence is equivalent to setting the relevant
option. This feature is not Perl-compatible. How setting a UTF mode affects
@@ -41,7 +43,7 @@ of features in the
page.
.P
Another special sequence that may appear at the start of a pattern or in
-combination with (*UTF8) or (*UTF16) is:
+combination with (*UTF8) or (*UTF16) or (*UTF32) is:
.sp
(*UCP)
.sp
@@ -57,12 +59,12 @@ of newlines; they are described below.
.P
The remainder of this document discusses the patterns that are supported by
PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or
-\fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching
-functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using
-a different algorithm that is not Perl-compatible. Some of the features
-discussed below are not available when DFA matching is used. The advantages and
-disadvantages of the alternative functions, and how they differ from the normal
-functions, are discussed in the
+\fBpcre[16|32]_exec()\fP (16- or 32-bit), is used. PCRE also has alternative
+matching functions, \fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP,
+which match using a different algorithm that is not Perl-compatible. Some of
+the features discussed below are not available when DFA matching is used. The
+advantages and disadvantages of the alternative functions, and how they differ
+from the normal functions, are discussed in the
.\" HREF
\fBpcrematching\fP
.\"
@@ -280,9 +282,11 @@ between \ex{ and }, but the character code is constrained as follows:
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint
16-bit non-UTF mode less than 0x10000
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint
+ 32-bit non-UTF mode less than 0x80000000
+ 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
.sp
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
-"surrogate" codepoints).
+"surrogate" codepoints), and 0xffef.
.P
If characters other than hexadecimal digits appear between \ex{ and }, or if
there is no terminating }, this form of escape is not recognized. Instead, the
@@ -568,7 +572,7 @@ change of newline convention; for example, a pattern can start with:
.sp
(*ANY)(*BSR_ANYCRLF)
.sp
-They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special
+They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special
sequences. Inside a character class, \eR is treated as an unrecognized escape
sequence, and so matches the letter "R" by default, but causes an error if
PCRE_EXTRA is set.
@@ -779,7 +783,8 @@ a modifier or "other".
The Cs (Surrogate) property applies only to characters in the range U+D800 to
U+DFFF. Such characters are not valid in Unicode strings and so
cannot be tested by PCRE, unless UTF validity checking has been turned off
-(see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the
+(see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and
+PCRE_NO_UTF32_CHECK in the
.\" HREF
\fBpcreapi\fP
.\"
@@ -1056,15 +1061,16 @@ name; PCRE does not support this.
.sp
Outside a character class, the escape sequence \eC matches any one data unit,
whether or not a UTF mode is set. In the 8-bit library, one data unit is one
-byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always
+byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is
+a 32-bit unit. Unlike a dot, \eC always
matches line-ending characters. The feature is provided in Perl in order to
match individual bytes in UTF-8 mode, but it is unclear how it can usefully be
used. Because \eC breaks up characters into individual data units, matching one
unit with \eC in a UTF mode means that the rest of the string may start with a
malformed UTF character. This has undefined results, because PCRE assumes that
it is dealing with valid UTF strings (and by default it checks this at the
-start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option
-is used).
+start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or
+PCRE_NO_UTF32_CHECK option is used).
.P
PCRE does not allow \eC to appear in lookbehind assertions
.\" HTML <a href="#lookbehind">
@@ -1123,9 +1129,9 @@ circumflex is not an assertion; it still consumes a character from the subject
string, and therefore it fails if the current pointer is at the end of the
string.
.P
-In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be
-included in a class as a literal string of data units, or by using the \ex{
-escaping mechanism.
+In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff)
+can be included in a class as a literal string of data units, or by using the
+\ex{ escaping mechanism.
.P
When caseless matching is set, any letters in a class represent both their
upper case and lower case versions, so for example, a caseless [aeiou] matches
@@ -1338,9 +1344,10 @@ the section entitled
.\" </a>
"Newline sequences"
.\"
-above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that
-can be used to set UTF and Unicode property modes; they are equivalent to
-setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively.
+above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading
+sequences that can be used to set UTF and Unicode property modes; they are
+equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
+options, respectively.
.
.
.\" HTML <a name="subpattern"></a>
@@ -2602,8 +2609,8 @@ same pair of parentheses when there is a repetition.
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
code. The feature is called "callout". The caller of PCRE provides an external
function by putting its entry point in the global variable \fIpcre_callout\fP
-(8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this
-variable contains NULL, which disables all calling out.
+(8-bit library) or \fIpcre[16|32]_callout\fP (16-bit or 32-bit library).
+By default, this variable contains NULL, which disables all calling out.
.P
Within a regular expression, (?C) indicates the points at which the external
function is to be called. If you want to identify different callout points, you
@@ -2658,10 +2665,10 @@ parenthesis followed by an asterisk. They are generally of the form
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
depending on whether or not an argument is present. A name is any sequence of
characters that does not include a closing parenthesis. The maximum length of
-name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name
-is empty, that is, if the closing parenthesis immediately follows the colon,
-the effect is as if the colon were not there. Any number of these verbs may
-occur in a pattern.
+name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit library.
+If the name is empty, that is, if the closing parenthesis immediately follows
+the colon, the effect is as if the colon were not there. Any number of these
+verbs may occur in a pattern.
.
.
.\" HTML <a name="nooptimize"></a>
@@ -2946,7 +2953,7 @@ overrides.
.rs
.sp
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
-\fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP.
+\fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP, \fBpcre32(3)\fP.
.
.
.SH AUTHOR