diff options
Diffstat (limited to 'doc/pcrepattern.3')
-rw-r--r-- | doc/pcrepattern.3 | 69 |
1 files changed, 38 insertions, 31 deletions
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3 index 1e2c078..c8091b7 100644 --- a/doc/pcrepattern.3 +++ b/doc/pcrepattern.3 @@ -21,15 +21,17 @@ published by O'Reilly, covers regular expressions in great detail. This description of PCRE's regular expressions is intended as reference material. .P The original operation of PCRE was on strings of one-byte characters. However, -there is now also support for UTF-8 strings in the original library, and a -second library that supports 16-bit and UTF-16 character strings. To use these +there is now also support for UTF-8 strings in the original library, an +extra library that supports 16-bit and UTF-16 character strings, and an +extra library that supports 32-bit and UTF-32 character strings. To use these features, PCRE must be built to include appropriate support. When using UTF -strings you must either call the compiling function with the PCRE_UTF8 or -PCRE_UTF16 option, or the pattern must start with one of these special -sequences: +strings you must either call the compiling function with the PCRE_UTF8, +PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of +these special sequences: .sp (*UTF8) (*UTF16) + (*UTF32) .sp Starting a pattern with such a sequence is equivalent to setting the relevant option. This feature is not Perl-compatible. How setting a UTF mode affects @@ -41,7 +43,7 @@ of features in the page. .P Another special sequence that may appear at the start of a pattern or in -combination with (*UTF8) or (*UTF16) is: +combination with (*UTF8) or (*UTF16) or (*UTF32) is: .sp (*UCP) .sp @@ -57,12 +59,12 @@ of newlines; they are described below. .P The remainder of this document discusses the patterns that are supported by PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or -\fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching -functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using -a different algorithm that is not Perl-compatible. Some of the features -discussed below are not available when DFA matching is used. The advantages and -disadvantages of the alternative functions, and how they differ from the normal -functions, are discussed in the +\fBpcre[16|32]_exec()\fP (16- or 32-bit), is used. PCRE also has alternative +matching functions, \fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP, +which match using a different algorithm that is not Perl-compatible. Some of +the features discussed below are not available when DFA matching is used. The +advantages and disadvantages of the alternative functions, and how they differ +from the normal functions, are discussed in the .\" HREF \fBpcrematching\fP .\" @@ -280,9 +282,11 @@ between \ex{ and }, but the character code is constrained as follows: 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint 16-bit non-UTF mode less than 0x10000 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint + 32-bit non-UTF mode less than 0x80000000 + 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint .sp Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called -"surrogate" codepoints). +"surrogate" codepoints), and 0xffef. .P If characters other than hexadecimal digits appear between \ex{ and }, or if there is no terminating }, this form of escape is not recognized. Instead, the @@ -568,7 +572,7 @@ change of newline convention; for example, a pattern can start with: .sp (*ANY)(*BSR_ANYCRLF) .sp -They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special +They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special sequences. Inside a character class, \eR is treated as an unrecognized escape sequence, and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is set. @@ -779,7 +783,8 @@ a modifier or "other". The Cs (Surrogate) property applies only to characters in the range U+D800 to U+DFFF. Such characters are not valid in Unicode strings and so cannot be tested by PCRE, unless UTF validity checking has been turned off -(see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the +(see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and +PCRE_NO_UTF32_CHECK in the .\" HREF \fBpcreapi\fP .\" @@ -1056,15 +1061,16 @@ name; PCRE does not support this. .sp Outside a character class, the escape sequence \eC matches any one data unit, whether or not a UTF mode is set. In the 8-bit library, one data unit is one -byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always +byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is +a 32-bit unit. Unlike a dot, \eC always matches line-ending characters. The feature is provided in Perl in order to match individual bytes in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC breaks up characters into individual data units, matching one unit with \eC in a UTF mode means that the rest of the string may start with a malformed UTF character. This has undefined results, because PCRE assumes that it is dealing with valid UTF strings (and by default it checks this at the -start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option -is used). +start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or +PCRE_NO_UTF32_CHECK option is used). .P PCRE does not allow \eC to appear in lookbehind assertions .\" HTML <a href="#lookbehind"> @@ -1123,9 +1129,9 @@ circumflex is not an assertion; it still consumes a character from the subject string, and therefore it fails if the current pointer is at the end of the string. .P -In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be -included in a class as a literal string of data units, or by using the \ex{ -escaping mechanism. +In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff) +can be included in a class as a literal string of data units, or by using the +\ex{ escaping mechanism. .P When caseless matching is set, any letters in a class represent both their upper case and lower case versions, so for example, a caseless [aeiou] matches @@ -1338,9 +1344,10 @@ the section entitled .\" </a> "Newline sequences" .\" -above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that -can be used to set UTF and Unicode property modes; they are equivalent to -setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively. +above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading +sequences that can be used to set UTF and Unicode property modes; they are +equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP +options, respectively. . . .\" HTML <a name="subpattern"></a> @@ -2602,8 +2609,8 @@ same pair of parentheses when there is a repetition. PCRE provides a similar feature, but of course it cannot obey arbitrary Perl code. The feature is called "callout". The caller of PCRE provides an external function by putting its entry point in the global variable \fIpcre_callout\fP -(8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this -variable contains NULL, which disables all calling out. +(8-bit library) or \fIpcre[16|32]_callout\fP (16-bit or 32-bit library). +By default, this variable contains NULL, which disables all calling out. .P Within a regular expression, (?C) indicates the points at which the external function is to be called. If you want to identify different callout points, you @@ -2658,10 +2665,10 @@ parenthesis followed by an asterisk. They are generally of the form (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, depending on whether or not an argument is present. A name is any sequence of characters that does not include a closing parenthesis. The maximum length of -name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name -is empty, that is, if the closing parenthesis immediately follows the colon, -the effect is as if the colon were not there. Any number of these verbs may -occur in a pattern. +name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit library. +If the name is empty, that is, if the closing parenthesis immediately follows +the colon, the effect is as if the colon were not there. Any number of these +verbs may occur in a pattern. . . .\" HTML <a name="nooptimize"></a> @@ -2946,7 +2953,7 @@ overrides. .rs .sp \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), -\fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP. +\fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP, \fBpcre32(3)\fP. . . .SH AUTHOR |