diff options
Diffstat (limited to 'doc/html/pcre2pattern.html')
-rw-r--r-- | doc/html/pcre2pattern.html | 67 |
1 files changed, 48 insertions, 19 deletions
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 0daddaf..c88e931 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -669,8 +669,8 @@ This is an example of an "atomic group", details of which are given This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next -line, U+0085). The two-character sequence is treated as a single unit that -cannot be split. +line, U+0085). Because this is an atomic group, the two-character sequence is +treated as a single unit that cannot be split. </P> <P> In other modes, two additional characters whose codepoints are greater than 255 @@ -1186,6 +1186,16 @@ when the <i>startoffset</i> argument of <b>pcre2_match()</b> is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set. </P> <P> +When the newline convention (see +<a href="#newlines">"Newline conventions"</a> +below) recognizes the two-character sequence CRLF as a newline, this is +preferred, even if the single characters CR and LF are also recognized as +newlines. For example, if the newline convention is "any", a multiline mode +circumflex matches before "xyz" in the string "abc\r\nxyz" rather than after +CR, even though CR on its own is a valid newline. (It also matches at the very +start of the string, of course.) +</P> +<P> Note that the sequences \A, \Z, and \z can be used to match the start and end of the subject in both modes, and if all branches of a pattern start with \A it is always anchored, whether or not PCRE2_MULTILINE is set. @@ -1236,7 +1246,7 @@ with \C in UTF-8 or UTF-16 mode means that the rest of the string may start with a malformed UTF character. This has undefined results, because PCRE2 assumes that it is matching character by character in a valid UTF string (by default it checks the subject string's validity at the start of processing -unless the PCRE2_NO_UTF_CHECK option is used). +unless the PCRE2_NO_UTF_CHECK option is used). </P> <P> An application can lock out the use of \C by setting the @@ -1247,9 +1257,9 @@ build PCRE2 with the use of \C permanently disabled. PCRE2 does not allow \C to appear in lookbehind assertions <a href="#lookbehind">(described below)</a> in a UTF mode, because this would make it impossible to calculate the length of -the lookbehind. Neither the alternative matching function -<b>pcre2_dfa_match()</b> not the JIT optimizer support \C in a UTF mode. The -former gives a match-time error; the latter fails to optimize and so the match +the lookbehind. Neither the alternative matching function +<b>pcre2_dfa_match()</b> not the JIT optimizer support \C in a UTF mode. The +former gives a match-time error; the latter fails to optimize and so the match is always run using the interpreter. </P> <P> @@ -1341,11 +1351,11 @@ example [\000-\037]. Ranges can include any characters that are valid for the current mode. </P> <P> -There is a special case in EBCDIC environments for ranges whose end points are -both specified as literal letters in the same case. For compatibility with -Perl, EBCDIC code points within the range that are not letters are omitted. For -example, [h-k] matches only four characters, even though the codes for h and k -are 0x88 and 0x92, a range of 11 code points. However, if the range is +There is a special case in EBCDIC environments for ranges whose end points are +both specified as literal letters in the same case. For compatibility with +Perl, EBCDIC code points within the range that are not letters are omitted. For +example, [h-k] matches only four characters, even though the codes for h and k +are 0x88 and 0x92, a range of 11 code points. However, if the range is specified numerically, for example, [\x88-\x92] or [h-\x92], all code points are included. </P> @@ -1672,6 +1682,10 @@ first one in the pattern with the given number. The following pattern matches <pre> /(?|(abc)|(def))(?1)/ </pre> +A relative reference such as (?-1) is no different: it is just a convenient way +of computing an absolute group number. +</P> +<P> If a <a href="#conditions">condition test</a> for a subpattern's having matched refers to a non-unique number, the test is @@ -2512,7 +2526,7 @@ For example: (?(VERSION>=10.4)yes|no) </pre> This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or -"no" otherwise. The fractional part of the version number may not contain more +"no" otherwise. The fractional part of the version number may not contain more than two digits. </P> <br><b> @@ -2626,6 +2640,21 @@ parentheses preceding the recursion. In other words, a negative number counts capturing parentheses leftwards from the point at which it is encountered. </P> <P> +Be aware however, that if +<a href="#dupsubpatternnumber">duplicate subpattern numbers</a> +are in use, relative references refer to the earliest subpattern with the +appropriate number. Consider, for example: +<pre> + (?|(a)|(b)) (c) (?-2) +</pre> +The first two capturing groups (a) and (b) are both numbered 1, and group (c) +is number 2. When the reference (?-2) is encountered, the second most recently +opened parentheses has the number 1, but it is the first such group (the (a) +group) to which the recursion refers. This would be the same if an absolute +reference (?1) was used. In other words, relative references are just a +shorthand for computing a group number. +</P> +<P> It is also possible to refer to subsequently opened parentheses, by writing references such as (?+2). However, these cannot be recursive because the reference is not inside the parentheses that are referenced. They are always @@ -2929,13 +2958,13 @@ depending on whether or not a name is present. </P> <P> By default, for compatibility with Perl, a name is any sequence of characters -that does not include a closing parenthesis. The name is not processed in +that does not include a closing parenthesis. The name is not processed in any way, and it is not possible to include a closing parenthesis in the name. -However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing -is applied to verb names and only an unescaped closing parenthesis terminates -the name. A closing parenthesis can be included in a name either as \) or -between \Q and \E. If the PCRE2_EXTENDED option is set, unescaped whitespace -in verb names is skipped and #-comments are recognized, exactly as in the rest +However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash processing +is applied to verb names and only an unescaped closing parenthesis terminates +the name. A closing parenthesis can be included in a name either as \) or +between \Q and \E. If the PCRE2_EXTENDED option is set, unescaped whitespace +in verb names is skipped and #-comments are recognized, exactly as in the rest of the pattern. </P> <P> @@ -3359,7 +3388,7 @@ Cambridge, England. </P> <br><a name="SEC30" href="#TOC1">REVISION</a><br> <P> -Last updated: 01 November 2015 +Last updated: 13 November 2015 <br> Copyright © 1997-2015 University of Cambridge. <br> |