diff options
Diffstat (limited to 'doc/pcre2pattern.3')
-rw-r--r-- | doc/pcre2pattern.3 | 829 |
1 files changed, 410 insertions, 419 deletions
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 8157f9e..f26117f 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -1,4 +1,4 @@ -.TH PCRE2PATTERN 3 "27 November 2018" "PCRE2 10.33" +.TH PCRE2PATTERN 3 "04 February 2019" "PCRE2 10.33" .SH NAME PCRE2 - Perl-compatible regular expressions (revised API) .SH "PCRE2 REGULAR EXPRESSION DETAILS" @@ -20,13 +20,13 @@ copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers regular expressions in great detail. This description of PCRE2's regular expressions is intended as reference material. .P -This document discusses the patterns that are supported by PCRE2 when its main -matching function, \fBpcre2_match()\fP, is used. PCRE2 also has an alternative -matching function, \fBpcre2_dfa_match()\fP, which matches using a different -algorithm that is not Perl-compatible. Some of the features discussed below are -not available when DFA matching is used. The advantages and disadvantages of -the alternative function, and how it differs from the normal function, are -discussed in the +This document discusses the regular expression patterns that are supported by +PCRE2 when its main matching function, \fBpcre2_match()\fP, is used. PCRE2 also +has an alternative matching function, \fBpcre2_dfa_match()\fP, which matches +using a different algorithm that is not Perl-compatible. Some of the features +discussed below are not available when DFA matching is used. The advantages and +disadvantages of the alternative function, and how it differs from the normal +function, are discussed in the .\" HREF \fBpcre2matching\fP .\" @@ -149,8 +149,8 @@ this indirectly restricts the amount of heap memory that is used, but there is also an explicit memory limit that can be set. .P These facilities are provided to catch runaway matches that are provoked by -patterns with huge matching trees (a typical example is a pattern with nested -unlimited repeats applied to a long string that does not match). When one of +patterns with huge matching trees. A common example is a pattern with nested +unlimited repeats applied to a long string that does not match. When one of these limits is reached, \fBpcre2_match()\fP gives an error return. The limits can also be set by items at the start of the pattern of the form .sp @@ -264,10 +264,10 @@ matches a portion of a subject string that is identical to itself. When caseless matching is specified (the PCRE2_CASELESS option), letters are matched independently of case. .P -The power of regular expressions comes from the ability to include alternatives -and repetitions in the pattern. These are encoded in the pattern by the use of -\fImetacharacters\fP, which do not stand for themselves but instead are -interpreted in some special way. +The power of regular expressions comes from the ability to include wild cards, +character classes, alternatives, and repetitions in the pattern. These are +encoded in the pattern by the use of \fImetacharacters\fP, which do not stand +for themselves but instead are interpreted in some special way. .P There are two different sets of metacharacters: those that are recognized anywhere in the pattern except within square brackets, and those that are @@ -280,14 +280,11 @@ are as follows: . match any character except newline (by default) [ start character class definition | start of alternative branch - ( start subpattern - ) end subpattern - ? extends the meaning of ( - also 0 or 1 quantifier - also quantifier minimizer + ( start group or control verb + ) end group or control verb * 0 or more quantifier - + 1 or more quantifier - also "possessive quantifier" + + 1 or more quantifier; also "possessive quantifier" + ? 0 or 1 quantifier; also quantifier minimizer { start min/max quantifier .sp Part of a pattern that is in square brackets is called a "character class". In @@ -296,9 +293,7 @@ a character class the only metacharacters are: \e general escape character ^ negate the class, but only if the first character - indicates character range -.\" JOIN - [ POSIX character class (only if followed by POSIX - syntax) + [ POSIX character class (if followed by POSIX syntax) ] terminates the character class .sp The following sections describe the use of each of the metacharacters. @@ -308,7 +303,7 @@ The following sections describe the use of each of the metacharacters. .rs .sp The backslash character has several uses. Firstly, if it is followed by a -character that is not a number or a letter, it takes away any special meaning +character that is not a digit or a letter, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes. .P @@ -318,7 +313,7 @@ would otherwise be interpreted as a metacharacter, so it is always safe to precede a non-alphanumeric with backslash to specify that it stands for itself. In particular, if you want to match a backslash, you write \e\e. .P -In a UTF mode, only ASCII numbers and letters have any special meaning after a +In a UTF mode, only ASCII digits and letters have any special meaning after a backslash. All other characters (in particular, those whose code points are greater than 127) are treated as literals. .P @@ -328,13 +323,13 @@ outside a character class and the next newline, inclusive, are ignored. An escaping backslash can be used to include a white space or # character as part of the pattern. .P -If you want to remove the special meaning from a sequence of characters, you -can do so by putting them between \eQ and \eE. This is different from Perl in -that $ and @ are handled as literals in \eQ...\eE sequences in PCRE2, whereas -in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish -backslash interpolation" on any backslashes between \eQ and \eE which, its -documentation says, "may lead to confusing results". PCRE2 treats a backslash -between \eQ and \eE just like any other character. Note the following examples: +If you want to treat all characters in a sequence as literals, you can do so by +putting them between \eQ and \eE. This is different from Perl in that $ and @ +are handled as literals in \eQ...\eE sequences in PCRE2, whereas in Perl, $ and +@ cause variable interpolation. Also, Perl does "double-quotish backslash +interpolation" on any backslashes between \eQ and \eE which, its documentation +says, "may lead to confusing results". PCRE2 treats a backslash between \eQ and +\eE just like any other character. Note the following examples: .sp Pattern PCRE2 matches Perl matches .sp @@ -362,8 +357,8 @@ A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters in a pattern, but when a pattern is being prepared by text editing, it is often easier to use one of the following escape sequences -than the binary character it represents. In an ASCII or Unicode environment, -these escapes are as follows: +instead of the binary character it represents. In an ASCII or Unicode +environment, these escapes are as follows: .sp \ea alarm, that is, the BEL character (hex 07) \ecx "control-x", where x is any printable ASCII character @@ -441,17 +436,17 @@ and Perl has changed over time, causing PCRE2 also to change. .P Outside a character class, PCRE2 reads the digit and any following digits as a decimal number. If the number is less than 10, begins with the digit 8 or 9, or -if there are at least that many previous capturing left parentheses in the -expression, the entire sequence is taken as a \fIbackreference\fP. A -description of how this works is given +if there are at least that many previous capture groups in the expression, the +entire sequence is taken as a \fIbackreference\fP. A description of how this +works is given .\" HTML <a href="#backreferences"> .\" </a> later, .\" following the discussion of -.\" HTML <a href="#subpattern"> +.\" HTML <a href="#group"> .\" </a> -parenthesized subpatterns. +parenthesized groups. .\" Otherwise, up to three octal digits are read to form a character code. .P @@ -463,7 +458,7 @@ for themselves. For example, outside a character class: \e040 is another way of writing an ASCII space .\" JOIN \e40 is the same, provided there are fewer than 40 - previous capturing subpatterns + previous capture groups \e7 is always a backreference .\" JOIN \e11 might be a backreference, or another way of @@ -493,7 +488,9 @@ If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just described only when it is followed by two hexadecimal digits. Otherwise, it matches a literal "x" character. In this mode, support for code points greater than 256 is provided by \eu, which must be followed by four hexadecimal digits; -otherwise it matches a literal "u" character. +otherwise it matches a literal "u" character. This syntax makes PCRE2 behave +like ECMAscript (aka JavaScript). Code points greater than 0xFFFF are not +supported. .P Characters whose value is less than 256 can be defined by either of the two syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in @@ -553,9 +550,9 @@ can be coded as \eg{name}. Backreferences are discussed later, .\" following the discussion of -.\" HTML <a href="#subpattern"> +.\" HTML <a href="#group"> .\" </a> -parenthesized subpatterns. +parenthesized groups. .\" . . @@ -564,14 +561,14 @@ parenthesized subpatterns. .sp For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative -syntax for referencing a subpattern as a "subroutine". Details are discussed +syntax for referencing a capture group as a subroutine. Details are discussed .\" HTML <a href="#onigurumasubroutines"> .\" </a> later. .\" Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP synonymous. The former is a backreference; the latter is a -.\" HTML <a href="#subpatternsassubroutines"> +.\" HTML <a href="#groupsassubroutines"> .\" </a> subroutine .\" @@ -751,21 +748,22 @@ an error. .rs .sp When PCRE2 is built with Unicode support (the default), three additional escape -sequences that match characters with specific properties are available. In -8-bit non-UTF-8 mode, these sequences are of course limited to testing -characters whose code points are less than 256, but they do work in this mode. -In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit) -may be encountered. These are all treated as being in the Unknown script and -with an unassigned type. The extra escape sequences are: +sequences that match characters with specific properties are available. They +can be used in any mode, though in 8-bit and 16-bit non-UTF modes these +sequences are of course limited to testing characters whose code points are +less than U+0100 and U+10000, respectively. In 32-bit non-UTF mode, code points +greater than 0x10ffff (the Unicode limit) may be encountered. These are all +treated as being in the Unknown script and with an unassigned type. The extra +escape sequences are: .sp \ep{\fIxx\fP} a character with the \fIxx\fP property \eP{\fIxx\fP} a character without the \fIxx\fP property \eX a Unicode extended grapheme cluster .sp -The property names represented by \fIxx\fP above are limited to the Unicode -script names, the general category properties, "Any", which matches any -character (including newline), and some special PCRE2 properties (described -in the +The property names represented by \fIxx\fP above are case-sensitive. There is +support for Unicode script names, Unicode general category properties, "Any", +which matches any character (including newline), and some special PCRE2 +properties (described in the .\" HTML <a href="#extraprops"> .\" </a> next section). @@ -999,14 +997,16 @@ The special property L& is also supported: it matches a character that has the Lu, Ll, or Lt property, in other words, a letter that is not classified as a modifier or "other". .P -The Cs (Surrogate) property applies only to characters in the range U+D800 to -U+DFFF. Such characters are not valid in Unicode strings and so -cannot be tested by PCRE2, unless UTF validity checking has been turned off -(see the discussion of PCRE2_NO_UTF_CHECK in the +The Cs (Surrogate) property applies only to characters whose code points are in +the range U+D800 to U+DFFF. These characters are no different to any other +character when PCRE2 is not in UTF mode (using the 16-bit or 32-bit library). +However, they are not valid in Unicode strings and so cannot be tested by PCRE2 +in UTF mode, unless UTF validity checking has been turned off (see the +discussion of PCRE2_NO_UTF_CHECK in the .\" HREF \fBpcre2api\fP .\" -page). Perl does not support the Cs property. +page). .P The long synonyms for property names that Perl supports (such as \ep{Letter}) are not supported by PCRE2, nor is it permitted to prefix any of these @@ -1130,7 +1130,7 @@ a lookbehind assertion However, in this case, the part of the subject before the real match does not have to be of fixed length, as lookbehind assertions do. The use of \eK does not interfere with the setting of -.\" HTML <a href="#subpattern"> +.\" HTML <a href="#group"> .\" </a> captured substrings. .\" @@ -1161,7 +1161,7 @@ start of the reported match is earlier than where the match started. The final use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The use of -subpatterns for more complicated assertions is described +groups for more complicated assertions is described .\" HTML <a href="#bigassertions"> .\" </a> below. @@ -1183,12 +1183,12 @@ character. If any other of these assertions appears in a character class, an A word boundary is a position in the subject string where the current character and the previous character do not both match \ew or \eW (i.e. one matches \ew and the other matches \eW), or the start or end of the string if the -first or last character matches \ew, respectively. In a UTF mode, the meanings -of \ew and \eW can be changed by setting the PCRE2_UCP option. When this is -done, it also affects \eb and \eB. Neither PCRE2 nor Perl has a separate "start -of word" or "end of word" metasequence. However, whatever follows \eb normally -determines which it is. For example, the fragment \eba matches "a" at the start -of a word. +first or last character matches \ew, respectively. When PCRE2 is built with +Unicode support, the meanings of \ew and \eW can be changed by setting the +PCRE2_UCP option. When this is done, it also affects \eb and \eB. Neither PCRE2 +nor Perl has a separate "start of word" or "end of word" metasequence. However, +whatever follows \eb normally determines which it is. For example, the fragment +\eba matches "a" at the start of a word. .P The \eA, \eZ, and \ez assertions differ from the traditional circumflex and dollar (described in the next section) in that they only ever match at the very @@ -1380,9 +1380,9 @@ could be used with a UTF-8 string (ignore white space and line breaks): .sp In this example, a group that starts with (?| resets the capturing parentheses numbers in each alternative (see -.\" HTML <a href="#dupsubpatternnumber"> +.\" HTML <a href="#dupgroupnumber"> .\" </a> -"Duplicate Subpattern Numbers" +"Duplicate Group Numbers" .\" below). The assertions at the start of each branch check the next UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The @@ -1624,13 +1624,13 @@ the pattern matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one -that succeeds is used. If the alternatives are within a subpattern -.\" HTML <a href="#subpattern"> +that succeeds is used. If the alternatives are within a group +.\" HTML <a href="#group"> .\" </a> (defined below), .\" "succeeds" means matching the rest of the main pattern as well as the -alternative in the subpattern. +alternative in the group. . . .SH "INTERNAL OPTION SETTING" @@ -1673,16 +1673,16 @@ the same way as the Perl-compatible options by using the characters J and U respectively. However, these are not unset by (?^). .P When one of these option changes occurs at top level (that is, not inside -subpattern parentheses), the change applies to the remainder of the pattern -that follows. An option change within a subpattern (see below for a description -of subpatterns) affects only that part of the subpattern that follows it, so +group parentheses), the change applies to the remainder of the pattern +that follows. An option change within a group (see below for a description +of groups) affects only that part of the group that follows it, so .sp (a(?i)b)c .sp matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used). By this means, options can be made to have different settings in different parts of the pattern. Any changes made in one alternative do carry on -into subsequent branches within the same subpattern. For example, +into subsequent branches within the same group. For example, .sp (a(?i)b|c) .sp @@ -1692,7 +1692,7 @@ option settings happen at compile time. There would be some very weird behaviour otherwise. .P As a convenient shorthand, if any option settings are required at the start of -a non-capturing subpattern (see the next section), the option letters may +a non-capturing group (see the next section), the option letters may appear between the "?" and the ":". Thus the two patterns .sp (?i:saturday|sunday) @@ -1700,10 +1700,11 @@ appear between the "?" and the ":". Thus the two patterns .sp match exactly the same set of strings. .P -\fBNote:\fP There are other PCRE2-specific options that can be set by the -application when the compiling function is called. The pattern can contain -special leading sequences such as (*CRLF) to override what the application has -set or what has been defaulted. Details are given in the section entitled +\fBNote:\fP There are other PCRE2-specific options, applying to the whole +pattern, which can be set by the application when the compiling function is +called. In addition, the pattern can contain special leading sequences such as +(*CRLF) to override what the application has set or what has been defaulted. +Details are given in the section entitled .\" HTML <a href="#newlineseq"> .\" </a> "Newline sequences" @@ -1715,12 +1716,12 @@ the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the (*UTF) and (*UCP) sequences. . . -.\" HTML <a name="subpattern"></a> -.SH SUBPATTERNS +.\" HTML <a name="group"></a> +.SH GROUPS .rs .sp -Subpatterns are delimited by parentheses (round brackets), which can be nested. -Turning part of a pattern into a subpattern does two things: +Groups are delimited by parentheses (round brackets), which can be nested. +Turning part of a pattern into a group does two things: .sp 1. It localizes a set of alternatives. For example, the pattern .sp @@ -1729,15 +1730,15 @@ Turning part of a pattern into a subpattern does two things: matches "cataract", "caterpillar", or "cat". Without the parentheses, it would match "cataract", "erpillar" or an empty string. .sp -2. It sets up the subpattern as a capturing subpattern. This means that, when -the whole pattern matches, the portion of the subject string that matched the -subpattern is passed back to the caller, separately from the portion that -matched the whole pattern. (This applies only to the traditional matching -function; the DFA matching function does not support capturing.) +2. It creates a "capture group". This means that, when the whole pattern +matches, the portion of the subject string that matched the group is passed +back to the caller, separately from the portion that matched the whole pattern. +(This applies only to the traditional matching function; the DFA matching +function does not support capturing.) .P Opening parentheses are counted from left to right (starting from 1) to obtain -numbers for the capturing subpatterns. For example, if the string "the red -king" is matched against the pattern +numbers for capture groups. For example, if the string "the red king" is +matched against the pattern .sp the ((red|white) (king|queen)) .sp @@ -1745,38 +1746,37 @@ the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively. .P The fact that plain parentheses fulfil two functions is not always helpful. -There are often times when a grouping subpattern is required without a -capturing requirement. If an opening parenthesis is followed by a question mark -and a colon, the subpattern does not do any capturing, and is not counted when -computing the number of any subsequent capturing subpatterns. For example, if -the string "the white queen" is matched against the pattern +There are often times when grouping is required without capturing. If an +opening parenthesis is followed by a question mark and a colon, the group +does not do any capturing, and is not counted when computing the number of any +subsequent capture groups. For example, if the string "the white queen" +is matched against the pattern .sp the ((?:red|white) (king|queen)) .sp the captured substrings are "white queen" and "queen", and are numbered 1 and -2. The maximum number of capturing subpatterns is 65535. +2. The maximum number of capture groups is 65535. .P As a convenient shorthand, if any option settings are required at the start of -a non-capturing subpattern, the option letters may appear between the "?" and -the ":". Thus the two patterns +a non-capturing group, the option letters may appear between the "?" and the +":". Thus the two patterns .sp (?i:saturday|sunday) (?:(?i)saturday|sunday) .sp match exactly the same set of strings. Because alternative branches are tried -from left to right, and options are not reset until the end of the subpattern -is reached, an option setting in one branch does affect subsequent branches, so +from left to right, and options are not reset until the end of the group is +reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday". . . -.\" HTML <a name="dupsubpatternnumber"></a> -.SH "DUPLICATE SUBPATTERN NUMBERS" +.\" HTML <a name="dupgroupnumber"></a> +.SH "DUPLICATE GROUP NUMBERS" .rs .sp -Perl 5.10 introduced a feature whereby each alternative in a subpattern uses -the same numbers for its capturing parentheses. Such a subpattern starts with -(?| and is itself a non-capturing subpattern. For example, consider this -pattern: +Perl 5.10 introduced a feature whereby each alternative in a group uses the +same numbers for its capturing parentheses. Such a group starts with (?| and is +itself a non-capturing group. For example, consider this pattern: .sp (?|(Sat)ur|(Sun))day .sp @@ -1786,7 +1786,7 @@ at captured substring number one, whichever alternative matched. This construct is useful when you want to capture part, but not all, of one of a number of alternatives. Inside a (?| group, parentheses are numbered as usual, but the number is reset at the start of each branch. The numbers of any capturing -parentheses that follow the subpattern start after the highest number used in +parentheses that follow the whole group start after the highest number used in any branch. The following example is taken from the Perl documentation. The numbers underneath show in which buffer the captured content will be stored. .sp @@ -1794,13 +1794,12 @@ numbers underneath show in which buffer the captured content will be stored. / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x # 1 2 2 3 2 3 4 .sp -A backreference to a numbered subpattern uses the most recent value that is -set for that number by any subpattern. The following pattern matches "abcabc" -or "defdef": +A backreference to a capture group uses the most recent value that is set for +the group. The following pattern matches "abcabc" or "defdef": .sp /(?|(abc)|(def))\e1/ .sp -In contrast, a subroutine call to a numbered subpattern always refers to the +In contrast, a subroutine call to a capture group always refers to the first one in the pattern with the given number. The following pattern matches "abcabc" or "defabc": .sp @@ -1814,29 +1813,35 @@ If a .\" </a> condition test .\" -for a subpattern's having matched refers to a non-unique number, the test is -true if any of the subpatterns of that number have matched. +for a group's having matched refers to a non-unique number, the test is +true if any group with that number has matched. .P An alternative approach to using this "branch reset" feature is to use -duplicate named subpatterns, as described in the next section. +duplicate named groups, as described in the next section. . . -.SH "NAMED SUBPATTERNS" +.SH "NAMED CAPTURE GROUPS" .rs .sp -Identifying capturing parentheses by number is simple, but it can be very hard -to keep track of the numbers in complicated patterns. Furthermore, if an -expression is modified, the numbers may change. To help with this difficulty, -PCRE2 supports the naming of capturing subpatterns. This feature was not added -to Perl until release 5.10. Python had the feature earlier, and PCRE1 -introduced it at release 4.0, using the Python syntax. PCRE2 supports both the -Perl and the Python syntax. +Identifying capture groups by number is simple, but it can be very hard to keep +track of the numbers in complicated patterns. Furthermore, if an expression is +modified, the numbers may change. To help with this difficulty, PCRE2 supports +the naming of capture groups. This feature was not added to Perl until release +5.10. Python had the feature earlier, and PCRE1 introduced it at release 4.0, +using the Python syntax. PCRE2 supports both the Perl and the Python syntax. .P -In PCRE2, a capturing subpattern can be named in one of three ways: -(?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. Names -consist of up to 32 alphanumeric characters and underscores, but must start -with a non-digit. References to capturing parentheses from other parts of the -pattern, such as +In PCRE2, a capture group can be named in one of three ways: (?<name>...) or +(?'name'...) as in Perl, or (?P<name>...) as in Python. Names may be up to 32 +code units long. When PCRE2_UTF is not set, they may contain only ASCII +alphanumeric characters and underscores, but must start with a non-digit. When +PCRE2_UTF is set, the syntax of group names is extended to allow any Unicode +letter or Unicode decimal digit. In other words, group names must match one of +these patterns: +.sp + ^[_A-Za-z][_A-Za-z0-9]*\ez when PCRE2_UTF is not set + ^[_\ep{L}][_\ep{L}\ep{Nd}]*\ez when PCRE2_UTF is set +.sp +References to capture groups from other parts of the pattern, such as .\" HTML <a href="#backreferences"> .\" </a> backreferences, @@ -1852,17 +1857,17 @@ conditions, .\" can all be made by name as well as by number. .P -Named capturing parentheses are allocated numbers as well as names, exactly as -if the names were not present. In both PCRE2 and Perl, capturing subpatterns +Named capture groups are allocated numbers as well as names, exactly as +if the names were not present. In both PCRE2 and Perl, capture groups are primarily identified by numbers; any names are just aliases for these numbers. The PCRE2 API provides function calls for extracting the complete name-to-number translation table from a compiled pattern, as well as convenience functions for extracting captured substrings by name. .P -\fBWarning:\fP When more than one subpattern has the same number, as described -in the previous section, a name given to one of them applies to all of them. -Perl allows identically numbered subpatterns to have different names. Consider -this pattern, where there are two capturing subpatterns, both numbered 1: +\fBWarning:\fP When more than one capture group has the same number, as +described in the previous section, a name given to one of them applies to all +of them. Perl allows identically numbered groups to have different names. +Consider this pattern, where there are two capture groups, both numbered 1: .sp (?|(?<AA>aa)|(?<BB>bb)) .sp @@ -1876,20 +1881,20 @@ pattern: .sp (?|(?<AA>aa)|(bb)) .sp -Although the second subpattern number 1 is not explicitly named, the name AA is -still an alias for subpattern 1. Whether the pattern matches "aa" or "bb", a +Although the second group number 1 is not explicitly named, the name AA is +still an alias for any group 1. Whether the pattern matches "aa" or "bb", a reference by name to group AA yields the matched string. .P By default, a name must be unique within a pattern, except that duplicate names -are permitted for subpatterns with the same number, for example: +are permitted for groups with the same number, for example: .sp (?|(?<AA>aa)|(?<AA>bb)) .sp The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES option at compile time, or by the use of (?J) within the pattern. Duplicate -names can be useful for patterns where only one instance of the named -parentheses can match. Suppose you want to match the name of a weekday, either -as a 3-letter abbreviation or as the full name, and in both cases you want to +names can be useful for patterns where only one instance of the named capture +group can match. Suppose you want to match the name of a weekday, either as a +3-letter abbreviation or as the full name, and in both cases you want to extract the abbreviation. This pattern (ignoring the line breaks) does the job: .sp (?<DN>Mon|Fri|Sun)(?:day)?| @@ -1898,23 +1903,23 @@ extract the abbreviation. This pattern (ignoring the line breaks) does the job: (?<DN>Thu)(?:rsday)?| (?<DN>Sat)(?:urday)? .sp -There are five capturing substrings, but only one is ever set after a match. -The convenience functions for extracting the data by name returns the substring -for the first (and in this example, the only) subpattern of that name that -matched. This saves searching to find which numbered subpattern it was. (An -alternative way of solving this problem is to use a "branch reset" subpattern, -as described in the previous section.) +There are five capture groups, but only one is ever set after a match. The +convenience functions for extracting the data by name returns the substring for +the first (and in this example, the only) group of that name that matched. This +saves searching to find which numbered group it was. (An alternative way of +solving this problem is to use a "branch reset" group, as described in the +previous section.) .P -If you make a backreference to a non-unique named subpattern from elsewhere in -the pattern, the subpatterns to which the name refers are checked in the order -in which they appear in the overall pattern. The first one that is set is used -for the reference. For example, this pattern matches both "foofoo" and -"barbar" but not "foobar" or "barfoo": +If you make a backreference to a non-unique named group from elsewhere in the +pattern, the groups to which the name refers are checked in the order in which +they appear in the overall pattern. The first one that is set is used for the +reference. For example, this pattern matches both "foofoo" and "barbar" but not +"foobar" or "barfoo": .sp (?:(?<n>foo)|(?<n>bar))\ek<n> .sp .P -If you make a subroutine call to a non-unique named subpattern, the one that +If you make a subroutine call to a non-unique named group, the one that corresponds to the first occurrence of the name is used. In the absence of duplicate numbers this is the one with the lowest number. .P @@ -1925,11 +1930,11 @@ test (see the .\" </a> section about conditions .\" -below), either to check whether a subpattern has matched, or to check for -recursion, all subpatterns with the same name are tested. If the condition is -true for any one of them, the overall condition is true. This is the same -behaviour as testing by number. For further details of the interfaces for -handling named subpatterns, see the +below), either to check whether a capture group has matched, or to check for +recursion, all groups with the same name are tested. If the condition is true +for any one of them, the overall condition is true. This is the same behaviour +as testing by number. For further details of the interfaces for handling named +capture groups, see the .\" HREF \fBpcre2api\fP .\" @@ -1945,18 +1950,18 @@ items: a literal data character the dot metacharacter the \eC escape sequence - the \eX escape sequence the \eR escape sequence + the \eX escape sequence an escape such as \ed or \epL that matches a single character a character class a backreference - a parenthesized subpattern (including most assertions) - a subroutine call to a subpattern (recursive or otherwise) + a parenthesized group (including most assertions) + a subroutine call (recursive or otherwise) .sp The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must -be less than or equal to the second. For example: +be less than or equal to the second. For example, .sp z{2,4} .sp @@ -1984,18 +1989,18 @@ several code units long (and they may be of different lengths). .P The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present. This may be useful for -subpatterns that are referenced as -.\" HTML <a href="#subpatternsassubroutines"> +capture groups that are referenced as +.\" HTML <a href="#groupsassubroutines"> .\" </a> subroutines .\" from elsewhere in the pattern (but see also the section entitled .\" HTML <a href="#subdefine"> .\" </a> -"Defining subpatterns for use by reference only" +"Defining capture groups for use by reference only" .\" -below). Items other than subpatterns that have a {0} quantifier are omitted -from the compiled pattern. +below). Except for parenthesized groups, items that have a {0} quantifier are +omitted from the compiled pattern. .P For convenience, the three most common quantifiers have single-character abbreviations: @@ -2004,22 +2009,22 @@ abbreviations: + is equivalent to {1,} ? is equivalent to {0,1} .sp -It is possible to construct infinite loops by following a subpattern that can -match no characters with a quantifier that has no upper limit, for example: +It is possible to construct infinite loops by following a group that can match +no characters with a quantifier that has no upper limit, for example: .sp (a?)* .sp Earlier versions of Perl and PCRE1 used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such -patterns are now accepted, but if any repetition of the subpattern does in fact +patterns are now accepted, but if any repetition of the group does in fact match no characters, the loop is forcibly broken. .P -By default, the quantifiers are "greedy", that is, they match as much as -possible (up to the maximum number of permitted times), without causing the -rest of the pattern to fail. The classic example of where this gives problems -is in trying to match comments in C programs. These appear between /* and */ -and within the comment, individual * and / characters may appear. An attempt to -match C comments by applying the pattern +By default, quantifiers are "greedy", that is, they match as much as possible +(up to the maximum number of permitted times), without causing the rest of the +pattern to fail. The classic example of where this gives problems is in trying +to match comments in C programs. These appear between /* and */ and within the +comment, individual * and / characters may appear. An attempt to match C +comments by applying the pattern .sp /\e*.*\e*/ .sp @@ -2028,10 +2033,9 @@ to the string /* first comment */ not comment /* second comment */ .sp fails, because it matches the entire string owing to the greediness of the .* -item. -.P -If a quantifier is followed by a question mark, it ceases to be greedy, and -instead matches the minimum number of times possible, so the pattern +item. However, if a quantifier is followed by a question mark, it ceases to be +greedy, and instead matches the minimum number of times possible, so the +pattern .sp /\e*.*?\e*/ .sp @@ -2050,7 +2054,7 @@ the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour. .P -When a parenthesized subpattern is quantified with a minimum repeat count that +When a parenthesized group is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more memory is required for the compiled pattern, in proportion to the size of the minimum or maximum. .P @@ -2085,15 +2089,14 @@ It matches "ab" in the subject "aab". The use of the backtracking control verbs (*PRUNE) and (*SKIP) also disable this optimization, and there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. .P -When a capturing subpattern is repeated, the value captured is the substring -that matched the final iteration. For example, after +When a capture group is repeated, the value captured is the substring that +matched the final iteration. For example, after .sp (tweedle[dume]{3}\es*)+ .sp has matched "tweedledum tweedledee" the value of the captured substring is -"tweedledee". However, if there are nested capturing subpatterns, the -corresponding captured values may have been set in previous iterations. For -example, after +"tweedledee". However, if there are nested capture groups, the corresponding +captured values may have been set in previous iterations. For example, after .sp (a|(b))+ .sp @@ -2119,7 +2122,7 @@ After matching all 6 digits and then failing to match "foo", the normal action of the matcher is to try again with only 5 digits matching the \ed+ item, and then with 4, and so on, before ultimately failing. "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides the means for specifying -that once a subpattern has matched, it is not to be re-evaluated in this way. +that once a group has matched, it is not to be re-evaluated in this way. .P If we use atomic grouping for the previous example, the matcher gives up immediately on failing to match "foo" the first time. The notation is a kind of @@ -2132,23 +2135,23 @@ be easier to remember: .sp (*atomic:\ed+)foo .sp -This kind of parenthesis "locks up" the part of the pattern it contains once -it has matched, and a failure further into the pattern is prevented from -backtracking into it. Backtracking past it to previous items, however, works as -normal. +This kind of parenthesized group "locks up" the part of the pattern it +contains once it has matched, and a failure further into the pattern is +prevented from backtracking into it. Backtracking past it to previous items, +however, works as normal. .P -An alternative description is that a subpattern of this type matches exactly -the string of characters that an identical standalone pattern would match, if +An alternative description is that a group of this type matches exactly the +string of characters that an identical standalone pattern would match, if anchored at the current point in the subject string. .P -Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as -the above example can be thought of as a maximizing repeat that must swallow -everything it can. So, while both \ed+ and \ed+? are prepared to adjust the -number of digits they match in order to make the rest of the pattern match, -(?>\ed+) can only match an entire sequence of digits. +Atomic groups are not capture groups. Simple cases such as the above example +can be thought of as a maximizing repeat that must swallow everything it can. +So, while both \ed+ and \ed+? are prepared to adjust the number of digits they +match in order to make the rest of the pattern match, (?>\ed+) can only match +an entire sequence of digits. .P Atomic groups in general can of course contain arbitrarily complicated -subpatterns, and can be nested. However, when the subpattern for an atomic +expressions, and can be nested. However, when the contents of an atomic group is just a single repeated item, as in the example above, a simpler notation, called a "possessive quantifier" can be used. This consists of an additional + character following a quantifier. Using this notation, the @@ -2170,8 +2173,8 @@ difference; possessive quantifiers should be slightly faster. The possessive quantifier syntax is an extension to the Perl 5.8 syntax. Jeffrey Friedl originated the idea (and the name) in the first edition of his book. Mike McCloskey liked it, so implemented it when he built Sun's Java -package, and PCRE1 copied it from there. It ultimately found its way into Perl -at release 5.10. +package, and PCRE1 copied it from there. It found its way into Perl at release +5.10. .P PCRE2 has an optimization that automatically "possessifies" certain simple pattern constructs. For example, the sequence A+B is treated as A++B because @@ -2179,10 +2182,9 @@ there is no point in backtracking into a sequence of A's when B must follow. This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). .P -When a pattern contains an unlimited repeat inside a subpattern that can itself -be repeated an unlimited number of times, the use of an atomic group is the -only way to avoid some failing matches taking a very long time indeed. The -pattern +When a pattern contains an unlimited repeat inside a group that can itself be +repeated an unlimited number of times, the use of an atomic group is the only +way to avoid some failing matches taking a very long time indeed. The pattern .sp (\eD+|<\ed+>)*[!?] .sp @@ -2211,29 +2213,28 @@ sequences of non-digits cannot be broken, and failure happens quickly. .rs .sp Outside a character class, a backslash followed by a digit greater than 0 (and -possibly further digits) is a backreference to a capturing subpattern earlier -(that is, to its left) in the pattern, provided there have been that many -previous capturing left parentheses. +possibly further digits) is a backreference to a capture group earlier (that +is, to its left) in the pattern, provided there have been that many previous +capture groups. .P However, if the decimal number following the backslash is less than 8, it is -always taken as a backreference, and causes an error only if there are not -that many capturing left parentheses in the entire pattern. In other words, the -parentheses that are referenced need not be to the left of the reference for -numbers less than 8. A "forward backreference" of this type can make sense -when a repetition is involved and the subpattern to the right has participated -in an earlier iteration. -.P -It is not possible to have a numerical "forward backreference" to a subpattern -whose number is 8 or more using this syntax because a sequence such as \e50 is +always taken as a backreference, and causes an error only if there are not that +many capture groups in the entire pattern. In other words, the group that is +referenced need not be to the left of the reference for numbers less than 8. A +"forward backreference" of this type can make sense when a repetition is +involved and the group to the right has participated in an earlier iteration. +.P +It is not possible to have a numerical "forward backreference" to a group whose +number is 8 or more using this syntax because a sequence such as \e50 is interpreted as a character defined in octal. See the subsection entitled "Non-printing characters" .\" HTML <a href="#digitsafterbackslash"> .\" </a> above .\" -for further details of the handling of digits following a backslash. There is -no such problem when named parentheses are used. A backreference to any -subpattern is possible using named parentheses (see below). +for further details of the handling of digits following a backslash. Other +forms of backreferencing do not suffer from this restriction. In particular, +there is no problem when named capture groups are used (see below). .P Another way of avoiding the ambiguity inherent in the use of digits following a backslash is to use the \eg escape sequence. This escape must be followed by a @@ -2250,22 +2251,22 @@ the reference. A signed number is a relative reference. Consider this example: .sp (abc(def)ghi)\eg{-1} .sp -The sequence \eg{-1} is a reference to the most recently started capturing -subpattern before \eg, that is, is it equivalent to \e2 in this example. -Similarly, \eg{-2} would be equivalent to \e1. The use of relative references -can be helpful in long patterns, and also in patterns that are created by -joining together fragments that contain references within themselves. +The sequence \eg{-1} is a reference to the most recently started capture group +before \eg, that is, is it equivalent to \e2 in this example. Similarly, +\eg{-2} would be equivalent to \e1. The use of relative references can be +helpful in long patterns, and also in patterns that are created by joining +together fragments that contain references within themselves. .P -The sequence \eg{+1} is a reference to the next capturing subpattern. This kind -of forward reference can be useful it patterns that repeat. Perl does not -support the use of + in this way. +The sequence \eg{+1} is a reference to the next capture group. This kind of +forward reference can be useful in patterns that repeat. Perl does not support +the use of + in this way. .P -A backreference matches whatever actually matched the capturing subpattern in -the current subject string, rather than anything matching the subpattern -itself (see -.\" HTML <a href="#subpatternsassubroutines"> +A backreference matches whatever actually most recently matched the capture +group in the current subject string, rather than anything at all that matches +the group (see +.\" HTML <a href="#groupsassubroutines"> .\" </a> -"Subpatterns as subroutines" +"Groups as subroutines" .\" below for a way of doing that). So the pattern .sp @@ -2278,26 +2279,26 @@ backreference, the case of letters is relevant. For example, ((?i)rah)\es+\e1 .sp matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original -capturing subpattern is matched caselessly. +capture group is matched caselessly. .P -There are several different ways of writing backreferences to named -subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or -\ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified +There are several different ways of writing backreferences to named capture +groups. The .NET syntax \ek{name} and the Perl syntax \ek<name> or \ek'name' +are supported, as is the Python syntax (?P=name). Perl 5.10's unified backreference syntax, in which \eg can be used for both numeric and named -references, is also supported. We could rewrite the above example in any of -the following ways: +references, is also supported. We could rewrite the above example in any of the +following ways: .sp (?<p1>(?i)rah)\es+\ek<p1> (?'p1'(?i)rah)\es+\ek{p1} (?P<p1>(?i)rah)\es+(?P=p1) (?<p1>(?i)rah)\es+\eg{p1} .sp -A subpattern that is referenced by name may appear in the pattern before or +A capture group that is referenced by name may appear in the pattern before or after the reference. .P -There may be more than one backreference to the same subpattern. If a -subpattern has not actually been used in a particular match, any backreferences -to it always fail by default. For example, the pattern +There may be more than one backreference to the same group. If a group has not +actually been used in a particular match, backreferences to it always fail by +default. For example, the pattern .sp (a|(bc))\e2 .sp @@ -2305,12 +2306,11 @@ always fails if it starts to match "a" rather than "bc". However, if the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an unset value matches an empty string. .P -Because there may be many capturing parentheses in a pattern, all digits -following a backslash are taken as part of a potential backreference number. -If the pattern continues with a digit character, some delimiter must be used to -terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE -option is set, this can be white space. Otherwise, the \eg{ syntax or an empty -comment (see +Because there may be many capture groups in a pattern, all digits following a +backslash are taken as part of a potential backreference number. If the pattern +continues with a digit character, some delimiter must be used to terminate the +backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this +can be white space. Otherwise, the \eg{} syntax or an empty comment (see .\" HTML <a href="#comments"> .\" </a> "Comments" @@ -2321,19 +2321,18 @@ below) can be used. .SS "Recursive backreferences" .rs .sp -A backreference that occurs inside the parentheses to which it refers fails -when the subpattern is first used, so, for example, (a\e1) never matches. -However, such references can be useful inside repeated subpatterns. For -example, the pattern +A backreference that occurs inside the group to which it refers fails when the +group is first used, so, for example, (a\e1) never matches. However, such +references can be useful inside repeated groups. For example, the pattern .sp (a|b\e1)+ .sp matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of -the subpattern, the backreference matches the character string corresponding -to the previous iteration. In order for this to work, the pattern must be such -that the first iteration does not need to match the backreference. This can be -done using alternation, as in the example above, or by a quantifier with a -minimum of zero. +the group, the backreference matches the character string corresponding to the +previous iteration. In order for this to work, the pattern must be such that +the first iteration does not need to match the backreference. This can be done +using alternation, as in the example above, or by a quantifier with a minimum +of zero. .P Backreferences of this type cause the group that they reference to be treated as an @@ -2357,28 +2356,28 @@ coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described above. .\" .P -More complicated assertions are coded as subpatterns. There are two kinds: -those that look ahead of the current position in the subject string, and those -that look behind it, and in each case an assertion may be positive (must match -for the assertion to be true) or negative (must not match for the assertion to -be true). An assertion subpattern is matched in the normal way, and if it is -true, matching continues after it, but with the matching position in the -subject string is was it was before the assertion was processed. +More complicated assertions are coded as parenthesized groups. There are two +kinds: those that look ahead of the current position in the subject string, and +those that look behind it, and in each case an assertion may be positive (must +match for the assertion to be true) or negative (must not match for the +assertion to be true). An assertion group is matched in the normal way, +and if it is true, matching continues after it, but with the matching position +in the subject string is was it was before the assertion was processed. .P A lookaround assertion may also appear as the condition in a .\" HTML <a href="#conditions"> .\" </a> -conditional subpattern +conditional group .\" (see below). In this case, the result of matching the assertion determines which branch of the condition is followed. .P -Assertion subpatterns are not capturing subpatterns. If an assertion contains -capturing subpatterns within it, these are counted for the purposes of -numbering the capturing subpatterns in the whole pattern. Within each branch of -an assertion, locally captured substrings may be referenced in the usual way. -For example, a sequence such as (.)\eg{-1} can be used to check that two -adjacent characters are the same. +Assertion groups are not capture groups. If an assertion contains capture +groups within it, these are counted for the purposes of numbering the capture +groups in the whole pattern. Within each branch of an assertion, locally +captured substrings may be referenced in the usual way. For example, a sequence +such as (.)\eg{-1} can be used to check that two adjacent characters are the +same. .P When a branch within an assertion fails to match, any substrings that were captured are discarded (as happens with any pattern branch that fails to @@ -2393,23 +2392,23 @@ the assertion. For a negative assertion, a matching branch means that the assertion is not true. If such an assertion is being used as a condition in a .\" HTML <a href="#conditions"> .\" </a> -conditional subpattern +conditional group .\" (see below), captured substrings are retained, because matching continues with the "no" branch of the condition. For other failing negative assertions, control passes to the previous backtracking point, thus discarding any captured strings within the assertion. .P -For compatibility with Perl, most assertion subpatterns may be repeated; though -it makes no sense to assert the same thing several times, the side effect of -capturing parentheses may occasionally be useful. However, an assertion that -forms the condition for a conditional subpattern may not be quantified. In -practice, for other assertions, there only three cases: +For compatibility with Perl, most assertion groups may be repeated; though it +makes no sense to assert the same thing several times, the side effect of +capturing may occasionally be useful. However, an assertion that forms the +condition for a conditional group may not be quantified. In practice, for +other assertions, there only three cases: .sp (1) If the quantifier is {0}, the assertion is never obeyed during matching. -However, it may contain internal capturing parenthesized groups that are called -from elsewhere via the -.\" HTML <a href="#subpatternsassubroutines"> +However, it may contain internal capture groups that are called from elsewhere +via the +.\" HTML <a href="#groupsassubroutines"> .\" </a> subroutine mechanism. .\" @@ -2436,9 +2435,9 @@ following synonyms: (*positive_lookbehind: or (*plb: is the same as (?<= (*negative_lookbehind: or (*nlb: is the same as (?<! .sp -For example, (*pla:foo) is the same assertion as (?=foo). However, in the -following sections, the various assertions are described using the original -symbolic forms. +For example, (*pla:foo) is the same assertion as (?=foo). In the following +sections, the various assertions are described using the original symbolic +forms. . . .SS "Lookahead assertions" @@ -2522,12 +2521,12 @@ because it makes it impossible to calculate the length of the lookbehind. The \eX and \eR escapes, which can match different numbers of code units, are never permitted in lookbehinds. .P -.\" HTML <a href="#subpatternsassubroutines"> +.\" HTML <a href="#groupsassubroutines"> .\" </a> "Subroutine" .\" calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long -as the subpattern matches a fixed-length string. However, +as the called capture group matches a fixed-length string. However, .\" HTML <a href="#recursion"> .\" </a> recursion, @@ -2538,10 +2537,10 @@ is not supported. Perl does not support backreferences in lookbehinds. PCRE2 does support them, but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use of (?| in the pattern (it creates -duplicate subpattern numbers), and if the backreference is by name, the name -must be unique. Of course, the referenced subpattern must itself be of fixed -length. The following pattern matches words containing at least two characters -that begin and end with the same character: +duplicate group numbers), and if the backreference is by name, the name +must be unique. Of course, the referenced group must itself match a fixed +length substring. The following pattern matches words containing at least two +characters that begin and end with the same character: .sp \eb(\ew)\ew++(?<=\e1) .P @@ -2669,18 +2668,18 @@ parentheses. .\" </a> (see below) .\" -should not be used within a script run subpattern, because it causes an -immediate exit from the subpattern, bypassing the script run checking. +should not be used within a script run group, because it causes an immediate +exit from the group, bypassing the script run checking. . . .\" HTML <a name="conditions"></a> -.SH "CONDITIONAL SUBPATTERNS" +.SH "CONDITIONAL GROUPS" .rs .sp -It is possible to cause the matching process to obey a subpattern -conditionally or to choose between two alternative subpatterns, depending on -the result of an assertion, or whether a specific capturing subpattern has -already been matched. The two possible forms of conditional subpattern are: +It is possible to cause the matching process to obey a pattern fragment +conditionally or to choose between two alternative fragments, depending on +the result of an assertion, or whether a specific capture group has +already been matched. The two possible forms of conditional group are: .sp (?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern) @@ -2688,38 +2687,36 @@ already been matched. The two possible forms of conditional subpattern are: If the condition is satisfied, the yes-pattern is used; otherwise the no-pattern (if present) is used. An absent no-pattern is equivalent to an empty string (it always matches). If there are more than two alternatives in the -subpattern, a compile-time error occurs. Each of the two alternatives may -itself contain nested subpatterns of any form, including conditional -subpatterns; the restriction to two alternatives applies only at the level of -the condition. This pattern fragment is an example where the alternatives are -complex: +group, a compile-time error occurs. Each of the two alternatives may itself +contain nested groups of any form, including conditional groups; the +restriction to two alternatives applies only at the level of the condition +itself. This pattern fragment is an example where the alternatives are complex: .sp (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) .sp .P -There are five kinds of condition: references to subpatterns, references to +There are five kinds of condition: references to capture groups, references to recursion, two pseudo-conditions called DEFINE and VERSION, and assertions. . . -.SS "Checking for a used subpattern by number" +.SS "Checking for a used capture group by number" .rs .sp If the text between the parentheses consists of a sequence of digits, the -condition is true if a capturing subpattern of that number has previously -matched. If there is more than one capturing subpattern with the same number -(see the earlier +condition is true if a capture group of that number has previously matched. If +there is more than one capture group with the same number (see the earlier .\" .\" HTML <a href="#recursion"> .\" </a> -section about duplicate subpattern numbers), +section about duplicate group numbers), .\" the condition is true if any of them have matched. An alternative notation is -to precede the digits with a plus or minus sign. In this case, the subpattern -number is relative rather than absolute. The most recently opened parentheses -can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside -loops it can also make sense to refer to subsequent groups. The next -parentheses to be opened can be referenced as (?(+1), and so on. (The value -zero in any of these forms is not used; it provokes a compile-time error.) +to precede the digits with a plus or minus sign. In this case, the group number +is relative rather than absolute. The most recently opened capture group can be +referenced by (?(-1), the next most recent by (?(-2), and so on. Inside loops +it can also make sense to refer to subsequent groups. The next capture group +can be referenced as (?(+1), and so on. (The value zero in any of these forms +is not used; it provokes a compile-time error.) .P Consider the following pattern, which contains non-significant white space to make it more readable (assume the PCRE2_EXTENDED option) and to divide it into @@ -2730,12 +2727,12 @@ three parts for ease of discussion: The first part matches an optional opening parenthesis, and if that character is present, sets it as the first captured substring. The second part matches one or more characters that are not parentheses. The third part is a -conditional subpattern that tests whether or not the first set of parentheses -matched. If they did, that is, if subject started with an opening parenthesis, +conditional group that tests whether or not the first capture group +matched. If it did, that is, if subject started with an opening parenthesis, the condition is true, and so the yes-pattern is executed and a closing parenthesis is required. Otherwise, since no-pattern is not present, the -subpattern matches nothing. In other words, this pattern matches a sequence of -non-parentheses, optionally enclosed in parentheses. +conditional group matches nothing. In other words, this pattern matches a +sequence of non-parentheses, optionally enclosed in parentheses. .P If you were embedding this pattern in a larger one, you could use a relative reference: @@ -2745,21 +2742,20 @@ reference: This makes the fragment independent of the parentheses in the larger pattern. . . -.SS "Checking for a used subpattern by name" +.SS "Checking for a used capture group by name" .rs .sp Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used -subpattern by name. For compatibility with earlier versions of PCRE1, which had -this facility before Perl, the syntax (?(name)...) is also recognized. Note, -however, that undelimited names consisting of the letter R followed by digits -are ambiguous (see the following section). -.P -Rewriting the above example to use a named subpattern gives this: +capture group by name. For compatibility with earlier versions of PCRE1, which +had this facility before Perl, the syntax (?(name)...) is also recognized. +Note, however, that undelimited names consisting of the letter R followed by +digits are ambiguous (see the following section). Rewriting the above example +to use a named group gives this: .sp (?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) ) .sp If the name used in a condition of this kind is a duplicate, the test is -applied to all subpatterns of the same name, and is true if any one of them has +applied to all groups of the same name, and is true if any one of them has matched. . . @@ -2774,22 +2770,22 @@ sections entitled "Recursive patterns" .\" and -.\" HTML <a href="#subpatternsassubroutines"> +.\" HTML <a href="#groupsassubroutines"> .\" </a> -"Subpatterns as subroutines" +"Groups as subroutines" .\" -below for details of recursion and subpattern calls. +below for details of recursion and subroutine calls. .P -If a condition is the string (R), and there is no subpattern with the name R, -the condition is true if matching is currently in a recursion or subroutine -call to the whole pattern or any subpattern. If digits follow the letter R, and -there is no subpattern with that name, the condition is true if the most recent -call is into a subpattern with the given number, which must exist somewhere in -the overall pattern. This is a contrived example that is equivalent to a+b: +If a condition is the string (R), and there is no capture group with the name +R, the condition is true if matching is currently in a recursion or subroutine +call to the whole pattern or any capture group. If digits follow the letter R, +and there is no group with that name, the condition is true if the most recent +call is into a group with the given number, which must exist somewhere in the +overall pattern. This is a contrived example that is equivalent to a+b: .sp ((?(R1)a+|(?1)b)) .sp -However, in both cases, if there is a subpattern with a matching name, the +However, in both cases, if there is a capture group with a matching name, the condition tests for its being set, as described in the section above, instead of testing for recursion. For example, creating a group with the name R1 by adding (?<R1>) to the above pattern completely changes its meaning. @@ -2798,27 +2794,27 @@ If a name preceded by ampersand follows the letter R, for example: .sp (?(R&name)...) .sp -the condition is true if the most recent recursion is into a subpattern of that -name (which must exist within the pattern). +the condition is true if the most recent recursion is into a group of that name +(which must exist within the pattern). .P This condition does not check the entire recursion stack. It tests only the current level. If the name used in a condition of this kind is a duplicate, the -test is applied to all subpatterns of the same name, and is true if any one of +test is applied to all groups of the same name, and is true if any one of them is the most recent recursion. .P At "top level", all these recursion test conditions are false. . . .\" HTML <a name="subdefine"></a> -.SS "Defining subpatterns for use by reference only" +.SS "Defining capture groups for use by reference only" .rs .sp If the condition is the string (DEFINE), the condition is always false, even if there is a group with the name DEFINE. In this case, there may be only one -alternative in the subpattern. It is always skipped if control reaches this -point in the pattern; the idea of DEFINE is that it can be used to define -subroutines that can be referenced from elsewhere. (The use of -.\" HTML <a href="#subpatternsassubroutines"> +alternative in the rest of the conditional group. It is always skipped if +control reaches this point in the pattern; the idea of DEFINE is that it can be +used to define subroutines that can be referenced from elsewhere. (The use of +.\" HTML <a href="#groupsassubroutines"> .\" </a> subroutines .\" @@ -2858,10 +2854,10 @@ than two digits. .SS "Assertion conditions" .rs .sp -If the condition is not in any of the above formats, it must be an assertion. -This may be a positive or negative lookahead or lookbehind assertion. Consider -this pattern, again containing non-significant white space, and with the two -alternatives on the second line: +If the condition is not in any of the above formats, it must be a parenthesized +assertion. This may be a positive or negative lookahead or lookbehind +assertion. Consider this pattern, again containing non-significant white space, +and with the two alternatives on the second line: .sp (?(?=[^a-z]*[a-z]) \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} ) @@ -2873,11 +2869,11 @@ subject is matched against the first alternative; otherwise it is matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. .P -When an assertion that is a condition contains capturing subpatterns, any +When an assertion that is a condition contains capture groups, any capturing that occurs in a matching branch is retained afterwards, for both positive and negative assertions, because matching always continues after the assertion, whether it succeeds or fails. (Compare non-conditional assertions, -when captures are retained only for positive assertions that succeed.) +for which captures are retained only for positive assertions that succeed.) . . .\" HTML <a name="comments"></a> @@ -2887,7 +2883,7 @@ when captures are retained only for positive assertions that succeed.) There are two ways of including comments in patterns that are processed by PCRE2. In both cases, the start of the comment must not be in a character class, nor in the middle of any other sequence of related characters such as -(?: or a subpattern name or number. The characters that make up a comment play +(?: or a group name or number. The characters that make up a comment play no part in the pattern matching. .P The sequence (?# marks the start of a comment that continues up to the next @@ -2937,13 +2933,13 @@ recursively to the pattern in which it appears. .P Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it supports special syntax for recursion of the entire pattern, and also for -individual subpattern recursion. After its introduction in PCRE1 and Python, +individual capture group recursion. After its introduction in PCRE1 and Python, this kind of recursion was subsequently introduced into Perl at release 5.10. .P A special item that consists of (? followed by a number greater than zero and a -closing parenthesis is a recursive subroutine call of the subpattern of the -given number, provided that it occurs inside that subpattern. (If not, it is a -.\" HTML <a href="#subpatternsassubroutines"> +closing parenthesis is a recursive subroutine call of the capture group of the +given number, provided that it occurs inside that group. (If not, it is a +.\" HTML <a href="#groupsassubroutines"> .\" </a> non-recursive subroutine .\" @@ -2976,26 +2972,26 @@ parentheses preceding the recursion. In other words, a negative number counts capturing parentheses leftwards from the point at which it is encountered. .P Be aware however, that if -.\" HTML <a href="#dupsubpatternnumber"> +.\" HTML <a href="#dupgroupnumber"> .\" </a> -duplicate subpattern numbers +duplicate capture group numbers .\" -are in use, relative references refer to the earliest subpattern with the +are in use, relative references refer to the earliest group with the appropriate number. Consider, for example: .sp (?|(a)|(b)) (c) (?-2) .sp -The first two capturing groups (a) and (b) are both numbered 1, and group (c) +The first two capture groups (a) and (b) are both numbered 1, and group (c) is number 2. When the reference (?-2) is encountered, the second most recently opened parentheses has the number 1, but it is the first such group (the (a) group) to which the recursion refers. This would be the same if an absolute reference (?1) was used. In other words, relative references are just a shorthand for computing a group number. .P -It is also possible to refer to subsequently opened parentheses, by writing +It is also possible to refer to subsequent capture groups, by writing references such as (?+2). However, these cannot be recursive because the reference is not inside the parentheses that are referenced. They are always -.\" HTML <a href="#subpatternsassubroutines"> +.\" HTML <a href="#groupsassubroutines"> .\" </a> non-recursive subroutine .\" @@ -3007,7 +3003,7 @@ rewrite the above example as follows: .sp (?<pn> \e( ( [^()]++ | (?&pn) )* \e) ) .sp -If there is more than one subpattern with the same name, the earliest one is +If there is more than one group with the same name, the earliest one is used. .P The example pattern that we have been looking at contains nested unlimited @@ -3033,9 +3029,9 @@ documentation). If the pattern above is matched against (ab(cd)ef) .sp the value for the inner capturing parentheses (numbered 2) is "ef", which is -the last value taken on at the top level. If a capturing subpattern is not -matched at the top level, its final captured value is unset, even if it was -(temporarily) set at a deeper level during the matching process. +the last value taken on at the top level. If a capture group is not matched at +the top level, its final captured value is unset, even if it was (temporarily) +set at a deeper level during the matching process. .P Do not confuse the (?R) item with the condition (R), which tests for recursion. Consider this pattern, which matches text in angle brackets, allowing for @@ -3044,9 +3040,9 @@ recursing), whereas any characters are permitted at the outer level. .sp < (?: (?(R) \ed++ | [^<>]*+) | (?R)) * > .sp -In this pattern, (?(R) is the start of a conditional subpattern, with two -different alternatives for the recursive and non-recursive cases. The (?R) item -is the actual recursive call. +In this pattern, (?(R) is the start of a conditional group, with two different +alternatives for the recursive and non-recursive cases. The (?R) item is the +actual recursive call. . . .\" HTML <a name="recursiondifference"></a> @@ -3056,7 +3052,7 @@ is the actual recursive call. Some former differences between PCRE2 and Perl no longer exist. .P Before release 10.30, recursion processing in PCRE2 differed from Perl in that -a recursive subpattern call was always treated as an atomic group. That is, +a recursive subroutine call was always treated as an atomic group. That is, once it had matched some of the subject string, it was never re-entered, even if it contained untried alternatives and there was a subsequent matching failure. (Historical note: PCRE implemented recursion before Perl did.) @@ -3089,7 +3085,7 @@ Perl takes so long that you think it has gone into a loop. .P Another way in which PCRE2 and Perl used to differ in their recursion processing is in the handling of captured values. Formerly in Perl, when a -subpattern was called recursively or as a subpattern (see the next section), it +group was called recursively or as a subroutine (see the next section), it had no access to any values that were captured outside the recursion, whereas in PCRE2 these values can be referenced. Consider this pattern: .sp @@ -3102,17 +3098,16 @@ alternative matches "a" and then recurses. In the recursion, \e1 does now match later versions (I tried 5.024) it now works. . . -.\" HTML <a name="subpatternsassubroutines"></a> -.SH "SUBPATTERNS AS SUBROUTINES" +.\" HTML <a name="groupsassubroutines"></a> +.SH "GROUPS AS SUBROUTINES" .rs .sp -If the syntax for a recursive subpattern call (either by number or by -name) is used outside the parentheses to which it refers, it operates a bit -like a subroutine in a programming language. More accurately, PCRE2 treats the -referenced subpattern as an independent subpattern which it tries to match at -the current matching position. The called subpattern may be defined before or -after the reference. A numbered reference can be absolute or relative, as in -these examples: +If the syntax for a recursive group call (either by number or by name) is used +outside the parentheses to which it refers, it operates a bit like a subroutine +in a programming language. More accurately, PCRE2 treats the referenced group +as an independent subpattern which it tries to match at the current matching +position. The called group may be defined before or after the reference. A +numbered reference can be absolute or relative, as in these examples: .sp (...(absolute)...)...(?2)... (...(relative)...)...(?-1)... @@ -3135,21 +3130,21 @@ changed at PCRE2 release 10.30, so backtracking into subroutine calls can now occur. However, any capturing parentheses that are set during the subroutine call revert to their previous values afterwards. .P -Processing options such as case-independence are fixed when a subpattern is +Processing options such as case-independence are fixed when a group is defined, so if it is used as a subroutine, such options cannot be changed for different calls. For example, consider this pattern: .sp (abc)(?i:(?-1)) .sp It matches "abcabc". It does not match "abcABC" because the change of -processing option does not affect the called subpattern. +processing option does not affect the called group. .P The behaviour of .\" HTML <a href="#backtrackcontrol"> .\" </a> backtracking control verbs .\" -in subpatterns when called as subroutines is described in the section entitled +in groups when called as subroutines is described in the section entitled .\" HTML <a href="#btsub"> .\" </a> "Backtracking verbs in subroutines" @@ -3163,8 +3158,8 @@ below. .sp For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or a number enclosed either in angle brackets or single quotes, is an alternative -syntax for referencing a subpattern as a subroutine, possibly recursively. Here -are two of the examples used above, rewritten using this syntax: +syntax for calling a group as a subroutine, possibly recursively. Here are two +of the examples used above, rewritten using this syntax: .sp (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) ) (sens|respons)e and \eg'1'ibility @@ -3306,7 +3301,7 @@ assertions, and in .\" HTML <a href="#btsub"> .\" </a> -subpatterns called as subroutines +capture groups called as subroutines .\" (whether or not recursively) is documented below. . @@ -3347,8 +3342,8 @@ The following verbs act as soon as they are encountered. (*ACCEPT) or (*ACCEPT:NAME) .sp This verb causes the match to end successfully, skipping the remainder of the -pattern. However, when it is inside a subpattern that is called as a -subroutine, only that subpattern is ended successfully. Matching then continues +pattern. However, when it is inside a capture group that is called as a +subroutine, only that group is ended successfully. Matching then continues at the outer level. If (*ACCEPT) in triggered in a positive assertion, the assertion succeeds; in a negative assertion, the assertion fails. .P @@ -3360,9 +3355,8 @@ example: This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by the outer parentheses. .P -\fBWarning:\fP (*ACCEPT) should not be used within a script run subpattern, -because it causes an immediate exit from the subpattern, bypassing the script -run checking. +\fBWarning:\fP (*ACCEPT) should not be used within a script run group, because +it causes an immediate exit from the group, bypassing the script run checking. .sp (*FAIL) or (*FAIL:NAME) .sp @@ -3602,30 +3596,28 @@ like (*MARK:NAME) in that the name is remembered for passing back to the caller. However, (*SKIP:NAME) searches only for names set with (*MARK), ignoring those set by other backtracking verbs. .P -A subpattern that does not contain a | character is just a part of the -enclosing alternative; it is not a nested alternation with only one -alternative. The effect of (*THEN) extends beyond such a subpattern to the -enclosing alternative. Consider this pattern, where A, B, etc. are complex -pattern fragments that do not contain any | characters at this level: +A group that does not contain a | character is just a part of the enclosing +alternative; it is not a nested alternation with only one alternative. The +effect of (*THEN) extends beyond such a group to the enclosing alternative. +Consider this pattern, where A, B, etc. are complex pattern fragments that do +not contain any | characters at this level: .sp A (B(*THEN)C) | D .sp If A and B are matched, but there is a failure in C, matching does not backtrack into A; instead it moves to the next alternative, that is, D. -However, if the subpattern containing (*THEN) is given an alternative, it +However, if the group containing (*THEN) is given an alternative, it behaves differently: .sp A (B(*THEN)C | (*FAIL)) | D .sp -The effect of (*THEN) is now confined to the inner subpattern. After a failure -in C, matching moves to (*FAIL), which causes the whole subpattern to fail -because there are no more alternatives to try. In this case, matching does now -backtrack into A. +The effect of (*THEN) is now confined to the inner group. After a failure in C, +matching moves to (*FAIL), which causes the whole group to fail because there +are no more alternatives to try. In this case, matching does backtrack into A. .P -Note that a conditional subpattern is not considered as having two -alternatives, because only one is ever used. In other words, the | character in -a conditional subpattern has a different meaning. Ignoring white space, -consider: +Note that a conditional group is not considered as having two alternatives, +because only one is ever used. In other words, the | character in a conditional +group has a different meaning. Ignoring white space, consider: .sp ^.*? (?(?=a) a | b(*THEN)c ) .sp @@ -3633,7 +3625,7 @@ If the subject is "ba", this pattern does not match. Because .*? is ungreedy, it initially matches zero characters. The condition (?=a) then fails, the character "b" is matched, but "c" is not. At this point, matching does not backtrack to .*? as might perhaps be expected from the presence of the | -character. The conditional subpattern is part of the single alternative that +character. The conditional group is part of the single alternative that comprises the whole pattern, and so the match fails. (If there was a backtrack into .*?, allowing it to match "b", the match would succeed.) .P @@ -3690,7 +3682,7 @@ acts. (*FAIL) in any assertion has its normal effect: it forces an immediate backtrack. The behaviour of the other backtracking verbs depends on whether or not the assertion is standalone or acting as the condition in a conditional -subpattern. +group. .P (*ACCEPT) in a standalone positive assertion causes the assertion to succeed without any further processing; captured strings and a mark name (if set) are @@ -3725,25 +3717,24 @@ the assertion to be true, without considering any further alternative branches. .SS "Backtracking verbs in subroutines" .rs .sp -These behaviours occur whether or not the subpattern is called recursively. +These behaviours occur whether or not the group is called recursively. .P -(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to +(*ACCEPT) in a group called as a subroutine causes the subroutine match to succeed without any further processing. Matching then continues after the subroutine call. Perl documents this behaviour. Perl's treatment of the other verbs in subroutines is different in some cases. .P -(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces +(*FAIL) in a group called as a subroutine has its normal effect: it forces an immediate backtrack. .P (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when -triggered by being backtracked to in a subpattern called as a subroutine. There -is then a backtrack at the outer level. +triggered by being backtracked to in a group called as a subroutine. There is +then a backtrack at the outer level. .P (*THEN), when triggered, skips to the next alternative in the innermost -enclosing group within the subpattern that has alternatives (its normal -behaviour). However, if there is no such group within the subroutine -subpattern, the subroutine match fails and there is a backtrack at the outer -level. +enclosing group that has alternatives (its normal behaviour). However, if there +is no such group within the subroutine's group, the subroutine match fails and +there is a backtrack at the outer level. . . .SH "SEE ALSO" @@ -3767,6 +3758,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 27 November 2018 -Copyright (c) 1997-2018 University of Cambridge. +Last updated: 04 February 2019 +Copyright (c) 1997-2019 University of Cambridge. .fi |