From dfd7749120e592299fddf2710bc336ee1000ae07 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Vicen=C8=9Biu=20Ciorbaru?=
-Thus, apart from \?, these escapes generate the same character code values as +Thus, apart from \c?, these escapes generate the same character code values as they do in an ASCII environment, though the meanings of the values mostly -differ. For example, \G always generates code value 7, which is BEL in ASCII +differ. For example, \cG always generates code value 7, which is BEL in ASCII but DEL in EBCDIC.
-The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but +The sequence \c? generates DEL (127, hex 7F) in an ASCII environment, but because 127 is not a control character in EBCDIC, Perl makes it generate the APC character. Unfortunately, there are several variants of EBCDIC. In most of them the APC character has the value 255 (hex FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC -values, PCRE makes \? generate 95; otherwise it generates 255. +values, PCRE makes \c? generate 95; otherwise it generates 255.
After \0 up to two further octal digits are read. If there are fewer than two @@ -1512,13 +1512,8 @@ J, U and X respectively.
When one of these option changes occurs at top level (that is, not inside subpattern parentheses), the change applies to the remainder of the pattern -that follows. If the change is placed right at the start of a pattern, PCRE -extracts it into the global options (and it will therefore show up in data -extracted by the pcre_fullinfo() function). -
--An option change within a subpattern (see below for a description of -subpatterns) affects only that part of the subpattern that follows it, so +that follows. An option change within a subpattern (see below for a description +of subpatterns) affects only that part of the subpattern that follows it, so
(a(?i)b)c@@ -2160,6 +2155,14 @@ capturing is carried out only for positive assertions. (Perl sometimes, but not always, does do capturing in negative assertions.)
+WARNING: If a positive assertion containing one or more capturing subpatterns +succeeds, but failure to match later in the pattern causes backtracking over +this assertion, the captures within the assertion are reset only if no higher +numbered captures are already set. This is, unfortunately, a fundamental +limitation of the current implementation, and as PCRE1 is now in +maintenance-only status, it is unlikely ever to change. +
+For compatibility with Perl, assertion subpatterns may be repeated; though it makes no sense to assert the same thing several times, the side effect of capturing parentheses may occasionally be useful. In practice, there only three @@ -3264,9 +3267,9 @@ Cambridge CB2 3QH, England.
-Last updated: 14 June 2015
+Last updated: 23 October 2016
-Copyright © 1997-2015 University of Cambridge.
+Copyright © 1997-2016 University of Cambridge.
Return to the PCRE index page.
diff --git a/pcre/doc/pcre.txt b/pcre/doc/pcre.txt
index f5eb347e18a..d68d5033ceb 100644
--- a/pcre/doc/pcre.txt
+++ b/pcre/doc/pcre.txt
@@ -4640,7 +4640,7 @@ DIFFERENCES BETWEEN PCRE AND PERL
pattern names is not as general as Perl's. This is a consequence of the
fact the PCRE works internally just with numbers, using an external ta-
ble to translate between numbers and names. In particular, a pattern
- such as (?|(?A)|(?A)|(?B), where the two capturing parentheses have
the same number but different names, is not supported, and causes an
error at compile time. If it were allowed, it would not be possible to
distinguish which parentheses matched, because both names map to cap-
@@ -5028,55 +5028,56 @@ BACKSLASH
ate the appropriate EBCDIC code values. The \c escape is processed as
specified for Perl in the perlebcdic document. The only characters that
are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?.
- Any other character provokes a compile-time error. The sequence \@
- encodes character code 0; the letters (in either case) encode charac-
- ters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters 27-31
- (hex 1B to hex 1F), and \? becomes either 255 (hex FF) or 95 (hex 5F).
-
- Thus, apart from \?, these escapes generate the same character code
- values as they do in an ASCII environment, though the meanings of the
- values mostly differ. For example, \G always generates code value 7,
+ Any other character provokes a compile-time error. The sequence \c@
+ encodes character code 0; after \c the letters (in either case) encode
+ characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
+ 27-31 (hex 1B to hex 1F), and \c? becomes either 255 (hex FF) or 95
+ (hex 5F).
+
+ Thus, apart from \c?, these escapes generate the same character code
+ values as they do in an ASCII environment, though the meanings of the
+ values mostly differ. For example, \cG always generates code value 7,
which is BEL in ASCII but DEL in EBCDIC.
- The sequence \? generates DEL (127, hex 7F) in an ASCII environment,
- but because 127 is not a control character in EBCDIC, Perl makes it
- generate the APC character. Unfortunately, there are several variants
- of EBCDIC. In most of them the APC character has the value 255 (hex
- FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
- certain other characters have POSIX-BC values, PCRE makes \? generate
+ The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
+ but because 127 is not a control character in EBCDIC, Perl makes it
+ generate the APC character. Unfortunately, there are several variants
+ of EBCDIC. In most of them the APC character has the value 255 (hex
+ FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
+ certain other characters have POSIX-BC values, PCRE makes \c? generate
95; otherwise it generates 255.
- After \0 up to two further octal digits are read. If there are fewer
- than two digits, just those that are present are used. Thus the
+ After \0 up to two further octal digits are read. If there are fewer
+ than two digits, just those that are present are used. Thus the
sequence \0\x\015 specifies two binary zeros followed by a CR character
(code value 13). Make sure you supply two digits after the initial zero
if the pattern character that follows is itself an octal digit.
- The escape \o must be followed by a sequence of octal digits, enclosed
- in braces. An error occurs if this is not the case. This escape is a
- recent addition to Perl; it provides way of specifying character code
- points as octal numbers greater than 0777, and it also allows octal
+ The escape \o must be followed by a sequence of octal digits, enclosed
+ in braces. An error occurs if this is not the case. This escape is a
+ recent addition to Perl; it provides way of specifying character code
+ points as octal numbers greater than 0777, and it also allows octal
numbers and back references to be unambiguously specified.
For greater clarity and unambiguity, it is best to avoid following \ by
a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
- ter numbers, and \g{} to specify back references. The following para-
+ ter numbers, and \g{} to specify back references. The following para-
graphs describe the old, ambiguous syntax.
The handling of a backslash followed by a digit other than 0 is compli-
- cated, and Perl has changed in recent releases, causing PCRE also to
+ cated, and Perl has changed in recent releases, causing PCRE also to
change. Outside a character class, PCRE reads the digit and any follow-
- ing digits as a decimal number. If the number is less than 8, or if
- there have been at least that many previous capturing left parentheses
- in the expression, the entire sequence is taken as a back reference. A
- description of how this works is given later, following the discussion
+ ing digits as a decimal number. If the number is less than 8, or if
+ there have been at least that many previous capturing left parentheses
+ in the expression, the entire sequence is taken as a back reference. A
+ description of how this works is given later, following the discussion
of parenthesized subpatterns.
- Inside a character class, or if the decimal number following \ is
+ Inside a character class, or if the decimal number following \ is
greater than 7 and there have not been that many capturing subpatterns,
- PCRE handles \8 and \9 as the literal characters "8" and "9", and oth-
+ PCRE handles \8 and \9 as the literal characters "8" and "9", and oth-
erwise re-reads up to three octal digits following the backslash, using
- them to generate a data character. Any subsequent digits stand for
+ them to generate a data character. Any subsequent digits stand for
themselves. For example:
\040 is another way of writing an ASCII space
@@ -5094,31 +5095,31 @@ BACKSLASH
\81 is either a back reference, or the two
characters "8" and "1"
- Note that octal values of 100 or greater that are specified using this
- syntax must not be introduced by a leading zero, because no more than
+ Note that octal values of 100 or greater that are specified using this
+ syntax must not be introduced by a leading zero, because no more than
three octal digits are ever read.
- By default, after \x that is not followed by {, from zero to two hexa-
- decimal digits are read (letters can be in upper or lower case). Any
+ By default, after \x that is not followed by {, from zero to two hexa-
+ decimal digits are read (letters can be in upper or lower case). Any
number of hexadecimal digits may appear between \x{ and }. If a charac-
- ter other than a hexadecimal digit appears between \x{ and }, or if
+ ter other than a hexadecimal digit appears between \x{ and }, or if
there is no terminating }, an error occurs.
- If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
- is as just described only when it is followed by two hexadecimal dig-
- its. Otherwise, it matches a literal "x" character. In JavaScript
+ If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
+ is as just described only when it is followed by two hexadecimal dig-
+ its. Otherwise, it matches a literal "x" character. In JavaScript
mode, support for code points greater than 256 is provided by \u, which
- must be followed by four hexadecimal digits; otherwise it matches a
+ must be followed by four hexadecimal digits; otherwise it matches a
literal "u" character.
Characters whose value is less than 256 can be defined by either of the
- two syntaxes for \x (or by \u in JavaScript mode). There is no differ-
+ two syntaxes for \x (or by \u in JavaScript mode). There is no differ-
ence in the way they are handled. For example, \xdc is exactly the same
as \x{dc} (or \u00dc in JavaScript mode).
Constraints on character values
- Characters that are specified using octal or hexadecimal numbers are
+ Characters that are specified using octal or hexadecimal numbers are
limited to certain values, as follows:
8-bit non-UTF mode less than 0x100
@@ -5128,44 +5129,44 @@ BACKSLASH
32-bit non-UTF mode less than 0x100000000
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint
- Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
+ Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-
called "surrogate" codepoints), and 0xffef.
Escape sequences in character classes
All the sequences that define a single character value can be used both
- inside and outside character classes. In addition, inside a character
+ inside and outside character classes. In addition, inside a character
class, \b is interpreted as the backspace character (hex 08).
- \N is not allowed in a character class. \B, \R, and \X are not special
- inside a character class. Like other unrecognized escape sequences,
- they are treated as the literal characters "B", "R", and "X" by
- default, but cause an error if the PCRE_EXTRA option is set. Outside a
+ \N is not allowed in a character class. \B, \R, and \X are not special
+ inside a character class. Like other unrecognized escape sequences,
+ they are treated as the literal characters "B", "R", and "X" by
+ default, but cause an error if the PCRE_EXTRA option is set. Outside a
character class, these sequences have different meanings.
Unsupported escape sequences
- In Perl, the sequences \l, \L, \u, and \U are recognized by its string
- handler and used to modify the case of following characters. By
- default, PCRE does not support these escape sequences. However, if the
- PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
+ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
+ handler and used to modify the case of following characters. By
+ default, PCRE does not support these escape sequences. However, if the
+ PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
\u can be used to define a character by code point, as described in the
previous section.
Absolute and relative back references
- The sequence \g followed by an unsigned or a negative number, option-
- ally enclosed in braces, is an absolute or relative back reference. A
+ The sequence \g followed by an unsigned or a negative number, option-
+ ally enclosed in braces, is an absolute or relative back reference. A
named back reference can be coded as \g{name}. Back references are dis-
cussed later, following the discussion of parenthesized subpatterns.
Absolute and relative subroutine calls
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a "subroutine".
- Details are discussed later. Note that \g{...} (Perl syntax) and
- \g<...> (Oniguruma syntax) are not synonymous. The former is a back
+ an alternative syntax for referencing a subpattern as a "subroutine".
+ Details are discussed later. Note that \g{...} (Perl syntax) and
+ \g<...> (Oniguruma syntax) are not synonymous. The former is a back
reference; the latter is a subroutine call.
Generic character types
@@ -5184,59 +5185,59 @@ BACKSLASH
\W any "non-word" character
There is also the single sequence \N, which matches a non-newline char-
- acter. This is the same as the "." metacharacter when PCRE_DOTALL is
- not set. Perl also uses \N to match characters by name; PCRE does not
+ acter. This is the same as the "." metacharacter when PCRE_DOTALL is
+ not set. Perl also uses \N to match characters by name; PCRE does not
support this.
- Each pair of lower and upper case escape sequences partitions the com-
- plete set of characters into two disjoint sets. Any given character
- matches one, and only one, of each pair. The sequences can appear both
- inside and outside character classes. They each match one character of
- the appropriate type. If the current matching point is at the end of
- the subject string, all of them fail, because there is no character to
+ Each pair of lower and upper case escape sequences partitions the com-
+ plete set of characters into two disjoint sets. Any given character
+ matches one, and only one, of each pair. The sequences can appear both
+ inside and outside character classes. They each match one character of
+ the appropriate type. If the current matching point is at the end of
+ the subject string, all of them fail, because there is no character to
match.
- For compatibility with Perl, \s did not used to match the VT character
- (code 11), which made it different from the the POSIX "space" class.
- However, Perl added VT at release 5.18, and PCRE followed suit at
- release 8.34. The default \s characters are now HT (9), LF (10), VT
- (11), FF (12), CR (13), and space (32), which are defined as white
+ For compatibility with Perl, \s did not used to match the VT character
+ (code 11), which made it different from the the POSIX "space" class.
+ However, Perl added VT at release 5.18, and PCRE followed suit at
+ release 8.34. The default \s characters are now HT (9), LF (10), VT
+ (11), FF (12), CR (13), and space (32), which are defined as white
space in the "C" locale. This list may vary if locale-specific matching
- is taking place. For example, in some locales the "non-breaking space"
- character (\xA0) is recognized as white space, and in others the VT
+ is taking place. For example, in some locales the "non-breaking space"
+ character (\xA0) is recognized as white space, and in others the VT
character is not.
- A "word" character is an underscore or any character that is a letter
- or digit. By default, the definition of letters and digits is con-
- trolled by PCRE's low-valued character tables, and may vary if locale-
- specific matching is taking place (see "Locale support" in the pcreapi
- page). For example, in a French locale such as "fr_FR" in Unix-like
- systems, or "french" in Windows, some character codes greater than 127
- are used for accented letters, and these are then matched by \w. The
+ A "word" character is an underscore or any character that is a letter
+ or digit. By default, the definition of letters and digits is con-
+ trolled by PCRE's low-valued character tables, and may vary if locale-
+ specific matching is taking place (see "Locale support" in the pcreapi
+ page). For example, in a French locale such as "fr_FR" in Unix-like
+ systems, or "french" in Windows, some character codes greater than 127
+ are used for accented letters, and these are then matched by \w. The
use of locales with Unicode is discouraged.
- By default, characters whose code points are greater than 127 never
+ By default, characters whose code points are greater than 127 never
match \d, \s, or \w, and always match \D, \S, and \W, although this may
- vary for characters in the range 128-255 when locale-specific matching
- is happening. These escape sequences retain their original meanings
- from before Unicode support was available, mainly for efficiency rea-
- sons. If PCRE is compiled with Unicode property support, and the
- PCRE_UCP option is set, the behaviour is changed so that Unicode prop-
+ vary for characters in the range 128-255 when locale-specific matching
+ is happening. These escape sequences retain their original meanings
+ from before Unicode support was available, mainly for efficiency rea-
+ sons. If PCRE is compiled with Unicode property support, and the
+ PCRE_UCP option is set, the behaviour is changed so that Unicode prop-
erties are used to determine character types, as follows:
\d any character that matches \p{Nd} (decimal digit)
\s any character that matches \p{Z} or \h or \v
\w any character that matches \p{L} or \p{N}, plus underscore
- The upper case escapes match the inverse sets of characters. Note that
- \d matches only decimal digits, whereas \w matches any Unicode digit,
- as well as any Unicode letter, and underscore. Note also that PCRE_UCP
- affects \b, and \B because they are defined in terms of \w and \W.
+ The upper case escapes match the inverse sets of characters. Note that
+ \d matches only decimal digits, whereas \w matches any Unicode digit,
+ as well as any Unicode letter, and underscore. Note also that PCRE_UCP
+ affects \b, and \B because they are defined in terms of \w and \W.
Matching these sequences is noticeably slower when PCRE_UCP is set.
- The sequences \h, \H, \v, and \V are features that were added to Perl
- at release 5.10. In contrast to the other sequences, which match only
- ASCII characters by default, these always match certain high-valued
+ The sequences \h, \H, \v, and \V are features that were added to Perl
+ at release 5.10. In contrast to the other sequences, which match only
+ ASCII characters by default, these always match certain high-valued
code points, whether or not PCRE_UCP is set. The horizontal space char-
acters are:
@@ -5275,110 +5276,110 @@ BACKSLASH
Newline sequences
- Outside a character class, by default, the escape sequence \R matches
- any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
+ Outside a character class, by default, the escape sequence \R matches
+ any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
to the following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
- This is an example of an "atomic group", details of which are given
+ This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character sequence
- CR followed by LF, or one of the single characters LF (linefeed,
- U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
- riage return, U+000D), or NEL (next line, U+0085). The two-character
+ CR followed by LF, or one of the single characters LF (linefeed,
+ U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
+ riage return, U+000D), or NEL (next line, U+0085). The two-character
sequence is treated as a single unit that cannot be split.
- In other modes, two additional characters whose codepoints are greater
+ In other modes, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
- rator, U+2029). Unicode character property support is not needed for
+ rator, U+2029). Unicode character property support is not needed for
these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
- the complete set of Unicode line endings) by setting the option
+ the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default
- when PCRE is built; if this is the case, the other behaviour can be
- requested via the PCRE_BSR_UNICODE option. It is also possible to
- specify these settings by starting a pattern string with one of the
+ when PCRE is built; if this is the case, the other behaviour can be
+ requested via the PCRE_BSR_UNICODE option. It is also possible to
+ specify these settings by starting a pattern string with one of the
following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
These override the default and the options given to the compiling func-
- tion, but they can themselves be overridden by options given to a
- matching function. Note that these special settings, which are not
- Perl-compatible, are recognized only at the very start of a pattern,
- and that they must be in upper case. If more than one of them is
- present, the last one is used. They can be combined with a change of
+ tion, but they can themselves be overridden by options given to a
+ matching function. Note that these special settings, which are not
+ Perl-compatible, are recognized only at the very start of a pattern,
+ and that they must be in upper case. If more than one of them is
+ present, the last one is used. They can be combined with a change of
newline convention; for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
- They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
+ They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF)
or (*UCP) special sequences. Inside a character class, \R is treated as
- an unrecognized escape sequence, and so matches the letter "R" by
+ an unrecognized escape sequence, and so matches the letter "R" by
default, but causes an error if PCRE_EXTRA is set.
Unicode character properties
When PCRE is built with Unicode character property support, three addi-
- tional escape sequences that match characters with specific properties
- are available. When in 8-bit non-UTF-8 mode, these sequences are of
- course limited to testing characters whose codepoints are less than
+ tional escape sequences that match characters with specific properties
+ are available. When in 8-bit non-UTF-8 mode, these sequences are of
+ course limited to testing characters whose codepoints are less than
256, but they do work in this mode. The extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X a Unicode extended grapheme cluster
- The property names represented by xx above are limited to the Unicode
+ The property names represented by xx above are limited to the Unicode
script names, the general category properties, "Any", which matches any
- character (including newline), and some special PCRE properties
- (described in the next section). Other Perl properties such as "InMu-
- sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
+ character (including newline), and some special PCRE properties
+ (described in the next section). Other Perl properties such as "InMu-
+ sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
does not match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts.
- A character from one of these sets can be matched using a script name.
+ A character from one of these sets can be matched using a script name.
For example:
\p{Greek}
\P{Han}
- Those that are not part of an identified script are lumped together as
+ Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
- Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali,
- Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
+ Arabic, Armenian, Avestan, Balinese, Bamum, Bassa_Vah, Batak, Bengali,
+ Bopomofo, Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Car-
ian, Caucasian_Albanian, Chakma, Cham, Cherokee, Common, Coptic, Cunei-
form, Cypriot, Cyrillic, Deseret, Devanagari, Duployan, Egyptian_Hiero-
glyphs, Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha,
- Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,
- Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
- tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
- Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin-
- ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic,
- Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
- Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean,
- New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian,
+ Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,
+ Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip-
+ tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li,
+ Kharoshthi, Khmer, Khojki, Khudawadi, Lao, Latin, Lepcha, Limbu, Lin-
+ ear_A, Linear_B, Lisu, Lycian, Lydian, Mahajani, Malayalam, Mandaic,
+ Manichaean, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
+ Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Myanmar, Nabataean,
+ New_Tai_Lue, Nko, Ogham, Ol_Chiki, Old_Italic, Old_North_Arabian,
Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, Oriya, Osmanya,
Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
- Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
- vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac,
- Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
- Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,
+ Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
+ vian, Siddham, Sinhala, Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac,
+ Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, Takri, Tamil, Telugu,
+ Thaana, Thai, Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi,
Yi.
Each character has exactly one Unicode general category property, spec-
- ified by a two-letter abbreviation. For compatibility with Perl, nega-
- tion can be specified by including a circumflex between the opening
- brace and the property name. For example, \p{^Lu} is the same as
+ ified by a two-letter abbreviation. For compatibility with Perl, nega-
+ tion can be specified by including a circumflex between the opening
+ brace and the property name. For example, \p{^Lu} is the same as
\P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
- eral category properties that start with that letter. In this case, in
- the absence of negation, the curly brackets in the escape sequence are
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
@@ -5430,73 +5431,73 @@ BACKSLASH
Zp Paragraph separator
Zs Space separator
- The special property L& is also supported: it matches a character that
- has the Lu, Ll, or Lt property, in other words, a letter that is not
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
- The Cs (Surrogate) property applies only to characters in the range
- U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
- so cannot be tested by PCRE, unless UTF validity checking has been
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
+ so cannot be tested by PCRE, unless UTF validity checking has been
turned off (see the discussion of PCRE_NO_UTF8_CHECK,
- PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl
+ PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl
does not support the Cs property.
- The long synonyms for property names that Perl supports (such as
- \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
- Specifying caseless matching does not affect these escape sequences.
- For example, \p{Lu} always matches only upper case letters. This is
+ Specifying caseless matching does not affect these escape sequences.
+ For example, \p{Lu} always matches only upper case letters. This is
different from the behaviour of current versions of Perl.
- Matching characters by Unicode property is not fast, because PCRE has
- to do a multistage table lookup in order to find a character's prop-
+ Matching characters by Unicode property is not fast, because PCRE has
+ to do a multistage table lookup in order to find a character's prop-
erty. That is why the traditional escape sequences such as \d and \w do
not use Unicode properties in PCRE by default, though you can make them
- do so by setting the PCRE_UCP option or by starting the pattern with
+ do so by setting the PCRE_UCP option or by starting the pattern with
(*UCP).
Extended grapheme clusters
- The \X escape matches any number of Unicode characters that form an
+ The \X escape matches any number of Unicode characters that form an
"extended grapheme cluster", and treats the sequence as an atomic group
- (see below). Up to and including release 8.31, PCRE matched an ear-
+ (see below). Up to and including release 8.31, PCRE matched an ear-
lier, simpler definition that was equivalent to
(?>\PM\pM*)
- That is, it matched a character without the "mark" property, followed
- by zero or more characters with the "mark" property. Characters with
- the "mark" property are typically non-spacing accents that affect the
+ That is, it matched a character without the "mark" property, followed
+ by zero or more characters with the "mark" property. Characters with
+ the "mark" property are typically non-spacing accents that affect the
preceding character.
- This simple definition was extended in Unicode to include more compli-
- cated kinds of composite character by giving each character a grapheme
- breaking property, and creating rules that use these properties to
- define the boundaries of extended grapheme clusters. In releases of
+ This simple definition was extended in Unicode to include more compli-
+ cated kinds of composite character by giving each character a grapheme
+ breaking property, and creating rules that use these properties to
+ define the boundaries of extended grapheme clusters. In releases of
PCRE later than 8.31, \X matches one of these clusters.
- \X always matches at least one character. Then it decides whether to
+ \X always matches at least one character. Then it decides whether to
add additional characters according to the following rules for ending a
cluster:
1. End at the end of the subject string.
- 2. Do not end between CR and LF; otherwise end after any control char-
+ 2. Do not end between CR and LF; otherwise end after any control char-
acter.
- 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
- characters are of five types: L, V, T, LV, and LVT. An L character may
- be followed by an L, V, LV, or LVT character; an LV or V character may
+ 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
+ characters are of five types: L, V, T, LV, and LVT. An L character may
+ be followed by an L, V, LV, or LVT character; an LV or V character may
be followed by a V or T character; an LVT or T character may be follwed
only by a T character.
- 4. Do not end before extending characters or spacing marks. Characters
- with the "mark" property always have the "extend" grapheme breaking
+ 4. Do not end before extending characters or spacing marks. Characters
+ with the "mark" property always have the "extend" grapheme breaking
property.
5. Do not end after prepend characters.
@@ -5505,9 +5506,9 @@ BACKSLASH
PCRE's additional properties
- As well as the standard Unicode properties described above, PCRE sup-
- ports four more that make it possible to convert traditional escape
- sequences such as \w and \s to use Unicode properties. PCRE uses these
+ As well as the standard Unicode properties described above, PCRE sup-
+ ports four more that make it possible to convert traditional escape
+ sequences such as \w and \s to use Unicode properties. PCRE uses these
non-standard, non-Perl properties internally when PCRE_UCP is set. How-
ever, they may also be used explicitly. These properties are:
@@ -5516,54 +5517,54 @@ BACKSLASH
Xsp Any Perl space character
Xwd Any Perl "word" character
- Xan matches characters that have either the L (letter) or the N (num-
- ber) property. Xps matches the characters tab, linefeed, vertical tab,
- form feed, or carriage return, and any other character that has the Z
- (separator) property. Xsp is the same as Xps; it used to exclude ver-
- tical tab, for Perl compatibility, but Perl changed, and so PCRE fol-
- lowed at release 8.34. Xwd matches the same characters as Xan, plus
+ Xan matches characters that have either the L (letter) or the N (num-
+ ber) property. Xps matches the characters tab, linefeed, vertical tab,
+ form feed, or carriage return, and any other character that has the Z
+ (separator) property. Xsp is the same as Xps; it used to exclude ver-
+ tical tab, for Perl compatibility, but Perl changed, and so PCRE fol-
+ lowed at release 8.34. Xwd matches the same characters as Xan, plus
underscore.
- There is another non-standard property, Xuc, which matches any charac-
- ter that can be represented by a Universal Character Name in C++ and
- other programming languages. These are the characters $, @, ` (grave
- accent), and all characters with Unicode code points greater than or
- equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
- most base (ASCII) characters are excluded. (Universal Character Names
- are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
+ There is another non-standard property, Xuc, which matches any charac-
+ ter that can be represented by a Universal Character Name in C++ and
+ other programming languages. These are the characters $, @, ` (grave
+ accent), and all characters with Unicode code points greater than or
+ equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
+ most base (ASCII) characters are excluded. (Universal Character Names
+ are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
Note that the Xuc property does not match these sequences but the char-
acters that they represent.)
Resetting the match start
- The escape sequence \K causes any previously matched characters not to
+ The escape sequence \K causes any previously matched characters not to
be included in the final matched sequence. For example, the pattern:
foo\Kbar
- matches "foobar", but reports that it has matched "bar". This feature
- is similar to a lookbehind assertion (described below). However, in
- this case, the part of the subject before the real match does not have
- to be of fixed length, as lookbehind assertions do. The use of \K does
- not interfere with the setting of captured substrings. For example,
+ matches "foobar", but reports that it has matched "bar". This feature
+ is similar to a lookbehind assertion (described below). However, in
+ this case, the part of the subject before the real match does not have
+ to be of fixed length, as lookbehind assertions do. The use of \K does
+ not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
- Perl documents that the use of \K within assertions is "not well
- defined". In PCRE, \K is acted upon when it occurs inside positive
- assertions, but is ignored in negative assertions. Note that when a
- pattern such as (?=ab\K) matches, the reported start of the match can
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE, \K is acted upon when it occurs inside positive
+ assertions, but is ignored in negative assertions. Note that when a
+ pattern such as (?=ab\K) matches, the reported start of the match can
be greater than the end of the match.
Simple assertions
- The final use of backslash is for certain simple assertions. An asser-
- tion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The
- use of subpatterns for more complicated assertions is described below.
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
@@ -5574,161 +5575,161 @@ BACKSLASH
\z matches only at the end of the subject
\G matches at the first matching position in the subject
- Inside a character class, \b has a different meaning; it matches the
- backspace character. If any other of these assertions appears in a
- character class, by default it matches the corresponding literal char-
+ Inside a character class, \b has a different meaning; it matches the
+ backspace character. If any other of these assertions appears in a
+ character class, by default it matches the corresponding literal char-
acter (for example, \B matches the letter B). However, if the
- PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
+ PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
ated instead.
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match \w or \W (i.e.
- one matches \w and the other matches \W), or the start or end of the
- string if the first or last character matches \w, respectively. In a
- UTF mode, the meanings of \w and \W can be changed by setting the
- PCRE_UCP option. When this is done, it also affects \b and \B. Neither
- PCRE nor Perl has a separate "start of word" or "end of word" metase-
- quence. However, whatever follows \b normally determines which it is.
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
+ string if the first or last character matches \w, respectively. In a
+ UTF mode, the meanings of \w and \W can be changed by setting the
+ PCRE_UCP option. When this is done, it also affects \b and \B. Neither
+ PCRE nor Perl has a separate "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
- affect only the behaviour of the circumflex and dollar metacharacters.
- However, if the startoffset argument of pcre_exec() is non-zero, indi-
+ affect only the behaviour of the circumflex and dollar metacharacters.
+ However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
- the subject, \A can never match. The difference between \Z and \z is
+ the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre_exec(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre_exec() multiple times with appropriate argu-
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre_exec(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
- Note, however, that PCRE's interpretation of \G, as the start of the
+ Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
CIRCUMFLEX AND DOLLAR
- The circumflex and dollar metacharacters are zero-width assertions.
- That is, they test for a particular condition being true without con-
+ The circumflex and dollar metacharacters are zero-width assertions.
+ That is, they test for a particular condition being true without con-
suming any characters from the subject string.
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre_exec() is non-zero, circumflex can never match if the
- PCRE_MULTILINE option is unset. Inside a character class, circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre_exec() is non-zero, circumflex can never match if the
+ PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- The dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
- before a newline at the end of the string (by default). Note, however,
- that it does not actually match the newline. Dollar need not be the
+ The dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
+ before a newline at the end of the string (by default). Note, however,
+ that it does not actually match the newline. Dollar need not be the
last character of the pattern if a number of alternatives are involved,
- but it should be the last item in any branch in which it appears. Dol-
+ but it should be the last item in any branch in which it appears. Dol-
lar has no special meaning in a character class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
- PCRE_MULTILINE option is set. When this is the case, a circumflex
- matches immediately after internal newlines as well as at the start of
- the subject string. It does not match after a newline that ends the
- string. A dollar matches before any newlines in the string, as well as
- at the very end, when PCRE_MULTILINE is set. When newline is specified
- as the two-character sequence CRLF, isolated CR and LF characters do
+ PCRE_MULTILINE option is set. When this is the case, a circumflex
+ matches immediately after internal newlines as well as at the start of
+ the subject string. It does not match after a newline that ends the
+ string. A dollar matches before any newlines in the string, as well as
+ at the very end, when PCRE_MULTILINE is set. When newline is specified
+ as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT) AND \N
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
+ ter in the subject string except (by default) a character that signi-
fies the end of a line.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE_DOTALL option is set, a dot matches any one character, without
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
- The escape sequence \N behaves like a dot, except that it is not
- affected by the PCRE_DOTALL option. In other words, it matches any
- character except one that signifies the end of a line. Perl also uses
+ The escape sequence \N behaves like a dot, except that it is not
+ affected by the PCRE_DOTALL option. In other words, it matches any
+ character except one that signifies the end of a line. Perl also uses
\N to match characters by name; PCRE does not support this.
MATCHING A SINGLE DATA UNIT
- Outside a character class, the escape sequence \C matches any one data
- unit, whether or not a UTF mode is set. In the 8-bit library, one data
- unit is one byte; in the 16-bit library it is a 16-bit unit; in the
- 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
- line-ending characters. The feature is provided in Perl in order to
+ Outside a character class, the escape sequence \C matches any one data
+ unit, whether or not a UTF mode is set. In the 8-bit library, one data
+ unit is one byte; in the 16-bit library it is a 16-bit unit; in the
+ 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
+ line-ending characters. The feature is provided in Perl in order to
match individual bytes in UTF-8 mode, but it is unclear how it can use-
- fully be used. Because \C breaks up characters into individual data
- units, matching one unit with \C in a UTF mode means that the rest of
+ fully be used. Because \C breaks up characters into individual data
+ units, matching one unit with \C in a UTF mode means that the rest of
the string may start with a malformed UTF character. This has undefined
results, because PCRE assumes that it is dealing with valid UTF strings
- (and by default it checks this at the start of processing unless the
- PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option
+ (and by default it checks this at the start of processing unless the
+ PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or PCRE_NO_UTF32_CHECK option
is used).
- PCRE does not allow \C to appear in lookbehind assertions (described
- below) in a UTF mode, because this would make it impossible to calcu-
+ PCRE does not allow \C to appear in lookbehind assertions (described
+ below) in a UTF mode, because this would make it impossible to calcu-
late the length of the lookbehind.
In general, the \C escape sequence is best avoided. However, one way of
- using it that avoids the problem of malformed UTF characters is to use
- a lookahead to check the length of the next character, as in this pat-
- tern, which could be used with a UTF-8 string (ignore white space and
+ using it that avoids the problem of malformed UTF characters is to use
+ a lookahead to check the length of the next character, as in this pat-
+ tern, which could be used with a UTF-8 string (ignore white space and
line breaks):
(?| (?=[\x00-\x7f])(\C) |
@@ -5736,11 +5737,11 @@ MATCHING A SINGLE DATA UNIT
(?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
(?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
- A group that starts with (?| resets the capturing parentheses numbers
- in each alternative (see "Duplicate Subpattern Numbers" below). The
- assertions at the start of each branch check the next UTF-8 character
- for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
- character's individual bytes are then captured by the appropriate num-
+ A group that starts with (?| resets the capturing parentheses numbers
+ in each alternative (see "Duplicate Subpattern Numbers" below). The
+ assertions at the start of each branch check the next UTF-8 character
+ for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
+ character's individual bytes are then captured by the appropriate num-
ber of groups.
@@ -5750,109 +5751,109 @@ SQUARE BRACKETS AND CHARACTER CLASSES
closing square bracket. A closing square bracket on its own is not spe-
cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
a lone closing square bracket causes a compile-time error. If a closing
- square bracket is required as a member of the class, it should be the
- first data character in the class (after an initial circumflex, if
+ square bracket is required as a member of the class, it should be the
+ first data character in the class (after an initial circumflex, if
present) or escaped with a backslash.
- A character class matches a single character in the subject. In a UTF
- mode, the character may be more than one data unit long. A matched
+ A character class matches a single character in the subject. In a UTF
+ mode, the character may be more than one data unit long. A matched
character must be in the set of characters defined by the class, unless
- the first character in the class definition is a circumflex, in which
+ the first character in the class definition is a circumflex, in which
case the subject character must not be in the set defined by the class.
- If a circumflex is actually required as a member of the class, ensure
+ If a circumflex is actually required as a member of the class, ensure
it is not the first character, or escape it with a backslash.
- For example, the character class [aeiou] matches any lower case vowel,
- while [^aeiou] matches any character that is not a lower case vowel.
+ For example, the character class [aeiou] matches any lower case vowel,
+ while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the
- characters that are in the class by enumerating those that are not. A
- class that starts with a circumflex is not an assertion; it still con-
- sumes a character from the subject string, and therefore it fails if
+ characters that are in the class by enumerating those that are not. A
+ class that starts with a circumflex is not an assertion; it still con-
+ sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string.
In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
- (0xffff) can be included in a class as a literal string of data units,
+ (0xffff) can be included in a class as a literal string of data units,
or by using the \x{ escaping mechanism.
- When caseless matching is set, any letters in a class represent both
- their upper case and lower case versions, so for example, a caseless
- [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
- match "A", whereas a caseful version would. In a UTF mode, PCRE always
- understands the concept of case for characters whose values are less
- than 128, so caseless matching is always possible. For characters with
- higher values, the concept of case is supported if PCRE is compiled
- with Unicode property support, but not otherwise. If you want to use
- caseless matching in a UTF mode for characters 128 and above, you must
- ensure that PCRE is compiled with Unicode property support as well as
+ When caseless matching is set, any letters in a class represent both
+ their upper case and lower case versions, so for example, a caseless
+ [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
+ match "A", whereas a caseful version would. In a UTF mode, PCRE always
+ understands the concept of case for characters whose values are less
+ than 128, so caseless matching is always possible. For characters with
+ higher values, the concept of case is supported if PCRE is compiled
+ with Unicode property support, but not otherwise. If you want to use
+ caseless matching in a UTF mode for characters 128 and above, you must
+ ensure that PCRE is compiled with Unicode property support as well as
with UTF support.
- Characters that might indicate line breaks are never treated in any
- special way when matching character classes, whatever line-ending
- sequence is in use, and whatever setting of the PCRE_DOTALL and
+ Characters that might indicate line breaks are never treated in any
+ special way when matching character classes, whatever line-ending
+ sequence is in use, and whatever setting of the PCRE_DOTALL and
PCRE_MULTILINE options is used. A class such as [^a] always matches one
of these characters.
- The minus (hyphen) character can be used to specify a range of charac-
- ters in a character class. For example, [d-m] matches any letter
- between d and m, inclusive. If a minus character is required in a
- class, it must be escaped with a backslash or appear in a position
- where it cannot be interpreted as indicating a range, typically as the
+ The minus (hyphen) character can be used to specify a range of charac-
+ ters in a character class. For example, [d-m] matches any letter
+ between d and m, inclusive. If a minus character is required in a
+ class, it must be escaped with a backslash or appear in a position
+ where it cannot be interpreted as indicating a range, typically as the
first or last character in the class, or immediately after a range. For
- example, [b-d-z] matches letters in the range b to d, a hyphen charac-
+ example, [b-d-z] matches letters in the range b to d, a hyphen charac-
ter, or z.
It is not possible to have the literal character "]" as the end charac-
- ter of a range. A pattern such as [W-]46] is interpreted as a class of
- two characters ("W" and "-") followed by a literal string "46]", so it
- would match "W46]" or "-46]". However, if the "]" is escaped with a
- backslash it is interpreted as the end of range, so [W-\]46] is inter-
- preted as a class containing a range followed by two other characters.
- The octal or hexadecimal representation of "]" can also be used to end
+ ter of a range. A pattern such as [W-]46] is interpreted as a class of
+ two characters ("W" and "-") followed by a literal string "46]", so it
+ would match "W46]" or "-46]". However, if the "]" is escaped with a
+ backslash it is interpreted as the end of range, so [W-\]46] is inter-
+ preted as a class containing a range followed by two other characters.
+ The octal or hexadecimal representation of "]" can also be used to end
a range.
- An error is generated if a POSIX character class (see below) or an
- escape sequence other than one that defines a single character appears
- at a point where a range ending character is expected. For example,
+ An error is generated if a POSIX character class (see below) or an
+ escape sequence other than one that defines a single character appears
+ at a point where a range ending character is expected. For example,
[z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
- Ranges operate in the collating sequence of character values. They can
- also be used for characters specified numerically, for example
- [\000-\037]. Ranges can include any characters that are valid for the
+ Ranges operate in the collating sequence of character values. They can
+ also be used for characters specified numerically, for example
+ [\000-\037]. Ranges can include any characters that are valid for the
current mode.
If a range that includes letters is used when caseless matching is set,
it matches the letters in either case. For example, [W-c] is equivalent
- to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
- character tables for a French locale are in use, [\xc8-\xcb] matches
- accented E characters in both cases. In UTF modes, PCRE supports the
- concept of case for characters with values greater than 128 only when
+ to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
+ character tables for a French locale are in use, [\xc8-\xcb] matches
+ accented E characters in both cases. In UTF modes, PCRE supports the
+ concept of case for characters with values greater than 128 only when
it is compiled with Unicode property support.
- The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
+ The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
\w, and \W may appear in a character class, and add the characters that
- they match to the class. For example, [\dABCDEF] matches any hexadeci-
- mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
- \d, \s, \w and their upper case partners, just as it does when they
- appear outside a character class, as described in the section entitled
+ they match to the class. For example, [\dABCDEF] matches any hexadeci-
+ mal digit. In UTF modes, the PCRE_UCP option affects the meanings of
+ \d, \s, \w and their upper case partners, just as it does when they
+ appear outside a character class, as described in the section entitled
"Generic character types" above. The escape sequence \b has a different
- meaning inside a character class; it matches the backspace character.
- The sequences \B, \N, \R, and \X are not special inside a character
- class. Like any other unrecognized escape sequences, they are treated
- as the literal characters "B", "N", "R", and "X" by default, but cause
+ meaning inside a character class; it matches the backspace character.
+ The sequences \B, \N, \R, and \X are not special inside a character
+ class. Like any other unrecognized escape sequences, they are treated
+ as the literal characters "B", "N", "R", and "X" by default, but cause
an error if the PCRE_EXTRA option is set.
- A circumflex can conveniently be used with the upper case character
- types to specify a more restricted set of characters than the matching
- lower case type. For example, the class [^\W_] matches any letter or
+ A circumflex can conveniently be used with the upper case character
+ types to specify a more restricted set of characters than the matching
+ lower case type. For example, the class [^\W_] matches any letter or
digit, but not underscore, whereas [\w] includes underscore. A positive
character class should be read as "something OR something OR ..." and a
negative class as "NOT something AND NOT something AND NOT ...".
- The only metacharacters that are recognized in character classes are
- backslash, hyphen (only where it can be interpreted as specifying a
- range), circumflex (only at the start), opening square bracket (only
- when it can be interpreted as introducing a POSIX class name, or for a
- special compatibility feature - see the next two sections), and the
+ The only metacharacters that are recognized in character classes are
+ backslash, hyphen (only where it can be interpreted as specifying a
+ range), circumflex (only at the start), opening square bracket (only
+ when it can be interpreted as introducing a POSIX class name, or for a
+ special compatibility feature - see the next two sections), and the
terminating closing square bracket. However, escaping other non-
alphanumeric characters does no harm.
@@ -5860,7 +5861,7 @@ SQUARE BRACKETS AND CHARACTER CLASSES
POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names
- enclosed by [: and :] within the enclosing square brackets. PCRE also
+ enclosed by [: and :] within the enclosing square brackets. PCRE also
supports this notation. For example,
[01[:alpha:]%]
@@ -5883,28 +5884,28 @@ POSIX CHARACTER CLASSES
word "word" characters (same as \w)
xdigit hexadecimal digits
- The default "space" characters are HT (9), LF (10), VT (11), FF (12),
- CR (13), and space (32). If locale-specific matching is taking place,
- the list of space characters may be different; there may be fewer or
+ The default "space" characters are HT (9), LF (10), VT (11), FF (12),
+ CR (13), and space (32). If locale-specific matching is taking place,
+ the list of space characters may be different; there may be fewer or
more of them. "Space" used to be different to \s, which did not include
VT, for Perl compatibility. However, Perl changed at release 5.18, and
- PCRE followed at release 8.34. "Space" and \s now match the same set
+ PCRE followed at release 8.34. "Space" and \s now match the same set
of characters.
- The name "word" is a Perl extension, and "blank" is a GNU extension
- from Perl 5.8. Another Perl extension is negation, which is indicated
+ The name "word" is a Perl extension, and "blank" is a GNU extension
+ from Perl 5.8. Another Perl extension is negation, which is indicated
by a ^ character after the colon. For example,
[12[:^digit:]]
- matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
+ matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
these are not supported, and an error is given if they are encountered.
By default, characters with values greater than 128 do not match any of
- the POSIX character classes. However, if the PCRE_UCP option is passed
- to pcre_compile(), some of the classes are changed so that Unicode
- character properties are used. This is achieved by replacing certain
+ the POSIX character classes. However, if the PCRE_UCP option is passed
+ to pcre_compile(), some of the classes are changed so that Unicode
+ character properties are used. This is achieved by replacing certain
POSIX classes by other sequences, as follows:
[:alnum:] becomes \p{Xan}
@@ -5916,10 +5917,10 @@ POSIX CHARACTER CLASSES
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
- Negated versions, such as [:^alpha:] use \P instead of \p. Three other
+ Negated versions, such as [:^alpha:] use \P instead of \p. Three other
POSIX classes are handled specially in UCP mode:
- [:graph:] This matches characters that have glyphs that mark the page
+ [:graph:] This matches characters that have glyphs that mark the page
when printed. In Unicode property terms, it matches all char-
acters with the L, M, N, P, S, or Cf properties, except for:
@@ -5928,58 +5929,58 @@ POSIX CHARACTER CLASSES
U+2066 - U+2069 Various "isolate"s
- [:print:] This matches the same characters as [:graph:] plus space
- characters that are not controls, that is, characters with
+ [:print:] This matches the same characters as [:graph:] plus space
+ characters that are not controls, that is, characters with
the Zs property.
[:punct:] This matches all characters that have the Unicode P (punctua-
- tion) property, plus those characters whose code points are
+ tion) property, plus those characters whose code points are
less than 128 that have the S (Symbol) property.
- The other POSIX classes are unchanged, and match only characters with
+ The other POSIX classes are unchanged, and match only characters with
code points less than 128.
COMPATIBILITY FEATURE FOR WORD BOUNDARIES
- In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
- ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
+ In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
+ ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
and "end of word". PCRE treats these items as follows:
[[:<:]] is converted to \b(?=\w)
[[:>:]] is converted to \b(?<=\w)
Only these exact character sequences are recognized. A sequence such as
- [a[:<:]b] provokes error for an unrecognized POSIX class name. This
- support is not compatible with Perl. It is provided to help migrations
+ [a[:<:]b] provokes error for an unrecognized POSIX class name. This
+ support is not compatible with Perl. It is provided to help migrations
from other environments, and is best not used in any new patterns. Note
- that \b matches at the start and the end of a word (see "Simple asser-
- tions" above), and in a Perl-style pattern the preceding or following
- character normally shows which is wanted, without the need for the
- assertions that are used above in order to give exactly the POSIX be-
+ that \b matches at the start and the end of a word (see "Simple asser-
+ tions" above), and in a Perl-style pattern the preceding or following
+ character normally shows which is wanted, without the need for the
+ assertions that are used above in order to give exactly the POSIX be-
haviour.
VERTICAL BAR
- Vertical bar characters are used to separate alternative patterns. For
+ Vertical bar characters are used to separate alternative patterns. For
example, the pattern
gilbert|sullivan
- matches either "gilbert" or "sullivan". Any number of alternatives may
- appear, and an empty alternative is permitted (matching the empty
+ matches either "gilbert" or "sullivan". Any number of alternatives may
+ appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left
- to right, and the first one that succeeds is used. If the alternatives
- are within a subpattern (defined below), "succeeds" means matching the
+ to right, and the first one that succeeds is used. If the alternatives
+ are within a subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the subpattern.
INTERNAL OPTION SETTING
- The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
- PCRE_EXTENDED options (which are Perl-compatible) can be changed from
- within the pattern by a sequence of Perl option letters enclosed
+ The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
+ PCRE_EXTENDED options (which are Perl-compatible) can be changed from
+ within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are
i for PCRE_CASELESS
@@ -5989,51 +5990,47 @@ INTERNAL OPTION SETTING
For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
- combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
- LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
- is also permitted. If a letter appears both before and after the
+ combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
+ LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
+ is also permitted. If a letter appears both before and after the
hyphen, the option is unset.
- The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
- can be changed in the same way as the Perl-compatible options by using
+ The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
+ can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.
- When one of these option changes occurs at top level (that is, not
- inside subpattern parentheses), the change applies to the remainder of
- the pattern that follows. If the change is placed right at the start of
- a pattern, PCRE extracts it into the global options (and it will there-
- fore show up in data extracted by the pcre_fullinfo() function).
-
- An option change within a subpattern (see below for a description of
- subpatterns) affects only that part of the subpattern that follows it,
- so
+ When one of these option changes occurs at top level (that is, not
+ inside subpattern parentheses), the change applies to the remainder of
+ the pattern that follows. An option change within a subpattern (see
+ below for a description of subpatterns) affects only that part of the
+ subpattern that follows it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
- used). By this means, options can be made to have different settings
- in different parts of the pattern. Any changes made in one alternative
- do carry on into subsequent branches within the same subpattern. For
+ used). By this means, options can be made to have different settings
+ in different parts of the pattern. Any changes made in one alternative
+ do carry on into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
- matches "ab", "aB", "c", and "C", even though when matching "C" the
- first branch is abandoned before the option setting. This is because
- the effects of option settings happen at compile time. There would be
+ matches "ab", "aB", "c", and "C", even though when matching "C" the
+ first branch is abandoned before the option setting. This is because
+ the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.
- Note: There are other PCRE-specific options that can be set by the
- application when the compiling or matching functions are called. In
- some cases the pattern can contain special leading sequences such as
- (*CRLF) to override what the application has set or what has been
- defaulted. Details are given in the section entitled "Newline
- sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and
- (*UCP) leading sequences that can be used to set UTF and Unicode prop-
- erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
- PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence
- is a generic version that can be used with any of the libraries. How-
- ever, the application can set the PCRE_NEVER_UTF option, which locks
+ Note: There are other PCRE-specific options that can be set by the
+ application when the compiling or matching functions are called. In
+ some cases the pattern can contain special leading sequences such as
+ (*CRLF) to override what the application has set or what has been
+ defaulted. Details are given in the section entitled "Newline
+ sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and
+ (*UCP) leading sequences that can be used to set UTF and Unicode prop-
+ erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16,
+ PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence
+ is a generic version that can be used with any of the libraries. How-
+ ever, the application can set the PCRE_NEVER_UTF option, which locks
out the use of the (*UTF) sequences.
@@ -6046,18 +6043,18 @@ SUBPATTERNS
cat(aract|erpillar|)
- matches "cataract", "caterpillar", or "cat". Without the parentheses,
+ matches "cataract", "caterpillar", or "cat". Without the parentheses,
it would match "cataract", "erpillar" or an empty string.
- 2. It sets up the subpattern as a capturing subpattern. This means
- that, when the whole pattern matches, that portion of the subject
+ 2. It sets up the subpattern as a capturing subpattern. This means
+ that, when the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via the
- ovector argument of the matching function. (This applies only to the
- traditional matching functions; the DFA matching functions do not sup-
+ ovector argument of the matching function. (This applies only to the
+ traditional matching functions; the DFA matching functions do not sup-
port capturing.)
Opening parentheses are counted from left to right (starting from 1) to
- obtain numbers for the capturing subpatterns. For example, if the
+ obtain numbers for the capturing subpatterns. For example, if the
string "the red king" is matched against the pattern
the ((red|white) (king|queen))
@@ -6065,12 +6062,12 @@ SUBPATTERNS
the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively.
- The fact that plain parentheses fulfil two functions is not always
- helpful. There are often times when a grouping subpattern is required
- without a capturing requirement. If an opening parenthesis is followed
- by a question mark and a colon, the subpattern does not do any captur-
- ing, and is not counted when computing the number of any subsequent
- capturing subpatterns. For example, if the string "the white queen" is
+ The fact that plain parentheses fulfil two functions is not always
+ helpful. There are often times when a grouping subpattern is required
+ without a capturing requirement. If an opening parenthesis is followed
+ by a question mark and a colon, the subpattern does not do any captur-
+ ing, and is not counted when computing the number of any subsequent
+ capturing subpatterns. For example, if the string "the white queen" is
matched against the pattern
the ((?:red|white) (king|queen))
@@ -6078,37 +6075,37 @@ SUBPATTERNS
the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535.
- As a convenient shorthand, if any option settings are required at the
- start of a non-capturing subpattern, the option letters may appear
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern, the option letters may appear
between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are
- tried from left to right, and options are not reset until the end of
- the subpattern is reached, an option setting in one branch does affect
- subsequent branches, so the above patterns match "SUNDAY" as well as
+ tried from left to right, and options are not reset until the end of
+ the subpattern is reached, an option setting in one branch does affect
+ subsequent branches, so the above patterns match "SUNDAY" as well as
"Saturday".
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern
- uses the same numbers for its capturing parentheses. Such a subpattern
- starts with (?| and is itself a non-capturing subpattern. For example,
+ uses the same numbers for its capturing parentheses. Such a subpattern
+ starts with (?| and is itself a non-capturing subpattern. For example,
consider this pattern:
(?|(Sat)ur|(Sun))day
- Because the two alternatives are inside a (?| group, both sets of cap-
- turing parentheses are numbered one. Thus, when the pattern matches,
- you can look at captured substring number one, whichever alternative
- matched. This construct is useful when you want to capture part, but
+ Because the two alternatives are inside a (?| group, both sets of cap-
+ turing parentheses are numbered one. Thus, when the pattern matches,
+ you can look at captured substring number one, whichever alternative
+ matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren-
- theses are numbered as usual, but the number is reset at the start of
- each branch. The numbers of any capturing parentheses that follow the
- subpattern start after the highest number used in any branch. The fol-
+ theses are numbered as usual, but the number is reset at the start of
+ each branch. The numbers of any capturing parentheses that follow the
+ subpattern start after the highest number used in any branch. The fol-
lowing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored.
@@ -6116,58 +6113,58 @@ DUPLICATE SUBPATTERN NUMBERS
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
- A back reference to a numbered subpattern uses the most recent value
- that is set for that number by any subpattern. The following pattern
+ A back reference to a numbered subpattern uses the most recent value
+ that is set for that number by any subpattern. The following pattern
matches "abcabc" or "defdef":
/(?|(abc)|(def))\1/
- In contrast, a subroutine call to a numbered subpattern always refers
- to the first one in the pattern with the given number. The following
+ In contrast, a subroutine call to a numbered subpattern always refers
+ to the first one in the pattern with the given number. The following
pattern matches "abcabc" or "defabc":
/(?|(abc)|(def))(?1)/
- If a condition test for a subpattern's having matched refers to a non-
- unique number, the test is true if any of the subpatterns of that num-
+ If a condition test for a subpattern's having matched refers to a non-
+ unique number, the test is true if any of the subpatterns of that num-
ber have matched.
- An alternative approach to using this "branch reset" feature is to use
+ An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
NAMED SUBPATTERNS
- Identifying capturing parentheses by number is simple, but it can be
- very hard to keep track of the numbers in complicated regular expres-
- sions. Furthermore, if an expression is modified, the numbers may
- change. To help with this difficulty, PCRE supports the naming of sub-
+ Identifying capturing parentheses by number is simple, but it can be
+ very hard to keep track of the numbers in complicated regular expres-
+ sions. Furthermore, if an expression is modified, the numbers may
+ change. To help with this difficulty, PCRE supports the naming of sub-
patterns. This feature was not added to Perl until release 5.10. Python
- had the feature earlier, and PCRE introduced it at release 4.0, using
- the Python syntax. PCRE now supports both the Perl and the Python syn-
- tax. Perl allows identically numbered subpatterns to have different
+ had the feature earlier, and PCRE introduced it at release 4.0, using
+ the Python syntax. PCRE now supports both the Perl and the Python syn-
+ tax. Perl allows identically numbered subpatterns to have different
names, but PCRE does not.
- In PCRE, a subpattern can be named in one of three ways: (?