summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-03-10 16:08:01 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-03-10 16:08:01 +0000
commit2b6afaf95389abfb23a75f791481df6f04f7dcd0 (patch)
tree00561b490a13ba30f5b33fb93dd83d46817b44c1
parent967823bcbc807492f562c85260a653a39eed2e23 (diff)
downloadpcre-2b6afaf95389abfb23a75f791481df6f04f7dcd0.tar.gz
Tidies for 8.02-RC1 release.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@507 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog72
-rw-r--r--NEWS6
-rwxr-xr-xRunTest12
-rw-r--r--configure.ac4
-rw-r--r--doc/html/pcre.html6
-rw-r--r--doc/html/pcrepattern.html33
-rw-r--r--doc/html/pcreperform.html25
-rw-r--r--doc/html/pcresyntax.html21
-rw-r--r--doc/pcre.txt1496
-rw-r--r--doc/pcrepattern.34
-rw-r--r--maint/README32
-rw-r--r--pcre_compile.c40
-rw-r--r--pcre_dfa_exec.c6
-rw-r--r--pcre_globals.c6
-rw-r--r--pcre_internal.h2
-rw-r--r--pcre_tables.c266
-rw-r--r--pcreposix.c2
-rw-r--r--pcretest.c2
-rw-r--r--testdata/testinput129
-rw-r--r--testdata/testinput67
-rw-r--r--testdata/testoutput1214
-rw-r--r--testdata/testoutput612
22 files changed, 1088 insertions, 989 deletions
diff --git a/ChangeLog b/ChangeLog
index c56871b..8a83325 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,69 +1,69 @@
ChangeLog for PCRE
------------------
-Version 8.02 01-Mar-2010
+Version 8.02 10-Mar-2010
------------------------
1. The Unicode data tables have been updated to Unicode 5.2.0.
2. Added the option --libs-cpp to pcre-config, but only when C++ support is
configured.
-
+
3. Updated the licensing terms in the pcregexp.pas file, as agreed with the
- original author of that file, following a query about its status.
-
-4. On systems that do not have stdint.h (e.g. Solaris), check for and include
+ original author of that file, following a query about its status.
+
+4. On systems that do not have stdint.h (e.g. Solaris), check for and include
inttypes.h instead. This fixes a bug that was introduced by change 8.01/8.
-
+
5. A pattern such as (?&t)*+(?(DEFINE)(?<t>.)) which has a possessive
quantifier applied to a forward-referencing subroutine call, could compile
incorrect code or give the error "internal error: previously-checked
referenced subpattern not found".
-
-6. Both MS Visual Studio and Symbian OS have problems with initializing
+
+6. Both MS Visual Studio and Symbian OS have problems with initializing
variables to point to external functions. For these systems, therefore,
pcre_malloc etc. are now initialized to local functions that call the
relevant global functions.
-
+
7. There were two entries missing in the vectors called coptable and poptable
in pcre_dfa_exec.c. This could lead to memory accesses outsize the vectors.
- I've fixed the data, and added a kludgy way of testing at compile time that
- the lengths are correct (equal to the number of opcodes).
-
-8. Following on from 7, I added a similar kludge to check the length of the
- eint vector in pcreposix.c.
-
-9. Error texts for pcre_compile() are held as one long string to avoid too
- much relocation at load time. To find a text, the string is searched,
+ I've fixed the data, and added a kludgy way of testing at compile time that
+ the lengths are correct (equal to the number of opcodes).
+
+8. Following on from 7, I added a similar kludge to check the length of the
+ eint vector in pcreposix.c.
+
+9. Error texts for pcre_compile() are held as one long string to avoid too
+ much relocation at load time. To find a text, the string is searched,
counting zeros. There was no check for running off the end of the string,
which could happen if a new error number was added without updating the
- string.
-
-10. \K gave a compile-time error if it appeared in a lookbehind assersion.
-
+ string.
+
+10. \K gave a compile-time error if it appeared in a lookbehind assersion.
+
11. \K was not working if it appeared in an atomic group or in a group that
- was called as a "subroutine", or in an assertion. Perl 5.11 documents that
- \K is "not well defined" if used in an assertion. PCRE now accepts it if
- the assertion is positive, but not if it is negative.
-
+ was called as a "subroutine", or in an assertion. Perl 5.11 documents that
+ \K is "not well defined" if used in an assertion. PCRE now accepts it if
+ the assertion is positive, but not if it is negative.
+
12. Change 11 fortuitously reduced the size of the stack frame used in the
- "match()" function of pcre_exec.c by one pointer. Forthcoming
- implementation of support for (*MARK) will need an extra pointer on the
+ "match()" function of pcre_exec.c by one pointer. Forthcoming
+ implementation of support for (*MARK) will need an extra pointer on the
stack; I have reserved it now, so that the stack frame size does not
decrease.
-
-13. A pattern such as (?P<L1>(?P<L2>0)|(?P>L2)(?P>L1)) in which the only other
- item in branch that calls a recursion is a subroutine call - as in the
+
+13. A pattern such as (?P<L1>(?P<L2>0)|(?P>L2)(?P>L1)) in which the only other
+ item in branch that calls a recursion is a subroutine call - as in the
second branch in the above example - was incorrectly given the compile-
time error "recursive call could loop indefinitely" because pcre_compile()
- was not correctly checking the subroutine for matching a non-empty string.
-
+ was not correctly checking the subroutine for matching a non-empty string.
+
14. The checks for overrunning compiling workspace could trigger after an
- overrun had occurred. This is a "should never occur" error, but it can be
+ overrun had occurred. This is a "should never occur" error, but it can be
triggered by pathological patterns such as hundreds of nested parentheses.
- The checks now trigger 100 bytes before the end of the workspace.
-
-15. Fix typo in configure.ac: "srtoq" should be "strtoq".
+ The checks now trigger 100 bytes before the end of the workspace.
+
+15. Fix typo in configure.ac: "srtoq" should be "strtoq".
Version 8.01 19-Jan-2010
diff --git a/NEWS b/NEWS
index 6fc8c68..09a5b15 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,12 @@
News about PCRE releases
------------------------
+Release 8.02 10-Mar-2010
+------------------------
+
+Another bug-fix release.
+
+
Release 8.01 19-Jan-2010
------------------------
diff --git a/RunTest b/RunTest
index 787f9cf..7091fab 100755
--- a/RunTest
+++ b/RunTest
@@ -133,10 +133,10 @@ echo ""
echo PCRE C library tests
./pcretest /dev/null
-# Primary test, Perl-compatible for both 5.8 and 5.10
+# Primary test, compatible with Perl 5.8, 5.10, 5.11
if [ $do1 = yes ] ; then
- echo "Test 1: main functionality (Perl 5.8 & 5.10 compatible)"
+ echo "Test 1: main functionality (Perl 5.8, 5.10, 5.11 compatible)"
$valgrind ./pcretest -q $testdata/testinput1 testtry
if [ $? = 0 ] ; then
$cf $testdata/testoutput1 testtry
@@ -215,7 +215,7 @@ fi
# Additional tests for UTF8 support
if [ $do4 = yes ] ; then
- echo "Test 4: UTF-8 support (Perl 5.8 & 5.10 compatible)"
+ echo "Test 4: UTF-8 support (Perl 5.8, 5.10, 5.11 compatible)"
$valgrind ./pcretest -q $testdata/testinput4 testtry
if [ $? = 0 ] ; then
$cf $testdata/testoutput4 testtry
@@ -237,7 +237,7 @@ if [ $do5 = yes ] ; then
fi
if [ $do6 = yes ] ; then
- echo "Test 6: Unicode property support (Perl 5.10 compatible)"
+ echo "Test 6: Unicode property support (Perl 5.10, 5.11 compatible)"
$valgrind ./pcretest -q $testdata/testinput6 testtry
if [ $? = 0 ] ; then
$cf $testdata/testoutput6 testtry
@@ -299,10 +299,10 @@ if [ $do10 = yes ] ; then
echo "OK"
fi
-# Test of Perl 5.10 features
+# Test of Perl 5.10, 5.11 features
if [ $do11 = yes ] ; then
- echo "Test 11: Perl 5.10 features"
+ echo "Test 11: Perl 5.10, 5.11 features"
$valgrind ./pcretest -q $testdata/testinput11 testtry
if [ $? = 0 ] ; then
$cf $testdata/testoutput11 testtry
diff --git a/configure.ac b/configure.ac
index 7d93840..d40d7e6 100644
--- a/configure.ac
+++ b/configure.ac
@@ -11,7 +11,7 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8])
m4_define(pcre_minor, [02])
m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2010-03-01])
+m4_define(pcre_date, [2010-03-10])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
@@ -110,7 +110,7 @@ AC_ARG_ENABLE(cpp,
AS_HELP_STRING([--disable-cpp],
[disable C++ support]),
, enable_cpp=yes)
-AC_SUBST(enable_cpp)
+AC_SUBST(enable_cpp)
# Handle --enable-rebuild-chartables
AC_ARG_ENABLE(rebuild-chartables,
diff --git a/doc/html/pcre.html b/doc/html/pcre.html
index 8ea03a1..c4512df 100644
--- a/doc/html/pcre.html
+++ b/doc/html/pcre.html
@@ -33,7 +33,7 @@ for requesting some minor changes that give better JavaScript compatibility.
The current implementation of PCRE corresponds approximately with Perl 5.10,
including support for UTF-8 encoded strings and Unicode general category
properties. However, UTF-8 and Unicode support has to be explicitly enabled; it
-is not the default. The Unicode tables correspond to Unicode release 5.1.
+is not the default. The Unicode tables correspond to Unicode release 5.2.0.
</P>
<P>
In addition to the Perl-compatible matching function, PCRE contains an
@@ -298,9 +298,9 @@ two digits 10, at the domain cam.ac.uk.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 28 September 2009
+Last updated: 01 March 2010
<br>
-Copyright &copy; 1997-2009 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 3390c56..1334e74 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -511,13 +511,17 @@ Those that are not part of an identified script are lumped together as
<P>
Arabic,
Armenian,
+Avestan,
Balinese,
+Bamum,
Bengali,
Bopomofo,
Braille,
Buginese,
Buhid,
Canadian_Aboriginal,
+Carian,
+Cham,
Cherokee,
Common,
Coptic,
@@ -526,6 +530,7 @@ Cypriot,
Cyrillic,
Deseret,
Devanagari,
+Egyptian_Hieroglyphs,
Ethiopic,
Georgian,
Glagolitic,
@@ -538,16 +543,27 @@ Hangul,
Hanunoo,
Hebrew,
Hiragana,
+Imperial_Aramaic,
Inherited,
+Inscriptional_Pahlavi,
+Inscriptional_Parthian,
+Javanese,
+Kaithi,
Kannada,
Katakana,
+Kayah_Li,
Kharoshthi,
Khmer,
Lao,
Latin,
+Lepcha,
Limbu,
Linear_B,
+Lisu,
+Lycian,
+Lydian,
Malayalam,
+Meetei_Mayek,
Mongolian,
Myanmar,
New_Tai_Lue,
@@ -555,18 +571,27 @@ Nko,
Ogham,
Old_Italic,
Old_Persian,
+Old_South_Arabian,
+Old_Turkic,
+Ol_Chiki,
Oriya,
Osmanya,
Phags_Pa,
Phoenician,
+Rejang,
Runic,
+Samaritan,
+Saurashtra,
Shavian,
Sinhala,
+Sundanese,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
+Tai_Tham,
+Tai_Viet,
Tamil,
Telugu,
Thaana,
@@ -574,6 +599,7 @@ Thai,
Tibetan,
Tifinagh,
Ugaritic,
+Vai,
Yi.
</P>
<P>
@@ -705,6 +731,11 @@ For example, when the pattern
(foo)\Kbar
</pre>
matches "foobar", the first substring is still set to "foo".
+</P>
+<P>
+Perl documents that the use of \K within assertions is "not well defined". In
+PCRE, \K is acted upon when it occurs inside positive assertions, but is
+ignored in negative assertions.
<a name="smallassertions"></a></P>
<br><b>
Simple assertions
@@ -2396,7 +2427,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 January 2010
+Last updated: 06 March 2010
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcreperform.html b/doc/html/pcreperform.html
index 41d893d..0c746be 100644
--- a/doc/html/pcreperform.html
+++ b/doc/html/pcreperform.html
@@ -21,14 +21,15 @@ time. The way you express your pattern as a regular expression can affect both
of them.
</P>
<br><b>
-MEMORY USAGE
+COMPILED PATTERN MEMORY USAGE
</b><br>
<P>
Patterns are compiled by PCRE into a reasonably efficient byte code, so that
most simple patterns do not use much memory. However, there is one case where
-memory usage can be unexpectedly large. When a parenthesized subpattern has a
-quantifier with a minimum greater than 1 and/or a limited maximum, the whole
-subpattern is repeated in the compiled code. For example, the pattern
+the memory usage of a compiled pattern can be unexpectedly large. If a
+parenthesized subpattern has a quantifier with a minimum greater than 1 and/or
+a limited maximum, the whole subpattern is repeated in the compiled code. For
+example, the pattern
<pre>
(abc|def){2,4}
</pre>
@@ -73,6 +74,18 @@ speed is acceptable, this kind of rewriting will allow you to process patterns
that PCRE cannot otherwise handle.
</P>
<br><b>
+STACK USAGE AT RUN TIME
+</b><br>
+<P>
+When <b>pcre_exec()</b> is used for matching, certain kinds of pattern can cause
+it to use large amounts of the process stack. In some environments the default
+process stack is quite small, and if it runs out the result is often SIGSEGV.
+This issue is probably the most frequently raised problem with PCRE. Rewriting
+your pattern can often help. The
+<a href="pcrestack.html"><b>pcrestack</b></a>
+documentation discusses this issue in detail.
+</P>
+<br><b>
PROCESSING TIME
</b><br>
<P>
@@ -164,9 +177,9 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 06 March 2007
+Last updated: 07 March 2010
<br>
-Copyright &copy; 1997-2007 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/html/pcresyntax.html b/doc/html/pcresyntax.html
index 2a1a686..1a2749f 100644
--- a/doc/html/pcresyntax.html
+++ b/doc/html/pcresyntax.html
@@ -146,7 +146,9 @@ In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
<P>
Arabic,
Armenian,
+Avestan,
Balinese,
+Bamum,
Bengali,
Bopomofo,
Braille,
@@ -163,6 +165,7 @@ Cypriot,
Cyrillic,
Deseret,
Devanagari,
+Egyptian_Hieroglyphs,
Ethiopic,
Georgian,
Glagolitic,
@@ -175,7 +178,12 @@ Hangul,
Hanunoo,
Hebrew,
Hiragana,
+Imperial_Aramaic,
Inherited,
+Inscriptional_Pahlavi,
+Inscriptional_Parthian,
+Javanese,
+Kaithi,
Kannada,
Katakana,
Kayah_Li,
@@ -186,9 +194,11 @@ Latin,
Lepcha,
Limbu,
Linear_B,
+Lisu,
Lycian,
Lydian,
Malayalam,
+Meetei_Mayek,
Mongolian,
Myanmar,
New_Tai_Lue,
@@ -196,6 +206,8 @@ Nko,
Ogham,
Old_Italic,
Old_Persian,
+Old_South_Arabian,
+Old_Turkic,
Ol_Chiki,
Oriya,
Osmanya,
@@ -203,15 +215,18 @@ Phags_Pa,
Phoenician,
Rejang,
Runic,
+Samaritan,
Saurashtra,
Shavian,
Sinhala,
-Sudanese,
+Sundanese,
Syloti_Nagri,
Syriac,
Tagalog,
Tagbanwa,
Tai_Le,
+Tai_Tham,
+Tai_Viet,
Tamil,
Telugu,
Thaana,
@@ -464,9 +479,9 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC26" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 11 April 2009
+Last updated: 01 March 2010
<br>
-Copyright &copy; 1997-2009 University of Cambridge.
+Copyright &copy; 1997-2010 University of Cambridge.
<br>
<p>
Return to the <a href="index.html">PCRE index page</a>.
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 0dda0cf..aedf853 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -29,7 +29,7 @@ INTRODUCTION
5.10, including support for UTF-8 encoded strings and Unicode general
category properties. However, UTF-8 and Unicode support has to be
explicitly enabled; it is not the default. The Unicode tables corre-
- spond to Unicode release 5.1.
+ spond to Unicode release 5.2.0.
In addition to the Perl-compatible matching function, PCRE contains an
alternative function that matches the same compiled patterns in a dif-
@@ -263,8 +263,8 @@ AUTHOR
REVISION
- Last updated: 28 September 2009
- Copyright (c) 1997-2009 University of Cambridge.
+ Last updated: 01 March 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -3488,24 +3488,29 @@ BACKSLASH
Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
- Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese,
- Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform,
- Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
- Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
- gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin,
- Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko,
- Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
- Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
- Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
-
- Each character has exactly one general category property, specified by
+ Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
+ Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
+ Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
+ tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
+ Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
+ rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
+ Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
+ Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
+ Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
+ Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
+ Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
+ Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
+ Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
+ Ugaritic, Vai, Yi.
+
+ Each character has exactly one general category property, specified by
a two-letter abbreviation. For compatibility with Perl, negation can be
- specified by including a circumflex between the opening brace and the
+ specified by including a circumflex between the opening brace and the
property name. For example, \p{^Lu} is the same as \P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
- eral category properties that start with that letter. In this case, in
- the absence of negation, the curly brackets in the escape sequence are
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
@@ -3557,69 +3562,73 @@ BACKSLASH
Zp Paragraph separator
Zs Space separator
- The special property L& is also supported: it matches a character that
- has the Lu, Ll, or Lt property, in other words, a letter that is not
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
- The Cs (Surrogate) property applies only to characters in the range
- U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
- ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
+ ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
the pcreapi page). Perl does not support the Cs property.
- The long synonyms for property names that Perl supports (such as
- \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
- Specifying caseless matching does not affect these escape sequences.
+ Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.
- The \X escape matches any number of Unicode characters that form an
+ The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
- That is, it matches a character without the "mark" property, followed
- by zero or more characters with the "mark" property, and treats the
- sequence as an atomic group (see below). Characters with the "mark"
- property are typically accents that affect the preceding character.
- None of them have codepoints less than 256, so in non-UTF-8 mode \X
+ That is, it matches a character without the "mark" property, followed
+ by zero or more characters with the "mark" property, and treats the
+ sequence as an atomic group (see below). Characters with the "mark"
+ property are typically accents that affect the preceding character.
+ None of them have codepoints less than 256, so in non-UTF-8 mode \X
matches any one character.
- Matching characters by Unicode property is not fast, because PCRE has
- to search a structure that contains data for over fifteen thousand
+ Matching characters by Unicode property is not fast, because PCRE has
+ to search a structure that contains data for over fifteen thousand
characters. That is why the traditional escape sequences such as \d and
\w do not use Unicode properties in PCRE.
Resetting the match start
The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
- ously matched characters not to be included in the final matched
+ ously matched characters not to be included in the final matched
sequence. For example, the pattern:
foo\Kbar
- matches "foobar", but reports that it has matched "bar". This feature
- is similar to a lookbehind assertion (described below). However, in
- this case, the part of the subject before the real match does not have
- to be of fixed length, as lookbehind assertions do. The use of \K does
- not interfere with the setting of captured substrings. For example,
+ matches "foobar", but reports that it has matched "bar". This feature
+ is similar to a lookbehind assertion (described below). However, in
+ this case, the part of the subject before the real match does not have
+ to be of fixed length, as lookbehind assertions do. The use of \K does
+ not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE, \K is acted upon when it occurs inside positive
+ assertions, but is ignored in negative assertions.
+
Simple assertions
- The final use of backslash is for certain simple assertions. An asser-
- tion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The
- use of subpatterns for more complicated assertions is described below.
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
@@ -3630,44 +3639,44 @@ BACKSLASH
\z matches only at the end of the subject
\G matches at the first matching position in the subject
- These assertions may not appear in character classes (but note that \b
+ These assertions may not appear in character classes (but note that \b
has a different meaning, namely the backspace character, inside a char-
acter class).
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match \w or \W (i.e.
- one matches \w and the other matches \W), or the start or end of the
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
string if the first or last character matches \w, respectively. Neither
- PCRE nor Perl has a separte "start of word" or "end of word" metase-
- quence. However, whatever follows \b normally determines which it is.
+ PCRE nor Perl has a separte "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
- affect only the behaviour of the circumflex and dollar metacharacters.
- However, if the startoffset argument of pcre_exec() is non-zero, indi-
+ affect only the behaviour of the circumflex and dollar metacharacters.
+ However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
- the subject, \A can never match. The difference between \Z and \z is
+ the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre_exec(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre_exec() multiple times with appropriate argu-
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre_exec(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
- Note, however, that PCRE's interpretation of \G, as the start of the
+ Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
@@ -3675,90 +3684,90 @@ BACKSLASH
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre_exec() is non-zero, circumflex can never match if the
- PCRE_MULTILINE option is unset. Inside a character class, circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre_exec() is non-zero, circumflex can never match if the
+ PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- A dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
+ A dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
before a newline at the end of the string (by default). Dollar need not
- be the last character of the pattern if a number of alternatives are
- involved, but it should be the last item in any branch in which it
+ be the last character of the pattern if a number of alternatives are
+ involved, but it should be the last item in any branch in which it
appears. Dollar has no special meaning in a character class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
- PCRE_MULTILINE option is set. When this is the case, a circumflex
- matches immediately after internal newlines as well as at the start of
- the subject string. It does not match after a newline that ends the
- string. A dollar matches before any newlines in the string, as well as
- at the very end, when PCRE_MULTILINE is set. When newline is specified
- as the two-character sequence CRLF, isolated CR and LF characters do
+ PCRE_MULTILINE option is set. When this is the case, a circumflex
+ matches immediately after internal newlines as well as at the start of
+ the subject string. It does not match after a newline that ends the
+ string. A dollar matches before any newlines in the string, as well as
+ at the very end, when PCRE_MULTILINE is set. When newline is specified
+ as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
- fies the end of a line. In UTF-8 mode, the matched character may be
+ ter in the subject string except (by default) a character that signi-
+ fies the end of a line. In UTF-8 mode, the matched character may be
more than one byte long.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE_DOTALL option is set, a dot matches any one character, without
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
MATCHING A SINGLE BYTE
Outside a character class, the escape sequence \C matches any one byte,
- both in and out of UTF-8 mode. Unlike a dot, it always matches any
- line-ending characters. The feature is provided in Perl in order to
- match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
- acters into individual bytes, what remains in the string may be a mal-
- formed UTF-8 string. For this reason, the \C escape sequence is best
+ both in and out of UTF-8 mode. Unlike a dot, it always matches any
+ line-ending characters. The feature is provided in Perl in order to
+ match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
+ acters into individual bytes, what remains in the string may be a mal-
+ formed UTF-8 string. For this reason, the \C escape sequence is best
avoided.
- PCRE does not allow \C to appear in lookbehind assertions (described
- below), because in UTF-8 mode this would make it impossible to calcu-
+ PCRE does not allow \C to appear in lookbehind assertions (described
+ below), because in UTF-8 mode this would make it impossible to calcu-
late the length of the lookbehind.
@@ -3768,97 +3777,97 @@ SQUARE BRACKETS AND CHARACTER CLASSES
closing square bracket. A closing square bracket on its own is not spe-
cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
a lone closing square bracket causes a compile-time error. If a closing
- square bracket is required as a member of the class, it should be the
- first data character in the class (after an initial circumflex, if
+ square bracket is required as a member of the class, it should be the
+ first data character in the class (after an initial circumflex, if
present) or escaped with a backslash.
- A character class matches a single character in the subject. In UTF-8
+ A character class matches a single character in the subject. In UTF-8
mode, the character may be more than one byte long. A matched character
must be in the set of characters defined by the class, unless the first
- character in the class definition is a circumflex, in which case the
- subject character must not be in the set defined by the class. If a
- circumflex is actually required as a member of the class, ensure it is
+ character in the class definition is a circumflex, in which case the
+ subject character must not be in the set defined by the class. If a
+ circumflex is actually required as a member of the class, ensure it is
not the first character, or escape it with a backslash.
- For example, the character class [aeiou] matches any lower case vowel,
- while [^aeiou] matches any character that is not a lower case vowel.
+ For example, the character class [aeiou] matches any lower case vowel,
+ while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the
- characters that are in the class by enumerating those that are not. A
- class that starts with a circumflex is not an assertion; it still con-
- sumes a character from the subject string, and therefore it fails if
+ characters that are in the class by enumerating those that are not. A
+ class that starts with a circumflex is not an assertion; it still con-
+ sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string.
- In UTF-8 mode, characters with values greater than 255 can be included
- in a class as a literal string of bytes, or by using the \x{ escaping
+ In UTF-8 mode, characters with values greater than 255 can be included
+ in a class as a literal string of bytes, or by using the \x{ escaping
mechanism.
- When caseless matching is set, any letters in a class represent both
- their upper case and lower case versions, so for example, a caseless
- [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
- match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
- understands the concept of case for characters whose values are less
- than 128, so caseless matching is always possible. For characters with
- higher values, the concept of case is supported if PCRE is compiled
- with Unicode property support, but not otherwise. If you want to use
- caseless matching in UTF8-mode for characters 128 and above, you must
- ensure that PCRE is compiled with Unicode property support as well as
+ When caseless matching is set, any letters in a class represent both
+ their upper case and lower case versions, so for example, a caseless
+ [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
+ match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
+ understands the concept of case for characters whose values are less
+ than 128, so caseless matching is always possible. For characters with
+ higher values, the concept of case is supported if PCRE is compiled
+ with Unicode property support, but not otherwise. If you want to use
+ caseless matching in UTF8-mode for characters 128 and above, you must
+ ensure that PCRE is compiled with Unicode property support as well as
with UTF-8 support.
- Characters that might indicate line breaks are never treated in any
- special way when matching character classes, whatever line-ending
- sequence is in use, and whatever setting of the PCRE_DOTALL and
+ Characters that might indicate line breaks are never treated in any
+ special way when matching character classes, whatever line-ending
+ sequence is in use, and whatever setting of the PCRE_DOTALL and
PCRE_MULTILINE options is used. A class such as [^a] always matches one
of these characters.
- The minus (hyphen) character can be used to specify a range of charac-
- ters in a character class. For example, [d-m] matches any letter
- between d and m, inclusive. If a minus character is required in a
- class, it must be escaped with a backslash or appear in a position
- where it cannot be interpreted as indicating a range, typically as the
+ The minus (hyphen) character can be used to specify a range of charac-
+ ters in a character class. For example, [d-m] matches any letter
+ between d and m, inclusive. If a minus character is required in a
+ class, it must be escaped with a backslash or appear in a position
+ where it cannot be interpreted as indicating a range, typically as the
first or last character in the class.
It is not possible to have the literal character "]" as the end charac-
- ter of a range. A pattern such as [W-]46] is interpreted as a class of
- two characters ("W" and "-") followed by a literal string "46]", so it
- would match "W46]" or "-46]". However, if the "]" is escaped with a
- backslash it is interpreted as the end of range, so [W-\]46] is inter-
- preted as a class containing a range followed by two other characters.
- The octal or hexadecimal representation of "]" can also be used to end
+ ter of a range. A pattern such as [W-]46] is interpreted as a class of
+ two characters ("W" and "-") followed by a literal string "46]", so it
+ would match "W46]" or "-46]". However, if the "]" is escaped with a
+ backslash it is interpreted as the end of range, so [W-\]46] is inter-
+ preted as a class containing a range followed by two other characters.
+ The octal or hexadecimal representation of "]" can also be used to end
a range.
- Ranges operate in the collating sequence of character values. They can
- also be used for characters specified numerically, for example
- [\000-\037]. In UTF-8 mode, ranges can include characters whose values
+ Ranges operate in the collating sequence of character values. They can
+ also be used for characters specified numerically, for example
+ [\000-\037]. In UTF-8 mode, ranges can include characters whose values
are greater than 255, for example [\x{100}-\x{2ff}].
If a range that includes letters is used when caseless matching is set,
it matches the letters in either case. For example, [W-c] is equivalent
- to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
- character tables for a French locale are in use, [\xc8-\xcb] matches
- accented E characters in both cases. In UTF-8 mode, PCRE supports the
- concept of case for characters with values greater than 128 only when
+ to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
+ character tables for a French locale are in use, [\xc8-\xcb] matches
+ accented E characters in both cases. In UTF-8 mode, PCRE supports the
+ concept of case for characters with values greater than 128 only when
it is compiled with Unicode property support.
- The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
- in a character class, and add the characters that they match to the
+ The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
+ in a character class, and add the characters that they match to the
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
- flex can conveniently be used with the upper case character types to
- specify a more restricted set of characters than the matching lower
- case type. For example, the class [^\W_] matches any letter or digit,
+ flex can conveniently be used with the upper case character types to
+ specify a more restricted set of characters than the matching lower
+ case type. For example, the class [^\W_] matches any letter or digit,
but not underscore.
- The only metacharacters that are recognized in character classes are
- backslash, hyphen (only where it can be interpreted as specifying a
- range), circumflex (only at the start), opening square bracket (only
- when it can be interpreted as introducing a POSIX class name - see the
- next section), and the terminating closing square bracket. However,
+ The only metacharacters that are recognized in character classes are
+ backslash, hyphen (only where it can be interpreted as specifying a
+ range), circumflex (only at the start), opening square bracket (only
+ when it can be interpreted as introducing a POSIX class name - see the
+ next section), and the terminating closing square bracket. However,
escaping other non-alphanumeric characters does no harm.
POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names
- enclosed by [: and :] within the enclosing square brackets. PCRE also
+ enclosed by [: and :] within the enclosing square brackets. PCRE also
supports this notation. For example,
[01[:alpha:]%]
@@ -3881,18 +3890,18 @@ POSIX CHARACTER CLASSES
word "word" characters (same as \w)
xdigit hexadecimal digits
- The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
- and space (32). Notice that this list includes the VT character (code
+ The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+ and space (32). Notice that this list includes the VT character (code
11). This makes "space" different to \s, which does not include VT (for
Perl compatibility).
- The name "word" is a Perl extension, and "blank" is a GNU extension
- from Perl 5.8. Another Perl extension is negation, which is indicated
+ The name "word" is a Perl extension, and "blank" is a GNU extension
+ from Perl 5.8. Another Perl extension is negation, which is indicated
by a ^ character after the colon. For example,
[12[:^digit:]]
- matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
+ matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
these are not supported, and an error is given if they are encountered.
@@ -3902,24 +3911,24 @@ POSIX CHARACTER CLASSES
VERTICAL BAR
- Vertical bar characters are used to separate alternative patterns. For
+ Vertical bar characters are used to separate alternative patterns. For
example, the pattern
gilbert|sullivan
- matches either "gilbert" or "sullivan". Any number of alternatives may
- appear, and an empty alternative is permitted (matching the empty
+ matches either "gilbert" or "sullivan". Any number of alternatives may
+ appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left
- to right, and the first one that succeeds is used. If the alternatives
- are within a subpattern (defined below), "succeeds" means matching the
+ to right, and the first one that succeeds is used. If the alternatives
+ are within a subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the subpattern.
INTERNAL OPTION SETTING
- The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
- PCRE_EXTENDED options (which are Perl-compatible) can be changed from
- within the pattern by a sequence of Perl option letters enclosed
+ The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
+ PCRE_EXTENDED options (which are Perl-compatible) can be changed from
+ within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are
i for PCRE_CASELESS
@@ -3929,46 +3938,46 @@ INTERNAL OPTION SETTING
For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
- combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
- LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
- is also permitted. If a letter appears both before and after the
+ combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
+ LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
+ is also permitted. If a letter appears both before and after the
hyphen, the option is unset.
- The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
- can be changed in the same way as the Perl-compatible options by using
+ The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
+ can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.
- When one of these option changes occurs at top level (that is, not
- inside subpattern parentheses), the change applies to the remainder of
+ When one of these option changes occurs at top level (that is, not
+ inside subpattern parentheses), the change applies to the remainder of
the pattern that follows. If the change is placed right at the start of
a pattern, PCRE extracts it into the global options (and it will there-
fore show up in data extracted by the pcre_fullinfo() function).
- An option change within a subpattern (see below for a description of
+ An option change within a subpattern (see below for a description of
subpatterns) affects only that part of the current pattern that follows
it, so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
- used). By this means, options can be made to have different settings
- in different parts of the pattern. Any changes made in one alternative
- do carry on into subsequent branches within the same subpattern. For
+ used). By this means, options can be made to have different settings
+ in different parts of the pattern. Any changes made in one alternative
+ do carry on into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
- matches "ab", "aB", "c", and "C", even though when matching "C" the
- first branch is abandoned before the option setting. This is because
- the effects of option settings happen at compile time. There would be
+ matches "ab", "aB", "c", and "C", even though when matching "C" the
+ first branch is abandoned before the option setting. This is because
+ the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.
- Note: There are other PCRE-specific options that can be set by the
- application when the compile or match functions are called. In some
+ Note: There are other PCRE-specific options that can be set by the
+ application when the compile or match functions are called. In some
cases the pattern can contain special leading sequences such as (*CRLF)
- to override what the application has set or what has been defaulted.
- Details are given in the section entitled "Newline sequences" above.
- There is also the (*UTF8) leading sequence that can be used to set
+ to override what the application has set or what has been defaulted.
+ Details are given in the section entitled "Newline sequences" above.
+ There is also the (*UTF8) leading sequence that can be used to set
UTF-8 mode; this is equivalent to setting the PCRE_UTF8 option.
@@ -3981,18 +3990,18 @@ SUBPATTERNS
cat(aract|erpillar|)
- matches one of the words "cat", "cataract", or "caterpillar". Without
- the parentheses, it would match "cataract", "erpillar" or an empty
+ matches one of the words "cat", "cataract", or "caterpillar". Without
+ the parentheses, it would match "cataract", "erpillar" or an empty
string.
- 2. It sets up the subpattern as a capturing subpattern. This means
- that, when the whole pattern matches, that portion of the subject
+ 2. It sets up the subpattern as a capturing subpattern. This means
+ that, when the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via the
- ovector argument of pcre_exec(). Opening parentheses are counted from
- left to right (starting from 1) to obtain numbers for the capturing
+ ovector argument of pcre_exec(). Opening parentheses are counted from
+ left to right (starting from 1) to obtain numbers for the capturing
subpatterns.
- For example, if the string "the red king" is matched against the pat-
+ For example, if the string "the red king" is matched against the pat-
tern
the ((red|white) (king|queen))
@@ -4000,12 +4009,12 @@ SUBPATTERNS
the captured substrings are "red king", "red", and "king", and are num-
bered 1, 2, and 3, respectively.
- The fact that plain parentheses fulfil two functions is not always
- helpful. There are often times when a grouping subpattern is required
- without a capturing requirement. If an opening parenthesis is followed
- by a question mark and a colon, the subpattern does not do any captur-
- ing, and is not counted when computing the number of any subsequent
- capturing subpatterns. For example, if the string "the white queen" is
+ The fact that plain parentheses fulfil two functions is not always
+ helpful. There are often times when a grouping subpattern is required
+ without a capturing requirement. If an opening parenthesis is followed
+ by a question mark and a colon, the subpattern does not do any captur-
+ ing, and is not counted when computing the number of any subsequent
+ capturing subpatterns. For example, if the string "the white queen" is
matched against the pattern
the ((?:red|white) (king|queen))
@@ -4013,96 +4022,96 @@ SUBPATTERNS
the captured substrings are "white queen" and "queen", and are numbered
1 and 2. The maximum number of capturing subpatterns is 65535.
- As a convenient shorthand, if any option settings are required at the
- start of a non-capturing subpattern, the option letters may appear
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern, the option letters may appear
between the "?" and the ":". Thus the two patterns
(?i:saturday|sunday)
(?:(?i)saturday|sunday)
match exactly the same set of strings. Because alternative branches are
- tried from left to right, and options are not reset until the end of
- the subpattern is reached, an option setting in one branch does affect
- subsequent branches, so the above patterns match "SUNDAY" as well as
+ tried from left to right, and options are not reset until the end of
+ the subpattern is reached, an option setting in one branch does affect
+ subsequent branches, so the above patterns match "SUNDAY" as well as
"Saturday".
DUPLICATE SUBPATTERN NUMBERS
Perl 5.10 introduced a feature whereby each alternative in a subpattern
- uses the same numbers for its capturing parentheses. Such a subpattern
- starts with (?| and is itself a non-capturing subpattern. For example,
+ uses the same numbers for its capturing parentheses. Such a subpattern
+ starts with (?| and is itself a non-capturing subpattern. For example,
consider this pattern:
(?|(Sat)ur|(Sun))day
- Because the two alternatives are inside a (?| group, both sets of cap-
- turing parentheses are numbered one. Thus, when the pattern matches,
- you can look at captured substring number one, whichever alternative
- matched. This construct is useful when you want to capture part, but
+ Because the two alternatives are inside a (?| group, both sets of cap-
+ turing parentheses are numbered one. Thus, when the pattern matches,
+ you can look at captured substring number one, whichever alternative
+ matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren-
- theses are numbered as usual, but the number is reset at the start of
- each branch. The numbers of any capturing buffers that follow the sub-
- pattern start after the highest number used in any branch. The follow-
- ing example is taken from the Perl documentation. The numbers under-
+ theses are numbered as usual, but the number is reset at the start of
+ each branch. The numbers of any capturing buffers that follow the sub-
+ pattern start after the highest number used in any branch. The follow-
+ ing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# 1 2 2 3 2 3 4
- A back reference to a numbered subpattern uses the most recent value
- that is set for that number by any subpattern. The following pattern
+ A back reference to a numbered subpattern uses the most recent value
+ that is set for that number by any subpattern. The following pattern
matches "abcabc" or "defdef":
/(?|(abc)|(def))\1/
- In contrast, a recursive or "subroutine" call to a numbered subpattern
- always refers to the first one in the pattern with the given number.
+ In contrast, a recursive or "subroutine" call to a numbered subpattern
+ always refers to the first one in the pattern with the given number.
The following pattern matches "abcabc" or "defabc":
/(?|(abc)|(def))(?1)/
- If a condition test for a subpattern's having matched refers to a non-
- unique number, the test is true if any of the subpatterns of that num-
+ If a condition test for a subpattern's having matched refers to a non-
+ unique number, the test is true if any of the subpatterns of that num-
ber have matched.
- An alternative approach to using this "branch reset" feature is to use
+ An alternative approach to using this "branch reset" feature is to use
duplicate named subpatterns, as described in the next section.
NAMED SUBPATTERNS
- Identifying capturing parentheses by number is simple, but it can be
- very hard to keep track of the numbers in complicated regular expres-
- sions. Furthermore, if an expression is modified, the numbers may
- change. To help with this difficulty, PCRE supports the naming of sub-
+ Identifying capturing parentheses by number is simple, but it can be
+ very hard to keep track of the numbers in complicated regular expres-
+ sions. Furthermore, if an expression is modified, the numbers may
+ change. To help with this difficulty, PCRE supports the naming of sub-
patterns. This feature was not added to Perl until release 5.10. Python
- had the feature earlier, and PCRE introduced it at release 4.0, using
- the Python syntax. PCRE now supports both the Perl and the Python syn-
- tax. Perl allows identically numbered subpatterns to have different
+ had the feature earlier, and PCRE introduced it at release 4.0, using
+ the Python syntax. PCRE now supports both the Perl and the Python syn-
+ tax. Perl allows identically numbered subpatterns to have different
names, but PCRE does not.
- In PCRE, a subpattern can be named in one of three ways: (?<name>...)
- or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
- to capturing parentheses from other parts of the pattern, such as back
- references, recursion, and conditions, can be made by name as well as
+ In PCRE, a subpattern can be named in one of three ways: (?<name>...)
+ or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
+ to capturing parentheses from other parts of the pattern, such as back
+ references, recursion, and conditions, can be made by name as well as
by number.
- Names consist of up to 32 alphanumeric characters and underscores.
- Named capturing parentheses are still allocated numbers as well as
- names, exactly as if the names were not present. The PCRE API provides
+ Names consist of up to 32 alphanumeric characters and underscores.
+ Named capturing parentheses are still allocated numbers as well as
+ names, exactly as if the names were not present. The PCRE API provides
function calls for extracting the name-to-number translation table from
a compiled pattern. There is also a convenience function for extracting
a captured substring by name.
- By default, a name must be unique within a pattern, but it is possible
+ By default, a name must be unique within a pattern, but it is possible
to relax this constraint by setting the PCRE_DUPNAMES option at compile
- time. (Duplicate names are also always permitted for subpatterns with
- the same number, set up as described in the previous section.) Dupli-
- cate names can be useful for patterns where only one instance of the
- named parentheses can match. Suppose you want to match the name of a
- weekday, either as a 3-letter abbreviation or as the full name, and in
+ time. (Duplicate names are also always permitted for subpatterns with
+ the same number, set up as described in the previous section.) Dupli-
+ cate names can be useful for patterns where only one instance of the
+ named parentheses can match. Suppose you want to match the name of a
+ weekday, either as a 3-letter abbreviation or as the full name, and in
both cases you want to extract the abbreviation. This pattern (ignoring
the line breaks) does the job:
@@ -4112,38 +4121,38 @@ NAMED SUBPATTERNS
(?<DN>Thu)(?:rsday)?|
(?<DN>Sat)(?:urday)?
- There are five capturing substrings, but only one is ever set after a
+ There are five capturing substrings, but only one is ever set after a
match. (An alternative way of solving this problem is to use a "branch
reset" subpattern, as described in the previous section.)
- The convenience function for extracting the data by name returns the
- substring for the first (and in this example, the only) subpattern of
- that name that matched. This saves searching to find which numbered
+ The convenience function for extracting the data by name returns the
+ substring for the first (and in this example, the only) subpattern of
+ that name that matched. This saves searching to find which numbered
subpattern it was.
- If you make a back reference to a non-unique named subpattern from
- elsewhere in the pattern, the one that corresponds to the first occur-
+ If you make a back reference to a non-unique named subpattern from
+ elsewhere in the pattern, the one that corresponds to the first occur-
rence of the name is used. In the absence of duplicate numbers (see the
- previous section) this is the one with the lowest number. If you use a
- named reference in a condition test (see the section about conditions
- below), either to check whether a subpattern has matched, or to check
- for recursion, all subpatterns with the same name are tested. If the
- condition is true for any one of them, the overall condition is true.
+ previous section) this is the one with the lowest number. If you use a
+ named reference in a condition test (see the section about conditions
+ below), either to check whether a subpattern has matched, or to check
+ for recursion, all subpatterns with the same name are tested. If the
+ condition is true for any one of them, the overall condition is true.
This is the same behaviour as testing by number. For further details of
the interfaces for handling named subpatterns, see the pcreapi documen-
tation.
Warning: You cannot use different names to distinguish between two sub-
- patterns with the same number because PCRE uses only the numbers when
+ patterns with the same number because PCRE uses only the numbers when
matching. For this reason, an error is given at compile time if differ-
- ent names are given to subpatterns with the same number. However, you
- can give the same name to subpatterns with the same number, even when
+ ent names are given to subpatterns with the same number. However, you
+ can give the same name to subpatterns with the same number, even when
PCRE_DUPNAMES is not set.
REPETITION
- Repetition is specified by quantifiers, which can follow any of the
+ Repetition is specified by quantifiers, which can follow any of the
following items:
a literal data character
@@ -4157,17 +4166,17 @@ REPETITION
a parenthesized subpattern (unless it is an assertion)
a recursive or "subroutine" call to a subpattern
- The general repetition quantifier specifies a minimum and maximum num-
- ber of permitted matches, by giving the two numbers in curly brackets
- (braces), separated by a comma. The numbers must be less than 65536,
+ The general repetition quantifier specifies a minimum and maximum num-
+ ber of permitted matches, by giving the two numbers in curly brackets
+ (braces), separated by a comma. The numbers must be less than 65536,
and the first must be less than or equal to the second. For example:
z{2,4}
- matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
- special character. If the second number is omitted, but the comma is
- present, there is no upper limit; if the second number and the comma
- are both omitted, the quantifier specifies an exact number of required
+ matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
+ special character. If the second number is omitted, but the comma is
+ present, there is no upper limit; if the second number and the comma
+ are both omitted, the quantifier specifies an exact number of required
matches. Thus
[aeiou]{3,}
@@ -4176,49 +4185,49 @@ REPETITION
\d{8}
- matches exactly 8 digits. An opening curly bracket that appears in a
- position where a quantifier is not allowed, or one that does not match
- the syntax of a quantifier, is taken as a literal character. For exam-
+ matches exactly 8 digits. An opening curly bracket that appears in a
+ position where a quantifier is not allowed, or one that does not match
+ the syntax of a quantifier, is taken as a literal character. For exam-
ple, {,6} is not a quantifier, but a literal string of four characters.
- In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
+ In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
acters, each of which is represented by a two-byte sequence. Similarly,
when Unicode property support is available, \X{3} matches three Unicode
- extended sequences, each of which may be several bytes long (and they
+ extended sequences, each of which may be several bytes long (and they
may be of different lengths).
The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present. This may be use-
- ful for subpatterns that are referenced as subroutines from elsewhere
+ ful for subpatterns that are referenced as subroutines from elsewhere
in the pattern. Items other than subpatterns that have a {0} quantifier
are omitted from the compiled pattern.
- For convenience, the three most common quantifiers have single-charac-
+ For convenience, the three most common quantifiers have single-charac-
ter abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
- It is possible to construct infinite loops by following a subpattern
+ It is possible to construct infinite loops by following a subpattern
that can match no characters with a quantifier that has no upper limit,
for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at compile time
- for such patterns. However, because there are cases where this can be
- useful, such patterns are now accepted, but if any repetition of the
- subpattern does in fact match no characters, the loop is forcibly bro-
+ for such patterns. However, because there are cases where this can be
+ useful, such patterns are now accepted, but if any repetition of the
+ subpattern does in fact match no characters, the loop is forcibly bro-
ken.
- By default, the quantifiers are "greedy", that is, they match as much
- as possible (up to the maximum number of permitted times), without
- causing the rest of the pattern to fail. The classic example of where
+ By default, the quantifiers are "greedy", that is, they match as much
+ as possible (up to the maximum number of permitted times), without
+ causing the rest of the pattern to fail. The classic example of where
this gives problems is in trying to match comments in C programs. These
- appear between /* and */ and within the comment, individual * and /
- characters may appear. An attempt to match C comments by applying the
+ appear between /* and */ and within the comment, individual * and /
+ characters may appear. An attempt to match C comments by applying the
pattern
/\*.*\*/
@@ -4227,19 +4236,19 @@ REPETITION
/* first comment */ not comment /* second comment */
- fails, because it matches the entire string owing to the greediness of
+ fails, because it matches the entire string owing to the greediness of
the .* item.
- However, if a quantifier is followed by a question mark, it ceases to
+ However, if a quantifier is followed by a question mark, it ceases to
be greedy, and instead matches the minimum number of times possible, so
the pattern
/\*.*?\*/
- does the right thing with the C comments. The meaning of the various
- quantifiers is not otherwise changed, just the preferred number of
- matches. Do not confuse this use of question mark with its use as a
- quantifier in its own right. Because it has two uses, it can sometimes
+ does the right thing with the C comments. The meaning of the various
+ quantifiers is not otherwise changed, just the preferred number of
+ matches. Do not confuse this use of question mark with its use as a
+ quantifier in its own right. Because it has two uses, it can sometimes
appear doubled, as in
\d??\d
@@ -4247,36 +4256,36 @@ REPETITION
which matches one digit by preference, but can match two if that is the
only way the rest of the pattern matches.
- If the PCRE_UNGREEDY option is set (an option that is not available in
- Perl), the quantifiers are not greedy by default, but individual ones
- can be made greedy by following them with a question mark. In other
+ If the PCRE_UNGREEDY option is set (an option that is not available in
+ Perl), the quantifiers are not greedy by default, but individual ones
+ can be made greedy by following them with a question mark. In other
words, it inverts the default behaviour.
- When a parenthesized subpattern is quantified with a minimum repeat
- count that is greater than 1 or with a limited maximum, more memory is
- required for the compiled pattern, in proportion to the size of the
+ When a parenthesized subpattern is quantified with a minimum repeat
+ count that is greater than 1 or with a limited maximum, more memory is
+ required for the compiled pattern, in proportion to the size of the
minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
- alent to Perl's /s) is set, thus allowing the dot to match newlines,
- the pattern is implicitly anchored, because whatever follows will be
- tried against every character position in the subject string, so there
- is no point in retrying the overall match at any position after the
- first. PCRE normally treats such a pattern as though it were preceded
+ alent to Perl's /s) is set, thus allowing the dot to match newlines,
+ the pattern is implicitly anchored, because whatever follows will be
+ tried against every character position in the subject string, so there
+ is no point in retrying the overall match at any position after the
+ first. PCRE normally treats such a pattern as though it were preceded
by \A.
- In cases where it is known that the subject string contains no new-
- lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
+ In cases where it is known that the subject string contains no new-
+ lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
mization, or alternatively using ^ to indicate anchoring explicitly.
- However, there is one situation where the optimization cannot be used.
+ However, there is one situation where the optimization cannot be used.
When .* is inside capturing parentheses that are the subject of a back
reference elsewhere in the pattern, a match at the start may fail where
a later one succeeds. Consider, for example:
(.*)abc\1
- If the subject is "xyz123abc123" the match point is the fourth charac-
+ If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored.
When a capturing subpattern is repeated, the value captured is the sub-
@@ -4285,8 +4294,8 @@ REPETITION
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring
- is "tweedledee". However, if there are nested capturing subpatterns,
- the corresponding captured values may have been set in previous itera-
+ is "tweedledee". However, if there are nested capturing subpatterns,
+ the corresponding captured values may have been set in previous itera-
tions. For example, after
/(a|(b))+/
@@ -4296,53 +4305,53 @@ REPETITION
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
- With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
- repetition, failure of what follows normally causes the repeated item
- to be re-evaluated to see if a different number of repeats allows the
- rest of the pattern to match. Sometimes it is useful to prevent this,
- either to change the nature of the match, or to cause it fail earlier
- than it otherwise might, when the author of the pattern knows there is
+ With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+ repetition, failure of what follows normally causes the repeated item
+ to be re-evaluated to see if a different number of repeats allows the
+ rest of the pattern to match. Sometimes it is useful to prevent this,
+ either to change the nature of the match, or to cause it fail earlier
+ than it otherwise might, when the author of the pattern knows there is
no point in carrying on.
- Consider, for example, the pattern \d+foo when applied to the subject
+ Consider, for example, the pattern \d+foo when applied to the subject
line
123456bar
After matching all 6 digits and then failing to match "foo", the normal
- action of the matcher is to try again with only 5 digits matching the
- \d+ item, and then with 4, and so on, before ultimately failing.
- "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
- the means for specifying that once a subpattern has matched, it is not
+ action of the matcher is to try again with only 5 digits matching the
+ \d+ item, and then with 4, and so on, before ultimately failing.
+ "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
+ the means for specifying that once a subpattern has matched, it is not
to be re-evaluated in this way.
- If we use atomic grouping for the previous example, the matcher gives
- up immediately on failing to match "foo" the first time. The notation
+ If we use atomic grouping for the previous example, the matcher gives
+ up immediately on failing to match "foo" the first time. The notation
is a kind of special parenthesis, starting with (?> as in this example:
(?>\d+)foo
- This kind of parenthesis "locks up" the part of the pattern it con-
- tains once it has matched, and a failure further into the pattern is
- prevented from backtracking into it. Backtracking past it to previous
+ This kind of parenthesis "locks up" the part of the pattern it con-
+ tains once it has matched, and a failure further into the pattern is
+ prevented from backtracking into it. Backtracking past it to previous
items, however, works as normal.
- An alternative description is that a subpattern of this type matches
- the string of characters that an identical standalone pattern would
+ An alternative description is that a subpattern of this type matches
+ the string of characters that an identical standalone pattern would
match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
such as the above example can be thought of as a maximizing repeat that
- must swallow everything it can. So, while both \d+ and \d+? are pre-
- pared to adjust the number of digits they match in order to make the
+ must swallow everything it can. So, while both \d+ and \d+? are pre-
+ pared to adjust the number of digits they match in order to make the
rest of the pattern match, (?>\d+) can only match an entire sequence of
digits.
- Atomic groups in general can of course contain arbitrarily complicated
- subpatterns, and can be nested. However, when the subpattern for an
+ Atomic groups in general can of course contain arbitrarily complicated
+ subpatterns, and can be nested. However, when the subpattern for an
atomic group is just a single repeated item, as in the example above, a
- simpler notation, called a "possessive quantifier" can be used. This
- consists of an additional + character following a quantifier. Using
+ simpler notation, called a "possessive quantifier" can be used. This
+ consists of an additional + character following a quantifier. Using
this notation, the previous example can be rewritten as
\d++foo
@@ -4352,45 +4361,45 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
(abc|xyz){2,3}+
- Possessive quantifiers are always greedy; the setting of the
+ Possessive quantifiers are always greedy; the setting of the
PCRE_UNGREEDY option is ignored. They are a convenient notation for the
- simpler forms of atomic group. However, there is no difference in the
- meaning of a possessive quantifier and the equivalent atomic group,
- though there may be a performance difference; possessive quantifiers
+ simpler forms of atomic group. However, there is no difference in the
+ meaning of a possessive quantifier and the equivalent atomic group,
+ though there may be a performance difference; possessive quantifiers
should be slightly faster.
- The possessive quantifier syntax is an extension to the Perl 5.8 syn-
- tax. Jeffrey Friedl originated the idea (and the name) in the first
+ The possessive quantifier syntax is an extension to the Perl 5.8 syn-
+ tax. Jeffrey Friedl originated the idea (and the name) in the first
edition of his book. Mike McCloskey liked it, so implemented it when he
- built Sun's Java package, and PCRE copied it from there. It ultimately
+ built Sun's Java package, and PCRE copied it from there. It ultimately
found its way into Perl at release 5.10.
PCRE has an optimization that automatically "possessifies" certain sim-
- ple pattern constructs. For example, the sequence A+B is treated as
- A++B because there is no point in backtracking into a sequence of A's
+ ple pattern constructs. For example, the sequence A+B is treated as
+ A++B because there is no point in backtracking into a sequence of A's
when B must follow.
- When a pattern contains an unlimited repeat inside a subpattern that
- can itself be repeated an unlimited number of times, the use of an
- atomic group is the only way to avoid some failing matches taking a
+ When a pattern contains an unlimited repeat inside a subpattern that
+ can itself be repeated an unlimited number of times, the use of an
+ atomic group is the only way to avoid some failing matches taking a
very long time indeed. The pattern
(\D+|<\d+>)*[!?]
- matches an unlimited number of substrings that either consist of non-
- digits, or digits enclosed in <>, followed by either ! or ?. When it
+ matches an unlimited number of substrings that either consist of non-
+ digits, or digits enclosed in <>, followed by either ! or ?. When it
matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- it takes a long time before reporting failure. This is because the
- string can be divided between the internal \D+ repeat and the external
- * repeat in a large number of ways, and all have to be tried. (The
- example uses [!?] rather than a single character at the end, because
- both PCRE and Perl have an optimization that allows for fast failure
- when a single character is used. They remember the last single charac-
- ter that is required for a match, and fail early if it is not present
- in the string.) If the pattern is changed so that it uses an atomic
+ it takes a long time before reporting failure. This is because the
+ string can be divided between the internal \D+ repeat and the external
+ * repeat in a large number of ways, and all have to be tried. (The
+ example uses [!?] rather than a single character at the end, because
+ both PCRE and Perl have an optimization that allows for fast failure
+ when a single character is used. They remember the last single charac-
+ ter that is required for a match, and fail early if it is not present
+ in the string.) If the pattern is changed so that it uses an atomic
group, like this:
((?>\D+)|<\d+>)*[!?]
@@ -4402,37 +4411,37 @@ BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than
0 (and possibly further digits) is a back reference to a capturing sub-
- pattern earlier (that is, to its left) in the pattern, provided there
+ pattern earlier (that is, to its left) in the pattern, provided there
have been that many previous capturing left parentheses.
However, if the decimal number following the backslash is less than 10,
- it is always taken as a back reference, and causes an error only if
- there are not that many capturing left parentheses in the entire pat-
- tern. In other words, the parentheses that are referenced need not be
- to the left of the reference for numbers less than 10. A "forward back
- reference" of this type can make sense when a repetition is involved
- and the subpattern to the right has participated in an earlier itera-
+ it is always taken as a back reference, and causes an error only if
+ there are not that many capturing left parentheses in the entire pat-
+ tern. In other words, the parentheses that are referenced need not be
+ to the left of the reference for numbers less than 10. A "forward back
+ reference" of this type can make sense when a repetition is involved
+ and the subpattern to the right has participated in an earlier itera-
tion.
- It is not possible to have a numerical "forward back reference" to a
- subpattern whose number is 10 or more using this syntax because a
- sequence such as \50 is interpreted as a character defined in octal.
+ It is not possible to have a numerical "forward back reference" to a
+ subpattern whose number is 10 or more using this syntax because a
+ sequence such as \50 is interpreted as a character defined in octal.
See the subsection entitled "Non-printing characters" above for further
- details of the handling of digits following a backslash. There is no
- such problem when named parentheses are used. A back reference to any
+ details of the handling of digits following a backslash. There is no
+ such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
- Another way of avoiding the ambiguity inherent in the use of digits
+ Another way of avoiding the ambiguity inherent in the use of digits
following a backslash is to use the \g escape sequence, which is a fea-
- ture introduced in Perl 5.10. This escape must be followed by an
- unsigned number or a negative number, optionally enclosed in braces.
+ ture introduced in Perl 5.10. This escape must be followed by an
+ unsigned number or a negative number, optionally enclosed in braces.
These examples are all identical:
(ring), \1
(ring), \g1
(ring), \g{1}
- An unsigned number specifies an absolute reference without the ambigu-
+ An unsigned number specifies an absolute reference without the ambigu-
ity that is present in the older syntax. It is also useful when literal
digits follow the reference. A negative number is a relative reference.
Consider this example:
@@ -4440,33 +4449,33 @@ BACK REFERENCES
(abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started captur-
- ing subpattern before \g, that is, is it equivalent to \2. Similarly,
+ ing subpattern before \g, that is, is it equivalent to \2. Similarly,
\g{-2} would be equivalent to \1. The use of relative references can be
- helpful in long patterns, and also in patterns that are created by
+ helpful in long patterns, and also in patterns that are created by
joining together fragments that contain references within themselves.
- A back reference matches whatever actually matched the capturing sub-
- pattern in the current subject string, rather than anything matching
+ A back reference matches whatever actually matched the capturing sub-
+ pattern in the current subject string, rather than anything matching
the subpattern itself (see "Subpatterns as subroutines" below for a way
of doing that). So the pattern
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
- not "sense and responsibility". If caseful matching is in force at the
- time of the back reference, the case of letters is relevant. For exam-
+ matches "sense and sensibility" and "response and responsibility", but
+ not "sense and responsibility". If caseful matching is in force at the
+ time of the back reference, the case of letters is relevant. For exam-
ple,
((?i)rah)\s+\1
- matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
+ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly.
- There are several different ways of writing back references to named
- subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
- \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
+ There are several different ways of writing back references to named
+ subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
+ \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
unified back reference syntax, in which \g can be used for both numeric
- and named references, is also supported. We could rewrite the above
+ and named references, is also supported. We could rewrite the above
example in any of the following ways:
(?<p1>(?i)rah)\s+\k<p1>
@@ -4474,67 +4483,67 @@ BACK REFERENCES
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
- A subpattern that is referenced by name may appear in the pattern
+ A subpattern that is referenced by name may appear in the pattern
before or after the reference.
- There may be more than one back reference to the same subpattern. If a
- subpattern has not actually been used in a particular match, any back
+ There may be more than one back reference to the same subpattern. If a
+ subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern
(a|(bc))\2
- always fails if it starts to match "a" rather than "bc". However, if
+ always fails if it starts to match "a" rather than "bc". However, if
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
ence to an unset value matches an empty string.
- Because there may be many capturing parentheses in a pattern, all dig-
- its following a backslash are taken as part of a potential back refer-
- ence number. If the pattern continues with a digit character, some
- delimiter must be used to terminate the back reference. If the
+ Because there may be many capturing parentheses in a pattern, all dig-
+ its following a backslash are taken as part of a potential back refer-
+ ence number. If the pattern continues with a digit character, some
+ delimiter must be used to terminate the back reference. If the
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
syntax or an empty comment (see "Comments" below) can be used.
Recursive back references
- A back reference that occurs inside the parentheses to which it refers
- fails when the subpattern is first used, so, for example, (a\1) never
- matches. However, such references can be useful inside repeated sub-
+ A back reference that occurs inside the parentheses to which it refers
+ fails when the subpattern is first used, so, for example, (a\1) never
+ matches. However, such references can be useful inside repeated sub-
patterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
- ation of the subpattern, the back reference matches the character
- string corresponding to the previous iteration. In order for this to
- work, the pattern must be such that the first iteration does not need
- to match the back reference. This can be done using alternation, as in
+ ation of the subpattern, the back reference matches the character
+ string corresponding to the previous iteration. In order for this to
+ work, the pattern must be such that the first iteration does not need
+ to match the back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of zero.
- Back references of this type cause the group that they reference to be
- treated as an atomic group. Once the whole group has been matched, a
- subsequent matching failure cannot cause backtracking into the middle
+ Back references of this type cause the group that they reference to be
+ treated as an atomic group. Once the whole group has been matched, a
+ subsequent matching failure cannot cause backtracking into the middle
of the group.
ASSERTIONS
- An assertion is a test on the characters following or preceding the
- current matching point that does not actually consume any characters.
- The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
+ An assertion is a test on the characters following or preceding the
+ current matching point that does not actually consume any characters.
+ The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
described above.
- More complicated assertions are coded as subpatterns. There are two
- kinds: those that look ahead of the current position in the subject
- string, and those that look behind it. An assertion subpattern is
- matched in the normal way, except that it does not cause the current
+ More complicated assertions are coded as subpatterns. There are two
+ kinds: those that look ahead of the current position in the subject
+ string, and those that look behind it. An assertion subpattern is
+ matched in the normal way, except that it does not cause the current
matching position to be changed.
- Assertion subpatterns are not capturing subpatterns, and may not be
- repeated, because it makes no sense to assert the same thing several
- times. If any kind of assertion contains capturing subpatterns within
- it, these are counted for the purposes of numbering the capturing sub-
+ Assertion subpatterns are not capturing subpatterns, and may not be
+ repeated, because it makes no sense to assert the same thing several
+ times. If any kind of assertion contains capturing subpatterns within
+ it, these are counted for the purposes of numbering the capturing sub-
patterns in the whole pattern. However, substring capturing is carried
- out only for positive assertions, because it does not make sense for
+ out only for positive assertions, because it does not make sense for
negative assertions.
Lookahead assertions
@@ -4544,38 +4553,38 @@ ASSERTIONS
\w+(?=;)
- matches a word followed by a semicolon, but does not include the semi-
+ matches a word followed by a semicolon, but does not include the semi-
colon in the match, and
foo(?!bar)
- matches any occurrence of "foo" that is not followed by "bar". Note
+ matches any occurrence of "foo" that is not followed by "bar". Note
that the apparently similar pattern
(?!foo)bar
- does not find an occurrence of "bar" that is preceded by something
- other than "foo"; it finds any occurrence of "bar" whatsoever, because
+ does not find an occurrence of "bar" that is preceded by something
+ other than "foo"; it finds any occurrence of "bar" whatsoever, because
the assertion (?!foo) is always true when the next three characters are
"bar". A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the
- most convenient way to do it is with (?!) because an empty string
- always matches, so an assertion that requires there not to be an empty
- string must always fail. The Perl 5.10 backtracking control verb
+ most convenient way to do it is with (?!) because an empty string
+ always matches, so an assertion that requires there not to be an empty
+ string must always fail. The Perl 5.10 backtracking control verb
(*FAIL) or (*F) is essentially a synonym for (?!).
Lookbehind assertions
- Lookbehind assertions start with (?<= for positive assertions and (?<!
+ Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example,
(?<!foo)bar
- does find an occurrence of "bar" that is not preceded by "foo". The
- contents of a lookbehind assertion are restricted such that all the
+ does find an occurrence of "bar" that is not preceded by "foo". The
+ contents of a lookbehind assertion are restricted such that all the
strings it matches must have a fixed length. However, if there are sev-
- eral top-level alternatives, they do not all have to have the same
+ eral top-level alternatives, they do not all have to have the same
fixed length. Thus
(?<=bullock|donkey)
@@ -4584,62 +4593,62 @@ ASSERTIONS
(?<!dogs?|cats?)
- causes an error at compile time. Branches that match different length
- strings are permitted only at the top level of a lookbehind assertion.
- This is an extension compared with Perl (5.8 and 5.10), which requires
+ causes an error at compile time. Branches that match different length
+ strings are permitted only at the top level of a lookbehind assertion.
+ This is an extension compared with Perl (5.8 and 5.10), which requires
all branches to match the same length of string. An assertion such as
(?<=ab(c|de))
- is not permitted, because its single top-level branch can match two
+ is not permitted, because its single top-level branch can match two
different lengths, but it is acceptable to PCRE if rewritten to use two
top-level branches:
(?<=abc|abde)
In some cases, the Perl 5.10 escape sequence \K (see above) can be used
- instead of a lookbehind assertion to get round the fixed-length
+ instead of a lookbehind assertion to get round the fixed-length
restriction.
- The implementation of lookbehind assertions is, for each alternative,
- to temporarily move the current position back by the fixed length and
+ The implementation of lookbehind assertions is, for each alternative,
+ to temporarily move the current position back by the fixed length and
then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails.
PCRE does not allow the \C escape (which matches a single byte in UTF-8
- mode) to appear in lookbehind assertions, because it makes it impossi-
- ble to calculate the length of the lookbehind. The \X and \R escapes,
+ mode) to appear in lookbehind assertions, because it makes it impossi-
+ ble to calculate the length of the lookbehind. The \X and \R escapes,
which can match different numbers of bytes, are also not permitted.
- "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
- lookbehinds, as long as the subpattern matches a fixed-length string.
+ "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
+ lookbehinds, as long as the subpattern matches a fixed-length string.
Recursion, however, is not supported.
- Possessive quantifiers can be used in conjunction with lookbehind
+ Possessive quantifiers can be used in conjunction with lookbehind
assertions to specify efficient matching of fixed-length strings at the
end of subject strings. Consider a simple pattern such as
abcd$
- when applied to a long string that does not match. Because matching
+ when applied to a long string that does not match. Because matching
proceeds from left to right, PCRE will look for each "a" in the subject
- and then see if what follows matches the rest of the pattern. If the
+ and then see if what follows matches the rest of the pattern. If the
pattern is specified as
^.*abcd$
- the initial .* matches the entire string at first, but when this fails
+ the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the
- last character, then all but the last two characters, and so on. Once
- again the search for "a" covers the entire string, from right to left,
+ last character, then all but the last two characters, and so on. Once
+ again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)
- there can be no backtracking for the .*+ item; it can match only the
- entire string. The subsequent lookbehind assertion does a single test
- on the last four characters. If it fails, the match fails immediately.
- For long strings, this approach makes a significant difference to the
+ there can be no backtracking for the .*+ item; it can match only the
+ entire string. The subsequent lookbehind assertion does a single test
+ on the last four characters. If it fails, the match fails immediately.
+ For long strings, this approach makes a significant difference to the
processing time.
Using multiple assertions
@@ -4648,18 +4657,18 @@ ASSERTIONS
(?<=\d{3})(?<!999)foo
- matches "foo" preceded by three digits that are not "999". Notice that
- each of the assertions is applied independently at the same point in
- the subject string. First there is a check that the previous three
- characters are all digits, and then there is a check that the same
+ matches "foo" preceded by three digits that are not "999". Notice that
+ each of the assertions is applied independently at the same point in
+ the subject string. First there is a check that the previous three
+ characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre-
- ceded by six characters, the first of which are digits and the last
- three of which are not "999". For example, it doesn't match "123abc-
+ ceded by six characters, the first of which are digits and the last
+ three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
- This time the first assertion looks at the preceding six characters,
+ This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999".
@@ -4667,96 +4676,96 @@ ASSERTIONS
(?<=(?<!foo)bar)baz
- matches an occurrence of "baz" that is preceded by "bar" which in turn
+ matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
- is another pattern that matches "foo" preceded by three digits and any
+ is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".
CONDITIONAL SUBPATTERNS
- It is possible to cause the matching process to obey a subpattern con-
- ditionally or to choose between two alternative subpatterns, depending
- on the result of an assertion, or whether a specific capturing subpat-
- tern has already been matched. The two possible forms of conditional
+ It is possible to cause the matching process to obey a subpattern con-
+ ditionally or to choose between two alternative subpatterns, depending
+ on the result of an assertion, or whether a specific capturing subpat-
+ tern has already been matched. The two possible forms of conditional
subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
- If the condition is satisfied, the yes-pattern is used; otherwise the
- no-pattern (if present) is used. If there are more than two alterna-
+ If the condition is satisfied, the yes-pattern is used; otherwise the
+ no-pattern (if present) is used. If there are more than two alterna-
tives in the subpattern, a compile-time error occurs.
- There are four kinds of condition: references to subpatterns, refer-
+ There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.
Checking for a used subpattern by number
- If the text between the parentheses consists of a sequence of digits,
+ If the text between the parentheses consists of a sequence of digits,
the condition is true if a capturing subpattern of that number has pre-
- viously matched. If there is more than one capturing subpattern with
- the same number (see the earlier section about duplicate subpattern
+ viously matched. If there is more than one capturing subpattern with
+ the same number (see the earlier section about duplicate subpattern
numbers), the condition is true if any of them have been set. An alter-
- native notation is to precede the digits with a plus or minus sign. In
- this case, the subpattern number is relative rather than absolute. The
- most recently opened parentheses can be referenced by (?(-1), the next
- most recent by (?(-2), and so on. In looping constructs it can also
- make sense to refer to subsequent groups with constructs such as
+ native notation is to precede the digits with a plus or minus sign. In
+ this case, the subpattern number is relative rather than absolute. The
+ most recently opened parentheses can be referenced by (?(-1), the next
+ most recent by (?(-2), and so on. In looping constructs it can also
+ make sense to refer to subsequent groups with constructs such as
(?(+2).
- Consider the following pattern, which contains non-significant white
+ Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
- The first part matches an optional opening parenthesis, and if that
+ The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec-
- ond part matches one or more characters that are not parentheses. The
+ ond part matches one or more characters that are not parentheses. The
third part is a conditional subpattern that tests whether the first set
of parentheses matched or not. If they did, that is, if subject started
with an opening parenthesis, the condition is true, and so the yes-pat-
- tern is executed and a closing parenthesis is required. Otherwise,
- since no-pattern is not present, the subpattern matches nothing. In
- other words, this pattern matches a sequence of non-parentheses,
+ tern is executed and a closing parenthesis is required. Otherwise,
+ since no-pattern is not present, the subpattern matches nothing. In
+ other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
- If you were embedding this pattern in a larger one, you could use a
+ If you were embedding this pattern in a larger one, you could use a
relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
- This makes the fragment independent of the parentheses in the larger
+ This makes the fragment independent of the parentheses in the larger
pattern.
Checking for a used subpattern by name
- Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
- used subpattern by name. For compatibility with earlier versions of
- PCRE, which had this facility before Perl, the syntax (?(name)...) is
- also recognized. However, there is a possible ambiguity with this syn-
- tax, because subpattern names may consist entirely of digits. PCRE
- looks first for a named subpattern; if it cannot find one and the name
- consists entirely of digits, PCRE looks for a subpattern of that num-
- ber, which must be greater than zero. Using subpattern names that con-
+ Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
+ used subpattern by name. For compatibility with earlier versions of
+ PCRE, which had this facility before Perl, the syntax (?(name)...) is
+ also recognized. However, there is a possible ambiguity with this syn-
+ tax, because subpattern names may consist entirely of digits. PCRE
+ looks first for a named subpattern; if it cannot find one and the name
+ consists entirely of digits, PCRE looks for a subpattern of that num-
+ ber, which must be greater than zero. Using subpattern names that con-
sist entirely of digits is not recommended.
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
- If the name used in a condition of this kind is a duplicate, the test
- is applied to all subpatterns of the same name, and is true if any one
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
of them has matched.
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the
- name R, the condition is true if a recursive call to the whole pattern
+ name R, the condition is true if a recursive call to the whole pattern
or any subpattern has been made. If digits or a name preceded by amper-
sand follow the letter R, for example:
@@ -4764,77 +4773,77 @@ CONDITIONAL SUBPATTERNS
the condition is true if the most recent recursion is into a subpattern
whose number or name is given. This condition does not check the entire
- recursion stack. If the name used in a condition of this kind is a
+ recursion stack. If the name used in a condition of this kind is a
duplicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion.
- At "top level", all these recursion test conditions are false. The
+ At "top level", all these recursion test conditions are false. The
syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
- If the condition is the string (DEFINE), and there is no subpattern
- with the name DEFINE, the condition is always false. In this case,
- there may be only one alternative in the subpattern. It is always
- skipped if control reaches this point in the pattern; the idea of
- DEFINE is that it can be used to define "subroutines" that can be ref-
- erenced from elsewhere. (The use of "subroutines" is described below.)
- For example, a pattern to match an IPv4 address could be written like
+ If the condition is the string (DEFINE), and there is no subpattern
+ with the name DEFINE, the condition is always false. In this case,
+ there may be only one alternative in the subpattern. It is always
+ skipped if control reaches this point in the pattern; the idea of
+ DEFINE is that it can be used to define "subroutines" that can be ref-
+ erenced from elsewhere. (The use of "subroutines" is described below.)
+ For example, a pattern to match an IPv4 address could be written like
this (ignore whitespace and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
- The first part of the pattern is a DEFINE group inside which a another
- group named "byte" is defined. This matches an individual component of
- an IPv4 address (a number less than 256). When matching takes place,
- this part of the pattern is skipped because DEFINE acts like a false
- condition. The rest of the pattern uses references to the named group
- to match the four dot-separated components of an IPv4 address, insist-
+ The first part of the pattern is a DEFINE group inside which a another
+ group named "byte" is defined. This matches an individual component of
+ an IPv4 address (a number less than 256). When matching takes place,
+ this part of the pattern is skipped because DEFINE acts like a false
+ condition. The rest of the pattern uses references to the named group
+ to match the four dot-separated components of an IPv4 address, insist-
ing on a word boundary at each end.
Assertion conditions
- If the condition is not in any of the above formats, it must be an
- assertion. This may be a positive or negative lookahead or lookbehind
- assertion. Consider this pattern, again containing non-significant
+ If the condition is not in any of the above formats, it must be an
+ assertion. This may be a positive or negative lookahead or lookbehind
+ assertion. Consider this pattern, again containing non-significant
white space, and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
- The condition is a positive lookahead assertion that matches an
- optional sequence of non-letters followed by a letter. In other words,
- it tests for the presence of at least one letter in the subject. If a
- letter is found, the subject is matched against the first alternative;
- otherwise it is matched against the second. This pattern matches
- strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+ The condition is a positive lookahead assertion that matches an
+ optional sequence of non-letters followed by a letter. In other words,
+ it tests for the presence of at least one letter in the subject. If a
+ letter is found, the subject is matched against the first alternative;
+ otherwise it is matched against the second. This pattern matches
+ strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
COMMENTS
- The sequence (?# marks the start of a comment that continues up to the
- next closing parenthesis. Nested parentheses are not permitted. The
- characters that make up a comment play no part in the pattern matching
+ The sequence (?# marks the start of a comment that continues up to the
+ next closing parenthesis. Nested parentheses are not permitted. The
+ characters that make up a comment play no part in the pattern matching
at all.
- If the PCRE_EXTENDED option is set, an unescaped # character outside a
- character class introduces a comment that continues to immediately
+ If the PCRE_EXTENDED option is set, an unescaped # character outside a
+ character class introduces a comment that continues to immediately
after the next newline in the pattern.
RECURSIVE PATTERNS
- Consider the problem of matching a string in parentheses, allowing for
- unlimited nested parentheses. Without the use of recursion, the best
- that can be done is to use a pattern that matches up to some fixed
- depth of nesting. It is not possible to handle an arbitrary nesting
+ Consider the problem of matching a string in parentheses, allowing for
+ unlimited nested parentheses. Without the use of recursion, the best
+ that can be done is to use a pattern that matches up to some fixed
+ depth of nesting. It is not possible to handle an arbitrary nesting
depth.
For some time, Perl has provided a facility that allows regular expres-
- sions to recurse (amongst other things). It does this by interpolating
- Perl code in the expression at run time, and the code can refer to the
+ sions to recurse (amongst other things). It does this by interpolating
+ Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this:
@@ -4844,182 +4853,182 @@ RECURSIVE PATTERNS
refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
- it supports special syntax for recursion of the entire pattern, and
- also for individual subpattern recursion. After its introduction in
- PCRE and Python, this kind of recursion was subsequently introduced
+ it supports special syntax for recursion of the entire pattern, and
+ also for individual subpattern recursion. After its introduction in
+ PCRE and Python, this kind of recursion was subsequently introduced
into Perl at release 5.10.
- A special item that consists of (? followed by a number greater than
+ A special item that consists of (? followed by a number greater than
zero and a closing parenthesis is a recursive call of the subpattern of
- the given number, provided that it occurs inside that subpattern. (If
- not, it is a "subroutine" call, which is described in the next sec-
- tion.) The special item (?R) or (?0) is a recursive call of the entire
+ the given number, provided that it occurs inside that subpattern. (If
+ not, it is a "subroutine" call, which is described in the next sec-
+ tion.) The special item (?R) or (?0) is a recursive call of the entire
regular expression.
- This PCRE pattern solves the nested parentheses problem (assume the
+ This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
- First it matches an opening parenthesis. Then it matches any number of
- substrings which can either be a sequence of non-parentheses, or a
- recursive match of the pattern itself (that is, a correctly parenthe-
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis. Note the use
of a possessive quantifier to avoid backtracking into sequences of non-
parentheses.
- If this were part of a larger pattern, you would not want to recurse
+ If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
- We have put the pattern into parentheses, and caused the recursion to
+ We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
- In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references (a Perl
- 5.10 feature). Instead of (?1) in the pattern above you can write
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references (a Perl
+ 5.10 feature). Instead of (?1) in the pattern above you can write
(?-2) to refer to the second most recently opened parentheses preceding
- the recursion. In other words, a negative number counts capturing
+ the recursion. In other words, a negative number counts capturing
parentheses leftwards from the point at which it is encountered.
- It is also possible to refer to subsequently opened parentheses, by
- writing references such as (?+2). However, these cannot be recursive
- because the reference is not inside the parentheses that are refer-
- enced. They are always "subroutine" calls, as described in the next
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always "subroutine" calls, as described in the next
section.
- An alternative approach is to use named parentheses instead. The Perl
- syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
+ An alternative approach is to use named parentheses instead. The Perl
+ syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
- If there is more than one subpattern with the same name, the earliest
+ If there is more than one subpattern with the same name, the earliest
one is used.
- This particular example pattern that we have been looking at contains
+ This particular example pattern that we have been looking at contains
nested unlimited repeats, and so the use of a possessive quantifier for
matching strings of non-parentheses is important when applying the pat-
- tern to strings that do not match. For example, when this pattern is
+ tern to strings that do not match. For example, when this pattern is
applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if a possessive quantifier is
- not used, the match runs for a very long time indeed because there are
- so many different ways the + and * repeats can carve up the subject,
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported.
- At the end of a match, the values of capturing parentheses are those
- from the outermost level. If you want to obtain intermediate values, a
- callout function can be used (see below and the pcrecallout documenta-
+ At the end of a match, the values of capturing parentheses are those
+ from the outermost level. If you want to obtain intermediate values, a
+ callout function can be used (see below and the pcrecallout documenta-
tion). If the pattern above is matched against
(ab(cd)ef)
- the value for the inner capturing parentheses (numbered 2) is "ef",
- which is the last value taken on at the top level. If a capturing sub-
+ the value for the inner capturing parentheses (numbered 2) is "ef",
+ which is the last value taken on at the top level. If a capturing sub-
pattern is not matched at the top level, its final value is unset, even
if it is (temporarily) set at a deeper level.
- If there are more than 15 capturing parentheses in a pattern, PCRE has
- to obtain extra memory to store data during a recursion, which it does
+ If there are more than 15 capturing parentheses in a pattern, PCRE has
+ to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
- Do not confuse the (?R) item with the condition (R), which tests for
- recursion. Consider this pattern, which matches text in angle brack-
- ets, allowing for arbitrary nesting. Only digits are allowed in nested
- brackets (that is, when recursing), whereas any characters are permit-
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
- In this pattern, (?(R) is the start of a conditional subpattern, with
- two different alternatives for the recursive and non-recursive cases.
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
Recursion difference from Perl
- In PCRE (like Python, but unlike Perl), a recursive subpattern call is
+ In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
- alternatives and there is a subsequent matching failure. This can be
- illustrated by the following pattern, which purports to match a palin-
- dromic string that contains an odd number of characters (for example,
+ alternatives and there is a subsequent matching failure. This can be
+ illustrated by the following pattern, which purports to match a palin-
+ dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical
- characters surrounding a sub-palindrome. In Perl, this pattern works;
- in PCRE it does not if the pattern is longer than three characters.
+ characters surrounding a sub-palindrome. In Perl, this pattern works;
+ in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba":
- At the top level, the first character is matched, but as it is not at
+ At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat-
- tern 1 successfully matches the next character ("b"). (Note that the
+ tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
- Back at the top level, the next character ("c") is compared with what
- subpattern 2 matched, which was "a". This fails. Because the recursion
- is treated as an atomic group, there are now no backtracking points,
- and so the entire match fails. (Perl is able, at this point, to re-
- enter the recursion and try the second alternative.) However, if the
+ Back at the top level, the next character ("c") is compared with what
+ subpattern 2 matched, which was "a". This fails. Because the recursion
+ is treated as an atomic group, there are now no backtracking points,
+ and so the entire match fails. (Perl is able, at this point, to re-
+ enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
- This time, the recursing alternative is tried first, and continues to
- recurse until it runs out of characters, at which point the recursion
- fails. But this time we do have another alternative to try at the
- higher level. That is the big difference: in the previous case the
+ This time, the recursing alternative is tried first, and continues to
+ recurse until it runs out of characters, at which point the recursion
+ fails. But this time we do have another alternative to try at the
+ higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot
use.
To change the pattern so that matches all palindromic strings, not just
- those with an odd number of characters, it is tempting to change the
+ those with an odd number of characters, it is tempting to change the
pattern to this:
^((.)(?1)\2|.?)$
- Again, this works in Perl, but not in PCRE, and for the same reason.
- When a deeper recursion has matched a single character, it cannot be
- entered again in order to match an empty string. The solution is to
- separate the two cases, and write out the odd and even cases as alter-
+ Again, this works in Perl, but not in PCRE, and for the same reason.
+ When a deeper recursion has matched a single character, it cannot be
+ entered again in order to match an empty string. The solution is to
+ separate the two cases, and write out the odd and even cases as alter-
natives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
- If you want to match typical palindromic phrases, the pattern has to
+ If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
- Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
- ing into sequences of non-word characters. Without this, PCRE takes a
- great deal longer (ten times or more) to match typical phrases, and
+ Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
+ ing into sequences of non-word characters. Without this, PCRE takes a
+ great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
- WARNING: The palindrome-matching patterns above work only if the sub-
- ject string does not start with a palindrome that is shorter than the
- entire string. For example, although "abcba" is correctly matched, if
- the subject is "ababa", PCRE finds the palindrome "aba" at the start,
- then fails at top level because the end of the string does not follow.
- Once again, it cannot jump back into the recursion to try other alter-
+ WARNING: The palindrome-matching patterns above work only if the sub-
+ ject string does not start with a palindrome that is shorter than the
+ entire string. For example, although "abcba" is correctly matched, if
+ the subject is "ababa", PCRE finds the palindrome "aba" at the start,
+ then fails at top level because the end of the string does not follow.
+ Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails.
SUBPATTERNS AS SUBROUTINES
If the syntax for a recursive subpattern reference (either by number or
- by name) is used outside the parentheses to which it refers, it oper-
- ates like a subroutine in a programming language. The "called" subpat-
+ by name) is used outside the parentheses to which it refers, it oper-
+ ates like a subroutine in a programming language. The "called" subpat-
tern may be defined before or after the reference. A numbered reference
can be absolute or relative, as in these examples:
@@ -5031,113 +5040,113 @@ SUBPATTERNS AS SUBROUTINES
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
+ matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
- is used, it does match "sense and responsibility" as well as the other
- two strings. Another example is given in the discussion of DEFINE
+ is used, it does match "sense and responsibility" as well as the other
+ two strings. Another example is given in the discussion of DEFINE
above.
- Like recursive subpatterns, a subroutine call is always treated as an
- atomic group. That is, once it has matched some of the subject string,
- it is never re-entered, even if it contains untried alternatives and
- there is a subsequent matching failure. Any capturing parentheses that
- are set during the subroutine call revert to their previous values
+ Like recursive subpatterns, a subroutine call is always treated as an
+ atomic group. That is, once it has matched some of the subject string,
+ it is never re-entered, even if it contains untried alternatives and
+ there is a subsequent matching failure. Any capturing parentheses that
+ are set during the subroutine call revert to their previous values
afterwards.
- When a subpattern is used as a subroutine, processing options such as
+ When a subpattern is used as a subroutine, processing options such as
case-independence are fixed when the subpattern is defined. They cannot
be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))
- It matches "abcabc". It does not match "abcABC" because the change of
+ It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a subroutine,
- possibly recursively. Here are two of the examples used above, rewrit-
+ an alternative syntax for referencing a subpattern as a subroutine,
+ possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
- PCRE supports an extension to Oniguruma: if a number is preceded by a
+ PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
- Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
- synonymous. The former is a back reference; the latter is a subroutine
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a back reference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
- Perl code to be obeyed in the middle of matching a regular expression.
+ Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides
- an external function by putting its entry point in the global variable
- pcre_callout. By default, this variable contains NULL, which disables
+ an external function by putting its entry point in the global variable
+ pcre_callout. By default, this variable contains NULL, which disables
all calling out.
- Within a regular expression, (?C) indicates the points at which the
- external function is to be called. If you want to identify different
- callout points, you can put a number less than 256 after the letter C.
- The default value is zero. For example, this pattern has two callout
+ Within a regular expression, (?C) indicates the points at which the
+ external function is to be called. If you want to identify different
+ callout points, you can put a number less than 256 after the letter C.
+ The default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
- automatically installed before each item in the pattern. They are all
+ automatically installed before each item in the pattern. They are all
numbered 255.
During matching, when PCRE reaches a callout point (and pcre_callout is
- set), the external function is called. It is provided with the number
- of the callout, the position in the pattern, and, optionally, one item
- of data originally supplied by the caller of pcre_exec(). The callout
- function may cause matching to proceed, to backtrack, or to fail alto-
+ set), the external function is called. It is provided with the number
+ of the callout, the position in the pattern, and, optionally, one item
+ of data originally supplied by the caller of pcre_exec(). The callout
+ function may cause matching to proceed, to backtrack, or to fail alto-
gether. A complete description of the interface to the callout function
is given in the pcrecallout documentation.
BACKTRACKING CONTROL
- Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
+ Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are described in the Perl documentation as "experimental and sub-
- ject to change or removal in a future version of Perl". It goes on to
- say: "Their usage in production code should be noted to avoid problems
+ ject to change or removal in a future version of Perl". It goes on to
+ say: "Their usage in production code should be noted to avoid problems
during upgrades." The same remarks apply to the PCRE features described
in this section.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using
pcre_exec(), which uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, they cause an
error if encountered by pcre_dfa_exec().
If any of these verbs are used in an assertion or subroutine subpattern
- (including recursive subpatterns), their effect is confined to that
- subpattern; it does not extend to the surrounding pattern. Note that
- such subpatterns are processed as anchored at the point where they are
+ (including recursive subpatterns), their effect is confined to that
+ subpattern; it does not extend to the surrounding pattern. Note that
+ such subpatterns are processed as anchored at the point where they are
tested.
- The new verbs make use of what was previously invalid syntax: an open-
+ The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. In Perl, they are generally of
the form (*VERB:ARG) but PCRE does not support the use of arguments, so
- its general form is just (*VERB). Any number of these verbs may occur
+ its general form is just (*VERB). Any number of these verbs may occur
in a pattern. There are two kinds:
Verbs that act immediately
@@ -5146,94 +5155,94 @@ BACKTRACKING CONTROL
(*ACCEPT)
- This verb causes the match to end successfully, skipping the remainder
- of the pattern. When inside a recursion, only the innermost pattern is
- ended immediately. If (*ACCEPT) is inside capturing parentheses, the
- data so far is captured. (This feature was added to PCRE at release
+ This verb causes the match to end successfully, skipping the remainder
+ of the pattern. When inside a recursion, only the innermost pattern is
+ ended immediately. If (*ACCEPT) is inside capturing parentheses, the
+ data so far is captured. (This feature was added to PCRE at release
8.00.) For example:
A((?:A|B(*ACCEPT)|C)D)
- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses.
(*FAIL) or (*F)
- This verb causes the match to fail, forcing backtracking to occur. It
- is equivalent to (?!) but easier to read. The Perl documentation notes
- that it is probably useful only when combined with (?{}) or (??{}).
- Those are, of course, Perl features that are not present in PCRE. The
- nearest equivalent is the callout feature, as for example in this pat-
+ This verb causes the match to fail, forcing backtracking to occur. It
+ is equivalent to (?!) but easier to read. The Perl documentation notes
+ that it is probably useful only when combined with (?{}) or (??{}).
+ Those are, of course, Perl features that are not present in PCRE. The
+ nearest equivalent is the callout feature, as for example in this pat-
tern:
a+(?C)(*FAIL)
- A match with the string "aaaa" always fails, but the callout is taken
+ A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times).
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con-
- tinues with what follows, but if there is no subsequent match, a fail-
- ure is forced. The verbs differ in exactly what kind of failure
+ tinues with what follows, but if there is no subsequent match, a fail-
+ ure is forced. The verbs differ in exactly what kind of failure
occurs.
(*COMMIT)
- This verb causes the whole match to fail outright if the rest of the
- pattern does not match. Even if the pattern is unanchored, no further
- attempts to find a match by advancing the starting point take place.
- Once (*COMMIT) has been passed, pcre_exec() is committed to finding a
+ This verb causes the whole match to fail outright if the rest of the
+ pattern does not match. Even if the pattern is unanchored, no further
+ attempts to find a match by advancing the starting point take place.
+ Once (*COMMIT) has been passed, pcre_exec() is committed to finding a
match at the current starting point, or not at all. For example:
a+(*COMMIT)b
- This matches "xxaab" but not "aacaab". It can be thought of as a kind
+ This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish."
(*PRUNE)
- This verb causes the match to fail at the current position if the rest
+ This verb causes the match to fail at the current position if the rest
of the pattern does not match. If the pattern is unanchored, the normal
- "bumpalong" advance to the next starting character then happens. Back-
- tracking can occur as usual to the left of (*PRUNE), or when matching
- to the right of (*PRUNE), but if there is no match to the right, back-
- tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE)
+ "bumpalong" advance to the next starting character then happens. Back-
+ tracking can occur as usual to the left of (*PRUNE), or when matching
+ to the right of (*PRUNE), but if there is no match to the right, back-
+ tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE)
is just an alternative to an atomic group or possessive quantifier, but
- there are some uses of (*PRUNE) that cannot be expressed in any other
+ there are some uses of (*PRUNE) that cannot be expressed in any other
way.
(*SKIP)
- This verb is like (*PRUNE), except that if the pattern is unanchored,
- the "bumpalong" advance is not to the next character, but to the posi-
- tion in the subject where (*SKIP) was encountered. (*SKIP) signifies
- that whatever text was matched leading up to it cannot be part of a
+ This verb is like (*PRUNE), except that if the pattern is unanchored,
+ the "bumpalong" advance is not to the next character, but to the posi-
+ tion in the subject where (*SKIP) was encountered. (*SKIP) signifies
+ that whatever text was matched leading up to it cannot be part of a
successful match. Consider:
a+(*SKIP)b
- If the subject is "aaaac...", after the first match attempt fails
- (starting at the first character in the string), the starting point
+ If the subject is "aaaac...", after the first match attempt fails
+ (starting at the first character in the string), the starting point
skips on to start the next attempt at "c". Note that a possessive quan-
- tifer does not have the same effect as this example; although it would
- suppress backtracking during the first match attempt, the second
- attempt would start at the second character instead of skipping on to
+ tifer does not have the same effect as this example; although it would
+ suppress backtracking during the first match attempt, the second
+ attempt would start at the second character instead of skipping on to
"c".
(*THEN)
This verb causes a skip to the next alternation if the rest of the pat-
tern does not match. That is, it cancels pending backtracking, but only
- within the current alternation. Its name comes from the observation
+ within the current alternation. Its name comes from the observation
that it can be used for a pattern-based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
- If the COND1 pattern matches, FOO is tried (and possibly further items
- after the end of the group if FOO succeeds); on failure the matcher
- skips to the second alternative and tries COND2, without backtracking
- into COND1. If (*THEN) is used outside of any alternation, it acts
+ If the COND1 pattern matches, FOO is tried (and possibly further items
+ after the end of the group if FOO succeeds); on failure the matcher
+ skips to the second alternative and tries COND2, without backtracking
+ into COND1. If (*THEN) is used outside of any alternation, it acts
exactly like (*PRUNE).
@@ -5251,7 +5260,7 @@ AUTHOR
REVISION
- Last updated: 11 January 2010
+ Last updated: 06 March 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -5363,16 +5372,19 @@ GENERAL CATEGORY PROPERTY CODES FOR \p and \P
SCRIPT NAMES FOR \p AND \P
- Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese,
- Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, Coptic, Cu-
- neiform, Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian,
- Glagolitic, Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo,
- Hebrew, Hiragana, Inherited, Kannada, Katakana, Kayah_Li, Kharoshthi,
- Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lycian, Lydian, Malayalam,
- Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, Old_Persian,
- Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Saurash-
- tra, Shavian, Sinhala, Sudanese, Syloti_Nagri, Syriac, Tagalog, Tag-
- banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
+ Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
+ Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
+ Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
+ tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
+ Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
+ rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
+ Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
+ Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
+ Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
+ Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
+ Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
+ Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
+ Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Vai, Yi.
@@ -5552,7 +5564,7 @@ BACKTRACKING CONTROL
(*ACCEPT) force successful match
(*FAIL) force backtrack; synonym (*F)
- The following act only when a subsequent match failure causes a back-
+ The following act only when a subsequent match failure causes a back-
track to reach them. They all force a match failure, but they differ in
what happens afterwards. Those that advance the start-of-match point do
so only if the pattern is not anchored.
@@ -5565,7 +5577,7 @@ BACKTRACKING CONTROL
NEWLINE CONVENTIONS
- These are recognized only at the very start of the pattern or after a
+ These are recognized only at the very start of the pattern or after a
(*BSR_...) or (*UTF8) option.
(*CR) carriage return only
@@ -5577,7 +5589,7 @@ NEWLINE CONVENTIONS
WHAT \R MATCHES
- These are recognized only at the very start of the pattern or after a
+ These are recognized only at the very start of the pattern or after a
(*...) option that sets the newline convention or UTF-8 mode.
(*BSR_ANYCRLF) CR, LF, or CRLF
@@ -5604,8 +5616,8 @@ AUTHOR
REVISION
- Last updated: 11 April 2009
- Copyright (c) 1997-2009 University of Cambridge.
+ Last updated: 01 March 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -6129,14 +6141,14 @@ PCRE PERFORMANCE
can affect both of them.
-MEMORY USAGE
+COMPILED PATTERN MEMORY USAGE
Patterns are compiled by PCRE into a reasonably efficient byte code, so
that most simple patterns do not use much memory. However, there is one
- case where memory usage can be unexpectedly large. When a parenthesized
- subpattern has a quantifier with a minimum greater than 1 and/or a lim-
- ited maximum, the whole subpattern is repeated in the compiled code.
- For example, the pattern
+ case where the memory usage of a compiled pattern can be unexpectedly
+ large. If a parenthesized subpattern has a quantifier with a minimum
+ greater than 1 and/or a limited maximum, the whole subpattern is
+ repeated in the compiled code. For example, the pattern
(abc|def){2,4}
@@ -6178,73 +6190,83 @@ MEMORY USAGE
otherwise handle.
+STACK USAGE AT RUN TIME
+
+ When pcre_exec() is used for matching, certain kinds of pattern can
+ cause it to use large amounts of the process stack. In some environ-
+ ments the default process stack is quite small, and if it runs out the
+ result is often SIGSEGV. This issue is probably the most frequently
+ raised problem with PCRE. Rewriting your pattern can often help. The
+ pcrestack documentation discusses this issue in detail.
+
+
PROCESSING TIME
- Certain items in regular expression patterns are processed more effi-
+ Certain items in regular expression patterns are processed more effi-
ciently than others. It is more efficient to use a character class like
- [aeiou] than a set of single-character alternatives such as
- (a|e|i|o|u). In general, the simplest construction that provides the
+ [aeiou] than a set of single-character alternatives such as
+ (a|e|i|o|u). In general, the simplest construction that provides the
required behaviour is usually the most efficient. Jeffrey Friedl's book
- contains a lot of useful general discussion about optimizing regular
- expressions for efficient performance. This document contains a few
+ contains a lot of useful general discussion about optimizing regular
+ expressions for efficient performance. This document contains a few
observations about PCRE.
- Using Unicode character properties (the \p, \P, and \X escapes) is
- slow, because PCRE has to scan a structure that contains data for over
- fifteen thousand characters whenever it needs a character's property.
- If you can find an alternative pattern that does not use character
+ Using Unicode character properties (the \p, \P, and \X escapes) is
+ slow, because PCRE has to scan a structure that contains data for over
+ fifteen thousand characters whenever it needs a character's property.
+ If you can find an alternative pattern that does not use character
properties, it will probably be faster.
- When a pattern begins with .* not in parentheses, or in parentheses
+ When a pattern begins with .* not in parentheses, or in parentheses
that are not the subject of a backreference, and the PCRE_DOTALL option
- is set, the pattern is implicitly anchored by PCRE, since it can match
- only at the start of a subject string. However, if PCRE_DOTALL is not
- set, PCRE cannot make this optimization, because the . metacharacter
- does not then match a newline, and if the subject string contains new-
- lines, the pattern may match from the character immediately following
+ is set, the pattern is implicitly anchored by PCRE, since it can match
+ only at the start of a subject string. However, if PCRE_DOTALL is not
+ set, PCRE cannot make this optimization, because the . metacharacter
+ does not then match a newline, and if the subject string contains new-
+ lines, the pattern may match from the character immediately following
one of them instead of from the very start. For example, the pattern
.*second
- matches the subject "first\nand second" (where \n stands for a newline
- character), with the match starting at the seventh character. In order
+ matches the subject "first\nand second" (where \n stands for a newline
+ character), with the match starting at the seventh character. In order
to do this, PCRE has to retry the match starting after every newline in
the subject.
- If you are using such a pattern with subject strings that do not con-
+ If you are using such a pattern with subject strings that do not con-
tain newlines, the best performance is obtained by setting PCRE_DOTALL,
- or starting the pattern with ^.* or ^.*? to indicate explicit anchor-
- ing. That saves PCRE from having to scan along the subject looking for
+ or starting the pattern with ^.* or ^.*? to indicate explicit anchor-
+ ing. That saves PCRE from having to scan along the subject looking for
a newline to restart at.
- Beware of patterns that contain nested indefinite repeats. These can
- take a long time to run when applied to a string that does not match.
+ Beware of patterns that contain nested indefinite repeats. These can
+ take a long time to run when applied to a string that does not match.
Consider the pattern fragment
^(a+)*
- This can match "aaaa" in 16 different ways, and this number increases
- very rapidly as the string gets longer. (The * repeat can match 0, 1,
- 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
- repeats can match different numbers of times.) When the remainder of
+ This can match "aaaa" in 16 different ways, and this number increases
+ very rapidly as the string gets longer. (The * repeat can match 0, 1,
+ 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
+ repeats can match different numbers of times.) When the remainder of
the pattern is such that the entire match is going to fail, PCRE has in
- principle to try every possible variation, and this can take an
+ principle to try every possible variation, and this can take an
extremely long time, even for relatively short strings.
An optimization catches some of the more simple cases such as
(a+)*b
- where a literal character follows. Before embarking on the standard
- matching procedure, PCRE checks that there is a "b" later in the sub-
- ject string, and if there is not, it fails the match immediately. How-
- ever, when there is no following literal this optimization cannot be
+ where a literal character follows. Before embarking on the standard
+ matching procedure, PCRE checks that there is a "b" later in the sub-
+ ject string, and if there is not, it fails the match immediately. How-
+ ever, when there is no following literal this optimization cannot be
used. You can see the difference by comparing the behaviour of
(a+)*\d
- with the pattern above. The former gives a failure almost instantly
- when applied to a whole line of "a" characters, whereas the latter
+ with the pattern above. The former gives a failure almost instantly
+ when applied to a whole line of "a" characters, whereas the latter
takes an appreciable time with strings longer than about 20 characters.
In many cases, the solution to this kind of performance issue is to use
@@ -6260,8 +6282,8 @@ AUTHOR
REVISION
- Last updated: 06 March 2007
- Copyright (c) 1997-2007 University of Cambridge.
+ Last updated: 07 March 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcrepattern.3 b/doc/pcrepattern.3
index 8849872..27afc4f 100644
--- a/doc/pcrepattern.3
+++ b/doc/pcrepattern.3
@@ -738,8 +738,8 @@ For example, when the pattern
.sp
matches "foobar", the first substring is still set to "foo".
.P
-Perl documents that the use of \eK within assertions is "not well defined". In
-PCRE, \eK is acted upon when it occurs inside positive assertions, but is
+Perl documents that the use of \eK within assertions is "not well defined". In
+PCRE, \eK is acted upon when it occurs inside positive assertions, but is
ignored in negative assertions.
.
.
diff --git a/maint/README b/maint/README
index aab8009..82062ea 100644
--- a/maint/README
+++ b/maint/README
@@ -1,5 +1,5 @@
MAINTENANCE README FOR PCRE
----------------------------
+===========================
The files in the "maint" directory of the PCRE source contain data, scripts,
and programs that are used for the maintenance of PCRE, but which do not form
@@ -14,14 +14,14 @@ also contains some notes for maintainers. Its contents are:
Files in the maint directory
-----------------------------
+============================
------------------ This file is now OBSOLETE and no longer used ----------------
+---------------- This file is now OBSOLETE and no longer used ----------------
Builducptable A Perl script that creates the contents of the ucptable.h file
from two Unicode data files, which themselves are downloaded
from the Unicode web site. Run this script in the "maint"
directory.
------------------ This file is now OBSOLETE and no longer used ----------------
+---------------- This file is now OBSOLETE and no longer used ----------------
GenerateUtt.py A Python script to generate part of the pcre_tables.c file
that contains Unicode script names in a long string with
@@ -61,7 +61,7 @@ utf8.c A short, freestanding C program for converting a Unicode code
Updating to a new Unicode release
----------------------------------
+=================================
When there is a new release of Unicode, the files in Unicode.tables must be
refreshed from the web site. If the new version of Unicode adds new character
@@ -88,7 +88,7 @@ of Unicode script names.
Preparing for a PCRE release
-----------------------------
+============================
This section contains a checklist of things that I consult before building a
distribution for a new release.
@@ -135,7 +135,9 @@ distribution for a new release.
Many of these won't need changing, but over the long term things do change.
. Man pages: Check all man pages for \ not followed by e or f or " because
- that indicates a markup error.
+ that indicates a markup error. However, there is one exception: pcredemo.3,
+ which is created from the pcredemo.c program. It contains three instances
+ of \\n.
. When the release is built, test it on a number of different operating
systems if possible, and using different compilers as well. For example,
@@ -145,7 +147,7 @@ distribution for a new release.
Making a PCRE release
----------------------
+=====================
Run PrepareRelease and commit the files that it changes (by removing trailing
spaces). Then run "make distcheck" to create the tarballs and the zipball.
@@ -155,11 +157,12 @@ Double-check with "svn status", then create an SVN tagged copy:
svn://vcs.exim.org/pcre/code/tags/pcre-8.xx
Don't forget to update Freshmeat when the new release is out, and to tell
-webmaster@pcre.org and the mailing list.
+webmaster@pcre.org and the mailing list. Also, update the list of version
+numbers in Bugzilla (edit products).
Future ideas (wish list)
-------------------------
+========================
This section records a list of ideas so that they do not get forgotten. They
vary enormously in their usefulness and potential for implementation. Some are
@@ -280,7 +283,7 @@ others are relatively new.
. Callouts with arguments: (?Cn:ARG) for instance.
. A user is going to supply a patch to generalize the API for user-specific
- memory allocation so that it is more flexible in threaded environments. Thiw
+ memory allocation so that it is more flexible in threaded environments. This
was promised a long time ago, and never appeared...
. Write a function that generates random matching strings for a compiled regex.
@@ -309,8 +312,13 @@ others are relatively new.
. PCRE cannot at present distinguish between subpatterns with different names,
but the same number (created by the use of ?|). In order to do so, a way of
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
+
+. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
+ "something" and the the #ifdef appears only in one place, in "something".
+
+. Support for (*MARK) and arguments for (*PRUNE) and friends.
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
-Last updated: 26 September 2009
+Last updated: 10 March 2010
diff --git a/pcre_compile.c b/pcre_compile.c
index 090a613..6ea9c74 100644
--- a/pcre_compile.c
+++ b/pcre_compile.c
@@ -92,7 +92,7 @@ is 4 there is plenty of room. */
#define COMPILE_WORK_SIZE (4096)
-/* The overrun tests check for a slightly smaller size so that they detect the
+/* The overrun tests check for a slightly smaller size so that they detect the
overrun before it actually does run off the end of the data block. */
#define WORK_SIZE_CHECK (COMPILE_WORK_SIZE - 100)
@@ -268,10 +268,10 @@ the number of relocations needed when a shared library is loaded dynamically,
it is now one long string. We cannot use a table of offsets, because the
lengths of inserts such as XSTRING(MAX_NAME_SIZE) are not known. Instead, we
simply count through to the one we want - this isn't a performance issue
-because these strings are used only when there is a compilation error.
+because these strings are used only when there is a compilation error.
-Each substring ends with \0 to insert a null character. This includes the final
-substring, so that the whole string ends with \0\0, which can be detected when
+Each substring ends with \0 to insert a null character. This includes the final
+substring, so that the whole string ends with \0\0, which can be detected when
counting through. */
static const char error_texts[] =
@@ -511,11 +511,11 @@ static const char *
find_error_text(int n)
{
const char *s = error_texts;
-for (; n > 0; n--)
+for (; n > 0; n--)
{
while (*s++ != 0) {};
if (*s == 0) return "Error text not found (please report)";
- }
+ }
return s;
}
@@ -1807,7 +1807,7 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
const uschar *ccode;
c = *code;
-
+
/* Skip over forward assertions; the other assertions are skipped by
first_significant_code() with a TRUE final argument. */
@@ -1827,13 +1827,13 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
c = *code;
continue;
}
-
+
/* For a recursion/subroutine call, if its end has been reached, which
implies a subroutine call, we can scan it. */
-
+
if (c == OP_RECURSE)
{
- BOOL empty_branch = FALSE;
+ BOOL empty_branch = FALSE;
const uschar *scode = cd->start_code + GET(code, 1);
if (GET(scode, 1) == 0) return TRUE; /* Unclosed */
do
@@ -1841,14 +1841,14 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
if (could_be_empty_branch(scode, endcode, utf8, cd))
{
empty_branch = TRUE;
- break;
- }
+ break;
+ }
scode += GET(scode, 1);
}
while (*scode == OP_ALT);
if (!empty_branch) return FALSE; /* All branches are non-empty */
continue;
- }
+ }
/* For other groups, scan the branches. */
@@ -2004,9 +2004,9 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
#endif
/* None of the remaining opcodes are required to match a character. */
-
+
default:
- break;
+ break;
}
}
@@ -2029,7 +2029,7 @@ Arguments:
endcode points to where to stop (current RECURSE item)
bcptr points to the chain of current (unclosed) branch starts
utf8 TRUE if in UTF-8 mode
- cd pointers to tables etc
+ cd pointers to tables etc
Returns: TRUE if what is matched could be empty
*/
@@ -4475,7 +4475,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
/* Because we are moving code along, we must ensure that any
pending recursive references are updated. */
-
+
default:
*code = OP_END;
adjust_recurse(tempcode, 1 + LINK_SIZE, utf8, cd, save_hwm);
@@ -5197,11 +5197,11 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
*errorcodeptr = ERR15;
goto FAILED;
}
-
+
/* Fudge the value of "called" so that when it is inserted as an
- offset below, what it actually inserted is the reference number
+ offset below, what it actually inserted is the reference number
of the group. */
-
+
called = cd->start_code + recno;
PUTINC(cd->hwm, 0, code + 2 + LINK_SIZE - cd->start_code);
}
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index 83538d0..d953f99 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -714,12 +714,12 @@ for (;;)
opcode, are not the correct length. It seems to be the only way to do
such a check at compile time, as the sizeof() operator does not work
in the C preprocessor. */
-
+
case OP_TABLE_LENGTH:
- case OP_TABLE_LENGTH +
+ case OP_TABLE_LENGTH +
((sizeof(coptable) == OP_TABLE_LENGTH) &&
(sizeof(poptable) == OP_TABLE_LENGTH)):
- break;
+ break;
/* ========================================================================== */
/* Reached a closing bracket. If not at the end of the pattern, carry
diff --git a/pcre_globals.c b/pcre_globals.c
index 10d0b2b..4562e0a 100644
--- a/pcre_globals.c
+++ b/pcre_globals.c
@@ -43,10 +43,10 @@ PCRE is thread-clean and doesn't use any global variables in the normal sense.
However, it calls memory allocation and freeing functions via the four
indirections below, and it can optionally do callouts, using the fifth
indirection. These values can be changed by the caller, but are shared between
-all threads.
+all threads.
-For MS Visual Studio and Symbian OS, there are problems in initializing these
-variables to non-local functions. In these cases, therefore, an indirection via
+For MS Visual Studio and Symbian OS, there are problems in initializing these
+variables to non-local functions. In these cases, therefore, an indirection via
a local function is used.
Also, when compiling for Virtual Pascal, things are done differently, and
diff --git a/pcre_internal.h b/pcre_internal.h
index 67a3475..4554657 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1503,7 +1503,7 @@ condition. */
#define RREF_ANY 0xffff
/* Compile time error code numbers. They are given names so that they can more
-easily be tracked. When a new number is added, the table called eint in
+easily be tracked. When a new number is added, the table called eint in
pcreposix.c must be updated. */
enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
diff --git a/pcre_tables.c b/pcre_tables.c
index 2fd7031..b7f7ba5 100644
--- a/pcre_tables.c
+++ b/pcre_tables.c
@@ -249,7 +249,7 @@ strings to make sure that UTF-8 support works on EBCDIC platforms. */
#define STRING_Zp0 STR_Z STR_p "\0"
#define STRING_Zs0 STR_Z STR_s "\0"
-const char _pcre_utt_names[] =
+const char _pcre_utt_names[] =
STRING_Any0
STRING_Arabic0
STRING_Armenian0
@@ -382,138 +382,138 @@ const char _pcre_utt_names[] =
STRING_Zp0
STRING_Zs0;
-const ucp_type_table _pcre_utt[] = {
- { 0, PT_ANY, 0 },
- { 4, PT_SC, ucp_Arabic },
- { 11, PT_SC, ucp_Armenian },
- { 20, PT_SC, ucp_Avestan },
- { 28, PT_SC, ucp_Balinese },
- { 37, PT_SC, ucp_Bamum },
- { 43, PT_SC, ucp_Bengali },
- { 51, PT_SC, ucp_Bopomofo },
- { 60, PT_SC, ucp_Braille },
- { 68, PT_SC, ucp_Buginese },
- { 77, PT_SC, ucp_Buhid },
- { 83, PT_GC, ucp_C },
- { 85, PT_SC, ucp_Canadian_Aboriginal },
- { 105, PT_SC, ucp_Carian },
- { 112, PT_PC, ucp_Cc },
- { 115, PT_PC, ucp_Cf },
- { 118, PT_SC, ucp_Cham },
- { 123, PT_SC, ucp_Cherokee },
- { 132, PT_PC, ucp_Cn },
- { 135, PT_PC, ucp_Co },
- { 138, PT_SC, ucp_Common },
- { 145, PT_SC, ucp_Coptic },
- { 152, PT_PC, ucp_Cs },
- { 155, PT_SC, ucp_Cuneiform },
- { 165, PT_SC, ucp_Cypriot },
- { 173, PT_SC, ucp_Cyrillic },
- { 182, PT_SC, ucp_Deseret },
- { 190, PT_SC, ucp_Devanagari },
- { 201, PT_SC, ucp_Egyptian_Hieroglyphs },
- { 222, PT_SC, ucp_Ethiopic },
- { 231, PT_SC, ucp_Georgian },
- { 240, PT_SC, ucp_Glagolitic },
- { 251, PT_SC, ucp_Gothic },
- { 258, PT_SC, ucp_Greek },
- { 264, PT_SC, ucp_Gujarati },
- { 273, PT_SC, ucp_Gurmukhi },
- { 282, PT_SC, ucp_Han },
- { 286, PT_SC, ucp_Hangul },
- { 293, PT_SC, ucp_Hanunoo },
- { 301, PT_SC, ucp_Hebrew },
- { 308, PT_SC, ucp_Hiragana },
- { 317, PT_SC, ucp_Imperial_Aramaic },
- { 334, PT_SC, ucp_Inherited },
- { 344, PT_SC, ucp_Inscriptional_Pahlavi },
- { 366, PT_SC, ucp_Inscriptional_Parthian },
- { 389, PT_SC, ucp_Javanese },
- { 398, PT_SC, ucp_Kaithi },
- { 405, PT_SC, ucp_Kannada },
- { 413, PT_SC, ucp_Katakana },
- { 422, PT_SC, ucp_Kayah_Li },
- { 431, PT_SC, ucp_Kharoshthi },
- { 442, PT_SC, ucp_Khmer },
- { 448, PT_GC, ucp_L },
- { 450, PT_LAMP, 0 },
- { 453, PT_SC, ucp_Lao },
- { 457, PT_SC, ucp_Latin },
- { 463, PT_SC, ucp_Lepcha },
- { 470, PT_SC, ucp_Limbu },
- { 476, PT_SC, ucp_Linear_B },
- { 485, PT_SC, ucp_Lisu },
- { 490, PT_PC, ucp_Ll },
- { 493, PT_PC, ucp_Lm },
- { 496, PT_PC, ucp_Lo },
- { 499, PT_PC, ucp_Lt },
- { 502, PT_PC, ucp_Lu },
- { 505, PT_SC, ucp_Lycian },
- { 512, PT_SC, ucp_Lydian },
- { 519, PT_GC, ucp_M },
- { 521, PT_SC, ucp_Malayalam },
- { 531, PT_PC, ucp_Mc },
- { 534, PT_PC, ucp_Me },
- { 537, PT_SC, ucp_Meetei_Mayek },
- { 550, PT_PC, ucp_Mn },
- { 553, PT_SC, ucp_Mongolian },
- { 563, PT_SC, ucp_Myanmar },
- { 571, PT_GC, ucp_N },
- { 573, PT_PC, ucp_Nd },
- { 576, PT_SC, ucp_New_Tai_Lue },
- { 588, PT_SC, ucp_Nko },
- { 592, PT_PC, ucp_Nl },
- { 595, PT_PC, ucp_No },
- { 598, PT_SC, ucp_Ogham },
- { 604, PT_SC, ucp_Ol_Chiki },
- { 613, PT_SC, ucp_Old_Italic },
- { 624, PT_SC, ucp_Old_Persian },
- { 636, PT_SC, ucp_Old_South_Arabian },
- { 654, PT_SC, ucp_Old_Turkic },
- { 665, PT_SC, ucp_Oriya },
- { 671, PT_SC, ucp_Osmanya },
- { 679, PT_GC, ucp_P },
- { 681, PT_PC, ucp_Pc },
- { 684, PT_PC, ucp_Pd },
- { 687, PT_PC, ucp_Pe },
- { 690, PT_PC, ucp_Pf },
- { 693, PT_SC, ucp_Phags_Pa },
- { 702, PT_SC, ucp_Phoenician },
- { 713, PT_PC, ucp_Pi },
- { 716, PT_PC, ucp_Po },
- { 719, PT_PC, ucp_Ps },
- { 722, PT_SC, ucp_Rejang },
- { 729, PT_SC, ucp_Runic },
- { 735, PT_GC, ucp_S },
- { 737, PT_SC, ucp_Samaritan },
- { 747, PT_SC, ucp_Saurashtra },
- { 758, PT_PC, ucp_Sc },
- { 761, PT_SC, ucp_Shavian },
- { 769, PT_SC, ucp_Sinhala },
- { 777, PT_PC, ucp_Sk },
- { 780, PT_PC, ucp_Sm },
- { 783, PT_PC, ucp_So },
- { 786, PT_SC, ucp_Sundanese },
- { 796, PT_SC, ucp_Syloti_Nagri },
- { 809, PT_SC, ucp_Syriac },
- { 816, PT_SC, ucp_Tagalog },
- { 824, PT_SC, ucp_Tagbanwa },
- { 833, PT_SC, ucp_Tai_Le },
- { 840, PT_SC, ucp_Tai_Tham },
- { 849, PT_SC, ucp_Tai_Viet },
- { 858, PT_SC, ucp_Tamil },
- { 864, PT_SC, ucp_Telugu },
- { 871, PT_SC, ucp_Thaana },
- { 878, PT_SC, ucp_Thai },
- { 883, PT_SC, ucp_Tibetan },
- { 891, PT_SC, ucp_Tifinagh },
- { 900, PT_SC, ucp_Ugaritic },
- { 909, PT_SC, ucp_Vai },
- { 913, PT_SC, ucp_Yi },
- { 916, PT_GC, ucp_Z },
- { 918, PT_PC, ucp_Zl },
- { 921, PT_PC, ucp_Zp },
- { 924, PT_PC, ucp_Zs }
+const ucp_type_table _pcre_utt[] = {
+ { 0, PT_ANY, 0 },
+ { 4, PT_SC, ucp_Arabic },
+ { 11, PT_SC, ucp_Armenian },
+ { 20, PT_SC, ucp_Avestan },
+ { 28, PT_SC, ucp_Balinese },
+ { 37, PT_SC, ucp_Bamum },
+ { 43, PT_SC, ucp_Bengali },
+ { 51, PT_SC, ucp_Bopomofo },
+ { 60, PT_SC, ucp_Braille },
+ { 68, PT_SC, ucp_Buginese },
+ { 77, PT_SC, ucp_Buhid },
+ { 83, PT_GC, ucp_C },
+ { 85, PT_SC, ucp_Canadian_Aboriginal },
+ { 105, PT_SC, ucp_Carian },
+ { 112, PT_PC, ucp_Cc },
+ { 115, PT_PC, ucp_Cf },
+ { 118, PT_SC, ucp_Cham },
+ { 123, PT_SC, ucp_Cherokee },
+ { 132, PT_PC, ucp_Cn },
+ { 135, PT_PC, ucp_Co },
+ { 138, PT_SC, ucp_Common },
+ { 145, PT_SC, ucp_Coptic },
+ { 152, PT_PC, ucp_Cs },
+ { 155, PT_SC, ucp_Cuneiform },
+ { 165, PT_SC, ucp_Cypriot },
+ { 173, PT_SC, ucp_Cyrillic },
+ { 182, PT_SC, ucp_Deseret },
+ { 190, PT_SC, ucp_Devanagari },
+ { 201, PT_SC, ucp_Egyptian_Hieroglyphs },
+ { 222, PT_SC, ucp_Ethiopic },
+ { 231, PT_SC, ucp_Georgian },
+ { 240, PT_SC, ucp_Glagolitic },
+ { 251, PT_SC, ucp_Gothic },
+ { 258, PT_SC, ucp_Greek },
+ { 264, PT_SC, ucp_Gujarati },
+ { 273, PT_SC, ucp_Gurmukhi },
+ { 282, PT_SC, ucp_Han },
+ { 286, PT_SC, ucp_Hangul },
+ { 293, PT_SC, ucp_Hanunoo },
+ { 301, PT_SC, ucp_Hebrew },
+ { 308, PT_SC, ucp_Hiragana },
+ { 317, PT_SC, ucp_Imperial_Aramaic },
+ { 334, PT_SC, ucp_Inherited },
+ { 344, PT_SC, ucp_Inscriptional_Pahlavi },
+ { 366, PT_SC, ucp_Inscriptional_Parthian },
+ { 389, PT_SC, ucp_Javanese },
+ { 398, PT_SC, ucp_Kaithi },
+ { 405, PT_SC, ucp_Kannada },
+ { 413, PT_SC, ucp_Katakana },
+ { 422, PT_SC, ucp_Kayah_Li },
+ { 431, PT_SC, ucp_Kharoshthi },
+ { 442, PT_SC, ucp_Khmer },
+ { 448, PT_GC, ucp_L },
+ { 450, PT_LAMP, 0 },
+ { 453, PT_SC, ucp_Lao },
+ { 457, PT_SC, ucp_Latin },
+ { 463, PT_SC, ucp_Lepcha },
+ { 470, PT_SC, ucp_Limbu },
+ { 476, PT_SC, ucp_Linear_B },
+ { 485, PT_SC, ucp_Lisu },
+ { 490, PT_PC, ucp_Ll },
+ { 493, PT_PC, ucp_Lm },
+ { 496, PT_PC, ucp_Lo },
+ { 499, PT_PC, ucp_Lt },
+ { 502, PT_PC, ucp_Lu },
+ { 505, PT_SC, ucp_Lycian },
+ { 512, PT_SC, ucp_Lydian },
+ { 519, PT_GC, ucp_M },
+ { 521, PT_SC, ucp_Malayalam },
+ { 531, PT_PC, ucp_Mc },
+ { 534, PT_PC, ucp_Me },
+ { 537, PT_SC, ucp_Meetei_Mayek },
+ { 550, PT_PC, ucp_Mn },
+ { 553, PT_SC, ucp_Mongolian },
+ { 563, PT_SC, ucp_Myanmar },
+ { 571, PT_GC, ucp_N },
+ { 573, PT_PC, ucp_Nd },
+ { 576, PT_SC, ucp_New_Tai_Lue },
+ { 588, PT_SC, ucp_Nko },
+ { 592, PT_PC, ucp_Nl },
+ { 595, PT_PC, ucp_No },
+ { 598, PT_SC, ucp_Ogham },
+ { 604, PT_SC, ucp_Ol_Chiki },
+ { 613, PT_SC, ucp_Old_Italic },
+ { 624, PT_SC, ucp_Old_Persian },
+ { 636, PT_SC, ucp_Old_South_Arabian },
+ { 654, PT_SC, ucp_Old_Turkic },
+ { 665, PT_SC, ucp_Oriya },
+ { 671, PT_SC, ucp_Osmanya },
+ { 679, PT_GC, ucp_P },
+ { 681, PT_PC, ucp_Pc },
+ { 684, PT_PC, ucp_Pd },
+ { 687, PT_PC, ucp_Pe },
+ { 690, PT_PC, ucp_Pf },
+ { 693, PT_SC, ucp_Phags_Pa },
+ { 702, PT_SC, ucp_Phoenician },
+ { 713, PT_PC, ucp_Pi },
+ { 716, PT_PC, ucp_Po },
+ { 719, PT_PC, ucp_Ps },
+ { 722, PT_SC, ucp_Rejang },
+ { 729, PT_SC, ucp_Runic },
+ { 735, PT_GC, ucp_S },
+ { 737, PT_SC, ucp_Samaritan },
+ { 747, PT_SC, ucp_Saurashtra },
+ { 758, PT_PC, ucp_Sc },
+ { 761, PT_SC, ucp_Shavian },
+ { 769, PT_SC, ucp_Sinhala },
+ { 777, PT_PC, ucp_Sk },
+ { 780, PT_PC, ucp_Sm },
+ { 783, PT_PC, ucp_So },
+ { 786, PT_SC, ucp_Sundanese },
+ { 796, PT_SC, ucp_Syloti_Nagri },
+ { 809, PT_SC, ucp_Syriac },
+ { 816, PT_SC, ucp_Tagalog },
+ { 824, PT_SC, ucp_Tagbanwa },
+ { 833, PT_SC, ucp_Tai_Le },
+ { 840, PT_SC, ucp_Tai_Tham },
+ { 849, PT_SC, ucp_Tai_Viet },
+ { 858, PT_SC, ucp_Tamil },
+ { 864, PT_SC, ucp_Telugu },
+ { 871, PT_SC, ucp_Thaana },
+ { 878, PT_SC, ucp_Thai },
+ { 883, PT_SC, ucp_Tibetan },
+ { 891, PT_SC, ucp_Tifinagh },
+ { 900, PT_SC, ucp_Ugaritic },
+ { 909, PT_SC, ucp_Vai },
+ { 913, PT_SC, ucp_Yi },
+ { 916, PT_GC, ucp_Z },
+ { 918, PT_PC, ucp_Zl },
+ { 921, PT_PC, ucp_Zp },
+ { 924, PT_PC, ucp_Zs }
};
const int _pcre_utt_size = sizeof(_pcre_utt)/sizeof(ucp_type_table);
diff --git a/pcreposix.c b/pcreposix.c
index 44c3ff9..76f3f87 100644
--- a/pcreposix.c
+++ b/pcreposix.c
@@ -372,7 +372,7 @@ switch(rc)
error if the vector eint, which is indexed by compile-time error number, is
not the correct length. It seems to be the only way to do such a check at
compile time, as the sizeof() operator does not work in the C preprocessor.
- As all the PCRE_ERROR_xxx values are negative, we can use 0 and 1. */
+ As all the PCRE_ERROR_xxx values are negative, we can use 0 and 1. */
case 0:
case (sizeof(eint)/sizeof(int) == ERRCOUNT):
diff --git a/pcretest.c b/pcretest.c
index 11403a3..5ed3289 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -118,7 +118,7 @@ external symbols to prevent clashes. */
/* We also need the pcre_printint() function for printing out compiled
patterns. This function is in a separate file so that it can be included in
-pcre_compile.c when that module is compiled with debugging enabled. It needs to
+pcre_compile.c when that module is compiled with debugging enabled. It needs to
know which case is being compiled. */
#define COMPILING_PCRETEST
diff --git a/testdata/testinput12 b/testdata/testinput12
index ab42f45..6588a8d 100644
--- a/testdata/testinput12
+++ b/testdata/testinput12
@@ -202,5 +202,14 @@ of case for anything other than the ASCII letters. --/
/(?i:[\x{c0}])/8
\x{c0}
\x{e0}
+
+/-- This should be Perl-compatible but Perl 5.11 gets \x{300} wrong. --/8
+/^\X/8
+ A
+ A\x{300}BC
+ A\x{300}\x{301}\x{302}BC
+ *** Failers
+ \x{300}
+
/-- End of testinput12 --/
diff --git a/testdata/testinput6 b/testdata/testinput6
index 9e5d052..759018a 100644
--- a/testdata/testinput6
+++ b/testdata/testinput6
@@ -370,13 +370,6 @@
\x{3b1}
\x{ff5a}
-/^\X/8
- A
- A\x{300}BC
- A\x{300}\x{301}\x{302}BC
- *** Failers
- \x{300}
-
/^[\X]/8
X123
*** Failers
diff --git a/testdata/testoutput12 b/testdata/testoutput12
index 21190d5..073cf2b 100644
--- a/testdata/testoutput12
+++ b/testdata/testoutput12
@@ -470,5 +470,19 @@ of case for anything other than the ASCII letters. --/
0: \x{c0}
\x{e0}
0: \x{e0}
+
+/-- This should be Perl-compatible but Perl 5.11 gets \x{300} wrong. --/8
+/^\X/8
+ A
+ 0: A
+ A\x{300}BC
+ 0: A\x{300}
+ A\x{300}\x{301}\x{302}BC
+ 0: A\x{300}\x{301}\x{302}
+ *** Failers
+ 0: *
+ \x{300}
+No match
+
/-- End of testinput12 --/
diff --git a/testdata/testoutput6 b/testdata/testoutput6
index 7a4410b..b4176eb 100644
--- a/testdata/testoutput6
+++ b/testdata/testoutput6
@@ -618,18 +618,6 @@ No match
\x{ff5a}
0: \x{ff5a}
-/^\X/8
- A
- 0: A
- A\x{300}BC
- 0: A\x{300}
- A\x{300}\x{301}\x{302}BC
- 0: A\x{300}\x{301}\x{302}
- *** Failers
- 0: *
- \x{300}
-No match
-
/^[\X]/8
X123
0: X