summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-12-05 12:33:44 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2011-12-05 12:33:44 +0000
commitfe230b59c018dd441d38ccc8eff23f35fd009a03 (patch)
treec70d7f16605bb43d8a2b221e70e7852b299fc9b1
parent757205faa5e41a044d79120d188bf6edf2d0e2d6 (diff)
downloadpcre-fe230b59c018dd441d38ccc8eff23f35fd009a03.tar.gz
Tidies for 8.21-RC1 release.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@784 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog10
-rw-r--r--NEWS7
-rw-r--r--configure.ac2
-rw-r--r--doc/html/index.html16
-rw-r--r--doc/html/pcreapi.html36
-rw-r--r--doc/html/pcrecallout.html9
-rw-r--r--doc/html/pcrecompat.html5
-rw-r--r--doc/html/pcrejit.html141
-rw-r--r--doc/html/pcrelimits.html8
-rw-r--r--doc/html/pcrematching.html8
-rw-r--r--doc/html/pcrepattern.html148
-rw-r--r--doc/html/pcretest.html7
-rw-r--r--doc/pcre.txt1369
-rw-r--r--doc/pcretest.txt307
-rw-r--r--testdata/testoutput151
15 files changed, 1179 insertions, 895 deletions
diff --git a/ChangeLog b/ChangeLog
index cce48aa..c75bcad 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,8 +1,8 @@
ChangeLog for PCRE
------------------
-Version 8.21
-------------
+Version 8.21 05-Dec-2011
+------------------------
1. Updating the JIT compiler.
@@ -13,7 +13,7 @@ Version 8.21
PCRE_EXTRA_TABLES is not suported by JIT, and should be checked before
calling _pcre_jit_exec. Some extra comments are added.
-4. Mark settings inside atomic groups that do not contain any capturing
+4. (*MARK) settings inside atomic groups that do not contain any capturing
parentheses, for example, (?>a(*:m)), were not being passed out. This bug
was introduced by change 18 for 8.20.
@@ -99,6 +99,10 @@ Version 8.21
24. Added PCRE_INFO_JITSIZE to pass on the value from (21) above, and also
output it when the /M option is used in pcretest.
+
+25. The CheckMan script was not being included in the distribution. Also, added
+ an explicit "perl" to run Perl scripts from the PrepareRelease script
+ because this is reportedly needed in Windows.
Version 8.20 21-Oct-2011
diff --git a/NEWS b/NEWS
index 1d76c86..16972d0 100644
--- a/NEWS
+++ b/NEWS
@@ -1,6 +1,13 @@
News about PCRE releases
------------------------
+Release 8.21 05-Dec-2011
+------------------------
+
+This is mostly a bug-fix release. The only new feature is the ability to obtain
+the memory used by the JIT compiler.
+
+
Release 8.20 21-Oct-2011
------------------------
diff --git a/configure.ac b/configure.ac
index ddee8e8..ea97c35 100644
--- a/configure.ac
+++ b/configure.ac
@@ -11,7 +11,7 @@ dnl be defined as -RC2, for example. For real releases, it should be empty.
m4_define(pcre_major, [8])
m4_define(pcre_minor, [21])
m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2011-11-14])
+m4_define(pcre_date, [2011-12-05])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
diff --git a/doc/html/index.html b/doc/html/index.html
index 75361fd..fc93ed0 100644
--- a/doc/html/index.html
+++ b/doc/html/index.html
@@ -1,10 +1,10 @@
<html>
-<!-- This is a manually maintained file that is the root of the HTML version of
- the PCRE documentation. When the HTML documents are built from the man
- page versions, the entire doc/html directory is emptied, this file is then
- copied into doc/html/index.html, and the remaining files therein are
+<!-- This is a manually maintained file that is the root of the HTML version of
+ the PCRE documentation. When the HTML documents are built from the man
+ page versions, the entire doc/html directory is emptied, this file is then
+ copied into doc/html/index.html, and the remaining files therein are
created by the 132html script.
--->
+-->
<head>
<title>PCRE specification</title>
</head>
@@ -83,11 +83,11 @@ The HTML documentation for PCRE comprises the following pages:
</table>
<p>
-There are also individual pages that summarize the interface for each function
+There are also individual pages that summarize the interface for each function
in the library:
</p>
-<table>
+<table>
<tr><td><a href="pcre_assign_jit_stack.html">pcre_assign_jit_stack</a></td>
<td>&nbsp;&nbsp;Assign stack for JIT matching</td></tr>
@@ -150,7 +150,7 @@ in the library:
<tr><td><a href="pcre_maketables.html">pcre_maketables</a></td>
<td>&nbsp;&nbsp;Build character tables in current locale</td></tr>
-
+
<tr><td><a href="pcre_refcount.html">pcre_refcount</a></td>
<td>&nbsp;&nbsp;Maintain reference count in compiled pattern</td></tr>
diff --git a/doc/html/pcreapi.html b/doc/html/pcreapi.html
index cd90766..9ddae5b 100644
--- a/doc/html/pcreapi.html
+++ b/doc/html/pcreapi.html
@@ -649,6 +649,23 @@ character). Thus, the pattern AB]CD becomes illegal when this option is set.
string (by default this causes the current matching alternative to fail). A
pattern such as (\1)(a) succeeds when this option is set (assuming it can find
an "a" in the subject), whereas it fails by default, for Perl compatibility.
+</P>
+<P>
+(3) \U matches an upper case "U" character; by default \U causes a compile
+time error (Perl uses \U to upper case subsequent characters).
+</P>
+<P>
+(4) \u matches a lower case "u" character unless it is followed by four
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, \u causes a compile time error (Perl uses it to upper
+case the following character).
+</P>
+<P>
+(5) \x matches a lower case "x" character unless it is followed by two
+hexadecimal digits, in which case the hexadecimal number defines the code point
+to match. By default, as in Perl, a hexadecimal number is always expected after
+\x, but it may have zero, one, or two digits (so, for example, \xz matches a
+binary zero character followed by z).
<pre>
PCRE_MULTILINE
</pre>
@@ -1127,6 +1144,12 @@ particular pattern. See the
<a href="pcrejit.html"><b>pcrejit</b></a>
documentation for details of what can and cannot be handled.
<pre>
+ PCRE_INFO_JITSIZE
+</pre>
+If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE option,
+return the size of the JIT compiled code, otherwise return zero. The fourth
+argument should point to a <b>size_t</b> variable.
+<pre>
PCRE_INFO_LASTLITERAL
</pre>
Return the value of the rightmost literal byte that must exist in any matched
@@ -1235,10 +1258,13 @@ For such patterns, the PCRE_ANCHORED bit is set in the options returned by
<pre>
PCRE_INFO_SIZE
</pre>
-Return the size of the compiled pattern, that is, the value that was passed as
-the argument to <b>pcre_malloc()</b> when PCRE was getting memory in which to
-place the compiled data. The fourth argument should point to a <b>size_t</b>
-variable.
+Return the size of the compiled pattern. The fourth argument should point to a
+<b>size_t</b> variable. This value does not include the size of the <b>pcre</b>
+structure that is returned by <b>pcre_compile()</b>. The value that is passed as
+the argument to <b>pcre_malloc()</b> when <b>pcre_compile()</b> is getting memory
+in which to place the compiled data is the value returned by this option plus
+the size of the <b>pcre</b> structure. Studying a compiled pattern, with or
+without JIT, does not alter the value returned by this option.
<pre>
PCRE_INFO_STUDYSIZE
</pre>
@@ -2486,7 +2512,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC24" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 23 September 2011
+Last updated: 02 December 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrecallout.html b/doc/html/pcrecallout.html
index 40d5fa2..e94ffec 100644
--- a/doc/html/pcrecallout.html
+++ b/doc/html/pcrecallout.html
@@ -189,9 +189,10 @@ same callout number. However, they are set for all callouts.
<P>
The <i>mark</i> field is present from version 2 of the <i>pcre_callout</i>
structure. In callouts from <b>pcre_exec()</b> it contains a pointer to the
-zero-terminated name of the most recently passed (*MARK) item in the match, or
-NULL if there are no (*MARK)s in the current matching path. In callouts from
-<b>pcre_dfa_exec()</b> this field always contains NULL.
+zero-terminated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)
+item in the match, or NULL if no such items have been passed. Instances of
+(*PRUNE) or (*THEN) without a name do not obliterate a previous (*MARK). In
+callouts from <b>pcre_dfa_exec()</b> this field always contains NULL.
</P>
<br><a name="SEC4" href="#TOC1">RETURN VALUES</a><br>
<P>
@@ -219,7 +220,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC6" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 26 August 2011
+Last updated: 30 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrecompat.html b/doc/html/pcrecompat.html
index 69d9d1d..9a09318 100644
--- a/doc/html/pcrecompat.html
+++ b/doc/html/pcrecompat.html
@@ -53,7 +53,8 @@ represent a binary zero.
own, matching a non-newline character, is supported.) In fact these are
implemented by Perl's general string-handling and are not part of its pattern
matching engine. If any of these are encountered by PCRE, an error is
-generated.
+generated by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
+\U and \u are interpreted as JavaScript interprets them.
</P>
<P>
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE is
@@ -202,7 +203,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 09 October 2011
+Last updated: 14 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrejit.html b/doc/html/pcrejit.html
index c257d0d..c5b2a48 100644
--- a/doc/html/pcrejit.html
+++ b/doc/html/pcrejit.html
@@ -20,10 +20,11 @@ man page, in case the conversion went wrong.
<li><a name="TOC5" href="#SEC5">RETURN VALUES FROM JIT EXECUTION</a>
<li><a name="TOC6" href="#SEC6">SAVING AND RESTORING COMPILED PATTERNS</a>
<li><a name="TOC7" href="#SEC7">CONTROLLING THE JIT STACK</a>
-<li><a name="TOC8" href="#SEC8">EXAMPLE CODE</a>
-<li><a name="TOC9" href="#SEC9">SEE ALSO</a>
-<li><a name="TOC10" href="#SEC10">AUTHOR</a>
-<li><a name="TOC11" href="#SEC11">REVISION</a>
+<li><a name="TOC8" href="#SEC8">JIT STACK FAQ</a>
+<li><a name="TOC9" href="#SEC9">EXAMPLE CODE</a>
+<li><a name="TOC10" href="#SEC10">SEE ALSO</a>
+<li><a name="TOC11" href="#SEC11">AUTHOR</a>
+<li><a name="TOC12" href="#SEC12">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PCRE JUST-IN-TIME COMPILER SUPPORT</a><br>
<P>
@@ -57,11 +58,17 @@ fully tested. If --enable-jit is set on an unsupported platform, compilation
fails.
</P>
<P>
-A program can tell if JIT support is available by calling <b>pcre_config()</b>
-with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available, and 0
-otherwise. However, a simple program does not need to check this in order to
-use JIT. The API is implemented in a way that falls back to the ordinary PCRE
-code if JIT is not available.
+A program that is linked with PCRE 8.20 or later can tell if JIT support is
+available by calling <b>pcre_config()</b> with the PCRE_CONFIG_JIT option. The
+result is 1 when JIT is available, and 0 otherwise. However, a simple program
+does not need to check this in order to use JIT. The API is implemented in a
+way that falls back to the ordinary PCRE code if JIT is not available.
+</P>
+<P>
+If your program may sometimes be linked with versions of PCRE that are older
+than 8.20, but you want to use JIT when it is available, you can test
+the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT macro such
+as PCRE_CONFIG_JIT, for compile-time control of your code.
</P>
<br><a name="SEC3" href="#TOC1">SIMPLE USE OF JIT</a><br>
<P>
@@ -75,6 +82,21 @@ You have to do two things to make use of the JIT support in the simplest way:
no longer needed instead of just freeing it yourself. This
ensures that any JIT data is also freed.
</pre>
+For a program that may be linked with pre-8.20 versions of PCRE, you can insert
+<pre>
+ #ifndef PCRE_STUDY_JIT_COMPILE
+ #define PCRE_STUDY_JIT_COMPILE 0
+ #endif
+</pre>
+so that no option is passed to <b>pcre_study()</b>, and then use something like
+this to free the study data:
+<pre>
+ #ifdef PCRE_CONFIG_JIT
+ pcre_free_study(study_ptr);
+ #else
+ pcre_free(study_ptr);
+ #endif
+</pre>
In some circumstances you may need to call additional functions. These are
described in the section entitled
<a href="#stackcontrol">"Controlling the JIT stack"</a>
@@ -116,12 +138,8 @@ supported.
<P>
The unsupported pattern items are:
<pre>
- \C match a single byte; not supported in UTF-8 mode
+ \C match a single byte; not supported in UTF-8 mode
(?Cn) callouts
- (?(&#60;name&#62;)... conditional test on setting of a named subpattern
- (?(R)... conditional test on whole pattern recursion
- (?(Rn)... conditional test on recursion, by number
- (?(R&name)... conditional test on recursion, by name
(*COMMIT) )
(*MARK) )
(*PRUNE) ) the backtracking control verbs
@@ -167,7 +185,10 @@ When the compiled JIT code runs, it needs a block of memory to use as a stack.
By default, it uses 32K on the machine stack. However, some large or
complicated patterns need more than this. The error PCRE_ERROR_JIT_STACKLIMIT
is given when there is not enough stack. Three functions are provided for
-managing blocks of memory for use as JIT stacks.
+managing blocks of memory for use as JIT stacks. There is further discussion
+about the use of JIT stacks in the section entitled
+<a href="#stackcontrol">"JIT stack FAQ"</a>
+below.
</P>
<P>
The <b>pcre_jit_stack_alloc()</b> function creates a JIT stack. Its arguments
@@ -234,8 +255,86 @@ All the functions described in this section do nothing if JIT is not available,
and <b>pcre_assign_jit_stack()</b> does nothing unless the <b>extra</b> argument
is non-NULL and points to a <b>pcre_extra</b> block that is the result of a
successful study with PCRE_STUDY_JIT_COMPILE.
+<a name="stackfaq"></a></P>
+<br><a name="SEC8" href="#TOC1">JIT STACK FAQ</a><br>
+<P>
+(1) Why do we need JIT stacks?
+<br>
+<br>
+PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack where
+the local data of the current node is pushed before checking its child nodes.
+Allocating real machine stack on some platforms is difficult. For example, the
+stack chain needs to be updated every time if we extend the stack on PowerPC.
+Although it is possible, its updating time overhead decreases performance. So
+we do the recursion in memory.
+</P>
+<P>
+(2) Why don't we simply allocate blocks of memory with <b>malloc()</b>?
+<br>
+<br>
+Modern operating systems have a nice feature: they can reserve an address space
+instead of allocating memory. We can safely allocate memory pages inside this
+address space, so the stack could grow without moving memory data (this is
+important because of pointers). Thus we can allocate 1M address space, and use
+only a single memory page (usually 4K) if that is enough. However, we can still
+grow up to 1M anytime if needed.
+</P>
+<P>
+(3) Who "owns" a JIT stack?
+<br>
+<br>
+The owner of the stack is the user program, not the JIT studied pattern or
+anything else. The user program must ensure that if a stack is used by
+<b>pcre_exec()</b>, (that is, it is assigned to the pattern currently running),
+that stack must not be used by any other threads (to avoid overwriting the same
+memory area). The best practice for multithreaded programs is to allocate a
+stack for each thread, and return this stack through the JIT callback function.
+</P>
+<P>
+(4) When should a JIT stack be freed?
+<br>
+<br>
+You can free a JIT stack at any time, as long as it will not be used by
+<b>pcre_exec()</b> again. When you assign the stack to a pattern, only a pointer
+is set. There is no reference counting or any other magic. You can free the
+patterns and stacks in any order, anytime. Just <i>do not</i> call
+<b>pcre_exec()</b> with a pattern pointing to an already freed stack, as that
+will cause SEGFAULT. (Also, do not free a stack currently used by
+<b>pcre_exec()</b> in another thread). You can also replace the stack for a
+pattern at any time. You can even free the previous stack before assigning a
+replacement.
+</P>
+<P>
+(5) Should I allocate/free a stack every time before/after calling
+<b>pcre_exec()</b>?
+<br>
+<br>
+No, because this is too costly in terms of resources. However, you could
+implement some clever idea which release the stack if it is not used in let's
+say two minutes. The JIT callback can help to achive this without keeping a
+list of the currently JIT studied patterns.
+</P>
+<P>
+(6) OK, the stack is for long term memory allocation. But what happens if a
+pattern causes stack overflow with a stack of 1M? Is that 1M kept until the
+stack is freed?
+<br>
+<br>
+Especially on embedded sytems, it might be a good idea to release
+memory sometimes without freeing the stack. There is no API for this at the
+moment. Probably a function call which returns with the currently allocated
+memory for any stack and another which allows releasing memory (shrinking the
+stack) would be a good idea if someone needs this.
+</P>
+<P>
+(7) This is too much of a headache. Isn't there any better solution for JIT
+stack handling?
+<br>
+<br>
+No, thanks to Windows. If POSIX threads were used everywhere, we could throw
+out this complicated API.
</P>
-<br><a name="SEC8" href="#TOC1">EXAMPLE CODE</a><br>
+<br><a name="SEC9" href="#TOC1">EXAMPLE CODE</a><br>
<P>
This is a single-threaded example that specifies a JIT stack without using a
callback.
@@ -260,22 +359,22 @@ callback.
</PRE>
</P>
-<br><a name="SEC9" href="#TOC1">SEE ALSO</a><br>
+<br><a name="SEC10" href="#TOC1">SEE ALSO</a><br>
<P>
<b>pcreapi</b>(3)
</P>
-<br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
<P>
-Philip Hazel
+Philip Hazel (FAQ by Zoltan Herczeg)
<br>
University Computing Service
<br>
Cambridge CB2 3QH, England.
<br>
</P>
-<br><a name="SEC11" href="#TOC1">REVISION</a><br>
+<br><a name="SEC12" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 October 2011
+Last updated: 26 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrelimits.html b/doc/html/pcrelimits.html
index 4dc28f7..2cab81f 100644
--- a/doc/html/pcrelimits.html
+++ b/doc/html/pcrelimits.html
@@ -37,6 +37,12 @@ There is no limit to the number of parenthesized subpatterns, but there can be
no more than 65535 capturing subpatterns.
</P>
<P>
+There is a limit to the number of forward references to subsequent subpatterns
+of around 200,000. Repeated forward references with fixed upper limits, for
+example, (?2){0,100} when subpattern number 2 is to the right, are included in
+the count. There is no limit to the number of backward references.
+</P>
+<P>
The maximum length of name for a named subpattern is 32 characters, and the
maximum number of named subpatterns is 10000.
</P>
@@ -65,7 +71,7 @@ Cambridge CB2 3QH, England.
REVISION
</b><br>
<P>
-Last updated: 24 August 2011
+Last updated: 30 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcrematching.html b/doc/html/pcrematching.html
index 3d1acf6..ad17c98 100644
--- a/doc/html/pcrematching.html
+++ b/doc/html/pcrematching.html
@@ -164,9 +164,9 @@ always 1, and the value of the <i>capture_last</i> field is always -1.
</P>
<P>
7. The \C escape sequence, which (in the standard algorithm) matches a single
-byte, even in UTF-8 mode, is not supported because the alternative algorithm
-moves through the subject string one character at a time, for all active paths
-through the tree.
+byte, even in UTF-8 mode, is not supported in UTF-8 mode, because the
+alternative algorithm moves through the subject string one character at a time,
+for all active paths through the tree.
</P>
<P>
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
@@ -220,7 +220,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 17 November 2010
+Last updated: 19 November 2011
<br>
Copyright &copy; 1997-2010 University of Cambridge.
<br>
diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html
index 349c98c..3efb367 100644
--- a/doc/html/pcrepattern.html
+++ b/doc/html/pcrepattern.html
@@ -268,7 +268,8 @@ one of the following escape sequences than the binary character it represents:
\t tab (hex 09)
\ddd character with octal code ddd, or back reference
\xhh character with hex code hh
- \x{hhh..} character with hex code hhh..
+ \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
+ \uhhhh character with hex code hhhh (JavaScript mode only)
</pre>
The precise effect of \cx is as follows: if x is a lower case letter, it
is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
@@ -280,12 +281,12 @@ values are valid. A lower case letter is converted to upper case, and then the
0xc0 bits are flipped.)
</P>
<P>
-After \x, from zero to two hexadecimal digits are read (letters can be in
-upper or lower case). Any number of hexadecimal digits may appear between \x{
-and }, but the value of the character code must be less than 256 in non-UTF-8
-mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in
-hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code
-point, which is 10FFFF.
+By default, after \x, from zero to two hexadecimal digits are read (letters
+can be in upper or lower case). Any number of hexadecimal digits may appear
+between \x{ and }, but the value of the character code must be less than 256
+in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
+value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
+Unicode code point, which is 10FFFF.
</P>
<P>
If characters other than hexadecimal digits appear between \x{ and }, or if
@@ -294,9 +295,17 @@ initial \x will be interpreted as a basic hexadecimal escape, with no
following digits, giving a character whose value is zero.
</P>
<P>
+If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
+as just described only when it is followed by two hexadecimal digits.
+Otherwise, it matches a literal "x" character. In JavaScript mode, support for
+code points greater than 256 is provided by \u, which must be followed by
+four hexadecimal digits; otherwise it matches a literal "u" character.
+</P>
+<P>
Characters whose value is less than 256 can be defined by either of the two
-syntaxes for \x. There is no difference in the way they are handled. For
-example, \xdc is exactly the same as \x{dc}.
+syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
+way they are handled. For example, \xdc is exactly the same as \x{dc} (or
+\u00dc in JavaScript mode).
</P>
<P>
After \0 up to two further octal digits are read. If there are fewer than two
@@ -338,12 +347,25 @@ zero, because no more than three octal digits are ever read.
</P>
<P>
All the sequences that define a single character value can be used both inside
-and outside character classes. In addition, inside a character class, the
-sequence \b is interpreted as the backspace character (hex 08). The sequences
-\B, \N, \R, and \X are not special inside a character class. Like any other
-unrecognized escape sequences, they are treated as the literal characters "B",
-"N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is
-set. Outside a character class, these sequences have different meanings.
+and outside character classes. In addition, inside a character class, \b is
+interpreted as the backspace character (hex 08).
+</P>
+<P>
+\N is not allowed in a character class. \B, \R, and \X are not special
+inside a character class. Like other unrecognized escape sequences, they are
+treated as the literal characters "B", "R", and "X" by default, but cause an
+error if the PCRE_EXTRA option is set. Outside a character class, these
+sequences have different meanings.
+</P>
+<br><b>
+Unsupported escape sequences
+</b><br>
+<P>
+In Perl, the sequences \l, \L, \u, and \U are recognized by its string
+handler and used to modify the case of following characters. By default, PCRE
+does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
+option is set, \U matches a "U" character, and \u can be used to define a
+character by code point, as described in the previous section.
</P>
<br><b>
Absolute and relative back references
@@ -389,7 +411,8 @@ Another use of backslash is for specifying generic character types:
There is also the single sequence \N, which matches a non-newline character.
This is the same as
<a href="#fullstopdot">the "." metacharacter</a>
-when PCRE_DOTALL is not set.
+when PCRE_DOTALL is not set. Perl also uses \N to match characters by name;
+PCRE does not support this.
</P>
<P>
Each pair of lower and upper case escape sequences partitions the complete set
@@ -963,7 +986,8 @@ special meaning in a character class.
<P>
The escape sequence \N behaves like a dot, except that it is not affected by
the PCRE_DOTALL option. In other words, it matches any character except one
-that signifies the end of a line.
+that signifies the end of a line. Perl also uses \N to match characters by
+name; PCRE does not support this.
</P>
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
<P>
@@ -979,8 +1003,8 @@ processing unless the PCRE_NO_UTF8_CHECK option is used).
</P>
<P>
PCRE does not allow \C to appear in lookbehind assertions
-<a href="#lookbehind">(described below),</a>
-because in UTF-8 mode this would make it impossible to calculate the length of
+<a href="#lookbehind">(described below)</a>
+in UTF-8 mode, because this would make it impossible to calculate the length of
the lookbehind.
</P>
<P>
@@ -1926,10 +1950,10 @@ match. If there are insufficient characters before the current position, the
assertion fails.
</P>
<P>
-PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)
-to appear in lookbehind assertions, because it makes it impossible to calculate
-the length of the lookbehind. The \X and \R escapes, which can match
-different numbers of bytes, are also not permitted.
+In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte,
+even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
+impossible to calculate the length of the lookbehind. The \X and \R escapes,
+which can match different numbers of bytes, are also not permitted.
</P>
<P>
<a href="#subpatternsassubroutines">"Subroutine"</a>
@@ -2511,10 +2535,11 @@ failing negative assertion, they cause an error if encountered by
If any of these verbs are used in an assertion or in a subpattern that is
called as a subroutine (whether or not recursively), their effect is confined
to that subpattern; it does not extend to the surrounding pattern, with one
-exception: a *MARK that is encountered in a positive assertion <i>is</i> passed
-back (compare capturing parentheses in assertions). Note that such subpatterns
-are processed as anchored at the point where they are tested. Note also that
-Perl's treatment of subroutines is different in some cases.
+exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
+a successful positive assertion <i>is</i> passed back when a match succeeds
+(compare capturing parentheses in assertions). Note that such subpatterns are
+processed as anchored at the point where they are tested. Note also that Perl's
+treatment of subroutines is different in some cases.
</P>
<P>
The new verbs make use of what was previously invalid syntax: an opening
@@ -2536,6 +2561,10 @@ the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
pattern with (*NO_START_OPT).
</P>
+<P>
+Experiments with Perl suggest that it too has similar optimizations, sometimes
+leading to anomalous results.
+</P>
<br><b>
Verbs that act immediately
</b><br>
@@ -2583,17 +2612,17 @@ A name is always required with this verb. There may be as many instances of
(*MARK) as you like in a pattern, and their names do not have to be unique.
</P>
<P>
-When a match succeeds, the name of the last-encountered (*MARK) is passed back
-to the caller via the <i>pcre_extra</i> data structure, as described in the
+When a match succeeds, the name of the last-encountered (*MARK) on the matching
+path is passed back to the caller via the <i>pcre_extra</i> data structure, as
+described in the
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>
in the
<a href="pcreapi.html"><b>pcreapi</b></a>
-documentation. No data is returned for a partial match. Here is an example of
-<b>pcretest</b> output, where the /K modifier requests the retrieval and
-outputting of (*MARK) data:
+documentation. Here is an example of <b>pcretest</b> output, where the /K
+modifier requests the retrieval and outputting of (*MARK) data:
<pre>
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XY
+ re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data&#62; XY
0: XY
MK: A
XZ
@@ -2611,32 +2640,17 @@ passed back if it is the last-encountered. This does not happen for negative
assertions.
</P>
<P>
-A name may also be returned after a failed match if the final path through the
-pattern involves (*MARK). However, unless (*MARK) used in conjunction with
-(*COMMIT), this is unlikely to happen for an unanchored pattern because, as the
-starting point for matching is advanced, the final check is often with an empty
-string, causing a failure before (*MARK) is reached. For example:
+After a partial match or a failed match, the name of the last encountered
+(*MARK) in the entire match process is returned. For example:
<pre>
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XP
- No match
-</pre>
-There are three potential starting points for this match (starting with X,
-starting with P, and with an empty string). If the pattern is anchored, the
-result is different:
-<pre>
- /^X(*MARK:A)Y|^X(*MARK:B)Z/K
- XP
+ re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data&#62; XP
No match, mark = B
</pre>
-PCRE's start-of-match optimizations can also interfere with this. For example,
-if, as a result of a call to <b>pcre_study()</b>, it knows the minimum
-subject length for a match, a shorter subject will not be scanned at all.
-</P>
-<P>
-Note that similar anomalies (though different in detail) exist in Perl, no
-doubt for the same reasons. The use of (*MARK) data after a failed match of an
-unanchored pattern is not recommended, unless (*COMMIT) is involved.
+Note that in this unanchored example the mark is retained from the match
+attempt that started at the letter "X". Subsequent match attempts starting at
+"P" and then with an empty string do not get as far as the (*MARK) item, but
+nevertheless do not reset it.
</P>
<br><b>
Verbs that act after backtracking
@@ -2675,8 +2689,8 @@ Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
unless PCRE's start-of-match optimizations are turned off, as shown in this
<b>pcretest</b> example:
<pre>
- /(*COMMIT)abc/
- xyzabc
+ re&#62; /(*COMMIT)abc/
+ data&#62; xyzabc
0: abc
xyzabc\Y
No match
@@ -2697,10 +2711,8 @@ reached, or when matching to the right of (*PRUNE), but if there is no match to
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
(*PRUNE) is just an alternative to an atomic group or possessive quantifier,
but there are some uses of (*PRUNE) that cannot be expressed in any other way.
-The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
-match fails completely; the name is passed back if this is the final attempt.
-(*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored
-pattern (*PRUNE) has the same effect as (*COMMIT).
+The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
+anchored pattern (*PRUNE) has the same effect as (*COMMIT).
<pre>
(*SKIP)
</pre>
@@ -2726,8 +2738,7 @@ following pattern fails to match, the previous path through the pattern is
searched for the most recent (*MARK) that has the same name. If one is found,
the "bumpalong" advance is to the subject position that corresponds to that
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
-matching name is found, normal "bumpalong" of one character happens (that is,
-the (*SKIP) is ignored).
+matching name is found, the (*SKIP) is ignored.
<pre>
(*THEN) or (*THEN:NAME)
</pre>
@@ -2741,9 +2752,8 @@ be used for a pattern-based if-then-else block:
If the COND1 pattern matches, FOO is tried (and possibly further items after
the end of the group if FOO succeeds); on failure, the matcher skips to the
second alternative and tries COND2, without backtracking into COND1. The
-behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
-overall match fails. If (*THEN) is not inside an alternation, it acts like
-(*PRUNE).
+behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
+If (*THEN) is not inside an alternation, it acts like (*PRUNE).
</P>
<P>
Note that a subpattern that does not contain a | character is just a part of
@@ -2819,7 +2829,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 19 October 2011
+Last updated: 29 November 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/html/pcretest.html b/doc/html/pcretest.html
index 40b970d..80d224b 100644
--- a/doc/html/pcretest.html
+++ b/doc/html/pcretest.html
@@ -364,7 +364,10 @@ which it appears.
</P>
<P>
The <b>/M</b> modifier causes the size of memory block used to hold the compiled
-pattern to be output.
+pattern to be output. This does not include the size of the <b>pcre</b> block;
+it is just the actual compiled data. If the pattern is successfully studied
+with the PCRE_STUDY_JIT_COMPILE option, the size of the JIT compiled code is
+also output.
</P>
<P>
If the <b>/S</b> modifier appears once, it causes <b>pcre_study()</b> to be
@@ -856,7 +859,7 @@ Cambridge CB2 3QH, England.
</P>
<br><a name="SEC15" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 26 August 2011
+Last updated: 02 December 2011
<br>
Copyright &copy; 1997-2011 University of Cambridge.
<br>
diff --git a/doc/pcre.txt b/doc/pcre.txt
index 52ba310..126b7dd 100644
--- a/doc/pcre.txt
+++ b/doc/pcre.txt
@@ -120,8 +120,8 @@ REVISION
Last updated: 24 August 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREBUILD(3) PCREBUILD(3)
@@ -484,8 +484,8 @@ REVISION
Last updated: 06 September 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREMATCHING(3) PCREMATCHING(3)
@@ -633,9 +633,9 @@ THE ALTERNATIVE MATCHING ALGORITHM
always 1, and the value of the capture_last field is always -1.
7. The \C escape sequence, which (in the standard algorithm) matches a
- single byte, even in UTF-8 mode, is not supported because the alterna-
- tive algorithm moves through the subject string one character at a
- time, for all active paths through the tree.
+ single byte, even in UTF-8 mode, is not supported in UTF-8 mode,
+ because the alternative algorithm moves through the subject string one
+ character at a time, for all active paths through the tree.
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
are not supported. (*FAIL) is supported, and behaves like a failing
@@ -685,11 +685,11 @@ AUTHOR
REVISION
- Last updated: 17 November 2010
+ Last updated: 19 November 2011
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREAPI(3) PCREAPI(3)
@@ -1256,6 +1256,20 @@ COMPILING A PATTERN
set (assuming it can find an "a" in the subject), whereas it fails by
default, for Perl compatibility.
+ (3) \U matches an upper case "U" character; by default \U causes a com-
+ pile time error (Perl uses \U to upper case subsequent characters).
+
+ (4) \u matches a lower case "u" character unless it is followed by four
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, \u causes a compile time error (Perl
+ uses it to upper case the following character).
+
+ (5) \x matches a lower case "x" character unless it is followed by two
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, as in Perl, a hexadecimal number is
+ always expected after \x, but it may have zero, one, or two digits (so,
+ for example, \xz matches a binary zero character followed by z).
+
PCRE_MULTILINE
By default, PCRE treats the subject string as consisting of a single
@@ -1710,6 +1724,12 @@ INFORMATION ABOUT A PATTERN
compiler could not handle this particular pattern. See the pcrejit doc-
umentation for details of what can and cannot be handled.
+ PCRE_INFO_JITSIZE
+
+ If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE
+ option, return the size of the JIT compiled code, otherwise return
+ zero. The fourth argument should point to a size_t variable.
+
PCRE_INFO_LASTLITERAL
Return the value of the rightmost literal byte that must exist in any
@@ -1818,10 +1838,14 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_SIZE
- Return the size of the compiled pattern, that is, the value that was
- passed as the argument to pcre_malloc() when PCRE was getting memory in
- which to place the compiled data. The fourth argument should point to a
- size_t variable.
+ Return the size of the compiled pattern. The fourth argument should
+ point to a size_t variable. This value does not include the size of the
+ pcre structure that is returned by pcre_compile(). The value that is
+ passed as the argument to pcre_malloc() when pcre_compile() is getting
+ memory in which to place the compiled data is the value returned by
+ this option plus the size of the pcre structure. Studying a compiled
+ pattern, with or without JIT, does not alter the value returned by this
+ option.
PCRE_INFO_STUDYSIZE
@@ -2980,11 +3004,11 @@ AUTHOR
REVISION
- Last updated: 23 September 2011
+ Last updated: 02 December 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECALLOUT(3) PCRECALLOUT(3)
@@ -3143,9 +3167,11 @@ THE CALLOUT INTERFACE
The mark field is present from version 2 of the pcre_callout structure.
In callouts from pcre_exec() it contains a pointer to the zero-termi-
- nated name of the most recently passed (*MARK) item in the match, or
- NULL if there are no (*MARK)s in the current matching path. In callouts
- from pcre_dfa_exec() this field always contains NULL.
+ nated name of the most recently passed (*MARK), (*PRUNE), or (*THEN)
+ item in the match, or NULL if no such items have been passed. Instances
+ of (*PRUNE) or (*THEN) without a name do not obliterate a previous
+ (*MARK). In callouts from pcre_dfa_exec() this field always contains
+ NULL.
RETURN VALUES
@@ -3173,11 +3199,11 @@ AUTHOR
REVISION
- Last updated: 26 August 2011
+ Last updated: 30 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECOMPAT(3) PCRECOMPAT(3)
@@ -3218,7 +3244,9 @@ DIFFERENCES BETWEEN PCRE AND PERL
its own, matching a non-newline character, is supported.) In fact these
are implemented by Perl's general string-handling and are not part of
its pattern matching engine. If any of these are encountered by PCRE,
- an error is generated.
+ an error is generated by default. However, if the PCRE_JAVASCRIPT_COM-
+ PAT option is set, \U and \u are interpreted as JavaScript interprets
+ them.
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
is built with Unicode character property support. The properties that
@@ -3345,11 +3373,11 @@ AUTHOR
REVISION
- Last updated: 09 October 2011
+ Last updated: 14 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPATTERN(3) PCREPATTERN(3)
@@ -3572,7 +3600,8 @@ BACKSLASH
\t tab (hex 09)
\ddd character with octal code ddd, or back reference
\xhh character with hex code hh
- \x{hhh..} character with hex code hhh..
+ \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
+ \uhhhh character with hex code hhhh (JavaScript mode only)
The precise effect of \cx is as follows: if x is a lower case letter,
it is converted to upper case. Then bit 6 of the character (hex 40) is
@@ -3583,12 +3612,12 @@ BACKSLASH
is compiled in EBCDIC mode, all byte values are valid. A lower case
letter is converted to upper case, and then the 0xc0 bits are flipped.)
- After \x, from zero to two hexadecimal digits are read (letters can be
- in upper or lower case). Any number of hexadecimal digits may appear
- between \x{ and }, but the value of the character code must be less
- than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
- the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
- than the largest Unicode code point, which is 10FFFF.
+ By default, after \x, from zero to two hexadecimal digits are read
+ (letters can be in upper or lower case). Any number of hexadecimal dig-
+ its may appear between \x{ and }, but the value of the character code
+ must be less than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8
+ mode. That is, the maximum value in hexadecimal is 7FFFFFFF. Note that
+ this is bigger than the largest Unicode code point, which is 10FFFF.
If characters other than hexadecimal digits appear between \x{ and },
or if there is no terminating }, this form of escape is not recognized.
@@ -3596,9 +3625,17 @@ BACKSLASH
escape, with no following digits, giving a character whose value is
zero.
+ If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
+ is as just described only when it is followed by two hexadecimal dig-
+ its. Otherwise, it matches a literal "x" character. In JavaScript
+ mode, support for code points greater than 256 is provided by \u, which
+ must be followed by four hexadecimal digits; otherwise it matches a
+ literal "u" character.
+
Characters whose value is less than 256 can be defined by either of the
- two syntaxes for \x. There is no difference in the way they are han-
- dled. For example, \xdc is exactly the same as \x{dc}.
+ two syntaxes for \x (or by \u in JavaScript mode). There is no differ-
+ ence in the way they are handled. For example, \xdc is exactly the same
+ as \x{dc} (or \u00dc in JavaScript mode).
After \0 up to two further octal digits are read. If there are fewer
than two digits, just those that are present are used. Thus the
@@ -3642,12 +3679,22 @@ BACKSLASH
All the sequences that define a single character value can be used both
inside and outside character classes. In addition, inside a character
- class, the sequence \b is interpreted as the backspace character (hex
- 08). The sequences \B, \N, \R, and \X are not special inside a charac-
- ter class. Like any other unrecognized escape sequences, they are
- treated as the literal characters "B", "N", "R", and "X" by default,
- but cause an error if the PCRE_EXTRA option is set. Outside a character
- class, these sequences have different meanings.
+ class, \b is interpreted as the backspace character (hex 08).
+
+ \N is not allowed in a character class. \B, \R, and \X are not special
+ inside a character class. Like other unrecognized escape sequences,
+ they are treated as the literal characters "B", "R", and "X" by
+ default, but cause an error if the PCRE_EXTRA option is set. Outside a
+ character class, these sequences have different meanings.
+
+ Unsupported escape sequences
+
+ In Perl, the sequences \l, \L, \u, and \U are recognized by its string
+ handler and used to modify the case of following characters. By
+ default, PCRE does not support these escape sequences. However, if the
+ PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and
+ \u can be used to define a character by code point, as described in the
+ previous section.
Absolute and relative back references
@@ -3682,53 +3729,54 @@ BACKSLASH
There is also the single sequence \N, which matches a non-newline char-
acter. This is the same as the "." metacharacter when PCRE_DOTALL is
- not set.
-
- Each pair of lower and upper case escape sequences partitions the com-
- plete set of characters into two disjoint sets. Any given character
- matches one, and only one, of each pair. The sequences can appear both
- inside and outside character classes. They each match one character of
- the appropriate type. If the current matching point is at the end of
- the subject string, all of them fail, because there is no character to
+ not set. Perl also uses \N to match characters by name; PCRE does not
+ support this.
+
+ Each pair of lower and upper case escape sequences partitions the com-
+ plete set of characters into two disjoint sets. Any given character
+ matches one, and only one, of each pair. The sequences can appear both
+ inside and outside character classes. They each match one character of
+ the appropriate type. If the current matching point is at the end of
+ the subject string, all of them fail, because there is no character to
match.
- For compatibility with Perl, \s does not match the VT character (code
- 11). This makes it different from the the POSIX "space" class. The \s
- characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
+ For compatibility with Perl, \s does not match the VT character (code
+ 11). This makes it different from the the POSIX "space" class. The \s
+ characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
"use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.
- A "word" character is an underscore or any character that is a letter
- or digit. By default, the definition of letters and digits is con-
- trolled by PCRE's low-valued character tables, and may vary if locale-
- specific matching is taking place (see "Locale support" in the pcreapi
- page). For example, in a French locale such as "fr_FR" in Unix-like
- systems, or "french" in Windows, some character codes greater than 128
- are used for accented letters, and these are then matched by \w. The
+ A "word" character is an underscore or any character that is a letter
+ or digit. By default, the definition of letters and digits is con-
+ trolled by PCRE's low-valued character tables, and may vary if locale-
+ specific matching is taking place (see "Locale support" in the pcreapi
+ page). For example, in a French locale such as "fr_FR" in Unix-like
+ systems, or "french" in Windows, some character codes greater than 128
+ are used for accented letters, and these are then matched by \w. The
use of locales with Unicode is discouraged.
- By default, in UTF-8 mode, characters with values greater than 128
- never match \d, \s, or \w, and always match \D, \S, and \W. These
- sequences retain their original meanings from before UTF-8 support was
- available, mainly for efficiency reasons. However, if PCRE is compiled
- with Unicode property support, and the PCRE_UCP option is set, the be-
- haviour is changed so that Unicode properties are used to determine
+ By default, in UTF-8 mode, characters with values greater than 128
+ never match \d, \s, or \w, and always match \D, \S, and \W. These
+ sequences retain their original meanings from before UTF-8 support was
+ available, mainly for efficiency reasons. However, if PCRE is compiled
+ with Unicode property support, and the PCRE_UCP option is set, the be-
+ haviour is changed so that Unicode properties are used to determine
character types, as follows:
\d any character that \p{Nd} matches (decimal digit)
\s any character that \p{Z} matches, plus HT, LF, FF, CR
\w any character that \p{L} or \p{N} matches, plus underscore
- The upper case escapes match the inverse sets of characters. Note that
- \d matches only decimal digits, whereas \w matches any Unicode digit,
- as well as any Unicode letter, and underscore. Note also that PCRE_UCP
- affects \b, and \B because they are defined in terms of \w and \W.
+ The upper case escapes match the inverse sets of characters. Note that
+ \d matches only decimal digits, whereas \w matches any Unicode digit,
+ as well as any Unicode letter, and underscore. Note also that PCRE_UCP
+ affects \b, and \B because they are defined in terms of \w and \W.
Matching these sequences is noticeably slower when PCRE_UCP is set.
- The sequences \h, \H, \v, and \V are features that were added to Perl
- at release 5.10. In contrast to the other sequences, which match only
- ASCII characters by default, these always match certain high-valued
- codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
+ The sequences \h, \H, \v, and \V are features that were added to Perl
+ at release 5.10. In contrast to the other sequences, which match only
+ ASCII characters by default, these always match certain high-valued
+ codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
tal space characters are:
U+0009 Horizontal tab
@@ -3763,104 +3811,104 @@ BACKSLASH
Newline sequences
- Outside a character class, by default, the escape sequence \R matches
+ Outside a character class, by default, the escape sequence \R matches
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
- This is an example of an "atomic group", details of which are given
+ This is an example of an "atomic group", details of which are given
below. This particular group matches either the two-character sequence
- CR followed by LF, or one of the single characters LF (linefeed,
+ CR followed by LF, or one of the single characters LF (linefeed,
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
return, U+000D), or NEL (next line, U+0085). The two-character sequence
is treated as a single unit that cannot be split.
- In UTF-8 mode, two additional characters whose codepoints are greater
+ In UTF-8 mode, two additional characters whose codepoints are greater
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
- rator, U+2029). Unicode character property support is not needed for
+ rator, U+2029). Unicode character property support is not needed for
these characters to be recognized.
It is possible to restrict \R to match only CR, LF, or CRLF (instead of
- the complete set of Unicode line endings) by setting the option
+ the complete set of Unicode line endings) by setting the option
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
(BSR is an abbrevation for "backslash R".) This can be made the default
- when PCRE is built; if this is the case, the other behaviour can be
- requested via the PCRE_BSR_UNICODE option. It is also possible to
- specify these settings by starting a pattern string with one of the
+ when PCRE is built; if this is the case, the other behaviour can be
+ requested via the PCRE_BSR_UNICODE option. It is also possible to
+ specify these settings by starting a pattern string with one of the
following sequences:
(*BSR_ANYCRLF) CR, LF, or CRLF only
(*BSR_UNICODE) any Unicode newline sequence
- These override the default and the options given to pcre_compile() or
- pcre_compile2(), but they can be overridden by options given to
+ These override the default and the options given to pcre_compile() or
+ pcre_compile2(), but they can be overridden by options given to
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
- are not Perl-compatible, are recognized only at the very start of a
- pattern, and that they must be in upper case. If more than one of them
+ are not Perl-compatible, are recognized only at the very start of a
+ pattern, and that they must be in upper case. If more than one of them
is present, the last one is used. They can be combined with a change of
newline convention; for example, a pattern can start with:
(*ANY)(*BSR_ANYCRLF)
They can also be combined with the (*UTF8) or (*UCP) special sequences.
- Inside a character class, \R is treated as an unrecognized escape
+ Inside a character class, \R is treated as an unrecognized escape
sequence, and so matches the letter "R" by default, but causes an error
if PCRE_EXTRA is set.
Unicode character properties
When PCRE is built with Unicode character property support, three addi-
- tional escape sequences that match characters with specific properties
- are available. When not in UTF-8 mode, these sequences are of course
- limited to testing characters whose codepoints are less than 256, but
+ tional escape sequences that match characters with specific properties
+ are available. When not in UTF-8 mode, these sequences are of course
+ limited to testing characters whose codepoints are less than 256, but
they do work in this mode. The extra escape sequences are:
\p{xx} a character with the xx property
\P{xx} a character without the xx property
\X an extended Unicode sequence
- The property names represented by xx above are limited to the Unicode
+ The property names represented by xx above are limited to the Unicode
script names, the general category properties, "Any", which matches any
- character (including newline), and some special PCRE properties
- (described in the next section). Other Perl properties such as "InMu-
- sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
+ character (including newline), and some special PCRE properties
+ (described in the next section). Other Perl properties such as "InMu-
+ sicalSymbols" are not currently supported by PCRE. Note that \P{Any}
does not match any characters, so always causes a match failure.
Sets of Unicode characters are defined as belonging to certain scripts.
- A character from one of these sets can be matched using a script name.
+ A character from one of these sets can be matched using a script name.
For example:
\p{Greek}
\P{Han}
- Those that are not part of an identified script are lumped together as
+ Those that are not part of an identified script are lumped together as
"Common". The current list of scripts is:
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
- Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
- Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
- tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
- Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
+ Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
+ Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp-
+ tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek,
+ Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe-
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
- Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
+ Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam,
- Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
- Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
- Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
- Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
- Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
+ Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,
+ Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya,
+ Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian,
+ Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le,
+ Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh,
Ugaritic, Vai, Yi.
Each character has exactly one Unicode general category property, spec-
- ified by a two-letter abbreviation. For compatibility with Perl, nega-
- tion can be specified by including a circumflex between the opening
- brace and the property name. For example, \p{^Lu} is the same as
+ ified by a two-letter abbreviation. For compatibility with Perl, nega-
+ tion can be specified by including a circumflex between the opening
+ brace and the property name. For example, \p{^Lu} is the same as
\P{Lu}.
If only one letter is specified with \p or \P, it includes all the gen-
- eral category properties that start with that letter. In this case, in
- the absence of negation, the curly brackets in the escape sequence are
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
optional; these two examples have the same effect:
\p{L}
@@ -3912,54 +3960,54 @@ BACKSLASH
Zp Paragraph separator
Zs Space separator
- The special property L& is also supported: it matches a character that
- has the Lu, Ll, or Lt property, in other words, a letter that is not
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
classified as a modifier or "other".
- The Cs (Surrogate) property applies only to characters in the range
- U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
- ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
+ ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in
the pcreapi page). Perl does not support the Cs property.
- The long synonyms for property names that Perl supports (such as
- \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
any of these properties with "Is".
No character that is in the Unicode table has the Cn (unassigned) prop-
erty. Instead, this property is assumed for any code point that is not
in the Unicode table.
- Specifying caseless matching does not affect these escape sequences.
+ Specifying caseless matching does not affect these escape sequences.
For example, \p{Lu} always matches only upper case letters.
- The \X escape matches any number of Unicode characters that form an
+ The \X escape matches any number of Unicode characters that form an
extended Unicode sequence. \X is equivalent to
(?>\PM\pM*)
- That is, it matches a character without the "mark" property, followed
- by zero or more characters with the "mark" property, and treats the
- sequence as an atomic group (see below). Characters with the "mark"
- property are typically accents that affect the preceding character.
- None of them have codepoints less than 256, so in non-UTF-8 mode \X
+ That is, it matches a character without the "mark" property, followed
+ by zero or more characters with the "mark" property, and treats the
+ sequence as an atomic group (see below). Characters with the "mark"
+ property are typically accents that affect the preceding character.
+ None of them have codepoints less than 256, so in non-UTF-8 mode \X
matches any one character.
Note that recent versions of Perl have changed \X to match what Unicode
calls an "extended grapheme cluster", which has a more complicated def-
inition.
- Matching characters by Unicode property is not fast, because PCRE has
- to search a structure that contains data for over fifteen thousand
+ Matching characters by Unicode property is not fast, because PCRE has
+ to search a structure that contains data for over fifteen thousand
characters. That is why the traditional escape sequences such as \d and
- \w do not use Unicode properties in PCRE by default, though you can
+ \w do not use Unicode properties in PCRE by default, though you can
make them do so by setting the PCRE_UCP option for pcre_compile() or by
starting the pattern with (*UCP).
PCRE's additional properties
- As well as the standard Unicode properties described in the previous
- section, PCRE supports four more that make it possible to convert tra-
+ As well as the standard Unicode properties described in the previous
+ section, PCRE supports four more that make it possible to convert tra-
ditional escape sequences such as \w and \s and POSIX character classes
to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
erties internally when PCRE_UCP is set. They are:
@@ -3969,40 +4017,40 @@ BACKSLASH
Xsp Any Perl space character
Xwd Any Perl "word" character
- Xan matches characters that have either the L (letter) or the N (num-
- ber) property. Xps matches the characters tab, linefeed, vertical tab,
- formfeed, or carriage return, and any other character that has the Z
+ Xan matches characters that have either the L (letter) or the N (num-
+ ber) property. Xps matches the characters tab, linefeed, vertical tab,
+ formfeed, or carriage return, and any other character that has the Z
(separator) property. Xsp is the same as Xps, except that vertical tab
is excluded. Xwd matches the same characters as Xan, plus underscore.
Resetting the match start
- The escape sequence \K causes any previously matched characters not to
+ The escape sequence \K causes any previously matched characters not to
be included in the final matched sequence. For example, the pattern:
foo\Kbar
- matches "foobar", but reports that it has matched "bar". This feature
- is similar to a lookbehind assertion (described below). However, in
- this case, the part of the subject before the real match does not have
- to be of fixed length, as lookbehind assertions do. The use of \K does
- not interfere with the setting of captured substrings. For example,
+ matches "foobar", but reports that it has matched "bar". This feature
+ is similar to a lookbehind assertion (described below). However, in
+ this case, the part of the subject before the real match does not have
+ to be of fixed length, as lookbehind assertions do. The use of \K does
+ not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
- Perl documents that the use of \K within assertions is "not well
- defined". In PCRE, \K is acted upon when it occurs inside positive
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE, \K is acted upon when it occurs inside positive
assertions, but is ignored in negative assertions.
Simple assertions
- The final use of backslash is for certain simple assertions. An asser-
- tion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The
- use of subpatterns for more complicated assertions is described below.
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
@@ -4013,49 +4061,49 @@ BACKSLASH
\z matches only at the end of the subject
\G matches at the first matching position in the subject
- Inside a character class, \b has a different meaning; it matches the
- backspace character. If any other of these assertions appears in a
- character class, by default it matches the corresponding literal char-
+ Inside a character class, \b has a different meaning; it matches the
+ backspace character. If any other of these assertions appears in a
+ character class, by default it matches the corresponding literal char-
acter (for example, \B matches the letter B). However, if the
- PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
+ PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
ated instead.
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match \w or \W (i.e.
- one matches \w and the other matches \W), or the start or end of the
- string if the first or last character matches \w, respectively. In
- UTF-8 mode, the meanings of \w and \W can be changed by setting the
- PCRE_UCP option. When this is done, it also affects \b and \B. Neither
- PCRE nor Perl has a separate "start of word" or "end of word" metase-
- quence. However, whatever follows \b normally determines which it is.
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
+ string if the first or last character matches \w, respectively. In
+ UTF-8 mode, the meanings of \w and \W can be changed by setting the
+ PCRE_UCP option. When this is done, it also affects \b and \B. Neither
+ PCRE nor Perl has a separate "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
- affect only the behaviour of the circumflex and dollar metacharacters.
- However, if the startoffset argument of pcre_exec() is non-zero, indi-
+ affect only the behaviour of the circumflex and dollar metacharacters.
+ However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
- the subject, \A can never match. The difference between \Z and \z is
+ the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre_exec(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre_exec() multiple times with appropriate argu-
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre_exec(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
- Note, however, that PCRE's interpretation of \G, as the start of the
+ Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
@@ -4063,80 +4111,81 @@ BACKSLASH
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre_exec() is non-zero, circumflex can never match if the
- PCRE_MULTILINE option is unset. Inside a character class, circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre_exec() is non-zero, circumflex can never match if the
+ PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- A dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
+ A dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
before a newline at the end of the string (by default). Dollar need not
- be the last character of the pattern if a number of alternatives are
- involved, but it should be the last item in any branch in which it
+ be the last character of the pattern if a number of alternatives are
+ involved, but it should be the last item in any branch in which it
appears. Dollar has no special meaning in a character class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
- PCRE_MULTILINE option is set. When this is the case, a circumflex
- matches immediately after internal newlines as well as at the start of
- the subject string. It does not match after a newline that ends the
- string. A dollar matches before any newlines in the string, as well as
- at the very end, when PCRE_MULTILINE is set. When newline is specified
- as the two-character sequence CRLF, isolated CR and LF characters do
+ PCRE_MULTILINE option is set. When this is the case, a circumflex
+ matches immediately after internal newlines as well as at the start of
+ the subject string. It does not match after a newline that ends the
+ string. A dollar matches before any newlines in the string, as well as
+ at the very end, when PCRE_MULTILINE is set. When newline is specified
+ as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT) AND \N
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
- fies the end of a line. In UTF-8 mode, the matched character may be
+ ter in the subject string except (by default) a character that signi-
+ fies the end of a line. In UTF-8 mode, the matched character may be
more than one byte long.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE_DOTALL option is set, a dot matches any one character, without
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
- The escape sequence \N behaves like a dot, except that it is not
- affected by the PCRE_DOTALL option. In other words, it matches any
- character except one that signifies the end of a line.
+ The escape sequence \N behaves like a dot, except that it is not
+ affected by the PCRE_DOTALL option. In other words, it matches any
+ character except one that signifies the end of a line. Perl also uses
+ \N to match characters by name; PCRE does not support this.
MATCHING A SINGLE BYTE
@@ -4153,7 +4202,7 @@ MATCHING A SINGLE BYTE
PCRE_NO_UTF8_CHECK option is used).
PCRE does not allow \C to appear in lookbehind assertions (described
- below), because in UTF-8 mode this would make it impossible to calcu-
+ below) in UTF-8 mode, because this would make it impossible to calcu-
late the length of the lookbehind.
In general, the \C escape sequence is best avoided in UTF-8 mode. How-
@@ -5060,40 +5109,41 @@ ASSERTIONS
then try to match. If there are insufficient characters before the cur-
rent position, the assertion fails.
- PCRE does not allow the \C escape (which matches a single byte in UTF-8
- mode) to appear in lookbehind assertions, because it makes it impossi-
- ble to calculate the length of the lookbehind. The \X and \R escapes,
- which can match different numbers of bytes, are also not permitted.
+ In UTF-8 mode, PCRE does not allow the \C escape (which matches a sin-
+ gle byte, even in UTF-8 mode) to appear in lookbehind assertions,
+ because it makes it impossible to calculate the length of the lookbe-
+ hind. The \X and \R escapes, which can match different numbers of
+ bytes, are also not permitted.
- "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
- lookbehinds, as long as the subpattern matches a fixed-length string.
+ "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
+ lookbehinds, as long as the subpattern matches a fixed-length string.
Recursion, however, is not supported.
- Possessive quantifiers can be used in conjunction with lookbehind
+ Possessive quantifiers can be used in conjunction with lookbehind
assertions to specify efficient matching of fixed-length strings at the
end of subject strings. Consider a simple pattern such as
abcd$
- when applied to a long string that does not match. Because matching
+ when applied to a long string that does not match. Because matching
proceeds from left to right, PCRE will look for each "a" in the subject
- and then see if what follows matches the rest of the pattern. If the
+ and then see if what follows matches the rest of the pattern. If the
pattern is specified as
^.*abcd$
- the initial .* matches the entire string at first, but when this fails
+ the initial .* matches the entire string at first, but when this fails
(because there is no following "a"), it backtracks to match all but the
- last character, then all but the last two characters, and so on. Once
- again the search for "a" covers the entire string, from right to left,
+ last character, then all but the last two characters, and so on. Once
+ again the search for "a" covers the entire string, from right to left,
so we are no better off. However, if the pattern is written as
^.*+(?<=abcd)
- there can be no backtracking for the .*+ item; it can match only the
- entire string. The subsequent lookbehind assertion does a single test
- on the last four characters. If it fails, the match fails immediately.
- For long strings, this approach makes a significant difference to the
+ there can be no backtracking for the .*+ item; it can match only the
+ entire string. The subsequent lookbehind assertion does a single test
+ on the last four characters. If it fails, the match fails immediately.
+ For long strings, this approach makes a significant difference to the
processing time.
Using multiple assertions
@@ -5102,18 +5152,18 @@ ASSERTIONS
(?<=\d{3})(?<!999)foo
- matches "foo" preceded by three digits that are not "999". Notice that
- each of the assertions is applied independently at the same point in
- the subject string. First there is a check that the previous three
- characters are all digits, and then there is a check that the same
+ matches "foo" preceded by three digits that are not "999". Notice that
+ each of the assertions is applied independently at the same point in
+ the subject string. First there is a check that the previous three
+ characters are all digits, and then there is a check that the same
three characters are not "999". This pattern does not match "foo" pre-
- ceded by six characters, the first of which are digits and the last
- three of which are not "999". For example, it doesn't match "123abc-
+ ceded by six characters, the first of which are digits and the last
+ three of which are not "999". For example, it doesn't match "123abc-
foo". A pattern to do that is
(?<=\d{3}...)(?<!999)foo
- This time the first assertion looks at the preceding six characters,
+ This time the first assertion looks at the preceding six characters,
checking that the first three are digits, and then the second assertion
checks that the preceding three characters are not "999".
@@ -5121,29 +5171,29 @@ ASSERTIONS
(?<=(?<!foo)bar)baz
- matches an occurrence of "baz" that is preceded by "bar" which in turn
+ matches an occurrence of "baz" that is preceded by "bar" which in turn
is not preceded by "foo", while
(?<=\d{3}(?!999)...)foo
- is another pattern that matches "foo" preceded by three digits and any
+ is another pattern that matches "foo" preceded by three digits and any
three characters that are not "999".
CONDITIONAL SUBPATTERNS
- It is possible to cause the matching process to obey a subpattern con-
- ditionally or to choose between two alternative subpatterns, depending
- on the result of an assertion, or whether a specific capturing subpat-
- tern has already been matched. The two possible forms of conditional
+ It is possible to cause the matching process to obey a subpattern con-
+ ditionally or to choose between two alternative subpatterns, depending
+ on the result of an assertion, or whether a specific capturing subpat-
+ tern has already been matched. The two possible forms of conditional
subpattern are:
(?(condition)yes-pattern)
(?(condition)yes-pattern|no-pattern)
- If the condition is satisfied, the yes-pattern is used; otherwise the
- no-pattern (if present) is used. If there are more than two alterna-
- tives in the subpattern, a compile-time error occurs. Each of the two
+ If the condition is satisfied, the yes-pattern is used; otherwise the
+ no-pattern (if present) is used. If there are more than two alterna-
+ tives in the subpattern, a compile-time error occurs. Each of the two
alternatives may itself contain nested subpatterns of any form, includ-
ing conditional subpatterns; the restriction to two alternatives
applies only at the level of the condition. This pattern fragment is an
@@ -5152,73 +5202,73 @@ CONDITIONAL SUBPATTERNS
(?(1) (A|B|C) | (D | (?(2)E|F) | E) )
- There are four kinds of condition: references to subpatterns, refer-
+ There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.
Checking for a used subpattern by number
- If the text between the parentheses consists of a sequence of digits,
+ If the text between the parentheses consists of a sequence of digits,
the condition is true if a capturing subpattern of that number has pre-
- viously matched. If there is more than one capturing subpattern with
- the same number (see the earlier section about duplicate subpattern
- numbers), the condition is true if any of them have matched. An alter-
- native notation is to precede the digits with a plus or minus sign. In
- this case, the subpattern number is relative rather than absolute. The
- most recently opened parentheses can be referenced by (?(-1), the next
- most recent by (?(-2), and so on. Inside loops it can also make sense
+ viously matched. If there is more than one capturing subpattern with
+ the same number (see the earlier section about duplicate subpattern
+ numbers), the condition is true if any of them have matched. An alter-
+ native notation is to precede the digits with a plus or minus sign. In
+ this case, the subpattern number is relative rather than absolute. The
+ most recently opened parentheses can be referenced by (?(-1), the next
+ most recent by (?(-2), and so on. Inside loops it can also make sense
to refer to subsequent groups. The next parentheses to be opened can be
- referenced as (?(+1), and so on. (The value zero in any of these forms
+ referenced as (?(+1), and so on. (The value zero in any of these forms
is not used; it provokes a compile-time error.)
- Consider the following pattern, which contains non-significant white
+ Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
- The first part matches an optional opening parenthesis, and if that
+ The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec-
- ond part matches one or more characters that are not parentheses. The
- third part is a conditional subpattern that tests whether or not the
- first set of parentheses matched. If they did, that is, if subject
- started with an opening parenthesis, the condition is true, and so the
- yes-pattern is executed and a closing parenthesis is required. Other-
- wise, since no-pattern is not present, the subpattern matches nothing.
- In other words, this pattern matches a sequence of non-parentheses,
+ ond part matches one or more characters that are not parentheses. The
+ third part is a conditional subpattern that tests whether or not the
+ first set of parentheses matched. If they did, that is, if subject
+ started with an opening parenthesis, the condition is true, and so the
+ yes-pattern is executed and a closing parenthesis is required. Other-
+ wise, since no-pattern is not present, the subpattern matches nothing.
+ In other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
- If you were embedding this pattern in a larger one, you could use a
+ If you were embedding this pattern in a larger one, you could use a
relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
- This makes the fragment independent of the parentheses in the larger
+ This makes the fragment independent of the parentheses in the larger
pattern.
Checking for a used subpattern by name
- Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
- used subpattern by name. For compatibility with earlier versions of
- PCRE, which had this facility before Perl, the syntax (?(name)...) is
- also recognized. However, there is a possible ambiguity with this syn-
- tax, because subpattern names may consist entirely of digits. PCRE
- looks first for a named subpattern; if it cannot find one and the name
- consists entirely of digits, PCRE looks for a subpattern of that num-
- ber, which must be greater than zero. Using subpattern names that con-
+ Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
+ used subpattern by name. For compatibility with earlier versions of
+ PCRE, which had this facility before Perl, the syntax (?(name)...) is
+ also recognized. However, there is a possible ambiguity with this syn-
+ tax, because subpattern names may consist entirely of digits. PCRE
+ looks first for a named subpattern; if it cannot find one and the name
+ consists entirely of digits, PCRE looks for a subpattern of that num-
+ ber, which must be greater than zero. Using subpattern names that con-
sist entirely of digits is not recommended.
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
- If the name used in a condition of this kind is a duplicate, the test
- is applied to all subpatterns of the same name, and is true if any one
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
of them has matched.
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the
- name R, the condition is true if a recursive call to the whole pattern
+ name R, the condition is true if a recursive call to the whole pattern
or any subpattern has been made. If digits or a name preceded by amper-
sand follow the letter R, for example:
@@ -5226,51 +5276,51 @@ CONDITIONAL SUBPATTERNS
the condition is true if the most recent recursion is into a subpattern
whose number or name is given. This condition does not check the entire
- recursion stack. If the name used in a condition of this kind is a
+ recursion stack. If the name used in a condition of this kind is a
duplicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion.
- At "top level", all these recursion test conditions are false. The
+ At "top level", all these recursion test conditions are false. The
syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
- If the condition is the string (DEFINE), and there is no subpattern
- with the name DEFINE, the condition is always false. In this case,
- there may be only one alternative in the subpattern. It is always
- skipped if control reaches this point in the pattern; the idea of
- DEFINE is that it can be used to define subroutines that can be refer-
- enced from elsewhere. (The use of subroutines is described below.) For
- example, a pattern to match an IPv4 address such as "192.168.23.245"
+ If the condition is the string (DEFINE), and there is no subpattern
+ with the name DEFINE, the condition is always false. In this case,
+ there may be only one alternative in the subpattern. It is always
+ skipped if control reaches this point in the pattern; the idea of
+ DEFINE is that it can be used to define subroutines that can be refer-
+ enced from elsewhere. (The use of subroutines is described below.) For
+ example, a pattern to match an IPv4 address such as "192.168.23.245"
could be written like this (ignore whitespace and line breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
- The first part of the pattern is a DEFINE group inside which a another
- group named "byte" is defined. This matches an individual component of
- an IPv4 address (a number less than 256). When matching takes place,
- this part of the pattern is skipped because DEFINE acts like a false
- condition. The rest of the pattern uses references to the named group
- to match the four dot-separated components of an IPv4 address, insist-
+ The first part of the pattern is a DEFINE group inside which a another
+ group named "byte" is defined. This matches an individual component of
+ an IPv4 address (a number less than 256). When matching takes place,
+ this part of the pattern is skipped because DEFINE acts like a false
+ condition. The rest of the pattern uses references to the named group
+ to match the four dot-separated components of an IPv4 address, insist-
ing on a word boundary at each end.
Assertion conditions
- If the condition is not in any of the above formats, it must be an
- assertion. This may be a positive or negative lookahead or lookbehind
- assertion. Consider this pattern, again containing non-significant
+ If the condition is not in any of the above formats, it must be an
+ assertion. This may be a positive or negative lookahead or lookbehind
+ assertion. Consider this pattern, again containing non-significant
white space, and with the two alternatives on the second line:
(?(?=[^a-z]*[a-z])
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
- The condition is a positive lookahead assertion that matches an
- optional sequence of non-letters followed by a letter. In other words,
- it tests for the presence of at least one letter in the subject. If a
- letter is found, the subject is matched against the first alternative;
- otherwise it is matched against the second. This pattern matches
- strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+ The condition is a positive lookahead assertion that matches an
+ optional sequence of non-letters followed by a letter. In other words,
+ it tests for the presence of at least one letter in the subject. If a
+ letter is found, the subject is matched against the first alternative;
+ otherwise it is matched against the second. This pattern matches
+ strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
letters and dd are digits.
@@ -5279,41 +5329,41 @@ COMMENTS
There are two ways of including comments in patterns that are processed
by PCRE. In both cases, the start of the comment must not be in a char-
acter class, nor in the middle of any other sequence of related charac-
- ters such as (?: or a subpattern name or number. The characters that
+ ters such as (?: or a subpattern name or number. The characters that
make up a comment play no part in the pattern matching.
- The sequence (?# marks the start of a comment that continues up to the
- next closing parenthesis. Nested parentheses are not permitted. If the
+ The sequence (?# marks the start of a comment that continues up to the
+ next closing parenthesis. Nested parentheses are not permitted. If the
PCRE_EXTENDED option is set, an unescaped # character also introduces a
- comment, which in this case continues to immediately after the next
- newline character or character sequence in the pattern. Which charac-
+ comment, which in this case continues to immediately after the next
+ newline character or character sequence in the pattern. Which charac-
ters are interpreted as newlines is controlled by the options passed to
pcre_compile() or by a special sequence at the start of the pattern, as
- described in the section entitled "Newline conventions" above. Note
- that the end of this type of comment is a literal newline sequence in
+ described in the section entitled "Newline conventions" above. Note
+ that the end of this type of comment is a literal newline sequence in
the pattern; escape sequences that happen to represent a newline do not
- count. For example, consider this pattern when PCRE_EXTENDED is set,
+ count. For example, consider this pattern when PCRE_EXTENDED is set,
and the default newline convention is in force:
abc #comment \n still comment
- On encountering the # character, pcre_compile() skips along, looking
- for a newline in the pattern. The sequence \n is still literal at this
- stage, so it does not terminate the comment. Only an actual character
+ On encountering the # character, pcre_compile() skips along, looking
+ for a newline in the pattern. The sequence \n is still literal at this
+ stage, so it does not terminate the comment. Only an actual character
with the code value 0x0a (the default newline) does so.
RECURSIVE PATTERNS
- Consider the problem of matching a string in parentheses, allowing for
- unlimited nested parentheses. Without the use of recursion, the best
- that can be done is to use a pattern that matches up to some fixed
- depth of nesting. It is not possible to handle an arbitrary nesting
+ Consider the problem of matching a string in parentheses, allowing for
+ unlimited nested parentheses. Without the use of recursion, the best
+ that can be done is to use a pattern that matches up to some fixed
+ depth of nesting. It is not possible to handle an arbitrary nesting
depth.
For some time, Perl has provided a facility that allows regular expres-
- sions to recurse (amongst other things). It does this by interpolating
- Perl code in the expression at run time, and the code can refer to the
+ sions to recurse (amongst other things). It does this by interpolating
+ Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this:
@@ -5323,201 +5373,201 @@ RECURSIVE PATTERNS
refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
- it supports special syntax for recursion of the entire pattern, and
- also for individual subpattern recursion. After its introduction in
- PCRE and Python, this kind of recursion was subsequently introduced
+ it supports special syntax for recursion of the entire pattern, and
+ also for individual subpattern recursion. After its introduction in
+ PCRE and Python, this kind of recursion was subsequently introduced
into Perl at release 5.10.
- A special item that consists of (? followed by a number greater than
- zero and a closing parenthesis is a recursive subroutine call of the
- subpattern of the given number, provided that it occurs inside that
- subpattern. (If not, it is a non-recursive subroutine call, which is
- described in the next section.) The special item (?R) or (?0) is a
+ A special item that consists of (? followed by a number greater than
+ zero and a closing parenthesis is a recursive subroutine call of the
+ subpattern of the given number, provided that it occurs inside that
+ subpattern. (If not, it is a non-recursive subroutine call, which is
+ described in the next section.) The special item (?R) or (?0) is a
recursive call of the entire regular expression.
- This PCRE pattern solves the nested parentheses problem (assume the
+ This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
- First it matches an opening parenthesis. Then it matches any number of
- substrings which can either be a sequence of non-parentheses, or a
- recursive match of the pattern itself (that is, a correctly parenthe-
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis. Note the use
of a possessive quantifier to avoid backtracking into sequences of non-
parentheses.
- If this were part of a larger pattern, you would not want to recurse
+ If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
- We have put the pattern into parentheses, and caused the recursion to
+ We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
- In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references. Instead
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references. Instead
of (?1) in the pattern above you can write (?-2) to refer to the second
- most recently opened parentheses preceding the recursion. In other
- words, a negative number counts capturing parentheses leftwards from
+ most recently opened parentheses preceding the recursion. In other
+ words, a negative number counts capturing parentheses leftwards from
the point at which it is encountered.
- It is also possible to refer to subsequently opened parentheses, by
- writing references such as (?+2). However, these cannot be recursive
- because the reference is not inside the parentheses that are refer-
- enced. They are always non-recursive subroutine calls, as described in
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always non-recursive subroutine calls, as described in
the next section.
- An alternative approach is to use named parentheses instead. The Perl
- syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
+ An alternative approach is to use named parentheses instead. The Perl
+ syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
- If there is more than one subpattern with the same name, the earliest
+ If there is more than one subpattern with the same name, the earliest
one is used.
- This particular example pattern that we have been looking at contains
+ This particular example pattern that we have been looking at contains
nested unlimited repeats, and so the use of a possessive quantifier for
matching strings of non-parentheses is important when applying the pat-
- tern to strings that do not match. For example, when this pattern is
+ tern to strings that do not match. For example, when this pattern is
applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if a possessive quantifier is
- not used, the match runs for a very long time indeed because there are
- so many different ways the + and * repeats can carve up the subject,
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported.
- At the end of a match, the values of capturing parentheses are those
- from the outermost level. If you want to obtain intermediate values, a
- callout function can be used (see below and the pcrecallout documenta-
+ At the end of a match, the values of capturing parentheses are those
+ from the outermost level. If you want to obtain intermediate values, a
+ callout function can be used (see below and the pcrecallout documenta-
tion). If the pattern above is matched against
(ab(cd)ef)
- the value for the inner capturing parentheses (numbered 2) is "ef",
- which is the last value taken on at the top level. If a capturing sub-
- pattern is not matched at the top level, its final captured value is
- unset, even if it was (temporarily) set at a deeper level during the
+ the value for the inner capturing parentheses (numbered 2) is "ef",
+ which is the last value taken on at the top level. If a capturing sub-
+ pattern is not matched at the top level, its final captured value is
+ unset, even if it was (temporarily) set at a deeper level during the
matching process.
- If there are more than 15 capturing parentheses in a pattern, PCRE has
- to obtain extra memory to store data during a recursion, which it does
+ If there are more than 15 capturing parentheses in a pattern, PCRE has
+ to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
- Do not confuse the (?R) item with the condition (R), which tests for
- recursion. Consider this pattern, which matches text in angle brack-
- ets, allowing for arbitrary nesting. Only digits are allowed in nested
- brackets (that is, when recursing), whereas any characters are permit-
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
- In this pattern, (?(R) is the start of a conditional subpattern, with
- two different alternatives for the recursive and non-recursive cases.
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
Differences in recursion processing between PCRE and Perl
- Recursion processing in PCRE differs from Perl in two important ways.
- In PCRE (like Python, but unlike Perl), a recursive subpattern call is
+ Recursion processing in PCRE differs from Perl in two important ways.
+ In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
- alternatives and there is a subsequent matching failure. This can be
- illustrated by the following pattern, which purports to match a palin-
- dromic string that contains an odd number of characters (for example,
+ alternatives and there is a subsequent matching failure. This can be
+ illustrated by the following pattern, which purports to match a palin-
+ dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical
- characters surrounding a sub-palindrome. In Perl, this pattern works;
- in PCRE it does not if the pattern is longer than three characters.
+ characters surrounding a sub-palindrome. In Perl, this pattern works;
+ in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba":
- At the top level, the first character is matched, but as it is not at
+ At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat-
- tern 1 successfully matches the next character ("b"). (Note that the
+ tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
- Back at the top level, the next character ("c") is compared with what
- subpattern 2 matched, which was "a". This fails. Because the recursion
- is treated as an atomic group, there are now no backtracking points,
- and so the entire match fails. (Perl is able, at this point, to re-
- enter the recursion and try the second alternative.) However, if the
+ Back at the top level, the next character ("c") is compared with what
+ subpattern 2 matched, which was "a". This fails. Because the recursion
+ is treated as an atomic group, there are now no backtracking points,
+ and so the entire match fails. (Perl is able, at this point, to re-
+ enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
- This time, the recursing alternative is tried first, and continues to
- recurse until it runs out of characters, at which point the recursion
- fails. But this time we do have another alternative to try at the
- higher level. That is the big difference: in the previous case the
+ This time, the recursing alternative is tried first, and continues to
+ recurse until it runs out of characters, at which point the recursion
+ fails. But this time we do have another alternative to try at the
+ higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot
use.
- To change the pattern so that it matches all palindromic strings, not
- just those with an odd number of characters, it is tempting to change
+ To change the pattern so that it matches all palindromic strings, not
+ just those with an odd number of characters, it is tempting to change
the pattern to this:
^((.)(?1)\2|.?)$
- Again, this works in Perl, but not in PCRE, and for the same reason.
- When a deeper recursion has matched a single character, it cannot be
- entered again in order to match an empty string. The solution is to
- separate the two cases, and write out the odd and even cases as alter-
+ Again, this works in Perl, but not in PCRE, and for the same reason.
+ When a deeper recursion has matched a single character, it cannot be
+ entered again in order to match an empty string. The solution is to
+ separate the two cases, and write out the odd and even cases as alter-
natives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
- If you want to match typical palindromic phrases, the pattern has to
+ If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
- Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
- ing into sequences of non-word characters. Without this, PCRE takes a
- great deal longer (ten times or more) to match typical phrases, and
+ Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
+ ing into sequences of non-word characters. Without this, PCRE takes a
+ great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
- WARNING: The palindrome-matching patterns above work only if the sub-
- ject string does not start with a palindrome that is shorter than the
- entire string. For example, although "abcba" is correctly matched, if
- the subject is "ababa", PCRE finds the palindrome "aba" at the start,
- then fails at top level because the end of the string does not follow.
- Once again, it cannot jump back into the recursion to try other alter-
+ WARNING: The palindrome-matching patterns above work only if the sub-
+ ject string does not start with a palindrome that is shorter than the
+ entire string. For example, although "abcba" is correctly matched, if
+ the subject is "ababa", PCRE finds the palindrome "aba" at the start,
+ then fails at top level because the end of the string does not follow.
+ Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails.
- The second way in which PCRE and Perl differ in their recursion pro-
- cessing is in the handling of captured values. In Perl, when a subpat-
- tern is called recursively or as a subpattern (see the next section),
- it has no access to any values that were captured outside the recur-
- sion, whereas in PCRE these values can be referenced. Consider this
+ The second way in which PCRE and Perl differ in their recursion pro-
+ cessing is in the handling of captured values. In Perl, when a subpat-
+ tern is called recursively or as a subpattern (see the next section),
+ it has no access to any values that were captured outside the recur-
+ sion, whereas in PCRE these values can be referenced. Consider this
pattern:
^(.)(\1|a(?2))
- In PCRE, this pattern matches "bab". The first capturing parentheses
- match "b", then in the second group, when the back reference \1 fails
- to match "b", the second alternative matches "a" and then recurses. In
- the recursion, \1 does now match "b" and so the whole match succeeds.
- In Perl, the pattern fails to match because inside the recursive call
+ In PCRE, this pattern matches "bab". The first capturing parentheses
+ match "b", then in the second group, when the back reference \1 fails
+ to match "b", the second alternative matches "a" and then recurses. In
+ the recursion, \1 does now match "b" and so the whole match succeeds.
+ In Perl, the pattern fails to match because inside the recursive call
\1 cannot access the externally set value.
SUBPATTERNS AS SUBROUTINES
- If the syntax for a recursive subpattern call (either by number or by
- name) is used outside the parentheses to which it refers, it operates
- like a subroutine in a programming language. The called subpattern may
- be defined before or after the reference. A numbered reference can be
+ If the syntax for a recursive subpattern call (either by number or by
+ name) is used outside the parentheses to which it refers, it operates
+ like a subroutine in a programming language. The called subpattern may
+ be defined before or after the reference. A numbered reference can be
absolute or relative, as in these examples:
(...(absolute)...)...(?2)...
@@ -5528,108 +5578,109 @@ SUBPATTERNS AS SUBROUTINES
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
+ matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
- is used, it does match "sense and responsibility" as well as the other
- two strings. Another example is given in the discussion of DEFINE
+ is used, it does match "sense and responsibility" as well as the other
+ two strings. Another example is given in the discussion of DEFINE
above.
- All subroutine calls, whether recursive or not, are always treated as
- atomic groups. That is, once a subroutine has matched some of the sub-
+ All subroutine calls, whether recursive or not, are always treated as
+ atomic groups. That is, once a subroutine has matched some of the sub-
ject string, it is never re-entered, even if it contains untried alter-
- natives and there is a subsequent matching failure. Any capturing
- parentheses that are set during the subroutine call revert to their
+ natives and there is a subsequent matching failure. Any capturing
+ parentheses that are set during the subroutine call revert to their
previous values afterwards.
- Processing options such as case-independence are fixed when a subpat-
- tern is defined, so if it is used as a subroutine, such options cannot
+ Processing options such as case-independence are fixed when a subpat-
+ tern is defined, so if it is used as a subroutine, such options cannot
be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))
- It matches "abcabc". It does not match "abcABC" because the change of
+ It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a subroutine,
- possibly recursively. Here are two of the examples used above, rewrit-
+ an alternative syntax for referencing a subpattern as a subroutine,
+ possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
- PCRE supports an extension to Oniguruma: if a number is preceded by a
+ PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
- Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
- synonymous. The former is a back reference; the latter is a subroutine
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a back reference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
- Perl code to be obeyed in the middle of matching a regular expression.
+ Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides
- an external function by putting its entry point in the global variable
- pcre_callout. By default, this variable contains NULL, which disables
+ an external function by putting its entry point in the global variable
+ pcre_callout. By default, this variable contains NULL, which disables
all calling out.
- Within a regular expression, (?C) indicates the points at which the
- external function is to be called. If you want to identify different
- callout points, you can put a number less than 256 after the letter C.
- The default value is zero. For example, this pattern has two callout
+ Within a regular expression, (?C) indicates the points at which the
+ external function is to be called. If you want to identify different
+ callout points, you can put a number less than 256 after the letter C.
+ The default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
- automatically installed before each item in the pattern. They are all
+ automatically installed before each item in the pattern. They are all
numbered 255.
During matching, when PCRE reaches a callout point (and pcre_callout is
- set), the external function is called. It is provided with the number
- of the callout, the position in the pattern, and, optionally, one item
- of data originally supplied by the caller of pcre_exec(). The callout
- function may cause matching to proceed, to backtrack, or to fail alto-
+ set), the external function is called. It is provided with the number
+ of the callout, the position in the pattern, and, optionally, one item
+ of data originally supplied by the caller of pcre_exec(). The callout
+ function may cause matching to proceed, to backtrack, or to fail alto-
gether. A complete description of the interface to the callout function
is given in the pcrecallout documentation.
BACKTRACKING CONTROL
- Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
+ Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are described in the Perl documentation as "experimental and sub-
- ject to change or removal in a future version of Perl". It goes on to
- say: "Their usage in production code should be noted to avoid problems
+ ject to change or removal in a future version of Perl". It goes on to
+ say: "Their usage in production code should be noted to avoid problems
during upgrades." The same remarks apply to the PCRE features described
in this section.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using
pcre_exec(), which uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, they cause an
error if encountered by pcre_dfa_exec().
- If any of these verbs are used in an assertion or in a subpattern that
+ If any of these verbs are used in an assertion or in a subpattern that
is called as a subroutine (whether or not recursively), their effect is
confined to that subpattern; it does not extend to the surrounding pat-
- tern, with one exception: a *MARK that is encountered in a positive
- assertion is passed back (compare capturing parentheses in assertions).
+ tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
+ that is encountered in a successful positive assertion is passed back
+ when a match succeeds (compare capturing parentheses in assertions).
Note that such subpatterns are processed as anchored at the point where
they are tested. Note also that Perl's treatment of subroutines is dif-
ferent in some cases.
@@ -5652,59 +5703,61 @@ BACKTRACKING CONTROL
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com-
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
+ Experiments with Perl suggest that it too has similar optimizations,
+ sometimes leading to anomalous results.
+
Verbs that act immediately
- The following verbs act as soon as they are encountered. They may not
+ The following verbs act as soon as they are encountered. They may not
be followed by a name.
(*ACCEPT)
- This verb causes the match to end successfully, skipping the remainder
- of the pattern. However, when it is inside a subpattern that is called
- as a subroutine, only that subpattern is ended successfully. Matching
- then continues at the outer level. If (*ACCEPT) is inside capturing
+ This verb causes the match to end successfully, skipping the remainder
+ of the pattern. However, when it is inside a subpattern that is called
+ as a subroutine, only that subpattern is ended successfully. Matching
+ then continues at the outer level. If (*ACCEPT) is inside capturing
parentheses, the data so far is captured. For example:
A((?:A|B(*ACCEPT)|C)D)
- This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
tured by the outer parentheses.
(*FAIL) or (*F)
- This verb causes a matching failure, forcing backtracking to occur. It
- is equivalent to (?!) but easier to read. The Perl documentation notes
- that it is probably useful only when combined with (?{}) or (??{}).
- Those are, of course, Perl features that are not present in PCRE. The
- nearest equivalent is the callout feature, as for example in this pat-
+ This verb causes a matching failure, forcing backtracking to occur. It
+ is equivalent to (?!) but easier to read. The Perl documentation notes
+ that it is probably useful only when combined with (?{}) or (??{}).
+ Those are, of course, Perl features that are not present in PCRE. The
+ nearest equivalent is the callout feature, as for example in this pat-
tern:
a+(?C)(*FAIL)
- A match with the string "aaaa" always fails, but the callout is taken
+ A match with the string "aaaa" always fails, but the callout is taken
before each backtrack happens (in this example, 10 times).
Recording which path was taken
- There is one verb whose main purpose is to track how a match was
- arrived at, though it also has a secondary use in conjunction with
+ There is one verb whose main purpose is to track how a match was
+ arrived at, though it also has a secondary use in conjunction with
advancing the match starting point (see (*SKIP) below).
(*MARK:NAME) or (*:NAME)
- A name is always required with this verb. There may be as many
- instances of (*MARK) as you like in a pattern, and their names do not
+ A name is always required with this verb. There may be as many
+ instances of (*MARK) as you like in a pattern, and their names do not
have to be unique.
- When a match succeeds, the name of the last-encountered (*MARK) is
- passed back to the caller via the pcre_extra data structure, as
- described in the section on pcre_extra in the pcreapi documentation. No
- data is returned for a partial match. Here is an example of pcretest
- output, where the /K modifier requests the retrieval and outputting of
- (*MARK) data:
+ When a match succeeds, the name of the last-encountered (*MARK) on the
+ matching path is passed back to the caller via the pcre_extra data
+ structure, as described in the section on pcre_extra in the pcreapi
+ documentation. Here is an example of pcretest output, where the /K mod-
+ ifier requests the retrieval and outputting of (*MARK) data:
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XY
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data> XY
0: XY
MK: A
XZ
@@ -5720,98 +5773,78 @@ BACKTRACKING CONTROL
and passed back if it is the last-encountered. This does not happen for
negative assertions.
- A name may also be returned after a failed match if the final path
- through the pattern involves (*MARK). However, unless (*MARK) used in
- conjunction with (*COMMIT), this is unlikely to happen for an unan-
- chored pattern because, as the starting point for matching is advanced,
- the final check is often with an empty string, causing a failure before
- (*MARK) is reached. For example:
-
- /X(*MARK:A)Y|X(*MARK:B)Z/K
- XP
- No match
+ After a partial match or a failed match, the name of the last encoun-
+ tered (*MARK) in the entire match process is returned. For example:
- There are three potential starting points for this match (starting with
- X, starting with P, and with an empty string). If the pattern is
- anchored, the result is different:
-
- /^X(*MARK:A)Y|^X(*MARK:B)Z/K
- XP
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/K
+ data> XP
No match, mark = B
- PCRE's start-of-match optimizations can also interfere with this. For
- example, if, as a result of a call to pcre_study(), it knows the mini-
- mum subject length for a match, a shorter subject will not be scanned
- at all.
-
- Note that similar anomalies (though different in detail) exist in Perl,
- no doubt for the same reasons. The use of (*MARK) data after a failed
- match of an unanchored pattern is not recommended, unless (*COMMIT) is
- involved.
+ Note that in this unanchored example the mark is retained from the
+ match attempt that started at the letter "X". Subsequent match attempts
+ starting at "P" and then with an empty string do not get as far as the
+ (*MARK) item, but nevertheless do not reset it.
Verbs that act after backtracking
The following verbs do nothing when they are encountered. Matching con-
- tinues with what follows, but if there is no subsequent match, causing
- a backtrack to the verb, a failure is forced. That is, backtracking
- cannot pass to the left of the verb. However, when one of these verbs
- appears inside an atomic group, its effect is confined to that group,
- because once the group has been matched, there is never any backtrack-
- ing into it. In this situation, backtracking can "jump back" to the
- left of the entire atomic group. (Remember also, as stated above, that
+ tinues with what follows, but if there is no subsequent match, causing
+ a backtrack to the verb, a failure is forced. That is, backtracking
+ cannot pass to the left of the verb. However, when one of these verbs
+ appears inside an atomic group, its effect is confined to that group,
+ because once the group has been matched, there is never any backtrack-
+ ing into it. In this situation, backtracking can "jump back" to the
+ left of the entire atomic group. (Remember also, as stated above, that
this localization also applies in subroutine calls and assertions.)
- These verbs differ in exactly what kind of failure occurs when back-
+ These verbs differ in exactly what kind of failure occurs when back-
tracking reaches them.
(*COMMIT)
- This verb, which may not be followed by a name, causes the whole match
+ This verb, which may not be followed by a name, causes the whole match
to fail outright if the rest of the pattern does not match. Even if the
pattern is unanchored, no further attempts to find a match by advancing
the starting point take place. Once (*COMMIT) has been passed,
- pcre_exec() is committed to finding a match at the current starting
+ pcre_exec() is committed to finding a match at the current starting
point, or not at all. For example:
a+(*COMMIT)b
- This matches "xxaab" but not "aacaab". It can be thought of as a kind
+ This matches "xxaab" but not "aacaab". It can be thought of as a kind
of dynamic anchor, or "I've started, so I must finish." The name of the
- most recently passed (*MARK) in the path is passed back when (*COMMIT)
+ most recently passed (*MARK) in the path is passed back when (*COMMIT)
forces a match failure.
- Note that (*COMMIT) at the start of a pattern is not the same as an
- anchor, unless PCRE's start-of-match optimizations are turned off, as
+ Note that (*COMMIT) at the start of a pattern is not the same as an
+ anchor, unless PCRE's start-of-match optimizations are turned off, as
shown in this pcretest example:
- /(*COMMIT)abc/
- xyzabc
+ re> /(*COMMIT)abc/
+ data> xyzabc
0: abc
xyzabc\Y
No match
- PCRE knows that any match must start with "a", so the optimization
- skips along the subject to "a" before running the first match attempt,
- which succeeds. When the optimization is disabled by the \Y escape in
+ PCRE knows that any match must start with "a", so the optimization
+ skips along the subject to "a" before running the first match attempt,
+ which succeeds. When the optimization is disabled by the \Y escape in
the second subject, the match starts at "x" and so the (*COMMIT) causes
it to fail without trying any other starting points.
(*PRUNE) or (*PRUNE:NAME)
- This verb causes the match to fail at the current starting position in
- the subject if the rest of the pattern does not match. If the pattern
- is unanchored, the normal "bumpalong" advance to the next starting
- character then happens. Backtracking can occur as usual to the left of
- (*PRUNE), before it is reached, or when matching to the right of
- (*PRUNE), but if there is no match to the right, backtracking cannot
- cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
- native to an atomic group or possessive quantifier, but there are some
+ This verb causes the match to fail at the current starting position in
+ the subject if the rest of the pattern does not match. If the pattern
+ is unanchored, the normal "bumpalong" advance to the next starting
+ character then happens. Backtracking can occur as usual to the left of
+ (*PRUNE), before it is reached, or when matching to the right of
+ (*PRUNE), but if there is no match to the right, backtracking cannot
+ cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
+ native to an atomic group or possessive quantifier, but there are some
uses of (*PRUNE) that cannot be expressed in any other way. The behav-
- iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the
- match fails completely; the name is passed back if this is the final
- attempt. (*PRUNE:NAME) does not pass back a name if the match suc-
- ceeds. In an anchored pattern (*PRUNE) has the same effect as (*COM-
- MIT).
+ iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
+ anchored pattern (*PRUNE) has the same effect as (*COMMIT).
(*SKIP)
@@ -5838,67 +5871,66 @@ BACKTRACKING CONTROL
is searched for the most recent (*MARK) that has the same name. If one
is found, the "bumpalong" advance is to the subject position that cor-
responds to that (*MARK) instead of to where (*SKIP) was encountered.
- If no (*MARK) with a matching name is found, normal "bumpalong" of one
- character happens (that is, the (*SKIP) is ignored).
+ If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
(*THEN) or (*THEN:NAME)
- This verb causes a skip to the next innermost alternative if the rest
- of the pattern does not match. That is, it cancels pending backtrack-
- ing, but only within the current alternative. Its name comes from the
+ This verb causes a skip to the next innermost alternative if the rest
+ of the pattern does not match. That is, it cancels pending backtrack-
+ ing, but only within the current alternative. Its name comes from the
observation that it can be used for a pattern-based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
- If the COND1 pattern matches, FOO is tried (and possibly further items
- after the end of the group if FOO succeeds); on failure, the matcher
- skips to the second alternative and tries COND2, without backtracking
- into COND1. The behaviour of (*THEN:NAME) is exactly the same as
- (*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not
- inside an alternation, it acts like (*PRUNE).
-
- Note that a subpattern that does not contain a | character is just a
- part of the enclosing alternative; it is not a nested alternation with
- only one alternative. The effect of (*THEN) extends beyond such a sub-
- pattern to the enclosing alternative. Consider this pattern, where A,
+ If the COND1 pattern matches, FOO is tried (and possibly further items
+ after the end of the group if FOO succeeds); on failure, the matcher
+ skips to the second alternative and tries COND2, without backtracking
+ into COND1. The behaviour of (*THEN:NAME) is exactly the same as
+ (*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts
+ like (*PRUNE).
+
+ Note that a subpattern that does not contain a | character is just a
+ part of the enclosing alternative; it is not a nested alternation with
+ only one alternative. The effect of (*THEN) extends beyond such a sub-
+ pattern to the enclosing alternative. Consider this pattern, where A,
B, etc. are complex pattern fragments that do not contain any | charac-
ters at this level:
A (B(*THEN)C) | D
- If A and B are matched, but there is a failure in C, matching does not
+ If A and B are matched, but there is a failure in C, matching does not
backtrack into A; instead it moves to the next alternative, that is, D.
- However, if the subpattern containing (*THEN) is given an alternative,
+ However, if the subpattern containing (*THEN) is given an alternative,
it behaves differently:
A (B(*THEN)C | (*FAIL)) | D
- The effect of (*THEN) is now confined to the inner subpattern. After a
+ The effect of (*THEN) is now confined to the inner subpattern. After a
failure in C, matching moves to (*FAIL), which causes the whole subpat-
- tern to fail because there are no more alternatives to try. In this
+ tern to fail because there are no more alternatives to try. In this
case, matching does now backtrack into A.
Note also that a conditional subpattern is not considered as having two
- alternatives, because only one is ever used. In other words, the |
+ alternatives, because only one is ever used. In other words, the |
character in a conditional subpattern has a different meaning. Ignoring
white space, consider:
^.*? (?(?=a) a | b(*THEN)c )
- If the subject is "ba", this pattern does not match. Because .*? is
- ungreedy, it initially matches zero characters. The condition (?=a)
- then fails, the character "b" is matched, but "c" is not. At this
- point, matching does not backtrack to .*? as might perhaps be expected
- from the presence of the | character. The conditional subpattern is
+ If the subject is "ba", this pattern does not match. Because .*? is
+ ungreedy, it initially matches zero characters. The condition (?=a)
+ then fails, the character "b" is matched, but "c" is not. At this
+ point, matching does not backtrack to .*? as might perhaps be expected
+ from the presence of the | character. The conditional subpattern is
part of the single alternative that comprises the whole pattern, and so
- the match fails. (If there was a backtrack into .*?, allowing it to
+ the match fails. (If there was a backtrack into .*?, allowing it to
match "b", the match would succeed.)
- The verbs just described provide four different "strengths" of control
+ The verbs just described provide four different "strengths" of control
when subsequent matching fails. (*THEN) is the weakest, carrying on the
- match at the next alternative. (*PRUNE) comes next, failing the match
- at the current starting position, but allowing an advance to the next
- character (for an unanchored pattern). (*SKIP) is similar, except that
+ match at the next alternative. (*PRUNE) comes next, failing the match
+ at the current starting position, but allowing an advance to the next
+ character (for an unanchored pattern). (*SKIP) is similar, except that
the advance may be more than one character. (*COMMIT) is the strongest,
causing the entire match to fail.
@@ -5908,8 +5940,8 @@ BACKTRACKING CONTROL
(A(*COMMIT)B(*THEN)C|D)
- Once A has matched, PCRE is committed to this match, at the current
- starting position. If subsequently B matches, but C does not, the nor-
+ Once A has matched, PCRE is committed to this match, at the current
+ starting position. If subsequently B matches, but C does not, the nor-
mal (*THEN) action of trying the next alternative (that is, D) does not
happen because (*COMMIT) overrides.
@@ -5928,11 +5960,11 @@ AUTHOR
REVISION
- Last updated: 19 October 2011
+ Last updated: 29 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRESYNTAX(3) PCRESYNTAX(3)
@@ -6301,8 +6333,8 @@ REVISION
Last updated: 21 November 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREUNICODE(3) PCREUNICODE(3)
@@ -6455,8 +6487,8 @@ REVISION
Last updated: 19 October 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREJIT(3) PCREJIT(3)
@@ -6497,11 +6529,17 @@ AVAILABILITY OF JIT SUPPORT
been fully tested. If --enable-jit is set on an unsupported platform,
compilation fails.
- A program can tell if JIT support is available by calling pcre_config()
- with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available,
- and 0 otherwise. However, a simple program does not need to check this
- in order to use JIT. The API is implemented in a way that falls back to
- the ordinary PCRE code if JIT is not available.
+ A program that is linked with PCRE 8.20 or later can tell if JIT sup-
+ port is available by calling pcre_config() with the PCRE_CONFIG_JIT
+ option. The result is 1 when JIT is available, and 0 otherwise. How-
+ ever, a simple program does not need to check this in order to use JIT.
+ The API is implemented in a way that falls back to the ordinary PCRE
+ code if JIT is not available.
+
+ If your program may sometimes be linked with versions of PCRE that are
+ older than 8.20, but you want to use JIT when it is available, you can
+ test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
+ macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
SIMPLE USE OF JIT
@@ -6517,6 +6555,22 @@ SIMPLE USE OF JIT
no longer needed instead of just freeing it yourself. This
ensures that any JIT data is also freed.
+ For a program that may be linked with pre-8.20 versions of PCRE, you
+ can insert
+
+ #ifndef PCRE_STUDY_JIT_COMPILE
+ #define PCRE_STUDY_JIT_COMPILE 0
+ #endif
+
+ so that no option is passed to pcre_study(), and then use something
+ like this to free the study data:
+
+ #ifdef PCRE_CONFIG_JIT
+ pcre_free_study(study_ptr);
+ #else
+ pcre_free(study_ptr);
+ #endif
+
In some circumstances you may need to call additional functions. These
are described in the section entitled "Controlling the JIT stack"
below.
@@ -6555,12 +6609,8 @@ UNSUPPORTED OPTIONS AND PATTERN ITEMS
The unsupported pattern items are:
- \C match a single byte; not supported in UTF-8 mode
+ \C match a single byte; not supported in UTF-8 mode
(?Cn) callouts
- (?(<name>)... conditional test on setting of a named subpattern
- (?(R)... conditional test on whole pattern recursion
- (?(Rn)... conditional test on recursion, by number
- (?(R&name)... conditional test on recursion, by name
(*COMMIT) )
(*MARK) )
(*PRUNE) ) the backtracking control verbs
@@ -6609,28 +6659,29 @@ CONTROLLING THE JIT STACK
large or complicated patterns need more than this. The error
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
Three functions are provided for managing blocks of memory for use as
- JIT stacks.
-
- The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
- are a starting size and a maximum size, and it returns a pointer to an
- opaque structure of type pcre_jit_stack, or NULL if there is an error.
- The pcre_jit_stack_free() function can be used to free a stack that is
- no longer needed. (For the technically minded: the address space is
+ JIT stacks. There is further discussion about the use of JIT stacks in
+ the section entitled "JIT stack FAQ" below.
+
+ The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
+ are a starting size and a maximum size, and it returns a pointer to an
+ opaque structure of type pcre_jit_stack, or NULL if there is an error.
+ The pcre_jit_stack_free() function can be used to free a stack that is
+ no longer needed. (For the technically minded: the address space is
allocated by mmap or VirtualAlloc.)
- JIT uses far less memory for recursion than the interpretive code, and
- a maximum stack size of 512K to 1M should be more than enough for any
+ JIT uses far less memory for recursion than the interpretive code, and
+ a maximum stack size of 512K to 1M should be more than enough for any
pattern.
- The pcre_assign_jit_stack() function specifies which stack JIT code
+ The pcre_assign_jit_stack() function specifies which stack JIT code
should use. Its arguments are as follows:
pcre_extra *extra
pcre_jit_callback callback
void *data
- The extra argument must be the result of studying a pattern with
- PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the
+ The extra argument must be the result of studying a pattern with
+ PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the
other two options:
(1) If callback is NULL and data is NULL, an internal 32K block
@@ -6645,18 +6696,18 @@ CONTROLLING THE JIT STACK
is used; otherwise the return value must be a valid JIT stack,
the result of calling pcre_jit_stack_alloc().
- You may safely assign the same JIT stack to more than one pattern, as
+ You may safely assign the same JIT stack to more than one pattern, as
long as they are all matched sequentially in the same thread. In a mul-
tithread application, each thread must use its own JIT stack.
- Strictly speaking, even more is allowed. You can assign the same stack
- to any number of patterns as long as they are not used for matching by
+ Strictly speaking, even more is allowed. You can assign the same stack
+ to any number of patterns as long as they are not used for matching by
multiple threads at the same time. For example, you can assign the same
- stack to all compiled patterns, and use a global mutex in the callback
+ stack to all compiled patterns, and use a global mutex in the callback
to wait until the stack is available for use. However, this is an inef-
ficient solution, and not recommended.
- This is a suggestion for how a typical multithreaded program might
+ This is a suggestion for how a typical multithreaded program might
operate:
During thread initalization
@@ -6668,12 +6719,80 @@ CONTROLLING THE JIT STACK
Use a one-line callback function
return thread_local_var
- All the functions described in this section do nothing if JIT is not
- available, and pcre_assign_jit_stack() does nothing unless the extra
- argument is non-NULL and points to a pcre_extra block that is the
+ All the functions described in this section do nothing if JIT is not
+ available, and pcre_assign_jit_stack() does nothing unless the extra
+ argument is non-NULL and points to a pcre_extra block that is the
result of a successful study with PCRE_STUDY_JIT_COMPILE.
+JIT STACK FAQ
+
+ (1) Why do we need JIT stacks?
+
+ PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack
+ where the local data of the current node is pushed before checking its
+ child nodes. Allocating real machine stack on some platforms is diffi-
+ cult. For example, the stack chain needs to be updated every time if we
+ extend the stack on PowerPC. Although it is possible, its updating
+ time overhead decreases performance. So we do the recursion in memory.
+
+ (2) Why don't we simply allocate blocks of memory with malloc()?
+
+ Modern operating systems have a nice feature: they can reserve an
+ address space instead of allocating memory. We can safely allocate mem-
+ ory pages inside this address space, so the stack could grow without
+ moving memory data (this is important because of pointers). Thus we can
+ allocate 1M address space, and use only a single memory page (usually
+ 4K) if that is enough. However, we can still grow up to 1M anytime if
+ needed.
+
+ (3) Who "owns" a JIT stack?
+
+ The owner of the stack is the user program, not the JIT studied pattern
+ or anything else. The user program must ensure that if a stack is used
+ by pcre_exec(), (that is, it is assigned to the pattern currently run-
+ ning), that stack must not be used by any other threads (to avoid over-
+ writing the same memory area). The best practice for multithreaded pro-
+ grams is to allocate a stack for each thread, and return this stack
+ through the JIT callback function.
+
+ (4) When should a JIT stack be freed?
+
+ You can free a JIT stack at any time, as long as it will not be used by
+ pcre_exec() again. When you assign the stack to a pattern, only a
+ pointer is set. There is no reference counting or any other magic. You
+ can free the patterns and stacks in any order, anytime. Just do not
+ call pcre_exec() with a pattern pointing to an already freed stack, as
+ that will cause SEGFAULT. (Also, do not free a stack currently used by
+ pcre_exec() in another thread). You can also replace the stack for a
+ pattern at any time. You can even free the previous stack before
+ assigning a replacement.
+
+ (5) Should I allocate/free a stack every time before/after calling
+ pcre_exec()?
+
+ No, because this is too costly in terms of resources. However, you
+ could implement some clever idea which release the stack if it is not
+ used in let's say two minutes. The JIT callback can help to achive this
+ without keeping a list of the currently JIT studied patterns.
+
+ (6) OK, the stack is for long term memory allocation. But what happens
+ if a pattern causes stack overflow with a stack of 1M? Is that 1M kept
+ until the stack is freed?
+
+ Especially on embedded sytems, it might be a good idea to release mem-
+ ory sometimes without freeing the stack. There is no API for this at
+ the moment. Probably a function call which returns with the currently
+ allocated memory for any stack and another which allows releasing mem-
+ ory (shrinking the stack) would be a good idea if someone needs this.
+
+ (7) This is too much of a headache. Isn't there any better solution for
+ JIT stack handling?
+
+ No, thanks to Windows. If POSIX threads were used everywhere, we could
+ throw out this complicated API.
+
+
EXAMPLE CODE
This is a single-threaded example that specifies a JIT stack without
@@ -6705,18 +6824,18 @@ SEE ALSO
AUTHOR
- Philip Hazel
+ Philip Hazel (FAQ by Zoltan Herczeg)
University Computing Service
Cambridge CB2 3QH, England.
REVISION
- Last updated: 19 October 2011
+ Last updated: 26 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPARTIAL(3) PCREPARTIAL(3)
@@ -7137,8 +7256,8 @@ REVISION
Last updated: 26 August 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPRECOMPILE(3) PCREPRECOMPILE(3)
@@ -7268,8 +7387,8 @@ REVISION
Last updated: 26 August 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPERFORM(3) PCREPERFORM(3)
@@ -7436,8 +7555,8 @@ REVISION
Last updated: 16 May 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCREPOSIX(3) PCREPOSIX(3)
@@ -7699,8 +7818,8 @@ REVISION
Last updated: 16 May 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRECPP(3) PCRECPP(3)
@@ -8041,8 +8160,8 @@ REVISION
Last updated: 17 March 2009
Minor typo fixed: 25 July 2011
------------------------------------------------------------------------------
-
-
+
+
PCRESAMPLE(3) PCRESAMPLE(3)
@@ -8153,6 +8272,12 @@ SIZE AND OTHER LIMITATIONS
There is no limit to the number of parenthesized subpatterns, but there
can be no more than 65535 capturing subpatterns.
+ There is a limit to the number of forward references to subsequent sub-
+ patterns of around 200,000. Repeated forward references with fixed
+ upper limits, for example, (?2){0,100} when subpattern number 2 is to
+ the right, are included in the count. There is no limit to the number
+ of backward references.
+
The maximum length of name for a named subpattern is 32 characters, and
the maximum number of named subpatterns is 10000.
@@ -8173,11 +8298,11 @@ AUTHOR
REVISION
- Last updated: 24 August 2011
+ Last updated: 30 November 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
PCRESTACK(3) PCRESTACK(3)
@@ -8337,5 +8462,5 @@ REVISION
Last updated: 26 August 2011
Copyright (c) 1997-2011 University of Cambridge.
------------------------------------------------------------------------------
-
-
+
+
diff --git a/doc/pcretest.txt b/doc/pcretest.txt
index 999ee0c..3835f48 100644
--- a/doc/pcretest.txt
+++ b/doc/pcretest.txt
@@ -305,30 +305,33 @@ PATTERN MODIFIERS
it appears.
The /M modifier causes the size of memory block used to hold the com-
- piled pattern to be output.
-
- If the /S modifier appears once, it causes pcre_study() to be called
- after the expression has been compiled, and the results used when the
- expression is matched. If /S appears twice, it suppresses studying,
+ piled pattern to be output. This does not include the size of the pcre
+ block; it is just the actual compiled data. If the pattern is success-
+ fully studied with the PCRE_STUDY_JIT_COMPILE option, the size of the
+ JIT compiled code is also output.
+
+ If the /S modifier appears once, it causes pcre_study() to be called
+ after the expression has been compiled, and the results used when the
+ expression is matched. If /S appears twice, it suppresses studying,
even if it was requested externally by the -s command line option. This
- makes it possible to specify that certain patterns are always studied,
+ makes it possible to specify that certain patterns are always studied,
and others are never studied, independently of -s. This feature is used
in the test files in a few cases where the output is different when the
pattern is studied.
- If the /S modifier is immediately followed by a + character, the call
- to pcre_study() is made with the PCRE_STUDY_JIT_COMPILE option,
- requesting just-in-time optimization support if it is available. Note
- that there is also a /+ modifier; it must not be given immediately
- after /S because this will be misinterpreted. If JIT studying is suc-
- cessful, it will automatically be used when pcre_exec() is run, except
- when incompatible run-time options are specified. These include the
+ If the /S modifier is immediately followed by a + character, the call
+ to pcre_study() is made with the PCRE_STUDY_JIT_COMPILE option,
+ requesting just-in-time optimization support if it is available. Note
+ that there is also a /+ modifier; it must not be given immediately
+ after /S because this will be misinterpreted. If JIT studying is suc-
+ cessful, it will automatically be used when pcre_exec() is run, except
+ when incompatible run-time options are specified. These include the
partial matching options; a complete list is given in the pcrejit docu-
- mentation. See also the \J escape sequence below for a way of setting
+ mentation. See also the \J escape sequence below for a way of setting
the size of the JIT stack.
- The /T modifier must be followed by a single digit. It causes a spe-
- cific set of built-in character tables to be passed to pcre_compile().
+ The /T modifier must be followed by a single digit. It causes a spe-
+ cific set of built-in character tables to be passed to pcre_compile().
It is used in the standard PCRE tests to check behaviour with different
character tables. The digit specifies the tables as follows:
@@ -336,12 +339,12 @@ PATTERN MODIFIERS
pcre_chartables.c.dist
1 a set of tables defining ISO 8859 characters
- In table 1, some characters whose codes are greater than 128 are iden-
+ In table 1, some characters whose codes are greater than 128 are iden-
tified as letters, digits, spaces, etc.
Using the POSIX wrapper API
- The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
+ The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
rather than its native API. When /P is set, the following modifiers set
options for the regcomp() function:
@@ -353,17 +356,17 @@ PATTERN MODIFIERS
/W REG_UCP ) the POSIX standard
/8 REG_UTF8 )
- The /+ modifier works as described above. All other modifiers are
+ The /+ modifier works as described above. All other modifiers are
ignored.
DATA LINES
- Before each data line is passed to pcre_exec(), leading and trailing
- white space is removed, and it is then scanned for \ escapes. Some of
- these are pretty esoteric features, intended for checking out some of
- the more complicated features of PCRE. If you are just testing "ordi-
- nary" regular expressions, you probably don't need any of these. The
+ Before each data line is passed to pcre_exec(), leading and trailing
+ white space is removed, and it is then scanned for \ escapes. Some of
+ these are pretty esoteric features, intended for checking out some of
+ the more complicated features of PCRE. If you are just testing "ordi-
+ nary" regular expressions, you probably don't need any of these. The
following escapes are recognized:
\a alarm (BEL, \x07)
@@ -444,95 +447,95 @@ DATA LINES
\<any> pass the PCRE_NEWLINE_ANY option to pcre_exec()
or pcre_dfa_exec()
- Note that \xhh always specifies one byte, even in UTF-8 mode; this
+ Note that \xhh always specifies one byte, even in UTF-8 mode; this
makes it possible to construct invalid UTF-8 sequences for testing pur-
poses. On the other hand, \x{hh} is interpreted as a UTF-8 character in
- UTF-8 mode, generating more than one byte if the value is greater than
+ UTF-8 mode, generating more than one byte if the value is greater than
127. When not in UTF-8 mode, it generates one byte for values less than
256, and causes an error for greater values.
- The escapes that specify line ending sequences are literal strings,
+ The escapes that specify line ending sequences are literal strings,
exactly as shown. No more than one newline setting should be present in
any data line.
- A backslash followed by anything else just escapes the anything else.
- If the very last character is a backslash, it is ignored. This gives a
- way of passing an empty line as data, since a real empty line termi-
+ A backslash followed by anything else just escapes the anything else.
+ If the very last character is a backslash, it is ignored. This gives a
+ way of passing an empty line as data, since a real empty line termi-
nates the data input.
- The \J escape provides a way of setting the maximum stack size that is
- used by the just-in-time optimization code. It is ignored if JIT opti-
- mization is not being used. Providing a stack that is larger than the
+ The \J escape provides a way of setting the maximum stack size that is
+ used by the just-in-time optimization code. It is ignored if JIT opti-
+ mization is not being used. Providing a stack that is larger than the
default 32K is necessary only for very complicated patterns.
- If \M is present, pcretest calls pcre_exec() several times, with dif-
- ferent values in the match_limit and match_limit_recursion fields of
- the pcre_extra data structure, until it finds the minimum numbers for
- each parameter that allow pcre_exec() to complete without error.
- Because this is testing a specific feature of the normal interpretive
- pcre_exec() execution, the use of any JIT optimization that might have
+ If \M is present, pcretest calls pcre_exec() several times, with dif-
+ ferent values in the match_limit and match_limit_recursion fields of
+ the pcre_extra data structure, until it finds the minimum numbers for
+ each parameter that allow pcre_exec() to complete without error.
+ Because this is testing a specific feature of the normal interpretive
+ pcre_exec() execution, the use of any JIT optimization that might have
been set up by the /S+ qualifier of -s+ option is disabled.
- The match_limit number is a measure of the amount of backtracking that
- takes place, and checking it out can be instructive. For most simple
- matches, the number is quite small, but for patterns with very large
- numbers of matching possibilities, it can become large very quickly
- with increasing length of subject string. The match_limit_recursion
- number is a measure of how much stack (or, if PCRE is compiled with
- NO_RECURSE, how much heap) memory is needed to complete the match
+ The match_limit number is a measure of the amount of backtracking that
+ takes place, and checking it out can be instructive. For most simple
+ matches, the number is quite small, but for patterns with very large
+ numbers of matching possibilities, it can become large very quickly
+ with increasing length of subject string. The match_limit_recursion
+ number is a measure of how much stack (or, if PCRE is compiled with
+ NO_RECURSE, how much heap) memory is needed to complete the match
attempt.
- When \O is used, the value specified may be higher or lower than the
+ When \O is used, the value specified may be higher or lower than the
size set by the -O command line option (or defaulted to 45); \O applies
only to the call of pcre_exec() for the line in which it appears.
- If the /P modifier was present on the pattern, causing the POSIX wrap-
- per API to be used, the only option-setting sequences that have any
- effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
+ If the /P modifier was present on the pattern, causing the POSIX wrap-
+ per API to be used, the only option-setting sequences that have any
+ effect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and
REG_NOTEOL, respectively, to be passed to regexec().
- The use of \x{hh...} to represent UTF-8 characters is not dependent on
- the use of the /8 modifier on the pattern. It is recognized always.
- There may be any number of hexadecimal digits inside the braces. The
- result is from one to six bytes, encoded according to the original
- UTF-8 rules of RFC 2279. This allows for values in the range 0 to
- 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
- or indeed valid UTF-8 characters according to the later rules in RFC
+ The use of \x{hh...} to represent UTF-8 characters is not dependent on
+ the use of the /8 modifier on the pattern. It is recognized always.
+ There may be any number of hexadecimal digits inside the braces. The
+ result is from one to six bytes, encoded according to the original
+ UTF-8 rules of RFC 2279. This allows for values in the range 0 to
+ 0x7FFFFFFF. Note that not all of those are valid Unicode code points,
+ or indeed valid UTF-8 characters according to the later rules in RFC
3629.
THE ALTERNATIVE MATCHING FUNCTION
- By default, pcretest uses the standard PCRE matching function,
+ By default, pcretest uses the standard PCRE matching function,
pcre_exec() to match each data line. From release 6.0, PCRE supports an
- alternative matching function, pcre_dfa_test(), which operates in a
- different way, and has some restrictions. The differences between the
+ alternative matching function, pcre_dfa_test(), which operates in a
+ different way, and has some restrictions. The differences between the
two functions are described in the pcrematching documentation.
- If a data line contains the \D escape sequence, or if the command line
- contains the -dfa option, the alternative matching function is called.
+ If a data line contains the \D escape sequence, or if the command line
+ contains the -dfa option, the alternative matching function is called.
This function finds all possible matches at a given point. If, however,
- the \F escape sequence is present in the data line, it stops after the
+ the \F escape sequence is present in the data line, it stops after the
first match is found. This is always the shortest possible match.
DEFAULT OUTPUT FROM PCRETEST
- This section describes the output when the normal matching function,
+ This section describes the output when the normal matching function,
pcre_exec(), is being used.
When a match succeeds, pcretest outputs the list of captured substrings
- that pcre_exec() returns, starting with number 0 for the string that
- matched the whole pattern. Otherwise, it outputs "No match" when the
+ that pcre_exec() returns, starting with number 0 for the string that
+ matched the whole pattern. Otherwise, it outputs "No match" when the
return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the par-
- tially matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
- (Note that this is the entire substring that was inspected during the
- partial match; it may include characters before the actual match start
- if a lookbehind assertion, \K, \b, or \B was involved.) For any other
- return, pcretest outputs the PCRE negative error number and a short
- descriptive phrase. If the error is a failed UTF-8 string check, the
- byte offset of the start of the failing character and the reason code
- are also output, provided that the size of the output vector is at
+ tially matching substring when pcre_exec() returns PCRE_ERROR_PARTIAL.
+ (Note that this is the entire substring that was inspected during the
+ partial match; it may include characters before the actual match start
+ if a lookbehind assertion, \K, \b, or \B was involved.) For any other
+ return, pcretest outputs the PCRE negative error number and a short
+ descriptive phrase. If the error is a failed UTF-8 string check, the
+ byte offset of the start of the failing character and the reason code
+ are also output, provided that the size of the output vector is at
least two. Here is an example of an interactive pcretest run.
$ pcretest
@@ -547,9 +550,9 @@ DEFAULT OUTPUT FROM PCRETEST
Unset capturing substrings that are not followed by one that is set are
not returned by pcre_exec(), and are not shown by pcretest. In the fol-
- lowing example, there are two capturing substrings, but when the first
- data line is matched, the second, unset substring is not shown. An
- "internal" unset substring is shown as "<unset>", as for the second
+ lowing example, there are two capturing substrings, but when the first
+ data line is matched, the second, unset substring is not shown. An
+ "internal" unset substring is shown as "<unset>", as for the second
data line.
re> /(a)|(b)/
@@ -561,11 +564,11 @@ DEFAULT OUTPUT FROM PCRETEST
1: <unset>
2: b
- If the strings contain any non-printing characters, they are output as
- \0x escapes, or as \x{...} escapes if the /8 modifier was present on
- the pattern. See below for the definition of non-printing characters.
- If the pattern has the /+ modifier, the output for substring 0 is fol-
- lowed by the the rest of the subject string, identified by "0+" like
+ If the strings contain any non-printing characters, they are output as
+ \0x escapes, or as \x{...} escapes if the /8 modifier was present on
+ the pattern. See below for the definition of non-printing characters.
+ If the pattern has the /+ modifier, the output for substring 0 is fol-
+ lowed by the the rest of the subject string, identified by "0+" like
this:
re> /cat/+
@@ -573,7 +576,7 @@ DEFAULT OUTPUT FROM PCRETEST
0: cat
0+ aract
- If the pattern has the /g or /G modifier, the results of successive
+ If the pattern has the /g or /G modifier, the results of successive
matching attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
@@ -585,32 +588,32 @@ DEFAULT OUTPUT FROM PCRETEST
0: ipp
1: pp
- "No match" is output only if the first match attempt fails. Here is an
- example of a failure message (the offset 4 that is specified by \>4 is
+ "No match" is output only if the first match attempt fails. Here is an
+ example of a failure message (the offset 4 that is specified by \>4 is
past the end of the subject string):
re> /xyz/
data> xyz\>4
Error -24 (bad offset value)
- If any of the sequences \C, \G, or \L are present in a data line that
- is successfully matched, the substrings extracted by the convenience
+ If any of the sequences \C, \G, or \L are present in a data line that
+ is successfully matched, the substrings extracted by the convenience
functions are output with C, G, or L after the string number instead of
a colon. This is in addition to the normal full list. The string length
- (that is, the return from the extraction function) is given in paren-
+ (that is, the return from the extraction function) is given in paren-
theses after each string for \C and \G.
Note that whereas patterns can be continued over several lines (a plain
">" prompt is used for continuations), data lines may not. However new-
- lines can be included in data by means of the \n escape (or \r, \r\n,
+ lines can be included in data by means of the \n escape (or \r, \r\n,
etc., depending on the newline sequence setting).
OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
- When the alternative matching function, pcre_dfa_exec(), is used (by
- means of the \D escape sequence or the -dfa command line option), the
- output consists of a list of all the matches that start at the first
+ When the alternative matching function, pcre_dfa_exec(), is used (by
+ means of the \D escape sequence or the -dfa command line option), the
+ output consists of a list of all the matches that start at the first
point in the subject where there is at least one match. For example:
re> /(tang|tangerine|tan)/
@@ -619,11 +622,11 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tang
2: tan
- (Using the normal matching function on this data finds only "tang".)
- The longest matching string is always given first (and numbered zero).
+ (Using the normal matching function on this data finds only "tang".)
+ The longest matching string is always given first (and numbered zero).
After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
- lowed by the partially matching substring. (Note that this is the
- entire substring that was inspected during the partial match; it may
+ lowed by the partially matching substring. (Note that this is the
+ entire substring that was inspected during the partial match; it may
include characters before the actual match start if a lookbehind asser-
tion, \K, \b, or \B was involved.)
@@ -639,16 +642,16 @@ OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
1: tan
0: tan
- Since the matching function does not support substring capture, the
- escape sequences that are concerned with captured substrings are not
+ Since the matching function does not support substring capture, the
+ escape sequences that are concerned with captured substrings are not
relevant.
RESTARTING AFTER A PARTIAL MATCH
When the alternative matching function has given the PCRE_ERROR_PARTIAL
- return, indicating that the subject partially matched the pattern, you
- can restart the match with additional subject data by means of the \R
+ return, indicating that the subject partially matched the pattern, you
+ can restart the match with additional subject data by means of the \R
escape sequence. For example:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -657,30 +660,30 @@ RESTARTING AFTER A PARTIAL MATCH
data> n05\R\D
0: n05
- For further information about partial matching, see the pcrepartial
+ For further information about partial matching, see the pcrepartial
documentation.
CALLOUTS
- If the pattern contains any callout requests, pcretest's callout func-
- tion is called during matching. This works with both matching func-
+ If the pattern contains any callout requests, pcretest's callout func-
+ tion is called during matching. This works with both matching func-
tions. By default, the called function displays the callout number, the
- start and current positions in the text at the callout time, and the
+ start and current positions in the text at the callout time, and the
next pattern item to be tested. For example, the output
--->pqrabcdef
0 ^ ^ \d
- indicates that callout number 0 occurred for a match attempt starting
- at the fourth character of the subject string, when the pointer was at
- the seventh character of the data, and when the next pattern item was
- \d. Just one circumflex is output if the start and current positions
+ indicates that callout number 0 occurred for a match attempt starting
+ at the fourth character of the subject string, when the pointer was at
+ the seventh character of the data, and when the next pattern item was
+ \d. Just one circumflex is output if the start and current positions
are the same.
Callouts numbered 255 are assumed to be automatic callouts, inserted as
- a result of the /C pattern modifier. In this case, instead of showing
- the callout number, the offset in the pattern, preceded by a plus, is
+ a result of the /C pattern modifier. In this case, instead of showing
+ the callout number, the offset in the pattern, preceded by a plus, is
output. For example:
re> /\d?[A-E]\*/C
@@ -693,7 +696,7 @@ CALLOUTS
0: E*
If a pattern contains (*MARK) items, an additional line is output when-
- ever a change of latest mark is passed to the callout function. For
+ ever a change of latest mark is passed to the callout function. For
example:
re> /a(*MARK:X)bc/C
@@ -707,59 +710,59 @@ CALLOUTS
+12 ^ ^
0: abc
- The mark changes between matching "a" and "b", but stays the same for
- the rest of the match, so nothing more is output. If, as a result of
- backtracking, the mark reverts to being unset, the text "<unset>" is
+ The mark changes between matching "a" and "b", but stays the same for
+ the rest of the match, so nothing more is output. If, as a result of
+ backtracking, the mark reverts to being unset, the text "<unset>" is
output.
- The callout function in pcretest returns zero (carry on matching) by
- default, but you can use a \C item in a data line (as described above)
+ The callout function in pcretest returns zero (carry on matching) by
+ default, but you can use a \C item in a data line (as described above)
to change this and other parameters of the callout.
- Inserting callouts can be helpful when using pcretest to check compli-
- cated regular expressions. For further information about callouts, see
+ Inserting callouts can be helpful when using pcretest to check compli-
+ cated regular expressions. For further information about callouts, see
the pcrecallout documentation.
NON-PRINTING CHARACTERS
- When pcretest is outputting text in the compiled version of a pattern,
- bytes other than 32-126 are always treated as non-printing characters
+ When pcretest is outputting text in the compiled version of a pattern,
+ bytes other than 32-126 are always treated as non-printing characters
are are therefore shown as hex escapes.
- When pcretest is outputting text that is a matched part of a subject
- string, it behaves in the same way, unless a different locale has been
- set for the pattern (using the /L modifier). In this case, the
+ When pcretest is outputting text that is a matched part of a subject
+ string, it behaves in the same way, unless a different locale has been
+ set for the pattern (using the /L modifier). In this case, the
isprint() function to distinguish printing and non-printing characters.
SAVING AND RELOADING COMPILED PATTERNS
- The facilities described in this section are not available when the
- POSIX interface to PCRE is being used, that is, when the /P pattern
+ The facilities described in this section are not available when the
+ POSIX interface to PCRE is being used, that is, when the /P pattern
modifier is specified.
When the POSIX interface is not in use, you can cause pcretest to write
- a compiled pattern to a file, by following the modifiers with > and a
+ a compiled pattern to a file, by following the modifiers with > and a
file name. For example:
/pattern/im >/some/file
- See the pcreprecompile documentation for a discussion about saving and
- re-using compiled patterns. Note that if the pattern was successfully
+ See the pcreprecompile documentation for a discussion about saving and
+ re-using compiled patterns. Note that if the pattern was successfully
studied with JIT optimization, the JIT data cannot be saved.
- The data that is written is binary. The first eight bytes are the
- length of the compiled pattern data followed by the length of the
- optional study data, each written as four bytes in big-endian order
- (most significant byte first). If there is no study data (either the
+ The data that is written is binary. The first eight bytes are the
+ length of the compiled pattern data followed by the length of the
+ optional study data, each written as four bytes in big-endian order
+ (most significant byte first). If there is no study data (either the
pattern was not studied, or studying did not return any data), the sec-
- ond length is zero. The lengths are followed by an exact copy of the
- compiled pattern. If there is additional study data, this (excluding
- any JIT data) follows immediately after the compiled pattern. After
+ ond length is zero. The lengths are followed by an exact copy of the
+ compiled pattern. If there is additional study data, this (excluding
+ any JIT data) follows immediately after the compiled pattern. After
writing the file, pcretest expects to read a new pattern.
- A saved pattern can be reloaded into pcretest by specifying < and a
+ A saved pattern can be reloaded into pcretest by specifying < and a
file name instead of a pattern. The name of the file must not contain a
< character, as otherwise pcretest will interpret the line as a pattern
delimited by < characters. For example:
@@ -768,27 +771,27 @@ SAVING AND RELOADING COMPILED PATTERNS
Compiled pattern loaded from /some/file
No study data
- If the pattern was previously studied with the JIT optimization, the
- JIT information cannot be saved and restored, and so is lost. When the
- pattern has been loaded, pcretest proceeds to read data lines in the
+ If the pattern was previously studied with the JIT optimization, the
+ JIT information cannot be saved and restored, and so is lost. When the
+ pattern has been loaded, pcretest proceeds to read data lines in the
usual way.
- You can copy a file written by pcretest to a different host and reload
- it there, even if the new host has opposite endianness to the one on
- which the pattern was compiled. For example, you can compile on an i86
+ You can copy a file written by pcretest to a different host and reload
+ it there, even if the new host has opposite endianness to the one on
+ which the pattern was compiled. For example, you can compile on an i86
machine and run on a SPARC machine.
- File names for saving and reloading can be absolute or relative, but
- note that the shell facility of expanding a file name that starts with
+ File names for saving and reloading can be absolute or relative, but
+ note that the shell facility of expanding a file name that starts with
a tilde (~) is not available.
- The ability to save and reload files in pcretest is intended for test-
- ing and experimentation. It is not intended for production use because
- only a single pattern can be written to a file. Furthermore, there is
- no facility for supplying custom character tables for use with a
- reloaded pattern. If the original pattern was compiled with custom
- tables, an attempt to match a subject string using a reloaded pattern
- is likely to cause pcretest to crash. Finally, if you attempt to load
+ The ability to save and reload files in pcretest is intended for test-
+ ing and experimentation. It is not intended for production use because
+ only a single pattern can be written to a file. Furthermore, there is
+ no facility for supplying custom character tables for use with a
+ reloaded pattern. If the original pattern was compiled with custom
+ tables, an attempt to match a subject string using a reloaded pattern
+ is likely to cause pcretest to crash. Finally, if you attempt to load
a file that is not in the correct format, the result is undefined.
@@ -807,5 +810,5 @@ AUTHOR
REVISION
- Last updated: 26 August 2011
+ Last updated: 02 December 2011
Copyright (c) 1997-2011 University of Cambridge.
diff --git a/testdata/testoutput15 b/testdata/testoutput15
index 944659b..10fbc54 100644
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@@ -17,6 +17,5 @@ No options
No first char
No need char
Study returned NULL
-JIT support is not available in this version of PCRE
/-- End of testinput15 --/