summaryrefslogtreecommitdiff
path: root/ext/pcre/pcrelib
diff options
context:
space:
mode:
authorIlia Alshanetsky <iliaa@php.net>2010-12-24 15:03:29 +0000
committerIlia Alshanetsky <iliaa@php.net>2010-12-24 15:03:29 +0000
commit34c6447e9e8bc32462cd42d5aa6b0c99bff5052e (patch)
tree1baba7850d869cf79918437a6029d2e7bf007279 /ext/pcre/pcrelib
parentcd34d68cddda0ab3e5d8b6ca3a641545dcff6bc6 (diff)
downloadphp-git-34c6447e9e8bc32462cd42d5aa6b0c99bff5052e.tar.gz
Upgraded\ bundled\ PCRE\ to\ version\ 8.11.
Diffstat (limited to 'ext/pcre/pcrelib')
-rw-r--r--ext/pcre/pcrelib/ChangeLog130
-rw-r--r--ext/pcre/pcrelib/HACKING34
-rw-r--r--ext/pcre/pcrelib/NEWS21
-rw-r--r--ext/pcre/pcrelib/config.h7
-rw-r--r--ext/pcre/pcrelib/doc/pcre.txt2632
-rw-r--r--ext/pcre/pcrelib/pcre.h78
-rw-r--r--ext/pcre/pcrelib/pcre_compile.c158
-rw-r--r--ext/pcre/pcrelib/pcre_exec.c126
-rw-r--r--ext/pcre/pcrelib/pcre_internal.h222
-rw-r--r--ext/pcre/pcrelib/pcre_printint.src17
-rw-r--r--ext/pcre/pcrelib/pcre_study.c5
-rw-r--r--ext/pcre/pcrelib/pcre_valid_utf8.c17
-rw-r--r--ext/pcre/pcrelib/pcredemo.c76
-rw-r--r--ext/pcre/pcrelib/pcreposix.c1
14 files changed, 2075 insertions, 1449 deletions
diff --git a/ext/pcre/pcrelib/ChangeLog b/ext/pcre/pcrelib/ChangeLog
index b2fef0dab7..32cd0029b8 100644
--- a/ext/pcre/pcrelib/ChangeLog
+++ b/ext/pcre/pcrelib/ChangeLog
@@ -1,6 +1,136 @@
ChangeLog for PCRE
------------------
+Version 8.11 10-Dec-2010
+------------------------
+
+1. (*THEN) was not working properly if there were untried alternatives prior
+ to it in the current branch. For example, in ((a|b)(*THEN)(*F)|c..) it
+ backtracked to try for "b" instead of moving to the next alternative branch
+ at the same level (in this case, to look for "c"). The Perl documentation
+ is clear that when (*THEN) is backtracked onto, it goes to the "next
+ alternative in the innermost enclosing group".
+
+2. (*COMMIT) was not overriding (*THEN), as it does in Perl. In a pattern
+ such as (A(*COMMIT)B(*THEN)C|D) any failure after matching A should
+ result in overall failure. Similarly, (*COMMIT) now overrides (*PRUNE) and
+ (*SKIP), (*SKIP) overrides (*PRUNE) and (*THEN), and (*PRUNE) overrides
+ (*THEN).
+
+3. If \s appeared in a character class, it removed the VT character from
+ the class, even if it had been included by some previous item, for example
+ in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part
+ of \s, but is part of the POSIX "space" class.)
+
+4. A partial match never returns an empty string (because you can always
+ match an empty string at the end of the subject); however the checking for
+ an empty string was starting at the "start of match" point. This has been
+ changed to the "earliest inspected character" point, because the returned
+ data for a partial match starts at this character. This means that, for
+ example, /(?<=abc)def/ gives a partial match for the subject "abc"
+ (previously it gave "no match").
+
+5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
+ of $, \z, \Z, \b, and \B. If the match point is at the end of the string,
+ previously a full match would be given. However, setting PCRE_PARTIAL_HARD
+ has an implication that the given string is incomplete (because a partial
+ match is preferred over a full match). For this reason, these items now
+ give a partial match in this situation. [Aside: previously, the one case
+ /t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
+ match rather than a full match, which was wrong by the old rules, but is
+ now correct.]
+
+6. There was a bug in the handling of #-introduced comments, recognized when
+ PCRE_EXTENDED is set, when PCRE_NEWLINE_ANY and PCRE_UTF8 were also set.
+ If a UTF-8 multi-byte character included the byte 0x85 (e.g. +U0445, whose
+ UTF-8 encoding is 0xd1,0x85), this was misinterpreted as a newline when
+ scanning for the end of the comment. (*Character* 0x85 is an "any" newline,
+ but *byte* 0x85 is not, in UTF-8 mode). This bug was present in several
+ places in pcre_compile().
+
+7. Related to (6) above, when pcre_compile() was skipping #-introduced
+ comments when looking ahead for named forward references to subpatterns,
+ the only newline sequence it recognized was NL. It now handles newlines
+ according to the set newline convention.
+
+8. SunOS4 doesn't have strerror() or strtoul(); pcregrep dealt with the
+ former, but used strtoul(), whereas pcretest avoided strtoul() but did not
+ cater for a lack of strerror(). These oversights have been fixed.
+
+9. Added --match-limit and --recursion-limit to pcregrep.
+
+10. Added two casts needed to build with Visual Studio when NO_RECURSE is set.
+
+11. When the -o option was used, pcregrep was setting a return code of 1, even
+ when matches were found, and --line-buffered was not being honoured.
+
+12. Added an optional parentheses number to the -o and --only-matching options
+ of pcregrep.
+
+13. Imitating Perl's /g action for multiple matches is tricky when the pattern
+ can match an empty string. The code to do it in pcretest and pcredemo
+ needed fixing:
+
+ (a) When the newline convention was "crlf", pcretest got it wrong, skipping
+ only one byte after an empty string match just before CRLF (this case
+ just got forgotten; "any" and "anycrlf" were OK).
+
+ (b) The pcretest code also had a bug, causing it to loop forever in UTF-8
+ mode when an empty string match preceded an ASCII character followed by
+ a non-ASCII character. (The code for advancing by one character rather
+ than one byte was nonsense.)
+
+ (c) The pcredemo.c sample program did not have any code at all to handle
+ the cases when CRLF is a valid newline sequence.
+
+14. Neither pcre_exec() nor pcre_dfa_exec() was checking that the value given
+ as a starting offset was within the subject string. There is now a new
+ error, PCRE_ERROR_BADOFFSET, which is returned if the starting offset is
+ negative or greater than the length of the string. In order to test this,
+ pcretest is extended to allow the setting of negative starting offsets.
+
+15. In both pcre_exec() and pcre_dfa_exec() the code for checking that the
+ starting offset points to the beginning of a UTF-8 character was
+ unnecessarily clumsy. I tidied it up.
+
+16. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
+ bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
+
+17. Nobody had reported that the --include_dir option, which was added in
+ release 7.7 should have been called --include-dir (hyphen, not underscore)
+ for compatibility with GNU grep. I have changed it to --include-dir, but
+ left --include_dir as an undocumented synonym, and the same for
+ --exclude-dir, though that is not available in GNU grep, at least as of
+ release 2.5.4.
+
+18. At a user's suggestion, the macros GETCHAR and friends (which pick up UTF-8
+ characters from a string of bytes) have been redefined so as not to use
+ loops, in order to improve performance in some environments. At the same
+ time, I abstracted some of the common code into auxiliary macros to save
+ repetition (this should not affect the compiled code).
+
+19. If \c was followed by a multibyte UTF-8 character, bad things happened. A
+ compile-time error is now given if \c is not followed by an ASCII
+ character, that is, a byte less than 128. (In EBCDIC mode, the code is
+ different, and any byte value is allowed.)
+
+20. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
+ START_OPTIMIZE option, which is now allowed at compile time - but just
+ passed through to pcre_exec() or pcre_dfa_exec(). This makes it available
+ to pcregrep and other applications that have no direct access to PCRE
+ options. The new /Y option in pcretest sets this option when calling
+ pcre_compile().
+
+21. Change 18 of release 8.01 broke the use of named subpatterns for recursive
+ back references. Groups containing recursive back references were forced to
+ be atomic by that change, but in the case of named groups, the amount of
+ memory required was incorrectly computed, leading to "Failed: internal
+ error: code overflow". This has been fixed.
+
+22. Some patches to pcre_stringpiece.h, pcre_stringpiece_unittest.cc, and
+ pcretest.c, to avoid build problems in some Borland environments.
+
+
Version 8.10 25-Jun-2010
------------------------
diff --git a/ext/pcre/pcrelib/HACKING b/ext/pcre/pcrelib/HACKING
index 8c6a42de39..ee0913214c 100644
--- a/ext/pcre/pcrelib/HACKING
+++ b/ext/pcre/pcrelib/HACKING
@@ -4,6 +4,7 @@ Technical Notes about PCRE
These are very rough technical notes that record potentially useful information
about PCRE internals.
+
Historical note 1
-----------------
@@ -22,6 +23,7 @@ the one matching the longest subset of the subject string was chosen. This did
not necessarily maximize the individual wild portions of the pattern, as is
expected in Unix and Perl-style regular expressions.
+
Historical note 2
-----------------
@@ -34,6 +36,7 @@ maximizing (or, optionally, minimizing in Perl) the amount of the subject that
matches individual wild portions of the pattern. This is an "NFA algorithm" in
Friedl's terminology.
+
OK, here's the real stuff
-------------------------
@@ -44,6 +47,7 @@ in the pattern, to save on compiling time. However, because of the greater
complexity in Perl regular expressions, I couldn't do this. In any case, a
first pass through the pattern is helpful for other reasons.
+
Computing the memory requirement: how it was
--------------------------------------------
@@ -54,6 +58,7 @@ idea was that this would turn out faster than the Henry Spencer code because
the first pass is degenerate and the second pass can just store stuff straight
into the vector, which it knows is big enough.
+
Computing the memory requirement: how it is
-------------------------------------------
@@ -75,6 +80,7 @@ runs more slowly than before (30% or more, depending on the pattern) because it
is doing a full analysis of the pattern. My hope was that this would not be a
big issue, and in the event, nobody has commented on it.
+
Traditional matching function
-----------------------------
@@ -84,6 +90,7 @@ and the way that Perl works. This is not surprising, since it is intended to be
as compatible with Perl as possible. This is the function most users of PCRE
will use most of the time.
+
Supplementary matching function
-------------------------------
@@ -119,7 +126,6 @@ quantifiers) are always just two bytes long.
A list of the opcodes follows:
-
Opcodes with no following data
------------------------------
@@ -151,12 +157,24 @@ These items are all just one byte long
OP_EXTUNI match an extended Unicode character
OP_ANYNL match any Unicode newline sequence
- OP_ACCEPT ) These are Perl 5.10's "backtracking
- OP_COMMIT ) control verbs". If OP_ACCEPT is inside
- OP_FAIL ) capturing parentheses, it may be preceded
- OP_PRUNE ) by one or more OP_CLOSE, followed by a 2-byte
- OP_SKIP ) number, indicating which parentheses must be
- OP_THEN ) closed.
+ OP_ACCEPT ) These are Perl 5.10's "backtracking control
+ OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing
+ OP_FAIL ) parentheses, it may be preceded by one or more
+ OP_PRUNE ) OP_CLOSE, followed by a 2-byte number,
+ OP_SKIP ) indicating which parentheses must be closed.
+
+
+Backtracking control verbs with data
+------------------------------------
+
+OP_THEN is followed by a LINK_SIZE offset, which is the distance back to the
+start of the current branch.
+
+OP_MARK is followed by the mark name, preceded by a one-byte length, and
+followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments,
+the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used. For the first
+two, the name follows immediately; for OP_THEN_ARG, it follows the LINK_SIZE
+offset value.
Repeating single characters
@@ -419,4 +437,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
-October 2009
+October 2010
diff --git a/ext/pcre/pcrelib/NEWS b/ext/pcre/pcrelib/NEWS
index 4f21e0bdaf..fdc4ba1e67 100644
--- a/ext/pcre/pcrelib/NEWS
+++ b/ext/pcre/pcrelib/NEWS
@@ -1,6 +1,27 @@
News about PCRE releases
------------------------
+Release 8.11 10-Dec-2010
+------------------------
+
+A number of bugs in the library and in pcregrep have been fixed. As always, see
+ChangeLog for details. The following are the non-bug-fix changes:
+
+. Added --match-limit and --recursion-limit to pcregrep.
+
+. Added an optional parentheses number to the -o and --only-matching options
+ of pcregrep.
+
+. Changed the way PCRE_PARTIAL_HARD affects the matching of $, \z, \Z, \b, and
+ \B.
+
+. Added PCRE_ERROR_SHORTUTF8 to make it possible to distinguish between a
+ bad UTF-8 sequence and one that is incomplete when using PCRE_PARTIAL_HARD.
+
+. Recognize (*NO_START_OPT) at the start of a pattern to set the PCRE_NO_
+ START_OPTIMIZE option, which is now allowed at compile time
+
+
Release 8.10 25-Jun-2010
------------------------
diff --git a/ext/pcre/pcrelib/config.h b/ext/pcre/pcrelib/config.h
index 43bfa44053..b7e02b326f 100644
--- a/ext/pcre/pcrelib/config.h
+++ b/ext/pcre/pcrelib/config.h
@@ -23,6 +23,7 @@
# define PCRE_EXP_DATA_DEFN __attribute__ ((visibility("default")))
#endif
+
/* Exclude these below definitions when building within PHP */
#ifndef ZEND_API
@@ -281,7 +282,7 @@ them both to 0; an emulation function will be used. */
#define PACKAGE_NAME "PCRE"
/* Define to the full name and version of this package. */
-#define PACKAGE_STRING "PCRE 8.10"
+#define PACKAGE_STRING "PCRE 8.11"
/* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME "pcre"
@@ -290,7 +291,7 @@ them both to 0; an emulation function will be used. */
#define PACKAGE_URL ""
/* Define to the version of this package. */
-#define PACKAGE_VERSION "8.10"
+#define PACKAGE_VERSION "8.11"
/* If you are compiling for a system other than a Unix-like system or
@@ -346,7 +347,7 @@ them both to 0; an emulation function will be used. */
/* Version number of package */
#ifndef VERSION
-#define VERSION "8.10"
+#define VERSION "8.11"
#endif
/* Define to empty if `const' does not conform to ANSI C. */
diff --git a/ext/pcre/pcrelib/doc/pcre.txt b/ext/pcre/pcrelib/doc/pcre.txt
index 5f65871425..5d769dfb0c 100644
--- a/ext/pcre/pcrelib/doc/pcre.txt
+++ b/ext/pcre/pcrelib/doc/pcre.txt
@@ -26,8 +26,8 @@ INTRODUCTION
give better JavaScript compatibility.
The current implementation of PCRE corresponds approximately with Perl
- 5.10/5.11, including support for UTF-8 encoded strings and Unicode gen-
- eral category properties. However, UTF-8 and Unicode support has to be
+ 5.12, including support for UTF-8 encoded strings and Unicode general
+ category properties. However, UTF-8 and Unicode support has to be
explicitly enabled; it is not the default. The Unicode tables corre-
spond to Unicode release 5.2.0.
@@ -226,32 +226,31 @@ UTF-8 AND UNICODE PROPERTY SUPPORT
PCRE recognizes as digits, spaces, or word characters remain the same
set as before, all with values less than 256. This remains true even
when PCRE is built to include Unicode property support, because to do
- otherwise would slow down PCRE in many common cases. Note that this
- also applies to \b, because it is defined in terms of \w and \W. If you
- really want to test for a wider sense of, say, "digit", you can use
- explicit Unicode property tests such as \p{Nd}. Alternatively, if you
- set the PCRE_UCP option, the way that the character escapes work is
- changed so that Unicode properties are used to determine which charac-
- ters match. There are more details in the section on generic character
- types in the pcrepattern documentation.
+ otherwise would slow down PCRE in many common cases. Note in particular
+ that this applies to \b and \B, because they are defined in terms of \w
+ and \W. If you really want to test for a wider sense of, say, "digit",
+ you can use explicit Unicode property tests such as \p{Nd}. Alterna-
+ tively, if you set the PCRE_UCP option, the way that the character
+ escapes work is changed so that Unicode properties are used to deter-
+ mine which characters match. There are more details in the section on
+ generic character types in the pcrepattern documentation.
7. Similarly, characters that match the POSIX named character classes
are all low-valued characters, unless the PCRE_UCP option is set.
- 8. However, the Perl 5.10 horizontal and vertical whitespace matching
- escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
- acters, whether or not PCRE_UCP is set.
+ 8. However, the horizontal and vertical whitespace matching escapes
+ (\h, \H, \v, and \V) do match all the appropriate Unicode characters,
+ whether or not PCRE_UCP is set.
9. Case-insensitive matching applies only to characters whose values
are less than 128, unless PCRE is built with Unicode property support.
Even when Unicode property support is available, PCRE still uses its
own character tables when checking the case of low-valued characters,
so as not to degrade performance. The Unicode property information is
- used only for characters with higher values. Even when Unicode property
- support is available, PCRE supports case-insensitive matching only when
- there is a one-to-one mapping between a letter's cases. There are a
- small number of many-to-one mappings in Unicode; these are not sup-
- ported by PCRE.
+ used only for characters with higher values. Furthermore, PCRE supports
+ case-insensitive matching only when there is a one-to-one mapping
+ between a letter's cases. There are a small number of many-to-one map-
+ pings in Unicode; these are not supported by PCRE.
AUTHOR
@@ -260,14 +259,14 @@ AUTHOR
University Computing Service
Cambridge CB2 3QH, England.
- Putting an actual email address here seems to have been a spam magnet,
- so I've taken it away. If you want to email me, use my two initials,
+ Putting an actual email address here seems to have been a spam magnet,
+ so I've taken it away. If you want to email me, use my two initials,
followed by the two digits 10, at the domain cam.ac.uk.
REVISION
- Last updated: 12 May 2010
+ Last updated: 13 November 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -697,82 +696,86 @@ THE ALTERNATIVE MATCHING ALGORITHM
represent the different matching possibilities (if there are none, the
match has failed). Thus, if there is more than one possible match,
this algorithm finds all of them, and in particular, it finds the long-
- est. There is an option to stop the algorithm after the first match
- (which is necessarily the shortest) is found.
+ est. The matches are returned in decreasing order of length. There is
+ an option to stop the algorithm after the first match (which is neces-
+ sarily the shortest) is found.
Note that all the matches that are found start at the same point in the
subject. If the pattern
- cat(er(pillar)?)
+ cat(er(pillar)?)?
- is matched against the string "the caterpillar catchment", the result
- will be the three strings "cat", "cater", and "caterpillar" that start
- at the fourth character of the subject. The algorithm does not automat-
- ically move on to find matches that start at later positions.
+ is matched against the string "the caterpillar catchment", the result
+ will be the three strings "caterpillar", "cater", and "cat" that start
+ at the fifth character of the subject. The algorithm does not automati-
+ cally move on to find matches that start at later positions.
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
- 1. Because the algorithm finds all possible matches, the greedy or
- ungreedy nature of repetition quantifiers is not relevant. Greedy and
+ 1. Because the algorithm finds all possible matches, the greedy or
+ ungreedy nature of repetition quantifiers is not relevant. Greedy and
ungreedy quantifiers are treated in exactly the same way. However, pos-
- sessive quantifiers can make a difference when what follows could also
+ sessive quantifiers can make a difference when what follows could also
match what is quantified, for example in a pattern like this:
^a++\w!
- This pattern matches "aaab!" but not "aaa!", which would be matched by
- a non-possessive quantifier. Similarly, if an atomic group is present,
- it is matched as if it were a standalone pattern at the current point,
- and the longest match is then "locked in" for the rest of the overall
+ This pattern matches "aaab!" but not "aaa!", which would be matched by
+ a non-possessive quantifier. Similarly, if an atomic group is present,
+ it is matched as if it were a standalone pattern at the current point,
+ and the longest match is then "locked in" for the rest of the overall
pattern.
2. When dealing with multiple paths through the tree simultaneously, it
- is not straightforward to keep track of captured substrings for the
- different matching possibilities, and PCRE's implementation of this
+ is not straightforward to keep track of captured substrings for the
+ different matching possibilities, and PCRE's implementation of this
algorithm does not attempt to do this. This means that no captured sub-
strings are available.
- 3. Because no substrings are captured, back references within the pat-
+ 3. Because no substrings are captured, back references within the pat-
tern are not supported, and cause errors if encountered.
- 4. For the same reason, conditional expressions that use a backrefer-
- ence as the condition or test for a specific group recursion are not
+ 4. For the same reason, conditional expressions that use a backrefer-
+ ence as the condition or test for a specific group recursion are not
supported.
- 5. Because many paths through the tree may be active, the \K escape
+ 5. Because many paths through the tree may be active, the \K escape
sequence, which resets the start of the match when encountered (but may
- be on some paths and not on others), is not supported. It causes an
+ be on some paths and not on others), is not supported. It causes an
error if encountered.
- 6. Callouts are supported, but the value of the capture_top field is
+ 6. Callouts are supported, but the value of the capture_top field is
always 1, and the value of the capture_last field is always -1.
- 7. The \C escape sequence, which (in the standard algorithm) matches a
- single byte, even in UTF-8 mode, is not supported because the alterna-
- tive algorithm moves through the subject string one character at a
+ 7. The \C escape sequence, which (in the standard algorithm) matches a
+ single byte, even in UTF-8 mode, is not supported because the alterna-
+ tive algorithm moves through the subject string one character at a
time, for all active paths through the tree.
- 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
- are not supported. (*FAIL) is supported, and behaves like a failing
+ 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
+ are not supported. (*FAIL) is supported, and behaves like a failing
negative assertion.
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
- Using the alternative matching algorithm provides the following advan-
+ Using the alternative matching algorithm provides the following advan-
tages:
1. All possible matches (at a single point in the subject) are automat-
- ically found, and in particular, the longest match is found. To find
+ ically found, and in particular, the longest match is found. To find
more than one match using the standard algorithm, you have to do kludgy
things with callouts.
- 2. Because the alternative algorithm scans the subject string just
- once, and never needs to backtrack, it is possible to pass very long
- subject strings to the matching function in several pieces, checking
- for partial matching each time. The pcrepartial documentation gives
- details of partial matching.
+ 2. Because the alternative algorithm scans the subject string just
+ once, and never needs to backtrack, it is possible to pass very long
+ subject strings to the matching function in several pieces, checking
+ for partial matching each time. Although it is possible to do multi-
+ segment matching using the standard algorithm (pcre_exec()), by retain-
+ ing partially matched substrings, it is more complicated. The pcrepar-
+ tial documentation gives details of partial matching and discusses
+ multi-segment matching.
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
@@ -798,8 +801,8 @@ AUTHOR
REVISION
- Last updated: 29 September 2009
- Copyright (c) 1997-2009 University of Cambridge.
+ Last updated: 17 November 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -1162,34 +1165,39 @@ COMPILING A PATTERN
pcrepattern documentation). For those options that can be different in
different parts of the pattern, the contents of the options argument
specifies their settings at the start of compilation and execution. The
- PCRE_ANCHORED, PCRE_BSR_xxx, and PCRE_NEWLINE_xxx options can be set at
- the time of matching as well as at compile time.
+ PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
+ PCRE_NO_START_OPT options can be set at the time of matching as well as
+ at compile time.
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
- if compilation of a pattern fails, pcre_compile() returns NULL, and
+ if compilation of a pattern fails, pcre_compile() returns NULL, and
sets the variable pointed to by errptr to point to a textual error mes-
sage. This is a static string that is part of the library. You must not
- try to free it. The byte offset from the start of the pattern to the
- character that was being processed when the error was discovered is
- placed in the variable pointed to by erroffset, which must not be NULL.
- If it is, an immediate error is given. Some errors are not detected
- until checks are carried out when the whole pattern has been scanned;
- in this case the offset is set to the end of the pattern.
-
- If pcre_compile2() is used instead of pcre_compile(), and the error-
- codeptr argument is not NULL, a non-zero error code number is returned
- via this argument in the event of an error. This is in addition to the
+ try to free it. The offset from the start of the pattern to the byte
+ that was being processed when the error was discovered is placed in the
+ variable pointed to by erroffset, which must not be NULL. If it is, an
+ immediate error is given. Some errors are not detected until checks are
+ carried out when the whole pattern has been scanned; in this case the
+ offset is set to the end of the pattern.
+
+ Note that the offset is in bytes, not characters, even in UTF-8 mode.
+ It may point into the middle of a UTF-8 character (for example, when
+ PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string).
+
+ If pcre_compile2() is used instead of pcre_compile(), and the error-
+ codeptr argument is not NULL, a non-zero error code number is returned
+ via this argument in the event of an error. This is in addition to the
textual error message. Error codes and messages are listed below.
- If the final argument, tableptr, is NULL, PCRE uses a default set of
- character tables that are built when PCRE is compiled, using the
- default C locale. Otherwise, tableptr must be an address that is the
- result of a call to pcre_maketables(). This value is stored with the
- compiled pattern, and used again by pcre_exec(), unless another table
+ If the final argument, tableptr, is NULL, PCRE uses a default set of
+ character tables that are built when PCRE is compiled, using the
+ default C locale. Otherwise, tableptr must be an address that is the
+ result of a call to pcre_maketables(). This value is stored with the
+ compiled pattern, and used again by pcre_exec(), unless another table
pointer is passed to it. For more discussion, see the section on locale
support below.
- This code fragment shows a typical straightforward call to pcre_com-
+ This code fragment shows a typical straightforward call to pcre_com-
pile():
pcre *re;
@@ -1202,86 +1210,95 @@ COMPILING A PATTERN
&erroffset, /* for error offset */
NULL); /* use default character tables */
- The following names for option bits are defined in the pcre.h header
+ The following names for option bits are defined in the pcre.h header
file:
PCRE_ANCHORED
If this bit is set, the pattern is forced to be "anchored", that is, it
- is constrained to match only at the first matching point in the string
- that is being searched (the "subject string"). This effect can also be
- achieved by appropriate constructs in the pattern itself, which is the
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
only way to do it in Perl.
PCRE_AUTO_CALLOUT
If this bit is set, pcre_compile() automatically inserts callout items,
- all with number 255, before each pattern item. For discussion of the
+ all with number 255, before each pattern item. For discussion of the
callout facility, see the pcrecallout documentation.
PCRE_BSR_ANYCRLF
PCRE_BSR_UNICODE
These options (which are mutually exclusive) control what the \R escape
- sequence matches. The choice is either to match only CR, LF, or CRLF,
+ sequence matches. The choice is either to match only CR, LF, or CRLF,
or to match any Unicode newline sequence. The default is specified when
PCRE is built. It can be overridden from within the pattern, or by set-
ting an option when a compiled pattern is matched.
PCRE_CASELESS
- If this bit is set, letters in the pattern match both upper and lower
- case letters. It is equivalent to Perl's /i option, and it can be
- changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
- always understands the concept of case for characters whose values are
- less than 128, so caseless matching is always possible. For characters
- with higher values, the concept of case is supported if PCRE is com-
- piled with Unicode property support, but not otherwise. If you want to
- use caseless matching for characters 128 and above, you must ensure
- that PCRE is compiled with Unicode property support as well as with
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters. It is equivalent to Perl's /i option, and it can be
+ changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
+ always understands the concept of case for characters whose values are
+ less than 128, so caseless matching is always possible. For characters
+ with higher values, the concept of case is supported if PCRE is com-
+ piled with Unicode property support, but not otherwise. If you want to
+ use caseless matching for characters 128 and above, you must ensure
+ that PCRE is compiled with Unicode property support as well as with
UTF-8 support.
PCRE_DOLLAR_ENDONLY
- If this bit is set, a dollar metacharacter in the pattern matches only
- at the end of the subject string. Without this option, a dollar also
- matches immediately before a newline at the end of the string (but not
- before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
- if PCRE_MULTILINE is set. There is no equivalent to this option in
+ If this bit is set, a dollar metacharacter in the pattern matches only
+ at the end of the subject string. Without this option, a dollar also
+ matches immediately before a newline at the end of the string (but not
+ before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
+ if PCRE_MULTILINE is set. There is no equivalent to this option in
Perl, and no way to set it within a pattern.
PCRE_DOTALL
- If this bit is set, a dot metacharater in the pattern matches all char-
- acters, including those that indicate newline. Without it, a dot does
- not match when the current position is at a newline. This option is
- equivalent to Perl's /s option, and it can be changed within a pattern
- by a (?s) option setting. A negative class such as [^a] always matches
- newline characters, independent of the setting of this option.
+ If this bit is set, a dot metacharacter in the pattern matches a char-
+ acter of any value, including one that indicates a newline. However, it
+ only ever matches one character, even if newlines are coded as CRLF.
+ Without this option, a dot does not match when the current position is
+ at a newline. This option is equivalent to Perl's /s option, and it can
+ be changed within a pattern by a (?s) option setting. A negative class
+ such as [^a] always matches newline characters, independent of the set-
+ ting of this option.
PCRE_DUPNAMES
- If this bit is set, names used to identify capturing subpatterns need
+ If this bit is set, names used to identify capturing subpatterns need
not be unique. This can be helpful for certain types of pattern when it
- is known that only one instance of the named subpattern can ever be
- matched. There are more details of named subpatterns below; see also
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
the pcrepattern documentation.
PCRE_EXTENDED
- If this bit is set, whitespace data characters in the pattern are
+ If this bit is set, whitespace data characters in the pattern are
totally ignored except when escaped or inside a character class. White-
space does not include the VT character (code 11). In addition, charac-
ters between an unescaped # outside a character class and the next new-
- line, inclusive, are also ignored. This is equivalent to Perl's /x
- option, and it can be changed within a pattern by a (?x) option set-
+ line, inclusive, are also ignored. This is equivalent to Perl's /x
+ option, and it can be changed within a pattern by a (?x) option set-
ting.
+ Which characters are interpreted as newlines is controlled by the
+ options passed to pcre_compile() or by a special sequence at the start
+ of the pattern, as described in the section entitled "Newline conven-
+ tions" in the pcrepattern documentation. Note that the end of this type
+ of comment is a literal newline sequence in the pattern; escape
+ sequences that happen to represent a newline do not count.
+
This option makes it possible to include comments inside complicated
patterns. Note, however, that this applies only to data characters.
Whitespace characters may never appear within special character
- sequences in a pattern, for example within the sequence (?( which
- introduces a conditional subpattern.
+ sequences in a pattern, for example within the sequence (?( that intro-
+ duces a conditional subpattern.
PCRE_EXTRA
@@ -1363,13 +1380,12 @@ COMPILING A PATTERN
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
cause an error.
- The only time that a line break is specially recognized when compiling
- a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
- character class is encountered. This indicates a comment that lasts
- until after the next line break sequence. In other circumstances, line
- break sequences are treated as literal data, except that in
- PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
- and are therefore ignored.
+ The only time that a line break in a pattern is specially recognized
+ when compiling is when PCRE_EXTENDED is set. CR and LF are whitespace
+ characters, and so are ignored in this mode. Also, an unescaped # out-
+ side a character class indicates a comment that lasts until after the
+ next line break sequence. In other circumstances, line break sequences
+ in patterns are treated as literal data.
The newline option that is set at compile time becomes the default that
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
@@ -1377,20 +1393,29 @@ COMPILING A PATTERN
PCRE_NO_AUTO_CAPTURE
If this option is set, it disables the use of numbered capturing paren-
- theses in the pattern. Any opening parenthesis that is not followed by
- ? behaves as if it were followed by ?: but named parentheses can still
- be used for capturing (and they acquire numbers in the usual way).
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
+ be used for capturing (and they acquire numbers in the usual way).
There is no equivalent of this option in Perl.
+ NO_START_OPTIMIZE
+
+ This is an option that acts at matching time; that is, it is really an
+ option for pcre_exec() or pcre_dfa_exec(). If it is set at compile
+ time, it is remembered with the compiled pattern and assumed at match-
+ ing time. For details see the discussion of PCRE_NO_START_OPTIMIZE
+ below.
+
PCRE_UCP
- This option changes the way PCRE processes \b, \d, \s, \w, and some of
- the POSIX character classes. By default, only ASCII characters are rec-
- ognized, but if PCRE_UCP is set, Unicode properties are used instead to
- classify characters. More details are given in the section on generic
- character types in the pcrepattern page. If you set PCRE_UCP, matching
- one of the items it affects takes much longer. The option is available
- only if PCRE has been compiled with Unicode property support.
+ This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
+ \w, and some of the POSIX character classes. By default, only ASCII
+ characters are recognized, but if PCRE_UCP is set, Unicode properties
+ are used instead to classify characters. More details are given in the
+ section on generic character types in the pcrepattern page. If you set
+ PCRE_UCP, matching one of the items it affects takes much longer. The
+ option is available only if PCRE has been compiled with Unicode prop-
+ erty support.
PCRE_UNGREEDY
@@ -1562,9 +1587,9 @@ STUDYING A PATTERN
The two optimizations just described can be disabled by setting the
PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or
pcre_dfa_exec(). You might want to do this if your pattern contains
- callouts, or make use of (*MARK), and you make use of these in cases
- where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE
- below.
+ callouts or (*MARK), and you want to make use of these facilities in
+ cases where matching fails. See the discussion of PCRE_NO_START_OPTI-
+ MIZE below.
LOCALE SUPPORT
@@ -2123,28 +2148,34 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that
fails, by advancing the starting offset (see below) and trying an ordi-
nary match again. There is some code that demonstrates how to do this
- in the pcredemo sample program.
+ in the pcredemo sample program. In the most general case, you have to
+ check to see if the newline convention recognizes CRLF as a newline,
+ and if so, and the current character is CR followed by LF, advance the
+ starting offset by two characters instead of one.
PCRE_NO_START_OPTIMIZE
- There are a number of optimizations that pcre_exec() uses at the start
- of a match, in order to speed up the process. For example, if it is
+ There are a number of optimizations that pcre_exec() uses at the start
+ of a match, in order to speed up the process. For example, if it is
known that an unanchored match must start with a specific character, it
- searches the subject for that character, and fails immediately if it
- cannot find it, without actually running the main matching function.
+ searches the subject for that character, and fails immediately if it
+ cannot find it, without actually running the main matching function.
This means that a special item such as (*COMMIT) at the start of a pat-
- tern is not considered until after a suitable starting point for the
- match has been found. When callouts or (*MARK) items are in use, these
+ tern is not considered until after a suitable starting point for the
+ match has been found. When callouts or (*MARK) items are in use, these
"start-up" optimizations can cause them to be skipped if the pattern is
- never actually used. The start-up optimizations are in effect a pre-
+ never actually used. The start-up optimizations are in effect a pre-
scan of the subject that takes place before the pattern is run.
- The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
- possibly causing performance to suffer, but ensuring that in cases
- where the result is "no match", the callouts do occur, and that items
+ The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
+ possibly causing performance to suffer, but ensuring that in cases
+ where the result is "no match", the callouts do occur, and that items
such as (*COMMIT) and (*MARK) are considered at every possible starting
- position in the subject string. Setting PCRE_NO_START_OPTIMIZE can
- change the outcome of a matching operation. Consider the pattern
+ position in the subject string. If PCRE_NO_START_OPTIMIZE is set at
+ compile time, it cannot be unset at matching time.
+
+ Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching
+ operation. Consider the pattern
(*COMMIT)ABC
@@ -2179,64 +2210,90 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
points to the start of a UTF-8 character. There is a discussion about
the validity of UTF-8 strings in the section on UTF-8 support in the
main pcre page. If an invalid UTF-8 sequence of bytes is found,
- pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con-
- tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
-
- If you already know that your subject is valid, and you want to skip
- these checks for performance reasons, you can set the
- PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
- do this for the second and subsequent calls to pcre_exec() if you are
- making repeated calls to find all the matches in a single subject
- string. However, you should be sure that the value of startoffset
- points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
- set, the effect of passing an invalid UTF-8 string as a subject, or a
- value of startoffset that does not point to the start of a UTF-8 char-
- acter, is undefined. Your program may crash.
+ pcre_exec() returns the error PCRE_ERROR_BADUTF8 or, if PCRE_PAR-
+ TIAL_HARD is set and the problem is a truncated UTF-8 character at the
+ end of the subject, PCRE_ERROR_SHORTUTF8. If startoffset contains a
+ value that does not point to the start of a UTF-8 character (or to the
+ end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
+
+ If you already know that your subject is valid, and you want to skip
+ these checks for performance reasons, you can set the
+ PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
+ do this for the second and subsequent calls to pcre_exec() if you are
+ making repeated calls to find all the matches in a single subject
+ string. However, you should be sure that the value of startoffset
+ points to the start of a UTF-8 character (or the end of the subject).
+ When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8
+ string as a subject or an invalid value of startoffset is undefined.
+ Your program may crash.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
- These options turn on the partial matching feature. For backwards com-
- patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
- match occurs if the end of the subject string is reached successfully,
- but there are not enough subject characters to complete the match. If
- this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately
- returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set,
- matching continues by testing any other alternatives. Only if they all
- fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH).
- The portion of the string that was inspected when the partial match was
- found is set as the first matching string. There is a more detailed
- discussion in the pcrepartial documentation.
+ These options turn on the partial matching feature. For backwards com-
+ patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
+ match occurs if the end of the subject string is reached successfully,
+ but there are not enough subject characters to complete the match. If
+ this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
+ matching continues by testing any remaining alternatives. Only if no
+ complete match can be found is PCRE_ERROR_PARTIAL returned instead of
+ PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the
+ caller is prepared to handle a partial match, but only if no complete
+ match can be found.
+
+ If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this
+ case, if a partial match is found, pcre_exec() immediately returns
+ PCRE_ERROR_PARTIAL, without considering any other alternatives. In
+ other words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
+ ered to be more important that an alternative complete match.
+
+ In both cases, the portion of the string that was inspected when the
+ partial match was found is set as the first matching string. There is a
+ more detailed discussion of partial and multi-segment matching, with
+ examples, in the pcrepartial documentation.
The string to be matched by pcre_exec()
- The subject string is passed to pcre_exec() as a pointer in subject, a
+ The subject string is passed to pcre_exec() as a pointer in subject, a
length (in bytes) in length, and a starting byte offset in startoffset.
- In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
- acter. Unlike the pattern string, the subject may contain binary zero
- bytes. When the starting offset is zero, the search for a match starts
- at the beginning of the subject, and this is by far the most common
- case.
-
- A non-zero starting offset is useful when searching for another match
- in the same subject by calling pcre_exec() again after a previous suc-
- cess. Setting startoffset differs from just passing over a shortened
- string and setting PCRE_NOTBOL in the case of a pattern that begins
+ If this is negative or greater than the length of the subject,
+ pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is
+ zero, the search for a match starts at the beginning of the subject,
+ and this is by far the most common case. In UTF-8 mode, the byte offset
+ must point to the start of a UTF-8 character (or the end of the sub-
+ ject). Unlike the pattern string, the subject may contain binary zero
+ bytes.
+
+ A non-zero starting offset is useful when searching for another match
+ in the same subject by calling pcre_exec() again after a previous suc-
+ cess. Setting startoffset differs from just passing over a shortened
+ string and setting PCRE_NOTBOL in the case of a pattern that begins
with any kind of lookbehind. For example, consider the pattern
\Biss\B
- which finds occurrences of "iss" in the middle of words. (\B matches
- only if the current position in the subject is not a word boundary.)
- When applied to the string "Mississipi" the first call to pcre_exec()
- finds the first occurrence. If pcre_exec() is called again with just
- the remainder of the subject, namely "issipi", it does not match,
+ which finds occurrences of "iss" in the middle of words. (\B matches
+ only if the current position in the subject is not a word boundary.)
+ When applied to the string "Mississipi" the first call to pcre_exec()
+ finds the first occurrence. If pcre_exec() is called again with just
+ the remainder of the subject, namely "issipi", it does not match,
because \B is always false at the start of the subject, which is deemed
- to be a word boundary. However, if pcre_exec() is passed the entire
+ to be a word boundary. However, if pcre_exec() is passed the entire
string again, but with startoffset set to 4, it finds the second occur-
- rence of "iss" because it is able to look behind the starting point to
+ rence of "iss" because it is able to look behind the starting point to
discover that it is preceded by a letter.
+ Finding all the matches in a subject is tricky when the pattern can
+ match an empty string. It is possible to emulate Perl's /g behaviour by
+ first trying the match again at the same offset, with the
+ PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED options, and then if that
+ fails, advancing the starting offset and trying an ordinary match
+ again. There is some code that demonstrates how to do this in the pcre-
+ demo sample program. In the most general case, you have to check to see
+ if the newline convention recognizes CRLF as a newline, and if so, and
+ the current character is CR followed by LF, advance the starting offset
+ by two characters instead of one.
+
If a non-zero starting offset is passed when the pattern is anchored,
one attempt to match at the given offset is made. This can only succeed
if the pattern does not require the match to be at the start of the
@@ -2309,9 +2366,15 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
expression are also set to -1. For example, if the string "abc" is
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
matched. The return from the function is 2, because the highest used
- capturing subpattern number is 1. However, you can refer to the offsets
- for the second and third capturing subpatterns if you wish (assuming
- the vector is large enough, of course).
+ capturing subpattern number is 1, and the offsets for for the second
+ and third capturing subpatterns (assuming the vector is large enough,
+ of course) are set to -1.
+
+ Note: Elements of ovector that do not correspond to capturing parenthe-
+ ses in the pattern are never changed. That is, if a pattern contains n
+ capturing parentheses, no more than ovector[0] to ovector[2n+1] are set
+ by pcre_exec(). The other elements retain whatever values they previ-
+ ously had.
Some convenience functions are provided for extracting the captured
substrings as separate strings. These are described below.
@@ -2381,13 +2444,15 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
PCRE_ERROR_BADUTF8 (-10)
A string that contains an invalid UTF-8 byte sequence was passed as a
- subject.
+ subject. However, if PCRE_PARTIAL_HARD is set and the problem is a
+ truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORT-
+ UTF8 is used instead.
PCRE_ERROR_BADUTF8_OFFSET (-11)
The UTF-8 byte sequence that was passed as a subject was valid, but the
value of startoffset did not point to the beginning of a UTF-8 charac-
- ter.
+ ter or the end of the subject.
PCRE_ERROR_PARTIAL (-12)
@@ -2420,6 +2485,17 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION
An invalid combination of PCRE_NEWLINE_xxx options was given.
+ PCRE_ERROR_BADOFFSET (-24)
+
+ The value of startoffset was negative or greater than the length of the
+ subject, that is, the value in length.
+
+ PCRE_ERROR_SHORTUTF8 (-25)
+
+ The subject string ended with an incomplete (truncated) UTF-8 charac-
+ ter, and the PCRE_PARTIAL_HARD option was set. Without this option,
+ PCRE_ERROR_BADUTF8 is returned in this situation.
+
Error numbers -16 to -20 and -22 are not used by pcre_exec().
@@ -2436,78 +2512,78 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
int pcre_get_substring_list(const char *subject,
int *ovector, int stringcount, const char ***listptr);
- Captured substrings can be accessed directly by using the offsets
- returned by pcre_exec() in ovector. For convenience, the functions
+ Captured substrings can be accessed directly by using the offsets
+ returned by pcre_exec() in ovector. For convenience, the functions
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
- string_list() are provided for extracting captured substrings as new,
- separate, zero-terminated strings. These functions identify substrings
- by number. The next section describes functions for extracting named
+ string_list() are provided for extracting captured substrings as new,
+ separate, zero-terminated strings. These functions identify substrings
+ by number. The next section describes functions for extracting named
substrings.
- A substring that contains a binary zero is correctly extracted and has
- a further zero added on the end, but the result is not, of course, a C
- string. However, you can process such a string by referring to the
- length that is returned by pcre_copy_substring() and pcre_get_sub-
+ A substring that contains a binary zero is correctly extracted and has
+ a further zero added on the end, but the result is not, of course, a C
+ string. However, you can process such a string by referring to the
+ length that is returned by pcre_copy_substring() and pcre_get_sub-
string(). Unfortunately, the interface to pcre_get_substring_list() is
- not adequate for handling strings containing binary zeros, because the
+ not adequate for handling strings containing binary zeros, because the
end of the final string is not independently indicated.
- The first three arguments are the same for all three of these func-
- tions: subject is the subject string that has just been successfully
+ The first three arguments are the same for all three of these func-
+ tions: subject is the subject string that has just been successfully
matched, ovector is a pointer to the vector of integer offsets that was
passed to pcre_exec(), and stringcount is the number of substrings that
- were captured by the match, including the substring that matched the
+ were captured by the match, including the substring that matched the
entire regular expression. This is the value returned by pcre_exec() if
- it is greater than zero. If pcre_exec() returned zero, indicating that
- it ran out of space in ovector, the value passed as stringcount should
+ it is greater than zero. If pcre_exec() returned zero, indicating that
+ it ran out of space in ovector, the value passed as stringcount should
be the number of elements in the vector divided by three.
- The functions pcre_copy_substring() and pcre_get_substring() extract a
- single substring, whose number is given as stringnumber. A value of
- zero extracts the substring that matched the entire pattern, whereas
- higher values extract the captured substrings. For pcre_copy_sub-
- string(), the string is placed in buffer, whose length is given by
- buffersize, while for pcre_get_substring() a new block of memory is
- obtained via pcre_malloc, and its address is returned via stringptr.
- The yield of the function is the length of the string, not including
+ The functions pcre_copy_substring() and pcre_get_substring() extract a
+ single substring, whose number is given as stringnumber. A value of
+ zero extracts the substring that matched the entire pattern, whereas
+ higher values extract the captured substrings. For pcre_copy_sub-
+ string(), the string is placed in buffer, whose length is given by
+ buffersize, while for pcre_get_substring() a new block of memory is
+ obtained via pcre_malloc, and its address is returned via stringptr.
+ The yield of the function is the length of the string, not including
the terminating zero, or one of these error codes:
PCRE_ERROR_NOMEMORY (-6)
- The buffer was too small for pcre_copy_substring(), or the attempt to
+ The buffer was too small for pcre_copy_substring(), or the attempt to
get memory failed for pcre_get_substring().
PCRE_ERROR_NOSUBSTRING (-7)
There is no substring whose number is stringnumber.
- The pcre_get_substring_list() function extracts all available sub-
- strings and builds a list of pointers to them. All this is done in a
+ The pcre_get_substring_list() function extracts all available sub-
+ strings and builds a list of pointers to them. All this is done in a
single block of memory that is obtained via pcre_malloc. The address of
- the memory block is returned via listptr, which is also the start of
- the list of string pointers. The end of the list is marked by a NULL
- pointer. The yield of the function is zero if all went well, or the
+ the memory block is returned via listptr, which is also the start of
+ the list of string pointers. The end of the list is marked by a NULL
+ pointer. The yield of the function is zero if all went well, or the
error code
PCRE_ERROR_NOMEMORY (-6)
if the attempt to get the memory block failed.
- When any of these functions encounter a substring that is unset, which
- can happen when capturing subpattern number n+1 matches some part of
- the subject, but subpattern n has not been used at all, they return an
+ When any of these functions encounter a substring that is unset, which
+ can happen when capturing subpattern number n+1 matches some part of
+ the subject, but subpattern n has not been used at all, they return an
empty string. This can be distinguished from a genuine zero-length sub-
- string by inspecting the appropriate offset in ovector, which is nega-
+ string by inspecting the appropriate offset in ovector, which is nega-
tive for unset substrings.
- The two convenience functions pcre_free_substring() and pcre_free_sub-
- string_list() can be used to free the memory returned by a previous
+ The two convenience functions pcre_free_substring() and pcre_free_sub-
+ string_list() can be used to free the memory returned by a previous
call of pcre_get_substring() or pcre_get_substring_list(), respec-
- tively. They do nothing more than call the function pointed to by
- pcre_free, which of course could be called directly from a C program.
- However, PCRE is used in some situations where it is linked via a spe-
- cial interface to another programming language that cannot use
- pcre_free directly; it is for these cases that the functions are pro-
+ tively. They do nothing more than call the function pointed to by
+ pcre_free, which of course could be called directly from a C program.
+ However, PCRE is used in some situations where it is linked via a spe-
+ cial interface to another programming language that cannot use
+ pcre_free directly; it is for these cases that the functions are pro-
vided.
@@ -2526,7 +2602,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
int stringcount, const char *stringname,
const char **stringptr);
- To extract a substring by name, you first have to find associated num-
+ To extract a substring by name, you first have to find associated num-
ber. For example, for this pattern
(a+)b(?<xxx>\d+)...
@@ -2535,35 +2611,35 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME
be unique (PCRE_DUPNAMES was not set), you can find the number from the
name by calling pcre_get_stringnumber(). The first argument is the com-
piled pattern, and the second is the name. The yield of the function is
- the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
+ the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
subpattern of that name.
Given the number, you can extract the substring directly, or use one of
the functions described in the previous section. For convenience, there
are also two functions that do the whole job.
- Most of the arguments of pcre_copy_named_substring() and
- pcre_get_named_substring() are the same as those for the similarly
- named functions that extract by number. As these are described in the
- previous section, they are not re-described here. There are just two
+ Most of the arguments of pcre_copy_named_substring() and
+ pcre_get_named_substring() are the same as those for the similarly
+ named functions that extract by number. As these are described in the
+ previous section, they are not re-described here. There are just two
differences:
- First, instead of a substring number, a substring name is given. Sec-
+ First, instead of a substring number, a substring name is given. Sec-
ond, there is an extra argument, given at the start, which is a pointer
- to the compiled pattern. This is needed in order to gain access to the
+ to the compiled pattern. This is needed in order to gain access to the
name-to-number translation table.
- These functions call pcre_get_stringnumber(), and if it succeeds, they
- then call pcre_copy_substring() or pcre_get_substring(), as appropri-
- ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
+ These functions call pcre_get_stringnumber(), and if it succeeds, they
+ then call pcre_copy_substring() or pcre_get_substring(), as appropri-
+ ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
behaviour may not be what you want (see the next section).
Warning: If the pattern uses the (?| feature to set up multiple subpat-
- terns with the same number, as described in the section on duplicate
- subpattern numbers in the pcrepattern page, you cannot use names to
- distinguish the different subpatterns, because names are not included
- in the compiled code. The matching process uses only numbers. For this
- reason, the use of different names for subpatterns of the same number
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcrepattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
causes an error at compile time.
@@ -2572,51 +2648,51 @@ DUPLICATE SUBPATTERN NAMES
int pcre_get_stringtable_entries(const pcre *code,
const char *name, char **first, char **last);
- When a pattern is compiled with the PCRE_DUPNAMES option, names for
- subpatterns are not required to be unique. (Duplicate names are always
- allowed for subpatterns with the same number, created by using the (?|
- feature. Indeed, if such subpatterns are named, they are required to
+ When a pattern is compiled with the PCRE_DUPNAMES option, names for
+ subpatterns are not required to be unique. (Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
use the same names.)
Normally, patterns with duplicate names are such that in any one match,
- only one of the named subpatterns participates. An example is shown in
+ only one of the named subpatterns participates. An example is shown in
the pcrepattern documentation.
- When duplicates are present, pcre_copy_named_substring() and
- pcre_get_named_substring() return the first substring corresponding to
- the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
- (-7) is returned; no data is returned. The pcre_get_stringnumber()
- function returns one of the numbers that are associated with the name,
+ When duplicates are present, pcre_copy_named_substring() and
+ pcre_get_named_substring() return the first substring corresponding to
+ the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
+ (-7) is returned; no data is returned. The pcre_get_stringnumber()
+ function returns one of the numbers that are associated with the name,
but it is not defined which it is.
- If you want to get full details of all captured substrings for a given
- name, you must use the pcre_get_stringtable_entries() function. The
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre_get_stringtable_entries() function. The
first argument is the compiled pattern, and the second is the name. The
- third and fourth are pointers to variables which are updated by the
+ third and fourth are pointers to variables which are updated by the
function. After it has run, they point to the first and last entries in
- the name-to-number table for the given name. The function itself
- returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
- there are none. The format of the table is described above in the sec-
- tion entitled Information about a pattern. Given all the relevant
- entries for the name, you can extract each of their numbers, and hence
+ the name-to-number table for the given name. The function itself
+ returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
+ there are none. The format of the table is described above in the sec-
+ tion entitled Information about a pattern. Given all the relevant
+ entries for the name, you can extract each of their numbers, and hence
the captured data, if any.
FINDING ALL POSSIBLE MATCHES
- The traditional matching function uses a similar algorithm to Perl,
+ The traditional matching function uses a similar algorithm to Perl,
which stops when it finds the first match, starting at a given point in
- the subject. If you want to find all possible matches, or the longest
- possible match, consider using the alternative matching function (see
- below) instead. If you cannot use the alternative function, but still
- need to find all possible matches, you can kludge it up by making use
+ the subject. If you want to find all possible matches, or the longest
+ possible match, consider using the alternative matching function (see
+ below) instead. If you cannot use the alternative function, but still
+ need to find all possible matches, you can kludge it up by making use
of the callout facility, which is described in the pcrecallout documen-
tation.
What you have to do is to insert a callout right at the end of the pat-
- tern. When your callout function is called, extract and save the cur-
- rent matched substring. Then return 1, which forces pcre_exec() to
- backtrack and try other alternatives. Ultimately, when it runs out of
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre_exec() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
@@ -2627,26 +2703,26 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
int options, int *ovector, int ovecsize,
int *workspace, int wscount);
- The function pcre_dfa_exec() is called to match a subject string
- against a compiled pattern, using a matching algorithm that scans the
- subject string just once, and does not backtrack. This has different
- characteristics to the normal algorithm, and is not compatible with
- Perl. Some of the features of PCRE patterns are not supported. Never-
- theless, there are times when this kind of matching can be useful. For
- a discussion of the two matching algorithms, and a list of features
- that pcre_dfa_exec() does not support, see the pcrematching documenta-
+ The function pcre_dfa_exec() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
+ subject string just once, and does not backtrack. This has different
+ characteristics to the normal algorithm, and is not compatible with
+ Perl. Some of the features of PCRE patterns are not supported. Never-
+ theless, there are times when this kind of matching can be useful. For
+ a discussion of the two matching algorithms, and a list of features
+ that pcre_dfa_exec() does not support, see the pcrematching documenta-
tion.
- The arguments for the pcre_dfa_exec() function are the same as for
+ The arguments for the pcre_dfa_exec() function are the same as for
pcre_exec(), plus two extras. The ovector argument is used in a differ-
- ent way, and this is described below. The other common arguments are
- used in the same way as for pcre_exec(), so their description is not
+ ent way, and this is described below. The other common arguments are
+ used in the same way as for pcre_exec(), so their description is not
repeated here.
- The two additional arguments provide workspace for the function. The
- workspace vector should contain at least 20 elements. It is used for
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
keeping track of multiple paths through the pattern tree. More
- workspace will be needed for patterns and subjects where there are a
+ workspace will be needed for patterns and subjects where there are a
lot of potential matches.
Here is an example of a simple call to pcre_dfa_exec():
@@ -2668,53 +2744,55 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
Option bits for pcre_dfa_exec()
- The unused bits of the options argument for pcre_dfa_exec() must be
- zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
+ The unused bits of the options argument for pcre_dfa_exec() must be
+ zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
- PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
- PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
- TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
- four of these are exactly the same as for pcre_exec(), so their
+ PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF,
+ PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
+ TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
+ four of these are exactly the same as for pcre_exec(), so their
description is not repeated here.
PCRE_PARTIAL_HARD
PCRE_PARTIAL_SOFT
- These have the same general effect as they do for pcre_exec(), but the
- details are slightly different. When PCRE_PARTIAL_HARD is set for
- pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
- ject is reached and there is still at least one matching possibility
+ These have the same general effect as they do for pcre_exec(), but the
+ details are slightly different. When PCRE_PARTIAL_HARD is set for
+ pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub-
+ ject is reached and there is still at least one matching possibility
that requires additional characters. This happens even if some complete
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
- of the subject is reached, there have been no complete matches, but
- there is still at least one matching possibility. The portion of the
- string that was inspected when the longest partial match was found is
- set as the first matching string in both cases.
+ of the subject is reached, there have been no complete matches, but
+ there is still at least one matching possibility. The portion of the
+ string that was inspected when the longest partial match was found is
+ set as the first matching string in both cases. There is a more
+ detailed discussion of partial and multi-segment matching, with exam-
+ ples, in the pcrepartial documentation.
PCRE_DFA_SHORTEST
- Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
+ Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
stop as soon as it has found one match. Because of the way the alterna-
- tive algorithm works, this is necessarily the shortest possible match
+ tive algorithm works, this is necessarily the shortest possible match
at the first possible matching point in the subject string.
PCRE_DFA_RESTART
When pcre_dfa_exec() returns a partial match, it is possible to call it
- again, with additional subject characters, and have it continue with
- the same match. The PCRE_DFA_RESTART option requests this action; when
- it is set, the workspace and wscount options must reference the same
- vector as before because data about the match so far is left in them
+ again, with additional subject characters, and have it continue with
+ the same match. The PCRE_DFA_RESTART option requests this action; when
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
after a partial match. There is more discussion of this facility in the
pcrepartial documentation.
Successful returns from pcre_dfa_exec()
- When pcre_dfa_exec() succeeds, it may have matched more than one sub-
+ When pcre_dfa_exec() succeeds, it may have matched more than one sub-
string in the subject. Note, however, that all the matches from one run
- of the function start at the same point in the subject. The shorter
- matches are all initial substrings of the longer matches. For example,
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
if the pattern
<.*>
@@ -2729,61 +2807,61 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
<something> <something else>
<something> <something else> <something further>
- On success, the yield of the function is a number greater than zero,
- which is the number of matched substrings. The substrings themselves
- are returned in ovector. Each string uses two elements; the first is
- the offset to the start, and the second is the offset to the end. In
- fact, all the strings have the same start offset. (Space could have
- been saved by giving this only once, but it was decided to retain some
- compatibility with the way pcre_exec() returns data, even though the
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The substrings themselves
+ are returned in ovector. Each string uses two elements; the first is
+ the offset to the start, and the second is the offset to the end. In
+ fact, all the strings have the same start offset. (Space could have
+ been saved by giving this only once, but it was decided to retain some
+ compatibility with the way pcre_exec() returns data, even though the
meaning of the strings is different.)
The strings are returned in reverse order of length; that is, the long-
- est matching string is given first. If there were too many matches to
- fit into ovector, the yield of the function is zero, and the vector is
+ est matching string is given first. If there were too many matches to
+ fit into ovector, the yield of the function is zero, and the vector is
filled with the longest matches.
Error returns from pcre_dfa_exec()
- The pcre_dfa_exec() function returns a negative number when it fails.
- Many of the errors are the same as for pcre_exec(), and these are
- described above. There are in addition the following errors that are
+ The pcre_dfa_exec() function returns a negative number when it fails.
+ Many of the errors are the same as for pcre_exec(), and these are
+ described above. There are in addition the following errors that are
specific to pcre_dfa_exec():
PCRE_ERROR_DFA_UITEM (-16)
- This return is given if pcre_dfa_exec() encounters an item in the pat-
- tern that it does not support, for instance, the use of \C or a back
+ This return is given if pcre_dfa_exec() encounters an item in the pat-
+ tern that it does not support, for instance, the use of \C or a back
reference.
PCRE_ERROR_DFA_UCOND (-17)
- This return is given if pcre_dfa_exec() encounters a condition item
- that uses a back reference for the condition, or a test for recursion
+ This return is given if pcre_dfa_exec() encounters a condition item
+ that uses a back reference for the condition, or a test for recursion
in a specific group. These are not supported.
PCRE_ERROR_DFA_UMLIMIT (-18)
- This return is given if pcre_dfa_exec() is called with an extra block
+ This return is given if pcre_dfa_exec() is called with an extra block
that contains a setting of the match_limit field. This is not supported
(it is meaningless).
PCRE_ERROR_DFA_WSSIZE (-19)
- This return is given if pcre_dfa_exec() runs out of space in the
+ This return is given if pcre_dfa_exec() runs out of space in the
workspace vector.
PCRE_ERROR_DFA_RECURSE (-20)
- When a recursive subpattern is processed, the matching function calls
- itself recursively, using private vectors for ovector and workspace.
- This error is given if the output vector is not large enough. This
+ When a recursive subpattern is processed, the matching function calls
+ itself recursively, using private vectors for ovector and workspace.
+ This error is given if the output vector is not large enough. This
should be extremely rare, as a vector of size 1000 is used.
SEE ALSO
- pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
+ pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
@@ -2796,7 +2874,7 @@ AUTHOR
REVISION
- Last updated: 21 June 2010
+ Last updated: 21 November 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -2864,17 +2942,18 @@ MISSING CALLOUTS
patterns, if it has been scanned far enough.
You can disable these optimizations by passing the PCRE_NO_START_OPTI-
- MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the
- matching process, but does ensure that callouts such as the example
- above are obeyed.
+ MIZE option to pcre_compile(), pcre_exec(), or pcre_dfa_exec(), or by
+ starting the pattern with (*NO_START_OPT). This slows down the matching
+ process, but does ensure that callouts such as the example above are
+ obeyed.
THE CALLOUT INTERFACE
- During matching, when PCRE reaches a callout point, the external func-
- tion defined by pcre_callout is called (if it is set). This applies to
- both the pcre_exec() and the pcre_dfa_exec() matching functions. The
- only argument to the callout function is a pointer to a pcre_callout
+ During matching, when PCRE reaches a callout point, the external func-
+ tion defined by pcre_callout is called (if it is set). This applies to
+ both the pcre_exec() and the pcre_dfa_exec() matching functions. The
+ only argument to the callout function is a pointer to a pcre_callout
block. This structure contains the following fields:
int version;
@@ -2890,81 +2969,81 @@ THE CALLOUT INTERFACE
int pattern_position;
int next_item_length;
- The version field is an integer containing the version number of the
- block format. The initial version was 0; the current version is 1. The
- version number will change again in future if additional fields are
+ The version field is an integer containing the version number of the
+ block format. The initial version was 0; the current version is 1. The
+ version number will change again in future if additional fields are
added, but the intention is never to remove any of the existing fields.
- The callout_number field contains the number of the callout, as com-
- piled into the pattern (that is, the number after ?C for manual call-
+ The callout_number field contains the number of the callout, as com-
+ piled into the pattern (that is, the number after ?C for manual call-
outs, and 255 for automatically generated callouts).
- The offset_vector field is a pointer to the vector of offsets that was
- passed by the caller to pcre_exec() or pcre_dfa_exec(). When
- pcre_exec() is used, the contents can be inspected in order to extract
- substrings that have been matched so far, in the same way as for
- extracting substrings after a match has completed. For pcre_dfa_exec()
+ The offset_vector field is a pointer to the vector of offsets that was
+ passed by the caller to pcre_exec() or pcre_dfa_exec(). When
+ pcre_exec() is used, the contents can be inspected in order to extract
+ substrings that have been matched so far, in the same way as for
+ extracting substrings after a match has completed. For pcre_dfa_exec()
this field is not useful.
The subject and subject_length fields contain copies of the values that
were passed to pcre_exec().
- The start_match field normally contains the offset within the subject
- at which the current match attempt started. However, if the escape
- sequence \K has been encountered, this value is changed to reflect the
- modified starting point. If the pattern is not anchored, the callout
+ The start_match field normally contains the offset within the subject
+ at which the current match attempt started. However, if the escape
+ sequence \K has been encountered, this value is changed to reflect the
+ modified starting point. If the pattern is not anchored, the callout
function may be called several times from the same point in the pattern
for different starting points in the subject.
- The current_position field contains the offset within the subject of
+ The current_position field contains the offset within the subject of
the current match pointer.
- When the pcre_exec() function is used, the capture_top field contains
- one more than the number of the highest numbered captured substring so
- far. If no substrings have been captured, the value of capture_top is
- one. This is always the case when pcre_dfa_exec() is used, because it
+ When the pcre_exec() function is used, the capture_top field contains
+ one more than the number of the highest numbered captured substring so
+ far. If no substrings have been captured, the value of capture_top is
+ one. This is always the case when pcre_dfa_exec() is used, because it
does not support captured substrings.
- The capture_last field contains the number of the most recently cap-
- tured substring. If no substrings have been captured, its value is -1.
+ The capture_last field contains the number of the most recently cap-
+ tured substring. If no substrings have been captured, its value is -1.
This is always the case when pcre_dfa_exec() is used.
- The callout_data field contains a value that is passed to pcre_exec()
- or pcre_dfa_exec() specifically so that it can be passed back in call-
- outs. It is passed in the pcre_callout field of the pcre_extra data
- structure. If no such data was passed, the value of callout_data in a
- pcre_callout block is NULL. There is a description of the pcre_extra
+ The callout_data field contains a value that is passed to pcre_exec()
+ or pcre_dfa_exec() specifically so that it can be passed back in call-
+ outs. It is passed in the pcre_callout field of the pcre_extra data
+ structure. If no such data was passed, the value of callout_data in a
+ pcre_callout block is NULL. There is a description of the pcre_extra
structure in the pcreapi documentation.
- The pattern_position field is present from version 1 of the pcre_call-
+ The pattern_position field is present from version 1 of the pcre_call-
out structure. It contains the offset to the next item to be matched in
the pattern string.
- The next_item_length field is present from version 1 of the pcre_call-
+ The next_item_length field is present from version 1 of the pcre_call-
out structure. It contains the length of the next item to be matched in
- the pattern string. When the callout immediately precedes an alterna-
- tion bar, a closing parenthesis, or the end of the pattern, the length
- is zero. When the callout precedes an opening parenthesis, the length
+ the pattern string. When the callout immediately precedes an alterna-
+ tion bar, a closing parenthesis, or the end of the pattern, the length
+ is zero. When the callout precedes an opening parenthesis, the length
is that of the entire subpattern.
- The pattern_position and next_item_length fields are intended to help
- in distinguishing between different automatic callouts, which all have
+ The pattern_position and next_item_length fields are intended to help
+ in distinguishing between different automatic callouts, which all have
the same callout number. However, they are set for all callouts.
RETURN VALUES
- The external callout function returns an integer to PCRE. If the value
- is zero, matching proceeds as normal. If the value is greater than
- zero, matching fails at the current point, but the testing of other
+ The external callout function returns an integer to PCRE. If the value
+ is zero, matching proceeds as normal. If the value is greater than
+ zero, matching fails at the current point, but the testing of other
matching possibilities goes ahead, just as if a lookahead assertion had
- failed. If the value is less than zero, the match is abandoned, and
+ failed. If the value is less than zero, the match is abandoned, and
pcre_exec() or pcre_dfa_exec() returns the negative value.
- Negative values should normally be chosen from the set of
+ Negative values should normally be chosen from the set of
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
- dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
- reserved for use by callout functions; it will never be used by PCRE
+ dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
+ reserved for use by callout functions; it will never be used by PCRE
itself.
@@ -2977,8 +3056,8 @@ AUTHOR
REVISION
- Last updated: 29 September 2009
- Copyright (c) 1997-2009 University of Cambridge.
+ Last updated: 21 November 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -2993,7 +3072,7 @@ DIFFERENCES BETWEEN PCRE AND PERL
This document describes the differences in the ways that PCRE and Perl
handle regular expressions. The differences described here are with
- respect to Perl 5.10/5.11.
+ respect to Perl versions 5.10 and above.
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
of what it does have are given in the section on UTF-8 support in the
@@ -3075,24 +3154,27 @@ DIFFERENCES BETWEEN PCRE AND PERL
turing subpattern number 1. To avoid this confusing situation, an error
is given at compile time.
- 12. PCRE provides some extensions to the Perl regular expression facil-
- ities. Perl 5.10 includes new features that are not in earlier ver-
- sions of Perl, some of which (such as named parentheses) have been in
+ 12. Perl recognizes comments in some places that PCRE doesn't, for
+ example, between the ( and ? at the start of a subpattern.
+
+ 13. PCRE provides some extensions to the Perl regular expression facil-
+ ities. Perl 5.10 includes new features that are not in earlier ver-
+ sions of Perl, some of which (such as named parentheses) have been in
PCRE for some time. This list is with respect to Perl 5.10:
- (a) Although lookbehind assertions in PCRE must match fixed length
- strings, each alternative branch of a lookbehind assertion can match a
- different length of string. Perl requires them all to have the same
+ (a) Although lookbehind assertions in PCRE must match fixed length
+ strings, each alternative branch of a lookbehind assertion can match a
+ different length of string. Perl requires them all to have the same
length.
- (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
+ (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
meta-character matches only at the very end of the string.
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
ignored. (Perl can be made to issue a warning.)
- (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
+ (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
fiers is inverted, that is, by default they are not greedy, but if fol-
lowed by a question mark they are.
@@ -3100,10 +3182,10 @@ DIFFERENCES BETWEEN PCRE AND PERL
tried only at the first matching position in the subject string.
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
- and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
+ and PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl equiva-
lents.
- (g) The \R escape sequence can be restricted to match only CR, LF, or
+ (g) The \R escape sequence can be restricted to match only CR, LF, or
CRLF by the PCRE_BSR_ANYCRLF option.
(h) The callout facility is PCRE-specific.
@@ -3113,10 +3195,10 @@ DIFFERENCES BETWEEN PCRE AND PERL
(j) Patterns compiled by PCRE can be saved and re-used at a later time,
even on different hosts that have the other endianness.
- (k) The alternative matching function (pcre_dfa_exec()) matches in a
+ (k) The alternative matching function (pcre_dfa_exec()) matches in a
different way and is not Perl-compatible.
- (l) PCRE recognizes some special sequences such as (*CR) at the start
+ (l) PCRE recognizes some special sequences such as (*CR) at the start
of a pattern that set overall options that cannot be changed within the
pattern.
@@ -3130,7 +3212,7 @@ AUTHOR
REVISION
- Last updated: 12 May 2010
+ Last updated: 31 October 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -3183,26 +3265,31 @@ PCRE REGULAR EXPRESSION DETAILS
character types, instead of recognizing only characters with codes less
than 128 via a lookup table.
- The remainder of this document discusses the patterns that are sup-
- ported by PCRE when its main matching function, pcre_exec(), is used.
- From release 6.0, PCRE offers a second matching function,
- pcre_dfa_exec(), which matches using a different algorithm that is not
+ If a pattern starts with (*NO_START_OPT), it has the same effect as
+ setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
+ time. There are also some more of these special sequences that are con-
+ cerned with the handling of newlines; they are described below.
+
+ The remainder of this document discusses the patterns that are sup-
+ ported by PCRE when its main matching function, pcre_exec(), is used.
+ From release 6.0, PCRE offers a second matching function,
+ pcre_dfa_exec(), which matches using a different algorithm that is not
Perl-compatible. Some of the features discussed below are not available
- when pcre_dfa_exec() is used. The advantages and disadvantages of the
- alternative function, and how it differs from the normal function, are
+ when pcre_dfa_exec() is used. The advantages and disadvantages of the
+ alternative function, and how it differs from the normal function, are
discussed in the pcrematching page.
NEWLINE CONVENTIONS
- PCRE supports five different conventions for indicating line breaks in
- strings: a single CR (carriage return) character, a single LF (line-
+ PCRE supports five different conventions for indicating line breaks in
+ strings: a single CR (carriage return) character, a single LF (line-
feed) character, the two-character sequence CRLF, any of the three pre-
- ceding, or any Unicode newline sequence. The pcreapi page has further
- discussion about newlines, and shows how to set the newline convention
+ ceding, or any Unicode newline sequence. The pcreapi page has further
+ discussion about newlines, and shows how to set the newline convention
in the options arguments for the compiling and matching functions.
- It is also possible to specify a newline convention by starting a pat-
+ It is also possible to specify a newline convention by starting a pat-
tern string with one of the following five sequences:
(*CR) carriage return
@@ -3211,54 +3298,54 @@ NEWLINE CONVENTIONS
(*ANYCRLF) any of the three above
(*ANY) all Unicode newline sequences
- These override the default and the options given to pcre_compile() or
- pcre_compile2(). For example, on a Unix system where LF is the default
+ These override the default and the options given to pcre_compile() or
+ pcre_compile2(). For example, on a Unix system where LF is the default
newline sequence, the pattern
(*CR)a.b
changes the convention to CR. That pattern matches "a\nb" because LF is
- no longer a newline. Note that these special settings, which are not
- Perl-compatible, are recognized only at the very start of a pattern,
- and that they must be in upper case. If more than one of them is
+ no longer a newline. Note that these special settings, which are not
+ Perl-compatible, are recognized only at the very start of a pattern,
+ and that they must be in upper case. If more than one of them is
present, the last one is used.
- The newline convention affects the interpretation of the dot metachar-
- acter when PCRE_DOTALL is not set, and also the behaviour of \N. How-
- ever, it does not affect what the \R escape sequence matches. By
- default, this is any Unicode newline sequence, for Perl compatibility.
- However, this can be changed; see the description of \R in the section
- entitled "Newline sequences" below. A change of \R setting can be com-
+ The newline convention affects the interpretation of the dot metachar-
+ acter when PCRE_DOTALL is not set, and also the behaviour of \N. How-
+ ever, it does not affect what the \R escape sequence matches. By
+ default, this is any Unicode newline sequence, for Perl compatibility.
+ However, this can be changed; see the description of \R in the section
+ entitled "Newline sequences" below. A change of \R setting can be com-
bined with a change of newline convention.
CHARACTERS AND METACHARACTERS
- A regular expression is a pattern that is matched against a subject
- string from left to right. Most characters stand for themselves in a
- pattern, and match the corresponding characters in the subject. As a
+ A regular expression is a pattern that is matched against a subject
+ string from left to right. Most characters stand for themselves in a
+ pattern, and match the corresponding characters in the subject. As a
trivial example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to itself. When
- caseless matching is specified (the PCRE_CASELESS option), letters are
- matched independently of case. In UTF-8 mode, PCRE always understands
- the concept of case for characters whose values are less than 128, so
- caseless matching is always possible. For characters with higher val-
- ues, the concept of case is supported if PCRE is compiled with Unicode
- property support, but not otherwise. If you want to use caseless
- matching for characters 128 and above, you must ensure that PCRE is
+ caseless matching is specified (the PCRE_CASELESS option), letters are
+ matched independently of case. In UTF-8 mode, PCRE always understands
+ the concept of case for characters whose values are less than 128, so
+ caseless matching is always possible. For characters with higher val-
+ ues, the concept of case is supported if PCRE is compiled with Unicode
+ property support, but not otherwise. If you want to use caseless
+ matching for characters 128 and above, you must ensure that PCRE is
compiled with Unicode property support as well as with UTF-8 support.
- The power of regular expressions comes from the ability to include
- alternatives and repetitions in the pattern. These are encoded in the
+ The power of regular expressions comes from the ability to include
+ alternatives and repetitions in the pattern. These are encoded in the
pattern by the use of metacharacters, which do not stand for themselves
but instead are interpreted in some special way.
- There are two different sets of metacharacters: those that are recog-
- nized anywhere in the pattern except within square brackets, and those
- that are recognized within square brackets. Outside square brackets,
+ There are two different sets of metacharacters: those that are recog-
+ nized anywhere in the pattern except within square brackets, and those
+ that are recognized within square brackets. Outside square brackets,
the metacharacters are as follows:
\ general escape character with several uses
@@ -3277,7 +3364,7 @@ CHARACTERS AND METACHARACTERS
also "possessive quantifier"
{ start min/max quantifier
- Part of a pattern that is in square brackets is called a "character
+ Part of a pattern that is in square brackets is called a "character
class". In a character class the only metacharacters are:
\ general escape character
@@ -3293,27 +3380,31 @@ CHARACTERS AND METACHARACTERS
BACKSLASH
The backslash character has several uses. Firstly, if it is followed by
- a non-alphanumeric character, it takes away any special meaning that
- character may have. This use of backslash as an escape character
- applies both inside and outside character classes.
-
- For example, if you want to match a * character, you write \* in the
- pattern. This escaping action applies whether or not the following
- character would otherwise be interpreted as a metacharacter, so it is
- always safe to precede a non-alphanumeric with backslash to specify
- that it stands for itself. In particular, if you want to match a back-
+ a character that is not a number or a letter, it takes away any special
+ meaning that character may have. This use of backslash as an escape
+ character applies both inside and outside character classes.
+
+ For example, if you want to match a * character, you write \* in the
+ pattern. This escaping action applies whether or not the following
+ character would otherwise be interpreted as a metacharacter, so it is
+ always safe to precede a non-alphanumeric with backslash to specify
+ that it stands for itself. In particular, if you want to match a back-
slash, you write \\.
- If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
- the pattern (other than in a character class) and characters between a
+ In UTF-8 mode, only ASCII numbers and letters have any special meaning
+ after a backslash. All other characters (in particular, those whose
+ codepoints are greater than 127) are treated as literals.
+
+ If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
+ the pattern (other than in a character class) and characters between a
# outside a character class and the next newline are ignored. An escap-
- ing backslash can be used to include a whitespace or # character as
+ ing backslash can be used to include a whitespace or # character as
part of the pattern.
- If you want to remove the special meaning from a sequence of charac-
- ters, you can do so by putting them between \Q and \E. This is differ-
- ent from Perl in that $ and @ are handled as literals in \Q...\E
- sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
+ If you want to remove the special meaning from a sequence of charac-
+ ters, you can do so by putting them between \Q and \E. This is differ-
+ ent from Perl in that $ and @ are handled as literals in \Q...\E
+ sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
tion. Note the following examples:
Pattern PCRE matches Perl matches
@@ -3323,20 +3414,20 @@ BACKSLASH
\Qabc\$xyz\E abc\$xyz abc\$xyz
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz
- The \Q...\E sequence is recognized both inside and outside character
- classes.
+ The \Q...\E sequence is recognized both inside and outside character
+ classes. An isolated \E that is not preceded by \Q is ignored.
Non-printing characters
A second use of backslash provides a way of encoding non-printing char-
- acters in patterns in a visible manner. There is no restriction on the
- appearance of non-printing characters, apart from the binary zero that
- terminates a pattern, but when a pattern is being prepared by text
- editing, it is often easier to use one of the following escape
+ acters in patterns in a visible manner. There is no restriction on the
+ appearance of non-printing characters, apart from the binary zero that
+ terminates a pattern, but when a pattern is being prepared by text
+ editing, it is often easier to use one of the following escape
sequences than the binary character it represents:
\a alarm, that is, the BEL character (hex 07)
- \cx "control-x", where x is any character
+ \cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n linefeed (hex 0A)
@@ -3346,48 +3437,52 @@ BACKSLASH
\xhh character with hex code hh
\x{hhh..} character with hex code hhh..
- The precise effect of \cx is as follows: if x is a lower case letter,
- it is converted to upper case. Then bit 6 of the character (hex 40) is
- inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
- becomes hex 7B.
-
- After \x, from zero to two hexadecimal digits are read (letters can be
- in upper or lower case). Any number of hexadecimal digits may appear
- between \x{ and }, but the value of the character code must be less
+ The precise effect of \cx is as follows: if x is a lower case letter,
+ it is converted to upper case. Then bit 6 of the character (hex 40) is
+ inverted. Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({
+ is 7B), while \c; becomes hex 7B (; is 3B). If the byte following \c
+ has a value greater than 127, a compile-time error occurs. This locks
+ out non-ASCII characters in both byte mode and UTF-8 mode. (When PCRE
+ is compiled in EBCDIC mode, all byte values are valid. A lower case
+ letter is converted to upper case, and then the 0xc0 bits are flipped.)
+
+ After \x, from zero to two hexadecimal digits are read (letters can be
+ in upper or lower case). Any number of hexadecimal digits may appear
+ between \x{ and }, but the value of the character code must be less
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
- the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
+ the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
than the largest Unicode code point, which is 10FFFF.
- If characters other than hexadecimal digits appear between \x{ and },
+ If characters other than hexadecimal digits appear between \x{ and },
or if there is no terminating }, this form of escape is not recognized.
- Instead, the initial \x will be interpreted as a basic hexadecimal
- escape, with no following digits, giving a character whose value is
+ Instead, the initial \x will be interpreted as a basic hexadecimal
+ escape, with no following digits, giving a character whose value is
zero.
Characters whose value is less than 256 can be defined by either of the
- two syntaxes for \x. There is no difference in the way they are han-
+ two syntaxes for \x. There is no difference in the way they are han-
dled. For example, \xdc is exactly the same as \x{dc}.
- After \0 up to two further octal digits are read. If there are fewer
- than two digits, just those that are present are used. Thus the
+ After \0 up to two further octal digits are read. If there are fewer
+ than two digits, just those that are present are used. Thus the
sequence \0\x\07 specifies two binary zeros followed by a BEL character
- (code value 7). Make sure you supply two digits after the initial zero
+ (code value 7). Make sure you supply two digits after the initial zero
if the pattern character that follows is itself an octal digit.
The handling of a backslash followed by a digit other than 0 is compli-
cated. Outside a character class, PCRE reads it and any following dig-
- its as a decimal number. If the number is less than 10, or if there
+ its as a decimal number. If the number is less than 10, or if there
have been at least that many previous capturing left parentheses in the
- expression, the entire sequence is taken as a back reference. A
- description of how this works is given later, following the discussion
+ expression, the entire sequence is taken as a back reference. A
+ description of how this works is given later, following the discussion
of parenthesized subpatterns.
- Inside a character class, or if the decimal number is greater than 9
- and there have not been that many capturing subpatterns, PCRE re-reads
+ Inside a character class, or if the decimal number is greater than 9
+ and there have not been that many capturing subpatterns, PCRE re-reads
up to three octal digits following the backslash, and uses them to gen-
- erate a data character. Any subsequent digits stand for themselves. In
- non-UTF-8 mode, the value of a character specified in octal must be
- less than \400. In UTF-8 mode, values up to \777 are permitted. For
+ erate a data character. Any subsequent digits stand for themselves. In
+ non-UTF-8 mode, the value of a character specified in octal must be
+ less than \400. In UTF-8 mode, values up to \777 are permitted. For
example:
\040 is another way of writing a space
@@ -3405,32 +3500,32 @@ BACKSLASH
\81 is either a back reference, or a binary zero
followed by the two characters "8" and "1"
- Note that octal values of 100 or greater must not be introduced by a
+ Note that octal values of 100 or greater must not be introduced by a
leading zero, because no more than three octal digits are ever read.
All the sequences that define a single character value can be used both
- inside and outside character classes. In addition, inside a character
- class, the sequence \b is interpreted as the backspace character (hex
- 08). The sequences \B, \N, \R, and \X are not special inside a charac-
- ter class. Like any other unrecognized escape sequences, they are
- treated as the literal characters "B", "N", "R", and "X" by default,
+ inside and outside character classes. In addition, inside a character
+ class, the sequence \b is interpreted as the backspace character (hex
+ 08). The sequences \B, \N, \R, and \X are not special inside a charac-
+ ter class. Like any other unrecognized escape sequences, they are
+ treated as the literal characters "B", "N", "R", and "X" by default,
but cause an error if the PCRE_EXTRA option is set. Outside a character
class, these sequences have different meanings.
Absolute and relative back references
- The sequence \g followed by an unsigned or a negative number, option-
- ally enclosed in braces, is an absolute or relative back reference. A
+ The sequence \g followed by an unsigned or a negative number, option-
+ ally enclosed in braces, is an absolute or relative back reference. A
named back reference can be coded as \g{name}. Back references are dis-
cussed later, following the discussion of parenthesized subpatterns.
Absolute and relative subroutine calls
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a "subroutine".
- Details are discussed later. Note that \g{...} (Perl syntax) and
- \g<...> (Oniguruma syntax) are not synonymous. The former is a back
+ an alternative syntax for referencing a subpattern as a "subroutine".
+ Details are discussed later. Note that \g{...} (Perl syntax) and
+ \g<...> (Oniguruma syntax) are not synonymous. The former is a back
reference; the latter is a subroutine call.
Generic character types
@@ -3449,54 +3544,55 @@ BACKSLASH
\W any "non-word" character
There is also the single sequence \N, which matches a non-newline char-
- acter. This is the same as the "." metacharacter when PCRE_DOTALL is
+ acter. This is the same as the "." metacharacter when PCRE_DOTALL is
not set.
- Each pair of lower and upper case escape sequences partitions the com-
- plete set of characters into two disjoint sets. Any given character
- matches one, and only one, of each pair. The sequences can appear both
- inside and outside character classes. They each match one character of
- the appropriate type. If the current matching point is at the end of
- the subject string, all of them fail, because there is no character to
+ Each pair of lower and upper case escape sequences partitions the com-
+ plete set of characters into two disjoint sets. Any given character
+ matches one, and only one, of each pair. The sequences can appear both
+ inside and outside character classes. They each match one character of
+ the appropriate type. If the current matching point is at the end of
+ the subject string, all of them fail, because there is no character to
match.
- For compatibility with Perl, \s does not match the VT character (code
- 11). This makes it different from the the POSIX "space" class. The \s
- characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
+ For compatibility with Perl, \s does not match the VT character (code
+ 11). This makes it different from the the POSIX "space" class. The \s
+ characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
"use locale;" is included in a Perl script, \s may match the VT charac-
ter. In PCRE, it never does.
- A "word" character is an underscore or any character that is a letter
- or digit. By default, the definition of letters and digits is con-
- trolled by PCRE's low-valued character tables, and may vary if locale-
- specific matching is taking place (see "Locale support" in the pcreapi
- page). For example, in a French locale such as "fr_FR" in Unix-like
- systems, or "french" in Windows, some character codes greater than 128
- are used for accented letters, and these are then matched by \w. The
+ A "word" character is an underscore or any character that is a letter
+ or digit. By default, the definition of letters and digits is con-
+ trolled by PCRE's low-valued character tables, and may vary if locale-
+ specific matching is taking place (see "Locale support" in the pcreapi
+ page). For example, in a French locale such as "fr_FR" in Unix-like
+ systems, or "french" in Windows, some character codes greater than 128
+ are used for accented letters, and these are then matched by \w. The
use of locales with Unicode is discouraged.
- By default, in UTF-8 mode, characters with values greater than 128
- never match \d, \s, or \w, and always match \D, \S, and \W. These
- sequences retain their original meanings from before UTF-8 support was
- available, mainly for efficiency reasons. However, if PCRE is compiled
- with Unicode property support, and the PCRE_UCP option is set, the be-
- haviour is changed so that Unicode properties are used to determine
+ By default, in UTF-8 mode, characters with values greater than 128
+ never match \d, \s, or \w, and always match \D, \S, and \W. These
+ sequences retain their original meanings from before UTF-8 support was
+ available, mainly for efficiency reasons. However, if PCRE is compiled
+ with Unicode property support, and the PCRE_UCP option is set, the be-
+ haviour is changed so that Unicode properties are used to determine
character types, as follows:
\d any character that \p{Nd} matches (decimal digit)
\s any character that \p{Z} matches, plus HT, LF, FF, CR
\w any character that \p{L} or \p{N} matches, plus underscore
- The upper case escapes match the inverse sets of characters. Note that
- \d matches only decimal digits, whereas \w matches any Unicode digit,
- as well as any Unicode letter, and underscore. Note also that PCRE_UCP
- affects \b, and \B because they are defined in terms of \w and \W.
+ The upper case escapes match the inverse sets of characters. Note that
+ \d matches only decimal digits, whereas \w matches any Unicode digit,
+ as well as any Unicode letter, and underscore. Note also that PCRE_UCP
+ affects \b, and \B because they are defined in terms of \w and \W.
Matching these sequences is noticeably slower when PCRE_UCP is set.
- The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
- the other sequences, which match only ASCII characters by default,
- these always match certain high-valued codepoints in UTF-8 mode,
- whether or not PCRE_UCP is set. The horizontal space characters are:
+ The sequences \h, \H, \v, and \V are features that were added to Perl
+ at release 5.10. In contrast to the other sequences, which match only
+ ASCII characters by default, these always match certain high-valued
+ codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
+ tal space characters are:
U+0009 Horizontal tab
U+0020 Space
@@ -3531,8 +3627,8 @@ BACKSLASH
Newline sequences
Outside a character class, by default, the escape sequence \R matches
- any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8
- mode \R is equivalent to the following:
+ any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
+ following:
(?>\r\n|\n|\x0b|\f|\r|\x85)
@@ -3740,33 +3836,32 @@ BACKSLASH
Resetting the match start
- The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
- ously matched characters not to be included in the final matched
- sequence. For example, the pattern:
+ The escape sequence \K causes any previously matched characters not to
+ be included in the final matched sequence. For example, the pattern:
foo\Kbar
- matches "foobar", but reports that it has matched "bar". This feature
- is similar to a lookbehind assertion (described below). However, in
- this case, the part of the subject before the real match does not have
- to be of fixed length, as lookbehind assertions do. The use of \K does
- not interfere with the setting of captured substrings. For example,
+ matches "foobar", but reports that it has matched "bar". This feature
+ is similar to a lookbehind assertion (described below). However, in
+ this case, the part of the subject before the real match does not have
+ to be of fixed length, as lookbehind assertions do. The use of \K does
+ not interfere with the setting of captured substrings. For example,
when the pattern
(foo)\Kbar
matches "foobar", the first substring is still set to "foo".
- Perl documents that the use of \K within assertions is "not well
- defined". In PCRE, \K is acted upon when it occurs inside positive
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE, \K is acted upon when it occurs inside positive
assertions, but is ignored in negative assertions.
Simple assertions
- The final use of backslash is for certain simple assertions. An asser-
- tion specifies a condition that has to be met at a particular point in
- a match, without consuming any characters from the subject string. The
- use of subpatterns for more complicated assertions is described below.
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
The backslashed assertions are:
\b matches at a word boundary
@@ -3777,49 +3872,49 @@ BACKSLASH
\z matches only at the end of the subject
\G matches at the first matching position in the subject
- Inside a character class, \b has a different meaning; it matches the
- backspace character. If any other of these assertions appears in a
- character class, by default it matches the corresponding literal char-
+ Inside a character class, \b has a different meaning; it matches the
+ backspace character. If any other of these assertions appears in a
+ character class, by default it matches the corresponding literal char-
acter (for example, \B matches the letter B). However, if the
- PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
+ PCRE_EXTRA option is set, an "invalid escape sequence" error is gener-
ated instead.
- A word boundary is a position in the subject string where the current
- character and the previous character do not both match \w or \W (i.e.
- one matches \w and the other matches \W), or the start or end of the
- string if the first or last character matches \w, respectively. In
- UTF-8 mode, the meanings of \w and \W can be changed by setting the
- PCRE_UCP option. When this is done, it also affects \b and \B. Neither
- PCRE nor Perl has a separate "start of word" or "end of word" metase-
- quence. However, whatever follows \b normally determines which it is.
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
+ string if the first or last character matches \w, respectively. In
+ UTF-8 mode, the meanings of \w and \W can be changed by setting the
+ PCRE_UCP option. When this is done, it also affects \b and \B. Neither
+ PCRE nor Perl has a separate "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
For example, the fragment \ba matches "a" at the start of a word.
- The \A, \Z, and \z assertions differ from the traditional circumflex
+ The \A, \Z, and \z assertions differ from the traditional circumflex
and dollar (described in the next section) in that they only ever match
- at the very start and end of the subject string, whatever options are
- set. Thus, they are independent of multiline mode. These three asser-
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
- affect only the behaviour of the circumflex and dollar metacharacters.
- However, if the startoffset argument of pcre_exec() is non-zero, indi-
+ affect only the behaviour of the circumflex and dollar metacharacters.
+ However, if the startoffset argument of pcre_exec() is non-zero, indi-
cating that matching is to start at a point other than the beginning of
- the subject, \A can never match. The difference between \Z and \z is
+ the subject, \A can never match. The difference between \Z and \z is
that \Z matches before a newline at the end of the string as well as at
the very end, whereas \z matches only at the end.
- The \G assertion is true only when the current matching position is at
- the start point of the match, as specified by the startoffset argument
- of pcre_exec(). It differs from \A when the value of startoffset is
- non-zero. By calling pcre_exec() multiple times with appropriate argu-
+ The \G assertion is true only when the current matching position is at
+ the start point of the match, as specified by the startoffset argument
+ of pcre_exec(). It differs from \A when the value of startoffset is
+ non-zero. By calling pcre_exec() multiple times with appropriate argu-
ments, you can mimic Perl's /g option, and it is in this kind of imple-
mentation where \G can be useful.
- Note, however, that PCRE's interpretation of \G, as the start of the
+ Note, however, that PCRE's interpretation of \G, as the start of the
current match, is subtly different from Perl's, which defines it as the
- end of the previous match. In Perl, these can be different when the
- previously matched string was empty. Because PCRE does just one match
+ end of the previous match. In Perl, these can be different when the
+ previously matched string was empty. Because PCRE does just one match
at a time, it cannot reproduce this behaviour.
- If all the alternatives of a pattern begin with \G, the expression is
+ If all the alternatives of a pattern begin with \G, the expression is
anchored to the starting match position, and the "anchored" flag is set
in the compiled regular expression.
@@ -3827,94 +3922,94 @@ BACKSLASH
CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the circumflex
- character is an assertion that is true only if the current matching
- point is at the start of the subject string. If the startoffset argu-
- ment of pcre_exec() is non-zero, circumflex can never match if the
- PCRE_MULTILINE option is unset. Inside a character class, circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre_exec() is non-zero, circumflex can never match if the
+ PCRE_MULTILINE option is unset. Inside a character class, circumflex
has an entirely different meaning (see below).
- Circumflex need not be the first character of the pattern if a number
- of alternatives are involved, but it should be the first thing in each
- alternative in which it appears if the pattern is ever to match that
- branch. If all possible alternatives start with a circumflex, that is,
- if the pattern is constrained to match only at the start of the sub-
- ject, it is said to be an "anchored" pattern. (There are also other
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
constructs that can cause a pattern to be anchored.)
- A dollar character is an assertion that is true only if the current
- matching point is at the end of the subject string, or immediately
+ A dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
before a newline at the end of the string (by default). Dollar need not
- be the last character of the pattern if a number of alternatives are
- involved, but it should be the last item in any branch in which it
+ be the last character of the pattern if a number of alternatives are
+ involved, but it should be the last item in any branch in which it
appears. Dollar has no special meaning in a character class.
- The meaning of dollar can be changed so that it matches only at the
- very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
compile time. This does not affect the \Z assertion.
The meanings of the circumflex and dollar characters are changed if the
- PCRE_MULTILINE option is set. When this is the case, a circumflex
- matches immediately after internal newlines as well as at the start of
- the subject string. It does not match after a newline that ends the
- string. A dollar matches before any newlines in the string, as well as
- at the very end, when PCRE_MULTILINE is set. When newline is specified
- as the two-character sequence CRLF, isolated CR and LF characters do
+ PCRE_MULTILINE option is set. When this is the case, a circumflex
+ matches immediately after internal newlines as well as at the start of
+ the subject string. It does not match after a newline that ends the
+ string. A dollar matches before any newlines in the string, as well as
+ at the very end, when PCRE_MULTILINE is set. When newline is specified
+ as the two-character sequence CRLF, isolated CR and LF characters do
not indicate newlines.
- For example, the pattern /^abc$/ matches the subject string "def\nabc"
- (where \n represents a newline) in multiline mode, but not otherwise.
- Consequently, patterns that are anchored in single line mode because
- all branches start with ^ are not anchored in multiline mode, and a
- match for circumflex is possible when the startoffset argument of
- pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
PCRE_MULTILINE is set.
- Note that the sequences \A, \Z, and \z can be used to match the start
- and end of the subject in both modes, and if all branches of a pattern
- start with \A it is always anchored, whether or not PCRE_MULTILINE is
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE_MULTILINE is
set.
FULL STOP (PERIOD, DOT) AND \N
Outside a character class, a dot in the pattern matches any one charac-
- ter in the subject string except (by default) a character that signi-
- fies the end of a line. In UTF-8 mode, the matched character may be
+ ter in the subject string except (by default) a character that signi-
+ fies the end of a line. In UTF-8 mode, the matched character may be
more than one byte long.
- When a line ending is defined as a single character, dot never matches
- that character; when the two-character sequence CRLF is used, dot does
- not match CR if it is immediately followed by LF, but otherwise it
- matches all characters (including isolated CRs and LFs). When any Uni-
- code line endings are being recognized, dot does not match CR or LF or
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
any of the other line ending characters.
- The behaviour of dot with regard to newlines can be changed. If the
- PCRE_DOTALL option is set, a dot matches any one character, without
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE_DOTALL option is set, a dot matches any one character, without
exception. If the two-character sequence CRLF is present in the subject
string, it takes two dots to match it.
- The handling of dot is entirely independent of the handling of circum-
- flex and dollar, the only relationship being that they both involve
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
newlines. Dot has no special meaning in a character class.
- The escape sequence \N always behaves as a dot does when PCRE_DOTALL is
- not set. In other words, it matches any one character except one that
- signifies the end of a line.
+ The escape sequence \N behaves like a dot, except that it is not
+ affected by the PCRE_DOTALL option. In other words, it matches any
+ character except one that signifies the end of a line.
MATCHING A SINGLE BYTE
Outside a character class, the escape sequence \C matches any one byte,
- both in and out of UTF-8 mode. Unlike a dot, it always matches any
- line-ending characters. The feature is provided in Perl in order to
- match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
- acters into individual bytes, what remains in the string may be a mal-
- formed UTF-8 string. For this reason, the \C escape sequence is best
- avoided.
-
- PCRE does not allow \C to appear in lookbehind assertions (described
- below), because in UTF-8 mode this would make it impossible to calcu-
+ both in and out of UTF-8 mode. Unlike a dot, it always matches any
+ line-ending characters. The feature is provided in Perl in order to
+ match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
+ acters into individual bytes, the rest of the string may start with a
+ malformed UTF-8 character. For this reason, the \C escape sequence is
+ best avoided.
+
+ PCRE does not allow \C to appear in lookbehind assertions (described
+ below), because in UTF-8 mode this would make it impossible to calcu-
late the length of the lookbehind.
@@ -3924,97 +4019,109 @@ SQUARE BRACKETS AND CHARACTER CLASSES
closing square bracket. A closing square bracket on its own is not spe-
cial by default. However, if the PCRE_JAVASCRIPT_COMPAT option is set,
a lone closing square bracket causes a compile-time error. If a closing
- square bracket is required as a member of the class, it should be the
- first data character in the class (after an initial circumflex, if
+ square bracket is required as a member of the class, it should be the
+ first data character in the class (after an initial circumflex, if
present) or escaped with a backslash.
- A character class matches a single character in the subject. In UTF-8
+ A character class matches a single character in the subject. In UTF-8
mode, the character may be more than one byte long. A matched character
must be in the set of characters defined by the class, unless the first
- character in the class definition is a circumflex, in which case the
- subject character must not be in the set defined by the class. If a
- circumflex is actually required as a member of the class, ensure it is
+ character in the class definition is a circumflex, in which case the
+ subject character must not be in the set defined by the class. If a
+ circumflex is actually required as a member of the class, ensure it is
not the first character, or escape it with a backslash.
- For example, the character class [aeiou] matches any lower case vowel,
- while [^aeiou] matches any character that is not a lower case vowel.
+ For example, the character class [aeiou] matches any lower case vowel,
+ while [^aeiou] matches any character that is not a lower case vowel.
Note that a circumflex is just a convenient notation for specifying the
- characters that are in the class by enumerating those that are not. A
- class that starts with a circumflex is not an assertion; it still con-
- sumes a character from the subject string, and therefore it fails if
+ characters that are in the class by enumerating those that are not. A
+ class that starts with a circumflex is not an assertion; it still con-
+ sumes a character from the subject string, and therefore it fails if
the current pointer is at the end of the string.
- In UTF-8 mode, characters with values greater than 255 can be included
- in a class as a literal string of bytes, or by using the \x{ escaping
+ In UTF-8 mode, characters with values greater than 255 can be included
+ in a class as a literal string of bytes, or by using the \x{ escaping
mechanism.
- When caseless matching is set, any letters in a class represent both
- their upper case and lower case versions, so for example, a caseless
- [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
- match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
- understands the concept of case for characters whose values are less
- than 128, so caseless matching is always possible. For characters with
- higher values, the concept of case is supported if PCRE is compiled
- with Unicode property support, but not otherwise. If you want to use
- caseless matching in UTF8-mode for characters 128 and above, you must
- ensure that PCRE is compiled with Unicode property support as well as
+ When caseless matching is set, any letters in a class represent both
+ their upper case and lower case versions, so for example, a caseless
+ [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
+ match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
+ understands the concept of case for characters whose values are less
+ than 128, so caseless matching is always possible. For characters with
+ higher values, the concept of case is supported if PCRE is compiled
+ with Unicode property support, but not otherwise. If you want to use
+ caseless matching in UTF8-mode for characters 128 and above, you must
+ ensure that PCRE is compiled with Unicode property support as well as
with UTF-8 support.
- Characters that might indicate line breaks are never treated in any
- special way when matching character classes, whatever line-ending
- sequence is in use, and whatever setting of the PCRE_DOTALL and
+ Characters that might indicate line breaks are never treated in any
+ special way when matching character classes, whatever line-ending
+ sequence is in use, and whatever setting of the PCRE_DOTALL and
PCRE_MULTILINE options is used. A class such as [^a] always matches one
of these characters.
- The minus (hyphen) character can be used to specify a range of charac-
- ters in a character class. For example, [d-m] matches any letter
- between d and m, inclusive. If a minus character is required in a
- class, it must be escaped with a backslash or appear in a position
- where it cannot be interpreted as indicating a range, typically as the
+ The minus (hyphen) character can be used to specify a range of charac-
+ ters in a character class. For example, [d-m] matches any letter
+ between d and m, inclusive. If a minus character is required in a
+ class, it must be escaped with a backslash or appear in a position
+ where it cannot be interpreted as indicating a range, typically as the
first or last character in the class.
It is not possible to have the literal character "]" as the end charac-
- ter of a range. A pattern such as [W-]46] is interpreted as a class of
- two characters ("W" and "-") followed by a literal string "46]", so it
- would match "W46]" or "-46]". However, if the "]" is escaped with a
- backslash it is interpreted as the end of range, so [W-\]46] is inter-
- preted as a class containing a range followed by two other characters.
- The octal or hexadecimal representation of "]" can also be used to end
+ ter of a range. A pattern such as [W-]46] is interpreted as a class of
+ two characters ("W" and "-") followed by a literal string "46]", so it
+ would match "W46]" or "-46]". However, if the "]" is escaped with a
+ backslash it is interpreted as the end of range, so [W-\]46] is inter-
+ preted as a class containing a range followed by two other characters.
+ The octal or hexadecimal representation of "]" can also be used to end
a range.
- Ranges operate in the collating sequence of character values. They can
- also be used for characters specified numerically, for example
- [\000-\037]. In UTF-8 mode, ranges can include characters whose values
+ Ranges operate in the collating sequence of character values. They can
+ also be used for characters specified numerically, for example
+ [\000-\037]. In UTF-8 mode, ranges can include characters whose values
are greater than 255, for example [\x{100}-\x{2ff}].
If a range that includes letters is used when caseless matching is set,
it matches the letters in either case. For example, [W-c] is equivalent
- to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
- character tables for a French locale are in use, [\xc8-\xcb] matches
- accented E characters in both cases. In UTF-8 mode, PCRE supports the
- concept of case for characters with values greater than 128 only when
+ to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
+ character tables for a French locale are in use, [\xc8-\xcb] matches
+ accented E characters in both cases. In UTF-8 mode, PCRE supports the
+ concept of case for characters with values greater than 128 only when
it is compiled with Unicode property support.
- The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and \W
- may also appear in a character class, and add the characters that they
- match to the class. For example, [\dABCDEF] matches any hexadecimal
- digit. A circumflex can conveniently be used with the upper case char-
- acter types to specify a more restricted set of characters than the
- matching lower case type. For example, the class [^\W_] matches any
- letter or digit, but not underscore.
-
- The only metacharacters that are recognized in character classes are
- backslash, hyphen (only where it can be interpreted as specifying a
- range), circumflex (only at the start), opening square bracket (only
- when it can be interpreted as introducing a POSIX class name - see the
- next section), and the terminating closing square bracket. However,
+ The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
+ \w, and \W may appear in a character class, and add the characters that
+ they match to the class. For example, [\dABCDEF] matches any hexadeci-
+ mal digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of
+ \d, \s, \w and their upper case partners, just as it does when they
+ appear outside a character class, as described in the section entitled
+ "Generic character types" above. The escape sequence \b has a different
+ meaning inside a character class; it matches the backspace character.
+ The sequences \B, \N, \R, and \X are not special inside a character
+ class. Like any other unrecognized escape sequences, they are treated
+ as the literal characters "B", "N", "R", and "X" by default, but cause
+ an error if the PCRE_EXTRA option is set.
+
+ A circumflex can conveniently be used with the upper case character
+ types to specify a more restricted set of characters than the matching
+ lower case type. For example, the class [^\W_] matches any letter or
+ digit, but not underscore, whereas [\w] includes underscore. A positive
+ character class should be read as "something OR something OR ..." and a
+ negative class as "NOT something AND NOT something AND NOT ...".
+
+ The only metacharacters that are recognized in character classes are
+ backslash, hyphen (only where it can be interpreted as specifying a
+ range), circumflex (only at the start), opening square bracket (only
+ when it can be interpreted as introducing a POSIX class name - see the
+ next section), and the terminating closing square bracket. However,
escaping other non-alphanumeric characters does no harm.
POSIX CHARACTER CLASSES
Perl supports the POSIX notation for character classes. This uses names
- enclosed by [: and :] within the enclosing square brackets. PCRE also
+ enclosed by [: and :] within the enclosing square brackets. PCRE also
supports this notation. For example,
[01[:alpha:]%]
@@ -4037,24 +4144,24 @@ POSIX CHARACTER CLASSES
word "word" characters (same as \w)
xdigit hexadecimal digits
- The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
- and space (32). Notice that this list includes the VT character (code
+ The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+ and space (32). Notice that this list includes the VT character (code
11). This makes "space" different to \s, which does not include VT (for
Perl compatibility).
- The name "word" is a Perl extension, and "blank" is a GNU extension
- from Perl 5.8. Another Perl extension is negation, which is indicated
+ The name "word" is a Perl extension, and "blank" is a GNU extension
+ from Perl 5.8. Another Perl extension is negation, which is indicated
by a ^ character after the colon. For example,
[12[:^digit:]]
- matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
+ matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
these are not supported, and an error is given if they are encountered.
- By default, in UTF-8 mode, characters with values greater than 128 do
- not match any of the POSIX character classes. However, if the PCRE_UCP
- option is passed to pcre_compile(), some of the classes are changed so
+ By default, in UTF-8 mode, characters with values greater than 128 do
+ not match any of the POSIX character classes. However, if the PCRE_UCP
+ option is passed to pcre_compile(), some of the classes are changed so
that Unicode character properties are used. This is achieved by replac-
ing the POSIX classes by other sequences, as follows:
@@ -4067,31 +4174,31 @@ POSIX CHARACTER CLASSES
[:upper:] becomes \p{Lu}
[:word:] becomes \p{Xwd}
- Negated versions, such as [:^alpha:] use \P instead of \p. The other
+ Negated versions, such as [:^alpha:] use \P instead of \p. The other
POSIX classes are unchanged, and match only characters with code points
less than 128.
VERTICAL BAR
- Vertical bar characters are used to separate alternative patterns. For
+ Vertical bar characters are used to separate alternative patterns. For
example, the pattern
gilbert|sullivan
- matches either "gilbert" or "sullivan". Any number of alternatives may
- appear, and an empty alternative is permitted (matching the empty
+ matches either "gilbert" or "sullivan". Any number of alternatives may
+ appear, and an empty alternative is permitted (matching the empty
string). The matching process tries each alternative in turn, from left
- to right, and the first one that succeeds is used. If the alternatives
- are within a subpattern (defined below), "succeeds" means matching the
+ to right, and the first one that succeeds is used. If the alternatives
+ are within a subpattern (defined below), "succeeds" means matching the
rest of the main pattern as well as the alternative in the subpattern.
INTERNAL OPTION SETTING
- The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
- PCRE_EXTENDED options (which are Perl-compatible) can be changed from
- within the pattern by a sequence of Perl option letters enclosed
+ The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
+ PCRE_EXTENDED options (which are Perl-compatible) can be changed from
+ within the pattern by a sequence of Perl option letters enclosed
between "(?" and ")". The option letters are
i for PCRE_CASELESS
@@ -4101,47 +4208,47 @@ INTERNAL OPTION SETTING
For example, (?im) sets caseless, multiline matching. It is also possi-
ble to unset these options by preceding the letter with a hyphen, and a
- combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
- LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
- is also permitted. If a letter appears both before and after the
+ combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
+ LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
+ is also permitted. If a letter appears both before and after the
hyphen, the option is unset.
- The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
- can be changed in the same way as the Perl-compatible options by using
+ The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
+ can be changed in the same way as the Perl-compatible options by using
the characters J, U and X respectively.
- When one of these option changes occurs at top level (that is, not
- inside subpattern parentheses), the change applies to the remainder of
+ When one of these option changes occurs at top level (that is, not
+ inside subpattern parentheses), the change applies to the remainder of
the pattern that follows. If the change is placed right at the start of
a pattern, PCRE extracts it into the global options (and it will there-
fore show up in data extracted by the pcre_fullinfo() function).
- An option change within a subpattern (see below for a description of
- subpatterns) affects only that part of the current pattern that follows
- it, so
+ An option change within a subpattern (see below for a description of
+ subpatterns) affects only that part of the subpattern that follows it,
+ so
(a(?i)b)c
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
- used). By this means, options can be made to have different settings
- in different parts of the pattern. Any changes made in one alternative
- do carry on into subsequent branches within the same subpattern. For
+ used). By this means, options can be made to have different settings
+ in different parts of the pattern. Any changes made in one alternative
+ do carry on into subsequent branches within the same subpattern. For
example,
(a(?i)b|c)
- matches "ab", "aB", "c", and "C", even though when matching "C" the
- first branch is abandoned before the option setting. This is because
- the effects of option settings happen at compile time. There would be
+ matches "ab", "aB", "c", and "C", even though when matching "C" the
+ first branch is abandoned before the option setting. This is because
+ the effects of option settings happen at compile time. There would be
some very weird behaviour otherwise.
- Note: There are other PCRE-specific options that can be set by the
- application when the compile or match functions are called. In some
+ Note: There are other PCRE-specific options that can be set by the
+ application when the compile or match functions are called. In some
cases the pattern can contain special leading sequences such as (*CRLF)
- to override what the application has set or what has been defaulted.
- Details are given in the section entitled "Newline sequences" above.
- There are also the (*UTF8) and (*UCP) leading sequences that can be
- used to set UTF-8 and Unicode property modes; they are equivalent to
+ to override what the application has set or what has been defaulted.
+ Details are given in the section entitled "Newline sequences" above.
+ There are also the (*UTF8) and (*UCP) leading sequences that can be
+ used to set UTF-8 and Unicode property modes; they are equivalent to
setting the PCRE_UTF8 and the PCRE_UCP options, respectively.
@@ -4154,19 +4261,16 @@ SUBPATTERNS
cat(aract|erpillar|)
- matches one of the words "cat", "cataract", or "caterpillar". Without
- the parentheses, it would match "cataract", "erpillar" or an empty
- string.
+ matches "cataract", "caterpillar", or "cat". Without the parentheses,
+ it would match "cataract", "erpillar" or an empty string.
2. It sets up the subpattern as a capturing subpattern. This means
that, when the whole pattern matches, that portion of the subject
string that matched the subpattern is passed back to the caller via the
ovector argument of pcre_exec(). Opening parentheses are counted from
left to right (starting from 1) to obtain numbers for the capturing
- subpatterns.
-
- For example, if the string "the red king" is matched against the pat-
- tern
+ subpatterns. For example, if the string "the red king" is matched
+ against the pattern
the ((red|white) (king|queen))
@@ -4215,9 +4319,9 @@ DUPLICATE SUBPATTERN NUMBERS
matched. This construct is useful when you want to capture part, but
not all, of one of a number of alternatives. Inside a (?| group, paren-
theses are numbered as usual, but the number is reset at the start of
- each branch. The numbers of any capturing buffers that follow the sub-
- pattern start after the highest number used in any branch. The follow-
- ing example is taken from the Perl documentation. The numbers under-
+ each branch. The numbers of any capturing parentheses that follow the
+ subpattern start after the highest number used in any branch. The fol-
+ lowing example is taken from the Perl documentation. The numbers under-
neath show in which buffer the captured content will be stored.
# before ---------------branch-reset----------- after
@@ -4324,7 +4428,7 @@ REPETITION
the \C escape sequence
the \X escape sequence (in UTF-8 mode with Unicode properties)
the \R escape sequence
- an escape such as \d that matches a single character
+ an escape such as \d or \pL that matches a single character
a character class
a back reference (see next section)
a parenthesized subpattern (unless it is an assertion)
@@ -4364,34 +4468,35 @@ REPETITION
The quantifier {0} is permitted, causing the expression to behave as if
the previous item and the quantifier were not present. This may be use-
ful for subpatterns that are referenced as subroutines from elsewhere
- in the pattern. Items other than subpatterns that have a {0} quantifier
- are omitted from the compiled pattern.
+ in the pattern (but see also the section entitled "Defining subpatterns
+ for use by reference only" below). Items other than subpatterns that
+ have a {0} quantifier are omitted from the compiled pattern.
- For convenience, the three most common quantifiers have single-charac-
+ For convenience, the three most common quantifiers have single-charac-
ter abbreviations:
* is equivalent to {0,}
+ is equivalent to {1,}
? is equivalent to {0,1}
- It is possible to construct infinite loops by following a subpattern
+ It is possible to construct infinite loops by following a subpattern
that can match no characters with a quantifier that has no upper limit,
for example:
(a?)*
Earlier versions of Perl and PCRE used to give an error at compile time
- for such patterns. However, because there are cases where this can be
- useful, such patterns are now accepted, but if any repetition of the
- subpattern does in fact match no characters, the loop is forcibly bro-
+ for such patterns. However, because there are cases where this can be
+ useful, such patterns are now accepted, but if any repetition of the
+ subpattern does in fact match no characters, the loop is forcibly bro-
ken.
- By default, the quantifiers are "greedy", that is, they match as much
- as possible (up to the maximum number of permitted times), without
- causing the rest of the pattern to fail. The classic example of where
+ By default, the quantifiers are "greedy", that is, they match as much
+ as possible (up to the maximum number of permitted times), without
+ causing the rest of the pattern to fail. The classic example of where
this gives problems is in trying to match comments in C programs. These
- appear between /* and */ and within the comment, individual * and /
- characters may appear. An attempt to match C comments by applying the
+ appear between /* and */ and within the comment, individual * and /
+ characters may appear. An attempt to match C comments by applying the
pattern
/\*.*\*/
@@ -4400,19 +4505,19 @@ REPETITION
/* first comment */ not comment /* second comment */
- fails, because it matches the entire string owing to the greediness of
+ fails, because it matches the entire string owing to the greediness of
the .* item.
- However, if a quantifier is followed by a question mark, it ceases to
+ However, if a quantifier is followed by a question mark, it ceases to
be greedy, and instead matches the minimum number of times possible, so
the pattern
/\*.*?\*/
- does the right thing with the C comments. The meaning of the various
- quantifiers is not otherwise changed, just the preferred number of
- matches. Do not confuse this use of question mark with its use as a
- quantifier in its own right. Because it has two uses, it can sometimes
+ does the right thing with the C comments. The meaning of the various
+ quantifiers is not otherwise changed, just the preferred number of
+ matches. Do not confuse this use of question mark with its use as a
+ quantifier in its own right. Because it has two uses, it can sometimes
appear doubled, as in
\d??\d
@@ -4420,36 +4525,36 @@ REPETITION
which matches one digit by preference, but can match two if that is the
only way the rest of the pattern matches.
- If the PCRE_UNGREEDY option is set (an option that is not available in
- Perl), the quantifiers are not greedy by default, but individual ones
- can be made greedy by following them with a question mark. In other
+ If the PCRE_UNGREEDY option is set (an option that is not available in
+ Perl), the quantifiers are not greedy by default, but individual ones
+ can be made greedy by following them with a question mark. In other
words, it inverts the default behaviour.
- When a parenthesized subpattern is quantified with a minimum repeat
- count that is greater than 1 or with a limited maximum, more memory is
- required for the compiled pattern, in proportion to the size of the
+ When a parenthesized subpattern is quantified with a minimum repeat
+ count that is greater than 1 or with a limited maximum, more memory is
+ required for the compiled pattern, in proportion to the size of the
minimum or maximum.
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
- alent to Perl's /s) is set, thus allowing the dot to match newlines,
- the pattern is implicitly anchored, because whatever follows will be
- tried against every character position in the subject string, so there
- is no point in retrying the overall match at any position after the
- first. PCRE normally treats such a pattern as though it were preceded
+ alent to Perl's /s) is set, thus allowing the dot to match newlines,
+ the pattern is implicitly anchored, because whatever follows will be
+ tried against every character position in the subject string, so there
+ is no point in retrying the overall match at any position after the
+ first. PCRE normally treats such a pattern as though it were preceded
by \A.
- In cases where it is known that the subject string contains no new-
- lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
+ In cases where it is known that the subject string contains no new-
+ lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
mization, or alternatively using ^ to indicate anchoring explicitly.
- However, there is one situation where the optimization cannot be used.
+ However, there is one situation where the optimization cannot be used.
When .* is inside capturing parentheses that are the subject of a back
reference elsewhere in the pattern, a match at the start may fail where
a later one succeeds. Consider, for example:
(.*)abc\1
- If the subject is "xyz123abc123" the match point is the fourth charac-
+ If the subject is "xyz123abc123" the match point is the fourth charac-
ter. For this reason, such a pattern is not implicitly anchored.
When a capturing subpattern is repeated, the value captured is the sub-
@@ -4458,8 +4563,8 @@ REPETITION
(tweedle[dume]{3}\s*)+
has matched "tweedledum tweedledee" the value of the captured substring
- is "tweedledee". However, if there are nested capturing subpatterns,
- the corresponding captured values may have been set in previous itera-
+ is "tweedledee". However, if there are nested capturing subpatterns,
+ the corresponding captured values may have been set in previous itera-
tions. For example, after
/(a|(b))+/
@@ -4469,53 +4574,53 @@ REPETITION
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
- With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
- repetition, failure of what follows normally causes the repeated item
- to be re-evaluated to see if a different number of repeats allows the
- rest of the pattern to match. Sometimes it is useful to prevent this,
- either to change the nature of the match, or to cause it fail earlier
- than it otherwise might, when the author of the pattern knows there is
+ With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+ repetition, failure of what follows normally causes the repeated item
+ to be re-evaluated to see if a different number of repeats allows the
+ rest of the pattern to match. Sometimes it is useful to prevent this,
+ either to change the nature of the match, or to cause it fail earlier
+ than it otherwise might, when the author of the pattern knows there is
no point in carrying on.
- Consider, for example, the pattern \d+foo when applied to the subject
+ Consider, for example, the pattern \d+foo when applied to the subject
line
123456bar
After matching all 6 digits and then failing to match "foo", the normal
- action of the matcher is to try again with only 5 digits matching the
- \d+ item, and then with 4, and so on, before ultimately failing.
- "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
- the means for specifying that once a subpattern has matched, it is not
+ action of the matcher is to try again with only 5 digits matching the
+ \d+ item, and then with 4, and so on, before ultimately failing.
+ "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
+ the means for specifying that once a subpattern has matched, it is not
to be re-evaluated in this way.
- If we use atomic grouping for the previous example, the matcher gives
- up immediately on failing to match "foo" the first time. The notation
+ If we use atomic grouping for the previous example, the matcher gives
+ up immediately on failing to match "foo" the first time. The notation
is a kind of special parenthesis, starting with (?> as in this example:
(?>\d+)foo
- This kind of parenthesis "locks up" the part of the pattern it con-
- tains once it has matched, and a failure further into the pattern is
- prevented from backtracking into it. Backtracking past it to previous
+ This kind of parenthesis "locks up" the part of the pattern it con-
+ tains once it has matched, and a failure further into the pattern is
+ prevented from backtracking into it. Backtracking past it to previous
items, however, works as normal.
- An alternative description is that a subpattern of this type matches
- the string of characters that an identical standalone pattern would
+ An alternative description is that a subpattern of this type matches
+ the string of characters that an identical standalone pattern would
match, if anchored at the current point in the subject string.
Atomic grouping subpatterns are not capturing subpatterns. Simple cases
such as the above example can be thought of as a maximizing repeat that
- must swallow everything it can. So, while both \d+ and \d+? are pre-
- pared to adjust the number of digits they match in order to make the
+ must swallow everything it can. So, while both \d+ and \d+? are pre-
+ pared to adjust the number of digits they match in order to make the
rest of the pattern match, (?>\d+) can only match an entire sequence of
digits.
- Atomic groups in general can of course contain arbitrarily complicated
- subpatterns, and can be nested. However, when the subpattern for an
+ Atomic groups in general can of course contain arbitrarily complicated
+ subpatterns, and can be nested. However, when the subpattern for an
atomic group is just a single repeated item, as in the example above, a
- simpler notation, called a "possessive quantifier" can be used. This
- consists of an additional + character following a quantifier. Using
+ simpler notation, called a "possessive quantifier" can be used. This
+ consists of an additional + character following a quantifier. Using
this notation, the previous example can be rewritten as
\d++foo
@@ -4525,45 +4630,45 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
(abc|xyz){2,3}+
- Possessive quantifiers are always greedy; the setting of the
+ Possessive quantifiers are always greedy; the setting of the
PCRE_UNGREEDY option is ignored. They are a convenient notation for the
- simpler forms of atomic group. However, there is no difference in the
- meaning of a possessive quantifier and the equivalent atomic group,
- though there may be a performance difference; possessive quantifiers
+ simpler forms of atomic group. However, there is no difference in the
+ meaning of a possessive quantifier and the equivalent atomic group,
+ though there may be a performance difference; possessive quantifiers
should be slightly faster.
- The possessive quantifier syntax is an extension to the Perl 5.8 syn-
- tax. Jeffrey Friedl originated the idea (and the name) in the first
+ The possessive quantifier syntax is an extension to the Perl 5.8 syn-
+ tax. Jeffrey Friedl originated the idea (and the name) in the first
edition of his book. Mike McCloskey liked it, so implemented it when he
- built Sun's Java package, and PCRE copied it from there. It ultimately
+ built Sun's Java package, and PCRE copied it from there. It ultimately
found its way into Perl at release 5.10.
PCRE has an optimization that automatically "possessifies" certain sim-
- ple pattern constructs. For example, the sequence A+B is treated as
- A++B because there is no point in backtracking into a sequence of A's
+ ple pattern constructs. For example, the sequence A+B is treated as
+ A++B because there is no point in backtracking into a sequence of A's
when B must follow.
- When a pattern contains an unlimited repeat inside a subpattern that
- can itself be repeated an unlimited number of times, the use of an
- atomic group is the only way to avoid some failing matches taking a
+ When a pattern contains an unlimited repeat inside a subpattern that
+ can itself be repeated an unlimited number of times, the use of an
+ atomic group is the only way to avoid some failing matches taking a
very long time indeed. The pattern
(\D+|<\d+>)*[!?]
- matches an unlimited number of substrings that either consist of non-
- digits, or digits enclosed in <>, followed by either ! or ?. When it
+ matches an unlimited number of substrings that either consist of non-
+ digits, or digits enclosed in <>, followed by either ! or ?. When it
matches, it runs quickly. However, if it is applied to
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
- it takes a long time before reporting failure. This is because the
- string can be divided between the internal \D+ repeat and the external
- * repeat in a large number of ways, and all have to be tried. (The
- example uses [!?] rather than a single character at the end, because
- both PCRE and Perl have an optimization that allows for fast failure
- when a single character is used. They remember the last single charac-
- ter that is required for a match, and fail early if it is not present
- in the string.) If the pattern is changed so that it uses an atomic
+ it takes a long time before reporting failure. This is because the
+ string can be divided between the internal \D+ repeat and the external
+ * repeat in a large number of ways, and all have to be tried. (The
+ example uses [!?] rather than a single character at the end, because
+ both PCRE and Perl have an optimization that allows for fast failure
+ when a single character is used. They remember the last single charac-
+ ter that is required for a match, and fail early if it is not present
+ in the string.) If the pattern is changed so that it uses an atomic
group, like this:
((?>\D+)|<\d+>)*[!?]
@@ -4575,31 +4680,30 @@ BACK REFERENCES
Outside a character class, a backslash followed by a digit greater than
0 (and possibly further digits) is a back reference to a capturing sub-
- pattern earlier (that is, to its left) in the pattern, provided there
+ pattern earlier (that is, to its left) in the pattern, provided there
have been that many previous capturing left parentheses.
However, if the decimal number following the backslash is less than 10,
- it is always taken as a back reference, and causes an error only if
- there are not that many capturing left parentheses in the entire pat-
- tern. In other words, the parentheses that are referenced need not be
- to the left of the reference for numbers less than 10. A "forward back
- reference" of this type can make sense when a repetition is involved
- and the subpattern to the right has participated in an earlier itera-
+ it is always taken as a back reference, and causes an error only if
+ there are not that many capturing left parentheses in the entire pat-
+ tern. In other words, the parentheses that are referenced need not be
+ to the left of the reference for numbers less than 10. A "forward back
+ reference" of this type can make sense when a repetition is involved
+ and the subpattern to the right has participated in an earlier itera-
tion.
- It is not possible to have a numerical "forward back reference" to a
- subpattern whose number is 10 or more using this syntax because a
- sequence such as \50 is interpreted as a character defined in octal.
+ It is not possible to have a numerical "forward back reference" to a
+ subpattern whose number is 10 or more using this syntax because a
+ sequence such as \50 is interpreted as a character defined in octal.
See the subsection entitled "Non-printing characters" above for further
- details of the handling of digits following a backslash. There is no
- such problem when named parentheses are used. A back reference to any
+ details of the handling of digits following a backslash. There is no
+ such problem when named parentheses are used. A back reference to any
subpattern is possible using named parentheses (see below).
- Another way of avoiding the ambiguity inherent in the use of digits
- following a backslash is to use the \g escape sequence, which is a fea-
- ture introduced in Perl 5.10. This escape must be followed by an
- unsigned number or a negative number, optionally enclosed in braces.
- These examples are all identical:
+ Another way of avoiding the ambiguity inherent in the use of digits
+ following a backslash is to use the \g escape sequence. This escape
+ must be followed by an unsigned number or a negative number, optionally
+ enclosed in braces. These examples are all identical:
(ring), \1
(ring), \g1
@@ -4613,33 +4717,34 @@ BACK REFERENCES
(abc(def)ghi)\g{-1}
The sequence \g{-1} is a reference to the most recently started captur-
- ing subpattern before \g, that is, is it equivalent to \2. Similarly,
- \g{-2} would be equivalent to \1. The use of relative references can be
- helpful in long patterns, and also in patterns that are created by
- joining together fragments that contain references within themselves.
-
- A back reference matches whatever actually matched the capturing sub-
- pattern in the current subject string, rather than anything matching
+ ing subpattern before \g, that is, is it equivalent to \2 in this exam-
+ ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
+ references can be helpful in long patterns, and also in patterns that
+ are created by joining together fragments that contain references
+ within themselves.
+
+ A back reference matches whatever actually matched the capturing sub-
+ pattern in the current subject string, rather than anything matching
the subpattern itself (see "Subpatterns as subroutines" below for a way
of doing that). So the pattern
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
- not "sense and responsibility". If caseful matching is in force at the
- time of the back reference, the case of letters is relevant. For exam-
+ matches "sense and sensibility" and "response and responsibility", but
+ not "sense and responsibility". If caseful matching is in force at the
+ time of the back reference, the case of letters is relevant. For exam-
ple,
((?i)rah)\s+\1
- matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
+ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
original capturing subpattern is matched caselessly.
- There are several different ways of writing back references to named
- subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
- \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
+ There are several different ways of writing back references to named
+ subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
+ \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
unified back reference syntax, in which \g can be used for both numeric
- and named references, is also supported. We could rewrite the above
+ and named references, is also supported. We could rewrite the above
example in any of the following ways:
(?<p1>(?i)rah)\s+\k<p1>
@@ -4647,67 +4752,67 @@ BACK REFERENCES
(?P<p1>(?i)rah)\s+(?P=p1)
(?<p1>(?i)rah)\s+\g{p1}
- A subpattern that is referenced by name may appear in the pattern
+ A subpattern that is referenced by name may appear in the pattern
before or after the reference.
- There may be more than one back reference to the same subpattern. If a
- subpattern has not actually been used in a particular match, any back
+ There may be more than one back reference to the same subpattern. If a
+ subpattern has not actually been used in a particular match, any back
references to it always fail by default. For example, the pattern
(a|(bc))\2
- always fails if it starts to match "a" rather than "bc". However, if
+ always fails if it starts to match "a" rather than "bc". However, if
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
ence to an unset value matches an empty string.
- Because there may be many capturing parentheses in a pattern, all dig-
- its following a backslash are taken as part of a potential back refer-
- ence number. If the pattern continues with a digit character, some
- delimiter must be used to terminate the back reference. If the
+ Because there may be many capturing parentheses in a pattern, all dig-
+ its following a backslash are taken as part of a potential back refer-
+ ence number. If the pattern continues with a digit character, some
+ delimiter must be used to terminate the back reference. If the
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
syntax or an empty comment (see "Comments" below) can be used.
Recursive back references
- A back reference that occurs inside the parentheses to which it refers
- fails when the subpattern is first used, so, for example, (a\1) never
- matches. However, such references can be useful inside repeated sub-
+ A back reference that occurs inside the parentheses to which it refers
+ fails when the subpattern is first used, so, for example, (a\1) never
+ matches. However, such references can be useful inside repeated sub-
patterns. For example, the pattern
(a|b\1)+
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
- ation of the subpattern, the back reference matches the character
- string corresponding to the previous iteration. In order for this to
- work, the pattern must be such that the first iteration does not need
- to match the back reference. This can be done using alternation, as in
+ ation of the subpattern, the back reference matches the character
+ string corresponding to the previous iteration. In order for this to
+ work, the pattern must be such that the first iteration does not need
+ to match the back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of zero.
- Back references of this type cause the group that they reference to be
- treated as an atomic group. Once the whole group has been matched, a
- subsequent matching failure cannot cause backtracking into the middle
+ Back references of this type cause the group that they reference to be
+ treated as an atomic group. Once the whole group has been matched, a
+ subsequent matching failure cannot cause backtracking into the middle
of the group.
ASSERTIONS
- An assertion is a test on the characters following or preceding the
- current matching point that does not actually consume any characters.
- The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
+ An assertion is a test on the characters following or preceding the
+ current matching point that does not actually consume any characters.
+ The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
described above.
- More complicated assertions are coded as subpatterns. There are two
- kinds: those that look ahead of the current position in the subject
- string, and those that look behind it. An assertion subpattern is
- matched in the normal way, except that it does not cause the current
+ More complicated assertions are coded as subpatterns. There are two
+ kinds: those that look ahead of the current position in the subject
+ string, and those that look behind it. An assertion subpattern is
+ matched in the normal way, except that it does not cause the current
matching position to be changed.
- Assertion subpatterns are not capturing subpatterns, and may not be
- repeated, because it makes no sense to assert the same thing several
- times. If any kind of assertion contains capturing subpatterns within
- it, these are counted for the purposes of numbering the capturing sub-
+ Assertion subpatterns are not capturing subpatterns, and may not be
+ repeated, because it makes no sense to assert the same thing several
+ times. If any kind of assertion contains capturing subpatterns within
+ it, these are counted for the purposes of numbering the capturing sub-
patterns in the whole pattern. However, substring capturing is carried
- out only for positive assertions, because it does not make sense for
+ out only for positive assertions, because it does not make sense for
negative assertions.
Lookahead assertions
@@ -4717,38 +4822,38 @@ ASSERTIONS
\w+(?=;)
- matches a word followed by a semicolon, but does not include the semi-
+ matches a word followed by a semicolon, but does not include the semi-
colon in the match, and
foo(?!bar)
- matches any occurrence of "foo" that is not followed by "bar". Note
+ matches any occurrence of "foo" that is not followed by "bar". Note
that the apparently similar pattern
(?!foo)bar
- does not find an occurrence of "bar" that is preceded by something
- other than "foo"; it finds any occurrence of "bar" whatsoever, because
+ does not find an occurrence of "bar" that is preceded by something
+ other than "foo"; it finds any occurrence of "bar" whatsoever, because
the assertion (?!foo) is always true when the next three characters are
"bar". A lookbehind assertion is needed to achieve the other effect.
If you want to force a matching failure at some point in a pattern, the
- most convenient way to do it is with (?!) because an empty string
- always matches, so an assertion that requires there not to be an empty
- string must always fail. The Perl 5.10 backtracking control verb
- (*FAIL) or (*F) is essentially a synonym for (?!).
+ most convenient way to do it is with (?!) because an empty string
+ always matches, so an assertion that requires there not to be an empty
+ string must always fail. The backtracking control verb (*FAIL) or (*F)
+ is a synonym for (?!).
Lookbehind assertions
- Lookbehind assertions start with (?<= for positive assertions and (?<!
+ Lookbehind assertions start with (?<= for positive assertions and (?<!
for negative assertions. For example,
(?<!foo)bar
- does find an occurrence of "bar" that is not preceded by "foo". The
- contents of a lookbehind assertion are restricted such that all the
+ does find an occurrence of "bar" that is not preceded by "foo". The
+ contents of a lookbehind assertion are restricted such that all the
strings it matches must have a fixed length. However, if there are sev-
- eral top-level alternatives, they do not all have to have the same
+ eral top-level alternatives, they do not all have to have the same
fixed length. Thus
(?<=bullock|donkey)
@@ -4757,22 +4862,21 @@ ASSERTIONS
(?<!dogs?|cats?)
- causes an error at compile time. Branches that match different length
- strings are permitted only at the top level of a lookbehind assertion.
- This is an extension compared with Perl (5.8 and 5.10), which requires
- all branches to match the same length of string. An assertion such as
+ causes an error at compile time. Branches that match different length
+ strings are permitted only at the top level of a lookbehind assertion.
+ This is an extension compared with Perl, which requires all branches to
+ match the same length of string. An assertion such as
(?<=ab(c|de))
- is not permitted, because its single top-level branch can match two
+ is not permitted, because its single top-level branch can match two
different lengths, but it is acceptable to PCRE if rewritten to use two
top-level branches:
(?<=abc|abde)
- In some cases, the Perl 5.10 escape sequence \K (see above) can be used
- instead of a lookbehind assertion to get round the fixed-length
- restriction.
+ In some cases, the escape sequence \K (see above) can be used instead
+ of a lookbehind assertion to get round the fixed-length restriction.
The implementation of lookbehind assertions is, for each alternative,
to temporarily move the current position back by the fixed length and
@@ -4862,7 +4966,14 @@ CONDITIONAL SUBPATTERNS
If the condition is satisfied, the yes-pattern is used; otherwise the
no-pattern (if present) is used. If there are more than two alterna-
- tives in the subpattern, a compile-time error occurs.
+ tives in the subpattern, a compile-time error occurs. Each of the two
+ alternatives may itself contain nested subpatterns of any form, includ-
+ ing conditional subpatterns; the restriction to two alternatives
+ applies only at the level of the condition. This pattern fragment is an
+ example where the alternatives are complex:
+
+ (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
+
There are four kinds of condition: references to subpatterns, refer-
ences to recursion, a pseudo-condition called DEFINE, and assertions.
@@ -4873,63 +4984,64 @@ CONDITIONAL SUBPATTERNS
the condition is true if a capturing subpattern of that number has pre-
viously matched. If there is more than one capturing subpattern with
the same number (see the earlier section about duplicate subpattern
- numbers), the condition is true if any of them have been set. An alter-
+ numbers), the condition is true if any of them have matched. An alter-
native notation is to precede the digits with a plus or minus sign. In
this case, the subpattern number is relative rather than absolute. The
most recently opened parentheses can be referenced by (?(-1), the next
- most recent by (?(-2), and so on. In looping constructs it can also
- make sense to refer to subsequent groups with constructs such as
- (?(+2).
+ most recent by (?(-2), and so on. Inside loops it can also make sense
+ to refer to subsequent groups. The next parentheses to be opened can be
+ referenced as (?(+1), and so on. (The value zero in any of these forms
+ is not used; it provokes a compile-time error.)
- Consider the following pattern, which contains non-significant white
+ Consider the following pattern, which contains non-significant white
space to make it more readable (assume the PCRE_EXTENDED option) and to
divide it into three parts for ease of discussion:
( \( )? [^()]+ (?(1) \) )
- The first part matches an optional opening parenthesis, and if that
+ The first part matches an optional opening parenthesis, and if that
character is present, sets it as the first captured substring. The sec-
- ond part matches one or more characters that are not parentheses. The
- third part is a conditional subpattern that tests whether the first set
- of parentheses matched or not. If they did, that is, if subject started
- with an opening parenthesis, the condition is true, and so the yes-pat-
- tern is executed and a closing parenthesis is required. Otherwise,
- since no-pattern is not present, the subpattern matches nothing. In
- other words, this pattern matches a sequence of non-parentheses,
+ ond part matches one or more characters that are not parentheses. The
+ third part is a conditional subpattern that tests whether or not the
+ first set of parentheses matched. If they did, that is, if subject
+ started with an opening parenthesis, the condition is true, and so the
+ yes-pattern is executed and a closing parenthesis is required. Other-
+ wise, since no-pattern is not present, the subpattern matches nothing.
+ In other words, this pattern matches a sequence of non-parentheses,
optionally enclosed in parentheses.
- If you were embedding this pattern in a larger one, you could use a
+ If you were embedding this pattern in a larger one, you could use a
relative reference:
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
- This makes the fragment independent of the parentheses in the larger
+ This makes the fragment independent of the parentheses in the larger
pattern.
Checking for a used subpattern by name
- Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
- used subpattern by name. For compatibility with earlier versions of
- PCRE, which had this facility before Perl, the syntax (?(name)...) is
- also recognized. However, there is a possible ambiguity with this syn-
- tax, because subpattern names may consist entirely of digits. PCRE
- looks first for a named subpattern; if it cannot find one and the name
- consists entirely of digits, PCRE looks for a subpattern of that num-
- ber, which must be greater than zero. Using subpattern names that con-
+ Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
+ used subpattern by name. For compatibility with earlier versions of
+ PCRE, which had this facility before Perl, the syntax (?(name)...) is
+ also recognized. However, there is a possible ambiguity with this syn-
+ tax, because subpattern names may consist entirely of digits. PCRE
+ looks first for a named subpattern; if it cannot find one and the name
+ consists entirely of digits, PCRE looks for a subpattern of that num-
+ ber, which must be greater than zero. Using subpattern names that con-
sist entirely of digits is not recommended.
Rewriting the above example to use a named subpattern gives this:
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
- If the name used in a condition of this kind is a duplicate, the test
- is applied to all subpatterns of the same name, and is true if any one
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
of them has matched.
Checking for pattern recursion
If the condition is the string (R), and there is no subpattern with the
- name R, the condition is true if a recursive call to the whole pattern
+ name R, the condition is true if a recursive call to the whole pattern
or any subpattern has been made. If digits or a name preceded by amper-
sand follow the letter R, for example:
@@ -4937,23 +5049,24 @@ CONDITIONAL SUBPATTERNS
the condition is true if the most recent recursion is into a subpattern
whose number or name is given. This condition does not check the entire
- recursion stack. If the name used in a condition of this kind is a
+ recursion stack. If the name used in a condition of this kind is a
duplicate, the test is applied to all subpatterns of the same name, and
is true if any one of them is the most recent recursion.
- At "top level", all these recursion test conditions are false. The
+ At "top level", all these recursion test conditions are false. The
syntax for recursive patterns is described below.
Defining subpatterns for use by reference only
- If the condition is the string (DEFINE), and there is no subpattern
- with the name DEFINE, the condition is always false. In this case,
- there may be only one alternative in the subpattern. It is always
- skipped if control reaches this point in the pattern; the idea of
- DEFINE is that it can be used to define "subroutines" that can be ref-
- erenced from elsewhere. (The use of "subroutines" is described below.)
- For example, a pattern to match an IPv4 address could be written like
- this (ignore whitespace and line breaks):
+ If the condition is the string (DEFINE), and there is no subpattern
+ with the name DEFINE, the condition is always false. In this case,
+ there may be only one alternative in the subpattern. It is always
+ skipped if control reaches this point in the pattern; the idea of
+ DEFINE is that it can be used to define "subroutines" that can be ref-
+ erenced from elsewhere. (The use of "subroutines" is described below.)
+ For example, a pattern to match an IPv4 address such as
+ "192.168.23.245" could be written like this (ignore whitespace and line
+ breaks):
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
\b (?&byte) (\.(?&byte)){3} \b
@@ -4987,27 +5100,44 @@ CONDITIONAL SUBPATTERNS
COMMENTS
- The sequence (?# marks the start of a comment that continues up to the
- next closing parenthesis. Nested parentheses are not permitted. The
- characters that make up a comment play no part in the pattern matching
- at all.
+ There are two ways of including comments in patterns that are processed
+ by PCRE. In both cases, the start of the comment must not be in a char-
+ acter class, nor in the middle of any other sequence of related charac-
+ ters such as (?: or a subpattern name or number. The characters that
+ make up a comment play no part in the pattern matching.
- If the PCRE_EXTENDED option is set, an unescaped # character outside a
- character class introduces a comment that continues to immediately
- after the next newline in the pattern.
+ The sequence (?# marks the start of a comment that continues up to the
+ next closing parenthesis. Nested parentheses are not permitted. If the
+ PCRE_EXTENDED option is set, an unescaped # character also introduces a
+ comment, which in this case continues to immediately after the next
+ newline character or character sequence in the pattern. Which charac-
+ ters are interpreted as newlines is controlled by the options passed to
+ pcre_compile() or by a special sequence at the start of the pattern, as
+ described in the section entitled "Newline conventions" above. Note
+ that the end of this type of comment is a literal newline sequence in
+ the pattern; escape sequences that happen to represent a newline do not
+ count. For example, consider this pattern when PCRE_EXTENDED is set,
+ and the default newline convention is in force:
+
+ abc #comment \n still comment
+
+ On encountering the # character, pcre_compile() skips along, looking
+ for a newline in the pattern. The sequence \n is still literal at this
+ stage, so it does not terminate the comment. Only an actual character
+ with the code value 0x0a (the default newline) does so.
RECURSIVE PATTERNS
- Consider the problem of matching a string in parentheses, allowing for
- unlimited nested parentheses. Without the use of recursion, the best
- that can be done is to use a pattern that matches up to some fixed
- depth of nesting. It is not possible to handle an arbitrary nesting
+ Consider the problem of matching a string in parentheses, allowing for
+ unlimited nested parentheses. Without the use of recursion, the best
+ that can be done is to use a pattern that matches up to some fixed
+ depth of nesting. It is not possible to handle an arbitrary nesting
depth.
For some time, Perl has provided a facility that allows regular expres-
- sions to recurse (amongst other things). It does this by interpolating
- Perl code in the expression at run time, and the code can refer to the
+ sions to recurse (amongst other things). It does this by interpolating
+ Perl code in the expression at run time, and the code can refer to the
expression itself. A Perl pattern using code interpolation to solve the
parentheses problem can be created like this:
@@ -5017,182 +5147,182 @@ RECURSIVE PATTERNS
refers recursively to the pattern in which it appears.
Obviously, PCRE cannot support the interpolation of Perl code. Instead,
- it supports special syntax for recursion of the entire pattern, and
- also for individual subpattern recursion. After its introduction in
- PCRE and Python, this kind of recursion was subsequently introduced
+ it supports special syntax for recursion of the entire pattern, and
+ also for individual subpattern recursion. After its introduction in
+ PCRE and Python, this kind of recursion was subsequently introduced
into Perl at release 5.10.
- A special item that consists of (? followed by a number greater than
+ A special item that consists of (? followed by a number greater than
zero and a closing parenthesis is a recursive call of the subpattern of
- the given number, provided that it occurs inside that subpattern. (If
- not, it is a "subroutine" call, which is described in the next sec-
- tion.) The special item (?R) or (?0) is a recursive call of the entire
+ the given number, provided that it occurs inside that subpattern. (If
+ not, it is a "subroutine" call, which is described in the next sec-
+ tion.) The special item (?R) or (?0) is a recursive call of the entire
regular expression.
- This PCRE pattern solves the nested parentheses problem (assume the
+ This PCRE pattern solves the nested parentheses problem (assume the
PCRE_EXTENDED option is set so that white space is ignored):
\( ( [^()]++ | (?R) )* \)
- First it matches an opening parenthesis. Then it matches any number of
- substrings which can either be a sequence of non-parentheses, or a
- recursive match of the pattern itself (that is, a correctly parenthe-
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
sized substring). Finally there is a closing parenthesis. Note the use
of a possessive quantifier to avoid backtracking into sequences of non-
parentheses.
- If this were part of a larger pattern, you would not want to recurse
+ If this were part of a larger pattern, you would not want to recurse
the entire pattern, so instead you could use this:
( \( ( [^()]++ | (?1) )* \) )
- We have put the pattern into parentheses, and caused the recursion to
+ We have put the pattern into parentheses, and caused the recursion to
refer to them instead of the whole pattern.
- In a larger pattern, keeping track of parenthesis numbers can be
- tricky. This is made easier by the use of relative references (a Perl
- 5.10 feature). Instead of (?1) in the pattern above you can write
- (?-2) to refer to the second most recently opened parentheses preceding
- the recursion. In other words, a negative number counts capturing
- parentheses leftwards from the point at which it is encountered.
-
- It is also possible to refer to subsequently opened parentheses, by
- writing references such as (?+2). However, these cannot be recursive
- because the reference is not inside the parentheses that are refer-
- enced. They are always "subroutine" calls, as described in the next
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references. Instead
+ of (?1) in the pattern above you can write (?-2) to refer to the second
+ most recently opened parentheses preceding the recursion. In other
+ words, a negative number counts capturing parentheses leftwards from
+ the point at which it is encountered.
+
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always "subroutine" calls, as described in the next
section.
- An alternative approach is to use named parentheses instead. The Perl
- syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
+ An alternative approach is to use named parentheses instead. The Perl
+ syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
supported. We could rewrite the above example as follows:
(?<pn> \( ( [^()]++ | (?&pn) )* \) )
- If there is more than one subpattern with the same name, the earliest
+ If there is more than one subpattern with the same name, the earliest
one is used.
- This particular example pattern that we have been looking at contains
+ This particular example pattern that we have been looking at contains
nested unlimited repeats, and so the use of a possessive quantifier for
matching strings of non-parentheses is important when applying the pat-
- tern to strings that do not match. For example, when this pattern is
+ tern to strings that do not match. For example, when this pattern is
applied to
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
- it yields "no match" quickly. However, if a possessive quantifier is
- not used, the match runs for a very long time indeed because there are
- so many different ways the + and * repeats can carve up the subject,
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
and all have to be tested before failure can be reported.
- At the end of a match, the values of capturing parentheses are those
- from the outermost level. If you want to obtain intermediate values, a
- callout function can be used (see below and the pcrecallout documenta-
+ At the end of a match, the values of capturing parentheses are those
+ from the outermost level. If you want to obtain intermediate values, a
+ callout function can be used (see below and the pcrecallout documenta-
tion). If the pattern above is matched against
(ab(cd)ef)
- the value for the inner capturing parentheses (numbered 2) is "ef",
- which is the last value taken on at the top level. If a capturing sub-
+ the value for the inner capturing parentheses (numbered 2) is "ef",
+ which is the last value taken on at the top level. If a capturing sub-
pattern is not matched at the top level, its final value is unset, even
if it is (temporarily) set at a deeper level.
- If there are more than 15 capturing parentheses in a pattern, PCRE has
- to obtain extra memory to store data during a recursion, which it does
+ If there are more than 15 capturing parentheses in a pattern, PCRE has
+ to obtain extra memory to store data during a recursion, which it does
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
- Do not confuse the (?R) item with the condition (R), which tests for
- recursion. Consider this pattern, which matches text in angle brack-
- ets, allowing for arbitrary nesting. Only digits are allowed in nested
- brackets (that is, when recursing), whereas any characters are permit-
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
ted at the outer level.
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
- In this pattern, (?(R) is the start of a conditional subpattern, with
- two different alternatives for the recursive and non-recursive cases.
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
The (?R) item is the actual recursive call.
Recursion difference from Perl
- In PCRE (like Python, but unlike Perl), a recursive subpattern call is
+ In PCRE (like Python, but unlike Perl), a recursive subpattern call is
always treated as an atomic group. That is, once it has matched some of
the subject string, it is never re-entered, even if it contains untried
- alternatives and there is a subsequent matching failure. This can be
- illustrated by the following pattern, which purports to match a palin-
- dromic string that contains an odd number of characters (for example,
+ alternatives and there is a subsequent matching failure. This can be
+ illustrated by the following pattern, which purports to match a palin-
+ dromic string that contains an odd number of characters (for example,
"a", "aba", "abcba", "abcdcba"):
^(.|(.)(?1)\2)$
The idea is that it either matches a single character, or two identical
- characters surrounding a sub-palindrome. In Perl, this pattern works;
- in PCRE it does not if the pattern is longer than three characters.
+ characters surrounding a sub-palindrome. In Perl, this pattern works;
+ in PCRE it does not if the pattern is longer than three characters.
Consider the subject string "abcba":
- At the top level, the first character is matched, but as it is not at
+ At the top level, the first character is matched, but as it is not at
the end of the string, the first alternative fails; the second alterna-
tive is taken and the recursion kicks in. The recursive call to subpat-
- tern 1 successfully matches the next character ("b"). (Note that the
+ tern 1 successfully matches the next character ("b"). (Note that the
beginning and end of line tests are not part of the recursion).
- Back at the top level, the next character ("c") is compared with what
- subpattern 2 matched, which was "a". This fails. Because the recursion
- is treated as an atomic group, there are now no backtracking points,
- and so the entire match fails. (Perl is able, at this point, to re-
- enter the recursion and try the second alternative.) However, if the
+ Back at the top level, the next character ("c") is compared with what
+ subpattern 2 matched, which was "a". This fails. Because the recursion
+ is treated as an atomic group, there are now no backtracking points,
+ and so the entire match fails. (Perl is able, at this point, to re-
+ enter the recursion and try the second alternative.) However, if the
pattern is written with the alternatives in the other order, things are
different:
^((.)(?1)\2|.)$
- This time, the recursing alternative is tried first, and continues to
- recurse until it runs out of characters, at which point the recursion
- fails. But this time we do have another alternative to try at the
- higher level. That is the big difference: in the previous case the
+ This time, the recursing alternative is tried first, and continues to
+ recurse until it runs out of characters, at which point the recursion
+ fails. But this time we do have another alternative to try at the
+ higher level. That is the big difference: in the previous case the
remaining alternative is at a deeper recursion level, which PCRE cannot
use.
- To change the pattern so that matches all palindromic strings, not just
- those with an odd number of characters, it is tempting to change the
- pattern to this:
+ To change the pattern so that it matches all palindromic strings, not
+ just those with an odd number of characters, it is tempting to change
+ the pattern to this:
^((.)(?1)\2|.?)$
- Again, this works in Perl, but not in PCRE, and for the same reason.
- When a deeper recursion has matched a single character, it cannot be
- entered again in order to match an empty string. The solution is to
- separate the two cases, and write out the odd and even cases as alter-
+ Again, this works in Perl, but not in PCRE, and for the same reason.
+ When a deeper recursion has matched a single character, it cannot be
+ entered again in order to match an empty string. The solution is to
+ separate the two cases, and write out the odd and even cases as alter-
natives at the higher level:
^(?:((.)(?1)\2|)|((.)(?3)\4|.))
- If you want to match typical palindromic phrases, the pattern has to
+ If you want to match typical palindromic phrases, the pattern has to
ignore all non-word characters, which can be done like this:
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
If run with the PCRE_CASELESS option, this pattern matches phrases such
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
- Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
- ing into sequences of non-word characters. Without this, PCRE takes a
- great deal longer (ten times or more) to match typical phrases, and
+ Perl. Note the use of the possessive quantifier *+ to avoid backtrack-
+ ing into sequences of non-word characters. Without this, PCRE takes a
+ great deal longer (ten times or more) to match typical phrases, and
Perl takes so long that you think it has gone into a loop.
- WARNING: The palindrome-matching patterns above work only if the sub-
- ject string does not start with a palindrome that is shorter than the
- entire string. For example, although "abcba" is correctly matched, if
- the subject is "ababa", PCRE finds the palindrome "aba" at the start,
- then fails at top level because the end of the string does not follow.
- Once again, it cannot jump back into the recursion to try other alter-
+ WARNING: The palindrome-matching patterns above work only if the sub-
+ ject string does not start with a palindrome that is shorter than the
+ entire string. For example, although "abcba" is correctly matched, if
+ the subject is "ababa", PCRE finds the palindrome "aba" at the start,
+ then fails at top level because the end of the string does not follow.
+ Once again, it cannot jump back into the recursion to try other alter-
natives, so the entire match fails.
SUBPATTERNS AS SUBROUTINES
If the syntax for a recursive subpattern reference (either by number or
- by name) is used outside the parentheses to which it refers, it oper-
- ates like a subroutine in a programming language. The "called" subpat-
+ by name) is used outside the parentheses to which it refers, it oper-
+ ates like a subroutine in a programming language. The "called" subpat-
tern may be defined before or after the reference. A numbered reference
can be absolute or relative, as in these examples:
@@ -5204,125 +5334,126 @@ SUBPATTERNS AS SUBROUTINES
(sens|respons)e and \1ibility
- matches "sense and sensibility" and "response and responsibility", but
+ matches "sense and sensibility" and "response and responsibility", but
not "sense and responsibility". If instead the pattern
(sens|respons)e and (?1)ibility
- is used, it does match "sense and responsibility" as well as the other
- two strings. Another example is given in the discussion of DEFINE
+ is used, it does match "sense and responsibility" as well as the other
+ two strings. Another example is given in the discussion of DEFINE
above.
- Like recursive subpatterns, a subroutine call is always treated as an
- atomic group. That is, once it has matched some of the subject string,
- it is never re-entered, even if it contains untried alternatives and
- there is a subsequent matching failure. Any capturing parentheses that
- are set during the subroutine call revert to their previous values
+ Like recursive subpatterns, a subroutine call is always treated as an
+ atomic group. That is, once it has matched some of the subject string,
+ it is never re-entered, even if it contains untried alternatives and
+ there is a subsequent matching failure. Any capturing parentheses that
+ are set during the subroutine call revert to their previous values
afterwards.
- When a subpattern is used as a subroutine, processing options such as
+ When a subpattern is used as a subroutine, processing options such as
case-independence are fixed when the subpattern is defined. They cannot
be changed for different calls. For example, consider this pattern:
(abc)(?i:(?-1))
- It matches "abcabc". It does not match "abcABC" because the change of
+ It matches "abcabc". It does not match "abcABC" because the change of
processing option does not affect the called subpattern.
ONIGURUMA SUBROUTINE SYNTAX
- For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
name or a number enclosed either in angle brackets or single quotes, is
- an alternative syntax for referencing a subpattern as a subroutine,
- possibly recursively. Here are two of the examples used above, rewrit-
+ an alternative syntax for referencing a subpattern as a subroutine,
+ possibly recursively. Here are two of the examples used above, rewrit-
ten using this syntax:
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
(sens|respons)e and \g'1'ibility
- PCRE supports an extension to Oniguruma: if a number is preceded by a
+ PCRE supports an extension to Oniguruma: if a number is preceded by a
plus or a minus sign it is taken as a relative reference. For example:
(abc)(?i:\g<-1>)
- Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
- synonymous. The former is a back reference; the latter is a subroutine
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a back reference; the latter is a subroutine
call.
CALLOUTS
Perl has a feature whereby using the sequence (?{...}) causes arbitrary
- Perl code to be obeyed in the middle of matching a regular expression.
+ Perl code to be obeyed in the middle of matching a regular expression.
This makes it possible, amongst other things, to extract different sub-
strings that match the same pair of parentheses when there is a repeti-
tion.
PCRE provides a similar feature, but of course it cannot obey arbitrary
Perl code. The feature is called "callout". The caller of PCRE provides
- an external function by putting its entry point in the global variable
- pcre_callout. By default, this variable contains NULL, which disables
+ an external function by putting its entry point in the global variable
+ pcre_callout. By default, this variable contains NULL, which disables
all calling out.
- Within a regular expression, (?C) indicates the points at which the
- external function is to be called. If you want to identify different
- callout points, you can put a number less than 256 after the letter C.
- The default value is zero. For example, this pattern has two callout
+ Within a regular expression, (?C) indicates the points at which the
+ external function is to be called. If you want to identify different
+ callout points, you can put a number less than 256 after the letter C.
+ The default value is zero. For example, this pattern has two callout
points:
(?C1)abc(?C2)def
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
- automatically installed before each item in the pattern. They are all
+ automatically installed before each item in the pattern. They are all
numbered 255.
During matching, when PCRE reaches a callout point (and pcre_callout is
- set), the external function is called. It is provided with the number
- of the callout, the position in the pattern, and, optionally, one item
- of data originally supplied by the caller of pcre_exec(). The callout
- function may cause matching to proceed, to backtrack, or to fail alto-
+ set), the external function is called. It is provided with the number
+ of the callout, the position in the pattern, and, optionally, one item
+ of data originally supplied by the caller of pcre_exec(). The callout
+ function may cause matching to proceed, to backtrack, or to fail alto-
gether. A complete description of the interface to the callout function
is given in the pcrecallout documentation.
BACKTRACKING CONTROL
- Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
+ Perl 5.10 introduced a number of "Special Backtracking Control Verbs",
which are described in the Perl documentation as "experimental and sub-
- ject to change or removal in a future version of Perl". It goes on to
- say: "Their usage in production code should be noted to avoid problems
+ ject to change or removal in a future version of Perl". It goes on to
+ say: "Their usage in production code should be noted to avoid problems
during upgrades." The same remarks apply to the PCRE features described
in this section.
- Since these verbs are specifically related to backtracking, most of
- them can be used only when the pattern is to be matched using
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using
pcre_exec(), which uses a backtracking algorithm. With the exception of
(*FAIL), which behaves like a failing negative assertion, they cause an
error if encountered by pcre_dfa_exec().
If any of these verbs are used in an assertion or subroutine subpattern
- (including recursive subpatterns), their effect is confined to that
- subpattern; it does not extend to the surrounding pattern. Note that
- such subpatterns are processed as anchored at the point where they are
+ (including recursive subpatterns), their effect is confined to that
+ subpattern; it does not extend to the surrounding pattern. Note that
+ such subpatterns are processed as anchored at the point where they are
tested.
- The new verbs make use of what was previously invalid syntax: an open-
+ The new verbs make use of what was previously invalid syntax: an open-
ing parenthesis followed by an asterisk. They are generally of the form
- (*VERB) or (*VERB:NAME). Some may take either form, with differing be-
+ (*VERB) or (*VERB:NAME). Some may take either form, with differing be-
haviour, depending on whether or not an argument is present. An name is
- a sequence of letters, digits, and underscores. If the name is empty,
- that is, if the closing parenthesis immediately follows the colon, the
+ a sequence of letters, digits, and underscores. If the name is empty,
+ that is, if the closing parenthesis immediately follows the colon, the
effect is as if the colon were not there. Any number of these verbs may
occur in a pattern.
- PCRE contains some optimizations that are used to speed up matching by
+ PCRE contains some optimizations that are used to speed up matching by
running some checks at the start of each match attempt. For example, it
- may know the minimum length of matching subject, or that a particular
- character must be present. When one of these optimizations suppresses
- the running of a match, any included backtracking verbs will not, of
+ may know the minimum length of matching subject, or that a particular
+ character must be present. When one of these optimizations suppresses
+ the running of a match, any included backtracking verbs will not, of
course, be processed. You can suppress the start-of-match optimizations
- by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_exec().
+ by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com-
+ pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
Verbs that act immediately
@@ -5511,20 +5642,40 @@ BACKTRACKING CONTROL
(*THEN) or (*THEN:NAME)
- This verb causes a skip to the next alternation if the rest of the pat-
- tern does not match. That is, it cancels pending backtracking, but only
- within the current alternation. Its name comes from the observation
- that it can be used for a pattern-based if-then-else block:
+ This verb causes a skip to the next alternation in the innermost
+ enclosing group if the rest of the pattern does not match. That is, it
+ cancels pending backtracking, but only within the current alternation.
+ Its name comes from the observation that it can be used for a pattern-
+ based if-then-else block:
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
- If the COND1 pattern matches, FOO is tried (and possibly further items
- after the end of the group if FOO succeeds); on failure the matcher
- skips to the second alternative and tries COND2, without backtracking
- into COND1. The behaviour of (*THEN:NAME) is exactly the same as
- (*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not
+ If the COND1 pattern matches, FOO is tried (and possibly further items
+ after the end of the group if FOO succeeds); on failure the matcher
+ skips to the second alternative and tries COND2, without backtracking
+ into COND1. The behaviour of (*THEN:NAME) is exactly the same as
+ (*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not
directly inside an alternation, it acts like (*PRUNE).
+ The above verbs provide four different "strengths" of control when sub-
+ sequent matching fails. (*THEN) is the weakest, carrying on the match
+ at the next alternation. (*PRUNE) comes next, failing the match at the
+ current starting position, but allowing an advance to the next charac-
+ ter (for an unanchored pattern). (*SKIP) is similar, except that the
+ advance may be more than one character. (*COMMIT) is the strongest,
+ causing the entire match to fail.
+
+ If more than one is present in a pattern, the "stongest" one wins. For
+ example, consider this pattern, where A, B, etc. are complex pattern
+ fragments:
+
+ (A(*COMMIT)B(*THEN)C|D)
+
+ Once A has matched, PCRE is committed to this match, at the current
+ starting position. If subsequently B matches, but C does not, the nor-
+ mal (*THEN) action of trying the next alternation (that is, D) does not
+ happen because (*COMMIT) overrides.
+
SEE ALSO
@@ -5540,7 +5691,7 @@ AUTHOR
REVISION
- Last updated: 18 May 2010
+ Last updated: 21 November 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -5568,7 +5719,7 @@ QUOTING
CHARACTERS
\a alarm, that is, the BEL character (hex 07)
- \cx "control-x", where x is any character
+ \cx "control-x", where x is any ASCII character
\e escape (hex 1B)
\f formfeed (hex 0C)
\n newline (hex 0A)
@@ -5787,6 +5938,7 @@ OPTION SETTING
The following are recognized only at the start of a pattern or after
one of the newline-setting options with similar syntax:
+ (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
(*UTF8) set UTF-8 mode (PCRE_UTF8)
(*UCP) set PCRE_UCP (use Unicode properties for \d etc)
@@ -5909,7 +6061,7 @@ AUTHOR
REVISION
- Last updated: 12 May 2010
+ Last updated: 21 November 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -5941,8 +6093,8 @@ PARTIAL MATCHING IN PCRE
reflecting the character that has been typed, for example. This immedi-
ate feedback is likely to be a better user interface than a check that
is delayed until the entire string has been entered. Partial matching
- can also sometimes be useful when the subject string is very long and
- is not all available at once.
+ can also be useful when the subject string is very long and is not all
+ available at once.
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or
@@ -5964,20 +6116,22 @@ PARTIAL MATCHING IN PCRE
PARTIAL MATCHING USING pcre_exec()
- A partial match occurs during a call to pcre_exec() whenever the end of
- the subject string is reached successfully, but matching cannot con-
- tinue because more characters are needed. However, at least one charac-
- ter must have been matched. (In other words, a partial match can never
- be an empty string.)
-
- If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but
- matching continues as normal, and other alternatives in the pattern are
- tried. If no complete match can be found, pcre_exec() returns
- PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. If there are at least
- two slots in the offsets vector, the first of them is set to the offset
- of the earliest character that was inspected when the partial match was
- found. For convenience, the second offset points to the end of the
- string so that a substring can easily be identified.
+ A partial match occurs during a call to pcre_exec() when the end of the
+ subject string is reached successfully, but matching cannot continue
+ because more characters are needed. However, at least one character in
+ the subject must have been inspected. This character need not form part
+ of the final matched string; lookbehind assertions and the \K escape
+ sequence provide ways of inspecting characters before the start of a
+ matched substring. The requirement for inspecting at least one charac-
+ ter exists because an empty string can always be matched; without such
+ a restriction there would always be a partial match of an empty string
+ at the end of the subject.
+
+ If there are at least two slots in the offsets vector when pcre_exec()
+ returns with a partial match, the first slot is set to the offset of
+ the earliest character that was inspected when the partial match was
+ found. For convenience, the second offset points to the end of the sub-
+ ject so that a substring can easily be identified.
For the majority of patterns, the first offset identifies the start of
the partially matched string. However, for patterns that contain look-
@@ -5989,120 +6143,153 @@ PARTIAL MATCHING USING pcre_exec()
This pattern matches "123", but only if it is preceded by "abc". If the
subject string is "xyzabc12", the offsets after a partial match are for
the substring "abc12", because all these characters are needed if
- another match is tried with extra characters added.
+ another match is tried with extra characters added to the subject.
+
+ What happens when a partial match is identified depends on which of the
+ two partial matching options are set.
+
+ PCRE_PARTIAL_SOFT with pcre_exec()
- If there is more than one partial match, the first one that was found
+ If PCRE_PARTIAL_SOFT is set when pcre_exec() identifies a partial
+ match, the partial match is remembered, but matching continues as nor-
+ mal, and other alternatives in the pattern are tried. If no complete
+ match can be found, pcre_exec() returns PCRE_ERROR_PARTIAL instead of
+ PCRE_ERROR_NOMATCH.
+
+ This option is "soft" because it prefers a complete match over a par-
+ tial match. All the various matching items in a pattern behave as if
+ the subject string is potentially complete. For example, \z, \Z, and $
+ match at the end of the subject, as normal, and for \b and \B the end
+ of the subject is treated as a non-alphanumeric.
+
+ If there is more than one partial match, the first one that was found
provides the data that is returned. Consider this pattern:
/123\w+X|dogY/
- If this is matched against the subject string "abc123dog", both alter-
- natives fail to match, but the end of the subject is reached during
- matching, so PCRE_ERROR_PARTIAL is returned instead of
- PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying
- "123dog" as the first partial match that was found. (In this example,
- there are two partial matches, because "dog" on its own partially
- matches the second alternative.)
+ If this is matched against the subject string "abc123dog", both alter-
+ natives fail to match, but the end of the subject is reached during
+ matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
+ and 9, identifying "123dog" as the first partial match that was found.
+ (In this example, there are two partial matches, because "dog" on its
+ own partially matches the second alternative.)
+
+ PCRE_PARTIAL_HARD with pcre_exec()
If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR-
TIAL as soon as a partial match is found, without continuing to search
- for possible complete matches. The difference between the two options
- can be illustrated by a pattern such as:
+ for possible complete matches. This option is "hard" because it prefers
+ an earlier partial match over a later complete match. For this reason,
+ the assumption is made that the end of the supplied subject string may
+ not be the true end of the available data, and so, if \z, \Z, \b, \B,
+ or $ are encountered at the end of the subject, the result is
+ PCRE_ERROR_PARTIAL.
+
+ Setting PCRE_PARTIAL_HARD also affects the way pcre_exec() checks UTF-8
+ subject strings for validity. Normally, an invalid UTF-8 sequence
+ causes the error PCRE_ERROR_BADUTF8. However, in the special case of a
+ truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORT-
+ UTF8 is returned when PCRE_PARTIAL_HARD is set.
+
+ Comparing hard and soft partial matching
+
+ The difference between the two partial matching options can be illus-
+ trated by a pattern such as:
/dog(sbody)?/
- This matches either "dog" or "dogsbody", greedily (that is, it prefers
- the longer string if possible). If it is matched against the string
- "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
+ This matches either "dog" or "dogsbody", greedily (that is, it prefers
+ the longer string if possible). If it is matched against the string
+ "dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog".
However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
- On the other hand, if the pattern is made ungreedy the result is dif-
+ On the other hand, if the pattern is made ungreedy the result is dif-
ferent:
/dog(sbody)??/
- In this case the result is always a complete match because pcre_exec()
- finds that first, and it never continues after finding a match. It
- might be easier to follow this explanation by thinking of the two pat-
+ In this case the result is always a complete match because pcre_exec()
+ finds that first, and it never continues after finding a match. It
+ might be easier to follow this explanation by thinking of the two pat-
terns like this:
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
- The second pattern will never match "dogsbody" when pcre_exec() is
+ The second pattern will never match "dogsbody" when pcre_exec() is
used, because it will always find the shorter match first.
PARTIAL MATCHING USING pcre_dfa_exec()
- The pcre_dfa_exec() function moves along the subject string character
- by character, without backtracking, searching for all possible matches
- simultaneously. If the end of the subject is reached before the end of
- the pattern, there is the possibility of a partial match, again pro-
- vided that at least one character has matched.
-
- When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
- there have been no complete matches. Otherwise, the complete matches
- are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
- takes precedence over any complete matches. The portion of the string
- that was inspected when the longest partial match was found is set as
+ The pcre_dfa_exec() function moves along the subject string character
+ by character, without backtracking, searching for all possible matches
+ simultaneously. If the end of the subject is reached before the end of
+ the pattern, there is the possibility of a partial match, again pro-
+ vided that at least one character has been inspected.
+
+ When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if
+ there have been no complete matches. Otherwise, the complete matches
+ are returned. However, if PCRE_PARTIAL_HARD is set, a partial match
+ takes precedence over any complete matches. The portion of the string
+ that was inspected when the longest partial match was found is set as
the first matching string, provided there are at least two slots in the
offsets vector.
- Because pcre_dfa_exec() always searches for all possible matches, and
- there is no difference between greedy and ungreedy repetition, its be-
+ Because pcre_dfa_exec() always searches for all possible matches, and
+ there is no difference between greedy and ungreedy repetition, its be-
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con-
- sider the string "dog" matched against the ungreedy pattern shown
+ sider the string "dog" matched against the ungreedy pattern shown
above:
/dog(sbody)??/
- Whereas pcre_exec() stops as soon as it finds the complete match for
+ Whereas pcre_exec() stops as soon as it finds the complete match for
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and
so returns that when PCRE_PARTIAL_HARD is set.
PARTIAL MATCHING AND WORD BOUNDARIES
- If a pattern ends with one of sequences \b or \B, which test for word
- boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
+ If a pattern ends with one of sequences \b or \B, which test for word
+ boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-
intuitive results. Consider this pattern:
/\bcat\b/
This matches "cat", provided there is a word boundary at either end. If
the subject string is "the cat", the comparison of the final "t" with a
- following character cannot take place, so a partial match is found.
- However, pcre_exec() carries on with normal matching, which matches \b
- at the end of the subject when the last character is a letter, thus
+ following character cannot take place, so a partial match is found.
+ However, pcre_exec() carries on with normal matching, which matches \b
+ at the end of the subject when the last character is a letter, thus
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR-
- TIAL. The same thing happens with pcre_dfa_exec(), because it also
+ TIAL. The same thing happens with pcre_dfa_exec(), because it also
finds the complete match.
- Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
+ Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL,
because then the partial match takes precedence.
FORMERLY RESTRICTED PATTERNS
For releases of PCRE prior to 8.00, because of the way certain internal
- optimizations were implemented in the pcre_exec() function, the
- PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
- used with all patterns. From release 8.00 onwards, the restrictions no
- longer apply, and partial matching with pcre_exec() can be requested
+ optimizations were implemented in the pcre_exec() function, the
+ PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be
+ used with all patterns. From release 8.00 onwards, the restrictions no
+ longer apply, and partial matching with pcre_exec() can be requested
for any pattern.
Items that were formerly restricted were repeated single characters and
- repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
- not conform to the restrictions, pcre_exec() returned the error code
- PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
- PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
+ repeated metasequences. If PCRE_PARTIAL was set for a pattern that did
+ not conform to the restrictions, pcre_exec() returned the error code
+ PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
+ PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled
pattern can be used for partial matching now always returns 1.
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
- If the escape sequence \P is present in a pcretest data line, the
- PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
+ If the escape sequence \P is present in a pcretest data line, the
+ PCRE_PARTIAL_SOFT option is used for the match. Here is a run of
pcretest that uses the date example quoted above:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -6118,24 +6305,24 @@ EXAMPLE OF PARTIAL MATCHING USING PCRETEST
data> j\P
No match
- The first data string is matched completely, so pcretest shows the
- matched substrings. The remaining four strings do not match the com-
+ The first data string is matched completely, so pcretest shows the
+ matched substrings. The remaining four strings do not match the com-
plete pattern, but the first two are partial matches. Similar output is
obtained when pcre_dfa_exec() is used.
- If the escape sequence \P is present more than once in a pcretest data
+ If the escape sequence \P is present more than once in a pcretest data
line, the PCRE_PARTIAL_HARD option is set for the match.
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
When a partial match has been found using pcre_dfa_exec(), it is possi-
- ble to continue the match by providing additional subject data and
- calling pcre_dfa_exec() again with the same compiled regular expres-
- sion, this time setting the PCRE_DFA_RESTART option. You must pass the
+ ble to continue the match by providing additional subject data and
+ calling pcre_dfa_exec() again with the same compiled regular expres-
+ sion, this time setting the PCRE_DFA_RESTART option. You must pass the
same working space as before, because this is where details of the pre-
- vious partial match are stored. Here is an example using pcretest,
- using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
+ vious partial match are stored. Here is an example using pcretest,
+ using the \R escape sequence to set the PCRE_DFA_RESTART option (\D
specifies the use of pcre_dfa_exec()):
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
@@ -6144,30 +6331,33 @@ MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
data> n05\R\D
0: n05
- The first call has "23ja" as the subject, and requests partial match-
- ing; the second call has "n05" as the subject for the continued
- (restarted) match. Notice that when the match is complete, only the
- last part is shown; PCRE does not retain the previously partially-
- matched string. It is up to the calling program to do that if it needs
+ The first call has "23ja" as the subject, and requests partial match-
+ ing; the second call has "n05" as the subject for the continued
+ (restarted) match. Notice that when the match is complete, only the
+ last part is shown; PCRE does not retain the previously partially-
+ matched string. It is up to the calling program to do that if it needs
to.
- You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
- PCRE_DFA_RESTART to continue partial matching over multiple segments.
- This facility can be used to pass very long subject strings to
+ You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
+ PCRE_DFA_RESTART to continue partial matching over multiple segments.
+ This facility can be used to pass very long subject strings to
pcre_dfa_exec().
MULTI-SEGMENT MATCHING WITH pcre_exec()
- From release 8.00, pcre_exec() can also be used to do multi-segment
- matching. Unlike pcre_dfa_exec(), it is not possible to restart the
- previous match with a new segment of data. Instead, new data must be
- added to the previous subject string, and the entire match re-run,
- starting from the point where the partial match occurred. Earlier data
- can be discarded. Consider an unanchored pattern that matches dates:
+ From release 8.00, pcre_exec() can also be used to do multi-segment
+ matching. Unlike pcre_dfa_exec(), it is not possible to restart the
+ previous match with a new segment of data. Instead, new data must be
+ added to the previous subject string, and the entire match re-run,
+ starting from the point where the partial match occurred. Earlier data
+ can be discarded. It is best to use PCRE_PARTIAL_HARD in this situa-
+ tion, because it does not treat the end of a segment as the end of the
+ subject when matching \z, \Z, \b, \B, and $. Consider an unanchored
+ pattern that matches dates:
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
- data> The date is 23ja\P
+ data> The date is 23ja\P\P
Partial match: 23ja
At this stage, an application could discard the text preceding "23ja",
@@ -6188,29 +6378,30 @@ ISSUES WITH MULTI-SEGMENT MATCHING
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
- 1. If the pattern contains tests for the beginning or end of a line,
- you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
- ate, when the subject string for any call does not contain the begin-
- ning or end of a line.
-
- 2. Lookbehind assertions at the start of a pattern are catered for in
- the offsets that are returned for a partial match. However, in theory,
- a lookbehind assertion later in the pattern could require even earlier
- characters to be inspected, and it might not have been reached when a
- partial match occurs. This is probably an extremely unlikely case; you
- could guard against it to a certain extent by always including extra
+ 1. If the pattern contains a test for the beginning of a line, you need
+ to pass the PCRE_NOTBOL option when the subject string for any call
+ does start at the beginning of a line. There is also a PCRE_NOTEOL
+ option, but in practice when doing multi-segment matching you should be
+ using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
+
+ 2. Lookbehind assertions at the start of a pattern are catered for in
+ the offsets that are returned for a partial match. However, in theory,
+ a lookbehind assertion later in the pattern could require even earlier
+ characters to be inspected, and it might not have been reached when a
+ partial match occurs. This is probably an extremely unlikely case; you
+ could guard against it to a certain extent by always including extra
characters at the start.
- 3. Matching a subject string that is split into multiple segments may
- not always produce exactly the same result as matching over one single
- long string, especially when PCRE_PARTIAL_SOFT is used. The section
- "Partial Matching and Word Boundaries" above describes an issue that
- arises if the pattern ends with \b or \B. Another kind of difference
- may occur when there are multiple matching possibilities, because a
- partial match result is given only when there are no completed matches.
- This means that as soon as the shortest match has been found, continua-
- tion to a new subject segment is no longer possible. Consider again
- this pcretest example:
+ 3. Matching a subject string that is split into multiple segments may
+ not always produce exactly the same result as matching over one single
+ long string, especially when PCRE_PARTIAL_SOFT is used. The section
+ "Partial Matching and Word Boundaries" above describes an issue that
+ arises if the pattern ends with \b or \B. Another kind of difference
+ may occur when there are multiple matching possibilities, because (for
+ PCRE_PARTIAL_SOFT) a partial match result is given only when there are
+ no completed matches. This means that as soon as the shortest match has
+ been found, continuation to a new subject segment is no longer possi-
+ ble. Consider again this pcretest example:
re> /dog(sbody)?/
data> dogsb\P
@@ -6223,18 +6414,18 @@ ISSUES WITH MULTI-SEGMENT MATCHING
0: dogsbody
1: dog
- The first data line passes the string "dogsb" to pcre_exec(), setting
- the PCRE_PARTIAL_SOFT option. Although the string is a partial match
- for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
- shorter string "dog" is a complete match. Similarly, when the subject
- is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
+ The first data line passes the string "dogsb" to pcre_exec(), setting
+ the PCRE_PARTIAL_SOFT option. Although the string is a partial match
+ for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the
+ shorter string "dog" is a complete match. Similarly, when the subject
+ is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being
the first two) the match stops when "dog" has been found, and it is not
- possible to continue. On the other hand, if "dogsbody" is presented as
+ possible to continue. On the other hand, if "dogsbody" is presented as
a single string, pcre_dfa_exec() finds both matches.
- Because of these problems, it is probably best to use PCRE_PARTIAL_HARD
- when matching multi-segment data. The example above then behaves dif-
- ferently:
+ Because of these problems, it is best to use PCRE_PARTIAL_HARD when
+ matching multi-segment data. The example above then behaves differ-
+ ently:
re> /dog(sbody)?/
data> dogsb\P\P
@@ -6244,41 +6435,40 @@ ISSUES WITH MULTI-SEGMENT MATCHING
data> gsb\R\P\P\D
Partial match: gsb
-
4. Patterns that contain alternatives at the top level which do not all
- start with the same pattern item may not work as expected when
- PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider
+ start with the same pattern item may not work as expected when
+ PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider
this pattern:
1234|3789
- If the first part of the subject is "ABC123", a partial match of the
- first alternative is found at offset 3. There is no partial match for
+ If the first part of the subject is "ABC123", a partial match of the
+ first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the same
- point in the subject string. Attempting to continue with the string
- "7890" does not yield a match because only those alternatives that
- match at one point in the subject are remembered. The problem arises
- because the start of the second alternative matches within the first
- alternative. There is no problem with anchored patterns or patterns
+ point in the subject string. Attempting to continue with the string
+ "7890" does not yield a match because only those alternatives that
+ match at one point in the subject are remembered. The problem arises
+ because the start of the second alternative matches within the first
+ alternative. There is no problem with anchored patterns or patterns
such as:
1234|ABCD
- where no string can be a partial match for both alternatives. This is
- not a problem if pcre_exec() is used, because the entire match has to
+ where no string can be a partial match for both alternatives. This is
+ not a problem if pcre_exec() is used, because the entire match has to
be rerun each time:
re> /1234|3789/
- data> ABC123\P
+ data> ABC123\P\P
Partial match: 123
data> 1237890
0: 3789
- Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-
+ Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
running the entire match can also be used with pcre_dfa_exec(). Another
possibility is to work with two buffers. If a partial match at offset n
- in the first buffer is followed by "no match" when PCRE_DFA_RESTART is
- used on the second buffer, you can then try a new match starting at
+ in the first buffer is followed by "no match" when PCRE_DFA_RESTART is
+ used on the second buffer, you can then try a new match starting at
offset n+1 in the first buffer.
@@ -6291,8 +6481,8 @@ AUTHOR
REVISION
- Last updated: 19 October 2009
- Copyright (c) 1997-2009 University of Cambridge.
+ Last updated: 07 November 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -6403,7 +6593,7 @@ COMPATIBILITY WITH DIFFERENT PCRE RELEASES
In general, it is safest to recompile all saved patterns when you
update to a new PCRE release, though not all updates actually require
- this. Recompiling is definitely needed for release 7.2.
+ this.
AUTHOR
@@ -6415,8 +6605,8 @@ AUTHOR
REVISION
- Last updated: 13 June 2007
- Copyright (c) 1997-2007 University of Cambridge.
+ Last updated: 17 November 2010
+ Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
@@ -7250,7 +7440,7 @@ PCRE SAMPLE PROGRAM
expressions and the PCRE library. The pcredemo program is provided as a
simple coding example.
- When you try to run pcredemo when PCRE is not installed in the standard
+ If you try to run pcredemo when PCRE is not installed in the standard
library directory, you may get an error like this on some operating
systems (e.g. Solaris):
@@ -7274,7 +7464,7 @@ AUTHOR
REVISION
- Last updated: 26 May 2010
+ Last updated: 17 November 2010
Copyright (c) 1997-2010 University of Cambridge.
------------------------------------------------------------------------------
PCRESTACK(3) PCRESTACK(3)
diff --git a/ext/pcre/pcrelib/pcre.h b/ext/pcre/pcrelib/pcre.h
index febb617422..ec454ee60d 100644
--- a/ext/pcre/pcrelib/pcre.h
+++ b/ext/pcre/pcrelib/pcre.h
@@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE_MAJOR 8
-#define PCRE_MINOR 10
+#define PCRE_MINOR 11
#define PCRE_PRERELEASE
-#define PCRE_DATE 2010-06-25
+#define PCRE_DATE 2010-12-10
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE, the appropriate
@@ -96,42 +96,44 @@ extern "C" {
#endif
/* Options. Some are compile-time only, some are run-time only, and some are
-both, so we keep them all distinct. */
-
-#define PCRE_CASELESS 0x00000001
-#define PCRE_MULTILINE 0x00000002
-#define PCRE_DOTALL 0x00000004
-#define PCRE_EXTENDED 0x00000008
-#define PCRE_ANCHORED 0x00000010
-#define PCRE_DOLLAR_ENDONLY 0x00000020
-#define PCRE_EXTRA 0x00000040
-#define PCRE_NOTBOL 0x00000080
-#define PCRE_NOTEOL 0x00000100
-#define PCRE_UNGREEDY 0x00000200
-#define PCRE_NOTEMPTY 0x00000400
-#define PCRE_UTF8 0x00000800
-#define PCRE_NO_AUTO_CAPTURE 0x00001000
-#define PCRE_NO_UTF8_CHECK 0x00002000
-#define PCRE_AUTO_CALLOUT 0x00004000
-#define PCRE_PARTIAL_SOFT 0x00008000
+both, so we keep them all distinct. However, almost all the bits in the options
+word are now used. In the long run, we may have to re-use some of the
+compile-time only bits for runtime options, or vice versa. */
+
+#define PCRE_CASELESS 0x00000001 /* Compile */
+#define PCRE_MULTILINE 0x00000002 /* Compile */
+#define PCRE_DOTALL 0x00000004 /* Compile */
+#define PCRE_EXTENDED 0x00000008 /* Compile */
+#define PCRE_ANCHORED 0x00000010 /* Compile, exec, DFA exec */
+#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile */
+#define PCRE_EXTRA 0x00000040 /* Compile */
+#define PCRE_NOTBOL 0x00000080 /* Exec, DFA exec */
+#define PCRE_NOTEOL 0x00000100 /* Exec, DFA exec */
+#define PCRE_UNGREEDY 0x00000200 /* Compile */
+#define PCRE_NOTEMPTY 0x00000400 /* Exec, DFA exec */
+#define PCRE_UTF8 0x00000800 /* Compile */
+#define PCRE_NO_AUTO_CAPTURE 0x00001000 /* Compile */
+#define PCRE_NO_UTF8_CHECK 0x00002000 /* Compile, exec, DFA exec */
+#define PCRE_AUTO_CALLOUT 0x00004000 /* Compile */
+#define PCRE_PARTIAL_SOFT 0x00008000 /* Exec, DFA exec */
#define PCRE_PARTIAL 0x00008000 /* Backwards compatible synonym */
-#define PCRE_DFA_SHORTEST 0x00010000
-#define PCRE_DFA_RESTART 0x00020000
-#define PCRE_FIRSTLINE 0x00040000
-#define PCRE_DUPNAMES 0x00080000
-#define PCRE_NEWLINE_CR 0x00100000
-#define PCRE_NEWLINE_LF 0x00200000
-#define PCRE_NEWLINE_CRLF 0x00300000
-#define PCRE_NEWLINE_ANY 0x00400000
-#define PCRE_NEWLINE_ANYCRLF 0x00500000
-#define PCRE_BSR_ANYCRLF 0x00800000
-#define PCRE_BSR_UNICODE 0x01000000
-#define PCRE_JAVASCRIPT_COMPAT 0x02000000
-#define PCRE_NO_START_OPTIMIZE 0x04000000
-#define PCRE_NO_START_OPTIMISE 0x04000000
-#define PCRE_PARTIAL_HARD 0x08000000
-#define PCRE_NOTEMPTY_ATSTART 0x10000000
-#define PCRE_UCP 0x20000000
+#define PCRE_DFA_SHORTEST 0x00010000 /* DFA exec */
+#define PCRE_DFA_RESTART 0x00020000 /* DFA exec */
+#define PCRE_FIRSTLINE 0x00040000 /* Compile */
+#define PCRE_DUPNAMES 0x00080000 /* Compile */
+#define PCRE_NEWLINE_CR 0x00100000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_LF 0x00200000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_CRLF 0x00300000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANY 0x00400000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANYCRLF 0x00500000 /* Compile, exec, DFA exec */
+#define PCRE_BSR_ANYCRLF 0x00800000 /* Compile, exec, DFA exec */
+#define PCRE_BSR_UNICODE 0x01000000 /* Compile, exec, DFA exec */
+#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile */
+#define PCRE_NO_START_OPTIMIZE 0x04000000 /* Compile, exec, DFA exec */
+#define PCRE_NO_START_OPTIMISE 0x04000000 /* Synonym */
+#define PCRE_PARTIAL_HARD 0x08000000 /* Exec, DFA exec */
+#define PCRE_NOTEMPTY_ATSTART 0x10000000 /* Exec, DFA exec */
+#define PCRE_UCP 0x20000000 /* Compile */
/* Exec-time and get/set-time error codes */
@@ -159,6 +161,8 @@ both, so we keep them all distinct. */
#define PCRE_ERROR_RECURSIONLIMIT (-21)
#define PCRE_ERROR_NULLWSLIMIT (-22) /* No longer actually used */
#define PCRE_ERROR_BADNEWLINE (-23)
+#define PCRE_ERROR_BADOFFSET (-24)
+#define PCRE_ERROR_SHORTUTF8 (-25)
/* Request types for pcre_fullinfo() */
diff --git a/ext/pcre/pcrelib/pcre_compile.c b/ext/pcre/pcrelib/pcre_compile.c
index 53027e603d..b0d81ac94c 100644
--- a/ext/pcre/pcrelib/pcre_compile.c
+++ b/ext/pcre/pcrelib/pcre_compile.c
@@ -406,6 +406,7 @@ static const char error_texts[] =
"different names for subpatterns of the same number are not allowed\0"
"(*MARK) must have an argument\0"
"this version of PCRE is not compiled with PCRE_UCP support\0"
+ "\\c must be followed by an ASCII character\0"
;
/* Table to identify digits and hex digits. This is used when compiling
@@ -839,7 +840,8 @@ else
break;
/* For \c, a following letter is upper-cased; then the 0x40 bit is flipped.
- This coding is ASCII-specific, but then the whole concept of \cx is
+ An error is given if the byte following \c is not an ASCII character. This
+ coding is ASCII-specific, but then the whole concept of \cx is
ASCII-specific. (However, an EBCDIC equivalent has now been added.) */
case CHAR_c:
@@ -849,11 +851,15 @@ else
*errorcodeptr = ERR2;
break;
}
-
-#ifndef EBCDIC /* ASCII/UTF-8 coding */
+#ifndef EBCDIC /* ASCII/UTF-8 coding */
+ if (c > 127) /* Excludes all non-ASCII in either mode */
+ {
+ *errorcodeptr = ERR68;
+ break;
+ }
if (c >= CHAR_a && c <= CHAR_z) c -= 32;
c ^= 0x40;
-#else /* EBCDIC coding */
+#else /* EBCDIC coding */
if (c >= CHAR_a && c <= CHAR_z) c += 64;
c ^= 0xC0;
#endif
@@ -1097,10 +1103,21 @@ top-level call starts at the beginning of the pattern. All other calls must
start at a parenthesis. It scans along a pattern's text looking for capturing
subpatterns, and counting them. If it finds a named pattern that matches the
name it is given, it returns its number. Alternatively, if the name is NULL, it
-returns when it reaches a given numbered subpattern. We know that if (?P< is
-encountered, the name will be terminated by '>' because that is checked in the
-first pass. Recursion is used to keep track of subpatterns that reset the
-capturing group numbers - the (?| feature.
+returns when it reaches a given numbered subpattern. Recursion is used to keep
+track of subpatterns that reset the capturing group numbers - the (?| feature.
+
+This function was originally called only from the second pass, in which we know
+that if (?< or (?' or (?P< is encountered, the name will be correctly
+terminated because that is checked in the first pass. There is now one call to
+this function in the first pass, to check for a recursive back reference by
+name (so that we can make the whole group atomic). In this case, we need check
+only up to the current position in the pattern, and that is still OK because
+and previous occurrences will have been checked. To make this work, the test
+for "end of pattern" is a check against cd->end_pattern in the main loop,
+instead of looking for a binary zero. This means that the special first-pass
+call can adjust cd->end_pattern temporarily. (Checks for binary zero while
+processing items within the loop are OK, because afterwards the main loop will
+terminate.)
Arguments:
ptrptr address of the current character pointer (updated)
@@ -1108,6 +1125,7 @@ Arguments:
name name to seek, or NULL if seeking a numbered subpattern
lorn name length, or subpattern number if name is NULL
xmode TRUE if we are in /x mode
+ utf8 TRUE if we are in UTF-8 mode
count pointer to the current capturing subpattern number (updated)
Returns: the number of the named subpattern, or -1 if not found
@@ -1115,7 +1133,7 @@ Returns: the number of the named subpattern, or -1 if not found
static int
find_parens_sub(uschar **ptrptr, compile_data *cd, const uschar *name, int lorn,
- BOOL xmode, int *count)
+ BOOL xmode, BOOL utf8, int *count)
{
uschar *ptr = *ptrptr;
int start_count = *count;
@@ -1200,9 +1218,11 @@ if (ptr[0] == CHAR_LEFT_PARENTHESIS)
}
/* Past any initial parenthesis handling, scan for parentheses or vertical
-bars. */
+bars. Stop if we get to cd->end_pattern. Note that this is important for the
+first-pass call when this value is temporarily adjusted to stop at the current
+position. So DO NOT change this to a test for binary zero. */
-for (; *ptr != 0; ptr++)
+for (; ptr < cd->end_pattern; ptr++)
{
/* Skip over backslashed characters and also entire \Q...\E */
@@ -1276,7 +1296,15 @@ for (; *ptr != 0; ptr++)
if (xmode && *ptr == CHAR_NUMBER_SIGN)
{
- while (*(++ptr) != 0 && *ptr != CHAR_NL) {};
+ ptr++;
+ while (*ptr != 0)
+ {
+ if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
+ ptr++;
+#ifdef SUPPORT_UTF8
+ if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
+ }
if (*ptr == 0) goto FAIL_EXIT;
continue;
}
@@ -1285,7 +1313,7 @@ for (; *ptr != 0; ptr++)
if (*ptr == CHAR_LEFT_PARENTHESIS)
{
- int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, count);
+ int rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, count);
if (rc > 0) return rc;
if (*ptr == 0) goto FAIL_EXIT;
}
@@ -1331,12 +1359,14 @@ Arguments:
name name to seek, or NULL if seeking a numbered subpattern
lorn name length, or subpattern number if name is NULL
xmode TRUE if we are in /x mode
+ utf8 TRUE if we are in UTF-8 mode
Returns: the number of the found subpattern, or -1 if not found
*/
static int
-find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode)
+find_parens(compile_data *cd, const uschar *name, int lorn, BOOL xmode,
+ BOOL utf8)
{
uschar *ptr = (uschar *)cd->start_pattern;
int count = 0;
@@ -1349,7 +1379,7 @@ matching closing parens. That is why we have to have a loop. */
for (;;)
{
- rc = find_parens_sub(&ptr, cd, name, lorn, xmode, &count);
+ rc = find_parens_sub(&ptr, cd, name, lorn, xmode, utf8, &count);
if (rc > 0 || *ptr++ == 0) break;
}
@@ -1722,9 +1752,12 @@ for (;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
code += code[1];
break;
+
+ case OP_THEN_ARG:
+ code += code[1+LINK_SIZE];
+ break;
}
/* Add in the fixed length from the table */
@@ -1825,9 +1858,12 @@ for (;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
code += code[1];
break;
+
+ case OP_THEN_ARG:
+ code += code[1+LINK_SIZE];
+ break;
}
/* Add in the fixed length from the table */
@@ -2103,10 +2139,13 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
code += code[1];
break;
+ case OP_THEN_ARG:
+ code += code[1+LINK_SIZE];
+ break;
+
/* None of the remaining opcodes are required to match a character. */
default:
@@ -2504,8 +2543,15 @@ if ((options & PCRE_EXTENDED) != 0)
while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
if (*ptr == CHAR_NUMBER_SIGN)
{
- while (*(++ptr) != 0)
+ ptr++;
+ while (*ptr != 0)
+ {
if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
+ ptr++;
+#ifdef SUPPORT_UTF8
+ if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
+ }
}
else break;
}
@@ -2541,8 +2587,15 @@ if ((options & PCRE_EXTENDED) != 0)
while ((cd->ctypes[*ptr] & ctype_space) != 0) ptr++;
if (*ptr == CHAR_NUMBER_SIGN)
{
- while (*(++ptr) != 0)
+ ptr++;
+ while (*ptr != 0)
+ {
if (IS_NEWLINE(ptr)) { ptr += cd->nllen; break; }
+ ptr++;
+#ifdef SUPPORT_UTF8
+ if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
+ }
}
else break;
}
@@ -3115,9 +3168,14 @@ for (;; ptr++)
if ((cd->ctypes[c] & ctype_space) != 0) continue;
if (c == CHAR_NUMBER_SIGN)
{
- while (*(++ptr) != 0)
+ ptr++;
+ while (*ptr != 0)
{
if (IS_NEWLINE(ptr)) { ptr += cd->nllen - 1; break; }
+ ptr++;
+#ifdef SUPPORT_UTF8
+ if (utf8) while ((*ptr & 0xc0) == 0x80) ptr++;
+#endif
}
if (*ptr != 0) continue;
@@ -3492,9 +3550,14 @@ for (;; ptr++)
for (c = 0; c < 32; c++) classbits[c] |= ~cbits[c+cbit_word];
continue;
+ /* Perl 5.004 onwards omits VT from \s, but we must preserve it
+ if it was previously set by something earlier in the character
+ class. */
+
case ESC_s:
- for (c = 0; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
- classbits[1] &= ~0x08; /* Perl 5.004 onwards omits VT from \s */
+ classbits[0] |= cbits[cbit_space];
+ classbits[1] |= cbits[cbit_space+1] & ~0x08;
+ for (c = 2; c < 32; c++) classbits[c] |= cbits[c+cbit_space];
continue;
case ESC_S:
@@ -4806,7 +4869,12 @@ for (;; ptr++)
*errorcodeptr = ERR66;
goto FAILED;
}
- *code++ = verbs[i].op;
+ *code = verbs[i].op;
+ if (*code++ == OP_THEN)
+ {
+ PUT(code, 0, code - bcptr->current_branch - 1);
+ code += LINK_SIZE;
+ }
}
else
@@ -4816,7 +4884,12 @@ for (;; ptr++)
*errorcodeptr = ERR59;
goto FAILED;
}
- *code++ = verbs[i].op_arg;
+ *code = verbs[i].op_arg;
+ if (*code++ == OP_THEN_ARG)
+ {
+ PUT(code, 0, code - bcptr->current_branch - 1);
+ code += LINK_SIZE;
+ }
*code++ = arglen;
memcpy(code, arg, arglen);
code += arglen;
@@ -5010,7 +5083,7 @@ for (;; ptr++)
/* Search the pattern for a forward reference */
else if ((i = find_parens(cd, name, namelen,
- (options & PCRE_EXTENDED) != 0)) > 0)
+ (options & PCRE_EXTENDED) != 0, utf8)) > 0)
{
PUT2(code, 2+LINK_SIZE, i);
code[1+LINK_SIZE]++;
@@ -5311,11 +5384,17 @@ for (;; ptr++)
while ((cd->ctypes[*ptr] & ctype_word) != 0) ptr++;
namelen = (int)(ptr - name);
- /* In the pre-compile phase, do a syntax check and set a dummy
- reference number. */
+ /* In the pre-compile phase, do a syntax check. We used to just set
+ a dummy reference number, because it was not used in the first pass.
+ However, with the change of recursive back references to be atomic,
+ we have to look for the number so that this state can be identified, as
+ otherwise the incorrect length is computed. If it's not a backwards
+ reference, the dummy number will do. */
if (lengthptr != NULL)
{
+ const uschar *temp;
+
if (namelen == 0)
{
*errorcodeptr = ERR62;
@@ -5331,7 +5410,22 @@ for (;; ptr++)
*errorcodeptr = ERR48;
goto FAILED;
}
- recno = 0;
+
+ /* The name table does not exist in the first pass, so we cannot
+ do a simple search as in the code below. Instead, we have to scan the
+ pattern to find the number. It is important that we scan it only as
+ far as we have got because the syntax of named subpatterns has not
+ been checked for the rest of the pattern, and find_parens() assumes
+ correct syntax. In any case, it's a waste of resources to scan
+ further. We stop the scan at the current point by temporarily
+ adjusting the value of cd->endpattern. */
+
+ temp = cd->end_pattern;
+ cd->end_pattern = ptr;
+ recno = find_parens(cd, name, namelen,
+ (options & PCRE_EXTENDED) != 0, utf8);
+ cd->end_pattern = temp;
+ if (recno < 0) recno = 0; /* Forward ref; set dummy number */
}
/* In the real compile, seek the name in the table. We check the name
@@ -5356,7 +5450,7 @@ for (;; ptr++)
}
else if ((recno = /* Forward back reference */
find_parens(cd, name, namelen,
- (options & PCRE_EXTENDED) != 0)) <= 0)
+ (options & PCRE_EXTENDED) != 0, utf8)) <= 0)
{
*errorcodeptr = ERR15;
goto FAILED;
@@ -5467,7 +5561,7 @@ for (;; ptr++)
if (called == NULL)
{
if (find_parens(cd, NULL, recno,
- (options & PCRE_EXTENDED) != 0) < 0)
+ (options & PCRE_EXTENDED) != 0, utf8) < 0)
{
*errorcodeptr = ERR15;
goto FAILED;
@@ -6797,6 +6891,8 @@ while (ptr[skipatstart] == CHAR_LEFT_PARENTHESIS &&
{ skipatstart += 7; options |= PCRE_UTF8; continue; }
else if (strncmp((char *)(ptr+skipatstart+2), STRING_UCP_RIGHTPAR, 4) == 0)
{ skipatstart += 6; options |= PCRE_UCP; continue; }
+ else if (strncmp((char *)(ptr+skipatstart+2), STRING_NO_START_OPT_RIGHTPAR, 13) == 0)
+ { skipatstart += 15; options |= PCRE_NO_START_OPTIMIZE; continue; }
if (strncmp((char *)(ptr+skipatstart+2), STRING_CR_RIGHTPAR, 3) == 0)
{ skipatstart += 5; newnl = PCRE_NEWLINE_CR; }
diff --git a/ext/pcre/pcrelib/pcre_exec.c b/ext/pcre/pcrelib/pcre_exec.c
index 55883548c3..a007786dea 100644
--- a/ext/pcre/pcrelib/pcre_exec.c
+++ b/ext/pcre/pcrelib/pcre_exec.c
@@ -292,7 +292,7 @@ argument of match(), which never changes. */
#define RMATCH(ra,rb,rc,rd,re,rf,rg,rw)\
{\
- heapframe *newframe = (pcre_stack_malloc)(sizeof(heapframe));\
+ heapframe *newframe = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));\
if (newframe == NULL) RRETURN(PCRE_ERROR_NOMEMORY);\
frame->Xwhere = rw; \
newframe->Xeptr = ra;\
@@ -420,17 +420,18 @@ immediately. The second one is used when we already know we are past the end of
the subject. */
#define CHECK_PARTIAL()\
- if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
- {\
- md->hitend = TRUE;\
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+ if (md->partial != 0 && eptr >= md->end_subject && \
+ eptr > md->start_used_ptr) \
+ { \
+ md->hitend = TRUE; \
+ if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
#define SCHECK_PARTIAL()\
- if (md->partial != 0 && eptr > mstart)\
- {\
- md->hitend = TRUE;\
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+ if (md->partial != 0 && eptr > md->start_used_ptr) \
+ { \
+ md->hitend = TRUE; \
+ if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
@@ -486,7 +487,7 @@ heap storage. Set up the top-level frame here; others are obtained from the
heap whenever RMATCH() does a "recursion". See the macro definitions above. */
#ifdef NO_RECURSE
-heapframe *frame = (pcre_stack_malloc)(sizeof(heapframe));
+heapframe *frame = (heapframe *)(pcre_stack_malloc)(sizeof(heapframe));
if (frame == NULL) RRETURN(PCRE_ERROR_NOMEMORY);
frame->Xprevframe = NULL; /* Marks the top level */
@@ -708,36 +709,47 @@ for (;;)
case OP_FAIL:
MRRETURN(MATCH_NOMATCH);
+ /* COMMIT overrides PRUNE, SKIP, and THEN */
+
case OP_COMMIT:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM52);
- if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
+ rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
+ rrc != MATCH_THEN)
+ RRETURN(rrc);
MRRETURN(MATCH_COMMIT);
+ /* PRUNE overrides THEN */
+
case OP_PRUNE:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM51);
- if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
MRRETURN(MATCH_PRUNE);
case OP_PRUNE_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM56);
- if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
md->mark = ecode + 2;
RRETURN(MATCH_PRUNE);
+ /* SKIP overrides PRUNE and THEN */
+
case OP_SKIP:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM53);
- if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+ RRETURN(rrc);
md->start_match_ptr = eptr; /* Pass back current position */
MRRETURN(MATCH_SKIP);
case OP_SKIP_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM57);
- if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+ RRETURN(rrc);
/* Pass back the current skip name by overloading md->start_match_ptr and
returning the special MATCH_SKIP_ARG return code. This will either be
@@ -747,17 +759,24 @@ for (;;)
md->start_match_ptr = ecode + 2;
RRETURN(MATCH_SKIP_ARG);
+ /* For THEN (and THEN_ARG) we pass back the address of the bracket or
+ the alt that is at the start of the current branch. This makes it possible
+ to skip back past alternatives that precede the THEN within the current
+ branch. */
+
case OP_THEN:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM54);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
+ md->start_match_ptr = ecode - GET(ecode, 1);
MRRETURN(MATCH_THEN);
case OP_THEN_ARG:
- RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
- ims, eptrb, flags, RM58);
+ RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
+ offset_top, md, ims, eptrb, flags, RM58);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
- md->mark = ecode + 2;
+ md->start_match_ptr = ecode - GET(ecode, 1);
+ md->mark = ecode + LINK_SIZE + 2;
RRETURN(MATCH_THEN);
/* Handle a capturing bracket. If there is space in the offset vector, save
@@ -802,7 +821,9 @@ for (;;)
{
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM1);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
md->capture_last = save_capture_last;
ecode += GET(ecode, 1);
}
@@ -863,7 +884,9 @@ for (;;)
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims,
eptrb, flags, RM2);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode, 1);
}
/* Control never reaches here. */
@@ -1064,7 +1087,8 @@ for (;;)
ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2);
while (*ecode == OP_ALT) ecode += GET(ecode, 1);
}
- else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+ else if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
{
RRETURN(rrc); /* Need braces because of following else */
}
@@ -1192,7 +1216,9 @@ for (;;)
mstart = md->start_match_ptr; /* In case \K reset it */
break;
}
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode, 1);
}
while (*ecode == OP_ALT);
@@ -1226,7 +1252,9 @@ for (;;)
do ecode += GET(ecode,1); while (*ecode == OP_ALT);
break;
}
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode,1);
}
while (*ecode == OP_ALT);
@@ -1363,7 +1391,8 @@ for (;;)
(pcre_free)(new_recursive.offset_save);
MRRETURN(MATCH_MATCH);
}
- else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN)
+ else if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
{
DPRINTF(("Recursion gave error %d\n", rrc));
if (new_recursive.offset_save != stacksave)
@@ -1406,7 +1435,9 @@ for (;;)
mstart = md->start_match_ptr;
break;
}
- if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc);
+ if (rrc != MATCH_NOMATCH &&
+ (rrc != MATCH_THEN || md->start_match_ptr != ecode))
+ RRETURN(rrc);
ecode += GET(ecode,1);
}
while (*ecode == OP_ALT);
@@ -1672,37 +1703,40 @@ for (;;)
if (eptr < md->end_subject)
{ if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
else
- { if (md->noteol) MRRETURN(MATCH_NOMATCH); }
+ {
+ if (md->noteol) MRRETURN(MATCH_NOMATCH);
+ SCHECK_PARTIAL();
+ }
ecode++;
break;
}
- else
+ else /* Not multiline */
{
if (md->noteol) MRRETURN(MATCH_NOMATCH);
- if (!md->endonly)
- {
- if (eptr != md->end_subject &&
- (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
- MRRETURN(MATCH_NOMATCH);
- ecode++;
- break;
- }
+ if (!md->endonly) goto ASSERT_NL_OR_EOS;
}
+
/* ... else fall through for endonly */
/* End of subject assertion (\z) */
case OP_EOD:
if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
+ SCHECK_PARTIAL();
ecode++;
break;
/* End of subject or ending \n assertion (\Z) */
case OP_EODN:
- if (eptr != md->end_subject &&
+ ASSERT_NL_OR_EOS:
+ if (eptr < md->end_subject &&
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
MRRETURN(MATCH_NOMATCH);
+
+ /* Either at end of string or \n before end. */
+
+ SCHECK_PARTIAL();
ecode++;
break;
@@ -5598,6 +5632,7 @@ if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
if (re == NULL || subject == NULL ||
(offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
if (offsetcount < 0) return PCRE_ERROR_BADCOUNT;
+if (start_offset < 0 || start_offset > length) return PCRE_ERROR_BADOFFSET;
/* This information is for finding all the numbers associated with a given
name, for condition testing. */
@@ -5764,16 +5799,14 @@ back the character offset. */
#ifdef SUPPORT_UTF8
if (utf8 && (options & PCRE_NO_UTF8_CHECK) == 0)
{
- if (_pcre_valid_utf8((USPTR)subject, length) >= 0)
- return PCRE_ERROR_BADUTF8;
+ int tb;
+ if ((tb = _pcre_valid_utf8((USPTR)subject, length)) >= 0)
+ return (tb == length && md->partial > 1)?
+ PCRE_ERROR_SHORTUTF8 : PCRE_ERROR_BADUTF8;
if (start_offset > 0 && start_offset < length)
{
- int tb = ((USPTR)subject)[start_offset];
- if (tb > 127)
- {
- tb &= 0xc0;
- if (tb != 0 && tb != 0xc0) return PCRE_ERROR_BADUTF8_OFFSET;
- }
+ tb = ((USPTR)subject)[start_offset] & 0xc0;
+ if (tb == 0x80) return PCRE_ERROR_BADUTF8_OFFSET;
}
}
#endif
@@ -5901,9 +5934,10 @@ for(;;)
/* There are some optimizations that avoid running the match if a known
starting point is not found, or if a known later character is not present.
However, there is an option that disables these, for testing and for ensuring
- that all callouts do actually occur. */
+ that all callouts do actually occur. The option can be set in the regex by
+ (*NO_START_OPT) or passed in match-time options. */
- if ((options & PCRE_NO_START_OPTIMIZE) == 0)
+ if (((options | re->options) & PCRE_NO_START_OPTIMIZE) == 0)
{
/* Advance to a unique first byte if there is one. */
diff --git a/ext/pcre/pcrelib/pcre_internal.h b/ext/pcre/pcrelib/pcre_internal.h
index 039a140945..5f736d13ac 100644
--- a/ext/pcre/pcrelib/pcre_internal.h
+++ b/ext/pcre/pcrelib/pcre_internal.h
@@ -192,9 +192,7 @@ stdint.h is available, include it; it may define INT64_MAX. Systems that do not
have stdint.h (e.g. Solaris) may have inttypes.h. The macro int64_t may be set
by "configure". */
-#ifdef PHP_WIN32
-#include "win32/php_stdint.h"
-#elif HAVE_STDINT_H
+#if HAVE_STDINT_H
#include <stdint.h>
#elif HAVE_INTTYPES_H
#include <inttypes.h>
@@ -410,9 +408,10 @@ capturing parenthesis numbers in back references. */
/* When UTF-8 encoding is being used, a character is no longer just a single
byte. The macros for character handling generate simple sequences when used in
-byte-mode, and more complicated ones for UTF-8 characters. BACKCHAR should
-never be called in byte mode. To make sure it can never even appear when UTF-8
-support is omitted, we don't even define it. */
+byte-mode, and more complicated ones for UTF-8 characters. GETCHARLENTEST is
+not used when UTF-8 is not supported, so it is not defined, and BACKCHAR should
+never be called in byte mode. To make sure they can never even appear when
+UTF-8 support is omitted, we don't even define them. */
#ifndef SUPPORT_UTF8
#define GETCHAR(c, eptr) c = *eptr;
@@ -420,43 +419,83 @@ support is omitted, we don't even define it. */
#define GETCHARINC(c, eptr) c = *eptr++;
#define GETCHARINCTEST(c, eptr) c = *eptr++;
#define GETCHARLEN(c, eptr, len) c = *eptr;
+/* #define GETCHARLENTEST(c, eptr, len) */
/* #define BACKCHAR(eptr) */
#else /* SUPPORT_UTF8 */
+/* These macros were originally written in the form of loops that used data
+from the tables whose names start with _pcre_utf8_table. They were rewritten by
+a user so as not to use loops, because in some environments this gives a
+significant performance advantage, and it seems never to do any harm. */
+
+/* Base macro to pick up the remaining bytes of a UTF-8 character, not
+advancing the pointer. */
+
+#define GETUTF8(c, eptr) \
+ { \
+ if ((c & 0x20) == 0) \
+ c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
+ else if ((c & 0x10) == 0) \
+ c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
+ else if ((c & 0x08) == 0) \
+ c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
+ ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
+ else if ((c & 0x04) == 0) \
+ c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
+ ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
+ (eptr[4] & 0x3f); \
+ else \
+ c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
+ ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
+ ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
+ }
+
/* Get the next UTF-8 character, not advancing the pointer. This is called when
we know we are in UTF-8 mode. */
#define GETCHAR(c, eptr) \
c = *eptr; \
- if (c >= 0xc0) \
- { \
- int gcii; \
- int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
- int gcss = 6*gcaa; \
- c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
- for (gcii = 1; gcii <= gcaa; gcii++) \
- { \
- gcss -= 6; \
- c |= (eptr[gcii] & 0x3f) << gcss; \
- } \
- }
+ if (c >= 0xc0) GETUTF8(c, eptr);
/* Get the next UTF-8 character, testing for UTF-8 mode, and not advancing the
pointer. */
#define GETCHARTEST(c, eptr) \
c = *eptr; \
- if (utf8 && c >= 0xc0) \
+ if (utf8 && c >= 0xc0) GETUTF8(c, eptr);
+
+/* Base macro to pick up the remaining bytes of a UTF-8 character, advancing
+the pointer. */
+
+#define GETUTF8INC(c, eptr) \
{ \
- int gcii; \
- int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
- int gcss = 6*gcaa; \
- c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
- for (gcii = 1; gcii <= gcaa; gcii++) \
+ if ((c & 0x20) == 0) \
+ c = ((c & 0x1f) << 6) | (*eptr++ & 0x3f); \
+ else if ((c & 0x10) == 0) \
+ { \
+ c = ((c & 0x0f) << 12) | ((*eptr & 0x3f) << 6) | (eptr[1] & 0x3f); \
+ eptr += 2; \
+ } \
+ else if ((c & 0x08) == 0) \
+ { \
+ c = ((c & 0x07) << 18) | ((*eptr & 0x3f) << 12) | \
+ ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
+ eptr += 3; \
+ } \
+ else if ((c & 0x04) == 0) \
{ \
- gcss -= 6; \
- c |= (eptr[gcii] & 0x3f) << gcss; \
+ c = ((c & 0x03) << 24) | ((*eptr & 0x3f) << 18) | \
+ ((eptr[1] & 0x3f) << 12) | ((eptr[2] & 0x3f) << 6) | \
+ (eptr[3] & 0x3f); \
+ eptr += 4; \
+ } \
+ else \
+ { \
+ c = ((c & 0x01) << 30) | ((*eptr & 0x3f) << 24) | \
+ ((eptr[1] & 0x3f) << 18) | ((eptr[2] & 0x3f) << 12) | \
+ ((eptr[3] & 0x3f) << 6) | (eptr[4] & 0x3f); \
+ eptr += 5; \
} \
}
@@ -465,32 +504,49 @@ know we are in UTF-8 mode. */
#define GETCHARINC(c, eptr) \
c = *eptr++; \
- if (c >= 0xc0) \
- { \
- int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
- int gcss = 6*gcaa; \
- c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
- while (gcaa-- > 0) \
- { \
- gcss -= 6; \
- c |= (*eptr++ & 0x3f) << gcss; \
- } \
- }
+ if (c >= 0xc0) GETUTF8INC(c, eptr);
/* Get the next character, testing for UTF-8 mode, and advancing the pointer.
This is called when we don't know if we are in UTF-8 mode. */
#define GETCHARINCTEST(c, eptr) \
c = *eptr++; \
- if (utf8 && c >= 0xc0) \
+ if (utf8 && c >= 0xc0) GETUTF8INC(c, eptr);
+
+/* Base macro to pick up the remaining bytes of a UTF-8 character, not
+advancing the pointer, incrementing the length. */
+
+#define GETUTF8LEN(c, eptr, len) \
{ \
- int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
- int gcss = 6*gcaa; \
- c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
- while (gcaa-- > 0) \
+ if ((c & 0x20) == 0) \
+ { \
+ c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
+ len++; \
+ } \
+ else if ((c & 0x10) == 0) \
{ \
- gcss -= 6; \
- c |= (*eptr++ & 0x3f) << gcss; \
+ c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
+ len += 2; \
+ } \
+ else if ((c & 0x08) == 0) \
+ {\
+ c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
+ ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
+ len += 3; \
+ } \
+ else if ((c & 0x04) == 0) \
+ { \
+ c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
+ ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
+ (eptr[4] & 0x3f); \
+ len += 4; \
+ } \
+ else \
+ {\
+ c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
+ ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
+ ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
+ len += 5; \
} \
}
@@ -499,19 +555,7 @@ if there are extra bytes. This is called when we know we are in UTF-8 mode. */
#define GETCHARLEN(c, eptr, len) \
c = *eptr; \
- if (c >= 0xc0) \
- { \
- int gcii; \
- int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
- int gcss = 6*gcaa; \
- c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
- for (gcii = 1; gcii <= gcaa; gcii++) \
- { \
- gcss -= 6; \
- c |= (eptr[gcii] & 0x3f) << gcss; \
- } \
- len += gcaa; \
- }
+ if (c >= 0xc0) GETUTF8LEN(c, eptr, len);
/* Get the next UTF-8 character, testing for UTF-8 mode, not advancing the
pointer, incrementing length if there are extra bytes. This is called when we
@@ -519,19 +563,7 @@ do not know if we are in UTF-8 mode. */
#define GETCHARLENTEST(c, eptr, len) \
c = *eptr; \
- if (utf8 && c >= 0xc0) \
- { \
- int gcii; \
- int gcaa = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
- int gcss = 6*gcaa; \
- c = (c & _pcre_utf8_table3[gcaa]) << gcss; \
- for (gcii = 1; gcii <= gcaa; gcii++) \
- { \
- gcss -= 6; \
- c |= (eptr[gcii] & 0x3f) << gcss; \
- } \
- len += gcaa; \
- }
+ if (utf8 && c >= 0xc0) GETUTF8LEN(c, eptr, len);
/* If the pointer is not at the start of a character, move it back until
it is. This is called only in UTF-8 mode - we don't put a test within the macro
@@ -539,7 +571,7 @@ because almost all calls are already within a block of UTF-8 only code. */
#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--
-#endif
+#endif /* SUPPORT_UTF8 */
/* In case there is no definition of offsetof() provided - though any proper
@@ -583,7 +615,7 @@ time, run time, or study time, respectively. */
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
- PCRE_JAVASCRIPT_COMPAT|PCRE_UCP)
+ PCRE_JAVASCRIPT_COMPAT|PCRE_UCP|PCRE_NO_START_OPTIMIZE)
#define PUBLIC_EXEC_OPTIONS \
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NOTEMPTY_ATSTART| \
@@ -900,15 +932,16 @@ so that PCRE works on both ASCII and EBCDIC platforms, in non-UTF-mode only. */
#define STRING_DEFINE "DEFINE"
-#define STRING_CR_RIGHTPAR "CR)"
-#define STRING_LF_RIGHTPAR "LF)"
-#define STRING_CRLF_RIGHTPAR "CRLF)"
-#define STRING_ANY_RIGHTPAR "ANY)"
-#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
-#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
-#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
-#define STRING_UTF8_RIGHTPAR "UTF8)"
-#define STRING_UCP_RIGHTPAR "UCP)"
+#define STRING_CR_RIGHTPAR "CR)"
+#define STRING_LF_RIGHTPAR "LF)"
+#define STRING_CRLF_RIGHTPAR "CRLF)"
+#define STRING_ANY_RIGHTPAR "ANY)"
+#define STRING_ANYCRLF_RIGHTPAR "ANYCRLF)"
+#define STRING_BSR_ANYCRLF_RIGHTPAR "BSR_ANYCRLF)"
+#define STRING_BSR_UNICODE_RIGHTPAR "BSR_UNICODE)"
+#define STRING_UTF8_RIGHTPAR "UTF8)"
+#define STRING_UCP_RIGHTPAR "UCP)"
+#define STRING_NO_START_OPT_RIGHTPAR "NO_START_OPT)"
#else /* SUPPORT_UTF8 */
@@ -1154,15 +1187,16 @@ only. */
#define STRING_DEFINE STR_D STR_E STR_F STR_I STR_N STR_E
-#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
-#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_CRLF_RIGHTPAR STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_ANY_RIGHTPAR STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
-#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
-#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
-#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
-#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
+#define STRING_CR_RIGHTPAR STR_C STR_R STR_RIGHT_PARENTHESIS
+#define STRING_LF_RIGHTPAR STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_CRLF_RIGHTPAR STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_ANY_RIGHTPAR STR_A STR_N STR_Y STR_RIGHT_PARENTHESIS
+#define STRING_ANYCRLF_RIGHTPAR STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_BSR_ANYCRLF_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_A STR_N STR_Y STR_C STR_R STR_L STR_F STR_RIGHT_PARENTHESIS
+#define STRING_BSR_UNICODE_RIGHTPAR STR_B STR_S STR_R STR_UNDERSCORE STR_U STR_N STR_I STR_C STR_O STR_D STR_E STR_RIGHT_PARENTHESIS
+#define STRING_UTF8_RIGHTPAR STR_U STR_T STR_F STR_8 STR_RIGHT_PARENTHESIS
+#define STRING_UCP_RIGHTPAR STR_U STR_C STR_P STR_RIGHT_PARENTHESIS
+#define STRING_NO_START_OPT_RIGHTPAR STR_N STR_O STR_UNDERSCORE STR_S STR_T STR_A STR_R STR_T STR_UNDERSCORE STR_O STR_P STR_T STR_RIGHT_PARENTHESIS
#endif /* SUPPORT_UTF8 */
@@ -1516,8 +1550,9 @@ in UTF-8 mode. The code that uses this table must know about such things. */
3, 3, /* RREF, NRREF */ \
1, /* DEF */ \
1, 1, /* BRAZERO, BRAMINZERO */ \
- 3, 1, 3, /* MARK, PRUNE, PRUNE_ARG, */ \
- 1, 3, 1, 3, /* SKIP, SKIP_ARG, THEN, THEN_ARG, */ \
+ 3, 1, 3, /* MARK, PRUNE, PRUNE_ARG */ \
+ 1, 3, /* SKIP, SKIP_ARG */ \
+ 1+LINK_SIZE, 3+LINK_SIZE, /* THEN, THEN_ARG */ \
1, 1, 1, 3, 1 /* COMMIT, FAIL, ACCEPT, CLOSE, SKIPZERO */
@@ -1536,7 +1571,8 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
- ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERRCOUNT };
+ ERR60, ERR61, ERR62, ERR63, ERR64, ERR65, ERR66, ERR67, ERR68,
+ ERRCOUNT };
/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit
diff --git a/ext/pcre/pcrelib/pcre_printint.src b/ext/pcre/pcrelib/pcre_printint.src
index 49d93174cc..c7d8629477 100644
--- a/ext/pcre/pcrelib/pcre_printint.src
+++ b/ext/pcre/pcrelib/pcre_printint.src
@@ -537,11 +537,26 @@ for(;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
fprintf(f, " %s %s", OP_names[*code], code + 2);
extra += code[1];
break;
+ case OP_THEN:
+ if (print_lengths)
+ fprintf(f, " %s %d", OP_names[*code], GET(code, 1));
+ else
+ fprintf(f, " %s", OP_names[*code]);
+ break;
+
+ case OP_THEN_ARG:
+ if (print_lengths)
+ fprintf(f, " %s %d %s", OP_names[*code], GET(code, 1),
+ code + 2 + LINK_SIZE);
+ else
+ fprintf(f, " %s %s", OP_names[*code], code + 2 + LINK_SIZE);
+ extra += code[1+LINK_SIZE];
+ break;
+
/* Anything else is just an item with no data*/
default:
diff --git a/ext/pcre/pcrelib/pcre_study.c b/ext/pcre/pcrelib/pcre_study.c
index 3ac7e81496..431ad9d0dd 100644
--- a/ext/pcre/pcrelib/pcre_study.c
+++ b/ext/pcre/pcrelib/pcre_study.c
@@ -417,10 +417,13 @@ for (;;)
case OP_MARK:
case OP_PRUNE_ARG:
case OP_SKIP_ARG:
- case OP_THEN_ARG:
cc += _pcre_OP_lengths[op] + cc[1];
break;
+ case OP_THEN_ARG:
+ cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE];
+ break;
+
/* For the record, these are the opcodes that are matched by "default":
OP_ACCEPT, OP_CLOSE, OP_COMMIT, OP_FAIL, OP_PRUNE, OP_SET_SOM, OP_SKIP,
OP_THEN. */
diff --git a/ext/pcre/pcrelib/pcre_valid_utf8.c b/ext/pcre/pcrelib/pcre_valid_utf8.c
index 3c81dc9ecc..d54d3bd2d6 100644
--- a/ext/pcre/pcrelib/pcre_valid_utf8.c
+++ b/ext/pcre/pcrelib/pcre_valid_utf8.c
@@ -70,6 +70,20 @@ Arguments:
Returns: < 0 if the string is a valid UTF-8 string
>= 0 otherwise; the value is the offset of the bad byte
+
+Bad bytes can be:
+
+ . An isolated byte whose most significant bits are 0x80, because this
+ can only correctly appear within a UTF-8 character;
+
+ . A byte whose most significant bits are 0xc0, but whose other bits indicate
+ that there are more than 3 additional bytes (i.e. an RFC 2279 starting
+ byte, which is no longer valid under RFC 3629);
+
+ .
+
+The returned offset may also be equal to the length of the string; this means
+that one or more bytes is missing from the final UTF-8 character.
*/
int
@@ -91,7 +105,8 @@ for (p = string; length-- > 0; p++)
if (c < 128) continue;
if (c < 0xc0) return p - string;
ab = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */
- if (length < ab || ab > 3) return p - string;
+ if (ab > 3) return p - string; /* Too many for RFC 3629 */
+ if (length < ab) return p + 1 + length - string; /* Missing bytes */
length -= ab;
/* Check top bits in the second byte */
diff --git a/ext/pcre/pcrelib/pcredemo.c b/ext/pcre/pcrelib/pcredemo.c
index d565aecddd..3a38ced9bf 100644
--- a/ext/pcre/pcrelib/pcredemo.c
+++ b/ext/pcre/pcrelib/pcredemo.c
@@ -50,13 +50,16 @@ const char *error;
char *pattern;
char *subject;
unsigned char *name_table;
+unsigned int option_bits;
int erroffset;
int find_all;
+int crlf_is_newline;
int namecount;
int name_entry_size;
int ovector[OVECCOUNT];
int subject_length;
int rc, i;
+int utf8;
/**************************************************************************
@@ -238,15 +241,56 @@ if (namecount <= 0) printf("No named substrings\n"); else
* subject is not a valid match; other possibilities must be tried. The *
* second flag restricts PCRE to one match attempt at the initial string *
* position. If this match succeeds, an alternative to the empty string *
-* match has been found, and we can proceed round the loop. *
+* match has been found, and we can print it and proceed round the loop, *
+* advancing by the length of whatever was found. If this match does not *
+* succeed, we still stay in the loop, advancing by just one character. *
+* In UTF-8 mode, which can be set by (*UTF8) in the pattern, this may be *
+* more than one byte. *
+* *
+* However, there is a complication concerned with newlines. When the *
+* newline convention is such that CRLF is a valid newline, we want must *
+* advance by two characters rather than one. The newline convention can *
+* be set in the regex by (*CR), etc.; if not, we must find the default. *
*************************************************************************/
-if (!find_all)
+if (!find_all) /* Check for -g */
{
pcre_free(re); /* Release the memory used for the compiled pattern */
return 0; /* Finish unless -g was given */
}
+/* Before running the loop, check for UTF-8 and whether CRLF is a valid newline
+sequence. First, find the options with which the regex was compiled; extract
+the UTF-8 state, and mask off all but the newline options. */
+
+(void)pcre_fullinfo(re, NULL, PCRE_INFO_OPTIONS, &option_bits);
+utf8 = option_bits & PCRE_UTF8;
+option_bits &= PCRE_NEWLINE_CR|PCRE_NEWLINE_LF|PCRE_NEWLINE_CRLF|
+ PCRE_NEWLINE_ANY|PCRE_NEWLINE_ANYCRLF;
+
+/* If no newline options were set, find the default newline convention from the
+build configuration. */
+
+if (option_bits == 0)
+ {
+ int d;
+ (void)pcre_config(PCRE_CONFIG_NEWLINE, &d);
+ /* Note that these values are always the ASCII ones, even in
+ EBCDIC environments. CR = 13, NL = 10. */
+ option_bits = (d == 13)? PCRE_NEWLINE_CR :
+ (d == 10)? PCRE_NEWLINE_LF :
+ (d == (13<<8 | 10))? PCRE_NEWLINE_CRLF :
+ (d == -2)? PCRE_NEWLINE_ANYCRLF :
+ (d == -1)? PCRE_NEWLINE_ANY : 0;
+ }
+
+/* See if CRLF is a valid newline sequence. */
+
+crlf_is_newline =
+ option_bits == PCRE_NEWLINE_ANY ||
+ option_bits == PCRE_NEWLINE_CRLF ||
+ option_bits == PCRE_NEWLINE_ANYCRLF;
+
/* Loop for second and subsequent matches */
for (;;)
@@ -280,14 +324,32 @@ for (;;)
is zero, it just means we have found all possible matches, so the loop ends.
Otherwise, it means we have failed to find a non-empty-string match at a
point where there was a previous empty-string match. In this case, we do what
- Perl does: advance the matching position by one, and continue. We do this by
- setting the "end of previous match" offset, because that is picked up at the
- top of the loop as the point at which to start again. */
+ Perl does: advance the matching position by one character, and continue. We
+ do this by setting the "end of previous match" offset, because that is picked
+ up at the top of the loop as the point at which to start again.
+
+ There are two complications: (a) When CRLF is a valid newline sequence, and
+ the current position is just before it, advance by an extra byte. (b)
+ Otherwise we must ensure that we skip an entire UTF-8 character if we are in
+ UTF-8 mode. */
if (rc == PCRE_ERROR_NOMATCH)
{
- if (options == 0) break;
- ovector[1] = start_offset + 1;
+ if (options == 0) break; /* All matches found */
+ ovector[1] = start_offset + 1; /* Advance one byte */
+ if (crlf_is_newline && /* If CRLF is newline & */
+ start_offset < subject_length - 1 && /* we are at CRLF, */
+ subject[start_offset] == '\r' &&
+ subject[start_offset + 1] == '\n')
+ ovector[1] += 1; /* Advance by one more. */
+ else if (utf8) /* Otherwise, ensure we */
+ { /* advance a whole UTF-8 */
+ while (ovector[1] < subject_length) /* character. */
+ {
+ if ((subject[ovector[1]] & 0xc0) != 0x80) break;
+ ovector[1] += 1;
+ }
+ }
continue; /* Go round the loop again */
}
diff --git a/ext/pcre/pcrelib/pcreposix.c b/ext/pcre/pcrelib/pcreposix.c
index 29d8446829..ec2517c87a 100644
--- a/ext/pcre/pcrelib/pcreposix.c
+++ b/ext/pcre/pcrelib/pcreposix.c
@@ -149,6 +149,7 @@ static const int eint[] = {
REG_BADPAT, /* different names for subpatterns of the same number are not allowed */
REG_BADPAT, /* (*MARK) must have an argument */
REG_INVARG, /* this version of PCRE is not compiled with PCRE_UCP support */
+ REG_BADPAT, /* \c must be followed by an ASCII character */
};
/* Table of texts corresponding to POSIX error codes */