diff options
author | Nuno Lopes <nlopess@php.net> | 2007-09-01 17:48:43 +0000 |
---|---|---|
committer | Nuno Lopes <nlopess@php.net> | 2007-09-01 17:48:43 +0000 |
commit | 3edd2a69f27242b50b7afb46dd9220da46820178 (patch) | |
tree | 96da4f8dc21d66c25cecbea0a2d2560214bcba51 | |
parent | a3e6be974fc62869c3f4d72bcbbf73865168e613 (diff) | |
download | php-git-3edd2a69f27242b50b7afb46dd9220da46820178.tar.gz |
upgrade to PCRE 7.3
52 files changed, 3873 insertions, 1650 deletions
@@ -2,6 +2,7 @@ PHP NEWS ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ?? ??? 20??, PHP 5.2.5 - Added optional parameter $provide_object to debug_backtrace(). (Sebastian) +- Upgraded PCRE to version 7.3 (Nuno) - Fixed bug #42462 (Segmentation when trying to set an attribute in a DOMElement). (Rob) diff --git a/ext/pcre/pcrelib/ChangeLog b/ext/pcre/pcrelib/ChangeLog index 87c2de74dc..3b18524fe6 100644 --- a/ext/pcre/pcrelib/ChangeLog +++ b/ext/pcre/pcrelib/ChangeLog @@ -1,7 +1,172 @@ ChangeLog for PCRE ------------------ -Version 7.2 19-June-07 +Version 7.3 28-Aug-07 +--------------------- + + 1. In the rejigging of the build system that eventually resulted in 7.1, the + line "#include <pcre.h>" was included in pcre_internal.h. The use of angle + brackets there is not right, since it causes compilers to look for an + installed pcre.h, not the version that is in the source that is being + compiled (which of course may be different). I have changed it back to: + + #include "pcre.h" + + I have a vague recollection that the change was concerned with compiling in + different directories, but in the new build system, that is taken care of + by the VPATH setting the Makefile. + + 2. The pattern .*$ when run in not-DOTALL UTF-8 mode with newline=any failed + when the subject happened to end in the byte 0x85 (e.g. if the last + character was \x{1ec5}). *Character* 0x85 is one of the "any" newline + characters but of course it shouldn't be taken as a newline when it is part + of another character. The bug was that, for an unlimited repeat of . in + not-DOTALL UTF-8 mode, PCRE was advancing by bytes rather than by + characters when looking for a newline. + + 3. A small performance improvement in the DOTALL UTF-8 mode .* case. + + 4. Debugging: adjusted the names of opcodes for different kinds of parentheses + in debug output. + + 5. Arrange to use "%I64d" instead of "%lld" and "%I64u" instead of "%llu" for + long printing in the pcrecpp unittest when running under MinGW. + + 6. ESC_K was left out of the EBCDIC table. + + 7. Change 7.0/38 introduced a new limit on the number of nested non-capturing + parentheses; I made it 1000, which seemed large enough. Unfortunately, the + limit also applies to "virtual nesting" when a pattern is recursive, and in + this case 1000 isn't so big. I have been able to remove this limit at the + expense of backing off one optimization in certain circumstances. Normally, + when pcre_exec() would call its internal match() function recursively and + immediately return the result unconditionally, it uses a "tail recursion" + feature to save stack. However, when a subpattern that can match an empty + string has an unlimited repetition quantifier, it no longer makes this + optimization. That gives it a stack frame in which to save the data for + checking that an empty string has been matched. Previously this was taken + from the 1000-entry workspace that had been reserved. So now there is no + explicit limit, but more stack is used. + + 8. Applied Daniel's patches to solve problems with the import/export magic + syntax that is required for Windows, and which was going wrong for the + pcreposix and pcrecpp parts of the library. These were overlooked when this + problem was solved for the main library. + + 9. There were some crude static tests to avoid integer overflow when computing + the size of patterns that contain repeated groups with explicit upper + limits. As the maximum quantifier is 65535, the maximum group length was + set at 30,000 so that the product of these two numbers did not overflow a + 32-bit integer. However, it turns out that people want to use groups that + are longer than 30,000 bytes (though not repeat them that many times). + Change 7.0/17 (the refactoring of the way the pattern size is computed) has + made it possible to implement the integer overflow checks in a much more + dynamic way, which I have now done. The artificial limitation on group + length has been removed - we now have only the limit on the total length of + the compiled pattern, which depends on the LINK_SIZE setting. + +10. Fixed a bug in the documentation for get/copy named substring when + duplicate names are permitted. If none of the named substrings are set, the + functions return PCRE_ERROR_NOSUBSTRING (7); the doc said they returned an + empty string. + +11. Because Perl interprets \Q...\E at a high level, and ignores orphan \E + instances, patterns such as [\Q\E] or [\E] or even [^\E] cause an error, + because the ] is interpreted as the first data character and the + terminating ] is not found. PCRE has been made compatible with Perl in this + regard. Previously, it interpreted [\Q\E] as an empty class, and [\E] could + cause memory overwriting. + +10. Like Perl, PCRE automatically breaks an unlimited repeat after an empty + string has been matched (to stop an infinite loop). It was not recognizing + a conditional subpattern that could match an empty string if that + subpattern was within another subpattern. For example, it looped when + trying to match (((?(1)X|))*) but it was OK with ((?(1)X|)*) where the + condition was not nested. This bug has been fixed. + +12. A pattern like \X?\d or \P{L}?\d in non-UTF-8 mode could cause a backtrack + past the start of the subject in the presence of bytes with the top bit + set, for example "\x8aBCD". + +13. Added Perl 5.10 experimental backtracking controls (*FAIL), (*F), (*PRUNE), + (*SKIP), (*THEN), (*COMMIT), and (*ACCEPT). + +14. Optimized (?!) to (*FAIL). + +15. Updated the test for a valid UTF-8 string to conform to the later RFC 3629. + This restricts code points to be within the range 0 to 0x10FFFF, excluding + the "low surrogate" sequence 0xD800 to 0xDFFF. Previously, PCRE allowed the + full range 0 to 0x7FFFFFFF, as defined by RFC 2279. Internally, it still + does: it's just the validity check that is more restrictive. + +16. Inserted checks for integer overflows during escape sequence (backslash) + processing, and also fixed erroneous offset values for syntax errors during + backslash processing. + +17. Fixed another case of looking too far back in non-UTF-8 mode (cf 12 above) + for patterns like [\PPP\x8a]{1,}\x80 with the subject "A\x80". + +18. An unterminated class in a pattern like (?1)\c[ with a "forward reference" + caused an overrun. + +19. A pattern like (?:[\PPa*]*){8,} which had an "extended class" (one with + something other than just ASCII characters) inside a group that had an + unlimited repeat caused a loop at compile time (while checking to see + whether the group could match an empty string). + +20. Debugging a pattern containing \p or \P could cause a crash. For example, + [\P{Any}] did so. (Error in the code for printing property names.) + +21. An orphan \E inside a character class could cause a crash. + +22. A repeated capturing bracket such as (A)? could cause a wild memory + reference during compilation. + +23. There are several functions in pcre_compile() that scan along a compiled + expression for various reasons (e.g. to see if it's fixed length for look + behind). There were bugs in these functions when a repeated \p or \P was + present in the pattern. These operators have additional parameters compared + with \d, etc, and these were not being taken into account when moving along + the compiled data. Specifically: + + (a) A item such as \p{Yi}{3} in a lookbehind was not treated as fixed + length. + + (b) An item such as \pL+ within a repeated group could cause crashes or + loops. + + (c) A pattern such as \p{Yi}+(\P{Yi}+)(?1) could give an incorrect + "reference to non-existent subpattern" error. + + (d) A pattern like (\P{Yi}{2}\277)? could loop at compile time. + +24. A repeated \S or \W in UTF-8 mode could give wrong answers when multibyte + characters were involved (for example /\S{2}/8g with "A\x{a3}BC"). + +25. Using pcregrep in multiline, inverted mode (-Mv) caused it to loop. + +26. Patterns such as [\P{Yi}A] which include \p or \P and just one other + character were causing crashes (broken optimization). + +27. Patterns such as (\P{Yi}*\277)* (group with possible zero repeat containing + \p or \P) caused a compile-time loop. + +28. More problems have arisen in unanchored patterns when CRLF is a valid line + break. For example, the unstudied pattern [\r\n]A does not match the string + "\r\nA" because change 7.0/46 below moves the current point on by two + characters after failing to match at the start. However, the pattern \nA + *does* match, because it doesn't start till \n, and if [\r\n]A is studied, + the same is true. There doesn't seem any very clean way out of this, but + what I have chosen to do makes the common cases work: PCRE now takes note + of whether there can be an explicit match for \r or \n anywhere in the + pattern, and if so, 7.0/46 no longer applies. As part of this change, + there's a new PCRE_INFO_HASCRORLF option for finding out whether a compiled + pattern has explicit CR or LF references. + +29. Added (*CR) etc for changing newline setting at start of pattern. + + +Version 7.2 19-Jun-07 --------------------- 1. If the fr_FR locale cannot be found for test 3, try the "french" locale, diff --git a/ext/pcre/pcrelib/HACKING b/ext/pcre/pcrelib/HACKING index 49bba8a702..c946cd2bdb 100644 --- a/ext/pcre/pcrelib/HACKING +++ b/ext/pcre/pcrelib/HACKING @@ -109,15 +109,15 @@ variable length. The first byte in an item is an opcode, and the length of the item is either implicit in the opcode or contained in the data bytes that follow it. -In many cases below "two-byte" data values are specified. This is in fact just -a default when the number is an offset within the compiled pattern. PCRE can be +In many cases below LINK_SIZE data values are specified for offsets within the +compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be compiled to use 3-byte or 4-byte values for these offsets (impairing the performance). This is necessary only when patterns whose compiled length is greater than 64K are going to be processed. In this description, we assume the -"normal" compilation options. "Two-byte" data values that are counts (e.g. for -quantifiers) are always just two bytes. +"normal" compilation options. Data values that are counts (e.g. for +quantifiers) are always just two bytes long. -A list of all the opcodes follows: +A list of the opcodes follows: Opcodes with no following data ------------------------------ @@ -149,6 +149,13 @@ These items are all just one byte long OP_EXTUNI match an extended Unicode character OP_ANYNL match any Unicode newline sequence + OP_ACCEPT ) + OP_COMMIT ) + OP_FAIL ) These are Perl 5.10's "backtracking + OP_PRUNE ) control verbs". + OP_SKIP ) + OP_THEN ) + Repeating single characters --------------------------- @@ -404,4 +411,4 @@ at compile time, and so does not cause anything to be put into the compiled data. Philip Hazel -June 2007 +August 2007 diff --git a/ext/pcre/pcrelib/NEWS b/ext/pcre/pcrelib/NEWS index f1083e8601..6a30805bb7 100644 --- a/ext/pcre/pcrelib/NEWS +++ b/ext/pcre/pcrelib/NEWS @@ -2,6 +2,30 @@ News about PCRE releases ------------------------ +Release 7.3 28-Aug-07 +--------------------- + +Most changes are bug fixes. Some that are not: + +1. There is some support for Perl 5.10's experimental "backtracking control + verbs" such as (*PRUNE). + +2. UTF-8 checking is now as per RFC 3629 instead of RFC 2279; this is more + restrictive in the strings it accepts. + +3. Checking for potential integer overflow has been made more dynamic, and as a + consequence there is no longer a hard limit on the size of a subpattern that + has a limited repeat count. + +4. When CRLF is a valid line-ending sequence, pcre_exec() and pcre_dfa_exec() + no longer advance by two characters instead of one when an unanchored match + fails at CRLF if there are explicit CR or LF matches within the pattern. + This gets rid of some anomalous effects that previously occurred. + +5. Some PCRE-specific settings for varying the newline options at the start of + a pattern have been added. + + Release 7.2 19-Jun-07 --------------------- diff --git a/ext/pcre/pcrelib/NON-UNIX-USE b/ext/pcre/pcrelib/NON-UNIX-USE index a10c7041aa..f1047baa70 100644 --- a/ext/pcre/pcrelib/NON-UNIX-USE +++ b/ext/pcre/pcrelib/NON-UNIX-USE @@ -7,6 +7,7 @@ This document contains the following sections: Generic instructions for the PCRE C library The C++ wrapper functions Building for virtual Pascal + Stack size in Windows environments Comments about Win32 builds Building under Windows with BCC5.5 Building PCRE on OpenVMS @@ -14,7 +15,7 @@ This document contains the following sections: GENERAL -I (Philip Hazel) have no knowledge of Windows or VMS sytems and how their +I (Philip Hazel) have no experience of Windows or VMS sytems and how their libraries work. The items in the PCRE distribution and Makefile that relate to anything other than Unix-like systems are untested by me. @@ -38,79 +39,97 @@ GENERIC INSTRUCTIONS FOR THE PCRE C LIBRARY The following are generic comments about building the PCRE C library "by hand". -(1) Copy or rename the file config.h.generic as config.h, and edit the macro - settings that it contains to whatever is appropriate for your environment. - In particular, if you want to force a specific value for newline, you can - define the NEWLINE macro. - - An alternative approach is not to edit config.h, but to use -D on the - compiler command line to make any changes that you need. - - NOTE: There have been occasions when the way in which certain parameters in - config.h are used has changed between releases. (In the configure/make - world, this is handled automatically.) When upgrading to a new release, you - are strongly advised to review config.h.generic before re-using what you - had previously. - -(2) Copy or rename the file pcre.h.generic as pcre.h. - -(3) EITHER: - Copy or rename file pcre_chartables.c.dist as pcre_chartables.c. - - OR: - Compile dftables.c as a stand-alone program, and then run it with the - single argument "pcre_chartables.c". This generates a set of standard - character tables and writes them to that file. The tables are generated - using the default C locale for your system. If you want to use a locale - that is specified by LC_xxx environment variables, add the -L option to - the dftables command. You must use this method if you are building on - a system that uses EBCDIC code. - - The tables in pcre_chartables.c are defaults. The caller of PCRE can - specify alternative tables at run time. - -(4) Compile the following source files: - - pcre_chartables.c - pcre_compile.c - pcre_config.c - pcre_dfa_exec.c - pcre_exec.c - pcre_fullinfo.c - pcre_get.c - pcre_globals.c - pcre_info.c - pcre_maketables.c - pcre_newline.c - pcre_ord2utf8.c - pcre_refcount.c - pcre_study.c - pcre_tables.c - pcre_try_flipped.c - pcre_ucp_searchfuncs.c - pcre_valid_utf8.c - pcre_version.c - pcre_xclass.c - - Now link them all together into an object library in whichever form your - system keeps such libraries. This is the basic PCRE C library. If your - system has static and shared libraries, you may have to do this once for - each type. - -(5) Similarly, compile pcreposix.c and link it (on its own) as the pcreposix - library. - -(6) Compile the test program pcretest.c. This needs the functions in the - pcre and pcreposix libraries when linking. - -(7) Run pcretest on the testinput files in the testdata directory, and check - that the output matches the corresponding testoutput files. Note that the - supplied files are in Unix format, with just LF characters as line - terminators. You may need to edit them to change this if your system uses a - different convention. - -(8) If you want to use the pcregrep command, compile and link pcregrep.c; it - uses only the basic PCRE library (it does not need the pcreposix library). + (1) Copy or rename the file config.h.generic as config.h, and edit the macro + settings that it contains to whatever is appropriate for your environment. + In particular, if you want to force a specific value for newline, you can + define the NEWLINE macro. + + An alternative approach is not to edit config.h, but to use -D on the + compiler command line to make any changes that you need. + + NOTE: There have been occasions when the way in which certain parameters + in config.h are used has changed between releases. (In the configure/make + world, this is handled automatically.) When upgrading to a new release, + you are strongly advised to review config.h.generic before re-using what + you had previously. + + (2) Copy or rename the file pcre.h.generic as pcre.h. + + (3) EITHER: + Copy or rename file pcre_chartables.c.dist as pcre_chartables.c. + + OR: + Compile dftables.c as a stand-alone program, and then run it with the + single argument "pcre_chartables.c". This generates a set of standard + character tables and writes them to that file. The tables are generated + using the default C locale for your system. If you want to use a locale + that is specified by LC_xxx environment variables, add the -L option to + the dftables command. You must use this method if you are building on + a system that uses EBCDIC code. + + The tables in pcre_chartables.c are defaults. The caller of PCRE can + specify alternative tables at run time. + + (4) Ensure that you have the following header files: + + pcre_internal.h + ucp.h + ucpinternal.h + ucptable.h + + (5) Also ensure that you have the following file, which is #included as source + when building a debugging version of PCRE and is also used by pcretest. + + pcre_printint.src + + (6) Compile the following source files: + + pcre_chartables.c + pcre_compile.c + pcre_config.c + pcre_dfa_exec.c + pcre_exec.c + pcre_fullinfo.c + pcre_get.c + pcre_globals.c + pcre_info.c + pcre_maketables.c + pcre_newline.c + pcre_ord2utf8.c + pcre_refcount.c + pcre_study.c + pcre_tables.c + pcre_try_flipped.c + pcre_ucp_searchfuncs.c + pcre_valid_utf8.c + pcre_version.c + pcre_xclass.c + + Make sure that you include -I. in the compiler command (or equivalent for + an unusual compiler) so that all included PCRE header files are first + sought in the current directory. Otherwise you run the risk of picking up + a previously-installed file from somewhere else. + + (7) Now link all the compiled code into an object library in whichever form + your system keeps such libraries. This is the basic PCRE C library. If + your system has static and shared libraries, you may have to do this once + for each type. + + (8) Similarly, compile pcreposix.c and link the result (on its own) as the + pcreposix library. + + (9) Compile the test program pcretest.c. This needs the functions in the + pcre and pcreposix libraries when linking. It also needs the + pcre_printint.src source file, which it #includes. + +(10) Run pcretest on the testinput files in the testdata directory, and check + that the output matches the corresponding testoutput files. Note that the + supplied files are in Unix format, with just LF characters as line + terminators. You may need to edit them to change this if your system uses + a different convention. + +(11) If you want to use the pcregrep command, compile and link pcregrep.c; it + uses only the basic PCRE library (it does not need the pcreposix library). THE C++ WRAPPER FUNCTIONS @@ -131,6 +150,18 @@ additional files. The following files in the distribution are for building PCRE for use with VP/Borland: makevp_c.txt, makevp_l.txt, makevp.bat, pcregexp.pas. +STACK SIZE IN WINDOWS ENVIRONMENTS + +The default processor stack size of 1Mb in some Windows environments is too +small for matching patterns that need much recursion. In particular, test 2 may +fail because of this. Normally, running out of stack causes a crash, but there +have been cases where the test program has just died silently. See your linker +documentation for how to increase stack size if you experience problems. The +Linux default of 8Mb is a reasonable choice for the stack, though even that can +be too small for some pattern/subject combinations. There is more about stack +usage in the "pcrestack" documentation. + + COMMENTS ABOUT WIN32 BUILDS There are two ways of building PCRE using the "configure, make, make install" @@ -284,5 +315,5 @@ $! Locale could not be set to fr $! ========================= -Last Updated: 13 June 2007 +Last Updated: 01 August 2007 **** diff --git a/ext/pcre/pcrelib/config.h b/ext/pcre/pcrelib/config.h index 510197fa3e..b16c6b6500 100644 --- a/ext/pcre/pcrelib/config.h +++ b/ext/pcre/pcrelib/config.h @@ -178,13 +178,6 @@ them both to 0; an emulation function will be used. */ /* This limit is parameterized just in case anybody ever wants to change it. Care must be taken if it is increased, because it guards against integer overflow caused by enormously large patterns. */ -#ifndef MAX_DUPLENGTH -#define MAX_DUPLENGTH 30000 -#endif - -/* This limit is parameterized just in case anybody ever wants to change it. - Care must be taken if it is increased, because it guards against integer - overflow caused by enormously large patterns. */ #ifndef MAX_NAME_COUNT #define MAX_NAME_COUNT 10000 #endif @@ -224,13 +217,13 @@ them both to 0; an emulation function will be used. */ #define PACKAGE_NAME "PCRE" /* Define to the full name and version of this package. */ -#define PACKAGE_STRING "PCRE 7.2" +#define PACKAGE_STRING "PCRE 7.3" /* Define to the one symbol short name of this package. */ #define PACKAGE_TARNAME "pcre" /* Define to the version of this package. */ -#define PACKAGE_VERSION "7.2" +#define PACKAGE_VERSION "7.3" /* If you are compiling for a system other than a Unix-like system or @@ -272,7 +265,7 @@ them both to 0; an emulation function will be used. */ /* Version number of package */ #ifndef VERSION -#define VERSION "7.2" +#define VERSION "7.3" #endif /* Define to empty if `const' does not conform to ANSI C. */ diff --git a/ext/pcre/pcrelib/dftables.c b/ext/pcre/pcrelib/dftables.c index baa56a15c9..eb9a1a4b7d 100644 --- a/ext/pcre/pcrelib/dftables.c +++ b/ext/pcre/pcrelib/dftables.c @@ -43,6 +43,10 @@ character tables for PCRE. The tables are built according to the current locale. Now that pcre_maketables is a function visible to the outside world, we make use of its code from here in order to be consistent. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include <ctype.h> #include <stdio.h> #include <string.h> @@ -99,12 +103,15 @@ fprintf(f, "tables are passed to PCRE by the application that calls it. The tables\n" "are used only for characters whose code values are less than 256.\n\n"); fprintf(f, - "The following #include is present because without it gcc 4.x may remove\n" + "The following #includes are present because without them gcc 4.x may remove\n" "the array definition from the final binary if PCRE is built into a static\n" "library and dead code stripping is activated. This leads to link errors.\n" "Pulling in the header ensures that the array gets flagged as \"someone\n" "outside this compilation unit might reference this\" and so it will always\n" "be supplied to the linker. */\n\n" + "#ifdef HAVE_CONFIG_H\n" + "#include <config.h>\n" + "#endif\n\n" "#include \"pcre_internal.h\"\n\n"); fprintf(f, "const unsigned char _pcre_default_tables[] = {\n\n" diff --git a/ext/pcre/pcrelib/doc/pcre.txt b/ext/pcre/pcrelib/doc/pcre.txt index 823f15c440..f924f6de89 100644 --- a/ext/pcre/pcrelib/doc/pcre.txt +++ b/ext/pcre/pcrelib/doc/pcre.txt @@ -45,30 +45,31 @@ INTRODUCTION Details of exactly which Perl regular expression features are and are not supported by PCRE are given in separate documents. See the pcrepat- - tern and pcrecompat pages. + tern and pcrecompat pages. There is a syntax summary in the pcresyntax + page. - Some features of PCRE can be included, excluded, or changed when the - library is built. The pcre_config() function makes it possible for a - client to discover which features are available. The features them- - selves are described in the pcrebuild page. Documentation about build- - ing PCRE for various operating systems can be found in the README file + Some features of PCRE can be included, excluded, or changed when the + library is built. The pcre_config() function makes it possible for a + client to discover which features are available. The features them- + selves are described in the pcrebuild page. Documentation about build- + ing PCRE for various operating systems can be found in the README file in the source distribution. - The library contains a number of undocumented internal functions and - data tables that are used by more than one of the exported external - functions, but which are not intended for use by external callers. - Their names all begin with "_pcre_", which hopefully will not provoke + The library contains a number of undocumented internal functions and + data tables that are used by more than one of the exported external + functions, but which are not intended for use by external callers. + Their names all begin with "_pcre_", which hopefully will not provoke any name clashes. In some environments, it is possible to control which - external symbols are exported when a shared library is built, and in + external symbols are exported when a shared library is built, and in these cases the undocumented symbols are not exported. USER DOCUMENTATION - The user documentation for PCRE comprises a number of different sec- - tions. In the "man" format, each of these is a separate "man page". In - the HTML format, each is a separate page, linked from the index page. - In the plain text format, all the sections are concatenated, for ease + The user documentation for PCRE comprises a number of different sec- + tions. In the "man" format, each of these is a separate "man page". In + the HTML format, each is a separate page, linked from the index page. + In the plain text format, all the sections are concatenated, for ease of searching. The sections are as follows: pcre this document @@ -83,6 +84,7 @@ USER DOCUMENTATION pcrepartial details of the partial matching facility pcrepattern syntax and semantics of supported regular expressions + pcresyntax quick syntax reference pcreperform discussion of performance issues pcreposix the POSIX-compatible C API pcreprecompile details of saving and re-using precompiled patterns @@ -90,26 +92,24 @@ USER DOCUMENTATION pcrestack discussion of stack usage pcretest description of the pcretest testing command - In addition, in the "man" and HTML formats, there is a short page for + In addition, in the "man" and HTML formats, there is a short page for each C library function, listing its arguments and results. LIMITATIONS - There are some size limitations in PCRE but it is hoped that they will + There are some size limitations in PCRE but it is hoped that they will never in practice be relevant. - The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE + The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE is compiled with the default internal linkage size of 2. If you want to - process regular expressions that are truly enormous, you can compile - PCRE with an internal linkage size of 3 or 4 (see the README file in - the source distribution and the pcrebuild documentation for details). - In these cases the limit is substantially larger. However, the speed + process regular expressions that are truly enormous, you can compile + PCRE with an internal linkage size of 3 or 4 (see the README file in + the source distribution and the pcrebuild documentation for details). + In these cases the limit is substantially larger. However, the speed of execution is slower. - All values in repeating quantifiers must be less than 65536. The maxi- - mum compiled length of subpattern with an explicit repeat count is - 30000 bytes. The maximum number of capturing subpatterns is 65535. + All values in repeating quantifiers must be less than 65536. There is no limit to the number of parenthesized subpatterns, but there can be no more than 65535 capturing subpatterns. @@ -117,99 +117,129 @@ LIMITATIONS The maximum length of name for a named subpattern is 32 characters, and the maximum number of named subpatterns is 10000. - The maximum length of a subject string is the largest positive number - that an integer variable can hold. However, when using the traditional + The maximum length of a subject string is the largest positive number + that an integer variable can hold. However, when using the traditional matching function, PCRE uses recursion to handle subpatterns and indef- - inite repetition. This means that the available stack space may limit + inite repetition. This means that the available stack space may limit the size of a subject string that can be processed by certain patterns. For a discussion of stack issues, see the pcrestack documentation. UTF-8 AND UNICODE PROPERTY SUPPORT - From release 3.3, PCRE has had some support for character strings - encoded in the UTF-8 format. For release 4.0 this was greatly extended - to cover most common requirements, and in release 5.0 additional sup- + From release 3.3, PCRE has had some support for character strings + encoded in the UTF-8 format. For release 4.0 this was greatly extended + to cover most common requirements, and in release 5.0 additional sup- port for Unicode general category properties was added. - In order process UTF-8 strings, you must build PCRE to include UTF-8 - support in the code, and, in addition, you must call pcre_compile() - with the PCRE_UTF8 option flag. When you do this, both the pattern and - any subject strings that are matched against it are treated as UTF-8 + In order process UTF-8 strings, you must build PCRE to include UTF-8 + support in the code, and, in addition, you must call pcre_compile() + with the PCRE_UTF8 option flag. When you do this, both the pattern and + any subject strings that are matched against it are treated as UTF-8 strings instead of just strings of bytes. - If you compile PCRE with UTF-8 support, but do not use it at run time, - the library will be a bit bigger, but the additional run time overhead + If you compile PCRE with UTF-8 support, but do not use it at run time, + the library will be a bit bigger, but the additional run time overhead is limited to testing the PCRE_UTF8 flag occasionally, so should not be very big. If PCRE is built with Unicode character property support (which implies - UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- + UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- ported. The available properties that can be tested are limited to the - general category properties such as Lu for an upper case letter or Nd - for a decimal number, the Unicode script names such as Arabic or Han, - and the derived properties Any and L&. A full list is given in the + general category properties such as Lu for an upper case letter or Nd + for a decimal number, the Unicode script names such as Arabic or Han, + and the derived properties Any and L&. A full list is given in the pcrepattern documentation. Only the short names for properties are sup- - ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- - ter}, is not supported. Furthermore, in Perl, many properties may - optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE + ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- + ter}, is not supported. Furthermore, in Perl, many properties may + optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE does not support this. - The following comments apply when PCRE is running in UTF-8 mode: - - 1. When you set the PCRE_UTF8 flag, the strings passed as patterns and - subjects are checked for validity on entry to the relevant functions. - If an invalid UTF-8 string is passed, an error return is given. In some - situations, you may already know that your strings are valid, and - therefore want to skip these checks in order to improve performance. If - you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, - PCRE assumes that the pattern or subject it is given (respectively) - contains only valid UTF-8 codes. In this case, it does not diagnose an - invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when - PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may - crash. - - 2. An unbraced hexadecimal escape sequence (such as \xb3) matches a + Validity of UTF-8 strings + + When you set the PCRE_UTF8 flag, the strings passed as patterns and + subjects are (by default) checked for validity on entry to the relevant + functions. From release 7.3 of PCRE, the check is according the rules + of RFC 3629, which are themselves derived from the Unicode specifica- + tion. Earlier releases of PCRE followed the rules of RFC 2279, which + allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current + check allows only values in the range U+0 to U+10FFFF, excluding U+D800 + to U+DFFF. + + The excluded code points are the "Low Surrogate Area" of Unicode, of + which the Unicode Standard says this: "The Low Surrogate Area does not + contain any character assignments, consequently no character code + charts or namelists are provided for this area. Surrogates are reserved + for use with UTF-16 and then must be used in pairs." The code points + that are encoded by UTF-16 pairs are available as independent code + points in the UTF-8 encoding. (In other words, the whole surrogate + thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) + + If an invalid UTF-8 string is passed to PCRE, an error return + (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know + that your strings are valid, and therefore want to skip these checks in + order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at + compile time or at run time, PCRE assumes that the pattern or subject + it is given (respectively) contains only valid UTF-8 codes. In this + case, it does not diagnose an invalid UTF-8 string. + + If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, + what happens depends on why the string is invalid. If the string con- + forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a + string of characters in the range 0 to 0x7FFFFFFF. In other words, + apart from the initial validity test, PCRE (when in UTF-8 mode) handles + strings according to the more liberal rules of RFC 2279. However, if + the string does not even conform to RFC 2279, the result is undefined. + Your program may crash. + + If you want to process strings of values in the full range 0 to + 0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can + set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in + this situation, you will have to apply your own validity check. + + General comments about UTF-8 mode + + 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a two-byte UTF-8 character if the value is greater than 127. - 3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 + 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 characters for values greater than \177. - 4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- + 3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- vidual bytes, for example: \x{100}{3}. - 5. The dot metacharacter matches one UTF-8 character instead of a sin- + 4. The dot metacharacter matches one UTF-8 character instead of a sin- gle byte. - 6. The escape sequence \C can be used to match a single byte in UTF-8 - mode, but its use can lead to some strange effects. This facility is + 5. The escape sequence \C can be used to match a single byte in UTF-8 + mode, but its use can lead to some strange effects. This facility is not available in the alternative matching function, pcre_dfa_exec(). - 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly - test characters of any code value, but the characters that PCRE recog- - nizes as digits, spaces, or word characters remain the same set as + 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly + test characters of any code value, but the characters that PCRE recog- + nizes as digits, spaces, or word characters remain the same set as before, all with values less than 256. This remains true even when PCRE - includes Unicode property support, because to do otherwise would slow - down PCRE in many common cases. If you really want to test for a wider - sense of, say, "digit", you must use Unicode property tests such as + includes Unicode property support, because to do otherwise would slow + down PCRE in many common cases. If you really want to test for a wider + sense of, say, "digit", you must use Unicode property tests such as \p{Nd}. - 8. Similarly, characters that match the POSIX named character classes + 7. Similarly, characters that match the POSIX named character classes are all low-valued characters. - 9. However, the Perl 5.10 horizontal and vertical whitespace matching + 8. However, the Perl 5.10 horizontal and vertical whitespace matching escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- acters. - 10. Case-insensitive matching applies only to characters whose values - are less than 128, unless PCRE is built with Unicode property support. - Even when Unicode property support is available, PCRE still uses its - own character tables when checking the case of low-valued characters, - so as not to degrade performance. The Unicode property information is + 9. Case-insensitive matching applies only to characters whose values + are less than 128, unless PCRE is built with Unicode property support. + Even when Unicode property support is available, PCRE still uses its + own character tables when checking the case of low-valued characters, + so as not to degrade performance. The Unicode property information is used only for characters with higher values. Even when Unicode property support is available, PCRE supports case-insensitive matching only when - there is a one-to-one mapping between a letter's cases. There are a - small number of many-to-one mappings in Unicode; these are not sup- + there is a one-to-one mapping between a letter's cases. There are a + small number of many-to-one mappings in Unicode; these are not sup- ported by PCRE. @@ -219,14 +249,14 @@ AUTHOR University Computing Service Cambridge CB2 3QH, England. - Putting an actual email address here seems to have been a spam magnet, - so I've taken it away. If you want to email me, use my two initials, + Putting an actual email address here seems to have been a spam magnet, + so I've taken it away. If you want to email me, use my two initials, followed by the two digits 10, at the domain cam.ac.uk. REVISION - Last updated: 13 June 2007 + Last updated: 09 August 2007 Copyright (c) 1997-2007 University of Cambridge. ------------------------------------------------------------------------------ @@ -459,13 +489,14 @@ USING EBCDIC CODE PCRE assumes by default that it will run in an environment where the character code is ASCII (or Unicode, which is a superset of ASCII). - PCRE can, however, be compiled to run in an EBCDIC environment by - adding + This is the case for most computer operating systems. PCRE can, how- + ever, be compiled to run in an EBCDIC environment by adding --enable-ebcdic to the configure command. This setting implies --enable-rebuild-charta- - bles. + bles. You should only use it if you know that you are in an EBCDIC + environment (for example, an IBM mainframe operating system). SEE ALSO @@ -482,7 +513,7 @@ AUTHOR REVISION - Last updated: 05 June 2007 + Last updated: 30 July 2007 Copyright (c) 1997-2007 University of Cambridge. ------------------------------------------------------------------------------ @@ -626,31 +657,34 @@ THE ALTERNATIVE MATCHING ALGORITHM 6. Callouts are supported, but the value of the capture_top field is always 1, and the value of the capture_last field is always -1. - 7. The \C escape sequence, which (in the standard algorithm) matches a + 7. The \C escape sequence, which (in the standard algorithm) matches a single byte, even in UTF-8 mode, is not supported because the alterna- tive algorithm moves through the subject string one character at a time, for all active paths through the tree. + 8. None of the backtracking control verbs such as (*PRUNE) are sup- + ported. + ADVANTAGES OF THE ALTERNATIVE ALGORITHM - Using the alternative matching algorithm provides the following advan- + Using the alternative matching algorithm provides the following advan- tages: 1. All possible matches (at a single point in the subject) are automat- - ically found, and in particular, the longest match is found. To find + ically found, and in particular, the longest match is found. To find more than one match using the standard algorithm, you have to do kludgy things with callouts. - 2. There is much better support for partial matching. The restrictions - on the content of the pattern that apply when using the standard algo- - rithm for partial matching do not apply to the alternative algorithm. - For non-anchored patterns, the starting position of a partial match is + 2. There is much better support for partial matching. The restrictions + on the content of the pattern that apply when using the standard algo- + rithm for partial matching do not apply to the alternative algorithm. + For non-anchored patterns, the starting position of a partial match is available. - 3. Because the alternative algorithm scans the subject string just - once, and never needs to backtrack, it is possible to pass very long - subject strings to the matching function in several pieces, checking + 3. Because the alternative algorithm scans the subject string just + once, and never needs to backtrack, it is possible to pass very long + subject strings to the matching function in several pieces, checking for partial matching each time. @@ -658,8 +692,8 @@ DISADVANTAGES OF THE ALTERNATIVE ALGORITHM The alternative algorithm suffers from a number of disadvantages: - 1. It is substantially slower than the standard algorithm. This is - partly because it has to search for all possible matches, but is also + 1. It is substantially slower than the standard algorithm. This is + partly because it has to search for all possible matches, but is also because it is less susceptible to optimization. 2. Capturing parentheses and back references are not supported. @@ -677,7 +711,7 @@ AUTHOR REVISION - Last updated: 29 May 2007 + Last updated: 08 August 2007 Copyright (c) 1997-2007 University of Cambridge. ------------------------------------------------------------------------------ @@ -874,13 +908,19 @@ NEWLINES dard. When PCRE is run, the default can be overridden, either when a pattern is compiled, or when it is matched. + At compile time, the newline convention can be specified by the options + argument of pcre_compile(), or it can be specified by special text at + the start of the pattern itself; this overrides any other settings. See + the pcrepattern page for details of the special character sequences. + In the PCRE documentation the word "newline" is used to mean "the char- - acter or pair of characters that indicate a line break". The choice of - newline convention affects the handling of the dot, circumflex, and + acter or pair of characters that indicate a line break". The choice of + newline convention affects the handling of the dot, circumflex, and dollar metacharacters, the handling of #-comments in /x mode, and, when - CRLF is a recognized line ending sequence, the match position advance- - ment for a non-anchored pattern. The choice of newline convention does - not affect the interpretation of the \n or \r escape sequences. + CRLF is a recognized line ending sequence, the match position advance- + ment for a non-anchored pattern. There is more detail about this in the + section on pcre_exec() options below. The choice of newline convention + does not affect the interpretation of the \n or \r escape sequences. MULTITHREADING @@ -1221,21 +1261,22 @@ COMPILING A PATTERN PCRE_NO_UTF8_CHECK When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is - automatically checked. If an invalid UTF-8 sequence of bytes is found, - pcre_compile() returns an error. If you already know that your pattern - is valid, and you want to skip this check for performance reasons, you - can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of - passing an invalid UTF-8 string as a pattern is undefined. It may cause - your program to crash. Note that this option can also be passed to - pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check- - ing of subject strings. + automatically checked. There is a discussion about the validity of + UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of + bytes is found, pcre_compile() returns an error. If you already know + that your pattern is valid, and you want to skip this check for perfor- + mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is + set, the effect of passing an invalid UTF-8 string as a pattern is + undefined. It may cause your program to crash. Note that this option + can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the + UTF-8 validity checking of subject strings. COMPILATION ERROR CODES - The following table lists the error codes than may be returned by - pcre_compile2(), along with the error messages that may be returned by - both compiling functions. As PCRE has developed, some error codes have + The following table lists the error codes than may be returned by + pcre_compile2(), along with the error messages that may be returned by + both compiling functions. As PCRE has developed, some error codes have fallen out of use. To avoid confusion, they have not been re-used. 0 no error @@ -1288,10 +1329,10 @@ COMPILATION ERROR CODES 47 unknown property name after \P or \p 48 subpattern name is too long (maximum 32 characters) 49 too many named subpatterns (maximum 10,000) - 50 repeated subpattern is too long + 50 [this code is not in use] 51 octal value is greater than \377 (not in UTF-8 mode) 52 internal error: overran compiling workspace - 53 internal error: previously-checked referenced subpattern not + 53 internal error: previously-checked referenced subpattern not found 54 DEFINE group contains more than one branch 55 repeating a DEFINE group is not allowed @@ -1306,32 +1347,32 @@ STUDYING A PATTERN pcre_extra *pcre_study(const pcre *code, int options const char **errptr); - If a compiled pattern is going to be used several times, it is worth + If a compiled pattern is going to be used several times, it is worth spending more time analyzing it in order to speed up the time taken for - matching. The function pcre_study() takes a pointer to a compiled pat- + matching. The function pcre_study() takes a pointer to a compiled pat- tern as its first argument. If studying the pattern produces additional - information that will help speed up matching, pcre_study() returns a - pointer to a pcre_extra block, in which the study_data field points to + information that will help speed up matching, pcre_study() returns a + pointer to a pcre_extra block, in which the study_data field points to the results of the study. The returned value from pcre_study() can be passed directly to - pcre_exec(). However, a pcre_extra block also contains other fields - that can be set by the caller before the block is passed; these are + pcre_exec(). However, a pcre_extra block also contains other fields + that can be set by the caller before the block is passed; these are described below in the section on matching a pattern. - If studying the pattern does not produce any additional information + If studying the pattern does not produce any additional information pcre_study() returns NULL. In that circumstance, if the calling program - wants to pass any of the other fields to pcre_exec(), it must set up + wants to pass any of the other fields to pcre_exec(), it must set up its own pcre_extra block. - The second argument of pcre_study() contains option bits. At present, + The second argument of pcre_study() contains option bits. At present, no options are defined, and this argument should always be zero. - The third argument for pcre_study() is a pointer for an error message. - If studying succeeds (even if no data is returned), the variable it - points to is set to NULL. Otherwise it is set to point to a textual + The third argument for pcre_study() is a pointer for an error message. + If studying succeeds (even if no data is returned), the variable it + points to is set to NULL. Otherwise it is set to point to a textual error message. This is a static string that is part of the library. You - must not try to free it. You should test the error pointer for NULL + must not try to free it. You should test the error pointer for NULL after calling pcre_study(), to be sure that it has run successfully. This is a typical call to pcre_study(): @@ -1343,62 +1384,62 @@ STUDYING A PATTERN &error); /* set to NULL or points to a message */ At present, studying a pattern is useful only for non-anchored patterns - that do not have a single fixed starting character. A bitmap of possi- + that do not have a single fixed starting character. A bitmap of possi- ble starting bytes is created. LOCALE SUPPORT - PCRE handles caseless matching, and determines whether characters are - letters, digits, or whatever, by reference to a set of tables, indexed - by character value. When running in UTF-8 mode, this applies only to - characters with codes less than 128. Higher-valued codes never match - escapes such as \w or \d, but can be tested with \p if PCRE is built - with Unicode character property support. The use of locales with Uni- - code is discouraged. If you are handling characters with codes greater - than 128, you should either use UTF-8 and Unicode, or use locales, but + PCRE handles caseless matching, and determines whether characters are + letters, digits, or whatever, by reference to a set of tables, indexed + by character value. When running in UTF-8 mode, this applies only to + characters with codes less than 128. Higher-valued codes never match + escapes such as \w or \d, but can be tested with \p if PCRE is built + with Unicode character property support. The use of locales with Uni- + code is discouraged. If you are handling characters with codes greater + than 128, you should either use UTF-8 and Unicode, or use locales, but not try to mix the two. - PCRE contains an internal set of tables that are used when the final - argument of pcre_compile() is NULL. These are sufficient for many + PCRE contains an internal set of tables that are used when the final + argument of pcre_compile() is NULL. These are sufficient for many applications. Normally, the internal tables recognize only ASCII char- acters. However, when PCRE is built, it is possible to cause the inter- nal tables to be rebuilt in the default "C" locale of the local system, which may cause them to be different. - The internal tables can always be overridden by tables supplied by the + The internal tables can always be overridden by tables supplied by the application that calls PCRE. These may be created in a different locale - from the default. As more and more applications change to using Uni- + from the default. As more and more applications change to using Uni- code, the need for this locale support is expected to die away. - External tables are built by calling the pcre_maketables() function, - which has no arguments, in the relevant locale. The result can then be - passed to pcre_compile() or pcre_exec() as often as necessary. For - example, to build and use tables that are appropriate for the French - locale (where accented characters with values greater than 128 are + External tables are built by calling the pcre_maketables() function, + which has no arguments, in the relevant locale. The result can then be + passed to pcre_compile() or pcre_exec() as often as necessary. For + example, to build and use tables that are appropriate for the French + locale (where accented characters with values greater than 128 are treated as letters), the following code could be used: setlocale(LC_CTYPE, "fr_FR"); tables = pcre_maketables(); re = pcre_compile(..., tables); - The locale name "fr_FR" is used on Linux and other Unix-like systems; + The locale name "fr_FR" is used on Linux and other Unix-like systems; if you are using Windows, the name for the French locale is "french". - When pcre_maketables() runs, the tables are built in memory that is - obtained via pcre_malloc. It is the caller's responsibility to ensure - that the memory containing the tables remains available for as long as + When pcre_maketables() runs, the tables are built in memory that is + obtained via pcre_malloc. It is the caller's responsibility to ensure + that the memory containing the tables remains available for as long as it is needed. The pointer that is passed to pcre_compile() is saved with the compiled - pattern, and the same tables are used via this pointer by pcre_study() + pattern, and the same tables are used via this pointer by pcre_study() and normally also by pcre_exec(). Thus, by default, for any single pat- tern, compilation, studying and matching all happen in the same locale, but different patterns can be compiled in different locales. - It is possible to pass a table pointer or NULL (indicating the use of - the internal tables) to pcre_exec(). Although not intended for this - purpose, this facility could be used to match a pattern in a different + It is possible to pass a table pointer or NULL (indicating the use of + the internal tables) to pcre_exec(). Although not intended for this + purpose, this facility could be used to match a pattern in a different locale from the one in which it was compiled. Passing table pointers at run time is discussed below in the section on matching a pattern. @@ -1408,15 +1449,15 @@ INFORMATION ABOUT A PATTERN int pcre_fullinfo(const pcre *code, const pcre_extra *extra, int what, void *where); - The pcre_fullinfo() function returns information about a compiled pat- + The pcre_fullinfo() function returns information about a compiled pat- tern. It replaces the obsolete pcre_info() function, which is neverthe- less retained for backwards compability (and is documented below). - The first argument for pcre_fullinfo() is a pointer to the compiled - pattern. The second argument is the result of pcre_study(), or NULL if - the pattern was not studied. The third argument specifies which piece - of information is required, and the fourth argument is a pointer to a - variable to receive the data. The yield of the function is zero for + The first argument for pcre_fullinfo() is a pointer to the compiled + pattern. The second argument is the result of pcre_study(), or NULL if + the pattern was not studied. The third argument specifies which piece + of information is required, and the fourth argument is a pointer to a + variable to receive the data. The yield of the function is zero for success, or one of the following negative numbers: PCRE_ERROR_NULL the argument code was NULL @@ -1424,9 +1465,9 @@ INFORMATION ABOUT A PATTERN PCRE_ERROR_BADMAGIC the "magic number" was not found PCRE_ERROR_BADOPTION the value of what was invalid - The "magic number" is placed at the start of each compiled pattern as - an simple check against passing an arbitrary memory pointer. Here is a - typical call of pcre_fullinfo(), to obtain the length of the compiled + The "magic number" is placed at the start of each compiled pattern as + an simple check against passing an arbitrary memory pointer. Here is a + typical call of pcre_fullinfo(), to obtain the length of the compiled pattern: int rc; @@ -1437,69 +1478,75 @@ INFORMATION ABOUT A PATTERN PCRE_INFO_SIZE, /* what is required */ &length); /* where to put the data */ - The possible values for the third argument are defined in pcre.h, and + The possible values for the third argument are defined in pcre.h, and are as follows: PCRE_INFO_BACKREFMAX - Return the number of the highest back reference in the pattern. The - fourth argument should point to an int variable. Zero is returned if + Return the number of the highest back reference in the pattern. The + fourth argument should point to an int variable. Zero is returned if there are no back references. PCRE_INFO_CAPTURECOUNT - Return the number of capturing subpatterns in the pattern. The fourth + Return the number of capturing subpatterns in the pattern. The fourth argument should point to an int variable. PCRE_INFO_DEFAULT_TABLES - Return a pointer to the internal default character tables within PCRE. - The fourth argument should point to an unsigned char * variable. This + Return a pointer to the internal default character tables within PCRE. + The fourth argument should point to an unsigned char * variable. This information call is provided for internal use by the pcre_study() func- - tion. External callers can cause PCRE to use its internal tables by + tion. External callers can cause PCRE to use its internal tables by passing a NULL table pointer. PCRE_INFO_FIRSTBYTE - Return information about the first byte of any matched string, for a - non-anchored pattern. The fourth argument should point to an int vari- - able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name + Return information about the first byte of any matched string, for a + non-anchored pattern. The fourth argument should point to an int vari- + able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards compatibility.) - If there is a fixed first byte, for example, from a pattern such as + If there is a fixed first byte, for example, from a pattern such as (cat|cow|coyote), its value is returned. Otherwise, if either - (a) the pattern was compiled with the PCRE_MULTILINE option, and every + (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch starts with "^", or (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set (if it were set, the pattern would be anchored), - -1 is returned, indicating that the pattern matches only at the start - of a subject string or after any newline within the string. Otherwise + -1 is returned, indicating that the pattern matches only at the start + of a subject string or after any newline within the string. Otherwise -2 is returned. For anchored patterns, -2 is returned. PCRE_INFO_FIRSTTABLE - If the pattern was studied, and this resulted in the construction of a + If the pattern was studied, and this resulted in the construction of a 256-bit table indicating a fixed set of bytes for the first byte in any - matching string, a pointer to the table is returned. Otherwise NULL is - returned. The fourth argument should point to an unsigned char * vari- + matching string, a pointer to the table is returned. Otherwise NULL is + returned. The fourth argument should point to an unsigned char * vari- able. + PCRE_INFO_HASCRORLF + + Return 1 if the pattern contains any explicit matches for CR or LF + characters, otherwise 0. The fourth argument should point to an int + variable. + PCRE_INFO_JCHANGED - Return 1 if the (?J) option setting is used in the pattern, otherwise + Return 1 if the (?J) option setting is used in the pattern, otherwise 0. The fourth argument should point to an int variable. The (?J) inter- nal option setting changes the local PCRE_DUPNAMES option. PCRE_INFO_LASTLITERAL - Return the value of the rightmost literal byte that must exist in any - matched string, other than at its start, if such a byte has been + Return the value of the rightmost literal byte that must exist in any + matched string, other than at its start, if such a byte has been recorded. The fourth argument should point to an int variable. If there - is no such byte, -1 is returned. For anchored patterns, a last literal - byte is recorded only if it follows something of variable length. For + is no such byte, -1 is returned. For anchored patterns, a last literal + byte is recorded only if it follows something of variable length. For example, for the pattern /^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value is -1. @@ -1507,34 +1554,34 @@ INFORMATION ABOUT A PATTERN PCRE_INFO_NAMEENTRYSIZE PCRE_INFO_NAMETABLE - PCRE supports the use of named as well as numbered capturing parenthe- - ses. The names are just an additional way of identifying the parenthe- + PCRE supports the use of named as well as numbered capturing parenthe- + ses. The names are just an additional way of identifying the parenthe- ses, which still acquire numbers. Several convenience functions such as - pcre_get_named_substring() are provided for extracting captured sub- - strings by name. It is also possible to extract the data directly, by - first converting the name to a number in order to access the correct + pcre_get_named_substring() are provided for extracting captured sub- + strings by name. It is also possible to extract the data directly, by + first converting the name to a number in order to access the correct pointers in the output vector (described with pcre_exec() below). To do - the conversion, you need to use the name-to-number map, which is + the conversion, you need to use the name-to-number map, which is described by these three values. The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size - of each entry; both of these return an int value. The entry size - depends on the length of the longest name. PCRE_INFO_NAMETABLE returns - a pointer to the first entry of the table (a pointer to char). The + of each entry; both of these return an int value. The entry size + depends on the length of the longest name. PCRE_INFO_NAMETABLE returns + a pointer to the first entry of the table (a pointer to char). The first two bytes of each entry are the number of the capturing parenthe- - sis, most significant byte first. The rest of the entry is the corre- - sponding name, zero terminated. The names are in alphabetical order. + sis, most significant byte first. The rest of the entry is the corre- + sponding name, zero terminated. The names are in alphabetical order. When PCRE_DUPNAMES is set, duplicate names are in order of their paren- - theses numbers. For example, consider the following pattern (assume - PCRE_EXTENDED is set, so white space - including newlines - is + theses numbers. For example, consider the following pattern (assume + PCRE_EXTENDED is set, so white space - including newlines - is ignored): (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) ) - There are four named subpatterns, so the table has four entries, and - each entry in the table is eight bytes long. The table is as follows, + There are four named subpatterns, so the table has four entries, and + each entry in the table is eight bytes long. The table is as follows, with non-printing bytes shows in hexadecimal, and undefined bytes shown as ??: @@ -1543,25 +1590,29 @@ INFORMATION ABOUT A PATTERN 00 04 m o n t h 00 00 02 y e a r 00 ?? - When writing code to extract data from named subpatterns using the - name-to-number map, remember that the length of the entries is likely + When writing code to extract data from named subpatterns using the + name-to-number map, remember that the length of the entries is likely to be different for each compiled pattern. PCRE_INFO_OKPARTIAL - Return 1 if the pattern can be used for partial matching, otherwise 0. - The fourth argument should point to an int variable. The pcrepartial - documentation lists the restrictions that apply to patterns when par- + Return 1 if the pattern can be used for partial matching, otherwise 0. + The fourth argument should point to an int variable. The pcrepartial + documentation lists the restrictions that apply to patterns when par- tial matching is used. PCRE_INFO_OPTIONS - Return a copy of the options with which the pattern was compiled. The - fourth argument should point to an unsigned long int variable. These + Return a copy of the options with which the pattern was compiled. The + fourth argument should point to an unsigned long int variable. These option bits are those specified in the call to pcre_compile(), modified - by any top-level option settings within the pattern itself. + by any top-level option settings at the start of the pattern itself. In + other words, they are the options that will be in force when matching + starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with + the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, + and PCRE_EXTENDED. - A pattern is automatically anchored by PCRE if all of its top-level + A pattern is automatically anchored by PCRE if all of its top-level alternatives begin with one of the following: ^ unless PCRE_MULTILINE is set @@ -1575,7 +1626,7 @@ INFORMATION ABOUT A PATTERN PCRE_INFO_SIZE - Return the size of the compiled pattern, that is, the value that was + Return the size of the compiled pattern, that is, the value that was passed as the argument to pcre_malloc() when PCRE was getting memory in which to place the compiled data. The fourth argument should point to a size_t variable. @@ -1583,9 +1634,9 @@ INFORMATION ABOUT A PATTERN PCRE_INFO_STUDYSIZE Return the size of the data block pointed to by the study_data field in - a pcre_extra block. That is, it is the value that was passed to + a pcre_extra block. That is, it is the value that was passed to pcre_malloc() when PCRE was getting memory into which to place the data - created by pcre_study(). The fourth argument should point to a size_t + created by pcre_study(). The fourth argument should point to a size_t variable. @@ -1593,21 +1644,21 @@ OBSOLETE INFO FUNCTION int pcre_info(const pcre *code, int *optptr, int *firstcharptr); - The pcre_info() function is now obsolete because its interface is too - restrictive to return all the available data about a compiled pattern. - New programs should use pcre_fullinfo() instead. The yield of - pcre_info() is the number of capturing subpatterns, or one of the fol- + The pcre_info() function is now obsolete because its interface is too + restrictive to return all the available data about a compiled pattern. + New programs should use pcre_fullinfo() instead. The yield of + pcre_info() is the number of capturing subpatterns, or one of the fol- lowing negative numbers: PCRE_ERROR_NULL the argument code was NULL PCRE_ERROR_BADMAGIC the "magic number" was not found - If the optptr argument is not NULL, a copy of the options with which - the pattern was compiled is placed in the integer it points to (see + If the optptr argument is not NULL, a copy of the options with which + the pattern was compiled is placed in the integer it points to (see PCRE_INFO_OPTIONS above). - If the pattern is not anchored and the firstcharptr argument is not - NULL, it is used to pass back information about the first character of + If the pattern is not anchored and the firstcharptr argument is not + NULL, it is used to pass back information about the first character of any matched string (see PCRE_INFO_FIRSTBYTE above). @@ -1615,21 +1666,21 @@ REFERENCE COUNTS int pcre_refcount(pcre *code, int adjust); - The pcre_refcount() function is used to maintain a reference count in + The pcre_refcount() function is used to maintain a reference count in the data block that contains a compiled pattern. It is provided for the - benefit of applications that operate in an object-oriented manner, + benefit of applications that operate in an object-oriented manner, where different parts of the application may be using the same compiled pattern, but you want to free the block when they are all done. When a pattern is compiled, the reference count field is initialized to - zero. It is changed only by calling this function, whose action is to - add the adjust value (which may be positive or negative) to it. The + zero. It is changed only by calling this function, whose action is to + add the adjust value (which may be positive or negative) to it. The yield of the function is the new value. However, the value of the count - is constrained to lie between 0 and 65535, inclusive. If the new value + is constrained to lie between 0 and 65535, inclusive. If the new value is outside these limits, it is forced to the appropriate limit value. - Except when it is zero, the reference count is not correctly preserved - if a pattern is compiled on one host and then transferred to a host + Except when it is zero, the reference count is not correctly preserved + if a pattern is compiled on one host and then transferred to a host whose byte-order is different. (This seems a highly unlikely scenario.) @@ -1639,18 +1690,18 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION const char *subject, int length, int startoffset, int options, int *ovector, int ovecsize); - The function pcre_exec() is called to match a subject string against a - compiled pattern, which is passed in the code argument. If the pattern + The function pcre_exec() is called to match a subject string against a + compiled pattern, which is passed in the code argument. If the pattern has been studied, the result of the study should be passed in the extra - argument. This function is the main matching facility of the library, + argument. This function is the main matching facility of the library, and it operates in a Perl-like manner. For specialist use there is also - an alternative matching function, which is described below in the sec- + an alternative matching function, which is described below in the sec- tion about the pcre_dfa_exec() function. - In most applications, the pattern will have been compiled (and option- - ally studied) in the same process that calls pcre_exec(). However, it + In most applications, the pattern will have been compiled (and option- + ally studied) in the same process that calls pcre_exec(). However, it is possible to save compiled patterns and study data, and then use them - later in different processes, possibly even on different hosts. For a + later in different processes, possibly even on different hosts. For a discussion about this, see the pcreprecompile documentation. Here is an example of a simple call to pcre_exec(): @@ -1669,10 +1720,10 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION Extra data for pcre_exec() - If the extra argument is not NULL, it must point to a pcre_extra data - block. The pcre_study() function returns such a block (when it doesn't - return NULL), but you can also create one for yourself, and pass addi- - tional information in it. The pcre_extra block contains the following + If the extra argument is not NULL, it must point to a pcre_extra data + block. The pcre_study() function returns such a block (when it doesn't + return NULL), but you can also create one for yourself, and pass addi- + tional information in it. The pcre_extra block contains the following fields (not necessarily in this order): unsigned long int flags; @@ -1682,7 +1733,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION void *callout_data; const unsigned char *tables; - The flags field is a bitmap that specifies which of the other fields + The flags field is a bitmap that specifies which of the other fields are set. The flag bits are: PCRE_EXTRA_STUDY_DATA @@ -1691,75 +1742,75 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION PCRE_EXTRA_CALLOUT_DATA PCRE_EXTRA_TABLES - Other flag bits should be set to zero. The study_data field is set in - the pcre_extra block that is returned by pcre_study(), together with + Other flag bits should be set to zero. The study_data field is set in + the pcre_extra block that is returned by pcre_study(), together with the appropriate flag bit. You should not set this yourself, but you may - add to the block by setting the other fields and their corresponding + add to the block by setting the other fields and their corresponding flag bits. The match_limit field provides a means of preventing PCRE from using up - a vast amount of resources when running patterns that are not going to - match, but which have a very large number of possibilities in their - search trees. The classic example is the use of nested unlimited + a vast amount of resources when running patterns that are not going to + match, but which have a very large number of possibilities in their + search trees. The classic example is the use of nested unlimited repeats. - Internally, PCRE uses a function called match() which it calls repeat- - edly (sometimes recursively). The limit set by match_limit is imposed - on the number of times this function is called during a match, which - has the effect of limiting the amount of backtracking that can take + Internally, PCRE uses a function called match() which it calls repeat- + edly (sometimes recursively). The limit set by match_limit is imposed + on the number of times this function is called during a match, which + has the effect of limiting the amount of backtracking that can take place. For patterns that are not anchored, the count restarts from zero for each position in the subject string. - The default value for the limit can be set when PCRE is built; the - default default is 10 million, which handles all but the most extreme - cases. You can override the default by suppling pcre_exec() with a - pcre_extra block in which match_limit is set, and - PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is + The default value for the limit can be set when PCRE is built; the + default default is 10 million, which handles all but the most extreme + cases. You can override the default by suppling pcre_exec() with a + pcre_extra block in which match_limit is set, and + PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. - The match_limit_recursion field is similar to match_limit, but instead + The match_limit_recursion field is similar to match_limit, but instead of limiting the total number of times that match() is called, it limits - the depth of recursion. The recursion depth is a smaller number than - the total number of calls, because not all calls to match() are recur- + the depth of recursion. The recursion depth is a smaller number than + the total number of calls, because not all calls to match() are recur- sive. This limit is of use only if it is set smaller than match_limit. - Limiting the recursion depth limits the amount of stack that can be + Limiting the recursion depth limits the amount of stack that can be used, or, when PCRE has been compiled to use memory on the heap instead of the stack, the amount of heap memory that can be used. - The default value for match_limit_recursion can be set when PCRE is - built; the default default is the same value as the default for - match_limit. You can override the default by suppling pcre_exec() with - a pcre_extra block in which match_limit_recursion is set, and - PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the + The default value for match_limit_recursion can be set when PCRE is + built; the default default is the same value as the default for + match_limit. You can override the default by suppling pcre_exec() with + a pcre_extra block in which match_limit_recursion is set, and + PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. - The pcre_callout field is used in conjunction with the "callout" fea- + The pcre_callout field is used in conjunction with the "callout" fea- ture, which is described in the pcrecallout documentation. - The tables field is used to pass a character tables pointer to - pcre_exec(); this overrides the value that is stored with the compiled - pattern. A non-NULL value is stored with the compiled pattern only if - custom tables were supplied to pcre_compile() via its tableptr argu- + The tables field is used to pass a character tables pointer to + pcre_exec(); this overrides the value that is stored with the compiled + pattern. A non-NULL value is stored with the compiled pattern only if + custom tables were supplied to pcre_compile() via its tableptr argu- ment. If NULL is passed to pcre_exec() using this mechanism, it forces - PCRE's internal tables to be used. This facility is helpful when re- - using patterns that have been saved after compiling with an external - set of tables, because the external tables might be at a different - address when pcre_exec() is called. See the pcreprecompile documenta- + PCRE's internal tables to be used. This facility is helpful when re- + using patterns that have been saved after compiling with an external + set of tables, because the external tables might be at a different + address when pcre_exec() is called. See the pcreprecompile documenta- tion for a discussion of saving compiled patterns for later use. Option bits for pcre_exec() - The unused bits of the options argument for pcre_exec() must be zero. - The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, + The unused bits of the options argument for pcre_exec() must be zero. + The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. PCRE_ANCHORED - The PCRE_ANCHORED option limits pcre_exec() to matching at the first - matching position. If a pattern was compiled with PCRE_ANCHORED, or - turned out to be anchored by virtue of its contents, it cannot be made + The PCRE_ANCHORED option limits pcre_exec() to matching at the first + matching position. If a pattern was compiled with PCRE_ANCHORED, or + turned out to be anchored by virtue of its contents, it cannot be made unachored at matching time. PCRE_NEWLINE_CR @@ -1768,16 +1819,33 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION PCRE_NEWLINE_ANYCRLF PCRE_NEWLINE_ANY - These options override the newline definition that was chosen or - defaulted when the pattern was compiled. For details, see the descrip- - tion of pcre_compile() above. During matching, the newline choice - affects the behaviour of the dot, circumflex, and dollar metacharac- - ters. It may also alter the way the match position is advanced after a - match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF, - PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt - fails when the current position is at a CRLF sequence, the match posi- - tion is advanced by two characters instead of one, in other words, to - after the CRLF. + These options override the newline definition that was chosen or + defaulted when the pattern was compiled. For details, see the descrip- + tion of pcre_compile() above. During matching, the newline choice + affects the behaviour of the dot, circumflex, and dollar metacharac- + ters. It may also alter the way the match position is advanced after a + match failure for an unanchored pattern. + + When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is + set, and a match attempt for an unanchored pattern fails when the cur- + rent position is at a CRLF sequence, and the pattern contains no + explicit matches for CR or NL characters, the match position is + advanced by two characters instead of one, in other words, to after the + CRLF. + + The above rule is a compromise that makes the most common cases work as + expected. For example, if the pattern is .+A (and the PCRE_DOTALL + option is not set), it does not match the string "\r\nA" because, after + failing at the start, it skips both the CR and the LF before retrying. + However, the pattern [\r\n]A does match that string, because it con- + tains an explicit CR or LF reference, and so advances only by one char- + acter after the first failure. Note than an explicit CR or LF refer- + ence occurs for negated character classes such as [^X] because they can + match CR or LF characters. + + Notwithstanding the above, anomalous effects may still occur when CRLF + is a valid newline sequence and explicit \r or \n escapes appear in the + pattern. PCRE_NOTBOL @@ -1824,140 +1892,141 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8 string is automatically checked when pcre_exec() is subsequently called. The value of startoffset is also checked to ensure that it - points to the start of a UTF-8 character. If an invalid UTF-8 sequence - of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If - startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is - returned. - - If you already know that your subject is valid, and you want to skip - these checks for performance reasons, you can set the - PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to - do this for the second and subsequent calls to pcre_exec() if you are - making repeated calls to find all the matches in a single subject - string. However, you should be sure that the value of startoffset - points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is - set, the effect of passing an invalid UTF-8 string as a subject, or a - value of startoffset that does not point to the start of a UTF-8 char- + points to the start of a UTF-8 character. There is a discussion about + the validity of UTF-8 strings in the section on UTF-8 support in the + main pcre page. If an invalid UTF-8 sequence of bytes is found, + pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- + tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. + + If you already know that your subject is valid, and you want to skip + these checks for performance reasons, you can set the + PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to + do this for the second and subsequent calls to pcre_exec() if you are + making repeated calls to find all the matches in a single subject + string. However, you should be sure that the value of startoffset + points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is + set, the effect of passing an invalid UTF-8 string as a subject, or a + value of startoffset that does not point to the start of a UTF-8 char- acter, is undefined. Your program may crash. PCRE_PARTIAL - This option turns on the partial matching feature. If the subject - string fails to match the pattern, but at some point during the match- - ing process the end of the subject was reached (that is, the subject - partially matches the pattern and the failure to match occurred only - because there were not enough subject characters), pcre_exec() returns - PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is - used, there are restrictions on what may appear in the pattern. These + This option turns on the partial matching feature. If the subject + string fails to match the pattern, but at some point during the match- + ing process the end of the subject was reached (that is, the subject + partially matches the pattern and the failure to match occurred only + because there were not enough subject characters), pcre_exec() returns + PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is + used, there are restrictions on what may appear in the pattern. These are discussed in the pcrepartial documentation. The string to be matched by pcre_exec() - The subject string is passed to pcre_exec() as a pointer in subject, a - length in length, and a starting byte offset in startoffset. In UTF-8 - mode, the byte offset must point to the start of a UTF-8 character. - Unlike the pattern string, the subject may contain binary zero bytes. - When the starting offset is zero, the search for a match starts at the + The subject string is passed to pcre_exec() as a pointer in subject, a + length in length, and a starting byte offset in startoffset. In UTF-8 + mode, the byte offset must point to the start of a UTF-8 character. + Unlike the pattern string, the subject may contain binary zero bytes. + When the starting offset is zero, the search for a match starts at the beginning of the subject, and this is by far the most common case. - A non-zero starting offset is useful when searching for another match - in the same subject by calling pcre_exec() again after a previous suc- - cess. Setting startoffset differs from just passing over a shortened - string and setting PCRE_NOTBOL in the case of a pattern that begins + A non-zero starting offset is useful when searching for another match + in the same subject by calling pcre_exec() again after a previous suc- + cess. Setting startoffset differs from just passing over a shortened + string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind. For example, consider the pattern \Biss\B - which finds occurrences of "iss" in the middle of words. (\B matches - only if the current position in the subject is not a word boundary.) - When applied to the string "Mississipi" the first call to pcre_exec() - finds the first occurrence. If pcre_exec() is called again with just - the remainder of the subject, namely "issipi", it does not match, + which finds occurrences of "iss" in the middle of words. (\B matches + only if the current position in the subject is not a word boundary.) + When applied to the string "Mississipi" the first call to pcre_exec() + finds the first occurrence. If pcre_exec() is called again with just + the remainder of the subject, namely "issipi", it does not match, because \B is always false at the start of the subject, which is deemed - to be a word boundary. However, if pcre_exec() is passed the entire + to be a word boundary. However, if pcre_exec() is passed the entire string again, but with startoffset set to 4, it finds the second occur- - rence of "iss" because it is able to look behind the starting point to + rence of "iss" because it is able to look behind the starting point to discover that it is preceded by a letter. - If a non-zero starting offset is passed when the pattern is anchored, + If a non-zero starting offset is passed when the pattern is anchored, one attempt to match at the given offset is made. This can only succeed - if the pattern does not require the match to be at the start of the + if the pattern does not require the match to be at the start of the subject. How pcre_exec() returns captured substrings - In general, a pattern matches a certain portion of the subject, and in - addition, further substrings from the subject may be picked out by - parts of the pattern. Following the usage in Jeffrey Friedl's book, - this is called "capturing" in what follows, and the phrase "capturing - subpattern" is used for a fragment of a pattern that picks out a sub- - string. PCRE supports several other kinds of parenthesized subpattern + In general, a pattern matches a certain portion of the subject, and in + addition, further substrings from the subject may be picked out by + parts of the pattern. Following the usage in Jeffrey Friedl's book, + this is called "capturing" in what follows, and the phrase "capturing + subpattern" is used for a fragment of a pattern that picks out a sub- + string. PCRE supports several other kinds of parenthesized subpattern that do not cause substrings to be captured. - Captured substrings are returned to the caller via a vector of integer - offsets whose address is passed in ovector. The number of elements in - the vector is passed in ovecsize, which must be a non-negative number. + Captured substrings are returned to the caller via a vector of integer + offsets whose address is passed in ovector. The number of elements in + the vector is passed in ovecsize, which must be a non-negative number. Note: this argument is NOT the size of ovector in bytes. - The first two-thirds of the vector is used to pass back captured sub- - strings, each substring using a pair of integers. The remaining third - of the vector is used as workspace by pcre_exec() while matching cap- - turing subpatterns, and is not available for passing back information. - The length passed in ovecsize should always be a multiple of three. If + The first two-thirds of the vector is used to pass back captured sub- + strings, each substring using a pair of integers. The remaining third + of the vector is used as workspace by pcre_exec() while matching cap- + turing subpatterns, and is not available for passing back information. + The length passed in ovecsize should always be a multiple of three. If it is not, it is rounded down. - When a match is successful, information about captured substrings is - returned in pairs of integers, starting at the beginning of ovector, - and continuing up to two-thirds of its length at the most. The first + When a match is successful, information about captured substrings is + returned in pairs of integers, starting at the beginning of ovector, + and continuing up to two-thirds of its length at the most. The first element of a pair is set to the offset of the first character in a sub- - string, and the second is set to the offset of the first character - after the end of a substring. The first pair, ovector[0] and ovec- - tor[1], identify the portion of the subject string matched by the - entire pattern. The next pair is used for the first capturing subpat- + string, and the second is set to the offset of the first character + after the end of a substring. The first pair, ovector[0] and ovec- + tor[1], identify the portion of the subject string matched by the + entire pattern. The next pair is used for the first capturing subpat- tern, and so on. The value returned by pcre_exec() is one more than the highest numbered pair that has been set. For example, if two substrings - have been captured, the returned value is 3. If there are no capturing - subpatterns, the return value from a successful match is 1, indicating + have been captured, the returned value is 3. If there are no capturing + subpatterns, the return value from a successful match is 1, indicating that just the first pair of offsets has been set. If a capturing subpattern is matched repeatedly, it is the last portion of the string that it matched that is returned. - If the vector is too small to hold all the captured substring offsets, + If the vector is too small to hold all the captured substring offsets, it is used as far as possible (up to two-thirds of its length), and the - function returns a value of zero. In particular, if the substring off- + function returns a value of zero. In particular, if the substring off- sets are not of interest, pcre_exec() may be called with ovector passed - as NULL and ovecsize as zero. However, if the pattern contains back - references and the ovector is not big enough to remember the related - substrings, PCRE has to get additional memory for use during matching. + as NULL and ovecsize as zero. However, if the pattern contains back + references and the ovector is not big enough to remember the related + substrings, PCRE has to get additional memory for use during matching. Thus it is usually advisable to supply an ovector. - The pcre_info() function can be used to find out how many capturing - subpatterns there are in a compiled pattern. The smallest size for - ovector that will allow for n captured substrings, in addition to the + The pcre_info() function can be used to find out how many capturing + subpatterns there are in a compiled pattern. The smallest size for + ovector that will allow for n captured substrings, in addition to the offsets of the substring matched by the whole pattern, is (n+1)*3. - It is possible for capturing subpattern number n+1 to match some part + It is possible for capturing subpattern number n+1 to match some part of the subject when subpattern n has not been used at all. For example, - if the string "abc" is matched against the pattern (a|(z))(bc) the + if the string "abc" is matched against the pattern (a|(z))(bc) the return from the function is 4, and subpatterns 1 and 3 are matched, but - 2 is not. When this happens, both values in the offset pairs corre- + 2 is not. When this happens, both values in the offset pairs corre- sponding to unused subpatterns are set to -1. - Offset values that correspond to unused subpatterns at the end of the - expression are also set to -1. For example, if the string "abc" is - matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not - matched. The return from the function is 2, because the highest used + Offset values that correspond to unused subpatterns at the end of the + expression are also set to -1. For example, if the string "abc" is + matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not + matched. The return from the function is 2, because the highest used capturing subpattern number is 1. However, you can refer to the offsets - for the second and third capturing subpatterns if you wish (assuming + for the second and third capturing subpatterns if you wish (assuming the vector is large enough, of course). - Some convenience functions are provided for extracting the captured + Some convenience functions are provided for extracting the captured substrings as separate strings. These are described below. Error return values from pcre_exec() - If pcre_exec() fails, it returns a negative number. The following are + If pcre_exec() fails, it returns a negative number. The following are defined in the header file: PCRE_ERROR_NOMATCH (-1) @@ -1966,7 +2035,7 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION PCRE_ERROR_NULL (-2) - Either code or subject was passed as NULL, or ovector was NULL and + Either code or subject was passed as NULL, or ovector was NULL and ovecsize was not zero. PCRE_ERROR_BADOPTION (-3) @@ -1975,94 +2044,86 @@ MATCHING A PATTERN: THE TRADITIONAL FUNCTION PCRE_ERROR_BADMAGIC (-4) - PCRE stores a 4-byte "magic number" at the start of the compiled code, + PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch the case when it is passed a junk pointer and to detect when a pattern that was compiled in an environment of one endianness is run in - an environment with the other endianness. This is the error that PCRE + an environment with the other endianness. This is the error that PCRE gives when the magic number is not present. PCRE_ERROR_UNKNOWN_OPCODE (-5) While running the pattern match, an unknown item was encountered in the - compiled pattern. This error could be caused by a bug in PCRE or by + compiled pattern. This error could be caused by a bug in PCRE or by overwriting of the compiled pattern. PCRE_ERROR_NOMEMORY (-6) - If a pattern contains back references, but the ovector that is passed + If a pattern contains back references, but the ovector that is passed to pcre_exec() is not big enough to remember the referenced substrings, - PCRE gets a block of memory at the start of matching to use for this - purpose. If the call via pcre_malloc() fails, this error is given. The + PCRE gets a block of memory at the start of matching to use for this + purpose. If the call via pcre_malloc() fails, this error is given. The memory is automatically freed at the end of matching. PCRE_ERROR_NOSUBSTRING (-7) - This error is used by the pcre_copy_substring(), pcre_get_substring(), + This error is used by the pcre_copy_substring(), pcre_get_substring(), and pcre_get_substring_list() functions (see below). It is never returned by pcre_exec(). PCRE_ERROR_MATCHLIMIT (-8) - The backtracking limit, as specified by the match_limit field in a - pcre_extra structure (or defaulted) was reached. See the description + The backtracking limit, as specified by the match_limit field in a + pcre_extra structure (or defaulted) was reached. See the description above. PCRE_ERROR_CALLOUT (-9) This error is never generated by pcre_exec() itself. It is provided for - use by callout functions that want to yield a distinctive error code. + use by callout functions that want to yield a distinctive error code. See the pcrecallout documentation for details. PCRE_ERROR_BADUTF8 (-10) - A string that contains an invalid UTF-8 byte sequence was passed as a + A string that contains an invalid UTF-8 byte sequence was passed as a subject. PCRE_ERROR_BADUTF8_OFFSET (-11) The UTF-8 byte sequence that was passed as a subject was valid, but the - value of startoffset did not point to the beginning of a UTF-8 charac- + value of startoffset did not point to the beginning of a UTF-8 charac- ter. PCRE_ERROR_PARTIAL (-12) - The subject string did not match, but it did match partially. See the + The subject string did not match, but it did match partially. See the pcrepartial documentation for details of partial matching. PCRE_ERROR_BADPARTIAL (-13) - The PCRE_PARTIAL option was used with a compiled pattern containing - items that are not supported for partial matching. See the pcrepartial + The PCRE_PARTIAL option was used with a compiled pattern containing + items that are not supported for partial matching. See the pcrepartial documentation for details of partial matching. PCRE_ERROR_INTERNAL (-14) - An unexpected internal error has occurred. This error could be caused + An unexpected internal error has occurred. This error could be caused by a bug in PCRE or by overwriting of the compiled pattern. PCRE_ERROR_BADCOUNT (-15) - This error is given if the value of the ovecsize argument is negative. + This error is given if the value of the ovecsize argument is negative. PCRE_ERROR_RECURSIONLIMIT (-21) The internal recursion limit, as specified by the match_limit_recursion - field in a pcre_extra structure (or defaulted) was reached. See the + field in a pcre_extra structure (or defaulted) was reached. See the description above. - PCRE_ERROR_NULLWSLIMIT (-22) - - When a group that can match an empty substring is repeated with an - unbounded upper limit, the subject position at the start of the group - must be remembered, so that a test for an empty string can be made when - the end of the group is reached. Some workspace is required for this; - if it runs out, this error is given. - PCRE_ERROR_BADNEWLINE (-23) An invalid combination of PCRE_NEWLINE_xxx options was given. - Error numbers -16 to -20 are not used by pcre_exec(). + Error numbers -16 to -20 and -22 are not used by pcre_exec(). EXTRACTING CAPTURED SUBSTRINGS BY NUMBER @@ -2078,78 +2139,78 @@ EXTRACTING CAPTURED SUBSTRINGS BY NUMBER int pcre_get_substring_list(const char *subject, int *ovector, int stringcount, const char ***listptr); - Captured substrings can be accessed directly by using the offsets - returned by pcre_exec() in ovector. For convenience, the functions + Captured substrings can be accessed directly by using the offsets + returned by pcre_exec() in ovector. For convenience, the functions pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- - string_list() are provided for extracting captured substrings as new, - separate, zero-terminated strings. These functions identify substrings - by number. The next section describes functions for extracting named + string_list() are provided for extracting captured substrings as new, + separate, zero-terminated strings. These functions identify substrings + by number. The next section describes functions for extracting named substrings. - A substring that contains a binary zero is correctly extracted and has - a further zero added on the end, but the result is not, of course, a C - string. However, you can process such a string by referring to the - length that is returned by pcre_copy_substring() and pcre_get_sub- + A substring that contains a binary zero is correctly extracted and has + a further zero added on the end, but the result is not, of course, a C + string. However, you can process such a string by referring to the + length that is returned by pcre_copy_substring() and pcre_get_sub- string(). Unfortunately, the interface to pcre_get_substring_list() is - not adequate for handling strings containing binary zeros, because the + not adequate for handling strings containing binary zeros, because the end of the final string is not independently indicated. - The first three arguments are the same for all three of these func- - tions: subject is the subject string that has just been successfully + The first three arguments are the same for all three of these func- + tions: subject is the subject string that has just been successfully matched, ovector is a pointer to the vector of integer offsets that was passed to pcre_exec(), and stringcount is the number of substrings that - were captured by the match, including the substring that matched the + were captured by the match, including the substring that matched the entire regular expression. This is the value returned by pcre_exec() if - it is greater than zero. If pcre_exec() returned zero, indicating that - it ran out of space in ovector, the value passed as stringcount should + it is greater than zero. If pcre_exec() returned zero, indicating that + it ran out of space in ovector, the value passed as stringcount should be the number of elements in the vector divided by three. - The functions pcre_copy_substring() and pcre_get_substring() extract a - single substring, whose number is given as stringnumber. A value of - zero extracts the substring that matched the entire pattern, whereas - higher values extract the captured substrings. For pcre_copy_sub- - string(), the string is placed in buffer, whose length is given by - buffersize, while for pcre_get_substring() a new block of memory is - obtained via pcre_malloc, and its address is returned via stringptr. - The yield of the function is the length of the string, not including + The functions pcre_copy_substring() and pcre_get_substring() extract a + single substring, whose number is given as stringnumber. A value of + zero extracts the substring that matched the entire pattern, whereas + higher values extract the captured substrings. For pcre_copy_sub- + string(), the string is placed in buffer, whose length is given by + buffersize, while for pcre_get_substring() a new block of memory is + obtained via pcre_malloc, and its address is returned via stringptr. + The yield of the function is the length of the string, not including the terminating zero, or one of these error codes: PCRE_ERROR_NOMEMORY (-6) - The buffer was too small for pcre_copy_substring(), or the attempt to + The buffer was too small for pcre_copy_substring(), or the attempt to get memory failed for pcre_get_substring(). PCRE_ERROR_NOSUBSTRING (-7) There is no substring whose number is stringnumber. - The pcre_get_substring_list() function extracts all available sub- - strings and builds a list of pointers to them. All this is done in a + The pcre_get_substring_list() function extracts all available sub- + strings and builds a list of pointers to them. All this is done in a single block of memory that is obtained via pcre_malloc. The address of - the memory block is returned via listptr, which is also the start of - the list of string pointers. The end of the list is marked by a NULL - pointer. The yield of the function is zero if all went well, or the + the memory block is returned via listptr, which is also the start of + the list of string pointers. The end of the list is marked by a NULL + pointer. The yield of the function is zero if all went well, or the error code PCRE_ERROR_NOMEMORY (-6) if the attempt to get the memory block failed. - When any of these functions encounter a substring that is unset, which - can happen when capturing subpattern number n+1 matches some part of - the subject, but subpattern n has not been used at all, they return an + When any of these functions encounter a substring that is unset, which + can happen when capturing subpattern number n+1 matches some part of + the subject, but subpattern n has not been used at all, they return an empty string. This can be distinguished from a genuine zero-length sub- - string by inspecting the appropriate offset in ovector, which is nega- + string by inspecting the appropriate offset in ovector, which is nega- tive for unset substrings. - The two convenience functions pcre_free_substring() and pcre_free_sub- - string_list() can be used to free the memory returned by a previous + The two convenience functions pcre_free_substring() and pcre_free_sub- + string_list() can be used to free the memory returned by a previous call of pcre_get_substring() or pcre_get_substring_list(), respec- - tively. They do nothing more than call the function pointed to by - pcre_free, which of course could be called directly from a C program. - However, PCRE is used in some situations where it is linked via a spe- - cial interface to another programming language that cannot use - pcre_free directly; it is for these cases that the functions are pro- + tively. They do nothing more than call the function pointed to by + pcre_free, which of course could be called directly from a C program. + However, PCRE is used in some situations where it is linked via a spe- + cial interface to another programming language that cannot use + pcre_free directly; it is for these cases that the functions are pro- vided. @@ -2168,7 +2229,7 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME int stringcount, const char *stringname, const char **stringptr); - To extract a substring by name, you first have to find associated num- + To extract a substring by name, you first have to find associated num- ber. For example, for this pattern (a+)b(?<xxx>\d+)... @@ -2177,27 +2238,27 @@ EXTRACTING CAPTURED SUBSTRINGS BY NAME be unique (PCRE_DUPNAMES was not set), you can find the number from the name by calling pcre_get_stringnumber(). The first argument is the com- piled pattern, and the second is the name. The yield of the function is - the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no + the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no subpattern of that name. Given the number, you can extract the substring directly, or use one of the functions described in the previous section. For convenience, there are also two functions that do the whole job. - Most of the arguments of pcre_copy_named_substring() and - pcre_get_named_substring() are the same as those for the similarly - named functions that extract by number. As these are described in the - previous section, they are not re-described here. There are just two + Most of the arguments of pcre_copy_named_substring() and + pcre_get_named_substring() are the same as those for the similarly + named functions that extract by number. As these are described in the + previous section, they are not re-described here. There are just two differences: - First, instead of a substring number, a substring name is given. Sec- + First, instead of a substring number, a substring name is given. Sec- ond, there is an extra argument, given at the start, which is a pointer - to the compiled pattern. This is needed in order to gain access to the + to the compiled pattern. This is needed in order to gain access to the name-to-number translation table. - These functions call pcre_get_stringnumber(), and if it succeeds, they - then call pcre_copy_substring() or pcre_get_substring(), as appropri- - ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the + These functions call pcre_get_stringnumber(), and if it succeeds, they + then call pcre_copy_substring() or pcre_get_substring(), as appropri- + ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the behaviour may not be what you want (see the next section). @@ -2206,45 +2267,47 @@ DUPLICATE SUBPATTERN NAMES int pcre_get_stringtable_entries(const pcre *code, const char *name, char **first, char **last); - When a pattern is compiled with the PCRE_DUPNAMES option, names for - subpatterns are not required to be unique. Normally, patterns with - duplicate names are such that in any one match, only one of the named - subpatterns participates. An example is shown in the pcrepattern docu- - mentation. When duplicates are present, pcre_copy_named_substring() and - pcre_get_named_substring() return the first substring corresponding to - the given name that is set. If none are set, an empty string is - returned. The pcre_get_stringnumber() function returns one of the num- - bers that are associated with the name, but it is not defined which it - is. - - If you want to get full details of all captured substrings for a given - name, you must use the pcre_get_stringtable_entries() function. The + When a pattern is compiled with the PCRE_DUPNAMES option, names for + subpatterns are not required to be unique. Normally, patterns with + duplicate names are such that in any one match, only one of the named + subpatterns participates. An example is shown in the pcrepattern docu- + mentation. + + When duplicates are present, pcre_copy_named_substring() and + pcre_get_named_substring() return the first substring corresponding to + the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING + (-7) is returned; no data is returned. The pcre_get_stringnumber() + function returns one of the numbers that are associated with the name, + but it is not defined which it is. + + If you want to get full details of all captured substrings for a given + name, you must use the pcre_get_stringtable_entries() function. The first argument is the compiled pattern, and the second is the name. The - third and fourth are pointers to variables which are updated by the + third and fourth are pointers to variables which are updated by the function. After it has run, they point to the first and last entries in - the name-to-number table for the given name. The function itself - returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if - there are none. The format of the table is described above in the sec- - tion entitled Information about a pattern. Given all the relevant - entries for the name, you can extract each of their numbers, and hence + the name-to-number table for the given name. The function itself + returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if + there are none. The format of the table is described above in the sec- + tion entitled Information about a pattern. Given all the relevant + entries for the name, you can extract each of their numbers, and hence the captured data, if any. FINDING ALL POSSIBLE MATCHES - The traditional matching function uses a similar algorithm to Perl, + The traditional matching function uses a similar algorithm to Perl, which stops when it finds the first match, starting at a given point in - the subject. If you want to find all possible matches, or the longest - possible match, consider using the alternative matching function (see - below) instead. If you cannot use the alternative function, but still - need to find all possible matches, you can kludge it up by making use + the subject. If you want to find all possible matches, or the longest + possible match, consider using the alternative matching function (see + below) instead. If you cannot use the alternative function, but still + need to find all possible matches, you can kludge it up by making use of the callout facility, which is described in the pcrecallout documen- tation. What you have to do is to insert a callout right at the end of the pat- - tern. When your callout function is called, extract and save the cur- - rent matched substring. Then return 1, which forces pcre_exec() to - backtrack and try other alternatives. Ultimately, when it runs out of + tern. When your callout function is called, extract and save the cur- + rent matched substring. Then return 1, which forces pcre_exec() to + backtrack and try other alternatives. Ultimately, when it runs out of matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. @@ -2255,25 +2318,25 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION int options, int *ovector, int ovecsize, int *workspace, int wscount); - The function pcre_dfa_exec() is called to match a subject string - against a compiled pattern, using a matching algorithm that scans the - subject string just once, and does not backtrack. This has different - characteristics to the normal algorithm, and is not compatible with - Perl. Some of the features of PCRE patterns are not supported. Never- - theless, there are times when this kind of matching can be useful. For + The function pcre_dfa_exec() is called to match a subject string + against a compiled pattern, using a matching algorithm that scans the + subject string just once, and does not backtrack. This has different + characteristics to the normal algorithm, and is not compatible with + Perl. Some of the features of PCRE patterns are not supported. Never- + theless, there are times when this kind of matching can be useful. For a discussion of the two matching algorithms, see the pcrematching docu- mentation. - The arguments for the pcre_dfa_exec() function are the same as for + The arguments for the pcre_dfa_exec() function are the same as for pcre_exec(), plus two extras. The ovector argument is used in a differ- - ent way, and this is described below. The other common arguments are - used in the same way as for pcre_exec(), so their description is not + ent way, and this is described below. The other common arguments are + used in the same way as for pcre_exec(), so their description is not repeated here. - The two additional arguments provide workspace for the function. The - workspace vector should contain at least 20 elements. It is used for + The two additional arguments provide workspace for the function. The + workspace vector should contain at least 20 elements. It is used for keeping track of multiple paths through the pattern tree. More - workspace will be needed for patterns and subjects where there are a + workspace will be needed for patterns and subjects where there are a lot of potential matches. Here is an example of a simple call to pcre_dfa_exec(): @@ -2295,47 +2358,47 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION Option bits for pcre_dfa_exec() - The unused bits of the options argument for pcre_dfa_exec() must be - zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- - LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, + The unused bits of the options argument for pcre_dfa_exec() must be + zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- + LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of these are the same as for pcre_exec(), so their description is not repeated here. PCRE_PARTIAL - This has the same general effect as it does for pcre_exec(), but the - details are slightly different. When PCRE_PARTIAL is set for - pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into - PCRE_ERROR_PARTIAL if the end of the subject is reached, there have + This has the same general effect as it does for pcre_exec(), but the + details are slightly different. When PCRE_PARTIAL is set for + pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into + PCRE_ERROR_PARTIAL if the end of the subject is reached, there have been no complete matches, but there is still at least one matching pos- - sibility. The portion of the string that provided the partial match is + sibility. The portion of the string that provided the partial match is set as the first matching string. PCRE_DFA_SHORTEST - Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to + Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to stop as soon as it has found one match. Because of the way the alterna- - tive algorithm works, this is necessarily the shortest possible match + tive algorithm works, this is necessarily the shortest possible match at the first possible matching point in the subject string. PCRE_DFA_RESTART - When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and - returns a partial match, it is possible to call it again, with addi- - tional subject characters, and have it continue with the same match. - The PCRE_DFA_RESTART option requests this action; when it is set, the - workspace and wscount options must reference the same vector as before - because data about the match so far is left in them after a partial - match. There is more discussion of this facility in the pcrepartial + When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and + returns a partial match, it is possible to call it again, with addi- + tional subject characters, and have it continue with the same match. + The PCRE_DFA_RESTART option requests this action; when it is set, the + workspace and wscount options must reference the same vector as before + because data about the match so far is left in them after a partial + match. There is more discussion of this facility in the pcrepartial documentation. Successful returns from pcre_dfa_exec() - When pcre_dfa_exec() succeeds, it may have matched more than one sub- + When pcre_dfa_exec() succeeds, it may have matched more than one sub- string in the subject. Note, however, that all the matches from one run - of the function start at the same point in the subject. The shorter - matches are all initial substrings of the longer matches. For example, + of the function start at the same point in the subject. The shorter + matches are all initial substrings of the longer matches. For example, if the pattern <.*> @@ -2350,62 +2413,62 @@ MATCHING A PATTERN: THE ALTERNATIVE FUNCTION <something> <something else> <something> <something else> <something further> - On success, the yield of the function is a number greater than zero, - which is the number of matched substrings. The substrings themselves - are returned in ovector. Each string uses two elements; the first is - the offset to the start, and the second is the offset to the end. In - fact, all the strings have the same start offset. (Space could have - been saved by giving this only once, but it was decided to retain some - compatibility with the way pcre_exec() returns data, even though the + On success, the yield of the function is a number greater than zero, + which is the number of matched substrings. The substrings themselves + are returned in ovector. Each string uses two elements; the first is + the offset to the start, and the second is the offset to the end. In + fact, all the strings have the same start offset. (Space could have + been saved by giving this only once, but it was decided to retain some + compatibility with the way pcre_exec() returns data, even though the meaning of the strings is different.) The strings are returned in reverse order of length; that is, the long- - est matching string is given first. If there were too many matches to - fit into ovector, the yield of the function is zero, and the vector is + est matching string is given first. If there were too many matches to + fit into ovector, the yield of the function is zero, and the vector is filled with the longest matches. Error returns from pcre_dfa_exec() - The pcre_dfa_exec() function returns a negative number when it fails. - Many of the errors are the same as for pcre_exec(), and these are - described above. There are in addition the following errors that are + The pcre_dfa_exec() function returns a negative number when it fails. + Many of the errors are the same as for pcre_exec(), and these are + described above. There are in addition the following errors that are specific to pcre_dfa_exec(): PCRE_ERROR_DFA_UITEM (-16) - This return is given if pcre_dfa_exec() encounters an item in the pat- - tern that it does not support, for instance, the use of \C or a back + This return is given if pcre_dfa_exec() encounters an item in the pat- + tern that it does not support, for instance, the use of \C or a back reference. PCRE_ERROR_DFA_UCOND (-17) - This return is given if pcre_dfa_exec() encounters a condition item - that uses a back reference for the condition, or a test for recursion + This return is given if pcre_dfa_exec() encounters a condition item + that uses a back reference for the condition, or a test for recursion in a specific group. These are not supported. PCRE_ERROR_DFA_UMLIMIT (-18) - This return is given if pcre_dfa_exec() is called with an extra block + This return is given if pcre_dfa_exec() is called with an extra block that contains a setting of the match_limit field. This is not supported (it is meaningless). PCRE_ERROR_DFA_WSSIZE (-19) - This return is given if pcre_dfa_exec() runs out of space in the + This return is given if pcre_dfa_exec() runs out of space in the workspace vector. PCRE_ERROR_DFA_RECURSE (-20) - When a recursive subpattern is processed, the matching function calls - itself recursively, using private vectors for ovector and workspace. - This error is given if the output vector is not large enough. This + When a recursive subpattern is processed, the matching function calls + itself recursively, using private vectors for ovector and workspace. + This error is given if the output vector is not large enough. This should be extremely rare, as a vector of size 1000 is used. SEE ALSO - pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- - tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). + pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- + tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). AUTHOR @@ -2417,7 +2480,7 @@ AUTHOR REVISION - Last updated: 13 June 2007 + Last updated: 21 August 2007 Copyright (c) 1997-2007 University of Cambridge. ------------------------------------------------------------------------------ @@ -2670,7 +2733,13 @@ DIFFERENCES BETWEEN PCRE AND PERL matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set to "b". - 11. PCRE provides some extensions to the Perl regular expression facil- + 11. PCRE does support Perl 5.10's backtracking verbs (*ACCEPT), + (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in + the forms without an argument. PCRE does not support (*MARK). If + (*ACCEPT) is within capturing parentheses, PCRE does not set that cap- + ture group; this is different to Perl. + + 12. PCRE provides some extensions to the Perl regular expression facil- ities. Perl 5.10 will include new features that are not in earlier versions, some of which (such as named parentheses) have been in PCRE for some time. This list is with respect to Perl 5.10: @@ -2716,7 +2785,7 @@ AUTHOR REVISION - Last updated: 13 June 2007 + Last updated: 08 August 2007 Copyright (c) 1997-2007 University of Cambridge. ------------------------------------------------------------------------------ @@ -2730,12 +2799,14 @@ NAME PCRE REGULAR EXPRESSION DETAILS - The syntax and semantics of the regular expressions supported by PCRE - are described below. Regular expressions are also described in the Perl - documentation and in a number of books, some of which have copious - examples. Jeffrey Friedl's "Mastering Regular Expressions", published - by O'Reilly, covers regular expressions in great detail. This descrip- - tion of PCRE's regular expressions is intended as reference material. + The syntax and semantics of the regular expressions that are supported + by PCRE are described in detail below. There is a quick-reference syn- + tax summary in the pcresyntax page. Perl's regular expressions are + described in its own documentation, and regular expressions in general + are covered in a number of books, some of which have copious examples. + Jeffrey Friedl's "Mastering Regular Expressions", published by + O'Reilly, covers regular expressions in great detail. This description + of PCRE's regular expressions is intended as reference material. The original operation of PCRE was on strings of one-byte characters. However, there is now also support for UTF-8 character strings. To use @@ -2755,33 +2826,63 @@ PCRE REGULAR EXPRESSION DETAILS discussed in the pcrematching page. +NEWLINE CONVENTIONS + + PCRE supports five different conventions for indicating line breaks in + strings: a single CR (carriage return) character, a single LF (line- + feed) character, the two-character sequence CRLF, any of the three pre- + ceding, or any Unicode newline sequence. The pcreapi page has further + discussion about newlines, and shows how to set the newline convention + in the options arguments for the compiling and matching functions. + + It is also possible to specify a newline convention by starting a pat- + tern string with one of the following five sequences: + + (*CR) carriage return + (*LF) linefeed + (*CRLF) carriage return, followed by linefeed + (*ANYCRLF) any of the three above + (*ANY) all Unicode newline sequences + + These override the default and the options given to pcre_compile(). For + example, on a Unix system where LF is the default newline sequence, the + pattern + + (*CR)a.b + + changes the convention to CR. That pattern matches "a\nb" because LF is + no longer a newline. Note that these special settings, which are not + Perl-compatible, are recognized only at the very start of a pattern, + and that they must be in upper case. + + CHARACTERS AND METACHARACTERS - A regular expression is a pattern that is matched against a subject - string from left to right. Most characters stand for themselves in a - pattern, and match the corresponding characters in the subject. As a + A regular expression is a pattern that is matched against a subject + string from left to right. Most characters stand for themselves in a + pattern, and match the corresponding characters in the subject. As a trivial example, the pattern The quick brown fox matches a portion of a subject string that is identical to itself. When - caseless matching is specified (the PCRE_CASELESS option), letters are - matched independently of case. In UTF-8 mode, PCRE always understands - the concept of case for characters whose values are less than 128, so - caseless matching is always possible. For characters with higher val- - ues, the concept of case is supported if PCRE is compiled with Unicode - property support, but not otherwise. If you want to use caseless - matching for characters 128 and above, you must ensure that PCRE is + caseless matching is specified (the PCRE_CASELESS option), letters are + matched independently of case. In UTF-8 mode, PCRE always understands + the concept of case for characters whose values are less than 128, so + caseless matching is always possible. For characters with higher val- + ues, the concept of case is supported if PCRE is compiled with Unicode + property support, but not otherwise. If you want to use caseless + matching for characters 128 and above, you must ensure that PCRE is compiled with Unicode property support as well as with UTF-8 support. - The power of regular expressions comes from the ability to include - alternatives and repetitions in the pattern. These are encoded in the + The power of regular expressions comes from the ability to include + alternatives and repetitions in the pattern. These are encoded in the pattern by the use of metacharacters, which do not stand for themselves but instead are interpreted in some special way. - There are two different sets of metacharacters: those that are recog- - nized anywhere in the pattern except within square brackets, and those - that are recognized within square brackets. Outside square brackets, + There are two different sets of metacharacters: those that are recog- + nized anywhere in the pattern except within square brackets, and those + that are recognized within square brackets. Outside square brackets, the metacharacters are as follows: \ general escape character with several uses @@ -2800,7 +2901,7 @@ CHARACTERS AND METACHARACTERS also "possessive quantifier" { start min/max quantifier - Part of a pattern that is in square brackets is called a "character + Part of a pattern that is in square brackets is called a "character class". In a character class the only metacharacters are: \ general escape character @@ -2810,33 +2911,33 @@ CHARACTERS AND METACHARACTERS syntax) ] terminates the character class - The following sections describe the use of each of the metacharacters. + The following sections describe the use of each of the metacharacters. BACKSLASH The backslash character has several uses. Firstly, if it is followed by - a non-alphanumeric character, it takes away any special meaning that - character may have. This use of backslash as an escape character + a non-alphanumeric character, it takes away any special meaning that + character may have. This use of backslash as an escape character applies both inside and outside character classes. - For example, if you want to match a * character, you write \* in the - pattern. This escaping action applies whether or not the following - character would otherwise be interpreted as a metacharacter, so it is - always safe to precede a non-alphanumeric with backslash to specify - that it stands for itself. In particular, if you want to match a back- + For example, if you want to match a * character, you write \* in the + pattern. This escaping action applies whether or not the following + character would otherwise be interpreted as a metacharacter, so it is + always safe to precede a non-alphanumeric with backslash to specify + that it stands for itself. In particular, if you want to match a back- slash, you write \\. - If a pattern is compiled with the PCRE_EXTENDED option, whitespace in - the pattern (other than in a character class) and characters between a + If a pattern is compiled with the PCRE_EXTENDED option, whitespace in + the pattern (other than in a character class) and characters between a # outside a character class and the next newline are ignored. An escap- - ing backslash can be used to include a whitespace or # character as + ing backslash can be used to include a whitespace or # character as part of the pattern. - If you want to remove the special meaning from a sequence of charac- - ters, you can do so by putting them between \Q and \E. This is differ- - ent from Perl in that $ and @ are handled as literals in \Q...\E - sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- + If you want to remove the special meaning from a sequence of charac- + ters, you can do so by putting them between \Q and \E. This is differ- + ent from Perl in that $ and @ are handled as literals in \Q...\E + sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- tion. Note the following examples: Pattern PCRE matches Perl matches @@ -2846,43 +2947,46 @@ BACKSLASH \Qabc\$xyz\E abc\$xyz abc\$xyz \Qabc\E\$\Qxyz\E abc$xyz abc$xyz - The \Q...\E sequence is recognized both inside and outside character + The \Q...\E sequence is recognized both inside and outside character classes. Non-printing characters A second use of backslash provides a way of encoding non-printing char- - acters in patterns in a visible manner. There is no restriction on the - appearance of non-printing characters, apart from the binary zero that - terminates a pattern, but when a pattern is being prepared by text - editing, it is usually easier to use one of the following escape + acters in patterns in a visible manner. There is no restriction on the + appearance of non-printing characters, apart from the binary zero that + terminates a pattern, but when a pattern is being prepared by text + editing, it is usually easier to use one of the following escape sequences than the binary character it represents: \a alarm, that is, the BEL character (hex 07) \cx "control-x", where x is any character \e escape (hex 1B) \f formfeed (hex 0C) - \n newline (hex 0A) + \n linefeed (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \ddd character with octal code ddd, or backreference \xhh character with hex code hh \x{hhh..} character with hex code hhh.. - The precise effect of \cx is as follows: if x is a lower case letter, - it is converted to upper case. Then bit 6 of the character (hex 40) is - inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; + The precise effect of \cx is as follows: if x is a lower case letter, + it is converted to upper case. Then bit 6 of the character (hex 40) is + inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex 7B. - After \x, from zero to two hexadecimal digits are read (letters can be - in upper or lower case). Any number of hexadecimal digits may appear - between \x{ and }, but the value of the character code must be less - than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is, - the maximum hexadecimal value is 7FFFFFFF). If characters other than - hexadecimal digits appear between \x{ and }, or if there is no termi- - nating }, this form of escape is not recognized. Instead, the initial - \x will be interpreted as a basic hexadecimal escape, with no following - digits, giving a character whose value is zero. + After \x, from zero to two hexadecimal digits are read (letters can be + in upper or lower case). Any number of hexadecimal digits may appear + between \x{ and }, but the value of the character code must be less + than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, + the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger + than the largest Unicode code point, which is 10FFFF. + + If characters other than hexadecimal digits appear between \x{ and }, + or if there is no terminating }, this form of escape is not recognized. + Instead, the initial \x will be interpreted as a basic hexadecimal + escape, with no following digits, giving a character whose value is + zero. Characters whose value is less than 256 can be defined by either of the two syntaxes for \x. There is no difference in the way they are han- @@ -2937,10 +3041,10 @@ BACKSLASH Absolute and relative back references - The sequence \g followed by a positive or negative number, optionally - enclosed in braces, is an absolute or relative back reference. A named - back reference can be coded as \g{name}. Back references are discussed - later, following the discussion of parenthesized subpatterns. + The sequence \g followed by an unsigned or a negative number, option- + ally enclosed in braces, is an absolute or relative back reference. A + named back reference can be coded as \g{name}. Back references are dis- + cussed later, following the discussion of parenthesized subpatterns. Generic character types @@ -3145,6 +3249,12 @@ BACKSLASH has the Lu, Ll, or Lt property, in other words, a letter that is not classified as a modifier or "other". + The Cs (Surrogate) property applies only to characters in the range + U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see + RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- + ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in + the pcreapi page). + The long synonyms for these properties that Perl supports (such as \p{Letter}) are not supported by PCRE, nor is it permitted to prefix any of these properties with "Is". @@ -3876,121 +3986,126 @@ ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS \d++foo - Possessive quantifiers are always greedy; the setting of the + Note that a possessive quantifier can be used with an entire group, for + example: + + (abc|xyz){2,3}+ + + Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY option is ignored. They are a convenient notation for the - simpler forms of atomic group. However, there is no difference in the - meaning of a possessive quantifier and the equivalent atomic group, - though there may be a performance difference; possessive quantifiers + simpler forms of atomic group. However, there is no difference in the + meaning of a possessive quantifier and the equivalent atomic group, + though there may be a performance difference; possessive quantifiers should be slightly faster. - The possessive quantifier syntax is an extension to the Perl 5.8 syn- - tax. Jeffrey Friedl originated the idea (and the name) in the first + The possessive quantifier syntax is an extension to the Perl 5.8 syn- + tax. Jeffrey Friedl originated the idea (and the name) in the first edition of his book. Mike McCloskey liked it, so implemented it when he - built Sun's Java package, and PCRE copied it from there. It ultimately + built Sun's Java package, and PCRE copied it from there. It ultimately found its way into Perl at release 5.10. PCRE has an optimization that automatically "possessifies" certain sim- - ple pattern constructs. For example, the sequence A+B is treated as - A++B because there is no point in backtracking into a sequence of A's + ple pattern constructs. For example, the sequence A+B is treated as + A++B because there is no point in backtracking into a sequence of A's when B must follow. - When a pattern contains an unlimited repeat inside a subpattern that - can itself be repeated an unlimited number of times, the use of an - atomic group is the only way to avoid some failing matches taking a + When a pattern contains an unlimited repeat inside a subpattern that + can itself be repeated an unlimited number of times, the use of an + atomic group is the only way to avoid some failing matches taking a very long time indeed. The pattern (\D+|<\d+>)*[!?] - matches an unlimited number of substrings that either consist of non- - digits, or digits enclosed in <>, followed by either ! or ?. When it + matches an unlimited number of substrings that either consist of non- + digits, or digits enclosed in <>, followed by either ! or ?. When it matches, it runs quickly. However, if it is applied to aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - it takes a long time before reporting failure. This is because the - string can be divided between the internal \D+ repeat and the external - * repeat in a large number of ways, and all have to be tried. (The - example uses [!?] rather than a single character at the end, because - both PCRE and Perl have an optimization that allows for fast failure - when a single character is used. They remember the last single charac- - ter that is required for a match, and fail early if it is not present - in the string.) If the pattern is changed so that it uses an atomic + it takes a long time before reporting failure. This is because the + string can be divided between the internal \D+ repeat and the external + * repeat in a large number of ways, and all have to be tried. (The + example uses [!?] rather than a single character at the end, because + both PCRE and Perl have an optimization that allows for fast failure + when a single character is used. They remember the last single charac- + ter that is required for a match, and fail early if it is not present + in the string.) If the pattern is changed so that it uses an atomic group, like this: ((?>\D+)|<\d+>)*[!?] - sequences of non-digits cannot be broken, and failure happens quickly. + sequences of non-digits cannot be broken, and failure happens quickly. BACK REFERENCES Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing sub- - pattern earlier (that is, to its left) in the pattern, provided there + pattern earlier (that is, to its left) in the pattern, provided there have been that many previous capturing left parentheses. However, if the decimal number following the backslash is less than 10, - it is always taken as a back reference, and causes an error only if - there are not that many capturing left parentheses in the entire pat- - tern. In other words, the parentheses that are referenced need not be - to the left of the reference for numbers less than 10. A "forward back - reference" of this type can make sense when a repetition is involved - and the subpattern to the right has participated in an earlier itera- + it is always taken as a back reference, and causes an error only if + there are not that many capturing left parentheses in the entire pat- + tern. In other words, the parentheses that are referenced need not be + to the left of the reference for numbers less than 10. A "forward back + reference" of this type can make sense when a repetition is involved + and the subpattern to the right has participated in an earlier itera- tion. - It is not possible to have a numerical "forward back reference" to a - subpattern whose number is 10 or more using this syntax because a - sequence such as \50 is interpreted as a character defined in octal. + It is not possible to have a numerical "forward back reference" to a + subpattern whose number is 10 or more using this syntax because a + sequence such as \50 is interpreted as a character defined in octal. See the subsection entitled "Non-printing characters" above for further - details of the handling of digits following a backslash. There is no - such problem when named parentheses are used. A back reference to any + details of the handling of digits following a backslash. There is no + such problem when named parentheses are used. A back reference to any subpattern is possible using named parentheses (see below). - Another way of avoiding the ambiguity inherent in the use of digits + Another way of avoiding the ambiguity inherent in the use of digits following a backslash is to use the \g escape sequence, which is a fea- - ture introduced in Perl 5.10. This escape must be followed by a posi- - tive or a negative number, optionally enclosed in braces. These exam- - ples are all identical: + ture introduced in Perl 5.10. This escape must be followed by an + unsigned number or a negative number, optionally enclosed in braces. + These examples are all identical: (ring), \1 (ring), \g1 (ring), \g{1} - A positive number specifies an absolute reference without the ambiguity - that is present in the older syntax. It is also useful when literal + An unsigned number specifies an absolute reference without the ambigu- + ity that is present in the older syntax. It is also useful when literal digits follow the reference. A negative number is a relative reference. Consider this example: (abc(def)ghi)\g{-1} The sequence \g{-1} is a reference to the most recently started captur- - ing subpattern before \g, that is, is it equivalent to \2. Similarly, + ing subpattern before \g, that is, is it equivalent to \2. Similarly, \g{-2} would be equivalent to \1. The use of relative references can be - helpful in long patterns, and also in patterns that are created by + helpful in long patterns, and also in patterns that are created by joining together fragments that contain references within themselves. - A back reference matches whatever actually matched the capturing sub- - pattern in the current subject string, rather than anything matching + A back reference matches whatever actually matched the capturing sub- + pattern in the current subject string, rather than anything matching the subpattern itself (see "Subpatterns as subroutines" below for a way of doing that). So the pattern (sens|respons)e and \1ibility - matches "sense and sensibility" and "response and responsibility", but - not "sense and responsibility". If caseful matching is in force at the - time of the back reference, the case of letters is relevant. For exam- + matches "sense and sensibility" and "response and responsibility", but + not "sense and responsibility". If caseful matching is in force at the + time of the back reference, the case of letters is relevant. For exam- ple, ((?i)rah)\s+\1 - matches "rah rah" and "RAH RAH", but not "RAH rah", even though the + matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original capturing subpattern is matched caselessly. - There are several different ways of writing back references to named - subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or - \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's + There are several different ways of writing back references to named + subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or + \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified back reference syntax, in which \g can be used for both numeric - and named references, is also supported. We could rewrite the above + and named references, is also supported. We could rewrite the above example in any of the following ways: (?<p1>(?i)rah)\s+\k<p1> @@ -3998,57 +4113,57 @@ BACK REFERENCES (?P<p1>(?i)rah)\s+(?P=p1) (?<p1>(?i)rah)\s+\g{p1} - A subpattern that is referenced by name may appear in the pattern + A subpattern that is referenced by name may appear in the pattern before or after the reference. - There may be more than one back reference to the same subpattern. If a - subpattern has not actually been used in a particular match, any back + There may be more than one back reference to the same subpattern. If a + subpattern has not actually been used in a particular match, any back references to it always fail. For example, the pattern (a|(bc))\2 - always fails if it starts to match "a" rather than "bc". Because there - may be many capturing parentheses in a pattern, all digits following - the backslash are taken as part of a potential back reference number. + always fails if it starts to match "a" rather than "bc". Because there + may be many capturing parentheses in a pattern, all digits following + the backslash are taken as part of a potential back reference number. If the pattern continues with a digit character, some delimiter must be - used to terminate the back reference. If the PCRE_EXTENDED option is - set, this can be whitespace. Otherwise an empty comment (see "Com- + used to terminate the back reference. If the PCRE_EXTENDED option is + set, this can be whitespace. Otherwise an empty comment (see "Com- ments" below) can be used. - A back reference that occurs inside the parentheses to which it refers - fails when the subpattern is first used, so, for example, (a\1) never - matches. However, such references can be useful inside repeated sub- + A back reference that occurs inside the parentheses to which it refers + fails when the subpattern is first used, so, for example, (a\1) never + matches. However, such references can be useful inside repeated sub- patterns. For example, the pattern (a|b\1)+ matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- - ation of the subpattern, the back reference matches the character - string corresponding to the previous iteration. In order for this to - work, the pattern must be such that the first iteration does not need - to match the back reference. This can be done using alternation, as in + ation of the subpattern, the back reference matches the character + string corresponding to the previous iteration. In order for this to + work, the pattern must be such that the first iteration does not need + to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero. ASSERTIONS - An assertion is a test on the characters following or preceding the - current matching point that does not actually consume any characters. - The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are + An assertion is a test on the characters following or preceding the + current matching point that does not actually consume any characters. + The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described above. - More complicated assertions are coded as subpatterns. There are two - kinds: those that look ahead of the current position in the subject - string, and those that look behind it. An assertion subpattern is - matched in the normal way, except that it does not cause the current + More complicated assertions are coded as subpatterns. There are two + kinds: those that look ahead of the current position in the subject + string, and those that look behind it. An assertion subpattern is + matched in the normal way, except that it does not cause the current matching position to be changed. - Assertion subpatterns are not capturing subpatterns, and may not be - repeated, because it makes no sense to assert the same thing several - times. If any kind of assertion contains capturing subpatterns within - it, these are counted for the purposes of numbering the capturing sub- + Assertion subpatterns are not capturing subpatterns, and may not be + repeated, because it makes no sense to assert the same thing several + times. If any kind of assertion contains capturing subpatterns within + it, these are counted for the purposes of numbering the capturing sub- patterns in the whole pattern. However, substring capturing is carried - out only for positive assertions, because it does not make sense for + out only for positive assertions, because it does not make sense for negative assertions. Lookahead assertions @@ -4058,37 +4173,37 @@ ASSERTIONS \w+(?=;) - matches a word followed by a semicolon, but does not include the semi- + matches a word followed by a semicolon, but does not include the semi- colon in the match, and foo(?!bar) - matches any occurrence of "foo" that is not followed by "bar". Note + matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern (?!foo)bar - does not find an occurrence of "bar" that is preceded by something - other than "foo"; it finds any occurrence of "bar" whatsoever, because + does not find an occurrence of "bar" that is preceded by something + other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always true when the next three characters are "bar". A lookbehind assertion is needed to achieve the other effect. If you want to force a matching failure at some point in a pattern, the - most convenient way to do it is with (?!) because an empty string - always matches, so an assertion that requires there not to be an empty + most convenient way to do it is with (?!) because an empty string + always matches, so an assertion that requires there not to be an empty string must always fail. Lookbehind assertions - Lookbehind assertions start with (?<= for positive assertions and (?<! + Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example, (?<!foo)bar - does find an occurrence of "bar" that is not preceded by "foo". The - contents of a lookbehind assertion are restricted such that all the + does find an occurrence of "bar" that is not preceded by "foo". The + contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are sev- - eral top-level alternatives, they do not all have to have the same + eral top-level alternatives, they do not all have to have the same fixed length. Thus (?<=bullock|donkey) @@ -4097,59 +4212,59 @@ ASSERTIONS (?<!dogs?|cats?) - causes an error at compile time. Branches that match different length - strings are permitted only at the top level of a lookbehind assertion. - This is an extension compared with Perl (at least for 5.8), which - requires all branches to match the same length of string. An assertion + causes an error at compile time. Branches that match different length + strings are permitted only at the top level of a lookbehind assertion. + This is an extension compared with Perl (at least for 5.8), which + requires all branches to match the same length of string. An assertion such as (?<=ab(c|de)) - is not permitted, because its single top-level branch can match two - different lengths, but it is acceptable if rewritten to use two top- + is not permitted, because its single top-level branch can match two + different lengths, but it is acceptable if rewritten to use two top- level branches: (?<=abc|abde) In some cases, the Perl 5.10 escape sequence \K (see above) can be used - instead of a lookbehind assertion; this is not restricted to a fixed- + instead of a lookbehind assertion; this is not restricted to a fixed- length. - The implementation of lookbehind assertions is, for each alternative, - to temporarily move the current position back by the fixed length and + The implementation of lookbehind assertions is, for each alternative, + to temporarily move the current position back by the fixed length and then try to match. If there are insufficient characters before the cur- rent position, the assertion fails. PCRE does not allow the \C escape (which matches a single byte in UTF-8 - mode) to appear in lookbehind assertions, because it makes it impossi- - ble to calculate the length of the lookbehind. The \X and \R escapes, + mode) to appear in lookbehind assertions, because it makes it impossi- + ble to calculate the length of the lookbehind. The \X and \R escapes, which can match different numbers of bytes, are also not permitted. - Possessive quantifiers can be used in conjunction with lookbehind - assertions to specify efficient matching at the end of the subject + Possessive quantifiers can be used in conjunction with lookbehind + assertions to specify efficient matching at the end of the subject string. Consider a simple pattern such as abcd$ - when applied to a long string that does not match. Because matching + when applied to a long string that does not match. Because matching proceeds from left to right, PCRE will look for each "a" in the subject - and then see if what follows matches the rest of the pattern. If the + and then see if what follows matches the rest of the pattern. If the pattern is specified as ^.*abcd$ - the initial .* matches the entire string at first, but when this fails + the initial .* matches the entire string at first, but when this fails (because there is no following "a"), it backtracks to match all but the - last character, then all but the last two characters, and so on. Once - again the search for "a" covers the entire string, from right to left, + last character, then all but the last two characters, and so on. Once + again the search for "a" covers the entire string, from right to left, so we are no better off. However, if the pattern is written as ^.*+(?<=abcd) - there can be no backtracking for the .*+ item; it can match only the - entire string. The subsequent lookbehind assertion does a single test - on the last four characters. If it fails, the match fails immediately. - For long strings, this approach makes a significant difference to the + there can be no backtracking for the .*+ item; it can match only the + entire string. The subsequent lookbehind assertion does a single test + on the last four characters. If it fails, the match fails immediately. + For long strings, this approach makes a significant difference to the processing time. Using multiple assertions @@ -4158,18 +4273,18 @@ ASSERTIONS (?<=\d{3})(?<!999)foo - matches "foo" preceded by three digits that are not "999". Notice that - each of the assertions is applied independently at the same point in - the subject string. First there is a check that the previous three - characters are all digits, and then there is a check that the same + matches "foo" preceded by three digits that are not "999". Notice that + each of the assertions is applied independently at the same point in + the subject string. First there is a check that the previous three + characters are all digits, and then there is a check that the same three characters are not "999". This pattern does not match "foo" pre- - ceded by six characters, the first of which are digits and the last - three of which are not "999". For example, it doesn't match "123abc- + ceded by six characters, the first of which are digits and the last + three of which are not "999". For example, it doesn't match "123abc- foo". A pattern to do that is (?<=\d{3}...)(?<!999)foo - This time the first assertion looks at the preceding six characters, + This time the first assertion looks at the preceding six characters, checking that the first three are digits, and then the second assertion checks that the preceding three characters are not "999". @@ -4177,79 +4292,79 @@ ASSERTIONS (?<=(?<!foo)bar)baz - matches an occurrence of "baz" that is preceded by "bar" which in turn + matches an occurrence of "baz" that is preceded by "bar" which in turn is not preceded by "foo", while (?<=\d{3}(?!999)...)foo - is another pattern that matches "foo" preceded by three digits and any + is another pattern that matches "foo" preceded by three digits and any three characters that are not "999". CONDITIONAL SUBPATTERNS - It is possible to cause the matching process to obey a subpattern con- - ditionally or to choose between two alternative subpatterns, depending - on the result of an assertion, or whether a previous capturing subpat- - tern matched or not. The two possible forms of conditional subpattern + It is possible to cause the matching process to obey a subpattern con- + ditionally or to choose between two alternative subpatterns, depending + on the result of an assertion, or whether a previous capturing subpat- + tern matched or not. The two possible forms of conditional subpattern are (?(condition)yes-pattern) (?(condition)yes-pattern|no-pattern) - If the condition is satisfied, the yes-pattern is used; otherwise the - no-pattern (if present) is used. If there are more than two alterna- + If the condition is satisfied, the yes-pattern is used; otherwise the + no-pattern (if present) is used. If there are more than two alterna- tives in the subpattern, a compile-time error occurs. - There are four kinds of condition: references to subpatterns, refer- + There are four kinds of condition: references to subpatterns, refer- ences to recursion, a pseudo-condition called DEFINE, and assertions. Checking for a used subpattern by number - If the text between the parentheses consists of a sequence of digits, - the condition is true if the capturing subpattern of that number has - previously matched. An alternative notation is to precede the digits + If the text between the parentheses consists of a sequence of digits, + the condition is true if the capturing subpattern of that number has + previously matched. An alternative notation is to precede the digits with a plus or minus sign. In this case, the subpattern number is rela- tive rather than absolute. The most recently opened parentheses can be - referenced by (?(-1), the next most recent by (?(-2), and so on. In + referenced by (?(-1), the next most recent by (?(-2), and so on. In looping constructs it can also make sense to refer to subsequent groups with constructs such as (?(+2). - Consider the following pattern, which contains non-significant white + Consider the following pattern, which contains non-significant white space to make it more readable (assume the PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: ( \( )? [^()]+ (?(1) \) ) - The first part matches an optional opening parenthesis, and if that + The first part matches an optional opening parenthesis, and if that character is present, sets it as the first captured substring. The sec- - ond part matches one or more characters that are not parentheses. The + ond part matches one or more characters that are not parentheses. The third part is a conditional subpattern that tests whether the first set of parentheses matched or not. If they did, that is, if subject started with an opening parenthesis, the condition is true, and so the yes-pat- - tern is executed and a closing parenthesis is required. Otherwise, - since no-pattern is not present, the subpattern matches nothing. In - other words, this pattern matches a sequence of non-parentheses, + tern is executed and a closing parenthesis is required. Otherwise, + since no-pattern is not present, the subpattern matches nothing. In + other words, this pattern matches a sequence of non-parentheses, optionally enclosed in parentheses. - If you were embedding this pattern in a larger one, you could use a + If you were embedding this pattern in a larger one, you could use a relative reference: ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... - This makes the fragment independent of the parentheses in the larger + This makes the fragment independent of the parentheses in the larger pattern. Checking for a used subpattern by name - Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a - used subpattern by name. For compatibility with earlier versions of - PCRE, which had this facility before Perl, the syntax (?(name)...) is - also recognized. However, there is a possible ambiguity with this syn- - tax, because subpattern names may consist entirely of digits. PCRE - looks first for a named subpattern; if it cannot find one and the name - consists entirely of digits, PCRE looks for a subpattern of that num- - ber, which must be greater than zero. Using subpattern names that con- + Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a + used subpattern by name. For compatibility with earlier versions of + PCRE, which had this facility before Perl, the syntax (?(name)...) is + also recognized. However, there is a possible ambiguity with this syn- + tax, because subpattern names may consist entirely of digits. PCRE + looks first for a named subpattern; if it cannot find one and the name + consists entirely of digits, PCRE looks for a subpattern of that num- + ber, which must be greater than zero. Using subpattern names that con- sist entirely of digits is not recommended. Rewriting the above example to use a named subpattern gives this: @@ -4260,85 +4375,85 @@ CONDITIONAL SUBPATTERNS Checking for pattern recursion If the condition is the string (R), and there is no subpattern with the - name R, the condition is true if a recursive call to the whole pattern + name R, the condition is true if a recursive call to the whole pattern or any subpattern has been made. If digits or a name preceded by amper- sand follow the letter R, for example: (?(R3)...) or (?(R&name)...) - the condition is true if the most recent recursion is into the subpat- - tern whose number or name is given. This condition does not check the + the condition is true if the most recent recursion is into the subpat- + tern whose number or name is given. This condition does not check the entire recursion stack. - At "top level", all these recursion test conditions are false. Recur- + At "top level", all these recursion test conditions are false. Recur- sive patterns are described below. Defining subpatterns for use by reference only - If the condition is the string (DEFINE), and there is no subpattern - with the name DEFINE, the condition is always false. In this case, - there may be only one alternative in the subpattern. It is always - skipped if control reaches this point in the pattern; the idea of - DEFINE is that it can be used to define "subroutines" that can be ref- - erenced from elsewhere. (The use of "subroutines" is described below.) - For example, a pattern to match an IPv4 address could be written like + If the condition is the string (DEFINE), and there is no subpattern + with the name DEFINE, the condition is always false. In this case, + there may be only one alternative in the subpattern. It is always + skipped if control reaches this point in the pattern; the idea of + DEFINE is that it can be used to define "subroutines" that can be ref- + erenced from elsewhere. (The use of "subroutines" is described below.) + For example, a pattern to match an IPv4 address could be written like this (ignore whitespace and line breaks): (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) \b (?&byte) (\.(?&byte)){3} \b - The first part of the pattern is a DEFINE group inside which a another - group named "byte" is defined. This matches an individual component of - an IPv4 address (a number less than 256). When matching takes place, - this part of the pattern is skipped because DEFINE acts like a false + The first part of the pattern is a DEFINE group inside which a another + group named "byte" is defined. This matches an individual component of + an IPv4 address (a number less than 256). When matching takes place, + this part of the pattern is skipped because DEFINE acts like a false condition. The rest of the pattern uses references to the named group to match the - four dot-separated components of an IPv4 address, insisting on a word + four dot-separated components of an IPv4 address, insisting on a word boundary at each end. Assertion conditions - If the condition is not in any of the above formats, it must be an - assertion. This may be a positive or negative lookahead or lookbehind - assertion. Consider this pattern, again containing non-significant + If the condition is not in any of the above formats, it must be an + assertion. This may be a positive or negative lookahead or lookbehind + assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line: (?(?=[^a-z]*[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) - The condition is a positive lookahead assertion that matches an - optional sequence of non-letters followed by a letter. In other words, - it tests for the presence of at least one letter in the subject. If a - letter is found, the subject is matched against the first alternative; - otherwise it is matched against the second. This pattern matches - strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are + The condition is a positive lookahead assertion that matches an + optional sequence of non-letters followed by a letter. In other words, + it tests for the presence of at least one letter in the subject. If a + letter is found, the subject is matched against the first alternative; + otherwise it is matched against the second. This pattern matches + strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. COMMENTS - The sequence (?# marks the start of a comment that continues up to the - next closing parenthesis. Nested parentheses are not permitted. The - characters that make up a comment play no part in the pattern matching + The sequence (?# marks the start of a comment that continues up to the + next closing parenthesis. Nested parentheses are not permitted. The + characters that make up a comment play no part in the pattern matching at all. - If the PCRE_EXTENDED option is set, an unescaped # character outside a - character class introduces a comment that continues to immediately + If the PCRE_EXTENDED option is set, an unescaped # character outside a + character class introduces a comment that continues to immediately after the next newline in the pattern. RECURSIVE PATTERNS - Consider the problem of matching a string in parentheses, allowing for - unlimited nested parentheses. Without the use of recursion, the best - that can be done is to use a pattern that matches up to some fixed - depth of nesting. It is not possible to handle an arbitrary nesting + Consider the problem of matching a string in parentheses, allowing for + unlimited nested parentheses. Without the use of recursion, the best + that can be done is to use a pattern that matches up to some fixed + depth of nesting. It is not possible to handle an arbitrary nesting depth. For some time, Perl has provided a facility that allows regular expres- - sions to recurse (amongst other things). It does this by interpolating - Perl code in the expression at run time, and the code can refer to the + sions to recurse (amongst other things). It does this by interpolating + Perl code in the expression at run time, and the code can refer to the expression itself. A Perl pattern using code interpolation to solve the parentheses problem can be created like this: @@ -4348,117 +4463,117 @@ RECURSIVE PATTERNS refers recursively to the pattern in which it appears. Obviously, PCRE cannot support the interpolation of Perl code. Instead, - it supports special syntax for recursion of the entire pattern, and - also for individual subpattern recursion. After its introduction in - PCRE and Python, this kind of recursion was introduced into Perl at + it supports special syntax for recursion of the entire pattern, and + also for individual subpattern recursion. After its introduction in + PCRE and Python, this kind of recursion was introduced into Perl at release 5.10. - A special item that consists of (? followed by a number greater than + A special item that consists of (? followed by a number greater than zero and a closing parenthesis is a recursive call of the subpattern of - the given number, provided that it occurs inside that subpattern. (If - not, it is a "subroutine" call, which is described in the next sec- - tion.) The special item (?R) or (?0) is a recursive call of the entire + the given number, provided that it occurs inside that subpattern. (If + not, it is a "subroutine" call, which is described in the next sec- + tion.) The special item (?R) or (?0) is a recursive call of the entire regular expression. - In PCRE (like Python, but unlike Perl), a recursive subpattern call is + In PCRE (like Python, but unlike Perl), a recursive subpattern call is always treated as an atomic group. That is, once it has matched some of the subject string, it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure. - This PCRE pattern solves the nested parentheses problem (assume the + This PCRE pattern solves the nested parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \) - First it matches an opening parenthesis. Then it matches any number of - substrings which can either be a sequence of non-parentheses, or a - recursive match of the pattern itself (that is, a correctly parenthe- + First it matches an opening parenthesis. Then it matches any number of + substrings which can either be a sequence of non-parentheses, or a + recursive match of the pattern itself (that is, a correctly parenthe- sized substring). Finally there is a closing parenthesis. - If this were part of a larger pattern, you would not want to recurse + If this were part of a larger pattern, you would not want to recurse the entire pattern, so instead you could use this: ( \( ( (?>[^()]+) | (?1) )* \) ) - We have put the pattern into parentheses, and caused the recursion to + We have put the pattern into parentheses, and caused the recursion to refer to them instead of the whole pattern. - In a larger pattern, keeping track of parenthesis numbers can be - tricky. This is made easier by the use of relative references. (A Perl - 5.10 feature.) Instead of (?1) in the pattern above you can write + In a larger pattern, keeping track of parenthesis numbers can be + tricky. This is made easier by the use of relative references. (A Perl + 5.10 feature.) Instead of (?1) in the pattern above you can write (?-2) to refer to the second most recently opened parentheses preceding - the recursion. In other words, a negative number counts capturing + the recursion. In other words, a negative number counts capturing parentheses leftwards from the point at which it is encountered. - It is also possible to refer to subsequently opened parentheses, by - writing references such as (?+2). However, these cannot be recursive - because the reference is not inside the parentheses that are refer- - enced. They are always "subroutine" calls, as described in the next + It is also possible to refer to subsequently opened parentheses, by + writing references such as (?+2). However, these cannot be recursive + because the reference is not inside the parentheses that are refer- + enced. They are always "subroutine" calls, as described in the next section. - An alternative approach is to use named parentheses instead. The Perl - syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also + An alternative approach is to use named parentheses instead. The Perl + syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We could rewrite the above example as follows: (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) - If there is more than one subpattern with the same name, the earliest + If there is more than one subpattern with the same name, the earliest one is used. - This particular example pattern that we have been looking at contains - nested unlimited repeats, and so the use of atomic grouping for match- - ing strings of non-parentheses is important when applying the pattern + This particular example pattern that we have been looking at contains + nested unlimited repeats, and so the use of atomic grouping for match- + ing strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when this pattern is applied to (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() - it yields "no match" quickly. However, if atomic grouping is not used, - the match runs for a very long time indeed because there are so many - different ways the + and * repeats can carve up the subject, and all + it yields "no match" quickly. However, if atomic grouping is not used, + the match runs for a very long time indeed because there are so many + different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported. At the end of a match, the values set for any capturing subpatterns are those from the outermost level of the recursion at which the subpattern - value is set. If you want to obtain intermediate values, a callout - function can be used (see below and the pcrecallout documentation). If + value is set. If you want to obtain intermediate values, a callout + function can be used (see below and the pcrecallout documentation). If the pattern above is matched against (ab(cd)ef) - the value for the capturing parentheses is "ef", which is the last - value taken on at the top level. If additional parentheses are added, + the value for the capturing parentheses is "ef", which is the last + value taken on at the top level. If additional parentheses are added, giving \( ( ( (?>[^()]+) | (?R) )* ) \) ^ ^ ^ ^ - the string they capture is "ab(cd)ef", the contents of the top level - parentheses. If there are more than 15 capturing parentheses in a pat- + the string they capture is "ab(cd)ef", the contents of the top level + parentheses. If there are more than 15 capturing parentheses in a pat- tern, PCRE has to obtain extra memory to store data during a recursion, - which it does by using pcre_malloc, freeing it via pcre_free after- - wards. If no memory can be obtained, the match fails with the + which it does by using pcre_malloc, freeing it via pcre_free after- + wards. If no memory can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. - Do not confuse the (?R) item with the condition (R), which tests for - recursion. Consider this pattern, which matches text in angle brack- - ets, allowing for arbitrary nesting. Only digits are allowed in nested - brackets (that is, when recursing), whereas any characters are permit- + Do not confuse the (?R) item with the condition (R), which tests for + recursion. Consider this pattern, which matches text in angle brack- + ets, allowing for arbitrary nesting. Only digits are allowed in nested + brackets (that is, when recursing), whereas any characters are permit- ted at the outer level. < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > - In this pattern, (?(R) is the start of a conditional subpattern, with - two different alternatives for the recursive and non-recursive cases. + In this pattern, (?(R) is the start of a conditional subpattern, with + two different alternatives for the recursive and non-recursive cases. The (?R) item is the actual recursive call. SUBPATTERNS AS SUBROUTINES If the syntax for a recursive subpattern reference (either by number or - by name) is used outside the parentheses to which it refers, it oper- - ates like a subroutine in a programming language. The "called" subpat- + by name) is used outside the parentheses to which it refers, it oper- + ates like a subroutine in a programming language. The "called" subpat- tern may be defined before or after the reference. A numbered reference can be absolute or relative, as in these examples: @@ -4470,65 +4585,182 @@ SUBPATTERNS AS SUBROUTINES (sens|respons)e and \1ibility - matches "sense and sensibility" and "response and responsibility", but + matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility - is used, it does match "sense and responsibility" as well as the other - two strings. Another example is given in the discussion of DEFINE + is used, it does match "sense and responsibility" as well as the other + two strings. Another example is given in the discussion of DEFINE above. Like recursive subpatterns, a "subroutine" call is always treated as an - atomic group. That is, once it has matched some of the subject string, - it is never re-entered, even if it contains untried alternatives and + atomic group. That is, once it has matched some of the subject string, + it is never re-entered, even if it contains untried alternatives and there is a subsequent matching failure. - When a subpattern is used as a subroutine, processing options such as + When a subpattern is used as a subroutine, processing options such as case-independence are fixed when the subpattern is defined. They cannot be changed for different calls. For example, consider this pattern: (abc)(?i:(?-1)) - It matches "abcabc". It does not match "abcABC" because the change of + It matches "abcabc". It does not match "abcABC" because the change of processing option does not affect the called subpattern. CALLOUTS Perl has a feature whereby using the sequence (?{...}) causes arbitrary - Perl code to be obeyed in the middle of matching a regular expression. + Perl code to be obeyed in the middle of matching a regular expression. This makes it possible, amongst other things, to extract different sub- strings that match the same pair of parentheses when there is a repeti- tion. PCRE provides a similar feature, but of course it cannot obey arbitrary Perl code. The feature is called "callout". The caller of PCRE provides - an external function by putting its entry point in the global variable - pcre_callout. By default, this variable contains NULL, which disables + an external function by putting its entry point in the global variable + pcre_callout. By default, this variable contains NULL, which disables all calling out. - Within a regular expression, (?C) indicates the points at which the - external function is to be called. If you want to identify different - callout points, you can put a number less than 256 after the letter C. - The default value is zero. For example, this pattern has two callout + Within a regular expression, (?C) indicates the points at which the + external function is to be called. If you want to identify different + callout points, you can put a number less than 256 after the letter C. + The default value is zero. For example, this pattern has two callout points: (?C1)abc(?C2)def If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are - automatically installed before each item in the pattern. They are all + automatically installed before each item in the pattern. They are all numbered 255. During matching, when PCRE reaches a callout point (and pcre_callout is - set), the external function is called. It is provided with the number - of the callout, the position in the pattern, and, optionally, one item - of data originally supplied by the caller of pcre_exec(). The callout - function may cause matching to proceed, to backtrack, or to fail alto- + set), the external function is called. It is provided with the number + of the callout, the position in the pattern, and, optionally, one item + of data originally supplied by the caller of pcre_exec(). The callout + function may cause matching to proceed, to backtrack, or to fail alto- gether. A complete description of the interface to the callout function is given in the pcrecallout documentation. +BACTRACKING CONTROL + + Perl 5.10 introduced a number of "Special Backtracking Control Verbs", + which are described in the Perl documentation as "experimental and sub- + ject to change or removal in a future version of Perl". It goes on to + say: "Their usage in production code should be noted to avoid problems + during upgrades." The same remarks apply to the PCRE features described + in this section. + + Since these verbs are specifically related to backtracking, they can be + used only when the pattern is to be matched using pcre_exec(), which + uses a backtracking algorithm. They cause an error if encountered by + pcre_dfa_exec(). + + The new verbs make use of what was previously invalid syntax: an open- + ing parenthesis followed by an asterisk. In Perl, they are generally of + the form (*VERB:ARG) but PCRE does not support the use of arguments, so + its general form is just (*VERB). Any number of these verbs may occur + in a pattern. There are two kinds: + + Verbs that act immediately + + The following verbs act as soon as they are encountered: + + (*ACCEPT) + + This verb causes the match to end successfully, skipping the remainder + of the pattern. When inside a recursion, only the innermost pattern is + ended immediately. PCRE differs from Perl in what happens if the + (*ACCEPT) is inside capturing parentheses. In Perl, the data so far is + captured: in PCRE no data is captured. For example: + + A(A|B(*ACCEPT)|C)D + + This matches "AB", "AAD", or "ACD", but when it matches "AB", no data + is captured. + + (*FAIL) or (*F) + + This verb causes the match to fail, forcing backtracking to occur. It + is equivalent to (?!) but easier to read. The Perl documentation notes + that it is probably useful only when combined with (?{}) or (??{}). + Those are, of course, Perl features that are not present in PCRE. The + nearest equivalent is the callout feature, as for example in this pat- + tern: + + a+(?C)(*FAIL) + + A match with the string "aaaa" always fails, but the callout is taken + before each backtrack happens (in this example, 10 times). + + Verbs that act after backtracking + + The following verbs do nothing when they are encountered. Matching con- + tinues with what follows, but if there is no subsequent match, a fail- + ure is forced. The verbs differ in exactly what kind of failure + occurs. + + (*COMMIT) + + This verb causes the whole match to fail outright if the rest of the + pattern does not match. Even if the pattern is unanchored, no further + attempts to find a match by advancing the start point take place. Once + (*COMMIT) has been passed, pcre_exec() is committed to finding a match + at the current starting point, or not at all. For example: + + a+(*COMMIT)b + + This matches "xxaab" but not "aacaab". It can be thought of as a kind + of dynamic anchor, or "I've started, so I must finish." + + (*PRUNE) + + This verb causes the match to fail at the current position if the rest + of the pattern does not match. If the pattern is unanchored, the normal + "bumpalong" advance to the next starting character then happens. Back- + tracking can occur as usual to the left of (*PRUNE), or when matching + to the right of (*PRUNE), but if there is no match to the right, back- + tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) + is just an alternative to an atomic group or possessive quantifier, but + there are some uses of (*PRUNE) that cannot be expressed in any other + way. + + (*SKIP) + + This verb is like (*PRUNE), except that if the pattern is unanchored, + the "bumpalong" advance is not to the next character, but to the posi- + tion in the subject where (*SKIP) was encountered. (*SKIP) signifies + that whatever text was matched leading up to it cannot be part of a + successful match. Consider: + + a+(*SKIP)b + + If the subject is "aaaac...", after the first match attempt fails + (starting at the first character in the string), the starting point + skips on to start the next attempt at "c". Note that a possessive quan- + tifer does not have the same effect in this example; although it would + suppress backtracking during the first match attempt, the second + attempt would start at the second character instead of skipping on to + "c". + + (*THEN) + + This verb causes a skip to the next alternation if the rest of the pat- + tern does not match. That is, it cancels pending backtracking, but only + within the current alternation. Its name comes from the observation + that it can be used for a pattern-based if-then-else block: + + ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... + + If the COND1 pattern matches, FOO is tried (and possibly further items + after the end of the group if FOO succeeds); on failure the matcher + skips to the second alternative and tries COND2, without backtracking + into COND1. If (*THEN) is used outside of any alternation, it acts + exactly like (*PRUNE). + + SEE ALSO pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). @@ -4543,7 +4775,335 @@ AUTHOR REVISION - Last updated: 19 June 2007 + Last updated: 21 August 2007 + Copyright (c) 1997-2007 University of Cambridge. +------------------------------------------------------------------------------ + + +PCRESYNTAX(3) PCRESYNTAX(3) + + +NAME + PCRE - Perl-compatible regular expressions + + +PCRE REGULAR EXPRESSION SYNTAX SUMMARY + + The full syntax and semantics of the regular expressions that are sup- + ported by PCRE are described in the pcrepattern documentation. This + document contains just a quick-reference summary of the syntax. + + +QUOTING + + \x where x is non-alphanumeric is a literal x + \Q...\E treat enclosed characters as literal + + +CHARACTERS + + \a alarm, that is, the BEL character (hex 07) + \cx "control-x", where x is any character + \e escape (hex 1B) + \f formfeed (hex 0C) + \n newline (hex 0A) + \r carriage return (hex 0D) + \t tab (hex 09) + \ddd character with octal code ddd, or backreference + \xhh character with hex code hh + \x{hhh..} character with hex code hhh.. + + +CHARACTER TYPES + + . any character except newline; + in dotall mode, any character whatsoever + \C one byte, even in UTF-8 mode (best avoided) + \d a decimal digit + \D a character that is not a decimal digit + \h a horizontal whitespace character + \H a character that is not a horizontal whitespace character + \p{xx} a character with the xx property + \P{xx} a character without the xx property + \R a newline sequence + \s a whitespace character + \S a character that is not a whitespace character + \v a vertical whitespace character + \V a character that is not a vertical whitespace character + \w a "word" character + \W a "non-word" character + \X an extended Unicode sequence + + In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters. + + +GENERAL CATEGORY PROPERTY CODES FOR \p and \P + + C Other + Cc Control + Cf Format + Cn Unassigned + Co Private use + Cs Surrogate + + L Letter + Ll Lower case letter + Lm Modifier letter + Lo Other letter + Lt Title case letter + Lu Upper case letter + L& Ll, Lu, or Lt + + M Mark + Mc Spacing mark + Me Enclosing mark + Mn Non-spacing mark + + N Number + Nd Decimal number + Nl Letter number + No Other number + + P Punctuation + Pc Connector punctuation + Pd Dash punctuation + Pe Close punctuation + Pf Final punctuation + Pi Initial punctuation + Po Other punctuation + Ps Open punctuation + + S Symbol + Sc Currency symbol + Sk Modifier symbol + Sm Mathematical symbol + So Other symbol + + Z Separator + Zl Line separator + Zp Paragraph separator + Zs Space separator + + +SCRIPT NAMES FOR \p AND \P + + Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, + Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, + Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, + Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- + gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, + Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, + Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, + Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, + Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. + + +CHARACTER CLASSES + + [...] positive character class + [^...] negative character class + [x-y] range (can be used for hex characters) + [[:xxx:]] positive POSIX named set + [[^:xxx:]] negative POSIX named set + + alnum alphanumeric + alpha alphabetic + ascii 0-127 + blank space or tab + cntrl control character + digit decimal digit + graph printing, excluding space + lower lower case letter + print printing, including space + punct printing, excluding alphanumeric + space whitespace + upper upper case letter + word same as \w + xdigit hexadecimal digit + + In PCRE, POSIX character set names recognize only ASCII characters. You + can use \Q...\E inside a character class. + + +QUANTIFIERS + + ? 0 or 1, greedy + ?+ 0 or 1, possessive + ?? 0 or 1, lazy + * 0 or more, greedy + *+ 0 or more, possessive + *? 0 or more, lazy + + 1 or more, greedy + ++ 1 or more, possessive + +? 1 or more, lazy + {n} exactly n + {n,m} at least n, no more than m, greedy + {n,m}+ at least n, no more than m, possessive + {n,m}? at least n, no more than m, lazy + {n,} n or more, greedy + {n,}+ n or more, possessive + {n,}? n or more, lazy + + +ANCHORS AND SIMPLE ASSERTIONS + + \b word boundary + \B not a word boundary + ^ start of subject + also after internal newline in multiline mode + \A start of subject + $ end of subject + also before newline at end of subject + also before internal newline in multiline mode + \Z end of subject + also before newline at end of subject + \z end of subject + \G first matching position in subject + + +MATCH POINT RESET + + \K reset start of match + + +ALTERNATION + + expr|expr|expr... + + +CAPTURING + + (...) capturing group + (?<name>...) named capturing group (Perl) + (?'name'...) named capturing group (Perl) + (?P<name>...) named capturing group (Python) + (?:...) non-capturing group + (?|...) non-capturing group; reset group numbers for + capturing groups in each alternative + + +ATOMIC GROUPS + + (?>...) atomic, non-capturing group + + +COMMENT + + (?#....) comment (not nestable) + + +OPTION SETTING + + (?i) caseless + (?J) allow duplicate names + (?m) multiline + (?s) single line (dotall) + (?U) default ungreedy (lazy) + (?x) extended (ignore white space) + (?-...) unset option(s) + + +LOOKAHEAD AND LOOKBEHIND ASSERTIONS + + (?=...) positive look ahead + (?!...) negative look ahead + (?<=...) positive look behind + (?<!...) negative look behind + + Each top-level branch of a look behind must be of a fixed length. + + +BACKREFERENCES + + \n reference by number (can be ambiguous) + \gn reference by number + \g{n} reference by number + \g{-n} relative reference by number + \k<name> reference by name (Perl) + \k'name' reference by name (Perl) + \g{name} reference by name (Perl) + \k{name} reference by name (.NET) + (?P=name) reference by name (Python) + + +SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) + + (?R) recurse whole pattern + (?n) call subpattern by absolute number + (?+n) call subpattern by relative number + (?-n) call subpattern by relative number + (?&name) call subpattern by name (Perl) + (?P>name) call subpattern by name (Python) + + +CONDITIONAL PATTERNS + + (?(condition)yes-pattern) + (?(condition)yes-pattern|no-pattern) + + (?(n)... absolute reference condition + (?(+n)... relative reference condition + (?(-n)... relative reference condition + (?(<name>)... named reference condition (Perl) + (?('name')... named reference condition (Perl) + (?(name)... named reference condition (PCRE) + (?(R)... overall recursion condition + (?(Rn)... specific group recursion condition + (?(R&name)... specific recursion condition + (?(DEFINE)... define subpattern for reference + (?(assert)... assertion condition + + +BACKTRACKING CONTROL + + The following act immediately they are reached: + + (*ACCEPT) force successful match + (*FAIL) force backtrack; synonym (*F) + + The following act only when a subsequent match failure causes a back- + track to reach them. They all force a match failure, but they differ in + what happens afterwards. Those that advance the start-of-match point do + so only if the pattern is not anchored. + + (*COMMIT) overall failure, no advance of starting point + (*PRUNE) advance to next starting character + (*SKIP) advance start to current matching position + (*THEN) local failure, backtrack to next alternation + + +NEWLINE CONVENTIONS + + These are recognized only at the very start of a pattern. + + (*CR) + (*LF) + (*CRLF) + (*ANYCRLF) + (*ANY) + + +CALLOUTS + + (?C) callout + (?Cn) callout with data n + + +SEE ALSO + + pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). + + +AUTHOR + + Philip Hazel + University Computing Service + Cambridge CB2 3QH, England. + + +REVISION + + Last updated: 21 August 2007 Copyright (c) 1997-2007 University of Cambridge. ------------------------------------------------------------------------------ diff --git a/ext/pcre/pcrelib/pcre.h b/ext/pcre/pcrelib/pcre.h index 7399896298..a002bbd5aa 100644 --- a/ext/pcre/pcrelib/pcre.h +++ b/ext/pcre/pcrelib/pcre.h @@ -42,19 +42,25 @@ POSSIBILITY OF SUCH DAMAGE. /* The current PCRE version information. */ #define PCRE_MAJOR 7 -#define PCRE_MINOR 2 +#define PCRE_MINOR 3 #define PCRE_PRERELEASE -#define PCRE_DATE 2007-06-19 +#define PCRE_DATE 2007-08-28 /* When an application links to a PCRE DLL in Windows, the symbols that are imported have to be identified as such. When building PCRE, the appropriate export setting is defined in pcre_internal.h, which includes this file. So we -don't change an existing definition of PCRE_EXP_DECL. */ +don't change existing definitions of PCRE_EXP_DECL and PCRECPP_EXP_DECL. */ -#ifndef PCRE_EXP_DECL -# ifdef _WIN32 -# ifndef PCRE_STATIC -# define PCRE_EXP_DECL extern __declspec(dllimport) +#if defined(_WIN32) && !defined(PCRE_STATIC) +# ifndef PCRE_EXP_DECL +# define PCRE_EXP_DECL extern __declspec(dllimport) +# endif +# ifdef __cplusplus +# ifndef PCRECPP_EXP_DECL +# define PCRECPP_EXP_DECL extern __declspec(dllimport) +# endif +# ifndef PCRECPP_EXP_DEFN +# define PCRECPP_EXP_DEFN __declspec(dllimport) # endif # endif #endif @@ -63,9 +69,18 @@ don't change an existing definition of PCRE_EXP_DECL. */ #ifndef PCRE_EXP_DECL # ifdef __cplusplus -# define PCRE_EXP_DECL extern "C" +# define PCRE_EXP_DECL extern "C" # else -# define PCRE_EXP_DECL extern +# define PCRE_EXP_DECL extern +# endif +#endif + +#ifdef __cplusplus +# ifndef PCRECPP_EXP_DECL +# define PCRECPP_EXP_DECL extern +# endif +# ifndef PCRECPP_EXP_DEFN +# define PCRECPP_EXP_DEFN # endif #endif @@ -132,7 +147,7 @@ extern "C" { #define PCRE_ERROR_DFA_WSSIZE (-19) #define PCRE_ERROR_DFA_RECURSE (-20) #define PCRE_ERROR_RECURSIONLIMIT (-21) -#define PCRE_ERROR_NULLWSLIMIT (-22) +#define PCRE_ERROR_NOTUSED (-22) #define PCRE_ERROR_BADNEWLINE (-23) /* Request types for pcre_fullinfo() */ @@ -152,6 +167,7 @@ extern "C" { #define PCRE_INFO_DEFAULT_TABLES 11 #define PCRE_INFO_OKPARTIAL 12 #define PCRE_INFO_JCHANGED 13 +#define PCRE_INFO_HASCRORLF 14 /* Request types for pcre_config(). Do not re-arrange, in order to remain compatible. */ diff --git a/ext/pcre/pcrelib/pcre_chartables.c b/ext/pcre/pcrelib/pcre_chartables.c index 6494d8e98c..3d6a4fff9c 100644 --- a/ext/pcre/pcrelib/pcre_chartables.c +++ b/ext/pcre/pcrelib/pcre_chartables.c @@ -14,12 +14,16 @@ example ISO-8859-1. When dftables is run, it creates these tables in the current locale. If PCRE is configured with --enable-rebuild-chartables, this happens automatically. -The following #include is present because without it gcc 4.x may remove the +The following #includes are present because without the gcc 4.x may remove the array definition from the final binary if PCRE is built into a static library and dead code stripping is activated. This leads to link errors. Pulling in the header ensures that the array gets flagged as "someone outside this compilation unit might reference this" and so it will always be supplied to the linker. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" const unsigned char _pcre_default_tables[] = { diff --git a/ext/pcre/pcrelib/pcre_compile.c b/ext/pcre/pcrelib/pcre_compile.c index c191539c8b..8361e148bb 100644 --- a/ext/pcre/pcrelib/pcre_compile.c +++ b/ext/pcre/pcrelib/pcre_compile.c @@ -42,11 +42,14 @@ POSSIBILITY OF SUCH DAMAGE. supporting internal functions that are not used by other modules. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #define NLBLOCK cd /* Block containing newline information */ #define PSSTART start_pattern /* Field containing processed string start */ #define PSEND end_pattern /* Field containing processed string end */ - #include "pcre_internal.h" @@ -62,6 +65,13 @@ used by pcretest. DEBUG is not defined when building a production library. */ #define SETBIT(a,b) a[b/8] |= (1 << (b%8)) +/* Maximum length value to check against when making sure that the integer that +holds the compiled pattern length does not overflow. We make it a bit less than +INT_MAX to allow for adding in group terminating bytes, so that we don't have +to check them every time. */ + +#define OFLOW_MAX (INT_MAX - 20) + /************************************************* * Code parameters and static tables * @@ -120,7 +130,7 @@ static const short int escapes[] = { /* B8 */ 0, 0, 0, 0, 0, ']', '=', '-', /* C0 */ '{',-ESC_A, -ESC_B, -ESC_C, -ESC_D,-ESC_E, 0, -ESC_G, /* C8 */-ESC_H, 0, 0, 0, 0, 0, 0, 0, -/* D0 */ '}', 0, 0, 0, 0, 0, 0, -ESC_P, +/* D0 */ '}', 0, -ESC_K, 0, 0, 0, 0, -ESC_P, /* D8 */-ESC_Q,-ESC_R, 0, 0, 0, 0, 0, 0, /* E0 */ '\\', 0, -ESC_S, 0, 0,-ESC_V, -ESC_W, -ESC_X, /* E8 */ 0,-ESC_Z, 0, 0, 0, 0, 0, 0, @@ -130,6 +140,27 @@ static const short int escapes[] = { #endif +/* Table of special "verbs" like (*PRUNE) */ + +typedef struct verbitem { + const char *name; + int len; + int op; +} verbitem; + +static verbitem verbs[] = { + { "ACCEPT", 6, OP_ACCEPT }, + { "COMMIT", 6, OP_COMMIT }, + { "F", 1, OP_FAIL }, + { "FAIL", 4, OP_FAIL }, + { "PRUNE", 5, OP_PRUNE }, + { "SKIP", 4, OP_SKIP }, + { "THEN", 4, OP_THEN } +}; + +static int verbcount = sizeof(verbs)/sizeof(verbitem); + + /* Tables of names of POSIX character classes and their lengths. The list is terminated by a zero length entry. The first three must be alpha, lower, upper, as this is assumed for handling case independence. */ @@ -203,7 +234,7 @@ static const char *error_texts[] = { "missing ) after comment", "parentheses nested too deeply", /** DEAD **/ /* 20 */ - "regular expression too large", + "regular expression is too large", "failed to get memory", "unmatched parentheses", "internal error: code overflow", @@ -239,7 +270,7 @@ static const char *error_texts[] = { "subpattern name is too long (maximum " XSTRING(MAX_NAME_SIZE) " characters)", "too many named subpatterns (maximum " XSTRING(MAX_NAME_COUNT) ")", /* 50 */ - "repeated subpattern is too long", + "repeated subpattern is too long", /** DEAD **/ "octal value is greater than \\377 (not in UTF-8 mode)", "internal error: overran compiling workspace", "internal error: previously-checked referenced subpattern not found", @@ -248,7 +279,11 @@ static const char *error_texts[] = { "repeating a DEFINE group is not allowed", "inconsistent NEWLINE options", "\\g is not followed by a braced name or an optionally braced non-zero number", - "(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number" + "(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number", + "(*VERB) with an argument is not supported", + /* 60 */ + "(*VERB) not recognized", + "number is too big" }; @@ -405,7 +440,7 @@ Arguments: Returns: zero or positive => a data character negative => a special escape sequence - on error, errorptr is set + on error, errorcodeptr is set */ static int @@ -490,10 +525,16 @@ else while ((digitab[ptr[1]] & ctype_digit) != 0) c = c * 10 + *(++ptr) - '0'; + if (c < 0) + { + *errorcodeptr = ERR61; + break; + } + if (c == 0 || (braced && *(++ptr) != '}')) { *errorcodeptr = ERR57; - return 0; + break; } if (negated) @@ -501,7 +542,7 @@ else if (c > bracount) { *errorcodeptr = ERR15; - return 0; + break; } c = bracount - (c - 1); } @@ -530,6 +571,11 @@ else c -= '0'; while ((digitab[ptr[1]] & ctype_digit) != 0) c = c * 10 + *(++ptr) - '0'; + if (c < 0) + { + *errorcodeptr = ERR61; + break; + } if (c < 10 || c <= bracount) { c = -(ESC_REF + c); @@ -625,7 +671,7 @@ else if (c == 0) { *errorcodeptr = ERR2; - return 0; + break; } #ifndef EBCDIC /* ASCII coding */ @@ -701,7 +747,7 @@ if (c == '{') *negptr = TRUE; ptr++; } - for (i = 0; i < sizeof(name) - 1; i++) + for (i = 0; i < (int)sizeof(name) - 1; i++) { c = *(++ptr); if (c == 0) goto ERROR_RETURN; @@ -904,6 +950,7 @@ for (; *ptr != 0; ptr++) { while (*(++ptr) != ']') { + if (*ptr == 0) return -1; if (*ptr == '\\') { if (*(++ptr) == 0) return -1; @@ -931,7 +978,7 @@ for (; *ptr != 0; ptr++) /* An opening parens must now be a real metacharacter */ if (*ptr != '(') continue; - if (ptr[1] != '?') + if (ptr[1] != '?' && ptr[1] != '*') { count++; if (name == NULL && count == lorn) return count; @@ -1059,7 +1106,6 @@ for (;;) { int d; register int op = *cc; - switch (op) { case OP_CBRA: @@ -1148,6 +1194,7 @@ for (;;) case OP_TYPEEXACT: branchlength += GET2(cc,1); + if (cc[3] == OP_PROP || cc[3] == OP_NOTPROP) cc += 2; cc += 4; break; @@ -1256,13 +1303,42 @@ for (;;) code += _pcre_OP_lengths[c]; } - /* In UTF-8 mode, opcodes that are followed by a character may be followed by - a multi-byte character. The length in the table is a minimum, so we have to - arrange to skip the extra bytes. */ + /* Otherwise, we can get the item's length from the table, except that for + repeated character types, we have to test for \p and \P, which have an extra + two bytes of parameters. */ else { + switch(c) + { + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + case OP_TYPEPOSSTAR: + case OP_TYPEPOSPLUS: + case OP_TYPEPOSQUERY: + if (code[1] == OP_PROP || code[1] == OP_NOTPROP) code += 2; + break; + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + case OP_TYPEEXACT: + case OP_TYPEPOSUPTO: + if (code[3] == OP_PROP || code[3] == OP_NOTPROP) code += 2; + break; + } + + /* Add in the fixed length from the table */ + code += _pcre_OP_lengths[c]; + + /* In UTF-8 mode, opcodes that are followed by a character may be followed by + a multi-byte character. The length in the table is a minimum, so we have to + arrange to skip the extra bytes. */ + #ifdef SUPPORT_UTF8 if (utf8) switch(c) { @@ -1320,14 +1396,42 @@ for (;;) if (c == OP_XCLASS) code += GET(code, 1); - /* Otherwise, we get the item's length from the table. In UTF-8 mode, opcodes - that are followed by a character may be followed by a multi-byte character. - The length in the table is a minimum, so we have to arrange to skip the extra - bytes. */ + /* Otherwise, we can get the item's length from the table, except that for + repeated character types, we have to test for \p and \P, which have an extra + two bytes of parameters. */ else { + switch(c) + { + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + case OP_TYPEPOSSTAR: + case OP_TYPEPOSPLUS: + case OP_TYPEPOSQUERY: + if (code[1] == OP_PROP || code[1] == OP_NOTPROP) code += 2; + break; + + case OP_TYPEPOSUPTO: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + case OP_TYPEEXACT: + if (code[3] == OP_PROP || code[3] == OP_NOTPROP) code += 2; + break; + } + + /* Add in the fixed length from the table */ + code += _pcre_OP_lengths[c]; + + /* In UTF-8 mode, opcodes that are followed by a character may be followed + by a multi-byte character. The length in the table is a minimum, so we have + to arrange to skip the extra bytes. */ + #ifdef SUPPORT_UTF8 if (utf8) switch(c) { @@ -1399,7 +1503,7 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE /* For other groups, scan the branches. */ - if (c == OP_BRA || c == OP_CBRA || c == OP_ONCE) + if (c == OP_BRA || c == OP_CBRA || c == OP_ONCE || c == OP_COND) { BOOL empty_branch; if (GET(code, 1) == 0) return TRUE; /* Hit unclosed bracket */ @@ -1423,11 +1527,15 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE switch (c) { - /* Check for quantifiers after a class */ + /* Check for quantifiers after a class. XCLASS is used for classes that + cannot be represented just by a bit map. This includes negated single + high-valued characters. The length in _pcre_OP_lengths[] is zero; the + actual length is stored in the compiled code, so we must update "code" + here. */ #ifdef SUPPORT_UTF8 case OP_XCLASS: - ccode = code + GET(code, 1); + ccode = code += GET(code, 1); goto CHECK_CLASS_REPEAT; #endif @@ -1489,6 +1597,26 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE case OP_TYPEEXACT: return FALSE; + /* These are going to continue, as they may be empty, but we have to + fudge the length for the \p and \P cases. */ + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPOSSTAR: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + case OP_TYPEPOSQUERY: + if (code[1] == OP_PROP || code[1] == OP_NOTPROP) code += 2; + break; + + /* Same for these */ + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + case OP_TYPEPOSUPTO: + if (code[3] == OP_PROP || code[3] == OP_NOTPROP) code += 2; + break; + /* End of branch */ case OP_KET: @@ -1651,6 +1779,7 @@ adjust_recurse(uschar *group, int adjust, BOOL utf8, compile_data *cd, uschar *save_hwm) { uschar *ptr = group; + while ((ptr = (uschar *)find_recurse(ptr, utf8)) != NULL) { int offset; @@ -2255,6 +2384,15 @@ for (;; ptr++) */ if (code < last_code) code = last_code; + + /* Paranoid check for integer overflow */ + + if (OFLOW_MAX - *lengthptr < code - last_code) + { + *errorcodeptr = ERR20; + goto FAILED; + } + *lengthptr += code - last_code; DPRINTF(("length=%d added %d c=%c\n", *lengthptr, code - last_code, c)); @@ -2367,6 +2505,11 @@ for (;; ptr++) *ptrptr = ptr; if (lengthptr != NULL) { + if (OFLOW_MAX - *lengthptr < code - last_code) + { + *errorcodeptr = ERR20; + goto FAILED; + } *lengthptr += code - last_code; /* To include callout length */ DPRINTF((">> end branch\n")); } @@ -2429,16 +2572,23 @@ for (;; ptr++) goto FAILED; } - /* If the first character is '^', set the negation flag and skip it. */ + /* If the first character is '^', set the negation flag and skip it. Also, + if the first few characters (either before or after ^) are \Q\E or \E we + skip them too. This makes for compatibility with Perl. */ - if ((c = *(++ptr)) == '^') + negate_class = FALSE; + for (;;) { - negate_class = TRUE; c = *(++ptr); - } - else - { - negate_class = FALSE; + if (c == '\\') + { + if (ptr[1] == 'E') ptr++; + else if (strncmp((const char *)ptr+1, "Q\\E", 3) == 0) ptr += 3; + else break; + } + else if (!negate_class && c == '^') + negate_class = TRUE; + else break; } /* Keep a count of chars with values < 256 so that we can optimize the case @@ -2579,7 +2729,7 @@ for (;; ptr++) of the specials, which just set a flag. The sequence \b is a special case. Inside a class (and only there) it is treated as backspace. Elsewhere it marks a word boundary. Other escapes have preset maps ready - to or into the one we are building. We assume they have more than one + to 'or' into the one we are building. We assume they have more than one character in them, so set class_charcount bigger than one. */ if (c == '\\') @@ -2599,6 +2749,7 @@ for (;; ptr++) else inescq = TRUE; continue; } + else if (-c == ESC_E) continue; /* Ignore orphan \E */ if (c < 0) { @@ -3045,12 +3196,26 @@ for (;; ptr++) goto FAILED; } + /* Remember whether \r or \n are in this class */ + + if (negate_class) + { + if ((classbits[1] & 0x24) != 0x24) cd->external_options |= PCRE_HASCRORLF; + } + else + { + if ((classbits[1] & 0x24) != 0) cd->external_options |= PCRE_HASCRORLF; + } + /* If class_charcount is 1, we saw precisely one character whose value is - less than 256. In non-UTF-8 mode we can always optimize. In UTF-8 mode, we - can optimize the negative case only if there were no characters >= 128 - because OP_NOT and the related opcodes like OP_NOTSTAR operate on - single-bytes only. This is an historical hangover. Maybe one day we can - tidy these opcodes to handle multi-byte characters. + less than 256. As long as there were no characters >= 128 and there was no + use of \p or \P, in other words, no use of any XCLASS features, we can + optimize. + + In UTF-8 mode, we can optimize the negative case only if there were no + characters >= 128 because OP_NOT and the related opcodes like OP_NOTSTAR + operate on single-bytes only. This is an historical hangover. Maybe one day + we can tidy these opcodes to handle multi-byte characters. The optimization throws away the bit map. We turn the item into a 1-character OP_CHAR[NC] if it's positive, or OP_NOT if it's negative. Note @@ -3060,10 +3225,8 @@ for (;; ptr++) reqbyte, save the previous value for reinstating. */ #ifdef SUPPORT_UTF8 - if (class_charcount == 1 && - (!utf8 || - (!class_utf8 && (!negate_class || class_lastchar < 128)))) - + if (class_charcount == 1 && !class_utf8 && + (!utf8 || !negate_class || class_lastchar < 128)) #else if (class_charcount == 1) #endif @@ -3521,14 +3684,6 @@ for (;; ptr++) goto FAILED; } - /* This is a paranoid check to stop integer overflow later on */ - - if (len > MAX_DUPLENGTH) - { - *errorcodeptr = ERR50; - goto FAILED; - } - /* If the maximum repeat count is unlimited, find the end of the bracket by scanning through from the start, and compute the offset back to it from the current code pointer. There may be an OP_OPT setting following @@ -3617,10 +3772,21 @@ for (;; ptr++) if (repeat_min > 1) { /* In the pre-compile phase, we don't actually do the replication. We - just adjust the length as if we had. */ + just adjust the length as if we had. Do some paranoid checks for + potential integer overflow. */ if (lengthptr != NULL) - *lengthptr += (repeat_min - 1)*length_prevgroup; + { + int delta = (repeat_min - 1)*length_prevgroup; + if ((double)(repeat_min - 1)*(double)length_prevgroup > + (double)INT_MAX || + OFLOW_MAX - *lengthptr < delta) + { + *errorcodeptr = ERR20; + goto FAILED; + } + *lengthptr += delta; + } /* This is compiling for real */ @@ -3658,11 +3824,23 @@ for (;; ptr++) /* In the pre-compile phase, we don't actually do the replication. We just adjust the length as if we had. For each repetition we must add 1 to the length for BRAZERO and for all but the last repetition we must - add 2 + 2*LINKSIZE to allow for the nesting that occurs. */ + add 2 + 2*LINKSIZE to allow for the nesting that occurs. Do some + paranoid checks to avoid integer overflow. */ if (lengthptr != NULL && repeat_max > 0) - *lengthptr += repeat_max * (length_prevgroup + 1 + 2 + 2*LINK_SIZE) - - 2 - 2*LINK_SIZE; /* Last one doesn't nest */ + { + int delta = repeat_max * (length_prevgroup + 1 + 2 + 2*LINK_SIZE) - + 2 - 2*LINK_SIZE; /* Last one doesn't nest */ + if ((double)repeat_max * + (double)(length_prevgroup + 1 + 2 + 2*LINK_SIZE) + > (double)INT_MAX || + OFLOW_MAX - *lengthptr < delta) + { + *errorcodeptr = ERR20; + goto FAILED; + } + *lengthptr += delta; + } /* This is compiling for real */ @@ -3814,9 +3992,7 @@ for (;; ptr++) /* ===================================================================*/ /* Start of nested parenthesized sub-expression, or comment or lookahead or lookbehind or option setting or condition or all the other extended - parenthesis forms. First deal with the specials; all are introduced by ?, - and the appearance of any of them means that this is not a capturing - group. */ + parenthesis forms. */ case '(': newoptions = options; @@ -3825,7 +4001,44 @@ for (;; ptr++) save_hwm = cd->hwm; reset_bracount = FALSE; - if (*(++ptr) == '?') + /* First deal with various "verbs" that can be introduced by '*'. */ + + if (*(++ptr) == '*' && (cd->ctypes[ptr[1]] & ctype_letter) != 0) + { + int i, namelen; + const uschar *name = ++ptr; + previous = NULL; + while ((cd->ctypes[*++ptr] & ctype_letter) != 0); + if (*ptr == ':') + { + *errorcodeptr = ERR59; /* Not supported */ + goto FAILED; + } + if (*ptr != ')') + { + *errorcodeptr = ERR60; + goto FAILED; + } + namelen = ptr - name; + for (i = 0; i < verbcount; i++) + { + if (namelen == verbs[i].len && + strncmp((char *)name, verbs[i].name, namelen) == 0) + { + *code = verbs[i].op; + if (*code++ == OP_ACCEPT) cd->had_accept = TRUE; + break; + } + } + if (i < verbcount) continue; + *errorcodeptr = ERR60; + goto FAILED; + } + + /* Deal with the extended parentheses; all are introduced by '?', and the + appearance of any of them means that this is not a capturing group. */ + + else if (*ptr == '?') { int i, set, unset, namelen; int *optset; @@ -4067,8 +4280,14 @@ for (;; ptr++) /* ------------------------------------------------------------ */ case '!': /* Negative lookahead */ - bravalue = OP_ASSERT_NOT; ptr++; + if (*ptr == ')') /* Optimize (?!) */ + { + *code++ = OP_FAIL; + previous = NULL; + continue; + } + bravalue = OP_ASSERT_NOT; break; @@ -4617,23 +4836,29 @@ for (;; ptr++) goto FAILED; } - /* In the pre-compile phase, update the length by the length of the nested - group, less the brackets at either end. Then reduce the compiled code to - just the brackets so that it doesn't use much memory if it is duplicated by - a quantifier. */ + /* In the pre-compile phase, update the length by the length of the group, + less the brackets at either end. Then reduce the compiled code to just a + set of non-capturing brackets so that it doesn't use much memory if it is + duplicated by a quantifier.*/ if (lengthptr != NULL) { + if (OFLOW_MAX - *lengthptr < length_prevgroup - 2 - 2*LINK_SIZE) + { + *errorcodeptr = ERR20; + goto FAILED; + } *lengthptr += length_prevgroup - 2 - 2*LINK_SIZE; - code++; + *code++ = OP_BRA; PUTINC(code, 0, 1 + LINK_SIZE); *code++ = OP_KET; PUTINC(code, 0, 1 + LINK_SIZE); + break; /* No need to waste time with special character handling */ } /* Otherwise update the main code pointer to the end of the group. */ - else code = tempcode; + code = tempcode; /* For a DEFINE group, required and first character settings are not relevant. */ @@ -4837,6 +5062,11 @@ for (;; ptr++) *code++ = ((options & PCRE_CASELESS) != 0)? OP_CHARNC : OP_CHAR; for (c = 0; c < mclength; c++) *code++ = mcbuffer[c]; + /* Remember if \r or \n were seen */ + + if (mcbuffer[0] == '\r' || mcbuffer[0] == '\n') + cd->external_options |= PCRE_HASCRORLF; + /* Set the first and required bytes appropriately. If no previous first byte, set it from this character, but revert to none on a zero repeat. Otherwise, leave the firstbyte value alone, and don't change it on a zero @@ -5119,7 +5349,15 @@ for (;;) *ptrptr = ptr; *firstbyteptr = firstbyte; *reqbyteptr = reqbyte; - if (lengthptr != NULL) *lengthptr += length; + if (lengthptr != NULL) + { + if (OFLOW_MAX - *lengthptr < length) + { + *errorcodeptr = ERR20; + return FALSE; + } + *lengthptr += length; + } return TRUE; } @@ -5428,6 +5666,7 @@ real_pcre *re; int length = 1; /* For final END opcode */ int firstbyte, reqbyte, newline; int errorcode = 0; +int skipatstart = 0; #ifdef SUPPORT_UTF8 BOOL utf8; #endif @@ -5506,13 +5745,33 @@ cd->fcc = tables + fcc_offset; cd->cbits = tables + cbits_offset; cd->ctypes = tables + ctypes_offset; +/* Check for newline settings at the start of the pattern, and remember the +offset for later. */ + +if (ptr[0] == '(' && ptr[1] == '*') + { + int newnl = 0; + if (strncmp((char *)(ptr+2), "CR)", 3) == 0) + { skipatstart = 5; newnl = PCRE_NEWLINE_CR; } + else if (strncmp((char *)(ptr+2), "LF)", 3) == 0) + { skipatstart = 5; newnl = PCRE_NEWLINE_LF; } + else if (strncmp((char *)(ptr+2), "CRLF)", 5) == 0) + { skipatstart = 7; newnl = PCRE_NEWLINE_CR + PCRE_NEWLINE_LF; } + else if (strncmp((char *)(ptr+2), "ANY)", 4) == 0) + { skipatstart = 6; newnl = PCRE_NEWLINE_ANY; } + else if (strncmp((char *)(ptr+2), "ANYCRLF)", 8) == 0) + { skipatstart = 10; newnl = PCRE_NEWLINE_ANYCRLF; } + if (skipatstart > 0) + options = (options & ~PCRE_NEWLINE_BITS) | newnl; + } + /* Handle different types of newline. The three bits give seven cases. The current code allows for fixed one- or two-byte sequences, plus "any" and "anycrlf". */ -switch (options & (PCRE_NEWLINE_CRLF | PCRE_NEWLINE_ANY)) +switch (options & PCRE_NEWLINE_BITS) { - case 0: newline = NEWLINE; break; /* Compile-time default */ + case 0: newline = NEWLINE; break; /* Build-time default */ case PCRE_NEWLINE_CR: newline = '\r'; break; case PCRE_NEWLINE_LF: newline = '\n'; break; case PCRE_NEWLINE_CR+ @@ -5584,6 +5843,7 @@ been put into the cd block so that they can be changed if an option setting is found within the regex right at the beginning. Bringing initial option settings outside can help speed up starting point checks. */ +ptr += skipatstart; code = cworkspace; *code = OP_BRA; (void)compile_regex(cd->external_options, cd->external_options & PCRE_IMS, @@ -5647,12 +5907,13 @@ cd->start_code = codestart; cd->hwm = cworkspace; cd->req_varyopt = 0; cd->nopartial = FALSE; +cd->had_accept = FALSE; /* Set up a starting, non-extracting bracket, then compile the expression. On error, errorcode will be set non-zero, so we don't need to look at the result of the function here. */ -ptr = (const uschar *)pattern; +ptr = (const uschar *)pattern + skipatstart; code = (uschar *)codestart; *code = OP_BRA; (void)compile_regex(re->options, re->options & PCRE_IMS, &code, &ptr, @@ -5661,6 +5922,7 @@ re->top_bracket = cd->bracount; re->top_backref = cd->top_backref; if (cd->nopartial) re->options |= PCRE_NOPARTIAL; +if (cd->had_accept) reqbyte = -1; /* Must disable after (*ACCEPT) */ /* If not reached end of pattern on success, there's an excess bracket. */ @@ -5759,19 +6021,7 @@ case when building a production library. */ printf("Length = %d top_bracket = %d top_backref = %d\n", length, re->top_bracket, re->top_backref); -if (re->options != 0) - { - printf("%s%s%s%s%s%s%s%s%s\n", - ((re->options & PCRE_NOPARTIAL) != 0)? "nopartial " : "", - ((re->options & PCRE_ANCHORED) != 0)? "anchored " : "", - ((re->options & PCRE_CASELESS) != 0)? "caseless " : "", - ((re->options & PCRE_EXTENDED) != 0)? "extended " : "", - ((re->options & PCRE_MULTILINE) != 0)? "multiline " : "", - ((re->options & PCRE_DOTALL) != 0)? "dotall " : "", - ((re->options & PCRE_DOLLAR_ENDONLY) != 0)? "endonly " : "", - ((re->options & PCRE_EXTRA) != 0)? "extra " : "", - ((re->options & PCRE_UNGREEDY) != 0)? "ungreedy " : ""); - } +printf("Options=%08x\n", re->options); if ((re->options & PCRE_FIRSTSET) != 0) { diff --git a/ext/pcre/pcrelib/pcre_config.c b/ext/pcre/pcrelib/pcre_config.c index 52d2594661..ea0e317db8 100644 --- a/ext/pcre/pcrelib/pcre_config.c +++ b/ext/pcre/pcrelib/pcre_config.c @@ -41,6 +41,10 @@ POSSIBILITY OF SUCH DAMAGE. /* This module contains the external function pcre_config(). */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_exec.c b/ext/pcre/pcrelib/pcre_exec.c index a749a216ea..bd650e9b18 100644 --- a/ext/pcre/pcrelib/pcre_exec.c +++ b/ext/pcre/pcrelib/pcre_exec.c @@ -42,6 +42,10 @@ POSSIBILITY OF SUCH DAMAGE. pattern matching using an NFA algorithm, trying to mimic Perl as closely as possible. There are also some static supporting functions. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #define NLBLOCK md /* Block containing newline information */ #define PSSTART start_subject /* Field containing processed string start */ #define PSEND end_subject /* Field containing processed string end */ @@ -53,16 +57,10 @@ possible. There are also some static supporting functions. */ #undef min #undef max -/* The chain of eptrblocks for tail recursions uses memory in stack workspace, -obtained at top level, the size of which is defined by EPTR_WORK_SIZE. */ - -#define EPTR_WORK_SIZE (1000) - /* Flag bits for the match() function */ #define match_condassert 0x01 /* Called to check a condition assertion */ #define match_cbegroup 0x02 /* Could-be-empty unlimited repeat group */ -#define match_tail_recursed 0x04 /* Tail recursive call */ /* Non-error returns from the match() function. Error returns are externally defined PCRE_ERROR_xxx codes, which are all negative. */ @@ -70,6 +68,14 @@ defined PCRE_ERROR_xxx codes, which are all negative. */ #define MATCH_MATCH 1 #define MATCH_NOMATCH 0 +/* Special internal returns from the match() function. Make them sufficiently +negative to avoid the external error codes. */ + +#define MATCH_COMMIT (-999) +#define MATCH_PRUNE (-998) +#define MATCH_SKIP (-997) +#define MATCH_THEN (-996) + /* Maximum number of ints of offset to save on the stack for recursive calls. If the offset vector is bigger, malloc is used. This should be a multiple of 3, because the offset vector is always a multiple of 3 long. */ @@ -205,15 +211,15 @@ variable instead of being passed in the frame. **************************************************************************** ***************************************************************************/ - -/* Numbers for RMATCH calls */ +/* Numbers for RMATCH calls. When this list is changed, the code at HEAP_RETURN +below must be updated in sync. */ enum { RM1=1, RM2, RM3, RM4, RM5, RM6, RM7, RM8, RM9, RM10, RM11, RM12, RM13, RM14, RM15, RM16, RM17, RM18, RM19, RM20, RM21, RM22, RM23, RM24, RM25, RM26, RM27, RM28, RM29, RM30, RM31, RM32, RM33, RM34, RM35, RM36, RM37, RM38, RM39, RM40, - RM41, RM42, RM43, RM44, RM45, RM46, RM47 }; - + RM41, RM42, RM43, RM44, RM45, RM46, RM47, RM48, RM49, RM50, + RM51, RM52, RM53, RM54 }; /* These versions of the macros use the stack, as normal. There are debugging versions and production versions. Note that the "rw" argument of RMATCH isn't @@ -384,7 +390,6 @@ Arguments: match_condassert - this is an assertion condition match_cbegroup - this is the start of an unlimited repeat group that can match an empty string - match_tail_recursed - this is a tail_recursed group rdepth the recursion depth Returns: MATCH_MATCH if matched ) these values are >= 0 @@ -586,22 +591,16 @@ original_ims = ims; /* Save for resetting on ')' */ string, the match_cbegroup flag is set. When this is the case, add the current subject pointer to the chain of such remembered pointers, to be checked when we hit the closing ket, in order to break infinite loops that match no characters. -When match() is called in other circumstances, don't add to the chain. If this -is a tail recursion, use a block from the workspace, as the one on the stack is -already used. */ +When match() is called in other circumstances, don't add to the chain. The +match_cbegroup flag must NOT be used with tail recursion, because the memory +block that is used is on the stack, so a new one may be required for each +match(). */ if ((flags & match_cbegroup) != 0) { - eptrblock *p; - if ((flags & match_tail_recursed) != 0) - { - if (md->eptrn >= EPTR_WORK_SIZE) RRETURN(PCRE_ERROR_NULLWSLIMIT); - p = md->eptrchain + md->eptrn++; - } - else p = &newptrb; - p->epb_saved_eptr = eptr; - p->epb_prev = eptrb; - eptrb = p; + newptrb.epb_saved_eptr = eptr; + newptrb.epb_prev = eptrb; + eptrb = &newptrb; } /* Now start processing the opcodes. */ @@ -621,6 +620,34 @@ for (;;) switch(op) { + case OP_FAIL: + RRETURN(MATCH_NOMATCH); + + case OP_PRUNE: + RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, + ims, eptrb, flags, RM51); + if (rrc != MATCH_NOMATCH) RRETURN(rrc); + RRETURN(MATCH_PRUNE); + + case OP_COMMIT: + RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, + ims, eptrb, flags, RM52); + if (rrc != MATCH_NOMATCH) RRETURN(rrc); + RRETURN(MATCH_COMMIT); + + case OP_SKIP: + RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, + ims, eptrb, flags, RM53); + if (rrc != MATCH_NOMATCH) RRETURN(rrc); + md->start_match_ptr = eptr; /* Pass back current position */ + RRETURN(MATCH_SKIP); + + case OP_THEN: + RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, + ims, eptrb, flags, RM54); + if (rrc != MATCH_NOMATCH) RRETURN(rrc); + RRETURN(MATCH_THEN); + /* Handle a capturing bracket. If there is space in the offset vector, save the current subject position in the working slot at the top of the vector. We mustn't change the current values of the data slot, because they may be @@ -662,7 +689,7 @@ for (;;) { RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims, eptrb, flags, RM1); - if (rrc != MATCH_NOMATCH) RRETURN(rrc); + if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc); md->capture_last = save_capture_last; ecode += GET(ecode, 1); } @@ -677,15 +704,22 @@ for (;;) RRETURN(MATCH_NOMATCH); } - /* Insufficient room for saving captured contents. Treat as a non-capturing - bracket. */ + /* FALL THROUGH ... Insufficient room for saving captured contents. Treat + as a non-capturing bracket. */ + + /* VVVVVVVVVVVVVVVVVVVVVVVVV */ + /* VVVVVVVVVVVVVVVVVVVVVVVVV */ DPRINTF(("insufficient capture room: treat as non-capturing\n")); + /* VVVVVVVVVVVVVVVVVVVVVVVVV */ + /* VVVVVVVVVVVVVVVVVVVVVVVVV */ + /* Non-capturing bracket. Loop for all the alternatives. When we get to the final alternative within the brackets, we would return the result of a recursive call to match() whatever happened. We can reduce stack usage by - turning this into a tail recursion. */ + turning this into a tail recursion, except in the case when match_cbegroup + is set.*/ case OP_BRA: case OP_SBRA: @@ -693,12 +727,20 @@ for (;;) flags = (op >= OP_SBRA)? match_cbegroup : 0; for (;;) { - if (ecode[GET(ecode, 1)] != OP_ALT) + if (ecode[GET(ecode, 1)] != OP_ALT) /* Final alternative */ { - ecode += _pcre_OP_lengths[*ecode]; - flags |= match_tail_recursed; - DPRINTF(("bracket 0 tail recursion\n")); - goto TAIL_RECURSE; + if (flags == 0) /* Not a possibly empty group */ + { + ecode += _pcre_OP_lengths[*ecode]; + DPRINTF(("bracket 0 tail recursion\n")); + goto TAIL_RECURSE; + } + + /* Possibly empty group; can't use tail recursion. */ + + RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims, + eptrb, flags, RM48); + RRETURN(rrc); } /* For non-final alternatives, continue the loop for a NOMATCH result; @@ -706,7 +748,7 @@ for (;;) RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md, ims, eptrb, flags, RM2); - if (rrc != MATCH_NOMATCH) RRETURN(rrc); + if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc); ecode += GET(ecode, 1); } /* Control never reaches here. */ @@ -754,7 +796,7 @@ for (;;) ecode += 1 + LINK_SIZE + GET(ecode, LINK_SIZE + 2); while (*ecode == OP_ALT) ecode += GET(ecode, 1); } - else if (rrc != MATCH_NOMATCH) + else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) { RRETURN(rrc); /* Need braces because of following else */ } @@ -766,25 +808,36 @@ for (;;) } /* We are now at the branch that is to be obeyed. As there is only one, - we can use tail recursion to avoid using another stack frame. If the second - alternative doesn't exist, we can just plough on. */ + we can use tail recursion to avoid using another stack frame, except when + match_cbegroup is required for an unlimited repeat of a possibly empty + group. If the second alternative doesn't exist, we can just plough on. */ if (condition || *ecode == OP_ALT) { ecode += 1 + LINK_SIZE; - flags = match_tail_recursed | ((op == OP_SCOND)? match_cbegroup : 0); - goto TAIL_RECURSE; + if (op == OP_SCOND) /* Possibly empty group */ + { + RMATCH(eptr, ecode, offset_top, md, ims, eptrb, match_cbegroup, RM49); + RRETURN(rrc); + } + else /* Group must match something */ + { + flags = 0; + goto TAIL_RECURSE; + } } - else + else /* Condition false & no 2nd alternative */ { ecode += 1 + LINK_SIZE; } break; - /* End of the pattern. If we are in a top-level recursion, we should - restore the offsets appropriately and continue from after the call. */ + /* End of the pattern, either real or forced. If we are in a top-level + recursion, we should restore the offsets appropriately and continue from + after the call. */ + case OP_ACCEPT: case OP_END: if (md->recursive != NULL && md->recursive->group_num == 0) { @@ -805,7 +858,7 @@ for (;;) if (md->notempty && eptr == mstart) RRETURN(MATCH_NOMATCH); md->end_match_ptr = eptr; /* Record where we ended */ md->end_offset_top = offset_top; /* and how many extracts were taken */ - md->start_match_ptr = mstart; /* and the start (\K can modify) */ + md->start_match_ptr = mstart; /* and the start (\K can modify) */ RRETURN(MATCH_MATCH); /* Change option settings */ @@ -829,7 +882,7 @@ for (;;) RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, NULL, 0, RM4); if (rrc == MATCH_MATCH) break; - if (rrc != MATCH_NOMATCH) RRETURN(rrc); + if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc); ecode += GET(ecode, 1); } while (*ecode == OP_ALT); @@ -856,7 +909,7 @@ for (;;) RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, NULL, 0, RM5); if (rrc == MATCH_MATCH) RRETURN(MATCH_NOMATCH); - if (rrc != MATCH_NOMATCH) RRETURN(rrc); + if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc); ecode += GET(ecode,1); } while (*ecode == OP_ALT); @@ -880,7 +933,7 @@ for (;;) { eptr--; if (eptr < md->start_subject) RRETURN(MATCH_NOMATCH); - BACKCHAR(eptr) + BACKCHAR(eptr); } } else @@ -993,7 +1046,7 @@ for (;;) (pcre_free)(new_recursive.offset_save); RRETURN(MATCH_MATCH); } - else if (rrc != MATCH_NOMATCH) + else if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) { DPRINTF(("Recursion gave error %d\n", rrc)); RRETURN(rrc); @@ -1027,10 +1080,9 @@ for (;;) do { - RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, - eptrb, 0, RM7); + RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, 0, RM7); if (rrc == MATCH_MATCH) break; - if (rrc != MATCH_NOMATCH) RRETURN(rrc); + if (rrc != MATCH_NOMATCH && rrc != MATCH_THEN) RRETURN(rrc); ecode += GET(ecode,1); } while (*ecode == OP_ALT); @@ -1073,11 +1125,10 @@ for (;;) if (*ecode == OP_KETRMIN) { - RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, 0, - RM8); + RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, 0, RM8); if (rrc != MATCH_NOMATCH) RRETURN(rrc); ecode = prev; - flags = match_tail_recursed; + flags = 0; goto TAIL_RECURSE; } else /* OP_KETRMAX */ @@ -1085,7 +1136,7 @@ for (;;) RMATCH(eptr, prev, offset_top, md, ims, eptrb, match_cbegroup, RM9); if (rrc != MATCH_NOMATCH) RRETURN(rrc); ecode += 1 + LINK_SIZE; - flags = match_tail_recursed; + flags = 0; goto TAIL_RECURSE; } /* Control never gets here */ @@ -1216,17 +1267,21 @@ for (;;) /* The repeating kets try the rest of the pattern or restart from the preceding bracket, in the appropriate order. In the second case, we can use - tail recursion to avoid using another stack frame. */ + tail recursion to avoid using another stack frame, unless we have an + unlimited repeat of a group that can match an empty string. */ flags = (*prev >= OP_SBRA)? match_cbegroup : 0; if (*ecode == OP_KETRMIN) { - RMATCH(eptr, ecode + 1+LINK_SIZE, offset_top, md, ims, eptrb, 0, - RM12); + RMATCH(eptr, ecode + 1 + LINK_SIZE, offset_top, md, ims, eptrb, 0, RM12); if (rrc != MATCH_NOMATCH) RRETURN(rrc); + if (flags != 0) /* Could match an empty string */ + { + RMATCH(eptr, prev, offset_top, md, ims, eptrb, flags, RM50); + RRETURN(rrc); + } ecode = prev; - flags |= match_tail_recursed; goto TAIL_RECURSE; } else /* OP_KETRMAX */ @@ -1234,7 +1289,7 @@ for (;;) RMATCH(eptr, prev, offset_top, md, ims, eptrb, flags, RM13); if (rrc != MATCH_NOMATCH) RRETURN(rrc); ecode += 1 + LINK_SIZE; - flags = match_tail_recursed; + flags = 0; goto TAIL_RECURSE; } /* Control never gets here */ @@ -2033,7 +2088,7 @@ for (;;) RMATCH(eptr, ecode, offset_top, md, ims, eptrb, 0, RM21); if (rrc != MATCH_NOMATCH) RRETURN(rrc); if (eptr-- == pp) break; /* Stop if tried at original pos */ - BACKCHAR(eptr) + if (utf8) BACKCHAR(eptr); } RRETURN(MATCH_NOMATCH); } @@ -3038,9 +3093,9 @@ for (;;) for (i = 1; i <= min; i++) { if (eptr >= md->end_subject || - (*eptr < 128 && (md->ctypes[*eptr++] & ctype_space) != 0)) + (*eptr < 128 && (md->ctypes[*eptr] & ctype_space) != 0)) RRETURN(MATCH_NOMATCH); - while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + while (++eptr < md->end_subject && (*eptr & 0xc0) == 0x80); } break; @@ -3058,9 +3113,9 @@ for (;;) for (i = 1; i <= min; i++) { if (eptr >= md->end_subject || - (*eptr < 128 && (md->ctypes[*eptr++] & ctype_word) != 0)) + (*eptr < 128 && (md->ctypes[*eptr] & ctype_word) != 0)) RRETURN(MATCH_NOMATCH); - while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; + while (++eptr < md->end_subject && (*eptr & 0xc0) == 0x80); } break; @@ -3702,7 +3757,7 @@ for (;;) RMATCH(eptr, ecode, offset_top, md, ims, eptrb, 0, RM44); if (rrc != MATCH_NOMATCH) RRETURN(rrc); if (eptr-- == pp) break; /* Stop if tried at original pos */ - BACKCHAR(eptr); + if (utf8) BACKCHAR(eptr); } } @@ -3741,9 +3796,9 @@ for (;;) for (;;) /* Move back over one extended */ { int len = 1; - BACKCHAR(eptr); if (!utf8) c = *eptr; else { + BACKCHAR(eptr); GETCHARLEN(c, eptr, len); } prop_category = _pcre_ucp_findprop(c, &prop_chartype, &prop_script); @@ -3764,11 +3819,6 @@ for (;;) switch(ctype) { case OP_ANY: - - /* Special code is required for UTF8, but when the maximum is - unlimited we don't need it, so we repeat the non-UTF8 code. This is - probably worth it, because .* is quite a common idiom. */ - if (max < INT_MAX) { if ((ims & PCRE_DOTALL) == 0) @@ -3801,15 +3851,12 @@ for (;;) { if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break; eptr++; + while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++; } - break; } else { - c = max - min; - if (c > (unsigned int)(md->end_subject - eptr)) - c = md->end_subject - eptr; - eptr += c; + eptr = md->end_subject; } } break; @@ -3990,7 +4037,7 @@ for (;;) } } else -#endif +#endif /* SUPPORT_UTF8 */ /* Not UTF-8 mode */ { @@ -4181,7 +4228,8 @@ switch (frame->Xwhere) LBL(17) LBL(18) LBL(19) LBL(20) LBL(21) LBL(22) LBL(23) LBL(24) LBL(25) LBL(26) LBL(27) LBL(28) LBL(29) LBL(30) LBL(31) LBL(32) LBL(33) LBL(34) LBL(35) LBL(36) LBL(37) LBL(38) LBL(39) LBL(40) - LBL(41) LBL(42) LBL(43) LBL(44) LBL(45) LBL(46) LBL(47) + LBL(41) LBL(42) LBL(43) LBL(44) LBL(45) LBL(46) LBL(47) LBL(48) + LBL(49) LBL(50) LBL(51) LBL(52) LBL(53) LBL(54) default: DPRINTF(("jump error in pcre match: label %d non-existent\n", frame->Xwhere)); return PCRE_ERROR_INTERNAL; @@ -4298,7 +4346,6 @@ const uschar *start_bits = NULL; USPTR start_match = (USPTR)subject + start_offset; USPTR end_subject; USPTR req_byte_ptr = start_match - 1; -eptrblock eptrchain[EPTR_WORK_SIZE]; pcre_study_data internal_study; const pcre_study_data *study; @@ -4384,7 +4431,6 @@ md->partial = (options & PCRE_PARTIAL) != 0; md->hitend = FALSE; md->recursive = NULL; /* No recursion at top level */ -md->eptrchain = eptrchain; /* Make workspace generally available */ md->lcc = tables + lcc_offset; md->ctypes = tables + ctypes_offset; @@ -4540,6 +4586,7 @@ the loop runs just once. */ for(;;) { USPTR save_end_subject = end_subject; + USPTR new_start_match; /* Reset the maximum number of extractions we might see. */ @@ -4680,15 +4727,48 @@ for(;;) /* OK, we can now run the match. */ - md->start_match_ptr = start_match; /* Insurance */ + md->start_match_ptr = start_match; md->match_call_count = 0; - md->eptrn = 0; /* Next free eptrchain slot */ - rc = match(start_match, md->start_code, start_match, 2, md, - ims, NULL, 0, 0); + rc = match(start_match, md->start_code, start_match, 2, md, ims, NULL, 0, 0); + + switch(rc) + { + /* NOMATCH and PRUNE advance by one character. THEN at this level acts + exactly like PRUNE. */ + + case MATCH_NOMATCH: + case MATCH_PRUNE: + case MATCH_THEN: + new_start_match = start_match + 1; +#ifdef SUPPORT_UTF8 + if (utf8) + while(new_start_match < end_subject && (*new_start_match & 0xc0) == 0x80) + new_start_match++; +#endif + break; + + /* SKIP passes back the next starting point explicitly. */ - /* Any return other than MATCH_NOMATCH breaks the loop. */ + case MATCH_SKIP: + new_start_match = md->start_match_ptr; + break; + + /* COMMIT disables the bumpalong, but otherwise behaves as NOMATCH. */ - if (rc != MATCH_NOMATCH) break; + case MATCH_COMMIT: + rc = MATCH_NOMATCH; + goto ENDLOOP; + + /* Any other return is some kind of error. */ + + default: + goto ENDLOOP; + } + + /* Control reaches here for the various types of "no match at this point" + result. Reset the code to MATCH_NOMATCH for subsequent checking. */ + + rc = MATCH_NOMATCH; /* If PCRE_FIRSTLINE is set, the match must happen before or at the first newline in the subject (though it may continue over the newline). Therefore, @@ -4696,30 +4776,26 @@ for(;;) if (firstline && IS_NEWLINE(start_match)) break; - /* Advance the match position by one character. */ + /* Advance to new matching position */ - start_match++; -#ifdef SUPPORT_UTF8 - if (utf8) - while(start_match < end_subject && (*start_match & 0xc0) == 0x80) - start_match++; -#endif + start_match = new_start_match; /* Break the loop if the pattern is anchored or if we have passed the end of the subject. */ if (anchored || start_match > end_subject) break; - /* If we have just passed a CR and the newline option is CRLF or ANY or - ANYCRLF, and we are now at a LF, advance the match position by one more - character. */ + /* If we have just passed a CR and we are now at a LF, and the pattern does + not contain any explicit matches for \r or \n, and the newline option is CRLF + or ANY or ANYCRLF, advance the match position by one more character. */ if (start_match[-1] == '\r' && - (md->nltype == NLTYPE_ANY || - md->nltype == NLTYPE_ANYCRLF || - md->nllen == 2) && - start_match < end_subject && - *start_match == '\n') + start_match < end_subject && + *start_match == '\n' && + (re->options & PCRE_HASCRORLF) == 0 && + (md->nltype == NLTYPE_ANY || + md->nltype == NLTYPE_ANYCRLF || + md->nllen == 2)) start_match++; } /* End of for(;;) "bumpalong" loop */ @@ -4729,7 +4805,7 @@ for(;;) /* We reach here when rc is not MATCH_NOMATCH, or if one of the stopping conditions is true: -(1) The pattern is anchored; +(1) The pattern is anchored or the match was failed by (*COMMIT); (2) We are past the end of the subject; @@ -4744,6 +4820,8 @@ processing, copy those that we can. In this case there need not be overflow if certain parts of the pattern were not used, even though there are more capturing parentheses than vector slots. */ +ENDLOOP: + if (rc == MATCH_MATCH) { if (using_temporary_offsets) diff --git a/ext/pcre/pcrelib/pcre_fullinfo.c b/ext/pcre/pcrelib/pcre_fullinfo.c index 797eddecf2..b082473351 100644 --- a/ext/pcre/pcrelib/pcre_fullinfo.c +++ b/ext/pcre/pcrelib/pcre_fullinfo.c @@ -42,6 +42,10 @@ POSSIBILITY OF SUCH DAMAGE. information about a compiled pattern. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" @@ -148,6 +152,10 @@ switch (what) *((int *)where) = (re->options & PCRE_JCHANGED) != 0; break; + case PCRE_INFO_HASCRORLF: + *((int *)where) = (re->options & PCRE_HASCRORLF) != 0; + break; + default: return PCRE_ERROR_BADOPTION; } diff --git a/ext/pcre/pcrelib/pcre_get.c b/ext/pcre/pcrelib/pcre_get.c index ba0c8cb21f..64a195e940 100644 --- a/ext/pcre/pcrelib/pcre_get.c +++ b/ext/pcre/pcrelib/pcre_get.c @@ -43,6 +43,10 @@ from the subject string after a regex match has succeeded. The original idea for these functions came from Scott Wimer. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_globals.c b/ext/pcre/pcrelib/pcre_globals.c index dbde57e023..41b89dd75c 100644 --- a/ext/pcre/pcrelib/pcre_globals.c +++ b/ext/pcre/pcrelib/pcre_globals.c @@ -46,6 +46,10 @@ indirection. These values can be changed by the caller, but are shared between all threads. However, when compiling for Virtual Pascal, things are done differently, and global variables are not used (see pcre.in). */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" #ifndef VPCOMPAT diff --git a/ext/pcre/pcrelib/pcre_info.c b/ext/pcre/pcrelib/pcre_info.c index 52ac23f708..c40eb7c375 100644 --- a/ext/pcre/pcrelib/pcre_info.c +++ b/ext/pcre/pcrelib/pcre_info.c @@ -43,6 +43,10 @@ information about a compiled pattern. However, use of this function is now deprecated, as it has been superseded by pcre_fullinfo(). */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_internal.h b/ext/pcre/pcrelib/pcre_internal.h index d87b95cb27..a2409f9425 100644 --- a/ext/pcre/pcrelib/pcre_internal.h +++ b/ext/pcre/pcrelib/pcre_internal.h @@ -67,10 +67,6 @@ be absolutely sure we get our version. */ #endif -/* Get the definitions provided by running "configure" */ - -#include "config.h" - /* Standard C headers plus the external interface definition. The only time setjmp and stdarg are used is when NO_RECURSE is set. */ @@ -112,7 +108,7 @@ PCRE_EXP_DATA_DEFN only if they are not already set. */ #ifndef PCRE_EXP_DECL # ifdef _WIN32 -# ifdef DLL_EXPORT +# ifndef PCRE_STATIC # define PCRE_EXP_DECL extern __declspec(dllexport) # define PCRE_EXP_DEFN __declspec(dllexport) # define PCRE_EXP_DATA_DEFN __declspec(dllexport) @@ -121,7 +117,6 @@ PCRE_EXP_DATA_DEFN only if they are not already set. */ # define PCRE_EXP_DEFN # define PCRE_EXP_DATA_DEFN # endif -# # else # ifdef __cplusplus # define PCRE_EXP_DECL extern "C" @@ -234,7 +229,7 @@ must begin with PCRE_. */ /* Include the public PCRE header and the definitions of UCP character property values. */ -#include <pcre.h> +#include "pcre.h" #include "ucp.h" /* When compiling for use with the Virtual Pascal compiler, these functions @@ -363,7 +358,9 @@ capturing parenthesis numbers in back references. */ /* When UTF-8 encoding is being used, a character is no longer just a single byte. The macros for character handling generate simple sequences when used in -byte-mode, and more complicated ones for UTF-8 characters. */ +byte-mode, and more complicated ones for UTF-8 characters. BACKCHAR should +never be called in byte mode. To make sure it can never even appear when UTF-8 +support is omitted, we don't even define it. */ #ifndef SUPPORT_UTF8 #define GETCHAR(c, eptr) c = *eptr; @@ -371,7 +368,7 @@ byte-mode, and more complicated ones for UTF-8 characters. */ #define GETCHARINC(c, eptr) c = *eptr++; #define GETCHARINCTEST(c, eptr) c = *eptr++; #define GETCHARLEN(c, eptr, len) c = *eptr; -#define BACKCHAR(eptr) +/* #define BACKCHAR(eptr) */ #else /* SUPPORT_UTF8 */ @@ -464,9 +461,10 @@ if there are extra bytes. This is called when we know we are in UTF-8 mode. */ } /* If the pointer is not at the start of a character, move it back until -it is. Called only in UTF-8 mode. */ +it is. This is called only in UTF-8 mode - we don't put a test within the macro +because almost all calls are already within a block of UTF-8 only code. */ -#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr--; +#define BACKCHAR(eptr) while((*eptr & 0xc0) == 0x80) eptr-- #endif @@ -494,6 +492,7 @@ bits. */ #define PCRE_REQCHSET 0x20000000 /* req_byte is set */ #define PCRE_STARTLINE 0x10000000 /* start after \n for multiline */ #define PCRE_JCHANGED 0x08000000 /* j option changes within regex */ +#define PCRE_HASCRORLF 0x04000000 /* explicit \r or \n in pattern */ /* Options for the "extra" block produced by pcre_study(). */ @@ -610,14 +609,9 @@ enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF }; -/* Opcode table: OP_BRA must be last, as all values >= it are used for brackets -that extract substrings. Starting from 1 (i.e. after OP_END), the values up to +/* Opcode table: Starting from 1 (i.e. after OP_END), the values up to OP_EOD must correspond in order to the list of escapes immediately above. -To keep stored, compiled patterns compatible, new opcodes should be added -immediately before OP_BRA, where (since release 7.0) a gap is left for this -purpose. - *** NOTE NOTE NOTE *** Whenever this list is updated, the two macro definitions that follow must also be updated to match. There is also a table called "coptable" in pcre_dfa_exec.c that must be updated. */ @@ -744,7 +738,7 @@ enum { as there's a test for >= ONCE for a subpattern that isn't an assertion. */ OP_ONCE, /* 92 Atomic group */ - OP_BRA, /* 83 Start of non-capturing bracket */ + OP_BRA, /* 93 Start of non-capturing bracket */ OP_CBRA, /* 94 Start of capturing bracket */ OP_COND, /* 95 Conditional group */ @@ -760,7 +754,19 @@ enum { OP_DEF, /* 101 The DEFINE condition */ OP_BRAZERO, /* 102 These two must remain together and in this */ - OP_BRAMINZERO /* 103 order. */ + OP_BRAMINZERO, /* 103 order. */ + + /* These are backtracking control verbs */ + + OP_PRUNE, /* 104 */ + OP_SKIP, /* 105 */ + OP_THEN, /* 106 */ + OP_COMMIT, /* 107 */ + + /* These are forced failure and success verbs */ + + OP_FAIL, /* 108 */ + OP_ACCEPT /* 109 */ }; @@ -783,8 +789,9 @@ for debugging. The macro is referenced only in pcre_printint.c. */ "class", "nclass", "xclass", "Ref", "Recurse", "Callout", \ "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", \ "AssertB", "AssertB not", "Reverse", \ - "Once", "Bra 0", "Bra", "Cond", "SBra 0", "SBra", "SCond", \ - "Cond ref", "Cond rec", "Cond def", "Brazero", "Braminzero" + "Once", "Bra", "CBra", "Cond", "SBra", "SCBra", "SCond", \ + "Cond ref", "Cond rec", "Cond def", "Brazero", "Braminzero", \ + "*PRUNE", "*SKIP", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT" /* This macro defines the length of fixed length operations in the compiled @@ -848,6 +855,8 @@ in UTF-8 mode. The code that uses this table must know about such things. */ 3, /* RREF */ \ 1, /* DEF */ \ 1, 1, /* BRAZERO, BRAMINZERO */ \ + 1, 1, 1, 1, /* PRUNE, SKIP, THEN, COMMIT, */ \ + 1, 1 /* FAIL, ACCEPT */ /* A magic value for OP_RREF to indicate the "any recursion" condition. */ @@ -862,7 +871,8 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9, ERR20, ERR21, ERR22, ERR23, ERR24, ERR25, ERR26, ERR27, ERR28, ERR29, ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39, ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49, - ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58 }; + ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59, + ERR60, ERR61 }; /* The real format of the start of the pcre block; the index of names and the code vector run on as long as necessary after the end. We store an explicit @@ -931,6 +941,7 @@ typedef struct compile_data { int external_options; /* External (initial) options */ int req_varyopt; /* "After variable item" flag for reqbyte */ BOOL nopartial; /* Set TRUE if partial won't work */ + BOOL had_accept; /* (*ACCEPT) encountered */ int nltype; /* Newline type */ int nllen; /* Newline string length */ uschar nl[4]; /* Newline string when fixed length */ diff --git a/ext/pcre/pcrelib/pcre_maketables.c b/ext/pcre/pcrelib/pcre_maketables.c index 9963e13fb0..1e6381aa34 100644 --- a/ext/pcre/pcrelib/pcre_maketables.c +++ b/ext/pcre/pcrelib/pcre_maketables.c @@ -45,7 +45,10 @@ compilation of dftables.c, in which case the macro DFTABLES is defined. */ #ifndef DFTABLES -#include "pcre_internal.h" +# ifdef HAVE_CONFIG_H +# include <config.h> +# endif +# include "pcre_internal.h" #endif diff --git a/ext/pcre/pcrelib/pcre_newline.c b/ext/pcre/pcrelib/pcre_newline.c index ae66c730f6..db02a8cf20 100644 --- a/ext/pcre/pcrelib/pcre_newline.c +++ b/ext/pcre/pcrelib/pcre_newline.c @@ -47,6 +47,10 @@ and NLTYPE_ANY. The full list of Unicode newline characters is taken from http://unicode.org/unicode/reports/tr18/. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" @@ -124,12 +128,16 @@ _pcre_was_newline(const uschar *ptr, int type, const uschar *startptr, { int c; ptr--; +#ifdef SUPPORT_UTF8 if (utf8) { BACKCHAR(ptr); GETCHAR(c, ptr); } else c = *ptr; +#else /* no UTF-8 support */ +c = *ptr; +#endif /* SUPPORT_UTF8 */ if (type == NLTYPE_ANYCRLF) switch(c) { diff --git a/ext/pcre/pcrelib/pcre_ord2utf8.c b/ext/pcre/pcrelib/pcre_ord2utf8.c index a72761285e..7552034e44 100644 --- a/ext/pcre/pcrelib/pcre_ord2utf8.c +++ b/ext/pcre/pcrelib/pcre_ord2utf8.c @@ -41,6 +41,9 @@ POSSIBILITY OF SUCH DAMAGE. /* This file contains a private PCRE function that converts an ordinal character value into a UTF8 string. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_printint.src b/ext/pcre/pcrelib/pcre_printint.src index 79299f7e68..90381ed969 100644 --- a/ext/pcre/pcrelib/pcre_printint.src +++ b/ext/pcre/pcrelib/pcre_printint.src @@ -122,7 +122,7 @@ get_ucpname(int ptype, int pvalue) { #ifdef SUPPORT_UCP int i; -for (i = _pcre_utt_size; i >= 0; i--) +for (i = _pcre_utt_size - 1; i >= 0; i--) { if (ptype == _pcre_utt[i].type && pvalue == _pcre_utt[i].value) break; } diff --git a/ext/pcre/pcrelib/pcre_refcount.c b/ext/pcre/pcrelib/pcre_refcount.c index 8339019db5..b6a464ce81 100644 --- a/ext/pcre/pcrelib/pcre_refcount.c +++ b/ext/pcre/pcrelib/pcre_refcount.c @@ -43,6 +43,11 @@ auxiliary function that can be used to maintain a reference count in a compiled pattern data block. This might be helpful in applications where the block is shared by different users. */ + +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_study.c b/ext/pcre/pcrelib/pcre_study.c index 3cf5b5954d..6a6c314b02 100644 --- a/ext/pcre/pcrelib/pcre_study.c +++ b/ext/pcre/pcrelib/pcre_study.c @@ -42,6 +42,10 @@ POSSIBILITY OF SUCH DAMAGE. supporting functions. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_tables.c b/ext/pcre/pcrelib/pcre_tables.c index 3e36c931a0..7d79eff205 100644 --- a/ext/pcre/pcrelib/pcre_tables.c +++ b/ext/pcre/pcrelib/pcre_tables.c @@ -44,6 +44,10 @@ uses macros to change their names from _pcre_xxx to xxxx, thereby avoiding name clashes with the library. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_try_flipped.c b/ext/pcre/pcrelib/pcre_try_flipped.c index cd45968a4a..4f98918538 100644 --- a/ext/pcre/pcrelib/pcre_try_flipped.c +++ b/ext/pcre/pcrelib/pcre_try_flipped.c @@ -43,6 +43,10 @@ see if it was compiled with the opposite endianness. If so, it uses an auxiliary local function to flip the appropriate bytes. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_ucp_searchfuncs.c b/ext/pcre/pcrelib/pcre_ucp_searchfuncs.c index 5ecba6b1f8..d3edc2d71b 100644 --- a/ext/pcre/pcrelib/pcre_ucp_searchfuncs.c +++ b/ext/pcre/pcrelib/pcre_ucp_searchfuncs.c @@ -41,6 +41,10 @@ POSSIBILITY OF SUCH DAMAGE. /* This module contains code for searching the table of Unicode character properties. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" #include "ucp.h" /* Category definitions */ diff --git a/ext/pcre/pcrelib/pcre_valid_utf8.c b/ext/pcre/pcrelib/pcre_valid_utf8.c index 9a35a202e4..0486ea381f 100644 --- a/ext/pcre/pcrelib/pcre_valid_utf8.c +++ b/ext/pcre/pcrelib/pcre_valid_utf8.c @@ -42,6 +42,10 @@ POSSIBILITY OF SUCH DAMAGE. strings. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" @@ -55,6 +59,13 @@ that subsequent code can assume it is dealing with a valid string. The check can be turned off for maximum performance, but the consequences of supplying an invalid string are then undefined. +Originally, this function checked according to RFC 2279, allowing for values in +the range 0 to 0x7fffffff, up to 6 bytes long, but ensuring that they were in +the canonical format. Once somebody had pointed out RFC 3629 to me (it +obsoletes 2279), additional restrictions were applies. The values are now +limited to be between 0 and 0x0010ffff, no more than 4 bytes long, and the +subrange 0xd000 to 0xdfff is excluded. + Arguments: string points to the string length length of string, or -1 if the string is zero-terminated @@ -81,31 +92,48 @@ for (p = string; length-- > 0; p++) register int c = *p; if (c < 128) continue; if (c < 0xc0) return p - string; - ab = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ - if (length < ab) return p - string; + ab = _pcre_utf8_table4[c & 0x3f]; /* Number of additional bytes */ + if (length < ab || ab > 3) return p - string; length -= ab; /* Check top bits in the second byte */ if ((*(++p) & 0xc0) != 0x80) return p - string; - /* Check for overlong sequences for each different length */ + /* Check for overlong sequences for each different length, and for the + excluded range 0xd000 to 0xdfff. */ + switch (ab) { - /* Check for xx00 000x */ + /* Check for xx00 000x (overlong sequence) */ + case 1: if ((c & 0x3e) == 0) return p - string; continue; /* We know there aren't any more bytes to check */ - /* Check for 1110 0000, xx0x xxxx */ + /* Check for 1110 0000, xx0x xxxx (overlong sequence) or + 1110 1101, 1010 xxxx (0xd000 - 0xdfff) */ + case 2: - if (c == 0xe0 && (*p & 0x20) == 0) return p - string; + if ((c == 0xe0 && (*p & 0x20) == 0) || + (c == 0xed && *p >= 0xa0)) + return p - string; break; - /* Check for 1111 0000, xx00 xxxx */ + /* Check for 1111 0000, xx00 xxxx (overlong sequence) or + greater than 0x0010ffff (f4 8f bf bf) */ + case 3: - if (c == 0xf0 && (*p & 0x30) == 0) return p - string; + if ((c == 0xf0 && (*p & 0x30) == 0) || + (c > 0xf4 ) || + (c == 0xf4 && *p > 0x8f)) + return p - string; break; +#if 0 + /* These cases can no longer occur, as we restrict to a maximum of four + bytes nowadays. Leave the code here in case we ever want to add an option + for longer sequences. */ + /* Check for 1111 1000, xx00 0xxx */ case 4: if (c == 0xf8 && (*p & 0x38) == 0) return p - string; @@ -116,6 +144,8 @@ for (p = string; length-- > 0; p++) if (c == 0xfe || c == 0xff || (c == 0xfc && (*p & 0x3c) == 0)) return p - string; break; +#endif + } /* Check for valid bytes after the 2nd, if any; all must start 10 */ diff --git a/ext/pcre/pcrelib/pcre_version.c b/ext/pcre/pcrelib/pcre_version.c index 1a9ecb2c50..e1ab4d4107 100644 --- a/ext/pcre/pcrelib/pcre_version.c +++ b/ext/pcre/pcrelib/pcre_version.c @@ -42,6 +42,10 @@ POSSIBILITY OF SUCH DAMAGE. string that identifies the PCRE version that is in use. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcre_xclass.c b/ext/pcre/pcrelib/pcre_xclass.c index 0b333515f2..502e718bf4 100644 --- a/ext/pcre/pcrelib/pcre_xclass.c +++ b/ext/pcre/pcrelib/pcre_xclass.c @@ -43,6 +43,10 @@ class (one that contains characters whose values are > 255). It is used by both pcre_exec() and pcre_def_exec(). */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + #include "pcre_internal.h" diff --git a/ext/pcre/pcrelib/pcredemo.c b/ext/pcre/pcrelib/pcredemo.c index 80aba0e19d..4068e3e04d 100644 --- a/ext/pcre/pcrelib/pcredemo.c +++ b/ext/pcre/pcrelib/pcredemo.c @@ -11,15 +11,12 @@ Compile thuswise: -R/usr/local/lib -lpcre Replace "/usr/local/include" and "/usr/local/lib" with wherever the include and -library files for PCRE are installed on your system. Only some operating +library files for PCRE are installed on your system. You don't need -I and -L +if PCRE is installed in the standard system libraries. Only some operating systems (e.g. Solaris) use the -R option. */ -#ifdef HAVE_CONFIG_H -# include <config.h> -#endif - #include <stdio.h> #include <string.h> #include <pcre.h> diff --git a/ext/pcre/pcrelib/pcregrep.c b/ext/pcre/pcrelib/pcregrep.c index f14c973cb3..36b618a8c4 100644 --- a/ext/pcre/pcrelib/pcregrep.c +++ b/ext/pcre/pcrelib/pcregrep.c @@ -38,7 +38,7 @@ POSSIBILITY OF SUCH DAMAGE. */ #ifdef HAVE_CONFIG_H -# include <config.h> +#include <config.h> #endif #include <ctype.h> @@ -50,8 +50,9 @@ POSSIBILITY OF SUCH DAMAGE. #include <sys/types.h> #include <sys/stat.h> + #ifdef HAVE_UNISTD_H -# include <unistd.h> +#include <unistd.h> #endif #include <pcre.h> @@ -855,7 +856,7 @@ while (ptr < endptr) t = end_of_line(t, endptr, &endlinelength); linelength = t - ptr - endlinelength; - length = multiline? endptr - ptr : linelength; + length = multiline? (size_t)(endptr - ptr) : linelength; /* Extra processing for Jeffrey Friedl's debugging. */ @@ -1063,18 +1064,23 @@ while (ptr < endptr) /* In multiline mode, we want to print to the end of the line in which the end of the matched string is found, so we adjust linelength and the - line number appropriately. Because the PCRE_FIRSTLINE option is set, the - start of the match will always be before the first newline sequence. */ + line number appropriately, but only when there actually was a match + (invert not set). Because the PCRE_FIRSTLINE option is set, the start of + the match will always be before the first newline sequence. */ if (multiline) { int ellength; - char *endmatch = ptr + offsets[1]; - t = ptr; - while (t < endmatch) + char *endmatch = ptr; + if (!invert) { - t = end_of_line(t, endptr, &ellength); - if (t <= endmatch) linenumber++; else break; + endmatch += offsets[1]; + t = ptr; + while (t < endmatch) + { + t = end_of_line(t, endptr, &ellength); + if (t <= endmatch) linenumber++; else break; + } } endmatch = end_of_line(endmatch, endptr, &ellength); linelength = endmatch - ptr - ellength; @@ -1123,6 +1129,24 @@ while (ptr < endptr) lastmatchnumber = linenumber + 1; } + /* For a match in multiline inverted mode (which of course did not cause + anything to be printed), we have to move on to the end of the match before + proceeding. */ + + if (multiline && invert && match) + { + int ellength; + char *endmatch = ptr + offsets[1]; + t = ptr; + while (t < endmatch) + { + t = end_of_line(t, endptr, &ellength); + if (t <= endmatch) linenumber++; else break; + } + endmatch = end_of_line(endmatch, endptr, &ellength); + linelength = endmatch - ptr - ellength; + } + /* Advance to after the newline and increment the line number. */ ptr += linelength + endlinelength; @@ -1625,7 +1649,7 @@ for (i = 1; i < argc; i++) else /* Special case xxx=data */ { int oplen = equals - op->long_name; - int arglen = (argequals == NULL)? strlen(arg) : argequals - arg; + int arglen = (argequals == NULL)? (int)strlen(arg) : argequals - arg; if (oplen == arglen && strncmp(arg, op->long_name, oplen) == 0) { option_data = arg + arglen; diff --git a/ext/pcre/pcrelib/pcreposix.c b/ext/pcre/pcrelib/pcreposix.c index 8582fba097..83263deaab 100644 --- a/ext/pcre/pcrelib/pcreposix.c +++ b/ext/pcre/pcrelib/pcreposix.c @@ -42,9 +42,24 @@ POSSIBILITY OF SUCH DAMAGE. functions. */ +#ifdef HAVE_CONFIG_H +#include <config.h> +#endif + + +/* Ensure that the PCREPOSIX_EXP_xxx macros are set appropriately for +compiling these functions. This must come before including pcreposix.h, where +they are set for an application (using these functions) if they have not +previously been set. */ + +#if defined(_WIN32) && !defined(PCRE_STATIC) +# define PCREPOSIX_EXP_DECL extern __declspec(dllexport) +# define PCREPOSIX_EXP_DEFN __declspec(dllexport) +#endif + +#include <pcre.h> #include "pcre_internal.h" #include "pcreposix.h" -#include "stdlib.h" @@ -109,7 +124,8 @@ static const int eint[] = { REG_BADPAT, /* repeating a DEFINE group is not allowed */ REG_INVARG, /* inconsistent NEWLINE options */ REG_BADPAT, /* \g is not followed followed by an (optionally braced) non-zero number */ - REG_BADPAT /* (?+ or (?- must be followed by a non-zero number */ + REG_BADPAT, /* (?+ or (?- must be followed by a non-zero number */ + REG_BADPAT /* number is too big */ }; /* Table of texts corresponding to POSIX error codes */ diff --git a/ext/pcre/pcrelib/pcreposix.h b/ext/pcre/pcrelib/pcreposix.h index cca559b3b2..875e1ff18b 100644 --- a/ext/pcre/pcrelib/pcreposix.h +++ b/ext/pcre/pcrelib/pcreposix.h @@ -107,13 +107,12 @@ typedef struct { /* When an application links to a PCRE DLL in Windows, the symbols that are imported have to be identified as such. When building PCRE, the appropriate -export settings are needed. */ +export settings are needed, and are set in pcreposix.c before including this +file. */ -#ifdef _WIN32 -# ifndef PCREPOSIX_STATIC -# define PCREPOSIX_EXP_DECL extern __declspec(dllimport) -# define PCREPOSIX_EXP_DEFN __declspec(dllimport) -# endif +#if defined(_WIN32) && !defined(PCRE_STATIC) && !defined(PCREPOSIX_EXP_DECL) +# define PCREPOSIX_EXP_DECL extern __declspec(dllimport) +# define PCREPOSIX_EXP_DEFN __declspec(dllimport) #endif /* By default, we use the standard "extern" declarations. */ diff --git a/ext/pcre/pcrelib/testdata/grepoutput b/ext/pcre/pcrelib/testdata/grepoutput index 2e8cdc7d69..d5506a1097 100644 --- a/ext/pcre/pcrelib/testdata/grepoutput +++ b/ext/pcre/pcrelib/testdata/grepoutput @@ -383,3 +383,5 @@ AB.VE AB.VE the turtle PUT NEW DATA ABOVE THIS LINE. ---------------------------- Test 49 ------------------------------ +---------------------------- Test 50 ------------------------------ +over the lazy dog. diff --git a/ext/pcre/pcrelib/testdata/testinput1 b/ext/pcre/pcrelib/testdata/testinput1 index 7e8e9f0c4d..79c98fa7bd 100644 --- a/ext/pcre/pcrelib/testdata/testinput1 +++ b/ext/pcre/pcrelib/testdata/testinput1 @@ -4021,4 +4021,13 @@ /(.*(.)?)*/ abcd +/( (A | (?(1)0|) )* )/x + abcd + +/( ( (?(1)0|) )* )/x + abcd + +/( (?(1)0|)* )/x + abcd + / End of testinput1 / diff --git a/ext/pcre/pcrelib/testdata/testinput10 b/ext/pcre/pcrelib/testdata/testinput10 index 369af4a3d3..726a3890a2 100644 --- a/ext/pcre/pcrelib/testdata/testinput10 +++ b/ext/pcre/pcrelib/testdata/testinput10 @@ -101,4 +101,24 @@ are all themselves checked in other tests. --/ /[\x{105}-\x{109}]/8iBM +/( ( (?(1)0|) )* )/xBM + +/( (?(1)0|)* )/xBM + +/[a]/BM + +/[a]/8BM + +/[\xaa]/BM + +/[\xaa]/8BM + +/[^a]/BM + +/[^a]/8BM + +/[^\xaa]/BM + +/[^\xaa]/8BM + / End of testinput10 / diff --git a/ext/pcre/pcrelib/testdata/testinput2 b/ext/pcre/pcrelib/testdata/testinput2 index 2ce7ad0321..c9f1018a90 100644 --- a/ext/pcre/pcrelib/testdata/testinput2 +++ b/ext/pcre/pcrelib/testdata/testinput2 @@ -2326,4 +2326,142 @@ a random value. /Ix /\V+\v\V+\w/BZ +/\( (?: [^()]* | (?R) )* \)/x +(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(00)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0) + +/[\E]AAA/ + +/[\Q\E]AAA/ + +/[^\E]AAA/ + +/[^\Q\E]AAA/ + +/[\E^]AAA/ + +/[\Q\E^]AAA/ + +/A(*PRUNE)B(*SKIP)C(*THEN)D(*COMMIT)E(*F)F(*FAIL)G(?!)H(*ACCEPT)I/BZ + +/^a+(*FAIL)/ + aaaaaa + +/a+b?c+(*FAIL)/ + aaabccc + +/a+b?(*PRUNE)c+(*FAIL)/ + aaabccc + +/a+b?(*COMMIT)c+(*FAIL)/ + aaabccc + +/a+b?(*SKIP)c+(*FAIL)/ + aaabcccaaabccc + +/^(?:aaa(*THEN)\w{6}|bbb(*THEN)\w{5}|ccc(*THEN)\w{4}|\w{3})/ + aaaxxxxxx + aaa++++++ + bbbxxxxx + bbb+++++ + cccxxxx + ccc++++ + dddddddd + +/^(aaa(*THEN)\w{6}|bbb(*THEN)\w{5}|ccc(*THEN)\w{4}|\w{3})/ + aaaxxxxxx + aaa++++++ + bbbxxxxx + bbb+++++ + cccxxxx + ccc++++ + dddddddd + +/a+b?(*THEN)c+(*FAIL)/ + aaabccc + +/(A (A|B(*ACCEPT)|C) D)(E)/x + ABX + AADE + ACDE + ** Failers + AD + +/^a+(*FAIL)/C + aaaaaa + +/a+b?c+(*FAIL)/C + aaabccc + +/a+b?(*PRUNE)c+(*FAIL)/C + aaabccc + +/a+b?(*COMMIT)c+(*FAIL)/C + aaabccc + +/a+b?(*SKIP)c+(*FAIL)/C + aaabcccaaabccc + +/a+b?(*THEN)c+(*FAIL)/C + aaabccc + +/a(*PRUNE:XXX)b/ + +/a(*MARK)b/ + +/(?i:A{1,}\6666666666)/ + +/\g6666666666/ + +/[\g6666666666]/ + +/(?1)\c[/ + +/.+A/<crlf> + \r\nA + +/\nA/<crlf> + \r\nA + +/[\r\n]A/<crlf> + \r\nA + +/(\r|\n)A/<crlf> + \r\nA + +/a(*CR)b/ + +/(*CR)a.b/ + a\nb + ** Failers + a\rb + +/(*CR)a.b/<lf> + a\nb + ** Failers + a\rb + +/(*LF)a.b/<CRLF> + a\rb + ** Failers + a\nb + +/(*CRLF)a.b/ + a\rb + a\nb + ** Failers + a\r\nb + +/(*ANYCRLF)a.b/<CR> + ** Failers + a\rb + a\nb + a\r\nb + +/(*ANY)a.b/<cr> + ** Failers + a\rb + a\nb + a\r\nb + a\x85b + / End of testinput2 / diff --git a/ext/pcre/pcrelib/testdata/testinput4 b/ext/pcre/pcrelib/testdata/testinput4 index 0fb850bffc..630fb1d532 100644 --- a/ext/pcre/pcrelib/testdata/testinput4 +++ b/ext/pcre/pcrelib/testdata/testinput4 @@ -523,4 +523,16 @@ /a*\x{100}*\w/8 a +/\S\S/8g + A\x{a3}BC + +/\S{2}/8g + A\x{a3}BC + +/\W\W/8g + +\x{a3}== + +/\W{2}/8g + +\x{a3}== + / End of testinput4 / diff --git a/ext/pcre/pcrelib/testdata/testinput5 b/ext/pcre/pcrelib/testdata/testinput5 index e8e3cf799f..aa0123b3ea 100644 --- a/ext/pcre/pcrelib/testdata/testinput5 +++ b/ext/pcre/pcrelib/testdata/testinput5 @@ -238,6 +238,10 @@ can't tell the difference.) --/ \xf9\x87\x80\x80\x80 \xfc\x84\x80\x80\x80\x80 \xfd\x83\x80\x80\x80\x80 + \?\xf8\x88\x80\x80\x80 + \?\xf9\x87\x80\x80\x80 + \?\xfc\x84\x80\x80\x80\x80 + \?\xfd\x83\x80\x80\x80\x80 /\x{100}abc(xyz(?1))/8DZ @@ -393,4 +397,24 @@ can't tell the difference.) --/ /[\V]/8BZ +/.*$/8<any> + \x{1ec5} + +/-- This tests the stricter UTF-8 check according to RFC 3629. --/ + +/X/8 + \x{0}\x{d7ff}\x{e000}\x{10ffff} + \x{d800} + \x{d800}\? + \x{da00} + \x{da00}\? + \x{dfff} + \x{dfff}\? + \x{110000} + \x{110000}\? + \x{2000000} + \x{2000000}\? + \x{7fffffff} + \x{7fffffff}\? + / End of testinput5 / diff --git a/ext/pcre/pcrelib/testdata/testinput6 b/ext/pcre/pcrelib/testdata/testinput6 index 05e8feb026..53d2b328ff 100644 --- a/ext/pcre/pcrelib/testdata/testinput6 +++ b/ext/pcre/pcrelib/testdata/testinput6 @@ -61,7 +61,7 @@ \x{09f} /^\p{Cs}/8 - \x{dfff} + \?\x{dfff} ** Failers \x{09f} @@ -69,7 +69,7 @@ a ** Failers Z - \x{dfff} + \x{e000} /^\p{Lm}/8 \x{2b0} @@ -778,4 +778,58 @@ was broken in all cases./ 123abcdefg 123abc\xc4\xc5zz +/\X{1,3}\d/ + \x8aBCD + +/\X?\d/ + \x8aBCD + +/\P{L}?\d/ + \x8aBCD + +/[\PPP\x8a]{1,}\x80/ + A\x80 + +/(?:[\PPa*]*){8,}/ + +/[\P{Any}]/BZ + +/[\P{Any}\E]/BZ + +/(\P{Yi}+\277)/ + +/(\P{Yi}+\277)?/ + +/(?<=\P{Yi}{3}A)X/ + +/\p{Yi}+(\P{Yi}+)(?1)/ + +/(\P{Yi}{2}\277)?/ + +/[\P{Yi}A]/ + +/[\P{Yi}\P{Yi}\P{Yi}A]/ + +/[^\P{Yi}A]/ + +/[^\P{Yi}\P{Yi}\P{Yi}A]/ + +/(\P{Yi}*\277)*/ + +/(\P{Yi}*?\277)*/ + +/(\p{Yi}*+\277)*/ + +/(\P{Yi}?\277)*/ + +/(\P{Yi}??\277)*/ + +/(\p{Yi}?+\277)*/ + +/(\P{Yi}{0,3}\277)*/ + +/(\P{Yi}{0,3}?\277)*/ + +/(\p{Yi}{0,3}+\277)*/ + / End of testinput6 / diff --git a/ext/pcre/pcrelib/testdata/testinput7 b/ext/pcre/pcrelib/testdata/testinput7 index 2722980ad6..76524b725a 100644 --- a/ext/pcre/pcrelib/testdata/testinput7 +++ b/ext/pcre/pcrelib/testdata/testinput7 @@ -4298,4 +4298,16 @@ >XY\x0aZ\x0aA\x0bNN\x0c >\x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c +/.+A/<crlf> + \r\nA + +/\nA/<crlf> + \r\nA + +/[\r\n]A/<crlf> + \r\nA + +/(\r|\n)A/<crlf> + \r\nA + / End of testinput7 / diff --git a/ext/pcre/pcrelib/testdata/testinput9 b/ext/pcre/pcrelib/testdata/testinput9 index 07d9548af5..8a606318b7 100644 --- a/ext/pcre/pcrelib/testdata/testinput9 +++ b/ext/pcre/pcrelib/testdata/testinput9 @@ -148,7 +148,7 @@ \x{09f} /^\p{Cs}/8 - \x{dfff} + \?\x{dfff} ** Failers \x{09f} @@ -156,7 +156,7 @@ a ** Failers Z - \x{dfff} + \x{e000} /^\p{Lm}/8 \x{2b0} diff --git a/ext/pcre/pcrelib/testdata/testoutput1 b/ext/pcre/pcrelib/testdata/testoutput1 index 209b0d3f4c..4c0e680d11 100644 --- a/ext/pcre/pcrelib/testdata/testoutput1 +++ b/ext/pcre/pcrelib/testdata/testoutput1 @@ -6576,4 +6576,21 @@ No match 0: abcd 1: +/( (A | (?(1)0|) )* )/x + abcd + 0: + 1: + 2: + +/( ( (?(1)0|) )* )/x + abcd + 0: + 1: + 2: + +/( (?(1)0|)* )/x + abcd + 0: + 1: + / End of testinput1 / diff --git a/ext/pcre/pcrelib/testdata/testoutput10 b/ext/pcre/pcrelib/testdata/testoutput10 index bfda261bc8..dbd59241ad 100644 --- a/ext/pcre/pcrelib/testdata/testoutput10 +++ b/ext/pcre/pcrelib/testdata/testoutput10 @@ -6,8 +6,8 @@ are all themselves checked in other tests. --/ /((?i)b)/BM Memory allocation (code space): 21 ------------------------------------------------------------------ - 0 17 Bra 0 - 3 9 Bra 1 + 0 17 Bra + 3 9 CBra 1 8 01 Opt 10 NC b 12 9 Ket @@ -19,8 +19,8 @@ Memory allocation (code space): 21 /(?s)(.*X|^B)/BM Memory allocation (code space): 25 ------------------------------------------------------------------ - 0 21 Bra 0 - 3 9 Bra 1 + 0 21 Bra + 3 9 CBra 1 8 Any* 10 X 12 6 Alt @@ -34,8 +34,8 @@ Memory allocation (code space): 25 /(?s:.*X|^B)/BM Memory allocation (code space): 29 ------------------------------------------------------------------ - 0 25 Bra 0 - 3 9 Bra 0 + 0 25 Bra + 3 9 Bra 6 04 Opt 8 Any* 10 X @@ -52,7 +52,7 @@ Memory allocation (code space): 29 /^[[:alnum:]]/BM Memory allocation (code space): 41 ------------------------------------------------------------------ - 0 37 Bra 0 + 0 37 Bra 3 ^ 4 [0-9A-Za-z] 37 37 Ket @@ -62,7 +62,7 @@ Memory allocation (code space): 41 /#/IxMD Memory allocation (code space): 7 ------------------------------------------------------------------ - 0 3 Bra 0 + 0 3 Bra 3 3 Ket 6 End ------------------------------------------------------------------ @@ -74,7 +74,7 @@ No need char /a#/IxMD Memory allocation (code space): 9 ------------------------------------------------------------------ - 0 5 Bra 0 + 0 5 Bra 3 a 5 5 Ket 8 End @@ -87,7 +87,7 @@ No need char /x?+/BM Memory allocation (code space): 9 ------------------------------------------------------------------ - 0 5 Bra 0 + 0 5 Bra 3 x?+ 5 5 Ket 8 End @@ -96,7 +96,7 @@ Memory allocation (code space): 9 /x++/BM Memory allocation (code space): 9 ------------------------------------------------------------------ - 0 5 Bra 0 + 0 5 Bra 3 x++ 5 5 Ket 8 End @@ -105,7 +105,7 @@ Memory allocation (code space): 9 /x{1,3}+/BM Memory allocation (code space): 19 ------------------------------------------------------------------ - 0 15 Bra 0 + 0 15 Bra 3 9 Once 6 x 8 x{0,2} @@ -117,10 +117,10 @@ Memory allocation (code space): 19 /(x)*+/BM Memory allocation (code space): 24 ------------------------------------------------------------------ - 0 20 Bra 0 + 0 20 Bra 3 14 Once 6 Brazero - 7 7 Bra 1 + 7 7 CBra 1 12 x 14 7 KetRmax 17 14 Ket @@ -131,19 +131,19 @@ Memory allocation (code space): 24 /^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/BM Memory allocation (code space): 120 ------------------------------------------------------------------ - 0 116 Bra 0 + 0 116 Bra 3 ^ - 4 109 Bra 1 - 9 7 Bra 2 + 4 109 CBra 1 + 9 7 CBra 2 14 a+ 16 7 Ket - 19 39 Bra 3 + 19 39 CBra 3 24 [ab]+? 58 39 Ket - 61 39 Bra 4 + 61 39 CBra 4 66 [bc]+ 100 39 Ket -103 7 Bra 5 +103 7 CBra 5 108 \w* 110 7 Ket 113 109 Ket @@ -154,7 +154,7 @@ Memory allocation (code space): 120 |8J\$WE\<\.rX\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\<EjmhUZ\?\.akp2dF\>qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|BM Memory allocation (code space): 826 ------------------------------------------------------------------ - 0 822 Bra 0 + 0 822 Bra 3 8J$WE<.rX+ix[d1b!H#?vV0vrK:ZH1=2M>iV;?aPhFB<*vW@QW@sO9}cfZA-i'w%hKd6gt1UJP,15_#QY$M^Mss_U/]&LK9[5vQub^w[KDD<EjmhUZ?.akp2dF>qmj;2}YWFdYx.Ap]hjCPTP(n28k+3;o&WXqs/gOXdr$:r'do0;b4c(f_Gr="\4)[01T7ajQJvL$W~mL_sS/4h:x*[ZN=KLs&L5zX//>it,o:aU(;Z>pW&T7oP'2K^E:x9'c[%z-,64JQ5AeH_G#KijUKghQw^\vea3a?kka_G$8#`*kynsxzBLru']k_[7FrVx}^=$blx>s-N%j;D*aZDnsw:YKZ%Q.Kne9#hP?+b3(SOvL,^;&u5@?5C5Bhb=m-vEh_L15Jl]U)0RP6{q%L^_z5E'Dw6X 821 \b 822 822 Ket @@ -164,7 +164,7 @@ Memory allocation (code space): 826 |\$\<\.X\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\<EjmhUZ\?\.akp2dF\>qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|BM Memory allocation (code space): 816 ------------------------------------------------------------------ - 0 812 Bra 0 + 0 812 Bra 3 $<.X+ix[d1b!H#?vV0vrK:ZH1=2M>iV;?aPhFB<*vW@QW@sO9}cfZA-i'w%hKd6gt1UJP,15_#QY$M^Mss_U/]&LK9[5vQub^w[KDD<EjmhUZ?.akp2dF>qmj;2}YWFdYx.Ap]hjCPTP(n28k+3;o&WXqs/gOXdr$:r'do0;b4c(f_Gr="\4)[01T7ajQJvL$W~mL_sS/4h:x*[ZN=KLs&L5zX//>it,o:aU(;Z>pW&T7oP'2K^E:x9'c[%z-,64JQ5AeH_G#KijUKghQw^\vea3a?kka_G$8#`*kynsxzBLru']k_[7FrVx}^=$blx>s-N%j;D*aZDnsw:YKZ%Q.Kne9#hP?+b3(SOvL,^;&u5@?5C5Bhb=m-vEh_L15Jl]U)0RP6{q%L^_z5E'Dw6X 811 \b 812 812 Ket @@ -174,8 +174,8 @@ Memory allocation (code space): 816 /(a(?1)b)/BM Memory allocation (code space): 28 ------------------------------------------------------------------ - 0 24 Bra 0 - 3 18 Bra 1 + 0 24 Bra + 3 18 CBra 1 8 a 10 6 Once 13 3 Recurse @@ -189,8 +189,8 @@ Memory allocation (code space): 28 /(a(?1)+b)/BM Memory allocation (code space): 28 ------------------------------------------------------------------ - 0 24 Bra 0 - 3 18 Bra 1 + 0 24 Bra + 3 18 CBra 1 8 a 10 6 Once 13 3 Recurse @@ -204,15 +204,15 @@ Memory allocation (code space): 28 /a(?P<name1>b|c)d(?P<longername2>e)/BM Memory allocation (code space): 42 ------------------------------------------------------------------ - 0 32 Bra 0 + 0 32 Bra 3 a - 5 7 Bra 1 + 5 7 CBra 1 10 b 12 5 Alt 15 c 17 12 Ket 20 d - 22 7 Bra 2 + 22 7 CBra 2 27 e 29 7 Ket 32 32 Ket @@ -222,17 +222,17 @@ Memory allocation (code space): 42 /(?:a(?P<c>c(?P<d>d)))(?P<a>a)/BM Memory allocation (code space): 54 ------------------------------------------------------------------ - 0 41 Bra 0 - 3 25 Bra 0 + 0 41 Bra + 3 25 Bra 6 a - 8 17 Bra 1 + 8 17 CBra 1 13 c - 15 7 Bra 2 + 15 7 CBra 2 20 d 22 7 Ket 25 17 Ket 28 25 Ket - 31 7 Bra 3 + 31 7 CBra 3 36 a 38 7 Ket 41 41 Ket @@ -242,8 +242,8 @@ Memory allocation (code space): 54 /(?P<a>a)...(?P=a)bbb(?P>a)d/BM Memory allocation (code space): 43 ------------------------------------------------------------------ - 0 36 Bra 0 - 3 7 Bra 1 + 0 36 Bra + 3 7 CBra 1 8 a 10 7 Ket 13 Any @@ -262,7 +262,7 @@ Memory allocation (code space): 43 /abc(?C255)de(?C)f/BM Memory allocation (code space): 31 ------------------------------------------------------------------ - 0 27 Bra 0 + 0 27 Bra 3 abc 9 Callout 255 10 1 15 de @@ -275,7 +275,7 @@ Memory allocation (code space): 31 /abcde/CBM Memory allocation (code space): 53 ------------------------------------------------------------------ - 0 49 Bra 0 + 0 49 Bra 3 Callout 255 0 1 9 a 11 Callout 255 1 1 @@ -294,7 +294,7 @@ Memory allocation (code space): 53 /\x{100}/8BM Memory allocation (code space): 10 ------------------------------------------------------------------ - 0 6 Bra 0 + 0 6 Bra 3 \x{100} 6 6 Ket 9 End @@ -303,7 +303,7 @@ Memory allocation (code space): 10 /\x{1000}/8BM Memory allocation (code space): 11 ------------------------------------------------------------------ - 0 7 Bra 0 + 0 7 Bra 3 \x{1000} 7 7 Ket 10 End @@ -312,7 +312,7 @@ Memory allocation (code space): 11 /\x{10000}/8BM Memory allocation (code space): 12 ------------------------------------------------------------------ - 0 8 Bra 0 + 0 8 Bra 3 \x{10000} 8 8 Ket 11 End @@ -321,7 +321,7 @@ Memory allocation (code space): 12 /\x{100000}/8BM Memory allocation (code space): 12 ------------------------------------------------------------------ - 0 8 Bra 0 + 0 8 Bra 3 \x{100000} 8 8 Ket 11 End @@ -330,7 +330,7 @@ Memory allocation (code space): 12 /\x{1000000}/8BM Memory allocation (code space): 13 ------------------------------------------------------------------ - 0 9 Bra 0 + 0 9 Bra 3 \x{1000000} 9 9 Ket 12 End @@ -339,7 +339,7 @@ Memory allocation (code space): 13 /\x{4000000}/8BM Memory allocation (code space): 14 ------------------------------------------------------------------ - 0 10 Bra 0 + 0 10 Bra 3 \x{4000000} 10 10 Ket 13 End @@ -348,7 +348,7 @@ Memory allocation (code space): 14 /\x{7fffFFFF}/8BM Memory allocation (code space): 14 ------------------------------------------------------------------ - 0 10 Bra 0 + 0 10 Bra 3 \x{7fffffff} 10 10 Ket 13 End @@ -357,7 +357,7 @@ Memory allocation (code space): 14 /[\x{ff}]/8BM Memory allocation (code space): 10 ------------------------------------------------------------------ - 0 6 Bra 0 + 0 6 Bra 3 \x{ff} 6 6 Ket 9 End @@ -366,7 +366,7 @@ Memory allocation (code space): 10 /[\x{100}]/8BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\x{100}] 11 11 Ket 14 End @@ -375,7 +375,7 @@ Memory allocation (code space): 15 /\x80/8BM Memory allocation (code space): 10 ------------------------------------------------------------------ - 0 6 Bra 0 + 0 6 Bra 3 \x{80} 6 6 Ket 9 End @@ -384,7 +384,7 @@ Memory allocation (code space): 10 /\xff/8BM Memory allocation (code space): 10 ------------------------------------------------------------------ - 0 6 Bra 0 + 0 6 Bra 3 \x{ff} 6 6 Ket 9 End @@ -393,7 +393,7 @@ Memory allocation (code space): 10 /\x{0041}\x{2262}\x{0391}\x{002e}/D8M Memory allocation (code space): 18 ------------------------------------------------------------------ - 0 14 Bra 0 + 0 14 Bra 3 A\x{2262}\x{391}. 14 14 Ket 17 End @@ -406,7 +406,7 @@ Need char = '.' /\x{D55c}\x{ad6d}\x{C5B4}/D8M Memory allocation (code space): 19 ------------------------------------------------------------------ - 0 15 Bra 0 + 0 15 Bra 3 \x{d55c}\x{ad6d}\x{c5b4} 15 15 Ket 18 End @@ -419,7 +419,7 @@ Need char = 180 /\x{65e5}\x{672c}\x{8a9e}/D8M Memory allocation (code space): 19 ------------------------------------------------------------------ - 0 15 Bra 0 + 0 15 Bra 3 \x{65e5}\x{672c}\x{8a9e} 15 15 Ket 18 End @@ -432,7 +432,7 @@ Need char = 158 /[\x{100}]/8BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\x{100}] 11 11 Ket 14 End @@ -441,7 +441,7 @@ Memory allocation (code space): 15 /[Z\x{100}]/8BM Memory allocation (code space): 47 ------------------------------------------------------------------ - 0 43 Bra 0 + 0 43 Bra 3 [Z\x{100}] 43 43 Ket 46 End @@ -450,7 +450,7 @@ Memory allocation (code space): 47 /^[\x{100}\E-\Q\E\x{150}]/B8M Memory allocation (code space): 18 ------------------------------------------------------------------ - 0 14 Bra 0 + 0 14 Bra 3 ^ 4 [\x{100}-\x{150}] 14 14 Ket @@ -460,7 +460,7 @@ Memory allocation (code space): 18 /^[\QĀ\E-\QŐ\E]/B8M Memory allocation (code space): 18 ------------------------------------------------------------------ - 0 14 Bra 0 + 0 14 Bra 3 ^ 4 [\x{100}-\x{150}] 14 14 Ket @@ -473,7 +473,7 @@ Failed: missing terminating ] for character class at offset 15 /[\p{L}]/BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\p{L}] 11 11 Ket 14 End @@ -482,7 +482,7 @@ Memory allocation (code space): 15 /[\p{^L}]/BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\P{L}] 11 11 Ket 14 End @@ -491,7 +491,7 @@ Memory allocation (code space): 15 /[\P{L}]/BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\P{L}] 11 11 Ket 14 End @@ -500,7 +500,7 @@ Memory allocation (code space): 15 /[\P{^L}]/BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\p{L}] 11 11 Ket 14 End @@ -509,7 +509,7 @@ Memory allocation (code space): 15 /[abc\p{L}\x{0660}]/8BM Memory allocation (code space): 50 ------------------------------------------------------------------ - 0 46 Bra 0 + 0 46 Bra 3 [a-c\p{L}\x{660}] 46 46 Ket 49 End @@ -518,7 +518,7 @@ Memory allocation (code space): 50 /[\p{Nd}]/8BM Memory allocation (code space): 15 ------------------------------------------------------------------ - 0 11 Bra 0 + 0 11 Bra 3 [\p{Nd}] 11 11 Ket 14 End @@ -527,7 +527,7 @@ Memory allocation (code space): 15 /[\p{Nd}+-]+/8BM Memory allocation (code space): 48 ------------------------------------------------------------------ - 0 44 Bra 0 + 0 44 Bra 3 [+\-\p{Nd}]+ 44 44 Ket 47 End @@ -536,7 +536,7 @@ Memory allocation (code space): 48 /A\x{391}\x{10427}\x{ff3a}\x{1fb0}/8iBM Memory allocation (code space): 25 ------------------------------------------------------------------ - 0 21 Bra 0 + 0 21 Bra 3 NC A\x{391}\x{10427}\x{ff3a}\x{1fb0} 21 21 Ket 24 End @@ -545,7 +545,7 @@ Memory allocation (code space): 25 /A\x{391}\x{10427}\x{ff3a}\x{1fb0}/8BM Memory allocation (code space): 25 ------------------------------------------------------------------ - 0 21 Bra 0 + 0 21 Bra 3 A\x{391}\x{10427}\x{ff3a}\x{1fb0} 21 21 Ket 24 End @@ -554,10 +554,116 @@ Memory allocation (code space): 25 /[\x{105}-\x{109}]/8iBM Memory allocation (code space): 17 ------------------------------------------------------------------ - 0 13 Bra 0 + 0 13 Bra 3 [\x{104}-\x{109}] 13 13 Ket 16 End ------------------------------------------------------------------ +/( ( (?(1)0|) )* )/xBM +Memory allocation (code space): 38 +------------------------------------------------------------------ + 0 34 Bra + 3 28 CBra 1 + 8 Brazero + 9 19 SCBra 2 + 14 8 Cond + 17 1 Cond ref + 20 0 + 22 3 Alt + 25 11 Ket + 28 19 KetRmax + 31 28 Ket + 34 34 Ket + 37 End +------------------------------------------------------------------ + +/( (?(1)0|)* )/xBM +Memory allocation (code space): 30 +------------------------------------------------------------------ + 0 26 Bra + 3 20 CBra 1 + 8 Brazero + 9 8 SCond + 12 1 Cond ref + 15 0 + 17 3 Alt + 20 11 KetRmax + 23 20 Ket + 26 26 Ket + 29 End +------------------------------------------------------------------ + +/[a]/BM +Memory allocation (code space): 9 +------------------------------------------------------------------ + 0 5 Bra + 3 a + 5 5 Ket + 8 End +------------------------------------------------------------------ + +/[a]/8BM +Memory allocation (code space): 9 +------------------------------------------------------------------ + 0 5 Bra + 3 a + 5 5 Ket + 8 End +------------------------------------------------------------------ + +/[\xaa]/BM +Memory allocation (code space): 9 +------------------------------------------------------------------ + 0 5 Bra + 3 \xaa + 5 5 Ket + 8 End +------------------------------------------------------------------ + +/[\xaa]/8BM +Memory allocation (code space): 10 +------------------------------------------------------------------ + 0 6 Bra + 3 \x{aa} + 6 6 Ket + 9 End +------------------------------------------------------------------ + +/[^a]/BM +Memory allocation (code space): 9 +------------------------------------------------------------------ + 0 5 Bra + 3 [^a] + 5 5 Ket + 8 End +------------------------------------------------------------------ + +/[^a]/8BM +Memory allocation (code space): 9 +------------------------------------------------------------------ + 0 5 Bra + 3 [^a] + 5 5 Ket + 8 End +------------------------------------------------------------------ + +/[^\xaa]/BM +Memory allocation (code space): 9 +------------------------------------------------------------------ + 0 5 Bra + 3 [^\xaa] + 5 5 Ket + 8 End +------------------------------------------------------------------ + +/[^\xaa]/8BM +Memory allocation (code space): 40 +------------------------------------------------------------------ + 0 36 Bra + 3 [\x00-\xa9\xab-\xff] (neg) + 36 36 Ket + 39 End +------------------------------------------------------------------ + / End of testinput10 / diff --git a/ext/pcre/pcrelib/testdata/testoutput2 b/ext/pcre/pcrelib/testdata/testoutput2 index cd8f82eb10..a1c071d8b0 100644 --- a/ext/pcre/pcrelib/testdata/testoutput2 +++ b/ext/pcre/pcrelib/testdata/testoutput2 @@ -166,6 +166,7 @@ Starting byte set: a b c d /(a|[^\dZ])/IS Capturing subpattern count = 1 +Contains explicit CR or LF match No options No first char No need char @@ -402,6 +403,7 @@ Failed: missing terminating ] for character class at offset 4 /[^aeiou ]{3,}/I Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -599,8 +601,8 @@ Need char = 'h' (caseless) /((?i)b)/IDZS ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 01 Opt NC b Ket @@ -703,6 +705,7 @@ Starting byte set: a b /(?<=foo\n)^bar/Im Capturing subpattern count = 0 +Contains explicit CR or LF match Options: multiline No first char Need char = 'r' @@ -719,6 +722,7 @@ No match /^(?<=foo\n)bar/Im Capturing subpattern count = 0 +Contains explicit CR or LF match Options: multiline First char at start or follows newline Need char = 'r' @@ -1105,13 +1109,14 @@ No need char )?)?)?)?)?)?)?)?)?otherword/I Capturing subpattern count = 8 Partial matching not supported +Contains explicit CR or LF match No options First char = 'w' Need char = 'd' /.*X/IDZ ------------------------------------------------------------------ - Bra 0 + Bra Any* X Ket @@ -1125,7 +1130,7 @@ Need char = 'X' /.*X/IDZs ------------------------------------------------------------------ - Bra 0 + Bra Any* X Ket @@ -1139,8 +1144,8 @@ Need char = 'X' /(.*X|^B)/IDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 Any* X Alt @@ -1158,8 +1163,8 @@ No need char /(.*X|^B)/IDZs ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 Any* X Alt @@ -1177,8 +1182,8 @@ No need char /(?s)(.*X|^B)/IDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 Any* X Alt @@ -1196,8 +1201,8 @@ No need char /(?s:.*X|^B)/IDZ ------------------------------------------------------------------ - Bra 0 - Bra 0 + Bra + Bra 04 Opt Any* X @@ -1347,6 +1352,7 @@ No need char /^ab\n/Ig+ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char @@ -1356,6 +1362,7 @@ No need char /^ab\n/Img+ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: multiline First char at start or follows newline Need char = 10 @@ -1433,6 +1440,7 @@ Need char = 'a' /"([^\\"]+|\\.)*"/I Capturing subpattern count = 1 Partial matching not supported +Contains explicit CR or LF match No options First char = '"' Need char = '"' @@ -1708,6 +1716,7 @@ Study returned NULL /Ix Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1737,6 +1746,7 @@ No match /\( ( (?>[^()]+) | (?R) )* \) /Ixg Capturing subpattern count = 1 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1752,6 +1762,7 @@ Need char = ')' /\( (?: (?>[^()]+) | (?R) ) \) /Ix Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1771,6 +1782,7 @@ No match /\( (?: (?>[^()]+) | (?R) )? \) /Ix Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1782,6 +1794,7 @@ Need char = ')' /\( ( (?>[^()]+) | (?R) )* \) /Ix Capturing subpattern count = 1 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1792,6 +1805,7 @@ Need char = ')' /\( ( ( (?>[^()]+) | (?R) )* ) \) /Ix Capturing subpattern count = 2 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1803,6 +1817,7 @@ Need char = ')' /\( (123)? ( ( (?>[^()]+) | (?R) )* ) \) /Ix Capturing subpattern count = 3 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1820,6 +1835,7 @@ Need char = ')' /\( ( (123)? ( (?>[^()]+) | (?R) )* ) \) /Ix Capturing subpattern count = 3 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1837,6 +1853,7 @@ Need char = ')' /\( (((((((((( ( (?>[^()]+) | (?R) )* )))))))))) \) /Ix Capturing subpattern count = 11 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1857,6 +1874,7 @@ Need char = ')' /\( ( ( (?>[^()<>]+) | ((?>[^()]+)) | (?R) )* ) \) /Ix Capturing subpattern count = 3 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1869,6 +1887,7 @@ Need char = ')' /\( ( ( (?>[^()]+) | ((?R)) )* ) \) /Ix Capturing subpattern count = 3 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '(' Need char = ')' @@ -1885,7 +1904,7 @@ Need char = ')' /^[[:alnum:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [0-9A-Za-z] Ket @@ -1898,20 +1917,21 @@ No need char /^[[:^alnum:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-/:-@[-`{-\xff] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /^[[:alpha:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [A-Za-z] Ket @@ -1924,13 +1944,14 @@ No need char /^[[:^alpha:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-@[-`{-\xff] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char @@ -1945,20 +1966,21 @@ Starting byte set: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z /^[[:ascii:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-\x7f] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /^[[:^ascii:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x80-\xff] Ket @@ -1971,7 +1993,7 @@ No need char /^[[:blank:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x09 ] Ket @@ -1984,19 +2006,21 @@ No need char /^[[:^blank:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-\x08\x0a-\x1f!-\xff] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /[\n\x0b\x0c\x0d[:blank:]]/IS Capturing subpattern count = 0 +Contains explicit CR or LF match No options No first char No need char @@ -2004,20 +2028,21 @@ Starting byte set: \x09 \x0a \x0b \x0c \x0d \x20 /^[[:cntrl:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-\x1f\x7f] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /^[[:digit:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [0-9] Ket @@ -2030,7 +2055,7 @@ No need char /^[[:graph:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [!-~] Ket @@ -2043,7 +2068,7 @@ No need char /^[[:lower:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [a-z] Ket @@ -2056,7 +2081,7 @@ No need char /^[[:print:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [ -~] Ket @@ -2069,7 +2094,7 @@ No need char /^[[:punct:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [!-/:-@[-`{-~] Ket @@ -2082,20 +2107,21 @@ No need char /^[[:space:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x09-\x0d ] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /^[[:upper:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [A-Z] Ket @@ -2108,7 +2134,7 @@ No need char /^[[:xdigit:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [0-9A-Fa-f] Ket @@ -2121,7 +2147,7 @@ No need char /^[[:word:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [0-9A-Z_a-z] Ket @@ -2134,7 +2160,7 @@ No need char /^[[:^cntrl:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [ -~\x80-\xff] Ket @@ -2147,33 +2173,35 @@ No need char /^[12[:^digit:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-/12:-\xff] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /^[[:^blank:]]/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-\x08\x0a-\x1f!-\xff] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored No first char No need char /[01[:alpha:]%]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [%01A-Za-z] Ket End @@ -2694,7 +2722,7 @@ Need char = '-' /#/IxDZ ------------------------------------------------------------------ - Bra 0 + Bra Ket End ------------------------------------------------------------------ @@ -2705,7 +2733,7 @@ No need char /a#/IxDZ ------------------------------------------------------------------ - Bra 0 + Bra a Ket End @@ -2717,7 +2745,7 @@ No need char /[\s]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09\x0a\x0c\x0d ] Ket End @@ -2729,7 +2757,7 @@ No need char /[\S]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x08\x0b\x0e-\x1f!-\xff] Ket End @@ -2741,7 +2769,7 @@ No need char /a(?i)b/DZ ------------------------------------------------------------------ - Bra 0 + Bra a 01 Opt NC b @@ -2763,8 +2791,8 @@ No match /(a(?i)b)/DZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 a 01 Opt NC b @@ -2790,7 +2818,7 @@ No match / (?i)abc/IxDZ ------------------------------------------------------------------ - Bra 0 + Bra NC abc Ket End @@ -2803,7 +2831,7 @@ Need char = 'c' (caseless) /#this is a comment (?i)abc/IxDZ ------------------------------------------------------------------ - Bra 0 + Bra NC abc Ket End @@ -2815,7 +2843,7 @@ Need char = 'c' (caseless) /123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/DZ ------------------------------------------------------------------ - Bra 0 + Bra 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 Ket End @@ -2827,7 +2855,7 @@ Need char = '0' /\Q123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890/DZ ------------------------------------------------------------------ - Bra 0 + Bra 123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890 Ket End @@ -2839,7 +2867,7 @@ Need char = '0' /\Q\E/DZ ------------------------------------------------------------------ - Bra 0 + Bra Ket End ------------------------------------------------------------------ @@ -2852,7 +2880,7 @@ No need char /\Q\Ex/DZ ------------------------------------------------------------------ - Bra 0 + Bra x Ket End @@ -2864,7 +2892,7 @@ No need char / \Q\E/DZ ------------------------------------------------------------------ - Bra 0 + Bra Ket End @@ -2876,7 +2904,7 @@ No need char /a\Q\E/DZ ------------------------------------------------------------------ - Bra 0 + Bra a Ket End @@ -2894,7 +2922,7 @@ No need char /a\Q\Eb/DZ ------------------------------------------------------------------ - Bra 0 + Bra ab Ket End @@ -2908,7 +2936,7 @@ Need char = 'b' /\Q\Eabc/DZ ------------------------------------------------------------------ - Bra 0 + Bra abc Ket End @@ -2920,7 +2948,7 @@ Need char = 'c' /x*+\w/DZ ------------------------------------------------------------------ - Bra 0 + Bra x*+ \w Ket @@ -2938,7 +2966,7 @@ No match /x?+/DZ ------------------------------------------------------------------ - Bra 0 + Bra x?+ Ket End @@ -2950,7 +2978,7 @@ No need char /x++/DZ ------------------------------------------------------------------ - Bra 0 + Bra x++ Ket End @@ -2963,7 +2991,7 @@ No need char /x{1,3}+/DZ ------------------------------------------------------------------ - Bra 0 + Bra Once x x{0,2} @@ -2979,10 +3007,10 @@ No need char /(x)*+/DZ ------------------------------------------------------------------ - Bra 0 + Bra Once Brazero - Bra 1 + CBra 1 x KetRmax Ket @@ -3055,6 +3083,7 @@ Need char = 'b' /([^()]++|\([^()]*\))+/I Capturing subpattern count = 1 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -3065,6 +3094,7 @@ No need char /\(([^()]++|\([^()]+\))+\)/I Capturing subpattern count = 1 Partial matching not supported +Contains explicit CR or LF match No options First char = '(' Need char = ')' @@ -3081,18 +3111,18 @@ No match /(abc){1,3}+/DZ ------------------------------------------------------------------ - Bra 0 + Bra Once - Bra 1 + CBra 1 abc Ket Brazero - Bra 0 - Bra 1 + Bra + CBra 1 abc Ket Brazero - Bra 1 + CBra 1 abc Ket Ket @@ -3119,7 +3149,7 @@ Failed: nothing to repeat at offset 7 /x(?U)a++b/DZ ------------------------------------------------------------------ - Bra 0 + Bra x a++ b @@ -3136,7 +3166,7 @@ Need char = 'b' /(?U)xa++b/DZ ------------------------------------------------------------------ - Bra 0 + Bra x a++ b @@ -3153,19 +3183,19 @@ Need char = 'b' /^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ - Bra 1 - Bra 2 + CBra 1 + CBra 2 a+ Ket - Bra 3 + CBra 3 [ab]+? Ket - Bra 4 + CBra 4 [bc]+ Ket - Bra 5 + CBra 5 \w* Ket Ket @@ -3180,7 +3210,7 @@ No need char /^x(?U)a+b/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ x a++ @@ -3196,10 +3226,10 @@ Need char = 'b' /^x(?U)(a+)b/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ x - Bra 1 + CBra 1 a+? Ket b @@ -3247,7 +3277,7 @@ Failed: missing terminating ] for character class at offset 10 /[\s]/IDZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09\x0a\x0c\x0d ] Ket End @@ -3259,24 +3289,26 @@ No need char /[[:space:]]/IDZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09-\x0d ] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match No options No first char No need char /[[:space:]abcde]/IDZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09-\x0d a-e] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match No options No first char No need char @@ -3284,6 +3316,7 @@ No need char /< (?: (?(R) \d++ | [^<>]*+) | (?R)) * >/Ix Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '<' Need char = '>' @@ -3306,7 +3339,7 @@ No match |8J\$WE\<\.rX\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\<EjmhUZ\?\.akp2dF\>qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|IDZ ------------------------------------------------------------------ - Bra 0 + Bra 8J$WE<.rX+ix[d1b!H#?vV0vrK:ZH1=2M>iV;?aPhFB<*vW@QW@sO9}cfZA-i'w%hKd6gt1UJP,15_#QY$M^Mss_U/]&LK9[5vQub^w[KDD<EjmhUZ?.akp2dF>qmj;2}YWFdYx.Ap]hjCPTP(n28k+3;o&WXqs/gOXdr$:r'do0;b4c(f_Gr="\4)[01T7ajQJvL$W~mL_sS/4h:x*[ZN=KLs&L5zX//>it,o:aU(;Z>pW&T7oP'2K^E:x9'c[%z-,64JQ5AeH_G#KijUKghQw^\vea3a?kka_G$8#`*kynsxzBLru']k_[7FrVx}^=$blx>s-N%j;D*aZDnsw:YKZ%Q.Kne9#hP?+b3(SOvL,^;&u5@?5C5Bhb=m-vEh_L15Jl]U)0RP6{q%L^_z5E'Dw6X \b Ket @@ -3319,7 +3352,7 @@ Need char = 'X' |\$\<\.X\+ix\[d1b\!H\#\?vV0vrK\:ZH1\=2M\>iV\;\?aPhFB\<\*vW\@QW\@sO9\}cfZA\-i\'w\%hKd6gt1UJP\,15_\#QY\$M\^Mss_U\/\]\&LK9\[5vQub\^w\[KDD\<EjmhUZ\?\.akp2dF\>qmj\;2\}YWFdYx\.Ap\]hjCPTP\(n28k\+3\;o\&WXqs\/gOXdr\$\:r\'do0\;b4c\(f_Gr\=\"\\4\)\[01T7ajQJvL\$W\~mL_sS\/4h\:x\*\[ZN\=KLs\&L5zX\/\/\>it\,o\:aU\(\;Z\>pW\&T7oP\'2K\^E\:x9\'c\[\%z\-\,64JQ5AeH_G\#KijUKghQw\^\\vea3a\?kka_G\$8\#\`\*kynsxzBLru\'\]k_\[7FrVx\}\^\=\$blx\>s\-N\%j\;D\*aZDnsw\:YKZ\%Q\.Kne9\#hP\?\+b3\(SOvL\,\^\;\&u5\@\?5C5Bhb\=m\-vEh_L15Jl\]U\)0RP6\{q\%L\^_z5E\'Dw6X\b|IDZ ------------------------------------------------------------------ - Bra 0 + Bra $<.X+ix[d1b!H#?vV0vrK:ZH1=2M>iV;?aPhFB<*vW@QW@sO9}cfZA-i'w%hKd6gt1UJP,15_#QY$M^Mss_U/]&LK9[5vQub^w[KDD<EjmhUZ?.akp2dF>qmj;2}YWFdYx.Ap]hjCPTP(n28k+3;o&WXqs/gOXdr$:r'do0;b4c(f_Gr="\4)[01T7ajQJvL$W~mL_sS/4h:x*[ZN=KLs&L5zX//>it,o:aU(;Z>pW&T7oP'2K^E:x9'c[%z-,64JQ5AeH_G#KijUKghQw^\vea3a?kka_G$8#`*kynsxzBLru']k_[7FrVx}^=$blx>s-N%j;D*aZDnsw:YKZ%Q.Kne9#hP?+b3(SOvL,^;&u5@?5C5Bhb=m-vEh_L15Jl]U)0RP6{q%L^_z5E'Dw6X \b Ket @@ -3498,6 +3531,7 @@ Starting byte set: a b /[^a]/I Capturing subpattern count = 0 +Contains explicit CR or LF match No options No first char No need char @@ -3957,6 +3991,7 @@ Failed: recursive call could loop indefinitely at offset 16 /^([^()]|\((?1)*\))*$/I Capturing subpattern count = 1 +Contains explicit CR or LF match Options: anchored No first char No need char @@ -3976,6 +4011,7 @@ No match /^>abc>([^()]|\((?1)*\))*<xyz<$/I Capturing subpattern count = 1 +Contains explicit CR or LF match Options: anchored No first char Need char = '<' @@ -3991,8 +4027,8 @@ Need char = '<' /(a(?1)b)/DZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 a Once Recurse @@ -4009,8 +4045,8 @@ Need char = 'b' /(a(?1)+b)/DZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 a Once Recurse @@ -4103,6 +4139,7 @@ No match /((< (?: (?(R) \d++ | [^<>]*+) | (?2)) * >))/Ix Capturing subpattern count = 2 Partial matching not supported +Contains explicit CR or LF match Options: extended First char = '<' Need char = '>' @@ -4185,15 +4222,15 @@ No need char /a(?P<name1>b|c)d(?P<longername2>e)/DZ ------------------------------------------------------------------ - Bra 0 + Bra a - Bra 1 + CBra 1 b Alt c Ket d - Bra 2 + CBra 2 e Ket Ket @@ -4217,17 +4254,17 @@ Need char = 'e' /(?:a(?P<c>c(?P<d>d)))(?P<a>a)/DZ ------------------------------------------------------------------ - Bra 0 - Bra 0 + Bra + Bra a - Bra 1 + CBra 1 c - Bra 2 + CBra 2 d Ket Ket Ket - Bra 3 + CBra 3 a Ket Ket @@ -4244,8 +4281,8 @@ Need char = 'a' /(?P<a>a)...(?P=a)bbb(?P>a)d/DZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 a Ket Any @@ -4407,11 +4444,11 @@ No need char /(a)(bc)/INDZ ------------------------------------------------------------------ - Bra 0 - Bra 0 + Bra + Bra a Ket - Bra 0 + Bra bc Ket Ket @@ -4426,11 +4463,11 @@ Need char = 'c' /(?P<one>a)(bc)/INDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 a Ket - Bra 0 + Bra bc Ket Ket @@ -4448,11 +4485,11 @@ Need char = 'c' /(a)(?P<named>bc)/INDZ ------------------------------------------------------------------ - Bra 0 - Bra 0 + Bra + Bra a Ket - Bra 1 + CBra 1 bc Ket Ket @@ -4541,10 +4578,10 @@ copy substring three failed -7 /(?P<Tes>)(?P<Test>)/DZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 Ket - Bra 2 + CBra 2 Ket Ket End @@ -4559,10 +4596,10 @@ No need char /(?P<Test>)(?P<Tes>)/DZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 Ket - Bra 2 + CBra 2 Ket Ket End @@ -4636,11 +4673,11 @@ Need char = ']' /(a(b(?2)c))?/DZ ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 1 + CBra 1 a - Bra 2 + CBra 2 b Once Recurse @@ -4658,11 +4695,11 @@ No need char /(a(b(?2)c))*/DZ ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 1 + CBra 1 a - Bra 2 + CBra 2 b Once Recurse @@ -4680,12 +4717,12 @@ No need char /(a(b(?2)c)){0,2}/DZ ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 0 - Bra 1 + Bra + CBra 1 a - Bra 2 + CBra 2 b Once Recurse @@ -4694,9 +4731,9 @@ No need char Ket Ket Brazero - Bra 1 + CBra 1 a - Bra 2 + CBra 2 b Once Recurse @@ -4715,7 +4752,7 @@ No need char /[ab]{1}+/DZ ------------------------------------------------------------------ - Bra 0 + Bra Once [ab]{1,1} Ket @@ -4750,7 +4787,7 @@ Study returned NULL /a*.*b/ISDZ ------------------------------------------------------------------ - Bra 0 + Bra a* Any* b @@ -4766,9 +4803,9 @@ Study returned NULL /(a|b)*.?c/ISDZ ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 1 + CBra 1 a Alt b @@ -4786,7 +4823,7 @@ Study returned NULL /abc(?C255)de(?C)f/DZ ------------------------------------------------------------------ - Bra 0 + Bra abc Callout 255 10 1 de @@ -4802,7 +4839,7 @@ Need char = 'f' /abcde/ICDZ ------------------------------------------------------------------ - Bra 0 + Bra Callout 255 0 1 a Callout 255 1 1 @@ -4841,7 +4878,7 @@ No match /a*b/ICDZ ------------------------------------------------------------------ - Bra 0 + Bra Callout 255 0 2 a*+ Callout 255 2 1 @@ -4886,7 +4923,7 @@ Need char = 'b' /a+b/ICDZ ------------------------------------------------------------------ - Bra 0 + Bra Callout 255 0 2 a++ Callout 255 2 1 @@ -4926,9 +4963,9 @@ No match /(abc|def)x/ICDZ ------------------------------------------------------------------ - Bra 0 + Bra Callout 255 0 9 - Bra 1 + CBra 1 Callout 255 1 1 a Callout 255 2 1 @@ -5080,9 +5117,9 @@ No need char /([ab]{,4}c|xy)/ICDZ ------------------------------------------------------------------ - Bra 0 + Bra Callout 255 0 14 - Bra 1 + CBra 1 Callout 255 1 4 [ab] Callout 255 5 1 @@ -5255,9 +5292,9 @@ No match /([ab]{1,4}c|xy){4,5}?123/ICDZ ------------------------------------------------------------------ - Bra 0 + Bra Callout 255 0 21 - Bra 1 + CBra 1 Callout 255 1 9 [ab]{1,4} Callout 255 10 1 @@ -5270,7 +5307,7 @@ No match y Callout 255 14 0 Ket - Bra 1 + CBra 1 Callout 255 1 9 [ab]{1,4} Callout 255 10 1 @@ -5283,7 +5320,7 @@ No match y Callout 255 14 0 Ket - Bra 1 + CBra 1 Callout 255 1 9 [ab]{1,4} Callout 255 10 1 @@ -5296,7 +5333,7 @@ No match y Callout 255 14 0 Ket - Bra 1 + CBra 1 Callout 255 1 9 [ab]{1,4} Callout 255 10 1 @@ -5310,7 +5347,7 @@ No match Callout 255 14 0 Ket Braminzero - Bra 1 + CBra 1 Callout 255 1 9 [ab]{1,4} Callout 255 10 1 @@ -5631,6 +5668,7 @@ No need char /line\nbreak/I Capturing subpattern count = 0 +Contains explicit CR or LF match No options First char = 'l' Need char = 'k' @@ -5641,6 +5679,7 @@ Need char = 'k' /line\nbreak/If Capturing subpattern count = 0 +Contains explicit CR or LF match Options: firstline First char = 'l' Need char = 'k' @@ -5653,6 +5692,7 @@ No match /line\nbreak/Imf Capturing subpattern count = 0 +Contains explicit CR or LF match Options: multiline firstline First char = 'l' Need char = 'k' @@ -5918,6 +5958,7 @@ Matched, but too many substrings /[^()]*(?:\((?R)\)[^()]*)*/I Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -5931,6 +5972,7 @@ No need char /[^()]*(?:\((?>(?R))\)[^()]*)*/I Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -5942,6 +5984,7 @@ No need char /[^()]*(?:\((?R)\))*[^()]*/I Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -5953,6 +5996,7 @@ No need char /(?:\((?R)\))*[^()]*/I Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -5966,6 +6010,7 @@ No need char /(?:\((?R)\))|[^()]*/I Capturing subpattern count = 0 Partial matching not supported +Contains explicit CR or LF match No options No first char No need char @@ -6664,7 +6709,7 @@ Starting byte set: a b c d /^a*b\d/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ a*+ b @@ -6680,7 +6725,7 @@ Need char = 'b' /^a*+b\d/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ a*+ b @@ -6696,7 +6741,7 @@ Need char = 'b' /^a*?b\d/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ a*+ b @@ -6712,7 +6757,7 @@ Need char = 'b' /^a+A\d/DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ a++ A @@ -6734,7 +6779,7 @@ No match /^a*A\d/IiDZ ------------------------------------------------------------------ - Bra 0 + Bra ^ a* NC A @@ -6816,7 +6861,7 @@ Matched, but too many substrings /a*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra a*+ \d Ket @@ -6825,7 +6870,7 @@ Matched, but too many substrings /a*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra a* \D Ket @@ -6834,7 +6879,7 @@ Matched, but too many substrings /0*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra 0* \d Ket @@ -6843,7 +6888,7 @@ Matched, but too many substrings /0*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra 0*+ \D Ket @@ -6852,7 +6897,7 @@ Matched, but too many substrings /a*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra a*+ \s Ket @@ -6861,7 +6906,7 @@ Matched, but too many substrings /a*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra a* \S Ket @@ -6870,7 +6915,7 @@ Matched, but too many substrings / *\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra * \s Ket @@ -6879,7 +6924,7 @@ Matched, but too many substrings / *\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra *+ \S Ket @@ -6888,7 +6933,7 @@ Matched, but too many substrings /a*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra a* \w Ket @@ -6897,7 +6942,7 @@ Matched, but too many substrings /a*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra a*+ \W Ket @@ -6906,7 +6951,7 @@ Matched, but too many substrings /=*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra =*+ \w Ket @@ -6915,7 +6960,7 @@ Matched, but too many substrings /=*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra =* \W Ket @@ -6924,7 +6969,7 @@ Matched, but too many substrings /\d*a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d*+ a Ket @@ -6933,7 +6978,7 @@ Matched, but too many substrings /\d*2/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d* 2 Ket @@ -6942,7 +6987,7 @@ Matched, but too many substrings /\d*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d* \d Ket @@ -6951,7 +6996,7 @@ Matched, but too many substrings /\d*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d*+ \D Ket @@ -6960,7 +7005,7 @@ Matched, but too many substrings /\d*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d*+ \s Ket @@ -6969,7 +7014,7 @@ Matched, but too many substrings /\d*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d* \S Ket @@ -6978,7 +7023,7 @@ Matched, but too many substrings /\d*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d* \w Ket @@ -6987,7 +7032,7 @@ Matched, but too many substrings /\d*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d*+ \W Ket @@ -6996,7 +7041,7 @@ Matched, but too many substrings /\D*a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D* a Ket @@ -7005,7 +7050,7 @@ Matched, but too many substrings /\D*2/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D*+ 2 Ket @@ -7014,7 +7059,7 @@ Matched, but too many substrings /\D*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D*+ \d Ket @@ -7023,7 +7068,7 @@ Matched, but too many substrings /\D*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D* \D Ket @@ -7032,7 +7077,7 @@ Matched, but too many substrings /\D*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D* \s Ket @@ -7041,7 +7086,7 @@ Matched, but too many substrings /\D*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D* \S Ket @@ -7050,7 +7095,7 @@ Matched, but too many substrings /\D*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D* \w Ket @@ -7059,7 +7104,7 @@ Matched, but too many substrings /\D*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra \D* \W Ket @@ -7068,7 +7113,7 @@ Matched, but too many substrings /\s*a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s*+ a Ket @@ -7077,7 +7122,7 @@ Matched, but too many substrings /\s*2/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s*+ 2 Ket @@ -7086,7 +7131,7 @@ Matched, but too many substrings /\s*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s*+ \d Ket @@ -7095,7 +7140,7 @@ Matched, but too many substrings /\s*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s* \D Ket @@ -7104,7 +7149,7 @@ Matched, but too many substrings /\s*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s* \s Ket @@ -7113,7 +7158,7 @@ Matched, but too many substrings /\s*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s*+ \S Ket @@ -7122,7 +7167,7 @@ Matched, but too many substrings /\s*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s*+ \w Ket @@ -7131,7 +7176,7 @@ Matched, but too many substrings /\s*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra \s* \W Ket @@ -7140,7 +7185,7 @@ Matched, but too many substrings /\S*a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* a Ket @@ -7149,7 +7194,7 @@ Matched, but too many substrings /\S*2/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* 2 Ket @@ -7158,7 +7203,7 @@ Matched, but too many substrings /\S*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* \d Ket @@ -7167,7 +7212,7 @@ Matched, but too many substrings /\S*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* \D Ket @@ -7176,7 +7221,7 @@ Matched, but too many substrings /\S*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S*+ \s Ket @@ -7185,7 +7230,7 @@ Matched, but too many substrings /\S*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* \S Ket @@ -7194,7 +7239,7 @@ Matched, but too many substrings /\S*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* \w Ket @@ -7203,7 +7248,7 @@ Matched, but too many substrings /\S*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra \S* \W Ket @@ -7212,7 +7257,7 @@ Matched, but too many substrings /\w*a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w* a Ket @@ -7221,7 +7266,7 @@ Matched, but too many substrings /\w*2/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w* 2 Ket @@ -7230,7 +7275,7 @@ Matched, but too many substrings /\w*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w* \d Ket @@ -7239,7 +7284,7 @@ Matched, but too many substrings /\w*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w* \D Ket @@ -7248,7 +7293,7 @@ Matched, but too many substrings /\w*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w*+ \s Ket @@ -7257,7 +7302,7 @@ Matched, but too many substrings /\w*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w* \S Ket @@ -7266,7 +7311,7 @@ Matched, but too many substrings /\w*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w* \w Ket @@ -7275,7 +7320,7 @@ Matched, but too many substrings /\w*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w*+ \W Ket @@ -7284,7 +7329,7 @@ Matched, but too many substrings /\W*a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W*+ a Ket @@ -7293,7 +7338,7 @@ Matched, but too many substrings /\W*2/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W*+ 2 Ket @@ -7302,7 +7347,7 @@ Matched, but too many substrings /\W*\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W*+ \d Ket @@ -7311,7 +7356,7 @@ Matched, but too many substrings /\W*\D/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W* \D Ket @@ -7320,7 +7365,7 @@ Matched, but too many substrings /\W*\s/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W* \s Ket @@ -7329,7 +7374,7 @@ Matched, but too many substrings /\W*\S/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W* \S Ket @@ -7338,7 +7383,7 @@ Matched, but too many substrings /\W*\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W*+ \w Ket @@ -7347,7 +7392,7 @@ Matched, but too many substrings /\W*\W/BZ ------------------------------------------------------------------ - Bra 0 + Bra \W* \W Ket @@ -7356,7 +7401,7 @@ Matched, but too many substrings /[^a]+a/BZ ------------------------------------------------------------------ - Bra 0 + Bra [^a]++ a Ket @@ -7365,7 +7410,7 @@ Matched, but too many substrings /[^a]+a/BZi ------------------------------------------------------------------ - Bra 0 + Bra [^A]++ NC a Ket @@ -7374,7 +7419,7 @@ Matched, but too many substrings /[^a]+A/BZi ------------------------------------------------------------------ - Bra 0 + Bra [^A]++ NC A Ket @@ -7383,7 +7428,7 @@ Matched, but too many substrings /[^a]+b/BZ ------------------------------------------------------------------ - Bra 0 + Bra [^a]+ b Ket @@ -7392,7 +7437,7 @@ Matched, but too many substrings /[^a]+\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra [^a]+ \d Ket @@ -7401,7 +7446,7 @@ Matched, but too many substrings /a*[^a]/BZ ------------------------------------------------------------------ - Bra 0 + Bra a* [^a] Ket @@ -7542,7 +7587,7 @@ No match /^[\E\Qa\E-\Qz\E]+/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [a-z]+ Ket @@ -7551,7 +7596,7 @@ No match /^[a\Q]bc\E]/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\]a-c] Ket @@ -7560,7 +7605,7 @@ No match /^[a-\Q\E]/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\-a] Ket @@ -7569,13 +7614,13 @@ No match /^(?P>abc)[()](?<abc>)/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ Once Recurse Ket [()] - Bra 1 + CBra 1 Ket Ket End @@ -7583,15 +7628,15 @@ No match /^((?(abc)y)[()](?P<abc>x))+/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ - Bra 1 + CBra 1 Cond 2 Cond ref y Ket [()] - Bra 2 + CBra 2 x Ket KetRmax @@ -7605,13 +7650,13 @@ No match /^(?P>abc)\Q()\E(?<abc>)/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ Once Recurse Ket () - Bra 1 + CBra 1 Ket Ket End @@ -7619,13 +7664,13 @@ No match /^(?P>abc)[a\Q(]\E(](?<abc>)/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ Once Recurse Ket [(\]a] - Bra 1 + CBra 1 Ket Ket End @@ -7634,12 +7679,12 @@ No match /^(?P>abc) # this is (a comment) (?<abc>)/BZx ------------------------------------------------------------------ - Bra 0 + Bra ^ Once Recurse Ket - Bra 1 + CBra 1 Ket Ket End @@ -8064,16 +8109,16 @@ No match 2: b /^(a)\g-2/ -Failed: reference to non-existent subpattern at offset 4 +Failed: reference to non-existent subpattern at offset 7 /^(a)\g/ -Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 4 +Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 5 /^(a)\g{0}/ -Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 4 +Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 7 /^(a)\g{3/ -Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 4 +Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 8 /^(a)\g{4a}/ Failed: reference to non-existent subpattern at offset 9 @@ -8172,8 +8217,8 @@ No match /(ab|c)(?-1)/BZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 ab Alt c @@ -8190,12 +8235,12 @@ No match /xy(?+1)(abc)/BZ ------------------------------------------------------------------ - Bra 0 + Bra xy Once Recurse Ket - Bra 1 + CBra 1 abc Ket Ket @@ -8223,10 +8268,10 @@ Failed: reference to non-existent subpattern at offset 5 /^(abc)?(?(-1)X|Y)/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ Brazero - Bra 1 + CBra 1 abc Ket Cond @@ -8250,16 +8295,16 @@ No match /^((?(+1)X|Y)(abc))+/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ - Bra 1 + CBra 1 Cond 2 Cond ref X Alt Y Ket - Bra 2 + CBra 2 abc Ket KetRmax @@ -8284,8 +8329,8 @@ Failed: reference to non-existent subpattern at offset 6 /((?(-1)a))/BZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 Cond 1 Cond ref a @@ -8300,7 +8345,7 @@ Failed: reference to non-existent subpattern at offset 7 /^(?(+1)X|Y)/BZ ------------------------------------------------------------------ - Bra 0 + Bra ^ Cond 1 Cond ref @@ -8359,13 +8404,13 @@ Failed: syntax error in subpattern name (missing terminator) at offset 4 /(?|(abc)|(xyz))/BZ ------------------------------------------------------------------ - Bra 0 - Bra 0 - Bra 1 + Bra + Bra + CBra 1 abc Ket Alt - Bra 1 + CBra 1 xyz Ket Ket @@ -8381,20 +8426,20 @@ Failed: syntax error in subpattern name (missing terminator) at offset 4 /(x)(?|(abc)|(xyz))(x)/BZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 x Ket - Bra 0 - Bra 2 + Bra + CBra 2 abc Ket Alt - Bra 2 + CBra 2 xyz Ket Ket - Bra 3 + CBra 3 x Ket Ket @@ -8413,23 +8458,23 @@ Failed: syntax error in subpattern name (missing terminator) at offset 4 /(x)(?|(abc)(pqr)|(xyz))(x)/BZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 x Ket - Bra 0 - Bra 2 + Bra + CBra 2 abc Ket - Bra 3 + CBra 3 pqr Ket Alt - Bra 2 + CBra 2 xyz Ket Ket - Bra 4 + CBra 4 x Ket Ket @@ -8526,7 +8571,7 @@ No match /[\h]/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09 \xa0] Ket End @@ -8536,7 +8581,7 @@ No match /[\h]+/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09 \xa0]+ Ket End @@ -8546,7 +8591,7 @@ No match /[\v]/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x0a-\x0d\x85] Ket End @@ -8554,7 +8599,7 @@ No match /[\H]/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff] Ket End @@ -8562,7 +8607,7 @@ No match /[^\h]/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff] (neg) Ket End @@ -8570,7 +8615,7 @@ No match /[\V]/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x09\x0e-\x84\x86-\xff] Ket End @@ -8578,7 +8623,7 @@ No match /[\x0a\V]/BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x0a\x0e-\x84\x86-\xff] Ket End @@ -8586,7 +8631,7 @@ No match /\H++X/BZ ------------------------------------------------------------------ - Bra 0 + Bra \H++ X Ket @@ -8599,7 +8644,7 @@ No match /\H+\hY/BZ ------------------------------------------------------------------ - Bra 0 + Bra \H++ \h Y @@ -8611,7 +8656,7 @@ No match /\H+ Y/BZ ------------------------------------------------------------------ - Bra 0 + Bra \H++ Y Ket @@ -8620,7 +8665,7 @@ No match /\h+A/BZ ------------------------------------------------------------------ - Bra 0 + Bra \h++ A Ket @@ -8629,7 +8674,7 @@ No match /\v*B/BZ ------------------------------------------------------------------ - Bra 0 + Bra \v*+ B Ket @@ -8638,7 +8683,7 @@ No match /\V+\x0a/BZ ------------------------------------------------------------------ - Bra 0 + Bra \V++ \x0a Ket @@ -8647,7 +8692,7 @@ No match /A+\h/BZ ------------------------------------------------------------------ - Bra 0 + Bra A++ \h Ket @@ -8656,7 +8701,7 @@ No match / *\H/BZ ------------------------------------------------------------------ - Bra 0 + Bra *+ \H Ket @@ -8665,7 +8710,7 @@ No match /A*\v/BZ ------------------------------------------------------------------ - Bra 0 + Bra A*+ \v Ket @@ -8674,7 +8719,7 @@ No match /\x0b*\V/BZ ------------------------------------------------------------------ - Bra 0 + Bra \x0b*+ \V Ket @@ -8683,7 +8728,7 @@ No match /\d+\h/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d++ \h Ket @@ -8692,7 +8737,7 @@ No match /\d*\v/BZ ------------------------------------------------------------------ - Bra 0 + Bra \d*+ \v Ket @@ -8701,7 +8746,7 @@ No match /S+\h\S+\v/BZ ------------------------------------------------------------------ - Bra 0 + Bra S++ \h \S++ @@ -8712,7 +8757,7 @@ No match /\w{3,}\h\w+\v/BZ ------------------------------------------------------------------ - Bra 0 + Bra \w{3} \w*+ \h @@ -8724,7 +8769,7 @@ No match /\h+\d\h+\w\h+\S\h+\H/BZ ------------------------------------------------------------------ - Bra 0 + Bra \h++ \d \h++ @@ -8739,7 +8784,7 @@ No match /\v+\d\v+\w\v+\S\v+\V/BZ ------------------------------------------------------------------ - Bra 0 + Bra \v++ \d \v++ @@ -8754,7 +8799,7 @@ No match /\H+\h\H+\d/BZ ------------------------------------------------------------------ - Bra 0 + Bra \H++ \h \H+ @@ -8765,7 +8810,7 @@ No match /\V+\v\V+\w/BZ ------------------------------------------------------------------ - Bra 0 + Bra \V++ \v \V+ @@ -8774,4 +8819,353 @@ No match End ------------------------------------------------------------------ +/\( (?: [^()]* | (?R) )* \)/x +(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(00)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0) + 0: (0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(0(00)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0)0) + +/[\E]AAA/ +Failed: missing terminating ] for character class at offset 7 + +/[\Q\E]AAA/ +Failed: missing terminating ] for character class at offset 9 + +/[^\E]AAA/ +Failed: missing terminating ] for character class at offset 8 + +/[^\Q\E]AAA/ +Failed: missing terminating ] for character class at offset 10 + +/[\E^]AAA/ +Failed: missing terminating ] for character class at offset 8 + +/[\Q\E^]AAA/ +Failed: missing terminating ] for character class at offset 10 + +/A(*PRUNE)B(*SKIP)C(*THEN)D(*COMMIT)E(*F)F(*FAIL)G(?!)H(*ACCEPT)I/BZ +------------------------------------------------------------------ + Bra + A + *PRUNE + B + *SKIP + C + *THEN + D + *COMMIT + E + *FAIL + F + *FAIL + G + *FAIL + H + *ACCEPT + I + Ket + End +------------------------------------------------------------------ + +/^a+(*FAIL)/ + aaaaaa +No match + +/a+b?c+(*FAIL)/ + aaabccc +No match + +/a+b?(*PRUNE)c+(*FAIL)/ + aaabccc +No match + +/a+b?(*COMMIT)c+(*FAIL)/ + aaabccc +No match + +/a+b?(*SKIP)c+(*FAIL)/ + aaabcccaaabccc +No match + +/^(?:aaa(*THEN)\w{6}|bbb(*THEN)\w{5}|ccc(*THEN)\w{4}|\w{3})/ + aaaxxxxxx + 0: aaaxxxxxx + aaa++++++ + 0: aaa + bbbxxxxx + 0: bbbxxxxx + bbb+++++ + 0: bbb + cccxxxx + 0: cccxxxx + ccc++++ + 0: ccc + dddddddd + 0: ddd + +/^(aaa(*THEN)\w{6}|bbb(*THEN)\w{5}|ccc(*THEN)\w{4}|\w{3})/ + aaaxxxxxx + 0: aaaxxxxxx + 1: aaaxxxxxx + aaa++++++ + 0: aaa + 1: aaa + bbbxxxxx + 0: bbbxxxxx + 1: bbbxxxxx + bbb+++++ + 0: bbb + 1: bbb + cccxxxx + 0: cccxxxx + 1: cccxxxx + ccc++++ + 0: ccc + 1: ccc + dddddddd + 0: ddd + 1: ddd + +/a+b?(*THEN)c+(*FAIL)/ + aaabccc +No match + +/(A (A|B(*ACCEPT)|C) D)(E)/x + ABX + 0: AB + AADE + 0: AADE + 1: AAD + 2: A + 3: E + ACDE + 0: ACDE + 1: ACD + 2: C + 3: E + ** Failers +No match + AD +No match + +/^a+(*FAIL)/C + aaaaaa +--->aaaaaa + +0 ^ ^ + +1 ^ a+ + +3 ^ ^ (*FAIL) + +3 ^ ^ (*FAIL) + +3 ^ ^ (*FAIL) + +3 ^ ^ (*FAIL) + +3 ^ ^ (*FAIL) + +3 ^^ (*FAIL) +No match + +/a+b?c+(*FAIL)/C + aaabccc +--->aaabccc + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ c+ + +6 ^ ^ (*FAIL) + +6 ^ ^ (*FAIL) + +6 ^ ^ (*FAIL) + +4 ^ ^ c+ + +2 ^ ^ b? + +4 ^ ^ c+ + +2 ^^ b? + +4 ^^ c+ + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ c+ + +6 ^ ^ (*FAIL) + +6 ^ ^ (*FAIL) + +6 ^ ^ (*FAIL) + +4 ^ ^ c+ + +2 ^^ b? + +4 ^^ c+ + +0 ^ a+ + +2 ^^ b? + +4 ^ ^ c+ + +6 ^ ^ (*FAIL) + +6 ^ ^ (*FAIL) + +6 ^ ^ (*FAIL) + +4 ^^ c+ +No match + +/a+b?(*PRUNE)c+(*FAIL)/C + aaabccc +--->aaabccc + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*PRUNE) ++12 ^ ^ c+ ++14 ^ ^ (*FAIL) ++14 ^ ^ (*FAIL) ++14 ^ ^ (*FAIL) + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*PRUNE) ++12 ^ ^ c+ ++14 ^ ^ (*FAIL) ++14 ^ ^ (*FAIL) ++14 ^ ^ (*FAIL) + +0 ^ a+ + +2 ^^ b? + +4 ^ ^ (*PRUNE) ++12 ^ ^ c+ ++14 ^ ^ (*FAIL) ++14 ^ ^ (*FAIL) ++14 ^ ^ (*FAIL) +No match + +/a+b?(*COMMIT)c+(*FAIL)/C + aaabccc +--->aaabccc + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*COMMIT) ++13 ^ ^ c+ ++15 ^ ^ (*FAIL) ++15 ^ ^ (*FAIL) ++15 ^ ^ (*FAIL) +No match + +/a+b?(*SKIP)c+(*FAIL)/C + aaabcccaaabccc +--->aaabcccaaabccc + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*SKIP) ++11 ^ ^ c+ ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*SKIP) ++11 ^ ^ c+ ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) +No match + +/a+b?(*THEN)c+(*FAIL)/C + aaabccc +--->aaabccc + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*THEN) ++11 ^ ^ c+ ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) + +0 ^ a+ + +2 ^ ^ b? + +4 ^ ^ (*THEN) ++11 ^ ^ c+ ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) + +0 ^ a+ + +2 ^^ b? + +4 ^ ^ (*THEN) ++11 ^ ^ c+ ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) ++13 ^ ^ (*FAIL) +No match + +/a(*PRUNE:XXX)b/ +Failed: (*VERB) with an argument is not supported at offset 8 + +/a(*MARK)b/ +Failed: (*VERB) not recognized at offset 7 + +/(?i:A{1,}\6666666666)/ +Failed: number is too big at offset 19 + +/\g6666666666/ +Failed: number is too big at offset 11 + +/[\g6666666666]/ +Failed: number is too big at offset 12 + +/(?1)\c[/ +Failed: reference to non-existent subpattern at offset 3 + +/.+A/<crlf> + \r\nA +No match + +/\nA/<crlf> + \r\nA + 0: \x0aA + +/[\r\n]A/<crlf> + \r\nA + 0: \x0aA + +/(\r|\n)A/<crlf> + \r\nA + 0: \x0aA + 1: \x0a + +/a(*CR)b/ +Failed: (*VERB) not recognized at offset 5 + +/(*CR)a.b/ + a\nb + 0: a\x0ab + ** Failers +No match + a\rb +No match + +/(*CR)a.b/<lf> + a\nb + 0: a\x0ab + ** Failers +No match + a\rb +No match + +/(*LF)a.b/<CRLF> + a\rb + 0: a\x0db + ** Failers +No match + a\nb +No match + +/(*CRLF)a.b/ + a\rb + 0: a\x0db + a\nb + 0: a\x0ab + ** Failers +No match + a\r\nb +No match + +/(*ANYCRLF)a.b/<CR> + ** Failers +No match + a\rb +No match + a\nb +No match + a\r\nb +No match + +/(*ANY)a.b/<cr> + ** Failers +No match + a\rb +No match + a\nb +No match + a\r\nb +No match + a\x85b +No match + / End of testinput2 / diff --git a/ext/pcre/pcrelib/testdata/testoutput3 b/ext/pcre/pcrelib/testdata/testoutput3 index 839ae8a0dc..28b1c3aaaf 100644 --- a/ext/pcre/pcrelib/testdata/testoutput3 +++ b/ext/pcre/pcrelib/testdata/testoutput3 @@ -148,7 +148,7 @@ No match /[[:alpha:]][[:lower:]][[:upper:]]/DZLfr_FR ------------------------------------------------------------------ - Bra 0 + Bra [A-Za-z\xaa\xb5\xba\xc0-\xd6\xd8-\xf6\xf8-\xff] [a-z\xb5\xdf-\xf6\xf8-\xff] [A-Z\xc0-\xd6\xd8-\xde] diff --git a/ext/pcre/pcrelib/testdata/testoutput4 b/ext/pcre/pcrelib/testdata/testoutput4 index 966b28568c..b49d4f98ae 100644 --- a/ext/pcre/pcrelib/testdata/testoutput4 +++ b/ext/pcre/pcrelib/testdata/testoutput4 @@ -918,4 +918,24 @@ No match a 0: a +/\S\S/8g + A\x{a3}BC + 0: A\x{a3} + 0: BC + +/\S{2}/8g + A\x{a3}BC + 0: A\x{a3} + 0: BC + +/\W\W/8g + +\x{a3}== + 0: +\x{a3} + 0: == + +/\W{2}/8g + +\x{a3}== + 0: +\x{a3} + 0: == + / End of testinput4 / diff --git a/ext/pcre/pcrelib/testdata/testoutput5 b/ext/pcre/pcrelib/testdata/testoutput5 index 1f0b2b1243..2d9ee69495 100644 --- a/ext/pcre/pcrelib/testdata/testoutput5 +++ b/ext/pcre/pcrelib/testdata/testoutput5 @@ -1,6 +1,6 @@ /\x{100}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100} Ket End @@ -12,7 +12,7 @@ Need char = 128 /\x{1000}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{1000} Ket End @@ -24,7 +24,7 @@ Need char = 128 /\x{10000}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{10000} Ket End @@ -36,7 +36,7 @@ Need char = 128 /\x{100000}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100000} Ket End @@ -48,7 +48,7 @@ Need char = 128 /\x{1000000}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{1000000} Ket End @@ -60,7 +60,7 @@ Need char = 128 /\x{4000000}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{4000000} Ket End @@ -72,7 +72,7 @@ Need char = 128 /\x{7fffFFFF}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{7fffffff} Ket End @@ -84,7 +84,7 @@ Need char = 191 /[\x{ff}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{ff} Ket End @@ -96,7 +96,7 @@ Need char = 191 /[\x{100}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [\x{100}] Ket End @@ -118,7 +118,7 @@ Failed: character value in \x{...} sequence is too large at offset 12 /\x80/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{80} Ket End @@ -130,7 +130,7 @@ Need char = 128 /\xff/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{ff} Ket End @@ -142,7 +142,7 @@ Need char = 191 /\x{0041}\x{2262}\x{0391}\x{002e}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra A\x{2262}\x{391}. Ket End @@ -156,7 +156,7 @@ Need char = '.' /\x{D55c}\x{ad6d}\x{C5B4}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{d55c}\x{ad6d}\x{c5b4} Ket End @@ -170,7 +170,7 @@ Need char = 180 /\x{65e5}\x{672c}\x{8a9e}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{65e5}\x{672c}\x{8a9e} Ket End @@ -184,7 +184,7 @@ Need char = 158 /\x{80}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{80} Ket End @@ -196,7 +196,7 @@ Need char = 128 /\x{084}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{84} Ket End @@ -208,7 +208,7 @@ Need char = 132 /\x{104}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{104} Ket End @@ -220,7 +220,7 @@ Need char = 132 /\x{861}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{861} Ket End @@ -232,7 +232,7 @@ Need char = 161 /\x{212ab}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{212ab} Ket End @@ -244,7 +244,7 @@ Need char = 171 /.{3,5}X/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra Any{3} Any{0,2} X @@ -262,7 +262,7 @@ Need char = 'X' /.{3,5}?/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra Any{3} Any{0,2}? Ket @@ -334,7 +334,7 @@ can't tell the difference.) --/ /^[ab]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [ab] Ket @@ -357,13 +357,14 @@ No match /^[^ab]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x00-`c-\xff] (neg) Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: anchored utf8 No first char No need char @@ -380,12 +381,13 @@ No match /[^ab\xC0-\xF0]/8SDZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-`c-\xbf\xf1-\xff] (neg) Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: utf8 No first char No need char @@ -416,7 +418,7 @@ No match /Ā{3,4}/8SDZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}{3} \x{100}? Ket @@ -433,8 +435,8 @@ Study returned NULL /(\x{100}+|x)/8SDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 \x{100}+ Alt x @@ -451,8 +453,8 @@ Starting byte set: x \xc4 /(\x{100}*a|x)/8SDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 \x{100}*+ a Alt @@ -470,8 +472,8 @@ Starting byte set: a x \xc4 /(\x{100}{0,2}a|x)/8SDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 \x{100}{0,2} a Alt @@ -489,8 +491,8 @@ Starting byte set: a x \xc4 /(\x{100}{1,2}a|x)/8SDZ ------------------------------------------------------------------ - Bra 0 - Bra 1 + Bra + CBra 1 \x{100} \x{100}{0,1} a @@ -533,7 +535,7 @@ No match /\x{100}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100} Ket End @@ -545,7 +547,7 @@ Need char = 128 /\x{100}*/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}* Ket End @@ -558,7 +560,7 @@ No need char /a\x{100}*/8DZ ------------------------------------------------------------------ - Bra 0 + Bra a \x{100}* Ket @@ -572,7 +574,7 @@ No need char /ab\x{100}*/8DZ ------------------------------------------------------------------ - Bra 0 + Bra ab \x{100}* Ket @@ -586,7 +588,7 @@ Need char = 'b' /a\x{100}\x{101}*/8DZ ------------------------------------------------------------------ - Bra 0 + Bra a\x{100} \x{101}* Ket @@ -600,7 +602,7 @@ Need char = 128 /a\x{100}\x{101}+/8DZ ------------------------------------------------------------------ - Bra 0 + Bra a\x{100} \x{101}+ Ket @@ -614,7 +616,7 @@ Need char = 129 /\x{100}*A/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}*+ A Ket @@ -630,7 +632,7 @@ Need char = 'A' /\x{100}*\d(?R)/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}*+ \d Once @@ -647,31 +649,33 @@ No need char /[^\x{c4}]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [^\xc4] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match No options No first char No need char /[^\x{c4}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\xc3\xc5-\xff] (neg) Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: utf8 No first char No need char /[\x{100}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [\x{100}] Ket End @@ -691,7 +695,7 @@ No match /[Z\x{100}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [Z\x{100}] Ket End @@ -726,7 +730,7 @@ No match /[z-\x{100}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [z-\x{100}] Ket End @@ -738,7 +742,7 @@ No need char /[z\Qa-d]Ā\E]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [\-\]adz\x{100}] Ket End @@ -754,7 +758,7 @@ No need char /[\xFF]/DZ ------------------------------------------------------------------ - Bra 0 + Bra \xff Ket End @@ -768,7 +772,7 @@ No need char /[\xff]/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra \x{ff} Ket End @@ -782,24 +786,26 @@ Need char = 191 /[^\xFF]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [^\xff] Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match No options No first char No need char /[^\xff]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\xfe] (neg) Ket End ------------------------------------------------------------------ Capturing subpattern count = 0 +Contains explicit CR or LF match Options: utf8 No first char No need char @@ -839,7 +845,7 @@ Failed: invalid UTF-8 string at offset 1 /xxx/8?DZ ------------------------------------------------------------------ - Bra 0 + Bra \X{c0}\X{c0}\X{c0}xxx Ket End @@ -887,19 +893,27 @@ No match \xf1\x8f\x80\x80 No match \xf8\x88\x80\x80\x80 -No match +Error -10 \xf9\x87\x80\x80\x80 -No match +Error -10 \xfc\x84\x80\x80\x80\x80 -No match +Error -10 \xfd\x83\x80\x80\x80\x80 +Error -10 + \?\xf8\x88\x80\x80\x80 +No match + \?\xf9\x87\x80\x80\x80 +No match + \?\xfc\x84\x80\x80\x80\x80 +No match + \?\xfd\x83\x80\x80\x80\x80 No match /\x{100}abc(xyz(?1))/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}abc - Bra 1 + CBra 1 xyz Once Recurse @@ -915,10 +929,10 @@ Need char = 'z' /[^\x{100}]abc(xyz(?1))/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [^\x{100}] abc - Bra 1 + CBra 1 xyz Once Recurse @@ -928,16 +942,17 @@ Need char = 'z' End ------------------------------------------------------------------ Capturing subpattern count = 1 +Contains explicit CR or LF match Options: utf8 No first char Need char = 'z' /[ab\x{100}]abc(xyz(?1))/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [ab\x{100}] abc - Bra 1 + CBra 1 xyz Once Recurse @@ -953,11 +968,11 @@ Need char = 'z' /(\x{100}(b(?2)c))?/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 1 + CBra 1 \x{100} - Bra 2 + CBra 2 b Once Recurse @@ -975,12 +990,12 @@ No need char /(\x{100}(b(?2)c)){0,2}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 0 - Bra 1 + Bra + CBra 1 \x{100} - Bra 2 + CBra 2 b Once Recurse @@ -989,9 +1004,9 @@ No need char Ket Ket Brazero - Bra 1 + CBra 1 \x{100} - Bra 2 + CBra 2 b Once Recurse @@ -1010,11 +1025,11 @@ No need char /(\x{100}(b(?1)c))?/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 1 + CBra 1 \x{100} - Bra 2 + CBra 2 b Once Recurse @@ -1032,12 +1047,12 @@ No need char /(\x{100}(b(?1)c)){0,2}/DZ8 ------------------------------------------------------------------ - Bra 0 + Bra Brazero - Bra 0 - Bra 1 + Bra + CBra 1 \x{100} - Bra 2 + CBra 2 b Once Recurse @@ -1046,9 +1061,9 @@ No need char Ket Ket Brazero - Bra 1 + CBra 1 \x{100} - Bra 2 + CBra 2 b Once Recurse @@ -1081,7 +1096,7 @@ No need char /^\ሴ/8DZ ------------------------------------------------------------------ - Bra 0 + Bra ^ \x{1234} Ket @@ -1107,7 +1122,7 @@ Need char = 191 /\x{100}*\d/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}*+ \d Ket @@ -1121,7 +1136,7 @@ No need char /\x{100}*\s/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}*+ \s Ket @@ -1135,7 +1150,7 @@ No need char /\x{100}*\w/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}*+ \w Ket @@ -1149,7 +1164,7 @@ No need char /\x{100}*\D/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}* \D Ket @@ -1163,7 +1178,7 @@ No need char /\x{100}*\S/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}* \S Ket @@ -1177,7 +1192,7 @@ No need char /\x{100}*\W/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}* \W Ket @@ -1191,7 +1206,7 @@ No need char /\x{100}+\x{200}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}++ \x{200} Ket @@ -1205,7 +1220,7 @@ Need char = 128 /\x{100}+X/8DZ ------------------------------------------------------------------ - Bra 0 + Bra \x{100}++ X Ket @@ -1219,7 +1234,7 @@ Need char = 'X' /X+\x{200}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra X++ \x{200} Ket @@ -1256,7 +1271,7 @@ Matched, but too many substrings /^[\x{100}\E-\Q\E\x{150}]/BZ8 ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x{100}-\x{150}] Ket @@ -1265,7 +1280,7 @@ Matched, but too many substrings /^[\QĀ\E-\QŐ\E]/BZ8 ------------------------------------------------------------------ - Bra 0 + Bra ^ [\x{100}-\x{150}] Ket @@ -1431,7 +1446,7 @@ No match /[\h]/8BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09 \xa0\x{1680}\x{180e}\x{2000}-\x{200a}\x{202f}\x{205f}\x{3000}] Ket End @@ -1441,7 +1456,7 @@ No match /[\h]{3,}/8BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x09 \xa0\x{1680}\x{180e}\x{2000}-\x{200a}\x{202f}\x{205f}\x{3000}]{3,} Ket End @@ -1451,7 +1466,7 @@ No match /[\v]/8BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x0a-\x0d\x85\x{2028}-\x{2029}] Ket End @@ -1459,7 +1474,7 @@ No match /[\H]/8BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x08\x0a-\x1f!-\x9f\xa1-\xff\x{100}-\x{167f}\x{1681}-\x{180d}\x{180f}-\x{1fff}\x{200b}-\x{202e}\x{2030}-\x{205e}\x{2060}-\x{2fff}\x{3001}-\x{7fffffff}] Ket End @@ -1467,10 +1482,44 @@ No match /[\V]/8BZ ------------------------------------------------------------------ - Bra 0 + Bra [\x00-\x09\x0e-\x84\x86-\xff\x{100}-\x{2027}\x{2029}-\x{7fffffff}] Ket End ------------------------------------------------------------------ +/.*$/8<any> + \x{1ec5} + 0: \x{1ec5} + +/-- This tests the stricter UTF-8 check according to RFC 3629. --/ + +/X/8 + \x{0}\x{d7ff}\x{e000}\x{10ffff} +No match + \x{d800} +Error -10 + \x{d800}\? +No match + \x{da00} +Error -10 + \x{da00}\? +No match + \x{dfff} +Error -10 + \x{dfff}\? +No match + \x{110000} +Error -10 + \x{110000}\? +No match + \x{2000000} +Error -10 + \x{2000000}\? +No match + \x{7fffffff} +Error -10 + \x{7fffffff}\? +No match + / End of testinput5 / diff --git a/ext/pcre/pcrelib/testdata/testoutput6 b/ext/pcre/pcrelib/testdata/testoutput6 index 776eed4e5f..0a58b844f1 100644 --- a/ext/pcre/pcrelib/testdata/testoutput6 +++ b/ext/pcre/pcrelib/testdata/testoutput6 @@ -99,7 +99,7 @@ No match No match /^\p{Cs}/8 - \x{dfff} + \?\x{dfff} 0: \x{dfff} ** Failers No match @@ -113,7 +113,7 @@ No match No match Z No match - \x{dfff} + \x{e000} No match /^\p{Lm}/8 @@ -550,7 +550,7 @@ No match /[\p{L}]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [\p{L}] Ket End @@ -562,7 +562,7 @@ No need char /[\p{^L}]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [\P{L}] Ket End @@ -574,7 +574,7 @@ No need char /[\P{L}]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [\P{L}] Ket End @@ -586,7 +586,7 @@ No need char /[\P{^L}]/DZ ------------------------------------------------------------------ - Bra 0 + Bra [\p{L}] Ket End @@ -598,7 +598,7 @@ No need char /[abc\p{L}\x{0660}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [a-c\p{L}\x{660}] Ket End @@ -610,7 +610,7 @@ No need char /[\p{Nd}]/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [\p{Nd}] Ket End @@ -624,7 +624,7 @@ No need char /[\p{Nd}+-]+/8DZ ------------------------------------------------------------------ - Bra 0 + Bra [+\-\p{Nd}]+ Ket End @@ -779,7 +779,7 @@ No match /A\x{391}\x{10427}\x{ff3a}\x{1fb0}/8iDZ ------------------------------------------------------------------ - Bra 0 + Bra NC A\x{391}\x{10427}\x{ff3a}\x{1fb0} Ket End @@ -791,7 +791,7 @@ No need char /A\x{391}\x{10427}\x{ff3a}\x{1fb0}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra A\x{391}\x{10427}\x{ff3a}\x{1fb0} Ket End @@ -803,7 +803,7 @@ Need char = 176 /AB\x{1fb0}/8DZ ------------------------------------------------------------------ - Bra 0 + Bra AB\x{1fb0} Ket End @@ -815,7 +815,7 @@ Need char = 176 /AB\x{1fb0}/8DZi ------------------------------------------------------------------ - Bra 0 + Bra NC AB\x{1fb0} Ket End @@ -857,7 +857,7 @@ Need char = 'B' (caseless) /[\x{105}-\x{109}]/8iDZ ------------------------------------------------------------------ - Bra 0 + Bra [\x{104}-\x{109}] Ket End @@ -881,7 +881,7 @@ No match /[z-\x{100}]/8iDZ ------------------------------------------------------------------ - Bra 0 + Bra [Z\x{39c}\x{178}z-\x{101}] Ket End @@ -919,7 +919,7 @@ No match /[z-\x{100}]/8DZi ------------------------------------------------------------------ - Bra 0 + Bra [Z\x{39c}\x{178}z-\x{101}] Ket End @@ -1452,4 +1452,74 @@ was broken in all cases./ 123abc\xc4\xc5zz 0: abc\xc4 +/\X{1,3}\d/ + \x8aBCD +No match + +/\X?\d/ + \x8aBCD +No match + +/\P{L}?\d/ + \x8aBCD +No match + +/[\PPP\x8a]{1,}\x80/ + A\x80 + 0: A\x80 + +/(?:[\PPa*]*){8,}/ + +/[\P{Any}]/BZ +------------------------------------------------------------------ + Bra + [\P{Any}] + Ket + End +------------------------------------------------------------------ + +/[\P{Any}\E]/BZ +------------------------------------------------------------------ + Bra + [\P{Any}] + Ket + End +------------------------------------------------------------------ + +/(\P{Yi}+\277)/ + +/(\P{Yi}+\277)?/ + +/(?<=\P{Yi}{3}A)X/ + +/\p{Yi}+(\P{Yi}+)(?1)/ + +/(\P{Yi}{2}\277)?/ + +/[\P{Yi}A]/ + +/[\P{Yi}\P{Yi}\P{Yi}A]/ + +/[^\P{Yi}A]/ + +/[^\P{Yi}\P{Yi}\P{Yi}A]/ + +/(\P{Yi}*\277)*/ + +/(\P{Yi}*?\277)*/ + +/(\p{Yi}*+\277)*/ + +/(\P{Yi}?\277)*/ + +/(\P{Yi}??\277)*/ + +/(\p{Yi}?+\277)*/ + +/(\P{Yi}{0,3}\277)*/ + +/(\P{Yi}{0,3}?\277)*/ + +/(\p{Yi}{0,3}+\277)*/ + / End of testinput6 / diff --git a/ext/pcre/pcrelib/testdata/testoutput7 b/ext/pcre/pcrelib/testdata/testoutput7 index a77186dbd5..39c50750ec 100644 --- a/ext/pcre/pcrelib/testdata/testoutput7 +++ b/ext/pcre/pcrelib/testdata/testoutput7 @@ -7072,4 +7072,20 @@ No match >\x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c 0: \x0a\x0dX\x0aY\x0a\x0bZZZ\x0aAAA\x0bNNN\x0c +/.+A/<crlf> + \r\nA +No match + +/\nA/<crlf> + \r\nA + 0: \x0aA + +/[\r\n]A/<crlf> + \r\nA + 0: \x0aA + +/(\r|\n)A/<crlf> + \r\nA + 0: \x0aA + / End of testinput7 / diff --git a/ext/pcre/pcrelib/testdata/testoutput9 b/ext/pcre/pcrelib/testdata/testoutput9 index bc5f0e71a2..acaeb398dd 100644 --- a/ext/pcre/pcrelib/testdata/testoutput9 +++ b/ext/pcre/pcrelib/testdata/testoutput9 @@ -271,7 +271,7 @@ No match No match /^\p{Cs}/8 - \x{dfff} + \?\x{dfff} 0: \x{dfff} ** Failers No match @@ -285,7 +285,7 @@ No match No match Z No match - \x{dfff} + \x{e000} No match /^\p{Lm}/8 |