diff options
author | Andrey Hristov <andrey@php.net> | 1999-05-26 21:47:57 +0000 |
---|---|---|
committer | Andrey Hristov <andrey@php.net> | 1999-05-26 21:47:57 +0000 |
commit | 66f850e5b7f97bb8b07df7ca0d2a82acf4c239d4 (patch) | |
tree | 534d993fe1afe902042082be0457acfcd10bd4f1 /ext/pcre/pcrelib | |
parent | d73c63852645a0a4324f24b367dff43b54eada85 (diff) | |
download | php-git-66f850e5b7f97bb8b07df7ca0d2a82acf4c239d4.tar.gz |
-Added PCRE library source
-Updated configuration process
Diffstat (limited to 'ext/pcre/pcrelib')
30 files changed, 20933 insertions, 0 deletions
diff --git a/ext/pcre/pcrelib/ChangeLog b/ext/pcre/pcrelib/ChangeLog new file mode 100644 index 0000000000..2259f873ef --- /dev/null +++ b/ext/pcre/pcrelib/ChangeLog @@ -0,0 +1,439 @@ +ChangeLog for PCRE +------------------ + + +Version 2.05 21-Apr-99 +---------------------- + +1. Changed the type of magic_number from int to long int so that it works +properly on 16-bit systems. + +2. Fixed a bug which caused patterns starting with .* not to work correctly +when the subject string contained newline characters. PCRE was assuming +anchoring for such patterns in all cases, which is not correct because .* will +not pass a newline unless PCRE_DOTALL is set. It now assumes anchoring only if +DOTALL is set at top level; otherwise it knows that patterns starting with .* +must be retried after every newline in the subject. + + +Version 2.04 18-Feb-99 +---------------------- + +1. For parenthesized subpatterns with repeats whose minimum was zero, the +computation of the store needed to hold the pattern was incorrect (too large). +If such patterns were nested a few deep, this could multiply and become a real +problem. + +2. Added /M option to pcretest to show the memory requirement of a specific +pattern. Made -m a synonym of -s (which does this globally) for compatibility. + +3. Subpatterns of the form (regex){n,m} (i.e. limited maximum) were being +compiled in such a way that the backtracking after subsequent failure was +pessimal. Something like (a){0,3} was compiled as (a)?(a)?(a)? instead of +((a)((a)(a)?)?)? with disastrous performance if the maximum was of any size. + + +Version 2.03 02-Feb-99 +---------------------- + +1. Fixed typo and small mistake in man page. + +2. Added 4th condition (GPL supersedes if conflict) and created separate +LICENCE file containing the conditions. + +3. Updated pcretest so that patterns such as /abc\/def/ work like they do in +Perl, that is the internal \ allows the delimiter to be included in the +pattern. Locked out the use of \ as a delimiter. If \ immediately follows +the final delimiter, add \ to the end of the pattern (to test the error). + +4. Added the convenience functions for extracting substrings after a successful +match. Updated pcretest to make it able to test these functions. + + +Version 2.02 14-Jan-99 +---------------------- + +1. Initialized the working variables associated with each extraction so that +their saving and restoring doesn't refer to uninitialized store. + +2. Put dummy code into study.c in order to trick the optimizer of the IBM C +compiler for OS/2 into generating correct code. Apparently IBM isn't going to +fix the problem. + +3. Pcretest: the timing code wasn't using LOOPREPEAT for timing execution +calls, and wasn't printing the correct value for compiling calls. Increased the +default value of LOOPREPEAT, and the number of significant figures in the +times. + +4. Changed "/bin/rm" in the Makefile to "-rm" so it works on Windows NT. + +5. Renamed "deftables" as "dftables" to get it down to 8 characters, to avoid +a building problem on Windows NT with a FAT file system. + + +Version 2.01 21-Oct-98 +---------------------- + +1. Changed the API for pcre_compile() to allow for the provision of a pointer +to character tables built by pcre_maketables() in the current locale. If NULL +is passed, the default tables are used. + + +Version 2.00 24-Sep-98 +---------------------- + +1. Since the (>?) facility is in Perl 5.005, don't require PCRE_EXTRA to enable +it any more. + +2. Allow quantification of (?>) groups, and make it work correctly. + +3. The first character computation wasn't working for (?>) groups. + +4. Correct the implementation of \Z (it is permitted to match on the \n at the +end of the subject) and add 5.005's \z, which really does match only at the +very end of the subject. + +5. Remove the \X "cut" facility; Perl doesn't have it, and (?> is neater. + +6. Remove the ability to specify CASELESS, MULTILINE, DOTALL, and +DOLLAR_END_ONLY at runtime, to make it possible to implement the Perl 5.005 +localized options. All options to pcre_study() were also removed. + +7. Add other new features from 5.005: + + $(?<= positive lookbehind + $(?<! negative lookbehind + (?imsx-imsx) added the unsetting capability + such a setting is global if at outer level; local otherwise + (?imsx-imsx:) non-capturing groups with option setting + (?(cond)re|re) conditional pattern matching + + A backreference to itself in a repeated group matches the previous + captured string. + +8. General tidying up of studying (both automatic and via "study") +consequential on the addition of new assertions. + +9. As in 5.005, unlimited repeated groups that could match an empty substring +are no longer faulted at compile time. Instead, the loop is forcibly broken at +runtime if any iteration does actually match an empty substring. + +10. Include the RunTest script in the distribution. + +11. Added tests from the Perl 5.005_02 distribution. This showed up a few +discrepancies, some of which were old and were also with respect to 5.004. They +have now been fixed. + + +Version 1.09 28-Apr-98 +---------------------- + +1. A negated single character class followed by a quantifier with a minimum +value of one (e.g. [^x]{1,6} ) was not compiled correctly. This could lead to +program crashes, or just wrong answers. This did not apply to negated classes +containing more than one character, or to minima other than one. + + +Version 1.08 27-Mar-98 +---------------------- + +1. Add PCRE_UNGREEDY to invert the greediness of quantifiers. + +2. Add (?U) and (?X) to set PCRE_UNGREEDY and PCRE_EXTRA respectively. The +latter must appear before anything that relies on it in the pattern. + + +Version 1.07 16-Feb-98 +---------------------- + +1. A pattern such as /((a)*)*/ was not being diagnosed as in error (unlimited +repeat of a potentially empty string). + + +Version 1.06 23-Jan-98 +---------------------- + +1. Added Markus Oberhumer's little patches for C++. + +2. Literal strings longer than 255 characters were broken. + + +Version 1.05 23-Dec-97 +---------------------- + +1. Negated character classes containing more than one character were failing if +PCRE_CASELESS was set at run time. + + +Version 1.04 19-Dec-97 +---------------------- + +1. Corrected the man page, where some "const" qualifiers had been omitted. + +2. Made debugging output print "{0,xxx}" instead of just "{,xxx}" to agree with +input syntax. + +3. Fixed memory leak which occurred when a regex with back references was +matched with an offsets vector that wasn't big enough. The temporary memory +that is used in this case wasn't being freed if the match failed. + +4. Tidied pcretest to ensure it frees memory that it gets. + +5. Temporary memory was being obtained in the case where the passed offsets +vector was exactly big enough. + +6. Corrected definition of offsetof() from change 5 below. + +7. I had screwed up change 6 below and broken the rules for the use of +setjmp(). Now fixed. + + +Version 1.03 18-Dec-97 +---------------------- + +1. A erroneous regex with a missing opening parenthesis was correctly +diagnosed, but PCRE attempted to access brastack[-1], which could cause crashes +on some systems. + +2. Replaced offsetof(real_pcre, code) by offsetof(real_pcre, code[0]) because +it was reported that one broken compiler failed on the former because "code" is +also an independent variable. + +3. The erroneous regex a[]b caused an array overrun reference. + +4. A regex ending with a one-character negative class (e.g. /[^k]$/) did not +fail on data ending with that character. (It was going on too far, and checking +the next character, typically a binary zero.) This was specific to the +optimized code for single-character negative classes. + +5. Added a contributed patch from the TIN world which does the following: + + + Add an undef for memmove, in case the the system defines a macro for it. + + + Add a definition of offsetof(), in case there isn't one. (I don't know + the reason behind this - offsetof() is part of the ANSI standard - but + it does no harm). + + + Reduce the ifdef's in pcre.c using macro DPRINTF, thereby eliminating + most of the places where whitespace preceded '#'. I have given up and + allowed the remaining 2 cases to be at the margin. + + + Rename some variables in pcre to eliminate shadowing. This seems very + pedantic, but does no harm, of course. + +6. Moved the call to setjmp() into its own function, to get rid of warnings +from gcc -Wall, and avoided calling it at all unless PCRE_EXTRA is used. + +7. Constructs such as \d{8,} were compiling into the equivalent of +\d{8}\d{0,65527} instead of \d{8}\d* which didn't make much difference to the +outcome, but in this particular case used more store than had been allocated, +which caused the bug to be discovered because it threw up an internal error. + +8. The debugging code in both pcre and pcretest for outputting the compiled +form of a regex was going wrong in the case of back references followed by +curly-bracketed repeats. + + +Version 1.02 12-Dec-97 +---------------------- + +1. Typos in pcre.3 and comments in the source fixed. + +2. Applied a contributed patch to get rid of places where it used to remove +'const' from variables, and fixed some signed/unsigned and uninitialized +variable warnings. + +3. Added the "runtest" target to Makefile. + +4. Set default compiler flag to -O2 rather than just -O. + + +Version 1.01 19-Nov-97 +---------------------- + +1. PCRE was failing to diagnose unlimited repeat of empty string for patterns +like /([ab]*)*/, that is, for classes with more than one character in them. + +2. Likewise, it wasn't diagnosing patterns with "once-only" subpatterns, such +as /((?>a*))*/ (a PCRE_EXTRA facility). + + +Version 1.00 18-Nov-97 +---------------------- + +1. Added compile-time macros to support systems such as SunOS4 which don't have +memmove() or strerror() but have other things that can be used instead. + +2. Arranged that "make clean" removes the executables. + + +Version 0.99 27-Oct-97 +---------------------- + +1. Fixed bug in code for optimizing classes with only one character. It was +initializing a 32-byte map regardless, which could cause it to run off the end +of the memory it had got. + +2. Added, conditional on PCRE_EXTRA, the proposed (?>REGEX) construction. + + +Version 0.98 22-Oct-97 +---------------------- + +1. Fixed bug in code for handling temporary memory usage when there are more +back references than supplied space in the ovector. This could cause segfaults. + + +Version 0.97 21-Oct-97 +---------------------- + +1. Added the \X "cut" facility, conditional on PCRE_EXTRA. + +2. Optimized negated single characters not to use a bit map. + +3. Brought error texts together as macro definitions; clarified some of them; +fixed one that was wrong - it said "range out of order" when it meant "invalid +escape sequence". + +4. Changed some char * arguments to const char *. + +5. Added PCRE_NOTBOL and PCRE_NOTEOL (from POSIX). + +6. Added the POSIX-style API wrapper in pcreposix.a and testing facilities in +pcretest. + + +Version 0.96 16-Oct-97 +---------------------- + +1. Added a simple "pgrep" utility to the distribution. + +2. Fixed an incompatibility with Perl: "{" is now treated as a normal character +unless it appears in one of the precise forms "{ddd}", "{ddd,}", or "{ddd,ddd}" +where "ddd" means "one or more decimal digits". + +3. Fixed serious bug. If a pattern had a back reference, but the call to +pcre_exec() didn't supply a large enough ovector to record the related +identifying subpattern, the match always failed. PCRE now remembers the number +of the largest back reference, and gets some temporary memory in which to save +the offsets during matching if necessary, in order to ensure that +backreferences always work. + +4. Increased the compatibility with Perl in a number of ways: + + (a) . no longer matches \n by default; an option PCRE_DOTALL is provided + to request this handling. The option can be set at compile or exec time. + + (b) $ matches before a terminating newline by default; an option + PCRE_DOLLAR_ENDONLY is provided to override this (but not in multiline + mode). The option can be set at compile or exec time. + + (c) The handling of \ followed by a digit other than 0 is now supposed to be + the same as Perl's. If the decimal number it represents is less than 10 + or there aren't that many previous left capturing parentheses, an octal + escape is read. Inside a character class, it's always an octal escape, + even if it is a single digit. + + (d) An escaped but undefined alphabetic character is taken as a literal, + unless PCRE_EXTRA is set. Currently this just reserves the remaining + escapes. + + (e) {0} is now permitted. (The previous item is removed from the compiled + pattern). + +5. Changed all the names of code files so that the basic parts are no longer +than 10 characters, and abolished the teeny "globals.c" file. + +6. Changed the handling of character classes; they are now done with a 32-byte +bit map always. + +7. Added the -d and /D options to pcretest to make it possible to look at the +internals of compilation without having to recompile pcre. + + +Version 0.95 23-Sep-97 +---------------------- + +1. Fixed bug in pre-pass concerning escaped "normal" characters such as \x5c or +\x20 at the start of a run of normal characters. These were being treated as +real characters, instead of the source characters being re-checked. + + +Version 0.94 18-Sep-97 +---------------------- + +1. The functions are now thread-safe, with the caveat that the global variables +containing pointers to malloc() and free() or alternative functions are the +same for all threads. + +2. Get pcre_study() to generate a bitmap of initial characters for non- +anchored patterns when this is possible, and use it if passed to pcre_exec(). + + +Version 0.93 15-Sep-97 +---------------------- + +1. /(b)|(:+)/ was computing an incorrect first character. + +2. Add pcre_study() to the API and the passing of pcre_extra to pcre_exec(), +but not actually doing anything yet. + +3. Treat "-" characters in classes that cannot be part of ranges as literals, +as Perl does (e.g. [-az] or [az-]). + +4. Set the anchored flag if a branch starts with .* or .*? because that tests +all possible positions. + +5. Split up into different modules to avoid including unneeded functions in a +compiled binary. However, compile and exec are still in one module. The "study" +function is split off. + +6. The character tables are now in a separate module whose source is generated +by an auxiliary program - but can then be edited by hand if required. There are +now no calls to isalnum(), isspace(), isdigit(), isxdigit(), tolower() or +toupper() in the code. + +7. Turn the malloc/free funtions variables into pcre_malloc and pcre_free and +make them global. Abolish the function for setting them, as the caller can now +set them directly. + + +Version 0.92 11-Sep-97 +---------------------- + +1. A repeat with a fixed maximum and a minimum of 1 for an ordinary character +(e.g. /a{1,3}/) was broken (I mis-optimized it). + +2. Caseless matching was not working in character classes if the characters in +the pattern were in upper case. + +3. Make ranges like [W-c] work in the same way as Perl for caseless matching. + +4. Make PCRE_ANCHORED public and accept as a compile option. + +5. Add an options word to pcre_exec() and accept PCRE_ANCHORED and +PCRE_CASELESS at run time. Add escapes \A and \I to pcretest to cause it to +pass them. + +6. Give an error if bad option bits passed at compile or run time. + +7. Add PCRE_MULTILINE at compile and exec time, and (?m) as well. Add \M to +pcretest to cause it to pass that flag. + +8. Add pcre_info(), to get the number of identifying subpatterns, the stored +options, and the first character, if set. + +9. Recognize C+ or C{n,m} where n >= 1 as providing a fixed starting character. + + +Version 0.91 10-Sep-97 +---------------------- + +1. PCRE was failing to diagnose unlimited repeats of subpatterns that could +match the empty string as in /(a*)*/. It was looping and ultimately crashing. + +2. PCRE was looping on encountering an indefinitely repeated back reference to +a subpattern that had matched an empty string, e.g. /(a|)\1*/. It now does what +Perl does - treats the match as successful. + +**** diff --git a/ext/pcre/pcrelib/LICENCE b/ext/pcre/pcrelib/LICENCE new file mode 100644 index 0000000000..246515ae75 --- /dev/null +++ b/ext/pcre/pcrelib/LICENCE @@ -0,0 +1,32 @@ +PCRE LICENCE +------------ + +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel <ph10@cam.ac.uk> + +University of Cambridge Computing Service, +Cambridge, England. Phone: +44 1223 334714. + +Copyright (c) 1997-1999 University of Cambridge + +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. + +End diff --git a/ext/pcre/pcrelib/Makefile b/ext/pcre/pcrelib/Makefile new file mode 100644 index 0000000000..2da3012aa1 --- /dev/null +++ b/ext/pcre/pcrelib/Makefile @@ -0,0 +1,80 @@ +# Make file for PCRE (Perl-Compatible Regular Expression) library. + +# Edit CC, CFLAGS, and RANLIB for your system. + +# It is believed that RANLIB=ranlib is required for AIX, BSDI, FreeBSD, Linux, +# MIPS RISCOS, NetBSD, OpenBSD, Digital Unix, and Ultrix. + +# Use CFLAGS = -DUSE_BCOPY on SunOS4 and any other system that lacks the +# memmove() function, but has bcopy(). + +# Use CFLAGS = -DSTRERROR_FROM_ERRLIST on SunOS4 and any other system that +# lacks the strerror() function, but can provide the equivalent by indexing +# into errlist. + +AR = ar cq +CC = gcc -O2 -Wall +CFLAGS = +RANLIB = @true + +########################################################################## + +OBJ = maketables.o get.o study.o pcre.o + +all: libpcre.a libpcreposix.a pcretest pgrep + +pgrep: libpcre.a pgrep.o + $(CC) $(CFLAGS) -o pgrep pgrep.o libpcre.a + +pcretest: libpcre.a libpcreposix.a pcretest.o + $(PURIFY) $(CC) $(CFLAGS) -o pcretest pcretest.o libpcre.a libpcreposix.a + +libpcre.a: $(OBJ) + -rm -f libpcre.a + $(AR) libpcre.a $(OBJ) + $(RANLIB) libpcre.a + +libpcreposix.a: pcreposix.o + -rm -f libpcreposix.a + $(AR) libpcreposix.a pcreposix.o + $(RANLIB) libpcreposix.a + +pcre.o: chartables.c pcre.c pcre.h internal.h Makefile + $(CC) -c $(CFLAGS) pcre.c + +pcreposix.o: pcreposix.c pcreposix.h internal.h pcre.h Makefile + $(CC) -c $(CFLAGS) pcreposix.c + +maketables.o: maketables.c pcre.h internal.h Makefile + $(CC) -c $(CFLAGS) maketables.c + +get.o: get.c pcre.h internal.h Makefile + $(CC) -c $(CFLAGS) get.c + +study.o: study.c pcre.h internal.h Makefile + $(CC) -c $(CFLAGS) study.c + +pcretest.o: pcretest.c pcre.h Makefile + $(CC) -c $(CFLAGS) pcretest.c + +pgrep.o: pgrep.c pcre.h Makefile + $(CC) -c $(CFLAGS) pgrep.c + +# An auxiliary program makes the default character table source + +chartables.c: dftables + ./dftables >chartables.c + +dftables: dftables.c maketables.c pcre.h internal.h Makefile + $(CC) -o dftables $(CFLAGS) dftables.c + +# We deliberately omit dftables and chartables.c from 'make clean'; once made +# chartables.c shouldn't change, and if people have edited the tables by hand, +# you don't want to throw them away. + +clean:; -rm -f *.o *.a pcretest pgrep + +runtest: all + ./RunTest + +# End diff --git a/ext/pcre/pcrelib/README b/ext/pcre/pcrelib/README new file mode 100644 index 0000000000..2db0070f16 --- /dev/null +++ b/ext/pcre/pcrelib/README @@ -0,0 +1,333 @@ +README file for PCRE (Perl-compatible regular expressions) +---------------------------------------------------------- + +******************************************************************************* +* IMPORTANT FOR THOSE UPGRADING FROM VERSIONS BEFORE 2.00 * +* * +* Please note that there has been a change in the API such that a larger * +* ovector is required at matching time, to provide some additional workspace. * +* The new man page has details. This change was necessary in order to support * +* some of the new functionality in Perl 5.005. * +* * +* IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.00 * +* * +* Another (I hope this is the last!) change has been made to the API for the * +* pcre_compile() function. An additional argument has been added to make it * +* possible to pass over a pointer to character tables built in the current * +* locale by pcre_maketables(). To use the default tables, this new arguement * +* should be passed as NULL. * +******************************************************************************* + +The distribution should contain the following files: + + ChangeLog log of changes to the code + LICENCE conditions for the use of PCRE + Makefile for building PCRE + README this file + RunTest a shell script for running tests + Tech.Notes notes on the encoding + pcre.3 man page for the functions + pcreposix.3 man page for the POSIX wrapper API + dftables.c auxiliary program for building chartables.c + get.c ) + maketables.c ) + study.c ) source of + pcre.c ) the functions + pcreposix.c ) + pcre.h header for the external API + pcreposix.h header for the external POSIX wrapper API + internal.h header for internal use + pcretest.c test program + pgrep.1 man page for pgrep + pgrep.c source of a grep utility that uses PCRE + perltest Perl test program + testinput1 test data, compatible with Perl 5.004 and 5.005 + testinput2 test data for error messages and non-Perl things + testinput3 test data, compatible with Perl 5.005 + testinput4 test data for locale-specific tests + testoutput1 test results corresponding to testinput + testoutput2 test results corresponding to testinput2 + testoutput3 test results corresponding to testinput3 + testoutput4 test results corresponding to testinput4 + +To build PCRE, edit Makefile for your system (it is a fairly simple make file, +and there are some comments at the top) and then run it. It builds two +libraries called libpcre.a and libpcreposix.a, a test program called pcretest, +and the pgrep command. + +To test PCRE, run the RunTest script in the pcre directory. This runs pcretest +on each of the testinput files in turn, and compares the output with the +contents of the corresponding testoutput file. A file called testtry is used to +hold the output from pcretest (which is documented below). + +To run pcretest on just one of the test files, give its number as an argument +to RunTest, for example: + + RunTest 3 + +The first and third test files can also be fed directly into the perltest +program to check that Perl gives the same results. The third file requires the +additional features of release 5.005, which is why it is kept separate from the +main test input, which needs only Perl 5.004. In the long run, when 5.005 is +widespread, these two test files may get amalgamated. + +The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(), +pcre_get_substring(), pcre_get_substring_list(), error detection and run-time +flags that are specific to PCRE, as well as the POSIX wrapper API. + +The fourth set of tests checks pcre_maketables(), the facility for building a +set of character tables for a specific locale and using them instead of the +default tables. The tests make use of the "fr" (French) locale. Before running +the test, the script checks for the presence of this locale by running the +"locale" command. If that command fails, or if it doesn't include "fr" in the +list of available locales, the fourth test cannot be run, and a comment is +output to say why. If running this test produces instances of the error + + ** Failed to set locale "fr" + +in the comparison output, it means that locale is not available on your system, +despite being listed by "locale". This does not mean that PCRE is broken. + +To install PCRE, copy libpcre.a to any suitable library directory (e.g. +/usr/local/lib), pcre.h to any suitable include directory (e.g. +/usr/local/include), and pcre.3 to any suitable man directory (e.g. +/usr/local/man/man3). + +To install the pgrep command, copy it to any suitable binary directory, (e.g. +/usr/local/bin) and pgrep.1 to any suitable man directory (e.g. +/usr/local/man/man1). + +PCRE has its own native API, but a set of "wrapper" functions that are based on +the POSIX API are also supplied in the library libpcreposix.a. Note that this +just provides a POSIX calling interface to PCRE: the regular expressions +themselves still follow Perl syntax and semantics. The header file +for the POSIX-style functions is called pcreposix.h. The official POSIX name is +regex.h, but I didn't want to risk possible problems with existing files of +that name by distributing it that way. To use it with an existing program that +uses the POSIX API, it will have to be renamed or pointed at by a link. + + +Character tables +---------------- + +PCRE uses four tables for manipulating and identifying characters. The final +argument of the pcre_compile() function is a pointer to a block of memory +containing the concatenated tables. A call to pcre_maketables() is used to +generate a set of tables in the current locale. However, if the final argument +is passed as NULL, a set of default tables that is built into the binary is +used. + +The source file called chartables.c contains the default set of tables. This is +not supplied in the distribution, but is built by the program dftables +(compiled from dftables.c), which uses the ANSI C character handling functions +such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table +sources. This means that the default C locale set your system will control the +contents of the tables. You can change the default tables by editing +chartables.c and then re-building PCRE. If you do this, you should probably +also edit Makefile to ensure that the file doesn't ever get re-generated. + +The first two 256-byte tables provide lower casing and case flipping functions, +respectively. The next table consists of three 32-byte bit maps which identify +digits, "word" characters, and white space, respectively. These are used when +building 32-byte bit maps that represent character classes. + +The final 256-byte table has bits indicating various character types, as +follows: + + 1 white space character + 2 letter + 4 decimal digit + 8 hexadecimal digit + 16 alphanumeric or '_' + 128 regular expression metacharacter or binary zero + +You should not alter the set of characters that contain the 128 bit, as that +will cause PCRE to malfunction. + + +The pcretest program +-------------------- + +This program is intended for testing PCRE, but it can also be used for +experimenting with regular expressions. + +If it is given two filename arguments, it reads from the first and writes to +the second. If it is given only one filename argument, it reads from that file +and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and +prompts for each line of input. + +The program handles any number of sets of input on a single input file. Each +set starts with a regular expression, and continues with any number of data +lines to be matched against the pattern. An empty line signals the end of the +set. The regular expressions are given enclosed in any non-alphameric +delimiters other than backslash, for example + + /(a|bc)x+yz/ + +White space before the initial delimiter is ignored. A regular expression may +be continued over several input lines, in which case the newline characters are +included within it. See the testinput files for many examples. It is possible +to include the delimiter within the pattern by escaping it, for example + + /abc\/def/ + +If you do so, the escape and the delimiter form part of the pattern, but since +delimiters are always non-alphameric, this does not affect its interpretation. +If the terminating delimiter is immediately followed by a backslash, for +example, + + /abc/\ + +then a backslash is added to the end of the pattern. This provides a way of +testing the error condition that arises if a pattern finishes with a backslash, +because + + /abc\/ + +is interpreted as the first line of a pattern that starts with "abc/", causing +pcretest to read the next line as a continuation of the regular expression. + +The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS, +PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. These +options have the same effect as they do in Perl. + +There are also some upper case options that do not match Perl options: /A, /E, +and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively. + +The /L option must be followed directly by the name of a locale, for example, + + /pattern/Lfr + +For this reason, it must be the last option letter. The given locale is set, +pcre_maketables() is called to build a set of character tables for the locale, +and this is then passed to pcre_compile() when compiling the regular +expression. Without an /L option, NULL is passed as the tables pointer; that +is, /L applies only to the expression on which it appears. + +The /I option requests that pcretest output information about the compiled +expression (whether it is anchored, has a fixed first character, and so on). It +does this by calling pcre_info() after compiling an expression, and outputting +the information it gets back. If the pattern is studied, the results of that +are also output. + +The /D option is a PCRE debugging feature, which also assumes /I. It causes the +internal form of compiled regular expressions to be output after compilation. + +The /S option causes pcre_study() to be called after the expression has been +compiled, and the results used when the expression is matched. + +The /M option causes information about the size of memory block used to hold +the compile pattern to be output. + +Finally, the /P option causes pcretest to call PCRE via the POSIX wrapper API +rather than its native API. When this is done, all other options except /i and +/m are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m +is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and +PCRE_DOTALL unless REG_NEWLINE is set. + +Before each data line is passed to pcre_exec(), leading and trailing whitespace +is removed, and it is then scanned for \ escapes. The following are recognized: + + \a alarm (= BEL) + \b backspace + \e escape + \f formfeed + \n newline + \r carriage return + \t tab + \v vertical tab + \nnn octal character (up to 3 octal digits) + \xhh hexadecimal character (up to 2 hex digits) + + \A pass the PCRE_ANCHORED option to pcre_exec() + \B pass the PCRE_NOTBOL option to pcre_exec() + \Cdd call pcre_copy_substring() for substring dd after a successful match + (any decimal number less than 32) + \Gdd call pcre_get_substring() for substring dd after a successful match + (any decimal number less than 32) + \L call pcre_get_substringlist() after a successful match + \Odd set the size of the output vector passed to pcre_exec() to dd + (any number of decimal digits) + \Z pass the PCRE_NOTEOL option to pcre_exec() + +A backslash followed by anything else just escapes the anything else. If the +very last character is a backslash, it is ignored. This gives a way of passing +an empty line as data, since a real empty line terminates the data input. + +If /P was present on the regex, causing the POSIX wrapper API to be used, only +\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to +regexec() respectively. + +When a match succeeds, pcretest outputs the list of captured substrings that +pcre_exec() returns, starting with number 0 for the string that matched the +whole pattern. Here is an example of an interactive pcretest run. + + $ pcretest + Testing Perl-Compatible Regular Expressions + PCRE version 0.90 08-Sep-1997 + + re> /^abc(\d+)/ + data> abc123 + 0: abc123 + 1: 123 + data> xyz + No match + +If any of \C, \G, or \L are present in a data line that is successfully +matched, the substrings extracted by the convenience functions are output with +C, G, or L after the string number instead of a colon. This is in addition to +the normal full list. The string length (that is, the return from the +extraction function) is given in parentheses after each string for \C and \G. + +Note that while patterns can be continued over several lines (a plain ">" +prompt is used for continuations), data lines may not. However newlines can be +included in data by means of the \n escape. + +If the -p option is given to pcretest, it is equivalent to adding /P to each +regular expression: the POSIX wrapper API is used to call PCRE. None of the +following flags has any effect in this case. + +If the option -d is given to pcretest, it is equivalent to adding /D to each +regular expression: the internal form is output after compilation. + +If the option -i is given to pcretest, it is equivalent to adding /I to each +regular expression: information about the compiled pattern is given after +compilation. + +If the option -m is given to pcretest, it outputs the size of each compiled +pattern after it has been compiled. It is equivalent to adding /M to each +regular expression. For compatibility with earlier versions of pcretest, -s is +a synonym for -m. + +If the -t option is given, each compile, study, and match is run 20000 times +while being timed, and the resulting time per compile or match is output in +milliseconds. Do not set -t with -s, because you will then get the size output +20000 times and the timing will be distorted. If you want to change the number +of repetitions used for timing, edit the definition of LOOPREPEAT at the top of +pcretest.c + + + +The perltest program +-------------------- + +The perltest program tests Perl's regular expressions; it has the same +specification as pcretest, and so can be given identical input, except that +input patterns can be followed only by Perl's lower case options. The contents +of testinput1 and testinput3 meet this condition. + +The data lines are processed as Perl strings, so if they contain $ or @ +characters, these have to be escaped. For this reason, all such characters in +the testinput file are escaped so that it can be used for perltest as well as +for pcretest, and the special upper case options such as /A that pcretest +recognizes are not used in this file. The output should be identical, apart +from the initial identifying banner. + +The testinput2 and testinput4 files are not suitable for feeding to Perltest, +since they do make use of the special upper case options and escapes that +pcretest uses to test some features of PCRE. The first of these files also +contains malformed regular expressions, in order to check that PCRE diagnoses +them correctly. + +Philip Hazel <ph10@cam.ac.uk> +April 1999 diff --git a/ext/pcre/pcrelib/RunTest b/ext/pcre/pcrelib/RunTest new file mode 100755 index 0000000000..a23c51108f --- /dev/null +++ b/ext/pcre/pcrelib/RunTest @@ -0,0 +1,94 @@ +#! /bin/sh + +# Run PCRE tests + +cf=diff + +# Select which tests to run; if no selection, run all + +do1=no +do2=no +do3=no +do4=no + +while [ $# -gt 0 ] ; do + case $1 in + 1) do1=yes;; + 2) do2=yes;; + 3) do3=yes;; + 4) do4=yes;; + *) echo "Unknown test number $1"; exit 1;; + esac + shift +done + +if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no ] ; then + do1=yes + do2=yes + do3=yes + do4=yes +fi + +# Primary test, Perl-compatible + +if [ $do1 = yes ] ; then + echo "Testing main functionality (Perl compatible)" + ./pcretest testinput1 testtry + if [ $? = 0 ] ; then + $cf testtry testoutput1 + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi +fi + +# PCRE tests that are not Perl-compatible - API & error tests, mostly + +if [ $do2 = yes ] ; then + echo "Testing API and error handling (not Perl compatible)" + ./pcretest -i testinput2 testtry + if [ $? = 0 ] ; then + $cf testtry testoutput2 + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi +fi + +# Additional Perl-compatible tests for Perl 5.005's new features + +if [ $do3 = yes ] ; then + echo "Testing Perl 5.005 features (Perl 5.005 compatible)" + ./pcretest testinput3 testtry + if [ $? = 0 ] ; then + $cf testtry testoutput3 + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi +fi + +if [ $do1 = yes -a $do2 = yes -a $do3 = yes ] ; then + echo "The three main tests all ran OK" + echo " " +fi + +# Locale-specific tests, provided the "fr" locale is available + +if [ $do4 = yes ] ; then + locale -a | grep '^fr$' >/dev/null + if [ $? -eq 0 ] ; then + echo "Testing locale-specific features (using 'fr' locale)" + ./pcretest testinput4 testtry + if [ $? = 0 ] ; then + $cf testtry testoutput4 + if [ $? != 0 ] ; then exit 1; fi + echo "Locale test ran OK" + echo " " + else exit 1 + fi + else + echo "Cannot test locale-specific features - 'fr' locale not found," + echo "or the \"locale\" command is not available to check for it." + echo " " + fi +fi + +# End diff --git a/ext/pcre/pcrelib/Tech.Notes b/ext/pcre/pcrelib/Tech.Notes new file mode 100644 index 0000000000..d485a4ec59 --- /dev/null +++ b/ext/pcre/pcrelib/Tech.Notes @@ -0,0 +1,239 @@ +Technical Notes about PCRE +-------------------------- + +Many years ago I implemented some regular expression functions to an algorithm +suggested by Martin Richards. These were not Unix-like in form, and were quite +restricted in what they could do by comparison with Perl. The interesting part +about the algorithm was that the amount of space required to hold the compiled +form of an expression was known in advance. The code to apply an expression did +not operate by backtracking, as the Henry Spencer and Perl code does, but +instead checked all possibilities simultaneously by keeping a list of current +states and checking all of them as it advanced through the subject string. (In +the terminology of Jeffrey Friedl's book, it was a "DFA algorithm".) When the +pattern was all used up, all remaining states were possible matches, and the +one matching the longest subset of the subject string was chosen. This did not +necessarily maximize the individual wild portions of the pattern, as is +expected in Unix and Perl-style regular expressions. + +By contrast, the code originally written by Henry Spencer and subsequently +heavily modified for Perl actually compiles the expression twice: once in a +dummy mode in order to find out how much store will be needed, and then for +real. The execution function operates by backtracking and maximizing (or, +optionally, minimizing in Perl) the amount of the subject that matches +individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's +terminology. + +For this set of functions that forms PCRE, I tried at first to invent an +algorithm that used an amount of store bounded by a multiple of the number of +characters in the pattern, to save on compiling time. However, because of the +greater complexity in Perl regular expressions, I couldn't do this. In any +case, a first pass through the pattern is needed, in order to find internal +flag settings like (?i) at top level. So it works by running a very degenerate +first pass to calculate a maximum store size, and then a second pass to do the +real compile - which may use a bit less than the predicted amount of store. The +idea is that this is going to turn out faster because the first pass is +degenerate and the second can just store stuff straight into the vector. It +does make the compiling functions bigger, of course, but they have got quite +big anyway to handle all the Perl stuff. + +The compiled form of a pattern is a vector of bytes, containing items of +variable length. The first byte in an item is an opcode, and the length of the +item is either implicit in the opcode or contained in the data bytes which +follow it. A list of all the opcodes follows: + +Opcodes with no following data +------------------------------ + +These items are all just one byte long + + OP_END end of pattern + OP_ANY match any character + OP_SOD match start of data: \A + OP_CIRC ^ (start of data, or after \n in multiline) + OP_NOT_WORD_BOUNDARY \W + OP_WORD_BOUNDARY \w + OP_NOT_DIGIT \D + OP_DIGIT \d + OP_NOT_WHITESPACE \S + OP_WHITESPACE \s + OP_NOT_WORDCHAR \W + OP_WORDCHAR \w + OP_EODN match end of data or \n at end: \Z + OP_EOD match end of data: \z + OP_DOLL $ (end of data, or before \n in multiline) + + +Repeating single characters +--------------------------- + +The common repeats (*, +, ?) when applied to a single character appear as +two-byte items using the following opcodes: + + OP_STAR + OP_MINSTAR + OP_PLUS + OP_MINPLUS + OP_QUERY + OP_MINQUERY + +Those with "MIN" in their name are the minimizing versions. Each is followed by +the character that is to be repeated. Other repeats make use of + + OP_UPTO + OP_MINUPTO + OP_EXACT + +which are followed by a two-byte count (most significant first) and the +repeated character. OP_UPTO matches from 0 to the given number. A repeat with a +non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an +OP_UPTO (or OP_MINUPTO). + + +Repeating character types +------------------------- + +Repeats of things like \d are done exactly as for single characters, except +that instead of a character, the opcode for the type is stored in the data +byte. The opcodes are: + + OP_TYPESTAR + OP_TYPEMINSTAR + OP_TYPEPLUS + OP_TYPEMINPLUS + OP_TYPEQUERY + OP_TYPEMINQUERY + OP_TYPEUPTO + OP_TYPEMINUPTO + OP_TYPEEXACT + + +Matching a character string +--------------------------- + +The OP_CHARS opcode is followed by a one-byte count and then that number of +characters. If there are more than 255 characters in sequence, successive +instances of OP_CHARS are used. + + +Character classes +----------------- + +OP_CLASS is used for a character class, provided there are at least two +characters in the class. If there is only one character, OP_CHARS is used for a +positive class, and OP_NOT for a negative one (that is, for something like +[^a]). Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a +repeated, negated, single-character class. The normal ones (OP_STAR etc.) are +used for a repeated positive single-character class. + +OP_CLASS is followed by a 32-byte bit map containing a 1 +bit for every character that is acceptable. The bits are counted from the least +significant end of each byte. + + +Back references +--------------- + +OP_REF is followed by a single byte containing the reference number. + + +Repeating character classes and back references +----------------------------------------------- + +Single-character classes are handled specially (see above). This applies to +OP_CLASS and OP_REF. In both cases, the repeat information follows the base +item. The matching code looks at the following opcode to see if it is one of + + OP_CRSTAR + OP_CRMINSTAR + OP_CRPLUS + OP_CRMINPLUS + OP_CRQUERY + OP_CRMINQUERY + OP_CRRANGE + OP_CRMINRANGE + +All but the last two are just single-byte items. The others are followed by +four bytes of data, comprising the minimum and maximum repeat counts. + + +Brackets and alternation +------------------------ + +A pair of non-identifying (round) brackets is wrapped round each expression at +compile time, so alternation always happens in the context of brackets. +Non-identifying brackets use the opcode OP_BRA, while identifying brackets use +OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English +speakers, including myself, can be round, square, or curly. Hence this usage.] + +A bracket opcode is followed by two bytes which give the offset to the next +alternative OP_ALT or, if there aren't any branches, to the matching KET +opcode. Each OP_ALT is followed by two bytes giving the offset to the next one, +or to the KET opcode. + +OP_KET is used for subpatterns that do not repeat indefinitely, while +OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or +maximally respectively. All three are followed by two bytes giving (as a +positive number) the offset back to the matching BRA opcode. + +If a subpattern is quantified such that it is permitted to match zero times, it +is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte +opcodes which tell the matcher that skipping this subpattern entirely is a +valid branch. + +A subpattern with an indefinite maximum repetition is replicated in the +compiled data its minimum number of times (or once with a BRAZERO if the +minimum is zero), with the final copy terminating with a KETRMIN or KETRMAX as +appropriate. + +A subpattern with a bounded maximum repetition is replicated in a nested +fashion up to the maximum number of times, with BRAZERO or BRAMINZERO before +each replication after the minimum, so that, for example, (abc){2,5} is +compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 200-bracket limit does not +apply to these internally generated brackets. + + +Assertions +---------- + +Forward assertions are just like other subpatterns, but starting with one of +the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes +OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion +is OP_REVERSE, followed by a two byte count of the number of characters to move +back the pointer in the subject string. A separate count is present in each +alternative of a lookbehind assertion, allowing them to have different fixed +lengths. + + +Once-only subpatterns +--------------------- + +These are also just like other subpatterns, but they start with the opcode +OP_ONCE. + + +Conditional subpatterns +----------------------- + +These are like other subpatterns, but they start with the opcode OP_COND. If +the condition is a back reference, this is stored at the start of the +subpattern using the opcode OP_CREF followed by one byte containing the +reference number. Otherwise, a conditional subpattern will always start with +one of the assertions. + + +Changing options +---------------- + +If any of the /i, /m, or /s options are changed within a parenthesized group, +an OP_OPT opcode is compiled, followed by one byte containing the new settings +of these flags. If there are several alternatives in a group, there is an +occurrence of OP_OPT at the start of all those following the first options +change, to set appropriate options for the start of the alternative. +Immediately after the end of the group there is another such item to reset the +flags to their previous values. Other changes of flag within the pattern can be +handled entirely at compile time, and so do not cause anything to be put into +the compiled data. + + +Philip Hazel +January 1999 diff --git a/ext/pcre/pcrelib/chartables.c b/ext/pcre/pcrelib/chartables.c new file mode 100644 index 0000000000..7bd4e7775b --- /dev/null +++ b/ext/pcre/pcrelib/chartables.c @@ -0,0 +1,146 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* This file is automatically written by the makechartables auxiliary +program. If you edit it by hand, you might like to edit the Makefile to +prevent its ever being regenerated. + +This file is #included in the compilation of pcre.c to build the default +character tables which are used when no tables are passed to the compile +function. */ + +static unsigned char pcre_default_tables[] = { + +/* This table is a lower casing table. */ + + 0, 1, 2, 3, 4, 5, 6, 7, + 8, 9, 10, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31, + 32, 33, 34, 35, 36, 37, 38, 39, + 40, 41, 42, 43, 44, 45, 46, 47, + 48, 49, 50, 51, 52, 53, 54, 55, + 56, 57, 58, 59, 60, 61, 62, 63, + 64, 97, 98, 99,100,101,102,103, + 104,105,106,107,108,109,110,111, + 112,113,114,115,116,117,118,119, + 120,121,122, 91, 92, 93, 94, 95, + 96, 97, 98, 99,100,101,102,103, + 104,105,106,107,108,109,110,111, + 112,113,114,115,116,117,118,119, + 120,121,122,123,124,125,126,127, + 128,129,130,131,132,133,134,135, + 136,137,138,139,140,141,142,143, + 144,145,146,147,148,149,150,151, + 152,153,154,155,156,157,158,159, + 160,161,162,163,164,165,166,167, + 168,169,170,171,172,173,174,175, + 176,177,178,179,180,181,182,183, + 184,185,186,187,188,189,190,191, + 192,193,194,195,196,197,198,199, + 200,201,202,203,204,205,206,207, + 208,209,210,211,212,213,214,215, + 216,217,218,219,220,221,222,223, + 224,225,226,227,228,229,230,231, + 232,233,234,235,236,237,238,239, + 240,241,242,243,244,245,246,247, + 248,249,250,251,252,253,254,255, + +/* This table is a case flipping table. */ + + 0, 1, 2, 3, 4, 5, 6, 7, + 8, 9, 10, 11, 12, 13, 14, 15, + 16, 17, 18, 19, 20, 21, 22, 23, + 24, 25, 26, 27, 28, 29, 30, 31, + 32, 33, 34, 35, 36, 37, 38, 39, + 40, 41, 42, 43, 44, 45, 46, 47, + 48, 49, 50, 51, 52, 53, 54, 55, + 56, 57, 58, 59, 60, 61, 62, 63, + 64, 97, 98, 99,100,101,102,103, + 104,105,106,107,108,109,110,111, + 112,113,114,115,116,117,118,119, + 120,121,122, 91, 92, 93, 94, 95, + 96, 65, 66, 67, 68, 69, 70, 71, + 72, 73, 74, 75, 76, 77, 78, 79, + 80, 81, 82, 83, 84, 85, 86, 87, + 88, 89, 90,123,124,125,126,127, + 128,129,130,131,132,133,134,135, + 136,137,138,139,140,141,142,143, + 144,145,146,147,148,149,150,151, + 152,153,154,155,156,157,158,159, + 160,161,162,163,164,165,166,167, + 168,169,170,171,172,173,174,175, + 176,177,178,179,180,181,182,183, + 184,185,186,187,188,189,190,191, + 192,193,194,195,196,197,198,199, + 200,201,202,203,204,205,206,207, + 208,209,210,211,212,213,214,215, + 216,217,218,219,220,221,222,223, + 224,225,226,227,228,229,230,231, + 232,233,234,235,236,237,238,239, + 240,241,242,243,244,245,246,247, + 248,249,250,251,252,253,254,255, + +/* This table contains bit maps for digits, 'word' chars, and white +space. Each map is 32 bytes long and the bits run from the least +significant end of each byte. */ + + 0x00,0x00,0x00,0x00,0x00,0x00,0xff,0x03, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x00,0x00,0x00,0x00,0x00,0xff,0x03, + 0xfe,0xff,0xff,0x87,0xfe,0xff,0xff,0x07, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + + 0x00,0x3e,0x00,0x00,0x01,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00 , + +/* This table identifies various classes of character by individual bits: + 0x01 white space character + 0x02 letter + 0x04 decimal digit + 0x08 hexadecimal digit + 0x10 alphanumeric or '_' + 0x80 regular expression metacharacter or binary zero +*/ + + 0x80,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 0- 7 */ + 0x00,0x01,0x01,0x01,0x01,0x01,0x00,0x00, /* 8- 15 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 16- 23 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 24- 31 */ + 0x01,0x00,0x00,0x00,0x80,0x00,0x00,0x00, /* - ' */ + 0x80,0x80,0x80,0x80,0x00,0x00,0x80,0x00, /* ( - / */ + 0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c,0x1c, /* 0 - 7 */ + 0x1c,0x1c,0x00,0x00,0x00,0x00,0x00,0x80, /* 8 - ? */ + 0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* @ - G */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* H - O */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* P - W */ + 0x12,0x12,0x12,0x80,0x00,0x00,0x80,0x10, /* X - _ */ + 0x00,0x1a,0x1a,0x1a,0x1a,0x1a,0x1a,0x12, /* ` - g */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* h - o */ + 0x12,0x12,0x12,0x12,0x12,0x12,0x12,0x12, /* p - w */ + 0x12,0x12,0x12,0x80,0x80,0x00,0x00,0x00, /* x -127 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 128-135 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 136-143 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 144-151 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 152-159 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 160-167 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 168-175 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 176-183 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 184-191 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 192-199 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 200-207 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 208-215 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 216-223 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 224-231 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 232-239 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, /* 240-247 */ + 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00};/* 248-255 */ + +/* End of chartables.c */ diff --git a/ext/pcre/pcrelib/dftables.c b/ext/pcre/pcrelib/dftables.c new file mode 100644 index 0000000000..729049f23b --- /dev/null +++ b/ext/pcre/pcrelib/dftables.c @@ -0,0 +1,146 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- + +See the file Tech.Notes for some information on the internals. +*/ + + +/* This is a support program to generate the file chartables.c, containing +character tables of various kinds. They are built according to the default C +locale and used as the default tables by PCRE. Now that pcre_maketables is +a function visible to the outside world, we make use of its code from here in +order to be consistent. */ + +#include <ctype.h> +#include <stdio.h> +#include <string.h> + +#include "internal.h" + +#define DFTABLES /* maketables.c notices this */ +#include "maketables.c" + + +int main(void) +{ +int i; +unsigned const char *tables = pcre_maketables(); + +printf( + "/*************************************************\n" + "* Perl-Compatible Regular Expressions *\n" + "*************************************************/\n\n" + "/* This file is automatically written by the makechartables auxiliary \n" + "program. If you edit it by hand, you might like to edit the Makefile to \n" + "prevent its ever being regenerated.\n\n" + "This file is #included in the compilation of pcre.c to build the default\n" + "character tables which are used when no tables are passed to the compile\n" + "function. */\n\n" + "static unsigned char pcre_default_tables[] = {\n\n" + "/* This table is a lower casing table. */\n\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) printf("\n "); + printf("%3d", *tables++); + if (i != 255) printf(","); + } +printf(",\n\n"); + +printf("/* This table is a case flipping table. */\n\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) printf("\n "); + printf("%3d", *tables++); + if (i != 255) printf(","); + } +printf(",\n\n"); + +printf( + "/* This table contains bit maps for digits, 'word' chars, and white\n" + "space. Each map is 32 bytes long and the bits run from the least\n" + "significant end of each byte. */\n\n"); + +printf(" "); +for (i = 0; i < cbit_length; i++) + { + if ((i & 7) == 0 && i != 0) + { + if ((i & 31) == 0) printf("\n"); + printf("\n "); + } + printf("0x%02x", *tables++); + if (i != cbit_length - 1) printf(","); + } +printf(" ,\n\n"); + +printf( + "/* This table identifies various classes of character by individual bits:\n" + " 0x%02x white space character\n" + " 0x%02x letter\n" + " 0x%02x decimal digit\n" + " 0x%02x hexadecimal digit\n" + " 0x%02x alphanumeric or '_'\n" + " 0x%02x regular expression metacharacter or binary zero\n*/\n\n", + ctype_space, ctype_letter, ctype_digit, ctype_xdigit, ctype_word, + ctype_meta); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) + { + printf(" /* "); + if (isprint(i-8)) printf(" %c -", i-8); + else printf("%3d-", i-8); + if (isprint(i-1)) printf(" %c ", i-1); + else printf("%3d", i-1); + printf(" */\n "); + } + printf("0x%02x", *tables++); + if (i != 255) printf(","); + } + +printf("};/* "); +if (isprint(i-8)) printf(" %c -", i-8); + else printf("%3d-", i-8); +if (isprint(i-1)) printf(" %c ", i-1); + else printf("%3d", i-1); +printf(" */\n\n/* End of chartables.c */\n"); + +return 0; +} + +/* End of dftables.c */ diff --git a/ext/pcre/pcrelib/get.c b/ext/pcre/pcrelib/get.c new file mode 100644 index 0000000000..035668e301 --- /dev/null +++ b/ext/pcre/pcrelib/get.c @@ -0,0 +1,189 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + +/* This module contains some convenience functions for extracting substrings +from the subject string after a regex match has succeeded. The original idea +for these functions came from Scott Wimer <scottw@cgibuilder.com>. */ + + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + + +/************************************************* +* Copy captured string to given buffer * +*************************************************/ + +/* This function copies a single captured substring into a given buffer. +Note that we use memcpy() rather than strncpy() in case there are binary zeros +in the string. + +Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + stringnumber the number of the required substring + buffer where to put the substring + size the size of the buffer + +Returns: if successful: + the length of the copied string, not including the zero + that is put on the end; can be zero + if not successful: + PCRE_ERROR_NOMEMORY (-6) buffer too small + PCRE_ERROR_NOSUBSTRING (-7) no such captured substring +*/ + +int +pcre_copy_substring(const char *subject, int *ovector, int stringcount, + int stringnumber, char *buffer, int size) +{ +int yield; +if (stringnumber < 0 || stringnumber >= stringcount) + return PCRE_ERROR_NOSUBSTRING; +stringnumber *= 2; +yield = ovector[stringnumber+1] - ovector[stringnumber]; +if (size < yield + 1) return PCRE_ERROR_NOMEMORY; +memcpy(buffer, subject + ovector[stringnumber], yield); +buffer[yield] = 0; +return yield; +} + + + +/************************************************* +* Copy all captured strings to new store * +*************************************************/ + +/* This function gets one chunk of store and builds a list of pointers and all +of the captured substrings in it. A NULL pointer is put on the end of the list. + +Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + listptr set to point to the list of pointers + +Returns: if successful: 0 + if not successful: + PCRE_ERROR_NOMEMORY (-6) failed to get store +*/ + +int +pcre_get_substring_list(const char *subject, int *ovector, int stringcount, + const char ***listptr) +{ +int i; +int size = sizeof(char *); +int double_count = stringcount * 2; +char **stringlist; +char *p; + +for (i = 0; i < double_count; i += 2) + size += sizeof(char *) + ovector[i+1] - ovector[i] + 1; + +stringlist = (char **)(pcre_malloc)(size); +if (stringlist == NULL) return PCRE_ERROR_NOMEMORY; + +*listptr = (const char **)stringlist; +p = (char *)(stringlist + stringcount + 1); + +for (i = 0; i < double_count; i += 2) + { + int len = ovector[i+1] - ovector[i]; + memcpy(p, subject + ovector[i], len); + *stringlist++ = p; + p += len; + *p++ = 0; + } + +*stringlist = NULL; +return 0; +} + + + +/************************************************* +* Copy captured string to new store * +*************************************************/ + +/* This function copies a single captured substring into a piece of new +store + +Arguments: + subject the subject string that was matched + ovector pointer to the offsets table + stringcount the number of substrings that were captured + (i.e. the yield of the pcre_exec call, unless + that was zero, in which case it should be 1/3 + of the offset table size) + stringnumber the number of the required substring + stringptr where to put a pointer to the substring + +Returns: if successful: + the length of the string, not including the zero that + is put on the end; can be zero + if not successful: + PCRE_ERROR_NOMEMORY (-6) failed to get store + PCRE_ERROR_NOSUBSTRING (-7) substring not present +*/ + +int +pcre_get_substring(const char *subject, int *ovector, int stringcount, + int stringnumber, const char **stringptr) +{ +int yield; +char *substring; +if (stringnumber < 0 || stringnumber >= stringcount) + return PCRE_ERROR_NOSUBSTRING; +stringnumber *= 2; +yield = ovector[stringnumber+1] - ovector[stringnumber]; +substring = (char *)(pcre_malloc)(yield + 1); +if (substring == NULL) return PCRE_ERROR_NOMEMORY; +memcpy(substring, subject + ovector[stringnumber], yield); +substring[yield] = 0; +*stringptr = substring; +return yield; +} + +/* End of get.c */ diff --git a/ext/pcre/pcrelib/internal.h b/ext/pcre/pcrelib/internal.h new file mode 100644 index 0000000000..2b28ac1a54 --- /dev/null +++ b/ext/pcre/pcrelib/internal.h @@ -0,0 +1,338 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + + +#define PCRE_VERSION "2.05 21-Apr-1999" + + +/* This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + +/* This header contains definitions that are shared between the different +modules, but which are not relevant to the outside. */ + +/* To cope with SunOS4 and other systems that lack memmove() but have bcopy(), +define a macro for memmove() if USE_BCOPY is defined. */ + +#ifdef USE_BCOPY +#undef memmove /* some systems may have a macro */ +#define memmove(a, b, c) bcopy(b, a, c) +#endif + +/* Standard C headers plus the external interface definition */ + +#include <ctype.h> +#include <limits.h> +#include <stddef.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include "pcre.h" + +/* In case there is no definition of offsetof() provided - though any proper +Standard C system should have one. */ + +#ifndef offsetof +#define offsetof(p_type,field) ((size_t)&(((p_type *)0)->field)) +#endif + +/* These are the public options that can change during matching. */ + +#define PCRE_IMS (PCRE_CASELESS|PCRE_MULTILINE|PCRE_DOTALL) + +/* Private options flags start at the most significant end of the two bytes. +The public options defined in pcre.h start at the least significant end. Make +sure they don't overlap! */ + +#define PCRE_FIRSTSET 0x8000 /* first_char is set */ +#define PCRE_STARTLINE 0x4000 /* start after \n for multiline */ +#define PCRE_INGROUP 0x2000 /* compiling inside a group */ + +/* Options for the "extra" block produced by pcre_study(). */ + +#define PCRE_STUDY_MAPPED 0x01 /* a map of starting chars exists */ + +/* Masks for identifying the public options which are permitted at compile +time, run time or study time, respectively. */ + +#define PUBLIC_OPTIONS \ + (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \ + PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY) + +#define PUBLIC_EXEC_OPTIONS (PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL) + +#define PUBLIC_STUDY_OPTIONS 0 /* None defined */ + +/* Magic number to provide a small check against being handed junk. */ + +#define MAGIC_NUMBER 0x50435245UL /* 'PCRE' */ + +/* Miscellaneous definitions */ + +typedef int BOOL; + +#define FALSE 0 +#define TRUE 1 + +/* These are escaped items that aren't just an encoding of a particular data +value such as \n. They must have non-zero values, as check_escape() returns +their negation. Also, they must appear in the same order as in the opcode +definitions below, up to ESC_z. The final one must be ESC_REF as subsequent +values are used for \1, \2, \3, etc. There is a test in the code for an escape +greater than ESC_b and less than ESC_X to detect the types that may be +repeated. If any new escapes are put in-between that don't consume a character, +that code will have to change. */ + +enum { ESC_A = 1, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, ESC_w, + ESC_Z, ESC_z, ESC_REF }; + +/* Opcode table: OP_BRA must be last, as all values >= it are used for brackets +that extract substrings. Starting from 1 (i.e. after OP_END), the values up to +OP_EOD must correspond in order to the list of escapes immediately above. */ + +enum { + OP_END, /* End of pattern */ + + /* Values corresponding to backslashed metacharacters */ + + OP_SOD, /* Start of data: \A */ + OP_NOT_WORD_BOUNDARY, /* \B */ + OP_WORD_BOUNDARY, /* \b */ + OP_NOT_DIGIT, /* \D */ + OP_DIGIT, /* \d */ + OP_NOT_WHITESPACE, /* \S */ + OP_WHITESPACE, /* \s */ + OP_NOT_WORDCHAR, /* \W */ + OP_WORDCHAR, /* \w */ + OP_EODN, /* End of data or \n at end of data: \Z. */ + OP_EOD, /* End of data: \z */ + + OP_OPT, /* Set runtime options */ + OP_CIRC, /* Start of line - varies with multiline switch */ + OP_DOLL, /* End of line - varies with multiline switch */ + OP_ANY, /* Match any character */ + OP_CHARS, /* Match string of characters */ + OP_NOT, /* Match anything but the following char */ + + OP_STAR, /* The maximizing and minimizing versions of */ + OP_MINSTAR, /* all these opcodes must come in pairs, with */ + OP_PLUS, /* the minimizing one second. */ + OP_MINPLUS, /* This first set applies to single characters */ + OP_QUERY, + OP_MINQUERY, + OP_UPTO, /* From 0 to n matches */ + OP_MINUPTO, + OP_EXACT, /* Exactly n matches */ + + OP_NOTSTAR, /* The maximizing and minimizing versions of */ + OP_NOTMINSTAR, /* all these opcodes must come in pairs, with */ + OP_NOTPLUS, /* the minimizing one second. */ + OP_NOTMINPLUS, /* This first set applies to "not" single characters */ + OP_NOTQUERY, + OP_NOTMINQUERY, + OP_NOTUPTO, /* From 0 to n matches */ + OP_NOTMINUPTO, + OP_NOTEXACT, /* Exactly n matches */ + + OP_TYPESTAR, /* The maximizing and minimizing versions of */ + OP_TYPEMINSTAR, /* all these opcodes must come in pairs, with */ + OP_TYPEPLUS, /* the minimizing one second. These codes must */ + OP_TYPEMINPLUS, /* be in exactly the same order as those above. */ + OP_TYPEQUERY, /* This set applies to character types such as \d */ + OP_TYPEMINQUERY, + OP_TYPEUPTO, /* From 0 to n matches */ + OP_TYPEMINUPTO, + OP_TYPEEXACT, /* Exactly n matches */ + + OP_CRSTAR, /* The maximizing and minimizing versions of */ + OP_CRMINSTAR, /* all these opcodes must come in pairs, with */ + OP_CRPLUS, /* the minimizing one second. These codes must */ + OP_CRMINPLUS, /* be in exactly the same order as those above. */ + OP_CRQUERY, /* These are for character classes and back refs */ + OP_CRMINQUERY, + OP_CRRANGE, /* These are different to the three seta above. */ + OP_CRMINRANGE, + + OP_CLASS, /* Match a character class */ + OP_REF, /* Match a back reference */ + + OP_ALT, /* Start of alternation */ + OP_KET, /* End of group that doesn't have an unbounded repeat */ + OP_KETRMAX, /* These two must remain together and in this */ + OP_KETRMIN, /* order. They are for groups the repeat for ever. */ + + /* The assertions must come before ONCE and COND */ + + OP_ASSERT, /* Positive lookahead */ + OP_ASSERT_NOT, /* Negative lookahead */ + OP_ASSERTBACK, /* Positive lookbehind */ + OP_ASSERTBACK_NOT, /* Negative lookbehind */ + OP_REVERSE, /* Move pointer back - used in lookbehind assertions */ + + /* ONCE and COND must come after the assertions, with ONCE first, as there's + a test for >= ONCE for a subpattern that isn't an assertion. */ + + OP_ONCE, /* Once matched, don't back up into the subpattern */ + OP_COND, /* Conditional group */ + OP_CREF, /* Used to hold an extraction string number */ + + OP_BRAZERO, /* These two must remain together and in this */ + OP_BRAMINZERO, /* order. */ + + OP_BRA /* This and greater values are used for brackets that + extract substrings. */ +}; + +/* The highest extraction number. This is limited by the number of opcodes +left after OP_BRA, i.e. 255 - OP_BRA. We actually set it somewhat lower. */ + +#define EXTRACT_MAX 99 + +/* The texts of compile-time error messages are defined as macros here so that +they can be accessed by the POSIX wrapper and converted into error codes. Yes, +I could have used error codes in the first place, but didn't feel like changing +just to accommodate the POSIX wrapper. */ + +#define ERR1 "\\ at end of pattern" +#define ERR2 "\\c at end of pattern" +#define ERR3 "unrecognized character follows \\" +#define ERR4 "numbers out of order in {} quantifier" +#define ERR5 "number too big in {} quantifier" +#define ERR6 "missing terminating ] for character class" +#define ERR7 "invalid escape sequence in character class" +#define ERR8 "range out of order in character class" +#define ERR9 "nothing to repeat" +#define ERR10 "operand of unlimited repeat could match the empty string" +#define ERR11 "internal error: unexpected repeat" +#define ERR12 "unrecognized character after (?" +#define ERR13 "too many capturing parenthesized sub-patterns" +#define ERR14 "missing )" +#define ERR15 "back reference to non-existent subpattern" +#define ERR16 "erroffset passed as NULL" +#define ERR17 "unknown option bit(s) set" +#define ERR18 "missing ) after comment" +#define ERR19 "too many sets of parentheses" +#define ERR20 "regular expression too large" +#define ERR21 "failed to get memory" +#define ERR22 "unmatched parentheses" +#define ERR23 "internal error: code overflow" +#define ERR24 "unrecognized character after (?<" +#define ERR25 "lookbehind assertion is not fixed length" +#define ERR26 "malformed number after (?(" +#define ERR27 "conditional group contains more than two branches" +#define ERR28 "assertion expected after (?(" + +/* All character handling must be done as unsigned characters. Otherwise there +are problems with top-bit-set characters and functions such as isspace(). +However, we leave the interface to the outside world as char *, because that +should make things easier for callers. We define a short type for unsigned char +to save lots of typing. I tried "uchar", but it causes problems on Digital +Unix, where it is defined in sys/types, so use "uschar" instead. */ + +typedef unsigned char uschar; + +/* The real format of the start of the pcre block; the actual code vector +runs on as long as necessary after the end. */ + +typedef struct real_pcre { + unsigned long int magic_number; + const unsigned char *tables; + unsigned short int options; + unsigned char top_bracket; + unsigned char top_backref; + unsigned char first_char; + unsigned char code[1]; +} real_pcre; + +/* The real format of the extra block returned by pcre_study(). */ + +typedef struct real_pcre_extra { + unsigned char options; + unsigned char start_bits[32]; +} real_pcre_extra; + + +/* Structure for passing "static" information around between the functions +doing the compiling, so that they are thread-safe. */ + +typedef struct compile_data { + const uschar *lcc; /* Points to lower casing table */ + const uschar *fcc; /* Points to case-flippint table */ + const uschar *cbits; /* Points to character type table */ + const uschar *ctypes; /* Points to table of type maps */ +} compile_data; + +/* Structure for passing "static" information around between the functions +doing the matching, so that they are thread-safe. */ + +typedef struct match_data { + int errorcode; /* As it says */ + int *offset_vector; /* Offset vector */ + int offset_end; /* One past the end */ + int offset_max; /* The maximum usable for return data */ + const uschar *lcc; /* Points to lower casing table */ + const uschar *ctypes; /* Points to table of type maps */ + BOOL offset_overflow; /* Set if too many extractions */ + BOOL notbol; /* NOTBOL flag */ + BOOL noteol; /* NOTEOL flag */ + BOOL endonly; /* Dollar not before final \n */ + const uschar *start_subject; /* Start of the subject string */ + const uschar *end_subject; /* End of the subject string */ + const uschar *end_match_ptr; /* Subject position at end match */ + int end_offset_top; /* Highwater mark at end of match */ +} match_data; + +/* Bit definitions for entries in the pcre_ctypes table. */ + +#define ctype_space 0x01 +#define ctype_letter 0x02 +#define ctype_digit 0x04 +#define ctype_xdigit 0x08 +#define ctype_word 0x10 /* alphameric or '_' */ +#define ctype_meta 0x80 /* regexp meta char or zero (end pattern) */ + +/* Offsets for the bitmap tables in pcre_cbits. Each table contains a set +of bits for a class map. */ + +#define cbit_digit 0 /* for \d */ +#define cbit_word 32 /* for \w */ +#define cbit_space 64 /* for \s */ +#define cbit_length 96 /* Length of the cbits table */ + +/* Offsets of the various tables from the base tables pointer, and +total length. */ + +#define lcc_offset 0 +#define fcc_offset 256 +#define cbits_offset 512 +#define ctypes_offset (cbits_offset + cbit_length) +#define tables_length (ctypes_offset + 256) + +/* End of internal.h */ diff --git a/ext/pcre/pcrelib/maketables.c b/ext/pcre/pcrelib/maketables.c new file mode 100644 index 0000000000..1b764556a2 --- /dev/null +++ b/ext/pcre/pcrelib/maketables.c @@ -0,0 +1,113 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- + +See the file Tech.Notes for some information on the internals. +*/ + + +/* This file is compiled on its own as part of the PCRE library. However, +it is also included in the compilation of dftables.c, in which case the macro +DFTABLES is defined. */ + +#ifndef DFTABLES +#include "internal.h" +#endif + + + +/************************************************* +* Create PCRE character tables * +*************************************************/ + +/* This function builds a set of character tables for use by PCRE and returns +a pointer to them. They are build using the ctype functions, and consequently +their contents will depend upon the current locale setting. When compiled as +part of the library, the store is obtained via pcre_malloc(), but when compiled +inside dftables, use malloc(). + +Arguments: none +Returns: pointer to the contiguous block of data +*/ + +unsigned const char * +pcre_maketables(void) +{ +unsigned char *yield, *p; +int i; + +#ifndef DFTABLES +yield = (pcre_malloc)(tables_length); +#else +yield = malloc(tables_length); +#endif + +if (yield == NULL) return NULL; +p = yield; + +/* First comes the lower casing table */ + +for (i = 0; i < 256; i++) *p++ = tolower(i); + +/* Next the case-flipping table */ + +for (i = 0; i < 256; i++) *p++ = islower(i)? toupper(i) : tolower(i); + +/* Then the character class tables */ + +memset(p, 0, cbit_length); +for (i = 0; i < 256; i++) + { + if (isdigit(i)) p[cbit_digit + i/8] |= 1 << (i&7); + if (isalnum(i) || i == '_') + p[cbit_word + i/8] |= 1 << (i&7); + if (isspace(i)) p[cbit_space + i/8] |= 1 << (i&7); + } +p += cbit_length; + +/* Finally, the character type table */ + +for (i = 0; i < 256; i++) + { + int x = 0; + if (isspace(i)) x += ctype_space; + if (isalpha(i)) x += ctype_letter; + if (isdigit(i)) x += ctype_digit; + if (isxdigit(i)) x += ctype_xdigit; + if (isalnum(i) || i == '_') x += ctype_word; + if (strchr("*+?{^.$|()[", i) != 0) x += ctype_meta; + *p++ = x; + } + +return yield; +} + +/* End of maketables.c */ diff --git a/ext/pcre/pcrelib/pcre.3 b/ext/pcre/pcrelib/pcre.3 new file mode 100644 index 0000000000..ec356e13d3 --- /dev/null +++ b/ext/pcre/pcrelib/pcre.3 @@ -0,0 +1,1393 @@ +.TH PCRE 3 +.SH NAME +pcre - Perl-compatible regular expressions. +.SH SYNOPSIS +.B #include <pcre.h> +.PP +.SM +.br +.B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR, +.ti +5n +.B const char **\fIerrptr\fR, int *\fIerroffset\fR, +.ti +5n +.B const unsigned char *\fItableptr\fR); +.PP +.br +.B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR, +.ti +5n +.B const char **\fIerrptr\fR); +.PP +.br +.B int pcre_exec(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," +.ti +5n +.B "const char *\fIsubject\fR," int \fIlength\fR, int \fIoptions\fR, +.ti +5n +.B int *\fIovector\fR, int \fIovecsize\fR); +.PP +.br +.B int pcre_copy_substring(const char *\fIsubject\fR, int *\fIovector\fR, +.ti +5n +.B int \fIstringcount\fR, int \fIstringnumber\fR, char *\fIbuffer\fR, +.ti +5n +.B int \fIbuffersize\fR); +.PP +.br +.B int pcre_get_substring(const char *\fIsubject\fR, int *\fIovector\fR, +.ti +5n +.B int \fIstringcount\fR, int \fIstringnumber\fR, +.ti +5n +.B const char **\fIstringptr\fR); +.PP +.br +.B int pcre_get_substring_list(const char *\fIsubject\fR, +.ti +5n +.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" +.PP +.br +.B const unsigned char *pcre_maketables(void); +.PP +.br +.B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int +.B *\fIfirstcharptr\fR); +.PP +.br +.B char *pcre_version(void); +.PP +.br +.B void *(*pcre_malloc)(size_t); +.PP +.br +.B void (*pcre_free)(void *); + + + +.SH DESCRIPTION +The PCRE library is a set of functions that implement regular expression +pattern matching using the same syntax and semantics as Perl 5, with just a few +differences (see below). The current implementation corresponds to Perl 5.005. + +PCRE has its own native API, which is described in this man page. There is also +a set of wrapper functions that correspond to the POSIX API. See +\fBpcreposix (3)\fR. + +The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR +are used for compiling and matching regular expressions, while +\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and +\fBpcre_get_substring_list()\fR are convenience functions for extracting +captured substrings from a matched subject string. The function +\fBpcre_maketables()\fR is used (optionally) to build a set of character tables +in the current locale for passing to \fBpcre_compile()\fR. + +The function \fBpcre_info()\fR is used to find out information about a compiled +pattern, while the function \fBpcre_version()\fR returns a pointer to a string +containing the version of PCRE and its date of release. + +The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain +the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions +respectively. PCRE calls the memory management functions via these variables, +so a calling program can replace them if it wishes to intercept the calls. This +should be done before calling any PCRE functions. + + +.SH MULTI-THREADING +The PCRE functions can be used in multi-threading applications, with the +proviso that the memory management functions pointed to by \fBpcre_malloc\fR +and \fBpcre_free\fR are shared by all threads. + +The compiled form of a regular expression is not altered during matching, so +the same compiled pattern can safely be used by several threads at once. + + +.SH COMPILING A PATTERN +The function \fBpcre_compile()\fR is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument \fIpattern\fR. A pointer to a single block of memory +that is obtained via \fBpcre_malloc\fR is returned. This contains the +compiled code and related data. The \fBpcre\fR type is defined for this for +convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the +contents of the block are not externally defined. It is up to the caller to +free the memory when it is no longer required. +.PP +The size of a compiled pattern is roughly proportional to the length of the +pattern string, except that each character class (other than those containing +just a single character, negated or not) requires 33 bytes, and repeat +quantifiers with a minimum greater than one or a bounded maximum cause the +relevant portions of the compiled pattern to be replicated. +.PP +The \fIoptions\fR argument contains independent bits that affect the +compilation. It should be zero if no options are required. Some of the options, +in particular, those that are compatible with Perl, can also be set and unset +from within the pattern (see the detailed description of regular expressions +below). For these options, the contents of the \fIoptions\fR argument specifies +their initial settings at the start of compilation and execution. The +PCRE_ANCHORED option can be set at the time of matching as well as at compile +time. +.PP +If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. +Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns +NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual +error message. The offset from the start of the pattern to the character where +the error was discovered is placed in the variable pointed to by +\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given. +.PP +If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of +character tables which are built when it is compiled, using the default C +locale. Otherwise, \fItableptr\fR must be the result of a call to +\fBpcre_maketables()\fR. See the section on locale support below. +.PP +The following option bits are defined in the header file: + + PCRE_ANCHORED + +If this bit is set, the pattern is forced to be "anchored", that is, it is +constrained to match only at the start of the string which is being searched +(the "subject string"). This effect can also be achieved by appropriate +constructs in the pattern itself, which is the only way to do it in Perl. + + PCRE_CASELESS + +If this bit is set, letters in the pattern match both upper and lower case +letters. It is equivalent to Perl's /i option. + + PCRE_DOLLAR_ENDONLY + +If this bit is set, a dollar metacharacter in the pattern matches only at the +end of the subject string. Without this option, a dollar also matches +immediately before the final character if it is a newline (but not before any +other newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is +set. There is no equivalent to this option in Perl. + + PCRE_DOTALL + +If this bit is set, a dot metacharater in the pattern matches all characters, +including newlines. Without it, newlines are excluded. This option is +equivalent to Perl's /s option. A negative class such as [^a] always matches a +newline character, independent of the setting of this option. + + PCRE_EXTENDED + +If this bit is set, whitespace data characters in the pattern are totally +ignored except when escaped or inside a character class, and characters between +an unescaped # outside a character class and the next newline character, +inclusive, are also ignored. This is equivalent to Perl's /x option, and makes +it possible to include comments inside complicated patterns. Note, however, +that this applies only to data characters. Whitespace characters may never +appear within special character sequences in a pattern, for example within the +sequence (?( which introduces a conditional subpattern. + + PCRE_EXTRA + +This option turns on additional functionality of PCRE that is incompatible with +Perl. Any backslash in a pattern that is followed by a letter that has no +special meaning causes an error, thus reserving these combinations for future +expansion. By default, as in Perl, a backslash followed by a letter with no +special meaning is treated as a literal. There are at present no other features +controlled by this option. + + PCRE_MULTILINE + +By default, PCRE treats the subject string as consisting of a single "line" of +characters (even if it actually contains several newlines). The "start of line" +metacharacter (^) matches only at the start of the string, while the "end of +line" metacharacter ($) matches only at the end of the string, or before a +terminating newline (unless PCRE_DOLLAR_ENDONLY is set). This is the same as +Perl. + +When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs +match immediately following or immediately before any newline in the subject +string, respectively, as well as at the very start and end. This is equivalent +to Perl's /m option. If there are no "\\n" characters in a subject string, or +no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no +effect. + + PCRE_UNGREEDY + +This option inverts the "greediness" of the quantifiers so that they are not +greedy by default, but become greedy if followed by "?". It is not compatible +with Perl. It can also be set by a (?U) option setting within the pattern. + + +.SH STUDYING A PATTERN +When a pattern is going to be used several times, it is worth spending more +time analyzing it in order to speed up the time taken for matching. The +function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first +argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR +typedef) containing additional information about the pattern; this can be +passed to \fBpcre_exec()\fR. If no additional information is available, NULL +is returned. + +The second argument contains option bits. At present, no options are defined +for \fBpcre_study()\fR, and this argument should always be zero. + +The third argument for \fBpcre_study()\fR is a pointer to an error message. If +studying succeeds (even if no data is returned), the variable it points to is +set to NULL. Otherwise it points to a textual error message. + +At present, studying a pattern is useful only for non-anchored patterns that do +not have a single fixed starting character. A bitmap of possible starting +characters is created. + + +.SH LOCALE SUPPORT +PCRE handles caseless matching, and determines whether characters are letters, +digits, or whatever, by reference to a set of tables. The library contains a +default set of tables which is created in the default C locale when PCRE is +compiled. This is used when the final argument of \fBpcre_compile()\fR is NULL, +and is sufficient for many applications. + +An alternative set of tables can, however, be supplied. Such tables are built +by calling the \fBpcre_maketables()\fR function, which has no arguments, in the +relevant locale. The result can then be passed to \fBpcre_compile()\ as often +as necessary. For example, to build and use tables that are appropriate for the +French locale (where accented characters with codes greater than 128 are +treated as letters), the following code could be used: + + setlocale(LC_CTYPE, "fr"); + tables = pcre_maketables(); + re = pcre_compile(..., tables); + +The tables are built in memory that is obtained via \fBpcre_malloc\fR. The +pointer that is passed to \fBpcre_compile\fR is saved with the compiled +pattern, and the same tables are used via this pointer by \fBpcre_study()\fR +and \fBpcre_match()\fR. Thus for any single pattern, compilation, studying and +matching all happen in the same locale, but different patterns can be compiled +in different locales. It is the caller's responsibility to ensure that the +memory containing the tables remains available for as long as it is needed. + + +.SH INFORMATION ABOUT A PATTERN +The \fBpcre_info()\fR function returns information about a compiled pattern. +Its yield is the number of capturing subpatterns, or one of the following +negative numbers: + + PCRE_ERROR_NULL the argument \fIcode\fR was NULL + PCRE_ERROR_BADMAGIC the "magic number" was not found + +If the \fIoptptr\fR argument is not NULL, a copy of the options with which the +pattern was compiled is placed in the integer it points to. These option bits +are those specified in the call to \fBpcre_compile()\fR, modified by any +top-level option settings within the pattern itself, and with the PCRE_ANCHORED +bit set if the form of the pattern implies that it can match only at the start +of a subject string. + +If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, +it is used to pass back information about the first character of any matched +string. If there is a fixed first character, e.g. from a pattern such as +(cat|cow|coyote), then it is returned in the integer pointed to by +\fIfirstcharptr\fR. Otherwise, if either + + (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch + starts with "^", or + + (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set + (if it were set, the pattern would be anchored), + +then -1 is returned, indicating that the pattern matches only at the +start of a subject string or after any "\\n" within the string. Otherwise -2 is +returned. + + +.SH MATCHING A PATTERN +The function \fBpcre_exec()\fR is called to match a subject string against a +pre-compiled pattern, which is passed in the \fIcode\fR argument. If the +pattern has been studied, the result of the study should be passed in the +\fIextra\fR argument. Otherwise this must be NULL. + +The subject string is passed as a pointer in \fIsubject\fR and a length in +\fIlength\fR. Unlike the pattern string, it may contain binary zero characters. + +The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose +unused bits must be zero. However, if a pattern was compiled with +PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it +cannot be made unachored at matching time. + +There are also two further options that can be set only at matching time: + + PCRE_NOTBOL + +The first character of the string is not the beginning of a line, so the +circumflex metacharacter should not match before it. Setting this without +PCRE_MULTILINE (at compile time) causes circumflex never to match. + + PCRE_NOTEOL + +The end of the string is not the end of a line, so the dollar metacharacter +should not match it nor (except in multiline mode) a newline immediately before +it. Setting this without PCRE_MULTILINE (at compile time) causes dollar never +to match. + +In general, a pattern matches a certain portion of the subject, and in +addition, further substrings from the subject may be picked out by parts of the +pattern. Following the usage in Jeffrey Friedl's book, this is called +"capturing" in what follows, and the phrase "capturing subpattern" is used for +a fragment of a pattern that picks out a substring. PCRE supports several other +kinds of parenthesized subpattern that do not cause substrings to be captured. + +Captured substrings are returned to the caller via a vector of integer offsets +whose address is passed in \fIovector\fR. The number of elements in the vector +is passed in \fIovecsize\fR. The first two-thirds of the vector is used to pass +back captured substrings, each substring using a pair of integers. The +remaining third of the vector is used as workspace by \fBpcre_exec()\fR while +matching capturing subpatterns, and is not available for passing back +information. The length passed in \fIovecsize\fR should always be a multiple of +three. If it is not, it is rounded down. + +When a match has been successful, information about captured substrings is +returned in pairs of integers, starting at the beginning of \fIovector\fR, and +continuing up to two-thirds of its length at the most. The first element of a +pair is set to the offset of the first character in a substring, and the second +is set to the offset of the first character after the end of a substring. The +first pair, \fIovector[0]\fR and \fIovector[1]\fR, identify the portion of the +subject string matched by the entire pattern. The next pair is used for the +first capturing subpattern, and so on. The value returned by \fBpcre_exec()\fR +is the number of pairs that have been set. If there are no capturing +subpatterns, the return value from a successful match is 1, indicating that +just the first pair of offsets has been set. + +Some convenience functions are provided for extracting the captured substrings +as separate strings. These are described in the following section. + +It is possible for an capturing subpattern number \fIn+1\fR to match some +part of the subject when subpattern \fIn\fR has not been used at all. For +example, if the string "abc" is matched against the pattern (a|(z))(bc) +subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset +values corresponding to the unused subpattern are set to -1. + +If a capturing subpattern is matched repeatedly, it is the last portion of the +string that it matched that gets returned. + +If the vector is too small to hold all the captured substrings, it is used as +far as possible (up to two-thirds of its length), and the function returns a +value of zero. In particular, if the substring offsets are not of interest, +\fBpcre_exec()\fR may be called with \fIovector\fR passed as NULL and +\fIovecsize\fR as zero. However, if the pattern contains back references and +the \fIovector\fR isn't big enough to remember the related substrings, PCRE has +to get additional memory for use during matching. Thus it is usually advisable +to supply an \fIovector\fR. + +Note that \fBpcre_info()\fR can be used to find out how many capturing +subpatterns there are in a compiled pattern. The smallest size for +\fIovector\fR that will allow for \fIn\fR captured substrings in addition to +the offsets of the substring matched by the whole pattern is (\fIn\fR+1)*3. + +If \fBpcre_exec()\fR fails, it returns a negative number. The following are +defined in the header file: + + PCRE_ERROR_NOMATCH (-1) + +The subject string did not match the pattern. + + PCRE_ERROR_NULL (-2) + +Either \fIcode\fR or \fIsubject\fR was passed as NULL, or \fIovector\fR was +NULL and \fIovecsize\fR was not zero. + + PCRE_ERROR_BADOPTION (-3) + +An unrecognized bit was set in the \fIoptions\fR argument. + + PCRE_ERROR_BADMAGIC (-4) + +PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch +the case when it is passed a junk pointer. This is the error it gives when the +magic number isn't present. + + PCRE_ERROR_UNKNOWN_NODE (-5) + +While running the pattern match, an unknown item was encountered in the +compiled pattern. This error could be caused by a bug in PCRE or by overwriting +of the compiled pattern. + + PCRE_ERROR_NOMEMORY (-6) + +If a pattern contains back references, but the \fIovector\fR that is passed to +\fBpcre_exec()\fR is not big enough to remember the referenced substrings, PCRE +gets a block of memory at the start of matching to use for this purpose. If the +call via \fBpcre_malloc()\fR fails, this error is given. The memory is freed at +the end of matching. + + +.SH EXTRACTING CAPTURED SUBSTRINGS +Captured substrings can be accessed directly by using the offsets returned by +\fBpcre_exec()\fR in \fIovector\fR. For convenience, the functions +\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and +\fBpcre_get_substring_list()\fR are provided for extracting captured substrings +as new, separate, zero-terminated strings. A substring that contains a binary +zero is correctly extracted and has a further zero added on the end, but the +result does not, of course, function as a C string. + +The first three arguments are the same for all three functions: \fIsubject\fR +is the subject string which has just been successfully matched, \fIovector\fR +is a pointer to the vector of integer offsets that was passed to +\fBpcre_exec()\fR, and \fIstringcount\fR is the number of substrings that +were captured by the match, including the substring that matched the entire +regular expression. This is the value returned by \fBpcre_exec\fR if it +is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it +ran out of space in \fIovector\fR, then the value passed as +\fIstringcount\fR should be the size of the vector divided by three. + +The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR +extract a single substring, whose number is given as \fIstringnumber\fR. A +value of zero extracts the substring that matched the entire pattern, while +higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, +the string is placed in \fIbuffer\fR, whose length is given by +\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is +obtained via \fBpcre_malloc\fR, and its address is returned via +\fIstringptr\fR. The yield of the function is the length of the string, not +including the terminating zero, or one of + + PCRE_ERROR_NOMEMORY (-6) + +The buffer was too small for \fBpcre_copy_substring()\fR, or the attempt to get +memory failed for \fBpcre_get_substring()\fR. + + PCRE_ERROR_NOSUBSTRING (-7) + +There is no substring whose number is \fIstringnumber\fR. + +The \fBpcre_get_substring_list()\fR function extracts all available substrings +and builds a list of pointers to them. All this is done in a single block of +memory which is obtained via \fBpcre_malloc\fR. The address of the memory block +is returned via \fIlistptr\fR, which is also the start of the list of string +pointers. The end of the list is marked by a NULL pointer. The yield of the +function is zero if all went well, or + + PCRE_ERROR_NOMEMORY (-6) + +if the attempt to get the memory block failed. + +When any of these functions encounter a substring that is unset, which can +happen when capturing subpattern number \fIn+1\fR matches some part of the +subject, but subpattern \fIn\fR has not been used at all, they return an empty +string. This can be distinguished from a genuine zero-length substring by +inspecting the appropriate offset in \fIovector\fR, which is negative for unset +substrings. + + + +.SH LIMITATIONS +There are some size limitations in PCRE but it is hoped that they will never in +practice be relevant. +The maximum length of a compiled pattern is 65539 (sic) bytes. +All values in repeating quantifiers must be less than 65536. +The maximum number of capturing subpatterns is 99. +The maximum number of all parenthesized subpatterns, including capturing +subpatterns, assertions, and other types of subpattern, is 200. + +The maximum length of a subject string is the largest positive number that an +integer variable can hold. However, PCRE uses recursion to handle subpatterns +and indefinite repetition. This means that the available stack space may limit +the size of a subject string that can be processed by certain patterns. + + +.SH DIFFERENCES FROM PERL +The differences described here are with respect to Perl 5.005. + +1. By default, a whitespace character is any character that the C library +function \fBisspace()\fR recognizes, though it is possible to compile PCRE with +alternative character type tables. Normally \fBisspace()\fR matches space, +formfeed, newline, carriage return, horizontal tab, and vertical tab. Perl 5 +no longer includes vertical tab in its set of whitespace characters. The \\v +escape that was in the Perl documentation for a long time was never in fact +recognized. However, the character itself was treated as whitespace at least +up to 5.002. In 5.004 and 5.005 it does not match \\s. + +2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits +them, but they do not mean what you might think. For example, (?!a){3} does +not assert that the next three characters are not "a". It just asserts that the +next character is not "a" three times. + +3. Capturing subpatterns that occur inside negative lookahead assertions are +counted, but their entries in the offsets vector are never set. Perl sets its +numerical variables from any such patterns that are matched before the +assertion fails to match something (thereby succeeding), but only if the +negative lookahead assertion contains just one branch. + +4. Though binary zero characters are supported in the subject string, they are +not allowed in a pattern string because it is passed as a normal C string, +terminated by zero. The escape sequence "\\0" can be used in the pattern to +represent a binary zero. + +5. The following Perl escape sequences are not supported: \\l, \\u, \\L, \\U, +\\E, \\Q. In fact these are implemented by Perl's general string-handling and +are not part of its pattern matching engine. + +6. The Perl \\G assertion is not supported as it is not relevant to single +pattern matches. + +7. Fairly obviously, PCRE does not support the (?{code}) construction. + +8. There are at the time of writing some oddities in Perl 5.005_02 concerned +with the settings of captured strings when part of a pattern is repeated. For +example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value +"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if +the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. + +In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the +future Perl changes to a consistent state that is different, PCRE may change to +follow. + +9. Another as yet unresolved discrepancy is that in Perl 5.005_02 the pattern +/^(a)?(?(1)a|b)+$/ matches the string "a", whereas in PCRE it does not. +However, in both Perl and PCRE /^(a)?a/ matched against "a" leaves $1 unset. + +10. PCRE provides some extensions to the Perl regular expression facilities: + +(a) Although lookbehind assertions must match fixed length strings, each +alternative branch of a lookbehind assertion can match a different length of +string. Perl 5.005 requires them all to have the same length. + +(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta- +character matches only at the very end of the string. + +(c) If PCRE_EXTRA is set, a backslash followed by a letter with no special +meaning is faulted. + +(d) If PCRE_UNGREEDY is set, the greediness of the repetition quantifiers is +inverted, that is, by default they are not greedy, but if followed by a +question mark they are. + + +.SH REGULAR EXPRESSION DETAILS +The syntax and semantics of the regular expressions supported by PCRE are +described below. Regular expressions are also described in the Perl +documentation and in a number of other books, some of which have copious +examples. Jeffrey Friedl's "Mastering Regular Expressions", published by +O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description +here is intended as reference documentation. + +A regular expression is a pattern that is matched against a subject string from +left to right. Most characters stand for themselves in a pattern, and match the +corresponding characters in the subject. As a trivial example, the pattern + + The quick brown fox + +matches a portion of a subject string that is identical to itself. The power of +regular expressions comes from the ability to include alternatives and +repetitions in the pattern. These are encoded in the pattern by the use of +\fImeta-characters\fR, which do not stand for themselves but instead are +interpreted in some special way. + +There are two different sets of meta-characters: those that are recognized +anywhere in the pattern except within square brackets, and those that are +recognized in square brackets. Outside square brackets, the meta-characters are +as follows: + + \\ general escape character with several uses + ^ assert start of subject (or line, in multiline mode) + $ assert end of subject (or line, in multiline mode) + . match any character except newline (by default) + [ start character class definition + | start of alternative branch + ( start subpattern + ) end subpattern + ? extends the meaning of ( + also 0 or 1 quantifier + also quantifier minimizer + * 0 or more quantifier + + 1 or more quantifier + { start min/max quantifier + +Part of a pattern that is in square brackets is called a "character class". In +a character class the only meta-characters are: + + \\ general escape character + ^ negate the class, but only if the first character + - indicates character range + ] terminates the character class + +The following sections describe the use of each of the meta-characters. + + +.SH BACKSLASH +The backslash character has several uses. Firstly, if it is followed by a +non-alphameric character, it takes away any special meaning that character may +have. This use of backslash as an escape character applies both inside and +outside character classes. + +For example, if you want to match a "*" character, you write "\\*" in the +pattern. This applies whether or not the following character would otherwise be +interpreted as a meta-character, so it is always safe to precede a +non-alphameric with "\\" to specify that it stands for itself. In particular, +if you want to match a backslash, you write "\\\\". + +If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the +pattern (other than in a character class) and characters between a "#" outside +a character class and the next newline character are ignored. An escaping +backslash can be used to include a whitespace or "#" character as part of the +pattern. + +A second use of backslash provides a way of encoding non-printing characters +in patterns in a visible manner. There is no restriction on the appearance of +non-printing characters, apart from the binary zero that terminates a pattern, +but when a pattern is being prepared by text editing, it is usually easier to +use one of the following escape sequences than the binary character it +represents: + + \\a alarm, that is, the BEL character (hex 07) + \\cx "control-x", where x is any character + \\e escape (hex 1B) + \\f formfeed (hex 0C) + \\n newline (hex 0A) + \\r carriage return (hex 0D) + \\t tab (hex 09) + \\xhh character with hex code hh + \\ddd character with octal code ddd, or backreference + +The precise effect of "\\cx" is as follows: if "x" is a lower case letter, it +is converted to upper case. Then bit 6 of the character (hex 40) is inverted. +Thus "\\cz" becomes hex 1A, but "\\c{" becomes hex 3B, while "\\c;" becomes hex +7B. + +After "\\x", up to two hexadecimal digits are read (letters can be in upper or +lower case). + +After "\\0" up to two further octal digits are read. In both cases, if there +are fewer than two digits, just those that are present are used. Thus the +sequence "\\0\\x\\07" specifies two binary zeros followed by a BEL character. +Make sure you supply two digits after the initial zero if the character that +follows is itself an octal digit. + +The handling of a backslash followed by a digit other than 0 is complicated. +Outside a character class, PCRE reads it and any following digits as a decimal +number. If the number is less than 10, or if there have been at least that many +previous capturing left parentheses in the expression, the entire sequence is +taken as a \fIback reference\fR. A description of how this works is given +later, following the discussion of parenthesized subpatterns. + +Inside a character class, or if the decimal number is greater than 9 and there +have not been that many capturing subpatterns, PCRE re-reads up to three octal +digits following the backslash, and generates a single byte from the least +significant 8 bits of the value. Any subsequent digits stand for themselves. +For example: + + \\040 is another way of writing a space + \\40 is the same, provided there are fewer than 40 + previous capturing subpatterns + \\7 is always a back reference + \\11 might be a back reference, or another way of + writing a tab + \\011 is always a tab + \\0113 is a tab followed by the character "3" + \\113 is the character with octal code 113 (since there + can be no more than 99 back references) + \\377 is a byte consisting entirely of 1 bits + \\81 is either a back reference, or a binary zero + followed by the two characters "8" and "1" + +Note that octal values of 100 or greater must not be introduced by a leading +zero, because no more than three octal digits are ever read. + +All the sequences that define a single byte value can be used both inside and +outside character classes. In addition, inside a character class, the sequence +"\\b" is interpreted as the backspace character (hex 08). Outside a character +class it has a different meaning (see below). + +The third use of backslash is for specifying generic character types: + + \\d any decimal digit + \\D any character that is not a decimal digit + \\s any whitespace character + \\S any character that is not a whitespace character + \\w any "word" character + \\W any "non-word" character + +Each pair of escape sequences partitions the complete set of characters into +two disjoint sets. Any given character matches one, and only one, of each pair. + +A "word" character is any letter or digit or the underscore character, that is, +any character which can be part of a Perl "word". The definition of letters and +digits is controlled by PCRE's character tables, and may vary if locale- +specific matching is taking place (see "Locale support" above). For example, in +the "fr" (French) locale, some character codes greater than 128 are used for +accented letters, and these are matched by \\w. + +These character type sequences can appear both inside and outside character +classes. They each match one character of the appropriate type. If the current +matching point is at the end of the subject string, all of them fail, since +there is no character to match. + +The fourth use of backslash is for certain simple assertions. An assertion +specifies a condition that has to be met at a particular point in a match, +without consuming any characters from the subject string. The use of +subpatterns for more complicated assertions is described below. The backslashed +assertions are + + \\b word boundary + \\B not a word boundary + \\A start of subject (independent of multiline mode) + \\Z end of subject or newline at end (independent of multiline mode) + \\z end of subject (independent of multiline mode) + +These assertions may not appear in character classes (but note that "\\b" has a +different meaning, namely the backspace character, inside a character class). + +A word boundary is a position in the subject string where the current character +and the previous character do not both match \\w or \\W (i.e. one matches +\\w and the other matches \\W), or the start or end of the string if the +first or last character matches \\w, respectively. + +The \\A, \\Z, and \\z assertions differ from the traditional circumflex and +dollar (described below) in that they only ever match at the very start and end +of the subject string, whatever options are set. They are not affected by the +PCRE_NOTBOL or PCRE_NOTEOL options. The difference between \\Z and \\z is that +\\Z matches before a newline that is the last character of the string as well +as at the end of the string, whereas \\z matches only at the end. + + +.SH CIRCUMFLEX AND DOLLAR +Outside a character class, in the default matching mode, the circumflex +character is an assertion which is true only if the current matching point is +at the start of the subject string. Inside a character class, circumflex has an +entirely different meaning (see below). + +Circumflex need not be the first character of the pattern if a number of +alternatives are involved, but it should be the first thing in each alternative +in which it appears if the pattern is ever to match that branch. If all +possible alternatives start with a circumflex, that is, if the pattern is +constrained to match only at the start of the subject, it is said to be an +"anchored" pattern. (There are also other constructs that can cause a pattern +to be anchored.) + +A dollar character is an assertion which is true only if the current matching +point is at the end of the subject string, or immediately before a newline +character that is the last character in the string (by default). Dollar need +not be the last character of the pattern if a number of alternatives are +involved, but it should be the last item in any branch in which it appears. +Dollar has no special meaning in a character class. + +The meaning of dollar can be changed so that it matches only at the very end of +the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching +time. This does not affect the \\Z assertion. + +The meanings of the circumflex and dollar characters are changed if the +PCRE_MULTILINE option is set. When this is the case, they match immediately +after and immediately before an internal "\\n" character, respectively, in +addition to matching at the start and end of the subject string. For example, +the pattern /^abc$/ matches the subject string "def\\nabc" in multiline mode, +but not otherwise. Consequently, patterns that are anchored in single line mode +because all branches start with "^" are not anchored in multiline mode. The +PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. + +Note that the sequences \\A, \\Z, and \\z can be used to match the start and +end of the subject in both modes, and if all branches of a pattern start with +\\A is it always anchored, whether PCRE_MULTILINE is set or not. + + +.SH FULL STOP (PERIOD, DOT) +Outside a character class, a dot in the pattern matches any one character in +the subject, including a non-printing character, but not (by default) newline. +If the PCRE_DOTALL option is set, then dots match newlines as well. The +handling of dot is entirely independent of the handling of circumflex and +dollar, the only relationship being that they both involve newline characters. +Dot has no special meaning in a character class. + + +.SH SQUARE BRACKETS +An opening square bracket introduces a character class, terminated by a closing +square bracket. A closing square bracket on its own is not special. If a +closing square bracket is required as a member of the class, it should be the +first data character in the class (after an initial circumflex, if present) or +escaped with a backslash. + +A character class matches a single character in the subject; the character must +be in the set of characters defined by the class, unless the first character in +the class is a circumflex, in which case the subject character must not be in +the set defined by the class. If a circumflex is actually required as a member +of the class, ensure it is not the first character, or escape it with a +backslash. + +For example, the character class [aeiou] matches any lower case vowel, while +[^aeiou] matches any character that is not a lower case vowel. Note that a +circumflex is just a convenient notation for specifying the characters which +are in the class by enumerating those that are not. It is not an assertion: it +still consumes a character from the subject string, and fails if the current +pointer is at the end of the string. + +When caseless matching is set, any letters in a class represent both their +upper case and lower case versions, so for example, a caseless [aeiou] matches +"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a +caseful version would. + +The newline character is never treated in any special way in character classes, +whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class +such as [^a] will always match a newline. + +The minus (hyphen) character can be used to specify a range of characters in a +character class. For example, [d-m] matches any letter between d and m, +inclusive. If a minus character is required in a class, it must be escaped with +a backslash or appear in a position where it cannot be interpreted as +indicating a range, typically as the first or last character in the class. + +It is not possible to have the literal character "]" as the end character of a +range. A pattern such as [W-]46] is interpreted as a class of two characters +("W" and "-") followed by a literal string "46]", so it would match "W46]" or +"-46]". However, if the "]" is escaped with a backslash it is interpreted as +the end of range, so [W-\\]46] is interpreted as a single class containing a +range followed by two separate characters. The octal or hexadecimal +representation of "]" can also be used to end a range. + +Ranges operate in ASCII collating sequence. They can also be used for +characters specified numerically, for example [\\000-\\037]. If a range that +includes letters is used when caseless matching is set, it matches the letters +in either case. For example, [W-c] is equivalent to [][\\^_`wxyzabc], matched +caselessly, and if character tables for the "fr" locale are in use, +[\\xc8-\\xcb] matches accented E characters in both cases. + +The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a +character class, and add the characters that they match to the class. For +example, [\\dABCDEF] matches any hexadecimal digit. A circumflex can +conveniently be used with the upper case character types to specify a more +restricted set of characters than the matching lower case type. For example, +the class [^\\W_] matches any letter or digit, but not underscore. + +All non-alphameric characters other than \\, -, ^ (at the start) and the +terminating ] are non-special in character classes, but it does no harm if they +are escaped. + + +.SH VERTICAL BAR +Vertical bar characters are used to separate alternative patterns. For example, +the pattern + + gilbert|sullivan + +matches either "gilbert" or "sullivan". Any number of alternatives may appear, +and an empty alternative is permitted (matching the empty string). +The matching process tries each alternative in turn, from left to right, +and the first one that succeeds is used. If the alternatives are within a +subpattern (defined below), "succeeds" means matching the rest of the main +pattern as well as the alternative in the subpattern. + + +.SH INTERNAL OPTION SETTING +The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED +can be changed from within the pattern by a sequence of Perl option letters +enclosed between "(?" and ")". The option letters are + + i for PCRE_CASELESS + m for PCRE_MULTILINE + s for PCRE_DOTALL + x for PCRE_EXTENDED + +For example, (?im) sets caseless, multiline matching. It is also possible to +unset these options by preceding the letter with a hyphen, and a combined +setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and +PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also +permitted. If a letter appears both before and after the hyphen, the option is +unset. + +The scope of these option changes depends on where in the pattern the setting +occurs. For settings that are outside any subpattern (defined below), the +effect is the same as if the options were set or unset at the start of +matching. The following patterns all behave in exactly the same way: + + (?i)abc + a(?i)bc + ab(?i)c + abc(?i) + +which in turn is the same as compiling the pattern abc with PCRE_CASELESS set. +In other words, such "top level" settings apply to the whole pattern (unless +there are other changes inside subpatterns). If there is more than one setting +of the same option at top level, the rightmost setting is used. + +If an option change occurs inside a subpattern, the effect is different. This +is a change of behaviour in Perl 5.005. An option change inside a subpattern +affects only that part of the subpattern that follows it, so + + (a(?i)b)c + +matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used). +By this means, options can be made to have different settings in different +parts of the pattern. Any changes made in one alternative do carry on +into subsequent branches within the same subpattern. For example, + + (a(?i)b|c) + +matches "ab", "aB", "c", and "C", even though when matching "C" the first +branch is abandoned before the option setting. This is because the effects of +option settings happen at compile time. There would be some very weird +behaviour otherwise. + +The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed in the +same way as the Perl-compatible options by using the characters U and X +respectively. The (?X) flag setting is special in that it must always occur +earlier in the pattern than any of the additional features it turns on, even +when it is at top level. It is best put at the start. + + +.SH SUBPATTERNS +Subpatterns are delimited by parentheses (round brackets), which can be nested. +Marking part of a pattern as a subpattern does two things: + +1. It localizes a set of alternatives. For example, the pattern + + cat(aract|erpillar|) + +matches one of the words "cat", "cataract", or "caterpillar". Without the +parentheses, it would match "cataract", "erpillar" or the empty string. + +2. It sets up the subpattern as a capturing subpattern (as defined above). +When the whole pattern matches, that portion of the subject string that matched +the subpattern is passed back to the caller via the \fIovector\fR argument of +\fBpcre_exec()\fR. Opening parentheses are counted from left to right (starting +from 1) to obtain the numbers of the capturing subpatterns. + +For example, if the string "the red king" is matched against the pattern + + the ((red|white) (king|queen)) + +the captured substrings are "red king", "red", and "king", and are numbered 1, +2, and 3. + +The fact that plain parentheses fulfil two functions is not always helpful. +There are often times when a grouping subpattern is required without a +capturing requirement. If an opening parenthesis is followed by "?:", the +subpattern does not do any capturing, and is not counted when computing the +number of any subsequent capturing subpatterns. For example, if the string "the +white queen" is matched against the pattern + + the ((?:red|white) (king|queen)) + +the captured substrings are "white queen" and "queen", and are numbered 1 and +2. The maximum number of captured substrings is 99, and the maximum number of +all subpatterns, both capturing and non-capturing, is 200. + +As a convenient shorthand, if any option settings are required at the start of +a non-capturing subpattern, the option letters may appear between the "?" and +the ":". Thus the two patterns + + (?i:saturday|sunday) + (?:(?i)saturday|sunday) + +match exactly the same set of strings. Because alternative branches are tried +from left to right, and options are not reset until the end of the subpattern +is reached, an option setting in one branch does affect subsequent branches, so +the above patterns match "SUNDAY" as well as "Saturday". + + +.SH REPETITION +Repetition is specified by quantifiers, which can follow any of the following +items: + + a single character, possibly escaped + the . metacharacter + a character class + a back reference (see next section) + a parenthesized subpattern (unless it is an assertion - see below) + +The general repetition quantifier specifies a minimum and maximum number of +permitted matches, by giving the two numbers in curly brackets (braces), +separated by a comma. The numbers must be less than 65536, and the first must +be less than or equal to the second. For example: + + z{2,4} + +matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special +character. If the second number is omitted, but the comma is present, there is +no upper limit; if the second number and the comma are both omitted, the +quantifier specifies an exact number of required matches. Thus + + [aeiou]{3,} + +matches at least 3 successive vowels, but may match many more, while + + \\d{8} + +matches exactly 8 digits. An opening curly bracket that appears in a position +where a quantifier is not allowed, or one that does not match the syntax of a +quantifier, is taken as a literal character. For example, {,6} is not a +quantifier, but a literal string of four characters. + +The quantifier {0} is permitted, causing the expression to behave as if the +previous item and the quantifier were not present. + +For convenience (and historical compatibility) the three most common +quantifiers have single-character abbreviations: + + * is equivalent to {0,} + + is equivalent to {1,} + ? is equivalent to {0,1} + +It is possible to construct infinite loops by following a subpattern that can +match no characters with a quantifier that has no upper limit, for example: + + (a?)* + +Earlier versions of Perl and PCRE used to give an error at compile time for +such patterns. However, because there are cases where this can be useful, such +patterns are now accepted, but if any repetition of the subpattern does in fact +match no characters, the loop is forcibly broken. + +By default, the quantifiers are "greedy", that is, they match as much as +possible (up to the maximum number of permitted times), without causing the +rest of the pattern to fail. The classic example of where this gives problems +is in trying to match comments in C programs. These appear between the +sequences /* and */ and within the sequence, individual * and / characters may +appear. An attempt to match C comments by applying the pattern + + /\\*.*\\*/ + +to the string + + /* first command */ not comment /* second comment */ + +fails, because it matches the entire string due to the greediness of the .* +item. + +However, if a quantifier is followed by a question mark, then it ceases to be +greedy, and instead matches the minimum number of times possible, so the +pattern + + /\\*.*?\\*/ + +does the right thing with the C comments. The meaning of the various +quantifiers is not otherwise changed, just the preferred number of matches. +Do not confuse this use of question mark with its use as a quantifier in its +own right. Because it has two uses, it can sometimes appear doubled, as in + + \\d??\\d + +which matches one digit by preference, but can match two if that is the only +way the rest of the pattern matches. + +If the PCRE_UNGREEDY option is set (an option which is not available in Perl) +then the quantifiers are not greedy by default, but individual ones can be made +greedy by following them with a question mark. In other words, it inverts the +default behaviour. + +When a parenthesized subpattern is quantified with a minimum repeat count that +is greater than 1 or with a limited maximum, more store is required for the +compiled pattern, in proportion to the size of the minimum or maximum. + +If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent +to Perl's /s) is set, thus allowing the . to match newlines, then the pattern +is implicitly anchored, because whatever follows will be tried against every +character position in the subject string, so there is no point in retrying the +overall match at any position after the first. PCRE treats such a pattern as +though it were preceded by \\A. In cases where it is known that the subject +string contains no newlines, it is worth setting PCRE_DOTALL when the pattern +begins with .* in order to obtain this optimization, or alternatively using ^ +to indicate anchoring explicitly. + +When a capturing subpattern is repeated, the value captured is the substring +that matched the final iteration. For example, after + + (tweedle[dume]{3}\\s*)+ + +has matched "tweedledum tweedledee" the value of the captured substring is +"tweedledee". However, if there are nested capturing subpatterns, the +corresponding captured values may have been set in previous iterations. For +example, after + + /(a|(b))+/ + +matches "aba" the value of the second captured substring is "b". + + +.SH BACK REFERENCES +Outside a character class, a backslash followed by a digit greater than 0 (and +possibly further digits) is a back reference to a capturing subpattern earlier +(i.e. to its left) in the pattern, provided there have been that many previous +capturing left parentheses. + +However, if the decimal number following the backslash is less than 10, it is +always taken as a back reference, and causes an error only if there are not +that many capturing left parentheses in the entire pattern. In other words, the +parentheses that are referenced need not be to the left of the reference for +numbers less than 10. See the section entitled "Backslash" above for further +details of the handling of digits following a backslash. + +A back reference matches whatever actually matched the capturing subpattern in +the current subject string, rather than anything matching the subpattern +itself. So the pattern + + (sens|respons)e and \\1ibility + +matches "sense and sensibility" and "response and responsibility", but not +"sense and responsibility". If caseful matching is in force at the time of the +back reference, then the case of letters is relevant. For example, + + ((?i)rah)\\s+\\1 + +matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original +capturing subpattern is matched caselessly. + +There may be more than one back reference to the same subpattern. If a +subpattern has not actually been used in a particular match, then any back +references to it always fail. For example, the pattern + + (a|(bc))\\2 + +always fails if it starts to match "a" rather than "bc". Because there may be +up to 99 back references, all digits following the backslash are taken +as part of a potential back reference number. If the pattern continues with a +digit character, then some delimiter must be used to terminate the back +reference. If the PCRE_EXTENDED option is set, this can be whitespace. +Otherwise an empty comment can be used. + +A back reference that occurs inside the parentheses to which it refers fails +when the subpattern is first used, so, for example, (a\\1) never matches. +However, such references can be useful inside repeated subpatterns. For +example, the pattern + + (a|b\\1)+ + +matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of +the subpattern, the back reference matches the character string corresponding +to the previous iteration. In order for this to work, the pattern must be such +that the first iteration does not need to match the back reference. This can be +done using alternation, as in the example above, or by a quantifier with a +minimum of zero. + + +.SH ASSERTIONS +An assertion is a test on the characters following or preceding the current +matching point that does not actually consume any characters. The simple +assertions coded as \\b, \\B, \\A, \\Z, \\z, ^ and $ are described above. More +complicated assertions are coded as subpatterns. There are two kinds: those +that look ahead of the current position in the subject string, and those that +look behind it. + +An assertion subpattern is matched in the normal way, except that it does not +cause the current matching position to be changed. Lookahead assertions start +with (?= for positive assertions and (?! for negative assertions. For example, + + \\w+(?=;) + +matches a word followed by a semicolon, but does not include the semicolon in +the match, and + + foo(?!bar) + +matches any occurrence of "foo" that is not followed by "bar". Note that the +apparently similar pattern + + (?!foo)bar + +does not find an occurrence of "bar" that is preceded by something other than +"foo"; it finds any occurrence of "bar" whatsoever, because the assertion +(?!foo) is always true when the next three characters are "bar". A +lookbehind assertion is needed to achieve this effect. + +Lookbehind assertions start with (?<= for positive assertions and (?<! for +negative assertions. For example, + + (?<!foo)bar + +does find an occurrence of "bar" that is not preceded by "foo". The contents of +a lookbehind assertion are restricted such that all the strings it matches must +have a fixed length. However, if there are several alternatives, they do not +all have to have the same fixed length. Thus + + (?<=bullock|donkey) + +is permitted, but + + (?<!dogs?|cats?) + +causes an error at compile time. Branches that match different length strings +are permitted only at the top level of a lookbehind assertion. This is an +extension compared with Perl 5.005, which requires all branches to match the +same length of string. An assertion such as + + (?<=ab(c|de)) + +is not permitted, because its single top-level branch can match two different +lengths, but it is acceptable if rewritten to use two top-level branches: + + (?<=abc|abde) + +The implementation of lookbehind assertions is, for each alternative, to +temporarily move the current position back by the fixed width and then try to +match. If there are insufficient characters before the current position, the +match is deemed to fail. Lookbehinds in conjunction with once-only subpatterns +can be particularly useful for matching at the ends of strings; an example is +given at the end of the section on once-only subpatterns. + +Several assertions (of any sort) may occur in succession. For example, + + (?<=\\d{3})(?<!999)foo + +matches "foo" preceded by three digits that are not "999". Furthermore, +assertions can be nested in any combination. For example, + + (?<=(?<!foo)bar)baz + +matches an occurrence of "baz" that is preceded by "bar" which in turn is not +preceded by "foo". + +Assertion subpatterns are not capturing subpatterns, and may not be repeated, +because it makes no sense to assert the same thing several times. If an +assertion contains capturing subpatterns within it, these are always counted +for the purposes of numbering the capturing subpatterns in the whole pattern. +Substring capturing is carried out for positive assertions, but it does not +make sense for negative assertions. + +Assertions count towards the maximum of 200 parenthesized subpatterns. + + +.SH ONCE-ONLY SUBPATTERNS +With both maximizing and minimizing repetition, failure of what follows +normally causes the repeated item to be re-evaluated to see if a different +number of repeats allows the rest of the pattern to match. Sometimes it is +useful to prevent this, either to change the nature of the match, or to cause +it fail earlier than it otherwise might, when the author of the pattern knows +there is no point in carrying on. + +Consider, for example, the pattern \\d+foo when applied to the subject line + + 123456bar + +After matching all 6 digits and then failing to match "foo", the normal +action of the matcher is to try again with only 5 digits matching the \\d+ +item, and then with 4, and so on, before ultimately failing. Once-only +subpatterns provide the means for specifying that once a portion of the pattern +has matched, it is not to be re-evaluated in this way, so the matcher would +give up immediately on failing to match "foo" the first time. The notation is +another kind of special parenthesis, starting with (?> as in this example: + + (?>\\d+)bar + +This kind of parenthesis "locks up" the part of the pattern it contains once +it has matched, and a failure further into the pattern is prevented from +backtracking into it. Backtracking past it to previous items, however, works as +normal. + +An alternative description is that a subpattern of this type matches the string +of characters that an identical standalone pattern would match, if anchored at +the current point in the subject string. + +Once-only subpatterns are not capturing subpatterns. Simple cases such as the +above example can be thought of as a maximizing repeat that must swallow +everything it can. So, while both \\d+ and \\d+? are prepared to adjust the +number of digits they match in order to make the rest of the pattern match, +(?>\\d+) can only match an entire sequence of digits. + +This construction can of course contain arbitrarily complicated subpatterns, +and it can be nested. + +Once-only subpatterns can be used in conjunction with lookbehind assertions to +specify efficient matching at the end of the subject string. Consider a simple +pattern such as + + abcd$ + +when applied to a long string which does not match it. Because matching +proceeds from left to right, PCRE will look for each "a" in the subject and +then see if what follows matches the rest of the pattern. If the pattern is +specified as + + ^.*abcd$ + +then the initial .* matches the entire string at first, but when this fails, it +backtracks to match all but the last character, then all but the last two +characters, and so on. Once again the search for "a" covers the entire string, +from right to left, so we are no better off. However, if the pattern is written +as + + ^(?>.*)(?<=abcd) + +then there can be no backtracking for the .* item; it can match only the entire +string. The subsequent lookbehind assertion does a single test on the last four +characters. If it fails, the match fails immediately. For long strings, this +approach makes a significant difference to the processing time. + + +.SH CONDITIONAL SUBPATTERNS +It is possible to cause the matching process to obey a subpattern +conditionally or to choose between two alternative subpatterns, depending on +the result of an assertion, or whether a previous capturing subpattern matched +or not. The two possible forms of conditional subpattern are + + (?(condition)yes-pattern) + (?(condition)yes-pattern|no-pattern) + +If the condition is satisfied, the yes-pattern is used; otherwise the +no-pattern (if present) is used. If there are more than two alternatives in the +subpattern, a compile-time error occurs. + +There are two kinds of condition. If the text between the parentheses consists +of a sequence of digits, then the condition is satisfied if the capturing +subpattern of that number has previously matched. Consider the following +pattern, which contains non-significant white space to make it more readable +(assume the PCRE_EXTENDED option) and to divide it into three parts for ease +of discussion: + + ( \\( )? [^()]+ (?(1) \\) ) + +The first part matches an optional opening parenthesis, and if that +character is present, sets it as the first captured substring. The second part +matches one or more characters that are not parentheses. The third part is a +conditional subpattern that tests whether the first set of parentheses matched +or not. If they did, that is, if subject started with an opening parenthesis, +the condition is true, and so the yes-pattern is executed and a closing +parenthesis is required. Otherwise, since no-pattern is not present, the +subpattern matches nothing. In other words, this pattern matches a sequence of +non-parentheses, optionally enclosed in parentheses. + +If the condition is not a sequence of digits, it must be an assertion. This may +be a positive or negative lookahead or lookbehind assertion. Consider this +pattern, again containing non-significant white space, and with the two +alternatives on the second line: + + (?(?=[^a-z]*[a-z]) + \\d{2}[a-z]{3}-\\d{2} | \\d{2}-\\d{2}-\\d{2} ) + +The condition is a positive lookahead assertion that matches an optional +sequence of non-letters followed by a letter. In other words, it tests for the +presence of at least one letter in the subject. If a letter is found, the +subject is matched against the first alternative; otherwise it is matched +against the second. This pattern matches strings in one of the two forms +dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. + + +.SH COMMENTS +The sequence (?# marks the start of a comment which continues up to the next +closing parenthesis. Nested parentheses are not permitted. The characters +that make up a comment play no part in the pattern matching at all. + +If the PCRE_EXTENDED option is set, an unescaped # character outside a +character class introduces a comment that continues up to the next newline +character in the pattern. + + +.SH PERFORMANCE +Certain items that may appear in patterns are more efficient than others. It is +more efficient to use a character class like [aeiou] than a set of alternatives +such as (a|e|i|o|u). In general, the simplest construction that provides the +required behaviour is usually the most efficient. Jeffrey Friedl's book +contains a lot of discussion about optimizing regular expressions for efficient +performance. + +When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is +implicitly anchored by PCRE, since it can match only at the start of a subject +string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization, +because the . metacharacter does not then match a newline, and if the subject +string contains newlines, the pattern may match from the character immediately +following one of them instead of from the very start. For example, the pattern + + (.*) second + +matches the subject "first\\nand second" (where \\n stands for a newline +character) with the first captured substring being "and". In order to do this, +PCRE has to retry the match starting after every newline in the subject. + +If you are using such a pattern with subject strings that do not contain +newlines, the best performance is obtained by setting PCRE_DOTALL, or starting +the pattern with ^.* to indicate explicit anchoring. That saves PCRE from +having to scan along the subject looking for a newline to restart at. + +.SH AUTHOR +Philip Hazel <ph10@cam.ac.uk> +.br +University Computing Service, +.br +New Museums Site, +.br +Cambridge CB2 3QG, England. +.br +Phone: +44 1223 334714 + +Copyright (c) 1997-1999 University of Cambridge. diff --git a/ext/pcre/pcrelib/pcre.c b/ext/pcre/pcrelib/pcre.c new file mode 100644 index 0000000000..dd5852dd31 --- /dev/null +++ b/ext/pcre/pcrelib/pcre.c @@ -0,0 +1,4336 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + + +/* Define DEBUG to get debugging output on stdout. */ + +/* #define DEBUG */ + +/* Use a macro for debugging printing, 'cause that eliminates the use of #ifdef +inline, and there are *still* stupid compilers about that don't like indented +pre-processor statements. I suppose it's only been 10 years... */ + +#ifdef DEBUG +#define DPRINTF(p) printf p +#else +#define DPRINTF(p) /*nothing*/ +#endif + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + +/* Allow compilation as C++ source code, should anybody want to do that. */ + +#ifdef __cplusplus +#define class pcre_class +#endif + + +/* Number of items on the nested bracket stacks at compile time. This should +not be set greater than 200. */ + +#define BRASTACK_SIZE 200 + + +/* Min and max values for the common repeats; for the maxima, 0 => infinity */ + +static const char rep_min[] = { 0, 0, 1, 1, 0, 0 }; +static const char rep_max[] = { 0, 0, 0, 0, 1, 1 }; + +/* Text forms of OP_ values and things, for debugging (not all used) */ + +#ifdef DEBUG +static const char *OP_names[] = { + "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", + "Opt", "^", "$", "Any", "chars", "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", + "Brazero", "Braminzero", "Bra" +}; +#endif + +/* Table for handling escaped characters in the range '0'-'z'. Positive returns +are simple data values; negative values are for special things like \d and so +on. Zero means further processing is needed (for things like \x), or the escape +is invalid. */ + +static const short int escapes[] = { + 0, 0, 0, 0, 0, 0, 0, 0, /* 0 - 7 */ + 0, 0, ':', ';', '<', '=', '>', '?', /* 8 - ? */ + '@', -ESC_A, -ESC_B, 0, -ESC_D, 0, 0, 0, /* @ - G */ + 0, 0, 0, 0, 0, 0, 0, 0, /* H - O */ + 0, 0, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */ + 0, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */ + '`', 7, -ESC_b, 0, -ESC_d, 27, '\f', 0, /* ` - g */ + 0, 0, 0, 0, 0, 0, '\n', 0, /* h - o */ + 0, 0, '\r', -ESC_s, '\t', 0, 0, -ESC_w, /* p - w */ + 0, 0, -ESC_z /* x - z */ +}; + +/* Definition to allow mutual recursion */ + +static BOOL + compile_regex(int, int, int *, uschar **, const uschar **, const char **, + BOOL, int, compile_data *); + + + +/************************************************* +* Global variables * +*************************************************/ + +/* PCRE is thread-clean and doesn't use any global variables in the normal +sense. However, it calls memory allocation and free functions via the two +indirections below, which are can be changed by the caller, but are shared +between all threads. */ + +void *(*pcre_malloc)(size_t) = malloc; +void (*pcre_free)(void *) = free; + + + + +/************************************************* +* Default character tables * +*************************************************/ + +/* A default set of character tables is included in the PCRE binary. Its source +is built by the maketables auxiliary program, which uses the default C ctypes +functions, and put in the file chartables.c. These tables are used by PCRE +whenever the caller of pcre_compile() does not provide an alternate set of +tables. */ + +#include "chartables.c" + + + +/************************************************* +* Return version string * +*************************************************/ + +const char * +pcre_version(void) +{ +return PCRE_VERSION; +} + + + + +/************************************************* +* Return info about a compiled pattern * +*************************************************/ + +/* This function picks potentially useful data out of the private +structure. + +Arguments: + external_re points to compiled code + optptr where to pass back the options + first_char where to pass back the first character, + or -1 if multiline and all branches start ^, + or -2 otherwise + +Returns: number of identifying extraction brackets + or negative values on error +*/ + +int +pcre_info(const pcre *external_re, int *optptr, int *first_char) +{ +const real_pcre *re = (const real_pcre *)external_re; +if (re == NULL) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; +if (optptr != NULL) *optptr = (re->options & PUBLIC_OPTIONS); +if (first_char != NULL) + *first_char = ((re->options & PCRE_FIRSTSET) != 0)? re->first_char : + ((re->options & PCRE_STARTLINE) != 0)? -1 : -2; +return re->top_bracket; +} + + + + +#ifdef DEBUG +/************************************************* +* Debugging function to print chars * +*************************************************/ + +/* Print a sequence of chars in printable format, stopping at the end of the +subject if the requested. + +Arguments: + p points to characters + length number to print + is_subject TRUE if printing from within md->start_subject + md pointer to matching data block, if is_subject is TRUE + +Returns: nothing +*/ + +static void +pchars(const uschar *p, int length, BOOL is_subject, match_data *md) +{ +int c; +if (is_subject && length > md->end_subject - p) length = md->end_subject - p; +while (length-- > 0) + if (isprint(c = *(p++))) printf("%c", c); else printf("\\x%02x", c); +} +#endif + + + + +/************************************************* +* Handle escapes * +*************************************************/ + +/* This function is called when a \ has been encountered. It either returns a +positive value for a simple escape such as \n, or a negative value which +encodes one of the more complicated things such as \d. On entry, ptr is +pointing at the \. On exit, it is on the final character of the escape +sequence. + +Arguments: + ptrptr points to the pattern position pointer + errorptr points to the pointer to the error message + bracount number of previous extracting brackets + options the options bits + isclass TRUE if inside a character class + cd pointer to char tables block + +Returns: zero or positive => a data character + negative => a special escape sequence + on error, errorptr is set +*/ + +static int +check_escape(const uschar **ptrptr, const char **errorptr, int bracount, + int options, BOOL isclass, compile_data *cd) +{ +const uschar *ptr = *ptrptr; +int c = *(++ptr) & 255; /* Ensure > 0 on signed-char systems */ +int i; + +if (c == 0) *errorptr = ERR1; + +/* Digits or letters may have special meaning; all others are literals. */ + +else if (c < '0' || c > 'z') {} + +/* Do an initial lookup in a table. A non-zero result is something that can be +returned immediately. Otherwise further processing may be required. */ + +else if ((i = escapes[c - '0']) != 0) c = i; + +/* Escapes that need further processing, or are illegal. */ + +else + { + const uschar *oldptr; + switch (c) + { + /* The handling of escape sequences consisting of a string of digits + starting with one that is not zero is not straightforward. By experiment, + the way Perl works seems to be as follows: + + Outside a character class, the digits are read as a decimal number. If the + number is less than 10, or if there are that many previous extracting + left brackets, then it is a back reference. Otherwise, up to three octal + digits are read to form an escaped byte. Thus \123 is likely to be octal + 123 (cf \0123, which is octal 012 followed by the literal 3). If the octal + value is greater than 377, the least significant 8 bits are taken. Inside a + character class, \ followed by a digit is always an octal number. */ + + case '1': case '2': case '3': case '4': case '5': + case '6': case '7': case '8': case '9': + + if (!isclass) + { + oldptr = ptr; + c -= '0'; + while ((cd->ctypes[ptr[1]] & ctype_digit) != 0) + c = c * 10 + *(++ptr) - '0'; + if (c < 10 || c <= bracount) + { + c = -(ESC_REF + c); + break; + } + ptr = oldptr; /* Put the pointer back and fall through */ + } + + /* Handle an octal number following \. If the first digit is 8 or 9, Perl + generates a binary zero byte and treats the digit as a following literal. + Thus we have to pull back the pointer by one. */ + + if ((c = *ptr) >= '8') + { + ptr--; + c = 0; + break; + } + + /* \0 always starts an octal number, but we may drop through to here with a + larger first octal digit */ + + case '0': + c -= '0'; + while(i++ < 2 && (cd->ctypes[ptr[1]] & ctype_digit) != 0 && + ptr[1] != '8' && ptr[1] != '9') + c = c * 8 + *(++ptr) - '0'; + break; + + /* Special escapes not starting with a digit are straightforward */ + + case 'x': + c = 0; + while (i++ < 2 && (cd->ctypes[ptr[1]] & ctype_xdigit) != 0) + { + ptr++; + c = c * 16 + cd->lcc[*ptr] - + (((cd->ctypes[*ptr] & ctype_digit) != 0)? '0' : 'W'); + } + break; + + case 'c': + c = *(++ptr); + if (c == 0) + { + *errorptr = ERR2; + return 0; + } + + /* A letter is upper-cased; then the 0x40 bit is flipped */ + + if (c >= 'a' && c <= 'z') c = cd->fcc[c]; + c ^= 0x40; + break; + + /* PCRE_EXTRA enables extensions to Perl in the matter of escapes. Any + other alphameric following \ is an error if PCRE_EXTRA was set; otherwise, + for Perl compatibility, it is a literal. This code looks a bit odd, but + there used to be some cases other than the default, and there may be again + in future, so I haven't "optimized" it. */ + + default: + if ((options & PCRE_EXTRA) != 0) switch(c) + { + default: + *errorptr = ERR3; + break; + } + break; + } + } + +*ptrptr = ptr; +return c; +} + + + +/************************************************* +* Check for counted repeat * +*************************************************/ + +/* This function is called when a '{' is encountered in a place where it might +start a quantifier. It looks ahead to see if it really is a quantifier or not. +It is only a quantifier if it is one of the forms {ddd} {ddd,} or {ddd,ddd} +where the ddds are digits. + +Arguments: + p pointer to the first char after '{' + cd pointer to char tables block + +Returns: TRUE or FALSE +*/ + +static BOOL +is_counted_repeat(const uschar *p, compile_data *cd) +{ +if ((cd->ctypes[*p++] & ctype_digit) == 0) return FALSE; +while ((cd->ctypes[*p] & ctype_digit) != 0) p++; +if (*p == '}') return TRUE; + +if (*p++ != ',') return FALSE; +if (*p == '}') return TRUE; + +if ((cd->ctypes[*p++] & ctype_digit) == 0) return FALSE; +while ((cd->ctypes[*p] & ctype_digit) != 0) p++; +return (*p == '}'); +} + + + +/************************************************* +* Read repeat counts * +*************************************************/ + +/* Read an item of the form {n,m} and return the values. This is called only +after is_counted_repeat() has confirmed that a repeat-count quantifier exists, +so the syntax is guaranteed to be correct, but we need to check the values. + +Arguments: + p pointer to first char after '{' + minp pointer to int for min + maxp pointer to int for max + returned as -1 if no max + errorptr points to pointer to error message + cd pointer to character tables clock + +Returns: pointer to '}' on success; + current ptr on error, with errorptr set +*/ + +static const uschar * +read_repeat_counts(const uschar *p, int *minp, int *maxp, + const char **errorptr, compile_data *cd) +{ +int min = 0; +int max = -1; + +while ((cd->ctypes[*p] & ctype_digit) != 0) min = min * 10 + *p++ - '0'; + +if (*p == '}') max = min; else + { + if (*(++p) != '}') + { + max = 0; + while((cd->ctypes[*p] & ctype_digit) != 0) max = max * 10 + *p++ - '0'; + if (max < min) + { + *errorptr = ERR4; + return p; + } + } + } + +/* Do paranoid checks, then fill in the required variables, and pass back the +pointer to the terminating '}'. */ + +if (min > 65535 || max > 65535) + *errorptr = ERR5; +else + { + *minp = min; + *maxp = max; + } +return p; +} + + + +/************************************************* +* Find the fixed length of a pattern * +*************************************************/ + +/* Scan a pattern and compute the fixed length of subject that will match it, +if the length is fixed. This is needed for dealing with backward assertions. + +Arguments: + code points to the start of the pattern (the bracket) + +Returns: the fixed length, or -1 if there is no fixed length +*/ + +static int +find_fixedlength(uschar *code) +{ +int length = -1; + +register int branchlength = 0; +register uschar *cc = code + 3; + +/* Scan along the opcodes for this branch. If we get to the end of the +branch, check the length against that of the other branches. */ + +for (;;) + { + int d; + register int op = *cc; + if (op >= OP_BRA) op = OP_BRA; + + switch (op) + { + case OP_BRA: + case OP_ONCE: + case OP_COND: + d = find_fixedlength(cc); + if (d < 0) return -1; + branchlength += d; + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Reached end of a branch; if it's a ket it is the end of a nested + call. If it's ALT it is an alternation in a nested call. If it is + END it's the end of the outer call. All can be handled by the same code. */ + + case OP_ALT: + case OP_KET: + case OP_KETRMAX: + case OP_KETRMIN: + case OP_END: + if (length < 0) length = branchlength; + else if (length != branchlength) return -1; + if (*cc != OP_ALT) return length; + cc += 3; + branchlength = 0; + break; + + /* Skip over assertive subpatterns */ + + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Skip over things that don't match chars */ + + case OP_REVERSE: + cc++; + + case OP_CREF: + case OP_OPT: + cc++; + /* Fall through */ + + case OP_SOD: + case OP_EOD: + case OP_EODN: + case OP_CIRC: + case OP_DOLL: + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + cc++; + break; + + /* Handle char strings */ + + case OP_CHARS: + branchlength += *(++cc); + cc += *cc + 1; + break; + + /* Handle exact repetitions */ + + case OP_EXACT: + case OP_TYPEEXACT: + branchlength += (cc[1] << 8) + cc[2]; + cc += 4; + break; + + /* Handle single-char matchers */ + + case OP_NOT_DIGIT: + case OP_DIGIT: + case OP_NOT_WHITESPACE: + case OP_WHITESPACE: + case OP_NOT_WORDCHAR: + case OP_WORDCHAR: + case OP_ANY: + branchlength++; + cc++; + break; + + + /* Check a class for variable quantification */ + + case OP_CLASS: + cc += (*cc == OP_REF)? 2 : 33; + + switch (*cc) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + return -1; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if ((cc[1] << 8) + cc[2] != (cc[3] << 8) + cc[4]) return -1; + branchlength += (cc[1] << 8) + cc[2]; + cc += 5; + break; + + default: + branchlength++; + } + break; + + /* Anything else is variable length */ + + default: + return -1; + } + } +/* Control never gets here */ +} + + + + +/************************************************* +* Compile one branch * +*************************************************/ + +/* Scan the pattern, compiling it into the code vector. + +Arguments: + options the option bits + brackets points to number of brackets used + code points to the pointer to the current code point + ptrptr points to the current pattern pointer + errorptr points to pointer to error message + optchanged set to the value of the last OP_OPT item compiled + cd contains pointers to tables + +Returns: TRUE on success + FALSE, with *errorptr set on error +*/ + +static BOOL +compile_branch(int options, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, int *optchanged, + compile_data *cd) +{ +int repeat_type, op_type; +int repeat_min, repeat_max; +int bravalue, length; +int greedy_default, greedy_non_default; +register int c; +register uschar *code = *codeptr; +uschar *tempcode; +const uschar *ptr = *ptrptr; +const uschar *tempptr; +uschar *previous = NULL; +uschar class[32]; + +/* Set up the default and non-default settings for greediness */ + +greedy_default = ((options & PCRE_UNGREEDY) != 0); +greedy_non_default = greedy_default ^ 1; + +/* Switch on next character until the end of the branch */ + +for (;; ptr++) + { + BOOL negate_class; + int class_charcount; + int class_lastchar; + int newoptions; + int condref; + + c = *ptr; + if ((options & PCRE_EXTENDED) != 0) + { + if ((cd->ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + switch(c) + { + /* The branch terminates at end of string, |, or ). */ + + case 0: + case '|': + case ')': + *codeptr = code; + *ptrptr = ptr; + return TRUE; + + /* Handle single-character metacharacters */ + + case '^': + previous = NULL; + *code++ = OP_CIRC; + break; + + case '$': + previous = NULL; + *code++ = OP_DOLL; + break; + + case '.': + previous = code; + *code++ = OP_ANY; + break; + + /* Character classes. These always build a 32-byte bitmap of the permitted + characters, except in the special case where there is only one character. + For negated classes, we build the map as usual, then invert it at the end. + */ + + case '[': + previous = code; + *code++ = OP_CLASS; + + /* If the first character is '^', set the negation flag and skip it. */ + + if ((c = *(++ptr)) == '^') + { + negate_class = TRUE; + c = *(++ptr); + } + else negate_class = FALSE; + + /* Keep a count of chars so that we can optimize the case of just a single + character. */ + + class_charcount = 0; + class_lastchar = -1; + + /* Initialize the 32-char bit map to all zeros. We have to build the + map in a temporary bit of store, in case the class contains only 1 + character, because in that case the compiled code doesn't use the + bit map. */ + + memset(class, 0, 32 * sizeof(uschar)); + + /* Process characters until ] is reached. By writing this as a "do" it + means that an initial ] is taken as a data character. */ + + do + { + if (c == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* Backslash may introduce a single character, or it may introduce one + of the specials, which just set a flag. Escaped items are checked for + validity in the pre-compiling pass. The sequence \b is a special case. + Inside a class (and only there) it is treated as backspace. Elsewhere + it marks a word boundary. Other escapes have preset maps ready to + or into the one we are building. We assume they have more than one + character in them, so set class_count bigger than one. */ + + if (c == '\\') + { + c = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd); + if (-c == ESC_b) c = '\b'; + else if (c < 0) + { + register const uschar *cbits = cd->cbits; + class_charcount = 10; + switch (-c) + { + case ESC_d: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_digit]; + continue; + + case ESC_D: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_digit]; + continue; + + case ESC_w: + for (c = 0; c < 32; c++) + class[c] |= (cbits[c+cbit_digit] | cbits[c+cbit_word]); + continue; + + case ESC_W: + for (c = 0; c < 32; c++) + class[c] |= ~(cbits[c+cbit_digit] | cbits[c+cbit_word]); + continue; + + case ESC_s: + for (c = 0; c < 32; c++) class[c] |= cbits[c+cbit_space]; + continue; + + case ESC_S: + for (c = 0; c < 32; c++) class[c] |= ~cbits[c+cbit_space]; + continue; + + default: + *errorptr = ERR7; + goto FAILED; + } + } + /* Fall through if single character */ + } + + /* A single character may be followed by '-' to form a range. However, + Perl does not permit ']' to be the end of the range. A '-' character + here is treated as a literal. */ + + if (ptr[1] == '-' && ptr[2] != ']') + { + int d; + ptr += 2; + d = *ptr; + + if (d == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* The second part of a range can be a single-character escape, but + not any of the other escapes. */ + + if (d == '\\') + { + d = check_escape(&ptr, errorptr, *brackets, options, TRUE, cd); + if (d < 0) + { + if (d == -ESC_b) d = '\b'; else + { + *errorptr = ERR7; + goto FAILED; + } + } + } + + if (d < c) + { + *errorptr = ERR8; + goto FAILED; + } + + for (; c <= d; c++) + { + class[c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + int uc = cd->fcc[c]; /* flip case */ + class[uc/8] |= (1 << (uc&7)); + } + class_charcount++; /* in case a one-char range */ + class_lastchar = c; + } + continue; /* Go get the next char in the class */ + } + + /* Handle a lone single character - we can get here for a normal + non-escape char, or after \ that introduces a single character. */ + + class [c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + c = cd->fcc[c]; /* flip case */ + class[c/8] |= (1 << (c&7)); + } + class_charcount++; + class_lastchar = c; + } + + /* Loop until ']' reached; the check for end of string happens inside the + loop. This "while" is the end of the "do" above. */ + + while ((c = *(++ptr)) != ']'); + + /* If class_charcount is 1 and class_lastchar is not negative, we saw + precisely one character. This doesn't need the whole 32-byte bit map. + We turn it into a 1-character OP_CHAR if it's positive, or OP_NOT if + it's negative. */ + + if (class_charcount == 1 && class_lastchar >= 0) + { + if (negate_class) + { + code[-1] = OP_NOT; + } + else + { + code[-1] = OP_CHARS; + *code++ = 1; + } + *code++ = class_lastchar; + } + + /* Otherwise, negate the 32-byte map if necessary, and copy it into + the code vector. */ + + else + { + if (negate_class) + for (c = 0; c < 32; c++) code[c] = ~class[c]; + else + memcpy(code, class, 32); + code += 32; + } + break; + + /* Various kinds of repeat */ + + case '{': + if (!is_counted_repeat(ptr+1, cd)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr, cd); + if (*errorptr != NULL) goto FAILED; + goto REPEAT; + + case '*': + repeat_min = 0; + repeat_max = -1; + goto REPEAT; + + case '+': + repeat_min = 1; + repeat_max = -1; + goto REPEAT; + + case '?': + repeat_min = 0; + repeat_max = 1; + + REPEAT: + if (previous == NULL) + { + *errorptr = ERR9; + goto FAILED; + } + + /* If the next character is '?' this is a minimizing repeat, by default, + but if PCRE_UNGREEDY is set, it works the other way round. Advance to the + next character. */ + + if (ptr[1] == '?') + { repeat_type = greedy_non_default; ptr++; } + else repeat_type = greedy_default; + + /* If the maximum is zero then the minimum must also be zero; Perl allows + this case, so we do too - by simply omitting the item altogether. */ + + if (repeat_max == 0) code = previous; + + /* If previous was a string of characters, chop off the last one and use it + as the subject of the repeat. If there was only one character, we can + abolish the previous item altogether. */ + + else if (*previous == OP_CHARS) + { + int len = previous[1]; + if (len == 1) + { + c = previous[2]; + code = previous; + } + else + { + c = previous[len+1]; + previous[1]--; + code--; + } + op_type = 0; /* Use single-char op codes */ + goto OUTPUT_SINGLE_REPEAT; /* Code shared with single character types */ + } + + /* If previous was a single negated character ([^a] or similar), we use + one of the special opcodes, replacing it. The code is shared with single- + character repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous == OP_NOT) + { + op_type = OP_NOTSTAR - OP_STAR; /* Use "not" opcodes */ + c = previous[1]; + code = previous; + goto OUTPUT_SINGLE_REPEAT; + } + + /* If previous was a character type match (\d or similar), abolish it and + create a suitable repeat item. The code is shared with single-character + repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous < OP_EODN || *previous == OP_ANY) + { + op_type = OP_TYPESTAR - OP_STAR; /* Use type opcodes */ + c = *previous; + code = previous; + + OUTPUT_SINGLE_REPEAT: + repeat_type += op_type; /* Combine both values for many cases */ + + /* A minimum of zero is handled either as the special case * or ?, or as + an UPTO, with the maximum given. */ + + if (repeat_min == 0) + { + if (repeat_max == -1) *code++ = OP_STAR + repeat_type; + else if (repeat_max == 1) *code++ = OP_QUERY + repeat_type; + else + { + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The case {1,} is handled as the special case + */ + + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_PLUS + repeat_type; + + /* The case {n,n} is just an EXACT, while the general case {n,m} is + handled as an EXACT followed by an UPTO. An EXACT of 1 is optimized. */ + + else + { + if (repeat_min != 1) + { + *code++ = OP_EXACT + op_type; /* NB EXACT doesn't have repeat_type */ + *code++ = repeat_min >> 8; + *code++ = (repeat_min & 255); + } + + /* If the mininum is 1 and the previous item was a character string, + we either have to put back the item that got cancelled if the string + length was 1, or add the character back onto the end of a longer + string. For a character type nothing need be done; it will just get + put back naturally. Note that the final character is always going to + get added below. */ + + else if (*previous == OP_CHARS) + { + if (code == previous) code += 2; else previous[1]++; + } + + /* For a single negated character we also have to put back the + item that got cancelled. */ + + else if (*previous == OP_NOT) code++; + + /* If the maximum is unlimited, insert an OP_STAR. */ + + if (repeat_max < 0) + { + *code++ = c; + *code++ = OP_STAR + repeat_type; + } + + /* Else insert an UPTO if the max is greater than the min. */ + + else if (repeat_max != repeat_min) + { + *code++ = c; + repeat_max -= repeat_min; + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The character or character type itself comes last in all cases. */ + + *code++ = c; + } + + /* If previous was a character class or a back reference, we put the repeat + stuff after it. */ + + else if (*previous == OP_CLASS || *previous == OP_REF) + { + if (repeat_min == 0 && repeat_max == -1) + *code++ = OP_CRSTAR + repeat_type; + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_CRPLUS + repeat_type; + else if (repeat_min == 0 && repeat_max == 1) + *code++ = OP_CRQUERY + repeat_type; + else + { + *code++ = OP_CRRANGE + repeat_type; + *code++ = repeat_min >> 8; + *code++ = repeat_min & 255; + if (repeat_max == -1) repeat_max = 0; /* 2-byte encoding for max */ + *code++ = repeat_max >> 8; + *code++ = repeat_max & 255; + } + } + + /* If previous was a bracket group, we may have to replicate it in certain + cases. */ + + else if ((int)*previous >= OP_BRA || (int)*previous == OP_ONCE || + (int)*previous == OP_COND) + { + register int i; + int ketoffset = 0; + int len = code - previous; + uschar *bralink = NULL; + + /* If the maximum repeat count is unlimited, find the end of the bracket + by scanning through from the start, and compute the offset back to it + from the current code pointer. There may be an OP_OPT setting following + the final KET, so we can't find the end just by going back from the code + pointer. */ + + if (repeat_max == -1) + { + register uschar *ket = previous; + do ket += (ket[1] << 8) + ket[2]; while (*ket != OP_KET); + ketoffset = code - ket; + } + + /* The case of a zero minimum is special because of the need to stick + OP_BRAZERO in front of it, and because the group appears once in the + data, whereas in other cases it appears the minimum number of times. For + this reason, it is simplest to treat this case separately, as otherwise + the code gets far too mess. There are several special subcases when the + minimum is zero. */ + + if (repeat_min == 0) + { + /* If the maximum is also zero, we just omit the group from the output + altogether. */ + + if (repeat_max == 0) + { + code = previous; + previous = NULL; + break; + } + + /* If the maximum is 1 or unlimited, we just have to stick in the + BRAZERO and do no more at this point. */ + + if (repeat_max <= 1) + { + memmove(previous+1, previous, len); + code++; + *previous++ = OP_BRAZERO + repeat_type; + } + + /* If the maximum is greater than 1 and limited, we have to replicate + in a nested fashion, sticking OP_BRAZERO before each set of brackets. + The first one has to be handled carefully because it's the original + copy, which has to be moved up. The remainder can be handled by code + that is common with the non-zero minimum case below. We just have to + adjust the value or repeat_max, since one less copy is required. */ + + else + { + int offset; + memmove(previous+4, previous, len); + code += 4; + *previous++ = OP_BRAZERO + repeat_type; + *previous++ = OP_BRA; + + /* We chain together the bracket offset fields that have to be + filled in later when the ends of the brackets are reached. */ + + offset = (bralink == NULL)? 0 : previous - bralink; + bralink = previous; + *previous++ = offset >> 8; + *previous++ = offset & 255; + } + + repeat_max--; + } + + /* If the minimum is greater than zero, replicate the group as many + times as necessary, and adjust the maximum to the number of subsequent + copies that we need. */ + + else + { + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, len); + code += len; + } + if (repeat_max > 0) repeat_max -= repeat_min; + } + + /* This code is common to both the zero and non-zero minimum cases. If + the maximum is limited, it replicates the group in a nested fashion, + remembering the bracket starts on a stack. In the case of a zero minimum, + the first one was set up above. In all cases the repeat_max now specifies + the number of additional copies needed. */ + + if (repeat_max >= 0) + { + for (i = repeat_max - 1; i >= 0; i--) + { + *code++ = OP_BRAZERO + repeat_type; + + /* All but the final copy start a new nesting, maintaining the + chain of brackets outstanding. */ + + if (i != 0) + { + int offset; + *code++ = OP_BRA; + offset = (bralink == NULL)? 0 : code - bralink; + bralink = code; + *code++ = offset >> 8; + *code++ = offset & 255; + } + + memcpy(code, previous, len); + code += len; + } + + /* Now chain through the pending brackets, and fill in their length + fields (which are holding the chain links pro tem). */ + + while (bralink != NULL) + { + int oldlinkoffset; + int offset = code - bralink + 1; + uschar *bra = code - offset; + oldlinkoffset = (bra[1] << 8) + bra[2]; + bralink = (oldlinkoffset == 0)? NULL : bralink - oldlinkoffset; + *code++ = OP_KET; + *code++ = bra[1] = offset >> 8; + *code++ = bra[2] = (offset & 255); + } + } + + /* If the maximum is unlimited, set a repeater in the final copy. We + can't just offset backwards from the current code point, because we + don't know if there's been an options resetting after the ket. The + correct offset was computed above. */ + + else code[-ketoffset] = OP_KETRMAX + repeat_type; + + +#ifdef NEVER + /* If the minimum is greater than zero, and the maximum is unlimited or + equal to the minimum, the first copy remains where it is, and is + replicated up to the minimum number of times. This case includes the + + repeat, but of course no replication is needed in that case. */ + + if (repeat_min > 0 && (repeat_max == -1 || repeat_max == repeat_min)) + { + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, len); + code += len; + } + } + + /* If the minimum is zero, stick BRAZERO in front of the first copy. + Then, if there is a fixed upper limit, replicated up to that many times, + sticking BRAZERO in front of all the optional ones. */ + + else + { + if (repeat_min == 0) + { + memmove(previous+1, previous, len); + code++; + *previous++ = OP_BRAZERO + repeat_type; + } + + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, len); + code += len; + } + + for (i = (repeat_min > 0)? repeat_min : 1; i < repeat_max; i++) + { + *code++ = OP_BRAZERO + repeat_type; + memcpy(code, previous, len); + code += len; + } + } + + /* If the maximum is unlimited, set a repeater in the final copy. We + can't just offset backwards from the current code point, because we + don't know if there's been an options resetting after the ket. The + correct offset was computed above. */ + + if (repeat_max == -1) code[-ketoffset] = OP_KETRMAX + repeat_type; +#endif + + + } + + /* Else there's some kind of shambles */ + + else + { + *errorptr = ERR11; + goto FAILED; + } + + /* In all case we no longer have a previous item. */ + + previous = NULL; + break; + + + /* Start of nested bracket sub-expression, or comment or lookahead or + lookbehind or option setting or condition. First deal with special things + that can come after a bracket; all are introduced by ?, and the appearance + of any of them means that this is not a referencing group. They were + checked for validity in the first pass over the string, so we don't have to + check for syntax errors here. */ + + case '(': + newoptions = options; + condref = -1; + + if (*(++ptr) == '?') + { + int set, unset; + int *optset; + + switch (*(++ptr)) + { + case '#': /* Comment; skip to ket */ + ptr++; + while (*ptr != ')') ptr++; + continue; + + case ':': /* Non-extracting bracket */ + bravalue = OP_BRA; + ptr++; + break; + + case '(': + bravalue = OP_COND; /* Conditional group */ + if ((cd->ctypes[*(++ptr)] & ctype_digit) != 0) + { + condref = *ptr - '0'; + while (*(++ptr) != ')') condref = condref*10 + *ptr - '0'; + ptr++; + } + else ptr--; + break; + + case '=': /* Positive lookahead */ + bravalue = OP_ASSERT; + ptr++; + break; + + case '!': /* Negative lookahead */ + bravalue = OP_ASSERT_NOT; + ptr++; + break; + + case '<': /* Lookbehinds */ + switch (*(++ptr)) + { + case '=': /* Positive lookbehind */ + bravalue = OP_ASSERTBACK; + ptr++; + break; + + case '!': /* Negative lookbehind */ + bravalue = OP_ASSERTBACK_NOT; + ptr++; + break; + + default: /* Syntax error */ + *errorptr = ERR24; + goto FAILED; + } + break; + + case '>': /* One-time brackets */ + bravalue = OP_ONCE; + ptr++; + break; + + default: /* Option setting */ + set = unset = 0; + optset = &set; + + while (*ptr != ')' && *ptr != ':') + { + switch (*ptr++) + { + case '-': optset = &unset; break; + + case 'i': *optset |= PCRE_CASELESS; break; + case 'm': *optset |= PCRE_MULTILINE; break; + case 's': *optset |= PCRE_DOTALL; break; + case 'x': *optset |= PCRE_EXTENDED; break; + case 'U': *optset |= PCRE_UNGREEDY; break; + case 'X': *optset |= PCRE_EXTRA; break; + + default: + *errorptr = ERR12; + goto FAILED; + } + } + + /* Set up the changed option bits, but don't change anything yet. */ + + newoptions = (options | set) & (~unset); + + /* If the options ended with ')' this is not the start of a nested + group with option changes, so the options change at this level. At top + level there is nothing else to be done (the options will in fact have + been set from the start of compiling as a result of the first pass) but + at an inner level we must compile code to change the ims options if + necessary, and pass the new setting back so that it can be put at the + start of any following branches, and when this group ends, a resetting + item can be compiled. */ + + if (*ptr == ')') + { + if ((options & PCRE_INGROUP) != 0 && + (options & PCRE_IMS) != (newoptions & PCRE_IMS)) + { + *code++ = OP_OPT; + *code++ = *optchanged = newoptions & PCRE_IMS; + } + options = newoptions; /* Change options at this level */ + previous = NULL; /* This item can't be repeated */ + continue; /* It is complete */ + } + + /* If the options ended with ':' we are heading into a nested group + with possible change of options. Such groups are non-capturing and are + not assertions of any kind. All we need to do is skip over the ':'; + the newoptions value is handled below. */ + + bravalue = OP_BRA; + ptr++; + } + } + + /* Else we have a referencing group; adjust the opcode. */ + + else + { + if (++(*brackets) > EXTRACT_MAX) + { + *errorptr = ERR13; + goto FAILED; + } + bravalue = OP_BRA + *brackets; + } + + /* Process nested bracketed re. Assertions may not be repeated, but other + kinds can be. We copy code into a non-register variable in order to be able + to pass its address because some compilers complain otherwise. Pass in a + new setting for the ims options if they have changed. */ + + previous = (bravalue >= OP_ONCE)? code : NULL; + *code = bravalue; + tempcode = code; + + if (!compile_regex( + options | PCRE_INGROUP, /* Set for all nested groups */ + ((options & PCRE_IMS) != (newoptions & PCRE_IMS))? + newoptions & PCRE_IMS : -1, /* Pass ims options if changed */ + brackets, /* Bracket level */ + &tempcode, /* Where to put code (updated) */ + &ptr, /* Input pointer (updated) */ + errorptr, /* Where to put an error message */ + (bravalue == OP_ASSERTBACK || + bravalue == OP_ASSERTBACK_NOT), /* TRUE if back assert */ + condref, /* Condition reference number */ + cd)) /* Tables block */ + goto FAILED; + + /* At the end of compiling, code is still pointing to the start of the + group, while tempcode has been updated to point past the end of the group + and any option resetting that may follow it. The pattern pointer (ptr) + is on the bracket. */ + + /* If this is a conditional bracket, check that there are no more than + two branches in the group. */ + + if (bravalue == OP_COND) + { + int branchcount = 0; + uschar *tc = code; + + do { + branchcount++; + tc += (tc[1] << 8) | tc[2]; + } + while (*tc != OP_KET); + + if (branchcount > 2) + { + *errorptr = ERR27; + goto FAILED; + } + } + + /* Now update the main code pointer to the end of the group. */ + + code = tempcode; + + /* Error if hit end of pattern */ + + if (*ptr != ')') + { + *errorptr = ERR14; + goto FAILED; + } + break; + + /* Check \ for being a real metacharacter; if not, fall through and handle + it as a data character at the start of a string. Escape items are checked + for validity in the pre-compiling pass. */ + + case '\\': + tempptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd); + + /* Handle metacharacters introduced by \. For ones like \d, the ESC_ values + are arranged to be the negation of the corresponding OP_values. For the + back references, the values are ESC_REF plus the reference number. Only + back references and those types that consume a character may be repeated. + We can test for values between ESC_b and ESC_Z for the latter; this may + have to change if any new ones are ever created. */ + + if (c < 0) + { + if (-c >= ESC_REF) + { + previous = code; + *code++ = OP_REF; + *code++ = -c - ESC_REF; + } + else + { + previous = (-c > ESC_b && -c < ESC_Z)? code : NULL; + *code++ = -c; + } + continue; + } + + /* Data character: reset and fall through */ + + ptr = tempptr; + c = '\\'; + + /* Handle a run of data characters until a metacharacter is encountered. + The first character is guaranteed not to be whitespace or # when the + extended flag is set. */ + + NORMAL_CHAR: + default: + previous = code; + *code = OP_CHARS; + code += 2; + length = 0; + + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((cd->ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + if (c == 0) break; + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter. Escaped items + are checked for validity in the pre-compiling pass. Stop the string + before a metaitem. */ + + if (c == '\\') + { + tempptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE, cd); + if (c < 0) { ptr = tempptr; break; } + } + + /* Ordinary character or single-char escape */ + + *code++ = c; + length++; + } + + /* This "while" is the end of the "do" above. */ + + while (length < 255 && (cd->ctypes[c = *(++ptr)] & ctype_meta) == 0); + + /* Compute the length and set it in the data vector, and advance to + the next state. */ + + previous[1] = length; + if (length < 255) ptr--; + break; + } + } /* end of big loop */ + +/* Control never reaches here by falling through, only by a goto for all the +error states. Pass back the position in the pattern so that it can be displayed +to the user for diagnosing the error. */ + +FAILED: +*ptrptr = ptr; +return FALSE; +} + + + + +/************************************************* +* Compile sequence of alternatives * +*************************************************/ + +/* On entry, ptr is pointing past the bracket character, but on return +it points to the closing bracket, or vertical bar, or end of string. +The code variable is pointing at the byte into which the BRA operator has been +stored. If the ims options are changed at the start (for a (?ims: group) or +during any branch, we need to insert an OP_OPT item at the start of every +following branch to ensure they get set correctly at run time, and also pass +the new options into every subsequent branch compile. + +Argument: + options the option bits + optchanged new ims options to set as if (?ims) were at the start, or -1 + for no change + brackets -> int containing the number of extracting brackets used + codeptr -> the address of the current code pointer + ptrptr -> the address of the current pattern pointer + errorptr -> pointer to error message + lookbehind TRUE if this is a lookbehind assertion + condref > 0 for OPT_CREF setting at start of conditional group + cd points to the data block with tables pointers + +Returns: TRUE on success +*/ + +static BOOL +compile_regex(int options, int optchanged, int *brackets, uschar **codeptr, + const uschar **ptrptr, const char **errorptr, BOOL lookbehind, int condref, + compile_data *cd) +{ +const uschar *ptr = *ptrptr; +uschar *code = *codeptr; +uschar *last_branch = code; +uschar *start_bracket = code; +uschar *reverse_count = NULL; +int oldoptions = options & PCRE_IMS; + +code += 3; + +/* At the start of a reference-based conditional group, insert the reference +number as an OP_CREF item. */ + +if (condref > 0) + { + *code++ = OP_CREF; + *code++ = condref; + } + +/* Loop for each alternative branch */ + +for (;;) + { + int length; + + /* Handle change of options */ + + if (optchanged >= 0) + { + *code++ = OP_OPT; + *code++ = optchanged; + options = (options & ~PCRE_IMS) | optchanged; + } + + /* Set up dummy OP_REVERSE if lookbehind assertion */ + + if (lookbehind) + { + *code++ = OP_REVERSE; + reverse_count = code; + *code++ = 0; + *code++ = 0; + } + + /* Now compile the branch */ + + if (!compile_branch(options,brackets,&code,&ptr,errorptr,&optchanged,cd)) + { + *ptrptr = ptr; + return FALSE; + } + + /* Fill in the length of the last branch */ + + length = code - last_branch; + last_branch[1] = length >> 8; + last_branch[2] = length & 255; + + /* If lookbehind, check that this branch matches a fixed-length string, + and put the length into the OP_REVERSE item. Temporarily mark the end of + the branch with OP_END. */ + + if (lookbehind) + { + *code = OP_END; + length = find_fixedlength(last_branch); + DPRINTF(("fixed length = %d\n", length)); + if (length < 0) + { + *errorptr = ERR25; + *ptrptr = ptr; + return FALSE; + } + reverse_count[0] = (length >> 8); + reverse_count[1] = length & 255; + } + + /* Reached end of expression, either ')' or end of pattern. Insert a + terminating ket and the length of the whole bracketed item, and return, + leaving the pointer at the terminating char. If any of the ims options + were changed inside the group, compile a resetting op-code following. */ + + if (*ptr != '|') + { + length = code - start_bracket; + *code++ = OP_KET; + *code++ = length >> 8; + *code++ = length & 255; + if (optchanged >= 0) + { + *code++ = OP_OPT; + *code++ = oldoptions; + } + *codeptr = code; + *ptrptr = ptr; + return TRUE; + } + + /* Another branch follows; insert an "or" node and advance the pointer. */ + + *code = OP_ALT; + last_branch = code; + code += 3; + ptr++; + } +/* Control never reaches here */ +} + + + + +/************************************************* +* Find first significant op code * +*************************************************/ + +/* This is called by several functions that scan a compiled expression looking +for a fixed first character, or an anchoring op code etc. It skips over things +that do not influence this. For one application, a change of caseless option is +important. + +Arguments: + code pointer to the start of the group + options pointer to external options + optbit the option bit whose changing is significant, or + zero if none are + optstop TRUE to return on option change, otherwise change the options + value and continue + +Returns: pointer to the first significant opcode +*/ + +static const uschar* +first_significant_code(const uschar *code, int *options, int optbit, + BOOL optstop) +{ +for (;;) + { + switch ((int)*code) + { + case OP_OPT: + if (optbit > 0 && ((int)code[1] & optbit) != (*options & optbit)) + { + if (optstop) return code; + *options = (int)code[1]; + } + code += 2; + break; + + case OP_CREF: + code += 2; + break; + + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + do code += (code[1] << 8) + code[2]; while (*code == OP_ALT); + code += 3; + break; + + default: + return code; + } + } +/* Control never reaches here */ +} + + + + +/************************************************* +* Check for anchored expression * +*************************************************/ + +/* Try to find out if this is an anchored regular expression. Consider each +alternative branch. If they all start with OP_SOD or OP_CIRC, or with a bracket +all of whose alternatives start with OP_SOD or OP_CIRC (recurse ad lib), then +it's anchored. However, if this is a multiline pattern, then only OP_SOD +counts, since OP_CIRC can match in the middle. + +A branch is also implicitly anchored if it starts with .* and DOTALL is set, +because that will try the rest of the pattern at all possible matching points, +so there is no point trying them again. + +Arguments: + code points to start of expression (the bracket) + options points to the options setting + +Returns: TRUE or FALSE +*/ + +static BOOL +is_anchored(register const uschar *code, int *options) +{ +do { + const uschar *scode = first_significant_code(code + 3, options, + PCRE_MULTILINE, FALSE); + register int op = *scode; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_anchored(scode, options)) return FALSE; } + else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR) && + (*options & PCRE_DOTALL) != 0) + { if (scode[1] != OP_ANY) return FALSE; } + else if (op != OP_SOD && + ((*options & PCRE_MULTILINE) != 0 || op != OP_CIRC)) + return FALSE; + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Check for starting with ^ or .* * +*************************************************/ + +/* This is called to find out if every branch starts with ^ or .* so that +"first char" processing can be done to speed things up in multiline +matching and for non-DOTALL patterns that start with .* (which must start at +the beginning or after \n). + +Argument: points to start of expression (the bracket) +Returns: TRUE or FALSE +*/ + +static BOOL +is_startline(const uschar *code) +{ +do { + const uschar *scode = first_significant_code(code + 3, NULL, 0, FALSE); + register int op = *scode; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE || op == OP_COND) + { if (!is_startline(scode)) return FALSE; } + else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR) + { if (scode[1] != OP_ANY) return FALSE; } + else if (op != OP_CIRC) return FALSE; + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Check for fixed first char * +*************************************************/ + +/* Try to find out if there is a fixed first character. This is called for +unanchored expressions, as it speeds up their processing quite considerably. +Consider each alternative branch. If they all start with the same char, or with +a bracket all of whose alternatives start with the same char (recurse ad lib), +then we return that char, otherwise -1. + +Arguments: + code points to start of expression (the bracket) + options pointer to the options (used to check casing changes) + +Returns: -1 or the fixed first char +*/ + +static int +find_firstchar(const uschar *code, int *options) +{ +register int c = -1; +do { + int d; + const uschar *scode = first_significant_code(code + 3, options, + PCRE_CASELESS, TRUE); + register int op = *scode; + + if (op >= OP_BRA) op = OP_BRA; + + switch(op) + { + default: + return -1; + + case OP_BRA: + case OP_ASSERT: + case OP_ONCE: + case OP_COND: + if ((d = find_firstchar(scode, options)) < 0) return -1; + if (c < 0) c = d; else if (c != d) return -1; + break; + + case OP_EXACT: /* Fall through */ + scode++; + + case OP_CHARS: /* Fall through */ + scode++; + + case OP_PLUS: + case OP_MINPLUS: + if (c < 0) c = scode[1]; else if (c != scode[1]) return -1; + break; + } + + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return c; +} + + + + + +/************************************************* +* Compile a Regular Expression * +*************************************************/ + +/* This function takes a string and returns a pointer to a block of store +holding a compiled version of the expression. + +Arguments: + pattern the regular expression + options various option bits + errorptr pointer to pointer to error text + erroroffset ptr offset in pattern where error was detected + tables pointer to character tables or NULL + +Returns: pointer to compiled data block, or NULL on error, + with errorptr and erroroffset set +*/ + +pcre * +pcre_compile(const char *pattern, int options, const char **errorptr, + int *erroroffset, const unsigned char *tables) +{ +real_pcre *re; +int length = 3; /* For initial BRA plus length */ +int runlength; +int c, size; +int bracount = 0; +int top_backref = 0; +int branch_extra = 0; +int branch_newextra; +unsigned int brastackptr = 0; +uschar *code; +const uschar *ptr; +compile_data compile_block; +int brastack[BRASTACK_SIZE]; +uschar bralenstack[BRASTACK_SIZE]; + +#ifdef DEBUG +uschar *code_base, *code_end; +#endif + +/* We can't pass back an error message if errorptr is NULL; I guess the best we +can do is just return NULL. */ + +if (errorptr == NULL) return NULL; +*errorptr = NULL; + +/* However, we can give a message for this error */ + +if (erroroffset == NULL) + { + *errorptr = ERR16; + return NULL; + } +*erroroffset = 0; + +if ((options & ~PUBLIC_OPTIONS) != 0) + { + *errorptr = ERR17; + return NULL; + } + +/* Set up pointers to the individual character tables */ + +if (tables == NULL) tables = pcre_default_tables; +compile_block.lcc = tables + lcc_offset; +compile_block.fcc = tables + fcc_offset; +compile_block.cbits = tables + cbits_offset; +compile_block.ctypes = tables + ctypes_offset; + +/* Reflect pattern for debugging output */ + +DPRINTF(("------------------------------------------------------------------\n")); +DPRINTF(("%s\n", pattern)); + +/* The first thing to do is to make a pass over the pattern to compute the +amount of store required to hold the compiled code. This does not have to be +perfect as long as errors are overestimates. At the same time we can detect any +internal flag settings. Make an attempt to correct for any counted white space +if an "extended" flag setting appears late in the pattern. We can't be so +clever for #-comments. */ + +ptr = (const uschar *)(pattern - 1); +while ((c = *(++ptr)) != 0) + { + int min, max; + int class_charcount; + + if ((options & PCRE_EXTENDED) != 0) + { + if ((compile_block.ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + switch(c) + { + /* A backslashed item may be an escaped "normal" character or a + character type. For a "normal" character, put the pointers and + character back so that tests for whitespace etc. in the input + are done correctly. */ + + case '\\': + { + const uschar *save_ptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c >= 0) + { + ptr = save_ptr; + c = '\\'; + goto NORMAL_CHAR; + } + } + length++; + + /* A back reference needs an additional char, plus either one or 5 + bytes for a repeat. We also need to keep the value of the highest + back reference. */ + + if (c <= -ESC_REF) + { + int refnum = -c - ESC_REF; + if (refnum > top_backref) top_backref = refnum; + length++; /* For single back reference */ + if (ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + case '^': + case '.': + case '$': + case '*': /* These repeats won't be after brackets; */ + case '+': /* those are handled separately */ + case '?': + length++; + continue; + + /* This covers the cases of repeats after a single char, metachar, class, + or back reference. */ + + case '{': + if (!is_counted_repeat(ptr+1, &compile_block)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else + { + length--; /* Uncount the original char or metachar */ + if (min == 1) length++; else if (min > 0) length += 4; + if (max > 0) length += 4; else length += 2; + } + if (ptr[1] == '?') ptr++; + continue; + + /* An alternation contains an offset to the next branch or ket. If any ims + options changed in the previous branch(es), and/or if we are in a + lookbehind assertion, extra space will be needed at the start of the + branch. This is handled by branch_extra. */ + + case '|': + length += 3 + branch_extra; + continue; + + /* A character class uses 33 characters. Don't worry about character types + that aren't allowed in classes - they'll get picked up during the compile. + A character class that contains only one character uses 2 or 3 bytes, + depending on whether it is negated or not. Notice this where we can. */ + + case '[': + class_charcount = 0; + if (*(++ptr) == '^') ptr++; + do + { + if (*ptr == '\\') + { + int ch = check_escape(&ptr, errorptr, bracount, options, TRUE, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (-ch == ESC_b) class_charcount++; else class_charcount = 10; + } + else class_charcount++; + ptr++; + } + while (*ptr != 0 && *ptr != ']'); + + /* Repeats for negated single chars are handled by the general code */ + + if (class_charcount == 1) length += 3; else + { + length += 33; + + /* A repeat needs either 1 or 5 bytes. */ + + if (*ptr != 0 && ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + /* Brackets may be genuine groups or special things */ + + case '(': + branch_newextra = 0; + + /* Handle special forms of bracket, which all start (? */ + + if (ptr[1] == '?') + { + int set, unset; + int *optset; + + switch (c = ptr[2]) + { + /* Skip over comments entirely */ + case '#': + ptr += 3; + while (*ptr != 0 && *ptr != ')') ptr++; + if (*ptr == 0) + { + *errorptr = ERR18; + goto PCRE_ERROR_RETURN; + } + continue; + + /* Non-referencing groups and lookaheads just move the pointer on, and + then behave like a non-special bracket, except that they don't increment + the count of extracting brackets. Ditto for the "once only" bracket, + which is in Perl from version 5.005. */ + + case ':': + case '=': + case '!': + case '>': + ptr += 2; + break; + + /* Lookbehinds are in Perl from version 5.005 */ + + case '<': + if (ptr[3] == '=' || ptr[3] == '!') + { + ptr += 3; + branch_newextra = 3; + length += 3; /* For the first branch */ + break; + } + *errorptr = ERR24; + goto PCRE_ERROR_RETURN; + + /* Conditionals are in Perl from version 5.005. The bracket must either + be followed by a number (for bracket reference) or by an assertion + group. */ + + case '(': + if ((compile_block.ctypes[ptr[3]] & ctype_digit) != 0) + { + ptr += 4; + length += 2; + while ((compile_block.ctypes[*ptr] & ctype_digit) != 0) ptr++; + if (*ptr != ')') + { + *errorptr = ERR26; + goto PCRE_ERROR_RETURN; + } + } + else /* An assertion must follow */ + { + ptr++; /* Can treat like ':' as far as spacing is concerned */ + + if (ptr[2] != '?' || strchr("=!<", ptr[3]) == NULL) + { + ptr += 2; /* To get right offset in message */ + *errorptr = ERR28; + goto PCRE_ERROR_RETURN; + } + } + break; + + /* Else loop checking valid options until ) is met. Anything else is an + error. If we are without any brackets, i.e. at top level, the settings + act as if specified in the options, so massage the options immediately. + This is for backward compatibility with Perl 5.004. */ + + default: + set = unset = 0; + optset = &set; + ptr += 2; + + for (;; ptr++) + { + c = *ptr; + switch (c) + { + case 'i': + *optset |= PCRE_CASELESS; + continue; + + case 'm': + *optset |= PCRE_MULTILINE; + continue; + + case 's': + *optset |= PCRE_DOTALL; + continue; + + case 'x': + *optset |= PCRE_EXTENDED; + continue; + + case 'X': + *optset |= PCRE_EXTRA; + continue; + + case 'U': + *optset |= PCRE_UNGREEDY; + continue; + + case '-': + optset = &unset; + continue; + + /* A termination by ')' indicates an options-setting-only item; + this is global at top level; otherwise nothing is done here and + it is handled during the compiling process on a per-bracket-group + basis. */ + + case ')': + if (brastackptr == 0) + { + options = (options | set) & (~unset); + set = unset = 0; /* To save length */ + } + /* Fall through */ + + /* A termination by ':' indicates the start of a nested group with + the given options set. This is again handled at compile time, but + we must allow for compiled space if any of the ims options are + set. We also have to allow for resetting space at the end of + the group, which is why 4 is added to the length and not just 2. + If there are several changes of options within the same group, this + will lead to an over-estimate on the length, but this shouldn't + matter very much. We also have to allow for resetting options at + the start of any alternations, which we do by setting + branch_newextra to 2. */ + + case ':': + if (((set|unset) & PCRE_IMS) != 0) + { + length += 4; + branch_newextra = 2; + } + goto END_OPTIONS; + + /* Unrecognized option character */ + + default: + *errorptr = ERR12; + goto PCRE_ERROR_RETURN; + } + } + + /* If we hit a closing bracket, that's it - this is a freestanding + option-setting. We need to ensure that branch_extra is updated if + necessary. The only values branch_newextra can have here are 0 or 2. + If the value is 2, then branch_extra must either be 2 or 5, depending + on whether this is a lookbehind group or not. */ + + END_OPTIONS: + if (c == ')') + { + if (branch_newextra == 2 && (branch_extra == 0 || branch_extra == 3)) + branch_extra += branch_newextra; + continue; + } + + /* If options were terminated by ':' control comes here. Fall through + to handle the group below. */ + } + } + + /* Extracting brackets must be counted so we can process escapes in a + Perlish way. */ + + else bracount++; + + /* Non-special forms of bracket. Save length for computing whole length + at end if there's a repeat that requires duplication of the group. Also + save the current value of branch_extra, and start the new group with + the new value. If non-zero, this will either be 2 for a (?imsx: group, or 3 + for a lookbehind assertion. */ + + if (brastackptr >= sizeof(brastack)/sizeof(int)) + { + *errorptr = ERR19; + goto PCRE_ERROR_RETURN; + } + + bralenstack[brastackptr] = branch_extra; + branch_extra = branch_newextra; + + brastack[brastackptr++] = length; + length += 3; + continue; + + /* Handle ket. Look for subsequent max/min; for certain sets of values we + have to replicate this bracket up to that many times. If brastackptr is + 0 this is an unmatched bracket which will generate an error, but take care + not to try to access brastack[-1] when computing the length and restoring + the branch_extra value. */ + + case ')': + length += 3; + { + int minval = 1; + int maxval = 1; + int duplength; + + if (brastackptr > 0) + { + duplength = length - brastack[--brastackptr]; + branch_extra = bralenstack[brastackptr]; + } + else duplength = 0; + + /* Leave ptr at the final char; for read_repeat_counts this happens + automatically; for the others we need an increment. */ + + if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2, &compile_block)) + { + ptr = read_repeat_counts(ptr+2, &minval, &maxval, errorptr, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + } + else if (c == '*') { minval = 0; maxval = -1; ptr++; } + else if (c == '+') { maxval = -1; ptr++; } + else if (c == '?') { minval = 0; ptr++; } + + /* If the minimum is zero, we have to allow for an OP_BRAZERO before the + group, and if the maximum is greater than zero, we have to replicate + maxval-1 times; each replication acquires an OP_BRAZERO plus a nesting + bracket set - hence the 7. */ + + if (minval == 0) + { + length++; + if (maxval > 0) length += (maxval - 1) * (duplength + 7); + } + + /* When the minimum is greater than zero, 1 we have to replicate up to + minval-1 times, with no additions required in the copies. Then, if + there is a limited maximum we have to replicate up to maxval-1 times + allowing for a BRAZERO item before each optional copy and nesting + brackets for all but one of the optional copies. */ + + else + { + length += (minval - 1) * duplength; + if (maxval > minval) /* Need this test as maxval=-1 means no limit */ + length += (maxval - minval) * (duplength + 7) - 6; + } + } + continue; + + /* Non-special character. For a run of such characters the length required + is the number of characters + 2, except that the maximum run length is 255. + We won't get a skipped space or a non-data escape or the start of a # + comment as the first character, so the length can't be zero. */ + + NORMAL_CHAR: + default: + length += 2; + runlength = 0; + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((compile_block.ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter; stop the + string before the latter. */ + + if (c == '\\') + { + const uschar *saveptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE, + &compile_block); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c < 0) { ptr = saveptr; break; } + } + + /* Ordinary character or single-char escape */ + + runlength++; + } + + /* This "while" is the end of the "do" above. */ + + while (runlength < 255 && + (compile_block.ctypes[c = *(++ptr)] & ctype_meta) == 0); + + ptr--; + length += runlength; + continue; + } + } + +length += 4; /* For final KET and END */ + +if (length > 65539) + { + *errorptr = ERR20; + return NULL; + } + +/* Compute the size of data block needed and get it, either from malloc or +externally provided function. We specify "code[0]" in the offsetof() expression +rather than just "code", because it has been reported that one broken compiler +fails on "code" because it is also an independent variable. It should make no +difference to the value of the offsetof(). */ + +size = length + offsetof(real_pcre, code[0]); +re = (real_pcre *)(pcre_malloc)(size); + +if (re == NULL) + { + *errorptr = ERR21; + return NULL; + } + +/* Put in the magic number and the options. */ + +re->magic_number = MAGIC_NUMBER; +re->options = options; +re->tables = tables; + +/* Set up a starting, non-extracting bracket, then compile the expression. On +error, *errorptr will be set non-NULL, so we don't need to look at the result +of the function here. */ + +ptr = (const uschar *)pattern; +code = re->code; +*code = OP_BRA; +bracount = 0; +(void)compile_regex(options, -1, &bracount, &code, &ptr, errorptr, FALSE, -1, + &compile_block); +re->top_bracket = bracount; +re->top_backref = top_backref; + +/* If not reached end of pattern on success, there's an excess bracket. */ + +if (*errorptr == NULL && *ptr != 0) *errorptr = ERR22; + +/* Fill in the terminating state and check for disastrous overflow, but +if debugging, leave the test till after things are printed out. */ + +*code++ = OP_END; + +#ifndef DEBUG +if (code - re->code > length) *errorptr = ERR23; +#endif + +/* Give an error if there's back reference to a non-existent capturing +subpattern. */ + +if (top_backref > re->top_bracket) *errorptr = ERR15; + +/* Failed to compile */ + +if (*errorptr != NULL) + { + (pcre_free)(re); + PCRE_ERROR_RETURN: + *erroroffset = ptr - (const uschar *)pattern; + return NULL; + } + +/* If the anchored option was not passed, set flag if we can determine that the +pattern is anchored by virtue of ^ characters or \A or anything else (such as +starting with .* when DOTALL is set). + +Otherwise, see if we can determine what the first character has to be, because +that speeds up unanchored matches no end. If not, see if we can set the +PCRE_STARTLINE flag. This is helpful for multiline matches when all branches +start with ^. and also when all branches start with .* for non-DOTALL matches. +*/ + +if ((options & PCRE_ANCHORED) == 0) + { + int temp_options = options; + if (is_anchored(re->code, &temp_options)) + re->options |= PCRE_ANCHORED; + else + { + int ch = find_firstchar(re->code, &temp_options); + if (ch >= 0) + { + re->first_char = ch; + re->options |= PCRE_FIRSTSET; + } + else if (is_startline(re->code)) + re->options |= PCRE_STARTLINE; + } + } + +/* Print out the compiled data for debugging */ + +#ifdef DEBUG + +printf("Length = %d top_bracket = %d top_backref = %d\n", + length, re->top_bracket, re->top_backref); + +if (re->options != 0) + { + printf("%s%s%s%s%s%s%s%s\n", + ((re->options & PCRE_ANCHORED) != 0)? "anchored " : "", + ((re->options & PCRE_CASELESS) != 0)? "caseless " : "", + ((re->options & PCRE_EXTENDED) != 0)? "extended " : "", + ((re->options & PCRE_MULTILINE) != 0)? "multiline " : "", + ((re->options & PCRE_DOTALL) != 0)? "dotall " : "", + ((re->options & PCRE_DOLLAR_ENDONLY) != 0)? "endonly " : "", + ((re->options & PCRE_EXTRA) != 0)? "extra " : "", + ((re->options & PCRE_UNGREEDY) != 0)? "ungreedy " : ""); + } + +if ((re->options & PCRE_FIRSTSET) != 0) + { + if (isprint(re->first_char)) printf("First char = %c\n", re->first_char); + else printf("First char = \\x%02x\n", re->first_char); + } + +code_end = code; +code_base = code = re->code; + +while (code < code_end) + { + int charlength; + + printf("%3d ", code - code_base); + + if (*code >= OP_BRA) + { + printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_OPT: + printf(" %.2x %s", code[1], OP_names[*code]); + code++; + break; + + case OP_COND: + printf("%3d Cond", (code[1] << 8) + code[2]); + code += 2; + break; + + case OP_CREF: + printf(" %.2d %s", code[1], OP_names[*code]); + code++; + break; + + case OP_CHARS: + charlength = *(++code); + printf("%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) printf("%c", c); else printf("\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + case OP_ONCE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_REVERSE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + printf(" %s", OP_names[code[1]]); + else if (isprint(c = code[1])) printf(" %c", c); + else printf(" \\x%02x", c); + printf("%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) printf(" %c{", c); + else printf(" \\x%02x{", c); + if (*code != OP_EXACT) printf("0,"); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) printf("?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + printf(" %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) printf("?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + printf("%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) printf(" [^%c]{", c); + else printf(" [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) printf("?"); + code += 3; + break; + + case OP_REF: + printf(" \\%d", *(++code)); + code ++; + goto CLASS_REF_REPEAT; + + case OP_CLASS: + { + int i, min, max; + code++; + printf(" ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') printf("\\"); + if (isprint(i)) printf("%c", i); else printf("\\x%02x", i); + if (--j > i) + { + printf("-"); + if (j == '-' || j == ']') printf("\\"); + if (isprint(j)) printf("%c", j); else printf("\\x%02x", j); + } + i = j; + } + } + printf("]"); + code += 32; + + CLASS_REF_REPEAT: + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + printf("%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) printf("{%d,}", min); + else printf("{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) printf("?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + printf(" %s", OP_names[*code]); + break; + } + + code++; + printf("\n"); + } +printf("------------------------------------------------------------------\n"); + +/* This check is done here in the debugging case so that the code that +was compiled can be seen. */ + +if (code - re->code > length) + { + *errorptr = ERR23; + (pcre_free)(re); + *erroroffset = ptr - (uschar *)pattern; + return NULL; + } +#endif + +return (pcre *)re; +} + + + +/************************************************* +* Match a back-reference * +*************************************************/ + +/* If a back reference hasn't been set, the length that is passed is greater +than the number of characters left in the string, so the match fails. + +Arguments: + offset index into the offset vector + eptr points into the subject + length length to be matched + md points to match data block + ims the ims flags + +Returns: TRUE if matched +*/ + +static BOOL +match_ref(int offset, register const uschar *eptr, int length, match_data *md, + int ims) +{ +const uschar *p = md->start_subject + md->offset_vector[offset]; + +#ifdef DEBUG +if (eptr >= md->end_subject) + printf("matching subject <null>"); +else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + } +printf(" against backref "); +pchars(p, length, FALSE, md); +printf("\n"); +#endif + +/* Always fail if not enough characters left */ + +if (length > md->end_subject - eptr) return FALSE; + +/* Separate the caselesss case for speed */ + +if ((ims & PCRE_CASELESS) != 0) + { + while (length-- > 0) + if (md->lcc[*p++] != md->lcc[*eptr++]) return FALSE; + } +else + { while (length-- > 0) if (*p++ != *eptr++) return FALSE; } + +return TRUE; +} + + + +/************************************************* +* Match from current position * +*************************************************/ + +/* On entry ecode points to the first opcode, and eptr to the first character +in the subject string, while eptrb holds the value of eptr at the start of the +last bracketed group - used for breaking infinite loops matching zero-length +strings. + +Arguments: + eptr pointer in subject + ecode position in code + offset_top current top pointer + md pointer to "static" info for the match + ims current /i, /m, and /s options + condassert TRUE if called to check a condition assertion + eptrb eptr at start of last bracket + +Returns: TRUE if matched +*/ + +static BOOL +match(register const uschar *eptr, register const uschar *ecode, + int offset_top, match_data *md, int ims, BOOL condassert, const uschar *eptrb) +{ +int original_ims = ims; /* Save for resetting on ')' */ + +for (;;) + { + int op = (int)*ecode; + int min, max, ctype; + register int i; + register int c; + BOOL minimize = FALSE; + + /* Opening capturing bracket. If there is space in the offset vector, save + the current subject position in the working slot at the top of the vector. We + mustn't change the current values of the data slot, because they may be set + from a previous iteration of this group, and be referred to by a reference + inside the group. + + If the bracket fails to match, we need to restore this value and also the + values of the final offsets, in case they were set by a previous iteration of + the same bracket. + + If there isn't enough space in the offset vector, treat this as if it were a + non-capturing bracket. Don't worry about setting the flag for the error case + here; that is handled in the code for KET. */ + + if (op > OP_BRA) + { + int number = op - OP_BRA; + int offset = number << 1; + +#ifdef DEBUG + printf("start bracket %d subject=", number); + pchars(eptr, 16, TRUE, md); + printf("\n"); +#endif + + if (offset < md->offset_max) + { + int save_offset1 = md->offset_vector[offset]; + int save_offset2 = md->offset_vector[offset+1]; + int save_offset3 = md->offset_vector[md->offset_end - number]; + + DPRINTF(("saving %d %d %d\n", save_offset1, save_offset2, save_offset3)); + md->offset_vector[md->offset_end - number] = eptr - md->start_subject; + + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + DPRINTF(("bracket %d failed\n", number)); + + md->offset_vector[offset] = save_offset1; + md->offset_vector[offset+1] = save_offset2; + md->offset_vector[md->offset_end - number] = save_offset3; + return FALSE; + } + + /* Insufficient room for saving captured contents */ + + else op = OP_BRA; + } + + /* Other types of node can be handled by a switch */ + + switch(op) + { + case OP_BRA: /* Non-capturing bracket: optimized */ + DPRINTF(("start bracket 0\n")); + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + DPRINTF(("bracket 0 failed\n")); + return FALSE; + + /* Conditional group: compilation checked that there are no more than + two branches. If the condition is false, skipping the first branch takes us + past the end if there is only one branch, but that's OK because that is + exactly what going to the ket would do. */ + + case OP_COND: + if (ecode[3] == OP_CREF) /* Condition is extraction test */ + { + int offset = ecode[4] << 1; /* Doubled reference number */ + return match(eptr, + ecode + ((offset < offset_top && md->offset_vector[offset] >= 0)? + 5 : 3 + (ecode[1] << 8) + ecode[2]), + offset_top, md, ims, FALSE, eptr); + } + + /* The condition is an assertion. Call match() to evaluate it - setting + the final argument TRUE causes it to stop at the end of an assertion. */ + + else + { + if (match(eptr, ecode+3, offset_top, md, ims, TRUE, NULL)) + { + ecode += 3 + (ecode[4] << 8) + ecode[5]; + while (*ecode == OP_ALT) ecode += (ecode[1] << 8) + ecode[2]; + } + else ecode += (ecode[1] << 8) + ecode[2]; + return match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr); + } + /* Control never reaches here */ + + /* Skip over conditional reference data if encountered (should not be) */ + + case OP_CREF: + ecode += 2; + break; + + /* End of the pattern */ + + case OP_END: + md->end_match_ptr = eptr; /* Record where we ended */ + md->end_offset_top = offset_top; /* and how many extracts were taken */ + return TRUE; + + /* Change option settings */ + + case OP_OPT: + ims = ecode[1]; + ecode += 2; + DPRINTF(("ims set to %02x\n", ims)); + break; + + /* Assertion brackets. Check the alternative branches in turn - the + matching won't pass the KET for an assertion. If any one branch matches, + the assertion is true. Lookbehind assertions have an OP_REVERSE item at the + start of each branch to move the current point backwards, so the code at + this level is identical to the lookahead case. */ + + case OP_ASSERT: + case OP_ASSERTBACK: + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, NULL)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + if (*ecode == OP_KET) return FALSE; + + /* If checking an assertion for a condition, return TRUE. */ + + if (condassert) return TRUE; + + /* Continue from after the assertion, updating the offsets high water + mark, since extracts may have been taken during the assertion. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + ecode += 3; + offset_top = md->end_offset_top; + continue; + + /* Negative assertion: all branches must fail to match */ + + case OP_ASSERT_NOT: + case OP_ASSERTBACK_NOT: + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, NULL)) return FALSE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + if (condassert) return TRUE; + ecode += 3; + continue; + + /* Move the subject pointer back. This occurs only at the start of + each branch of a lookbehind assertion. If we are too close to the start to + move back, this match function fails. */ + + case OP_REVERSE: + eptr -= (ecode[1] << 8) + ecode[2]; + if (eptr < md->start_subject) return FALSE; + ecode += 3; + break; + + + /* "Once" brackets are like assertion brackets except that after a match, + the point in the subject string is not moved back. Thus there can never be + a move back into the brackets. Check the alternative branches in turn - the + matching won't pass the KET for this kind of subpattern. If any one branch + matches, we carry on as at the end of a normal bracket, leaving the subject + pointer. */ + + case OP_ONCE: + { + const uschar *prev = ecode; + + do + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + /* If hit the end of the group (which could be repeated), fail */ + + if (*ecode != OP_ONCE && *ecode != OP_ALT) return FALSE; + + /* Continue as from after the assertion, updating the offsets high water + mark, since extracts may have been taken. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + + offset_top = md->end_offset_top; + eptr = md->end_match_ptr; + + /* For a non-repeating ket, just continue at this level. This also + happens for a repeating ket if no characters were matched in the group. + This is the forcible breaking of infinite loops as implemented in Perl + 5.005. If there is an options reset, it will get obeyed in the normal + course of events. */ + + if (*ecode == OP_KET || eptr == eptrb) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. We need to reset any options + that changed within the bracket before re-running it, so check the next + opcode. */ + + if (ecode[3] == OP_OPT) + { + ims = (ims & ~PCRE_IMS) | ecode[4]; + DPRINTF(("ims set to %02x at group repeat\n", ims)); + } + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr) || + match(eptr, prev, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md, ims, FALSE, eptr) || + match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + } + return FALSE; + + /* An alternation is the end of a branch; scan along to find the end of the + bracketed group and go to there. */ + + case OP_ALT: + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + break; + + /* BRAZERO and BRAMINZERO occur just before a bracket group, indicating + that it may occur zero times. It may repeat infinitely, or not at all - + i.e. it could be ()* or ()? in the pattern. Brackets with fixed upper + repeat limits are compiled as a number of copies, with the optional ones + preceded by BRAZERO or BRAMINZERO. */ + + case OP_BRAZERO: + { + const uschar *next = ecode+1; + if (match(eptr, next, offset_top, md, ims, FALSE, eptr)) return TRUE; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + ecode = next + 3; + } + break; + + case OP_BRAMINZERO: + { + const uschar *next = ecode+1; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + if (match(eptr, next+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + ecode++; + } + break; + + /* End of a group, repeated or non-repeating. If we are at the end of + an assertion "group", stop matching and return TRUE, but record the + current high water mark for use by positive assertions. Do this also + for the "once" (not-backup up) groups. */ + + case OP_KET: + case OP_KETRMIN: + case OP_KETRMAX: + { + const uschar *prev = ecode - (ecode[1] << 8) - ecode[2]; + + if (*prev == OP_ASSERT || *prev == OP_ASSERT_NOT || + *prev == OP_ASSERTBACK || *prev == OP_ASSERTBACK_NOT || + *prev == OP_ONCE) + { + md->end_match_ptr = eptr; /* For ONCE */ + md->end_offset_top = offset_top; + return TRUE; + } + + /* In all other cases except a conditional group we have to check the + group number back at the start and if necessary complete handling an + extraction by setting the offsets and bumping the high water mark. */ + + if (*prev != OP_COND) + { + int number = *prev - OP_BRA; + int offset = number << 1; + + DPRINTF(("end bracket %d\n", number)); + + if (number > 0) + { + if (offset >= md->offset_max) md->offset_overflow = TRUE; else + { + md->offset_vector[offset] = + md->offset_vector[md->offset_end - number]; + md->offset_vector[offset+1] = eptr - md->start_subject; + if (offset_top <= offset) offset_top = offset + 2; + } + } + } + + /* Reset the value of the ims flags, in case they got changed during + the group. */ + + ims = original_ims; + DPRINTF(("ims reset to %02x\n", ims)); + + /* For a non-repeating ket, just continue at this level. This also + happens for a repeating ket if no characters were matched in the group. + This is the forcible breaking of infinite loops as implemented in Perl + 5.005. If there is an options reset, it will get obeyed in the normal + course of events. */ + + if (*ecode == OP_KET || eptr == eptrb) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. */ + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr) || + match(eptr, prev, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md, ims, FALSE, eptr) || + match(eptr, ecode+3, offset_top, md, ims, FALSE, eptr)) return TRUE; + } + } + return FALSE; + + /* Start of subject unless notbol, or after internal newline if multiline */ + + case OP_CIRC: + if (md->notbol && eptr == md->start_subject) return FALSE; + if ((ims & PCRE_MULTILINE) != 0) + { + if (eptr != md->start_subject && eptr[-1] != '\n') return FALSE; + ecode++; + break; + } + /* ... else fall through */ + + /* Start of subject assertion */ + + case OP_SOD: + if (eptr != md->start_subject) return FALSE; + ecode++; + break; + + /* Assert before internal newline if multiline, or before a terminating + newline unless endonly is set, else end of subject unless noteol is set. */ + + case OP_DOLL: + if ((ims & PCRE_MULTILINE) != 0) + { + if (eptr < md->end_subject) { if (*eptr != '\n') return FALSE; } + else { if (md->noteol) return FALSE; } + ecode++; + break; + } + else + { + if (md->noteol) return FALSE; + if (!md->endonly) + { + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + + ecode++; + break; + } + } + /* ... else fall through */ + + /* End of subject assertion (\z) */ + + case OP_EOD: + if (eptr < md->end_subject) return FALSE; + ecode++; + break; + + /* End of subject or ending \n assertion (\Z) */ + + case OP_EODN: + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + ecode++; + break; + + /* Word boundary assertions */ + + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + { + BOOL prev_is_word = (eptr != md->start_subject) && + ((md->ctypes[eptr[-1]] & ctype_word) != 0); + BOOL cur_is_word = (eptr < md->end_subject) && + ((md->ctypes[*eptr] & ctype_word) != 0); + if ((*ecode++ == OP_WORD_BOUNDARY)? + cur_is_word == prev_is_word : cur_is_word != prev_is_word) + return FALSE; + } + break; + + /* Match a single character type; inline for speed */ + + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == '\n') + return FALSE; + if (eptr++ >= md->end_subject) return FALSE; + ecode++; + break; + + case OP_NOT_DIGIT: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_digit) != 0) + return FALSE; + ecode++; + break; + + case OP_DIGIT: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_digit) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WHITESPACE: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_space) != 0) + return FALSE; + ecode++; + break; + + case OP_WHITESPACE: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_space) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WORDCHAR: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + ecode++; + break; + + case OP_WORDCHAR: + if (eptr >= md->end_subject || + (md->ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + ecode++; + break; + + /* Match a back reference, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. The code is similar + to that for character classes, but repeated for efficiency. Then obey + similar code to character type repeats - written out again for speed. + However, if the referenced string is the empty string, always treat + it as matched, any number of times (otherwise there could be infinite + loops). */ + + case OP_REF: + { + int length; + int offset = ecode[1] << 1; /* Doubled reference number */ + ecode += 2; /* Advance past the item */ + + /* If the reference is unset, set the length to be longer than the amount + of subject left; this ensures that every attempt at a match fails. We + can't just fail here, because of the possibility of quantifiers with zero + minima. */ + + length = (offset >= offset_top || md->offset_vector[offset] < 0)? + md->end_subject - eptr + 1 : + md->offset_vector[offset+1] - md->offset_vector[offset]; + + /* Set up for repetition, or handle the non-repeated case */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + if (!match_ref(offset, eptr, length, md, ims)) return FALSE; + eptr += length; + continue; /* With the main loop */ + } + + /* If the length of the reference is zero, just continue with the + main loop. */ + + if (length == 0) continue; + + /* First, ensure the minimum number of matches are present. We get back + the length of the reference string explicitly rather than passing the + address of eptr, so that eptr can be a register variable. */ + + for (i = 1; i <= min; i++) + { + if (!match_ref(offset, eptr, length, md, ims)) return FALSE; + eptr += length; + } + + /* If min = max, continue at the same level without recursion. + They are not both allowed to be zero. */ + + if (min == max) continue; + + /* If minimizing, keep trying and advancing the pointer */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || !match_ref(offset, eptr, length, md, ims)) + return FALSE; + eptr += length; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest string and work backwards */ + + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (!match_ref(offset, eptr, length, md, ims)) break; + eptr += length; + } + while (eptr >= pp) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + eptr -= length; + } + return FALSE; + } + } + /* Control never gets here */ + + + + /* Match a character class, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. Then obey similar + code to character type repeats - written out again for speed. */ + + case OP_CLASS: + { + const uschar *data = ecode + 1; /* Save for matching */ + ecode += 33; /* Advance past the item */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + min = max = 1; + break; + } + + /* First, ensure the minimum number of matches are present. */ + + for (i = 1; i <= min; i++) + { + if (eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + return FALSE; + } + + /* If max == min we can continue with the main loop without the + need to recurse. */ + + if (min == max) continue; + + /* If minimizing, keep testing the rest of the expression and advancing + the pointer while it matches the class. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + return FALSE; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest possible run, then work backwards. */ + + else + { + const uschar *pp = eptr; + for (i = min; i < max; eptr++, i++) + { + if (eptr >= md->end_subject) break; + c = *eptr; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + break; + } + + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a run of characters */ + + case OP_CHARS: + { + register int length = ecode[1]; + ecode += 2; + +#ifdef DEBUG /* Sigh. Some compilers never learn. */ + if (eptr >= md->end_subject) + printf("matching subject <null> against pattern "); + else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + printf(" against pattern "); + } + pchars(ecode, length, FALSE, md); + printf("\n"); +#endif + + if (length > md->end_subject - eptr) return FALSE; + if ((ims & PCRE_CASELESS) != 0) + { + while (length-- > 0) + if (md->lcc[*ecode++] != md->lcc[*eptr++]) + return FALSE; + } + else + { + while (length-- > 0) if (*ecode++ != *eptr++) return FALSE; + } + } + break; + + /* Match a single character repeatedly; different opcodes share code. */ + + case OP_EXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATCHAR; + + case OP_UPTO: + case OP_MINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_MINUPTO; + ecode += 3; + goto REPEATCHAR; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + c = *ecode++ - OP_STAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + DPRINTF(("matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr)); + + if ((ims & PCRE_CASELESS) != 0) + { + c = md->lcc[c]; + for (i = 1; i <= min; i++) + if (c != md->lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || + c != md->lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != md->lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c != *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || c != *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a negated single character */ + + case OP_NOT: + if (eptr >= md->end_subject) return FALSE; + ecode++; + if ((ims & PCRE_CASELESS) != 0) + { + if (md->lcc[*ecode++] == md->lcc[*eptr++]) return FALSE; + } + else + { + if (*ecode++ == *eptr++) return FALSE; + } + break; + + /* Match a negated single character repeatedly. This is almost a repeat of + the code for a repeated single character, but I haven't found a nice way of + commoning these up that doesn't require a test of the positive/negative + option for each character match. Maybe that wouldn't add very much to the + time taken, but character matching *is* what this is all about... */ + + case OP_NOTEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTUPTO: + case OP_NOTMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_NOTMINUPTO; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + c = *ecode++ - OP_NOTSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATNOTCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + DPRINTF(("negative matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr)); + + if ((ims & PCRE_CASELESS) != 0) + { + c = md->lcc[c]; + for (i = 1; i <= min; i++) + if (c == md->lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || + c == md->lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == md->lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c == *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + if (i >= max || eptr >= md->end_subject || c == *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + const uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a single character type repeatedly; several different opcodes + share code. This is very similar to the code for single characters, but we + repeat it in the interests of efficiency. */ + + case OP_TYPEEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + minimize = TRUE; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_TYPEMINUPTO; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + c = *ecode++ - OP_TYPESTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single character type matches */ + + REPEATTYPE: + ctype = *ecode++; /* Code for the character type */ + + /* First, ensure the minimum number of matches are present. Use inline + code for maximizing the speed, and do the type test once at the start + (i.e. keep it out of the loop). Also test that there are at least the + minimum number of characters before we start. */ + + if (min > md->end_subject - eptr) return FALSE; + if (min > 0) switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0) + { for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; } + else eptr += min; + break; + + case OP_NOT_DIGIT: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + break; + + case OP_WORDCHAR: + for (i = 1; i <= min; i++) + if ((md->ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + break; + } + + /* If min = max, continue at the same level without recursing */ + + if (min == max) continue; + + /* If minimizing, we have to test the rest of the pattern before each + subsequent match. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md, ims, FALSE, eptrb)) return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + + c = *eptr++; + switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0 && c == '\n') return FALSE; + break; + + case OP_NOT_DIGIT: + if ((md->ctypes[c] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + if ((md->ctypes[c] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + if ((md->ctypes[c] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + if ((md->ctypes[c] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + if ((md->ctypes[c] & ctype_word) != 0) return FALSE; + break; + + case OP_WORDCHAR: + if ((md->ctypes[c] & ctype_word) == 0) return FALSE; + break; + } + } + /* Control never gets here */ + } + + /* If maximizing it is worth using inline code for speed, doing the type + test once at the start (i.e. keep it out of the loop). */ + + else + { + const uschar *pp = eptr; + switch(ctype) + { + case OP_ANY: + if ((ims & PCRE_DOTALL) == 0) + { + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || *eptr == '\n') break; + eptr++; + } + } + else + { + c = max - min; + if (c > md->end_subject - eptr) c = md->end_subject - eptr; + eptr += c; + } + break; + + case OP_NOT_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_digit) != 0) + break; + eptr++; + } + break; + + case OP_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_digit) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_space) != 0) + break; + eptr++; + } + break; + + case OP_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_space) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_word) != 0) + break; + eptr++; + } + break; + + case OP_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (md->ctypes[*eptr] & ctype_word) == 0) + break; + eptr++; + } + break; + } + + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md, ims, FALSE, eptrb)) + return TRUE; + return FALSE; + } + /* Control never gets here */ + + /* There's been some horrible disaster. */ + + default: + DPRINTF(("Unknown opcode %d\n", *ecode)); + md->errorcode = PCRE_ERROR_UNKNOWN_NODE; + return FALSE; + } + + /* Do not stick any code in here without much thought; it is assumed + that "continue" in the code above comes out to here to repeat the main + loop. */ + + } /* End of main loop */ +/* Control never reaches here */ +} + + + + +/************************************************* +* Execute a Regular Expression * +*************************************************/ + +/* This function applies a compiled re to a subject string and picks out +portions of the string if it matches. Two elements in the vector are set for +each substring: the offsets to the start and end of the substring. + +Arguments: + external_re points to the compiled expression + external_extra points to "hints" from pcre_study() or is NULL + subject points to the subject string + length length of subject string (may contain binary zeros) + options option bits + offsets points to a vector of ints to be filled in with offsets + offsetcount the number of elements in the vector + +Returns: > 0 => success; value is the number of elements filled in + = 0 => success, but offsets is not big enough + -1 => failed to match + < -1 => some kind of unexpected problem +*/ + +int +pcre_exec(const pcre *external_re, const pcre_extra *external_extra, + const char *subject, int length, int options, int *offsets, int offsetcount) +{ +int resetcount, ocount; +int first_char = -1; +int ims = 0; +match_data match_block; +const uschar *start_bits = NULL; +const uschar *start_match = (const uschar *)subject; +const uschar *end_subject; +const real_pcre *re = (const real_pcre *)external_re; +const real_pcre_extra *extra = (const real_pcre_extra *)external_extra; +BOOL using_temporary_offsets = FALSE; +BOOL anchored = ((re->options | options) & PCRE_ANCHORED) != 0; +BOOL startline = (re->options & PCRE_STARTLINE) != 0; + +if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION; + +if (re == NULL || subject == NULL || + (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + +match_block.start_subject = (const uschar *)subject; +match_block.end_subject = match_block.start_subject + length; +end_subject = match_block.end_subject; + +match_block.endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0; + +match_block.notbol = (options & PCRE_NOTBOL) != 0; +match_block.noteol = (options & PCRE_NOTEOL) != 0; + +match_block.errorcode = PCRE_ERROR_NOMATCH; /* Default error */ + +match_block.lcc = re->tables + lcc_offset; +match_block.ctypes = re->tables + ctypes_offset; + +/* The ims options can vary during the matching as a result of the presence +of (?ims) items in the pattern. They are kept in a local variable so that +restoring at the exit of a group is easy. */ + +ims = re->options & (PCRE_CASELESS|PCRE_MULTILINE|PCRE_DOTALL); + +/* If the expression has got more back references than the offsets supplied can +hold, we get a temporary bit of working store to use during the matching. +Otherwise, we can use the vector supplied, rounding down its size to a multiple +of 3. */ + +ocount = offsetcount - (offsetcount % 3); + +if (re->top_backref > 0 && re->top_backref >= ocount/3) + { + ocount = re->top_backref * 3 + 3; + match_block.offset_vector = (int *)(pcre_malloc)(ocount * sizeof(int)); + if (match_block.offset_vector == NULL) return PCRE_ERROR_NOMEMORY; + using_temporary_offsets = TRUE; + DPRINTF(("Got memory to hold back references\n")); + } +else match_block.offset_vector = offsets; + +match_block.offset_end = ocount; +match_block.offset_max = (2*ocount)/3; +match_block.offset_overflow = FALSE; + +/* Compute the minimum number of offsets that we need to reset each time. Doing +this makes a huge difference to execution time when there aren't many brackets +in the pattern. */ + +resetcount = 2 + re->top_bracket * 2; +if (resetcount > offsetcount) resetcount = ocount; + +/* Reset the working variable associated with each extraction. These should +never be used unless previously set, but they get saved and restored, and so we +initialize them to avoid reading uninitialized locations. */ + +if (match_block.offset_vector != NULL) + { + register int *iptr = match_block.offset_vector + ocount; + register int *iend = iptr - resetcount/2 + 1; + while (--iptr >= iend) *iptr = -1; + } + +/* Set up the first character to match, if available. The first_char value is +never set for an anchored regular expression, but the anchoring may be forced +at run time, so we have to test for anchoring. The first char may be unset for +an unanchored pattern, of course. If there's no first char and the pattern was +studied, there may be a bitmap of possible first characters. */ + +if (!anchored) + { + if ((re->options & PCRE_FIRSTSET) != 0) + { + first_char = re->first_char; + if ((ims & PCRE_CASELESS) != 0) first_char = match_block.lcc[first_char]; + } + else + if (!startline && extra != NULL && + (extra->options & PCRE_STUDY_MAPPED) != 0) + start_bits = extra->start_bits; + } + +/* Loop for unanchored matches; for anchored regexps the loop runs just once. */ + +do + { + int rc; + register int *iptr = match_block.offset_vector; + register int *iend = iptr + resetcount; + + /* Reset the maximum number of extractions we might see. */ + + while (iptr < iend) *iptr++ = -1; + + /* Advance to a unique first char if possible */ + + if (first_char >= 0) + { + if ((ims & PCRE_CASELESS) != 0) + while (start_match < end_subject && + match_block.lcc[*start_match] != first_char) + start_match++; + else + while (start_match < end_subject && *start_match != first_char) + start_match++; + } + + /* Or to just after \n for a multiline match if possible */ + + else if (startline) + { + if (start_match > match_block.start_subject) + { + while (start_match < end_subject && start_match[-1] != '\n') + start_match++; + } + } + + /* Or to a non-unique first char */ + + else if (start_bits != NULL) + { + while (start_match < end_subject) + { + register int c = *start_match; + if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; + } + } + +#ifdef DEBUG /* Sigh. Some compilers never learn. */ + printf(">>>> Match against: "); + pchars(start_match, end_subject - start_match, TRUE, &match_block); + printf("\n"); +#endif + + /* When a match occurs, substrings will be set for all internal extractions; + we just need to set up the whole thing as substring 0 before returning. If + there were too many extractions, set the return code to zero. In the case + where we had to get some local store to hold offsets for backreferences, copy + those back references that we can. In this case there need not be overflow + if certain parts of the pattern were not used. */ + + if (!match(start_match, re->code, 2, &match_block, ims, FALSE, start_match)) + continue; + + /* Copy the offset information from temporary store if necessary */ + + if (using_temporary_offsets) + { + if (offsetcount >= 4) + { + memcpy(offsets + 2, match_block.offset_vector + 2, + (offsetcount - 2) * sizeof(int)); + DPRINTF(("Copied offsets from temporary memory\n")); + } + if (match_block.end_offset_top > offsetcount) + match_block.offset_overflow = TRUE; + + DPRINTF(("Freeing temporary memory\n")); + (pcre_free)(match_block.offset_vector); + } + + rc = match_block.offset_overflow? 0 : match_block.end_offset_top/2; + + if (match_block.offset_end < 2) rc = 0; else + { + offsets[0] = start_match - match_block.start_subject; + offsets[1] = match_block.end_match_ptr - match_block.start_subject; + } + + DPRINTF((">>>> returning %d\n", rc)); + return rc; + } + +/* This "while" is the end of the "do" above */ + +while (!anchored && + match_block.errorcode == PCRE_ERROR_NOMATCH && + start_match++ < end_subject); + +if (using_temporary_offsets) + { + DPRINTF(("Freeing temporary memory\n")); + (pcre_free)(match_block.offset_vector); + } + +DPRINTF((">>>> returning %d\n", match_block.errorcode)); + +return match_block.errorcode; +} + +/* End of pcre.c */ diff --git a/ext/pcre/pcrelib/pcre.h b/ext/pcre/pcrelib/pcre.h new file mode 100644 index 0000000000..27204b6605 --- /dev/null +++ b/ext/pcre/pcrelib/pcre.h @@ -0,0 +1,74 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* Copyright (c) 1997-1999 University of Cambridge */ + +#ifndef _PCRE_H +#define _PCRE_H + +/* Have to include stdlib.h in order to ensure that size_t is defined; +it is needed here for malloc. */ + +#include <sys/types.h> +#include <stdlib.h> + +/* Allow for C++ users */ + +#ifdef __cplusplus +extern "C" { +#endif + +/* Options */ + +#define PCRE_CASELESS 0x0001 +#define PCRE_MULTILINE 0x0002 +#define PCRE_DOTALL 0x0004 +#define PCRE_EXTENDED 0x0008 +#define PCRE_ANCHORED 0x0010 +#define PCRE_DOLLAR_ENDONLY 0x0020 +#define PCRE_EXTRA 0x0040 +#define PCRE_NOTBOL 0x0080 +#define PCRE_NOTEOL 0x0100 +#define PCRE_UNGREEDY 0x0200 + +/* Exec-time and get-time error codes */ + +#define PCRE_ERROR_NOMATCH (-1) +#define PCRE_ERROR_NULL (-2) +#define PCRE_ERROR_BADOPTION (-3) +#define PCRE_ERROR_BADMAGIC (-4) +#define PCRE_ERROR_UNKNOWN_NODE (-5) +#define PCRE_ERROR_NOMEMORY (-6) +#define PCRE_ERROR_NOSUBSTRING (-7) + +/* Types */ + +typedef void pcre; +typedef void pcre_extra; + +/* Store get and free functions. These can be set to alternative malloc/free +functions if required. */ + +extern void *(*pcre_malloc)(size_t); +extern void (*pcre_free)(void *); + +/* Functions */ + +extern pcre *pcre_compile(const char *, int, const char **, int *, + const unsigned char *); +extern int pcre_copy_substring(const char *, int *, int, int, char *, int); +extern int pcre_exec(const pcre *, const pcre_extra *, const char *, + int, int, int *, int); +extern int pcre_get_substring(const char *, int *, int, int, const char **); +extern int pcre_get_substring_list(const char *, int *, int, const char ***); +extern int pcre_info(const pcre *, int *, int *); +extern unsigned const char *pcre_maketables(void); +extern pcre_extra *pcre_study(const pcre *, int, const char **); +extern const char *pcre_version(void); + +#ifdef __cplusplus +} /* extern "C" */ +#endif + +#endif /* End of pcre.h */ diff --git a/ext/pcre/pcrelib/pcreposix.3 b/ext/pcre/pcrelib/pcreposix.3 new file mode 100644 index 0000000000..40601c4bd7 --- /dev/null +++ b/ext/pcre/pcrelib/pcreposix.3 @@ -0,0 +1,135 @@ +.TH PCRE 3 +.SH NAME +pcreposix - POSIX API for Perl-compatible regular expressions. +.SH SYNOPSIS +.B #include <pcreposix.h> +.PP +.SM +.br +.B int regcomp(regex_t *\fIpreg\fR, const char *\fIpattern\fR, +.ti +5n +.B int \fIcflags\fR); +.PP +.br +.B int regexec(regex_t *\fIpreg\fR, const char *\fIstring\fR, +.ti +5n +.B size_t \fInmatch\fR, regmatch_t \fIpmatch\fR[], int \fIeflags\fR); +.PP +.br +.B size_t regerror(int \fIerrcode\fR, const regex_t *\fIpreg\fR, +.ti +5n +.B char *\fIerrbuf\fR, size_t \fIerrbuf_size\fR); +.PP +.br +.B void regfree(regex_t *\fIpreg\fR); + + +.SH DESCRIPTION +This set of functions provides a POSIX-style API to the PCRE regular expression +package. See \fBpcre (3)\fR for a description of the native API, which contains +additional functionality. The functions described here are just wrapper +functions that ultimately call the native API. + +As I am pretty ignorant about POSIX, these functions must be considered as +experimental. I have implemented only those option bits that can be reasonably +mapped to PCRE native options. Other POSIX options are not even defined. It may +be that it is useful to define, but ignore, other options. Feedback from more +knowledgeable folk may cause this kind of detail to change. + +When PCRE is called via these functions, it is only the API that is POSIX-like +in style. The syntax and semantics of the regular expressions themselves are +still those of Perl, subject to the setting of various PCRE options, as +described below. + +The header for these functions is supplied as \fBpcreposix.h\fR to avoid any +potential clash with other POSIX libraries. It can, of course, be renamed or +aliased as \fBregex.h\fR, which is the "correct" name. It provides two +structure types, \fIregex_t\fR for compiled internal forms, and +\fIregmatch_t\fR for returning captured substrings. It also defines some +constants whose names start with "REG_"; these are used for setting options and +identifying error codes. + + +.SH COMPILING A PATTERN + +The function \fBregcomp()\fR is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument \fIpattern\fR. The \fIpreg\fR argument is a pointer +to a regex_t structure which is used as a base for storing information about +the compiled expression. + +The argument \fIcflags\fR is either zero, or contains one or more of the bits +defined by the following macros: + + REG_ICASE + +The PCRE_CASELESS option is set when the expression is passed for compilation +to the native function. + + REG_NEWLINE + +The PCRE_MULTILINE option is set when the expression is passed for compilation +to the native function. + +The yield of \fBregcomp()\fR is zero on success, and non-zero otherwise. The +\fIpreg\fR structure is filled in on success, and one member of the structure +is publicized: \fIre_nsub\fR contains the number of capturing subpatterns in +the regular expression. Various error codes are defined in the header file. + + +.SH MATCHING A PATTERN +The function \fBregexec()\fR is called to match a pre-compiled pattern +\fIpreg\fR against a given \fIstring\fR, which is terminated by a zero byte, +subject to the options in \fIeflags\fR. These can be: + + REG_NOTBOL + +The PCRE_NOTBOL option is set when calling the underlying PCRE matching +function. + + REG_NOTEOL + +The PCRE_NOTEOL option is set when calling the underlying PCRE matching +function. + +The portion of the string that was matched, and also any captured substrings, +are returned via the \fIpmatch\fR argument, which points to an array of +\fInmatch\fR structures of type \fIregmatch_t\fR, containing the members +\fIrm_so\fR and \fIrm_eo\fR. These contain the offset to the first character of +each substring and the offset to the first character after the end of each +substring, respectively. The 0th element of the vector relates to the entire +portion of \fIstring\fR that was matched; subsequent elements relate to the +capturing subpatterns of the regular expression. Unused entries in the array +have both structure members set to -1. + +A successful match yields a zero return; various error codes are defined in the +header file, of which REG_NOMATCH is the "expected" failure code. + + +.SH ERROR MESSAGES +The \fBregerror()\fR function maps a non-zero errorcode from either +\fBregcomp\fR or \fBregexec\fR to a printable message. If \fIpreg\fR is not +NULL, the error should have arisen from the use of that structure. A message +terminated by a binary zero is placed in \fIerrbuf\fR. The length of the +message, including the zero, is limited to \fIerrbuf_size\fR. The yield of the +function is the size of buffer needed to hold the whole message. + + +.SH STORAGE +Compiling a regular expression causes memory to be allocated and associated +with the \fIpreg\fR structure. The function \fBregfree()\fR frees all such +memory, after which \fIpreg\fR may no longer be used as a compiled expression. + + +.SH AUTHOR +Philip Hazel <ph10@cam.ac.uk> +.br +University Computing Service, +.br +New Museums Site, +.br +Cambridge CB2 3QG, England. +.br +Phone: +44 1223 334714 + +Copyright (c) 1997-1999 University of Cambridge. diff --git a/ext/pcre/pcrelib/pcreposix.c b/ext/pcre/pcrelib/pcreposix.c new file mode 100644 index 0000000000..b3707012e4 --- /dev/null +++ b/ext/pcre/pcrelib/pcreposix.c @@ -0,0 +1,250 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +This module is a wrapper that provides a POSIX API to the underlying PCRE +functions. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + +#include "internal.h" +#include "pcreposix.h" +#include "stdlib.h" + + + +/* Corresponding tables of PCRE error messages and POSIX error codes. */ + +static const char *estring[] = { + ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9, ERR10, + ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20, + ERR21, ERR22, ERR23, ERR24, ERR25 }; + +static int eint[] = { + REG_EESCAPE, /* "\\ at end of pattern" */ + REG_EESCAPE, /* "\\c at end of pattern" */ + REG_EESCAPE, /* "unrecognized character follows \\" */ + REG_BADBR, /* "numbers out of order in {} quantifier" */ + REG_BADBR, /* "number too big in {} quantifier" */ + REG_EBRACK, /* "missing terminating ] for character class" */ + REG_ECTYPE, /* "invalid escape sequence in character class" */ + REG_ERANGE, /* "range out of order in character class" */ + REG_BADRPT, /* "nothing to repeat" */ + REG_BADRPT, /* "operand of unlimited repeat could match the empty string" */ + REG_ASSERT, /* "internal error: unexpected repeat" */ + REG_BADPAT, /* "unrecognized character after (?" */ + REG_ESIZE, /* "too many capturing parenthesized sub-patterns" */ + REG_EPAREN, /* "missing )" */ + REG_ESUBREG, /* "back reference to non-existent subpattern" */ + REG_INVARG, /* "erroffset passed as NULL" */ + REG_INVARG, /* "unknown option bit(s) set" */ + REG_EPAREN, /* "missing ) after comment" */ + REG_ESIZE, /* "too many sets of parentheses" */ + REG_ESIZE, /* "regular expression too large" */ + REG_ESPACE, /* "failed to get memory" */ + REG_EPAREN, /* "unmatched brackets" */ + REG_ASSERT, /* "internal error: code overflow" */ + REG_BADPAT, /* "unrecognized character after (?<" */ + REG_BADPAT, /* "lookbehind assertion is not fixed length" */ + REG_BADPAT, /* "malformed number after (?(" */ + REG_BADPAT, /* "conditional group containe more than two branches" */ + REG_BADPAT /* "assertion expected after (?(" */ +}; + +/* Table of texts corresponding to POSIX error codes */ + +static const char *pstring[] = { + "", /* Dummy for value 0 */ + "internal error", /* REG_ASSERT */ + "invalid repeat counts in {}", /* BADBR */ + "pattern error", /* BADPAT */ + "? * + invalid", /* BADRPT */ + "unbalanced {}", /* EBRACE */ + "unbalanced []", /* EBRACK */ + "collation error - not relevant", /* ECOLLATE */ + "bad class", /* ECTYPE */ + "bad escape sequence", /* EESCAPE */ + "empty expression", /* EMPTY */ + "unbalanced ()", /* EPAREN */ + "bad range inside []", /* ERANGE */ + "expression too big", /* ESIZE */ + "failed to get memory", /* ESPACE */ + "bad back reference", /* ESUBREG */ + "bad argument", /* INVARG */ + "match failed" /* NOMATCH */ +}; + + + + +/************************************************* +* Translate PCRE text code to int * +*************************************************/ + +/* PCRE compile-time errors are given as strings defined as macros. We can just +look them up in a table to turn them into POSIX-style error codes. */ + +static int +pcre_posix_error_code(const char *s) +{ +size_t i; +for (i = 0; i < sizeof(estring)/sizeof(char *); i++) + if (strcmp(s, estring[i]) == 0) return eint[i]; +return REG_ASSERT; +} + + + +/************************************************* +* Translate error code to string * +*************************************************/ + +size_t +regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size) +{ +const char *message, *addmessage; +size_t length, addlength; + +message = (errcode >= (int)(sizeof(pstring)/sizeof(char *)))? + "unknown error code" : pstring[errcode]; +length = strlen(message) + 1; + +addmessage = " at offset "; +addlength = (preg != NULL && (int)preg->re_erroffset != -1)? + strlen(addmessage) + 6 : 0; + +if (errbuf_size > 0) + { + if (addlength > 0 && errbuf_size >= length + addlength) + sprintf(errbuf, "%s%s%-6d", message, addmessage, (int)preg->re_erroffset); + else + { + strncpy(errbuf, message, errbuf_size - 1); + errbuf[errbuf_size-1] = 0; + } + } + +return length + addlength; +} + + + + +/************************************************* +* Free store held by a regex * +*************************************************/ + +void +regfree(regex_t *preg) +{ +(pcre_free)(preg->re_pcre); +} + + + + +/************************************************* +* Compile a regular expression * +*************************************************/ + +/* +Arguments: + preg points to a structure for recording the compiled expression + pattern the pattern to compile + cflags compilation flags + +Returns: 0 on success + various non-zero codes on failure +*/ + +int +regcomp(regex_t *preg, const char *pattern, int cflags) +{ +const char *errorptr; +int erroffset; +int options = 0; + +if ((cflags & REG_ICASE) != 0) options |= PCRE_CASELESS; +if ((cflags & REG_NEWLINE) != 0) options |= PCRE_MULTILINE; + +preg->re_pcre = pcre_compile(pattern, options, &errorptr, &erroffset, NULL); +preg->re_erroffset = erroffset; + +if (preg->re_pcre == NULL) return pcre_posix_error_code(errorptr); + +preg->re_nsub = pcre_info(preg->re_pcre, NULL, NULL); +return 0; +} + + + + +/************************************************* +* Match a regular expression * +*************************************************/ + +int +regexec(regex_t *preg, const char *string, size_t nmatch, + regmatch_t pmatch[], int eflags) +{ +int rc; +int options = 0; + +if ((eflags & REG_NOTBOL) != 0) options |= PCRE_NOTBOL; +if ((eflags & REG_NOTEOL) != 0) options |= PCRE_NOTEOL; + +preg->re_erroffset = (size_t)(-1); /* Only has meaning after compile */ + +rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), options, + (int *)pmatch, nmatch * 2); + +if (rc == 0) return 0; /* All pmatch were filled in */ + +if (rc > 0) + { + size_t i; + for (i = rc; i < nmatch; i++) pmatch[i].rm_so = pmatch[i].rm_eo = -1; + return 0; + } + +else switch(rc) + { + case PCRE_ERROR_NOMATCH: return REG_NOMATCH; + case PCRE_ERROR_NULL: return REG_INVARG; + case PCRE_ERROR_BADOPTION: return REG_INVARG; + case PCRE_ERROR_BADMAGIC: return REG_INVARG; + case PCRE_ERROR_UNKNOWN_NODE: return REG_ASSERT; + case PCRE_ERROR_NOMEMORY: return REG_ESPACE; + default: return REG_ASSERT; + } +} + +/* End of pcreposix.c */ diff --git a/ext/pcre/pcrelib/pcreposix.h b/ext/pcre/pcrelib/pcreposix.h new file mode 100644 index 0000000000..208db354b0 --- /dev/null +++ b/ext/pcre/pcrelib/pcreposix.h @@ -0,0 +1,82 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* Copyright (c) 1997-1999 University of Cambridge */ + +#ifndef _PCREPOSIX_H +#define _PCREPOSIX_H + +/* This is the header for the POSIX wrapper interface to the PCRE Perl- +Compatible Regular Expression library. It defines the things POSIX says should +be there. I hope. */ + +/* Have to include stdlib.h in order to ensure that size_t is defined. */ + +#include <stdlib.h> + +/* Allow for C++ users */ + +#ifdef __cplusplus +extern "C" { +#endif + +/* Options defined by POSIX. */ + +#define REG_ICASE 0x01 +#define REG_NEWLINE 0x02 +#define REG_NOTBOL 0x04 +#define REG_NOTEOL 0x08 + +/* Error values. Not all these are relevant or used by the wrapper. */ + +enum { + REG_ASSERT = 1, /* internal error ? */ + REG_BADBR, /* invalid repeat counts in {} */ + REG_BADPAT, /* pattern error */ + REG_BADRPT, /* ? * + invalid */ + REG_EBRACE, /* unbalanced {} */ + REG_EBRACK, /* unbalanced [] */ + REG_ECOLLATE, /* collation error - not relevant */ + REG_ECTYPE, /* bad class */ + REG_EESCAPE, /* bad escape sequence */ + REG_EMPTY, /* empty expression */ + REG_EPAREN, /* unbalanced () */ + REG_ERANGE, /* bad range inside [] */ + REG_ESIZE, /* expression too big */ + REG_ESPACE, /* failed to get memory */ + REG_ESUBREG, /* bad back reference */ + REG_INVARG, /* bad argument */ + REG_NOMATCH /* match failed */ +}; + + +/* The structure representing a compiled regular expression. */ + +typedef struct { + void *re_pcre; + size_t re_nsub; + size_t re_erroffset; +} regex_t; + +/* The structure in which a captured offset is returned. */ + +typedef int regoff_t; + +typedef struct { + regoff_t rm_so; + regoff_t rm_eo; +} regmatch_t; + +/* The functions */ + +extern int regcomp(regex_t *, const char *, int); +extern int regexec(regex_t *, const char *, size_t, regmatch_t *, int); +extern size_t regerror(int, const regex_t *, char *, size_t); +extern void regfree(regex_t *); + +#ifdef __cplusplus +} /* extern "C" */ +#endif + +#endif /* End of pcreposix.h */ diff --git a/ext/pcre/pcrelib/pcretest.c b/ext/pcre/pcrelib/pcretest.c new file mode 100644 index 0000000000..da736a20e7 --- /dev/null +++ b/ext/pcre/pcrelib/pcretest.c @@ -0,0 +1,910 @@ +/************************************************* +* PCRE testing program * +*************************************************/ + +#include <ctype.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <time.h> +#include <locale.h> + +/* Use the internal info for displaying the results of pcre_study(). */ + +#include "internal.h" +#include "pcreposix.h" + +#ifndef CLOCKS_PER_SEC +#ifdef CLK_TCK +#define CLOCKS_PER_SEC CLK_TCK +#else +#define CLOCKS_PER_SEC 100 +#endif +#endif + +#define LOOPREPEAT 20000 + + +static FILE *outfile; +static int log_store = 0; + + + +/* Debugging function to print the internal form of the regex. This is the same +code as contained in pcre.c under the DEBUG macro. */ + +static const char *OP_names[] = { + "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "\\Z", "\\z", + "Opt", "^", "$", "Any", "chars", "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", + "AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref", + "Brazero", "Braminzero", "Bra" +}; + + +static void print_internals(pcre *re, FILE *outfile) +{ +unsigned char *code = ((real_pcre *)re)->code; + +fprintf(outfile, "------------------------------------------------------------------\n"); + +for(;;) + { + int c; + int charlength; + + fprintf(outfile, "%3d ", (int)(code - ((real_pcre *)re)->code)); + + if (*code >= OP_BRA) + { + fprintf(outfile, "%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_END: + fprintf(outfile, " %s\n", OP_names[*code]); + fprintf(outfile, "------------------------------------------------------------------\n"); + return; + + case OP_OPT: + fprintf(outfile, " %.2x %s", code[1], OP_names[*code]); + code++; + break; + + case OP_COND: + fprintf(outfile, "%3d Cond", (code[1] << 8) + code[2]); + code += 2; + break; + + case OP_CREF: + fprintf(outfile, " %.2d %s", code[1], OP_names[*code]); + code++; + break; + + case OP_CHARS: + charlength = *(++code); + fprintf(outfile, "%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + case OP_ONCE: + fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_REVERSE: + fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + fprintf(outfile, " %s", OP_names[code[1]]); + else if (isprint(c = code[1])) fprintf(outfile, " %c", c); + else fprintf(outfile, " \\x%02x", c); + fprintf(outfile, "%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) fprintf(outfile, " %c{", c); + else fprintf(outfile, " \\x%02x{", c); + if (*code != OP_EXACT) fprintf(outfile, ","); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + fprintf(outfile, " %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) fprintf(outfile, "0,"); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) fprintf(outfile, " [^%c]", c); + else fprintf(outfile, " [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) fprintf(outfile, " [^%c]", c); + else fprintf(outfile, " [^\\x%02x]", c); + fprintf(outfile, "%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) fprintf(outfile, " [^%c]{", c); + else fprintf(outfile, " [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) fprintf(outfile, ","); + fprintf(outfile, "%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) fprintf(outfile, "?"); + code += 3; + break; + + case OP_REF: + fprintf(outfile, " \\%d", *(++code)); + code++; + goto CLASS_REF_REPEAT; + + case OP_CLASS: + { + int i, min, max; + code++; + fprintf(outfile, " ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') fprintf(outfile, "\\"); + if (isprint(i)) fprintf(outfile, "%c", i); else fprintf(outfile, "\\x%02x", i); + if (--j > i) + { + fprintf(outfile, "-"); + if (j == '-' || j == ']') fprintf(outfile, "\\"); + if (isprint(j)) fprintf(outfile, "%c", j); else fprintf(outfile, "\\x%02x", j); + } + i = j; + } + } + fprintf(outfile, "]"); + code += 32; + + CLASS_REF_REPEAT: + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + fprintf(outfile, "%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) fprintf(outfile, "{%d,}", min); + else fprintf(outfile, "{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) fprintf(outfile, "?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + fprintf(outfile, " %s", OP_names[*code]); + break; + } + + code++; + fprintf(outfile, "\n"); + } +} + + + +/* Character string printing function. */ + +static void pchars(unsigned char *p, int length) +{ +int c; +while (length-- > 0) + if (isprint(c = *(p++))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); +} + + + +/* Alternative malloc function, to test functionality and show the size of the +compiled re. */ + +static void *new_malloc(size_t size) +{ +if (log_store) + fprintf(outfile, "Memory allocation request: %d (code space %d)\n", + (int)size, (int)size - offsetof(real_pcre, code[0])); +return malloc(size); +} + + + +/* Read lines from named file or stdin and write to named file or stdout; lines +consist of a regular expression, in delimiters and optionally followed by +options, followed by a set of test data, terminated by an empty line. */ + +int main(int argc, char **argv) +{ +FILE *infile = stdin; +int options = 0; +int study_options = 0; +int op = 1; +int timeit = 0; +int showinfo = 0; +int showstore = 0; +int posix = 0; +int debug = 0; +int done = 0; +unsigned char buffer[30000]; +unsigned char dbuffer[1024]; + +/* Static so that new_malloc can use it. */ + +outfile = stdout; + +/* Scan options */ + +while (argc > 1 && argv[op][0] == '-') + { + if (strcmp(argv[op], "-s") == 0 || strcmp(argv[op], "-m") == 0) + showstore = 1; + else if (strcmp(argv[op], "-t") == 0) timeit = 1; + else if (strcmp(argv[op], "-i") == 0) showinfo = 1; + else if (strcmp(argv[op], "-d") == 0) showinfo = debug = 1; + else if (strcmp(argv[op], "-p") == 0) posix = 1; + else + { + printf("*** Unknown option %s\n", argv[op]); + printf("Usage: pcretest [-d] [-i] [-p] [-s] [-t] [<input> [<output>]]\n"); + printf(" -d debug: show compiled code; implies -i\n" + " -i show information about compiled pattern\n" + " -p use POSIX interface\n" + " -s output store information\n" + " -t time compilation and execution\n"); + return 1; + } + op++; + argc--; + } + +/* Sort out the input and output files */ + +if (argc > 1) + { + infile = fopen(argv[op], "r"); + if (infile == NULL) + { + printf("** Failed to open %s\n", argv[op]); + return 1; + } + } + +if (argc > 2) + { + outfile = fopen(argv[op+1], "w"); + if (outfile == NULL) + { + printf("** Failed to open %s\n", argv[op+1]); + return 1; + } + } + +/* Set alternative malloc function */ + +pcre_malloc = new_malloc; + +/* Heading line, then prompt for first regex if stdin */ + +fprintf(outfile, "PCRE version %s\n\n", pcre_version()); + +/* Main loop */ + +while (!done) + { + pcre *re = NULL; + pcre_extra *extra = NULL; + regex_t preg; + const char *error; + unsigned char *p, *pp, *ppp; + unsigned const char *tables = NULL; + int do_study = 0; + int do_debug = debug; + int do_showinfo = showinfo; + int do_posix = 0; + int erroroffset, len, delimiter; + + if (infile == stdin) printf(" re> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) break; + if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + + p = buffer; + while (isspace(*p)) p++; + if (*p == 0) continue; + + /* Get the delimiter and seek the end of the pattern; if is isn't + complete, read more. */ + + delimiter = *p++; + + if (isalnum(delimiter) || delimiter == '\\') + { + fprintf(outfile, "** Delimiter must not be alphameric or \\\n"); + goto SKIP_DATA; + } + + pp = p; + + for(;;) + { + while (*pp != 0) + { + if (*pp == '\\' && pp[1] != 0) pp++; + else if (*pp == delimiter) break; + pp++; + } + if (*pp != 0) break; + + len = sizeof(buffer) - (pp - buffer); + if (len < 256) + { + fprintf(outfile, "** Expression too long - missing delimiter?\n"); + goto SKIP_DATA; + } + + if (infile == stdin) printf(" > "); + if (fgets((char *)pp, len, infile) == NULL) + { + fprintf(outfile, "** Unexpected EOF\n"); + done = 1; + goto CONTINUE; + } + if (infile != stdin) fprintf(outfile, "%s", (char *)pp); + } + + /* If the first character after the delimiter is backslash, make + the pattern end with backslash. This is purely to provide a way + of testing for the error message when a pattern ends with backslash. */ + + if (pp[1] == '\\') *pp++ = '\\'; + + /* Terminate the pattern at the delimiter */ + + *pp++ = 0; + + /* Look for options after final delimiter */ + + options = 0; + study_options = 0; + log_store = showstore; /* default from command line */ + + while (*pp != 0) + { + switch (*pp++) + { + case 'i': options |= PCRE_CASELESS; break; + case 'm': options |= PCRE_MULTILINE; break; + case 's': options |= PCRE_DOTALL; break; + case 'x': options |= PCRE_EXTENDED; break; + + case 'A': options |= PCRE_ANCHORED; break; + case 'D': do_debug = do_showinfo = 1; break; + case 'E': options |= PCRE_DOLLAR_ENDONLY; break; + case 'I': do_showinfo = 1; break; + case 'M': log_store = 1; break; + case 'P': do_posix = 1; break; + case 'S': do_study = 1; break; + case 'U': options |= PCRE_UNGREEDY; break; + case 'X': options |= PCRE_EXTRA; break; + + case 'L': + ppp = pp; + while (*ppp != '\n' && *ppp != ' ') ppp++; + *ppp = 0; + if (setlocale(LC_CTYPE, (const char *)pp) == NULL) + { + fprintf(outfile, "** Failed to set locale \"%s\"\n", pp); + goto SKIP_DATA; + } + tables = pcre_maketables(); + pp = ppp; + break; + + case '\n': case ' ': break; + default: + fprintf(outfile, "** Unknown option '%c'\n", pp[-1]); + goto SKIP_DATA; + } + } + + /* Handle compiling via the POSIX interface, which doesn't support the + timing, showing, or debugging options, nor the ability to pass over + local character tables. */ + + if (posix || do_posix) + { + int rc; + int cflags = 0; + if ((options & PCRE_CASELESS) != 0) cflags |= REG_ICASE; + if ((options & PCRE_MULTILINE) != 0) cflags |= REG_NEWLINE; + rc = regcomp(&preg, (char *)p, cflags); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "Failed: POSIX code %d: %s\n", rc, buffer); + goto SKIP_DATA; + } + } + + /* Handle compiling via the native interface */ + + else + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + { + re = pcre_compile((char *)p, options, &error, &erroroffset, tables); + if (re != NULL) free(re); + } + time_taken = clock() - start_time; + fprintf(outfile, "Compile time %.3f milliseconds\n", + ((double)time_taken * 1000.0) / + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + re = pcre_compile((char *)p, options, &error, &erroroffset, tables); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (re == NULL) + { + fprintf(outfile, "Failed: %s at offset %d\n", error, erroroffset); + SKIP_DATA: + if (infile != stdin) + { + for (;;) + { + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + { + done = 1; + goto CONTINUE; + } + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + if (len == 0) break; + } + fprintf(outfile, "\n"); + } + goto CONTINUE; + } + + /* Compilation succeeded; print data if required */ + + if (do_showinfo) + { + int first_char, count; + + if (do_debug) print_internals(re, outfile); + + count = pcre_info(re, &options, &first_char); + if (count < 0) fprintf(outfile, + "Error %d while reading info\n", count); + else + { + fprintf(outfile, "Identifying subpattern count = %d\n", count); + if (options == 0) fprintf(outfile, "No options\n"); + else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s\n", + ((options & PCRE_ANCHORED) != 0)? " anchored" : "", + ((options & PCRE_CASELESS) != 0)? " caseless" : "", + ((options & PCRE_EXTENDED) != 0)? " extended" : "", + ((options & PCRE_MULTILINE) != 0)? " multiline" : "", + ((options & PCRE_DOTALL) != 0)? " dotall" : "", + ((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "", + ((options & PCRE_EXTRA) != 0)? " extra" : "", + ((options & PCRE_UNGREEDY) != 0)? " ungreedy" : ""); + if (first_char == -1) + { + fprintf(outfile, "First char at start or follows \\n\n"); + } + else if (first_char < 0) + { + fprintf(outfile, "No first char\n"); + } + else + { + if (isprint(first_char)) + fprintf(outfile, "First char = \'%c\'\n", first_char); + else + fprintf(outfile, "First char = %d\n", first_char); + } + } + } + + /* If /S was present, study the regexp to generate additional info to + help with the matching. */ + + if (do_study) + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + extra = pcre_study(re, study_options, &error); + time_taken = clock() - start_time; + if (extra != NULL) free(extra); + fprintf(outfile, " Study time %.3f milliseconds\n", + ((double)time_taken * 1000.0)/ + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + extra = pcre_study(re, study_options, &error); + if (error != NULL) + fprintf(outfile, "Failed to study: %s\n", error); + else if (extra == NULL) + fprintf(outfile, "Study returned NULL\n"); + + /* This looks at internal information. A bit kludgy to do it this + way, but it is useful for testing. */ + + else if (do_showinfo) + { + real_pcre_extra *xx = (real_pcre_extra *)extra; + if ((xx->options & PCRE_STUDY_MAPPED) == 0) + fprintf(outfile, "No starting character set\n"); + else + { + int i; + int c = 24; + fprintf(outfile, "Starting character set: "); + for (i = 0; i < 256; i++) + { + if ((xx->start_bits[i/8] & (1<<(i%8))) != 0) + { + if (c > 75) + { + fprintf(outfile, "\n "); + c = 2; + } + if (isprint(i) && i != ' ') + { + fprintf(outfile, "%c ", i); + c += 2; + } + else + { + fprintf(outfile, "\\x%02x ", i); + c += 5; + } + } + } + fprintf(outfile, "\n"); + } + } + } + } + + /* Read data lines and test them */ + + for (;;) + { + unsigned char *q; + int count, c; + int copystrings = 0; + int getstrings = 0; + int getlist = 0; + int offsets[45]; + int size_offsets = sizeof(offsets)/sizeof(int); + + options = 0; + + if (infile == stdin) printf(" data> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + { + done = 1; + goto CONTINUE; + } + if (infile != stdin) fprintf(outfile, "%s", (char *)buffer); + + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + buffer[len] = 0; + if (len == 0) break; + + p = buffer; + while (isspace(*p)) p++; + + q = dbuffer; + while ((c = *p++) != 0) + { + int i = 0; + int n = 0; + if (c == '\\') switch ((c = *p++)) + { + case 'a': c = 7; break; + case 'b': c = '\b'; break; + case 'e': c = 27; break; + case 'f': c = '\f'; break; + case 'n': c = '\n'; break; + case 'r': c = '\r'; break; + case 't': c = '\t'; break; + case 'v': c = '\v'; break; + + case '0': case '1': case '2': case '3': + case '4': case '5': case '6': case '7': + c -= '0'; + while (i++ < 2 && isdigit(*p) && *p != '8' && *p != '9') + c = c * 8 + *p++ - '0'; + break; + + case 'x': + c = 0; + while (i++ < 2 && isxdigit(*p)) + { + c = c * 16 + tolower(*p) - ((isdigit(*p))? '0' : 'W'); + p++; + } + break; + + case 0: /* Allows for an empty line */ + p--; + continue; + + case 'A': /* Option setting */ + options |= PCRE_ANCHORED; + continue; + + case 'B': + options |= PCRE_NOTBOL; + continue; + + case 'C': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + copystrings |= 1 << n; + continue; + + case 'G': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + getstrings |= 1 << n; + continue; + + case 'L': + getlist = 1; + continue; + + case 'O': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + if (n <= (int)(sizeof(offsets)/sizeof(int))) size_offsets = n; + continue; + + case 'Z': + options |= PCRE_NOTEOL; + continue; + } + *q++ = c; + } + *q = 0; + len = q - dbuffer; + + /* Handle matching via the POSIX interface, which does not + support timing. */ + + if (posix || do_posix) + { + int rc; + int eflags = 0; + regmatch_t pmatch[30]; + if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL; + if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL; + + rc = regexec(&preg, (char *)dbuffer, sizeof(pmatch)/sizeof(regmatch_t), + pmatch, eflags); + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "No match: POSIX code %d: %s\n", rc, buffer); + } + else + { + size_t i; + for (i = 0; i < sizeof(pmatch)/sizeof(regmatch_t); i++) + { + if (pmatch[i].rm_so >= 0) + { + fprintf(outfile, "%2d: ", (int)i); + pchars(dbuffer + pmatch[i].rm_so, + pmatch[i].rm_eo - pmatch[i].rm_so); + fprintf(outfile, "\n"); + } + } + } + } + + /* Handle matching via the native interface */ + + else + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < LOOPREPEAT; i++) + count = pcre_exec(re, extra, (char *)dbuffer, len, options, offsets, + size_offsets); + time_taken = clock() - start_time; + fprintf(outfile, "Execute time %.3f milliseconds\n", + ((double)time_taken * 1000.0)/ + ((double)LOOPREPEAT * (double)CLOCKS_PER_SEC)); + } + + count = pcre_exec(re, extra, (char *)dbuffer, len, options, offsets, + size_offsets); + + if (count == 0) + { + fprintf(outfile, "Matched, but too many substrings\n"); + count = size_offsets/3; + } + + if (count >= 0) + { + int i; + for (i = 0; i < count * 2; i += 2) + { + if (offsets[i] < 0) + fprintf(outfile, "%2d: <unset>\n", i/2); + else + { + fprintf(outfile, "%2d: ", i/2); + pchars(dbuffer + offsets[i], offsets[i+1] - offsets[i]); + fprintf(outfile, "\n"); + } + } + + for (i = 0; i < 32; i++) + { + if ((copystrings & (1 << i)) != 0) + { + char buffer[16]; + int rc = pcre_copy_substring((char *)dbuffer, offsets, count, + i, buffer, sizeof(buffer)); + if (rc < 0) + fprintf(outfile, "copy substring %d failed %d\n", i, rc); + else + fprintf(outfile, "%2dC %s (%d)\n", i, buffer, rc); + } + } + + for (i = 0; i < 32; i++) + { + if ((getstrings & (1 << i)) != 0) + { + const char *substring; + int rc = pcre_get_substring((char *)dbuffer, offsets, count, + i, &substring); + if (rc < 0) + fprintf(outfile, "get substring %d failed %d\n", i, rc); + else + { + fprintf(outfile, "%2dG %s (%d)\n", i, substring, rc); + free((void *)substring); + } + } + } + + if (getlist) + { + const char **stringlist; + int rc = pcre_get_substring_list((char *)dbuffer, offsets, count, + &stringlist); + if (rc < 0) + fprintf(outfile, "get substring list failed %d\n", rc); + else + { + for (i = 0; i < count; i++) + fprintf(outfile, "%2dL %s\n", i, stringlist[i]); + if (stringlist[i] != NULL) + fprintf(outfile, "string list not terminated by NULL\n"); + free((void *)stringlist); + } + } + + } + else + { + if (count == -1) fprintf(outfile, "No match\n"); + else fprintf(outfile, "Error %d\n", count); + } + } + } + + CONTINUE: + if (posix || do_posix) regfree(&preg); + if (re != NULL) free(re); + if (extra != NULL) free(extra); + if (tables != NULL) + { + free((void *)tables); + setlocale(LC_CTYPE, "C"); + } + } + +fprintf(outfile, "\n"); +return 0; +} + +/* End */ diff --git a/ext/pcre/pcrelib/perltest b/ext/pcre/pcrelib/perltest new file mode 100755 index 0000000000..c6faedafaf --- /dev/null +++ b/ext/pcre/pcrelib/perltest @@ -0,0 +1,143 @@ +#! /usr/bin/perl + +# Program for testing regular expressions with perl to check that PCRE handles +# them the same. + + +# Function for turning a string into a string of printing chars + +sub pchars { +my($t) = ""; + +foreach $c (split(//, @_[0])) + { + if (ord $c >= 32 && ord $c < 127) { $t .= $c; } + else { $t .= sprintf("\\x%02x", ord $c); } + } +$t; +} + + + +# Read lines from named file or stdin and write to named file or stdout; lines +# consist of a regular expression, in delimiters and optionally followed by +# options, followed by a set of test data, terminated by an empty line. + +# Sort out the input and output files + +if (@ARGV > 0) + { + open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n"; + $infile = "INFILE"; + } +else { $infile = "STDIN"; } + +if (@ARGV > 1) + { + open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n"; + $outfile = "OUTFILE"; + } +else { $outfile = "STDOUT"; } + +printf($outfile "Perl $] Regular Expressions\n\n"); + +# Main loop + +NEXT_RE: +for (;;) + { + printf " re> " if $infile eq "STDIN"; + last if ! ($_ = <$infile>); + printf $outfile "$_" if $infile ne "STDIN"; + next if ($_ eq ""); + + $pattern = $_; + + $delimiter = substr($_, 0, 1); + while ($pattern !~ /^\s*(.).*\1/s) + { + printf " > " if $infile eq "STDIN"; + last if ! ($_ = <$infile>); + printf $outfile "$_" if $infile ne "STDIN"; + $pattern .= $_; + } + + chomp($pattern); + $pattern =~ s/\s+$//; + + # Check that the pattern is valid + + eval "\$_ =~ ${pattern}"; + if ($@) + { + printf $outfile "Error: $@"; + next NEXT_RE; + } + + # Read data lines and test them + + for (;;) + { + printf "data> " if $infile eq "STDIN"; + last NEXT_RE if ! ($_ = <$infile>); + chomp; + printf $outfile "$_\n" if $infile ne "STDIN"; + + s/\s+$//; + s/^\s+//; + + last if ($_ eq ""); + + $_ = eval "\"$_\""; # To get escapes processed + + $ok = 0; + eval "if (\$_ =~ ${pattern}) {" . + "\$z = \$&;" . + "\$a = \$1;" . + "\$b = \$2;" . + "\$c = \$3;" . + "\$d = \$4;" . + "\$e = \$5;" . + "\$f = \$6;" . + "\$g = \$7;" . + "\$h = \$8;" . + "\$i = \$9;" . + "\$j = \$10;" . + "\$k = \$11;" . + "\$l = \$12;" . + "\$m = \$13;" . + "\$n = \$14;" . + "\$o = \$15;" . + "\$p = \$16;" . + "\$ok = 1; }"; + + if ($@) + { + printf $outfile "Error: $@\n"; + next NEXT_RE; + } + elsif (!$ok) + { + printf $outfile "No match\n"; + } + else + { + @subs = ($z,$a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p); + $last_printed = 0; + for ($i = 0; $i <= 17; $i++) + { + if ($i == 0 || defined $subs[$i]) + { + while ($last_printed++ < $i-1) + { printf $outfile ("%2d: <unset>\n", $last_printed); } + printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i])); + $last_printed = $i; + } + } + } + } + } + +printf $outfile "\n"; + +# End diff --git a/ext/pcre/pcrelib/pgrep.1 b/ext/pcre/pcrelib/pgrep.1 new file mode 100644 index 0000000000..49f81d31e0 --- /dev/null +++ b/ext/pcre/pcrelib/pgrep.1 @@ -0,0 +1,72 @@ +.TH PGREP 1 +.SH NAME +pgrep - a grep with Perl-compatible regular expressions. +.SH SYNOPSIS +.B pgrep [-chilnsvx] pattern [file] ... + + +.SH DESCRIPTION +\fBpgrep\fR searches files for character patterns, in the same way as other +grep commands do, but it uses the PCRE regular expression library to support +patterns that are compatible with the regular expressions of Perl 5. See +\fBpcre(3)\fR for a full description of syntax and semantics. + +If no files are specified, \fBpgrep\fR reads the standard input. By default, +each line that matches the pattern is copied to the standard output, and if +there is more than one file, the file name is printed before each line of +output. However, there are options that can change how \fBpgrep\fR behaves. + +Lines are limited to BUFSIZ characters. BUFSIZ is defined in \fB<stdio.h>\fR. +The newline character is removed from the end of each line before it is matched +against the pattern. + + +.SH OPTIONS +.TP 10 +\fB-c\fR +Do not print individual lines; instead just print a count of the number of +lines that would otherwise have been printed. If several files are given, a +count is printed for each of them. +.TP +\fB-h\fR +Suppress printing of filenames when searching multiple files. +.TP +\fB-i\fR +Ignore upper/lower case distinctions during comparisons. +.TP +\fB-l\fR +Instead of printing lines from the files, just print the names of the files +containing lines that would have been printed. Each file name is printed +once, on a separate line. +.TP +\fB-n\fR +Precede each line by its line number in the file. +.TP +\fB-s\fR +Work silently, that is, display nothing except error messages. +The exit status indicates whether any matches were found. +.TP +\fB-v\fR +Invert the sense of the match, so that lines which do \fInot\fR match the +pattern are now the ones that are found. +.TP +\fB-x\fR +Force the pattern to be anchored (it must start matching at the beginning of +the line) and in addition, require it to match the entire line. This is +equivalent to having ^ and $ characters at the start and end of each +alternative branch in the regular expression. + + +.SH SEE ALSO +\fBpcre(3)\fR, Perl 5 documentation + + +.SH DIAGNOSTICS +Exit status is 0 if any matches were found, 1 if no matches were found, and 2 +for syntax errors or inacessible files (even if matches were found). + + +.SH AUTHOR +Philip Hazel <ph10@cam.ac.uk> +.br +Copyright (c) 1997-1999 University of Cambridge. diff --git a/ext/pcre/pcrelib/pgrep.c b/ext/pcre/pcrelib/pgrep.c new file mode 100644 index 0000000000..b41083605a --- /dev/null +++ b/ext/pcre/pcrelib/pgrep.c @@ -0,0 +1,220 @@ +/************************************************* +* PCRE grep program * +*************************************************/ + +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <errno.h> +#include "pcre.h" + + +#define FALSE 0 +#define TRUE 1 + +typedef int BOOL; + + + +/************************************************* +* Global variables * +*************************************************/ + +static pcre *pattern; +static pcre_extra *hints; + +static BOOL count_only = FALSE; +static BOOL filenames_only = FALSE; +static BOOL invert = FALSE; +static BOOL number = FALSE; +static BOOL silent = FALSE; +static BOOL whole_lines = FALSE; + + + +#ifdef STRERROR_FROM_ERRLIST +/************************************************* +* Provide strerror() for non-ANSI libraries * +*************************************************/ + +/* Some old-fashioned systems still around (e.g. SunOS4) don't have strerror() +in their libraries, but can provide the same facility by this simple +alternative function. */ + +extern int sys_nerr; +extern char *sys_errlist[]; + +char * +strerror(int n) +{ +if (n < 0 || n >= sys_nerr) return "unknown error number"; +return sys_errlist[n]; +} +#endif /* STRERROR_FROM_ERRLIST */ + + + +/************************************************* +* Grep an individual file * +*************************************************/ + +static int +pgrep(FILE *in, char *name) +{ +int rc = 1; +int linenumber = 0; +int count = 0; +int offsets[99]; +char buffer[BUFSIZ]; + +while (fgets(buffer, sizeof(buffer), in) != NULL) + { + BOOL match; + int length = (int)strlen(buffer); + if (length > 0 && buffer[length-1] == '\n') buffer[--length] = 0; + linenumber++; + + match = pcre_exec(pattern, hints, buffer, length, 0, offsets, 99) >= 0; + if (match && whole_lines && offsets[1] != length) match = FALSE; + + if (match != invert) + { + if (count_only) count++; + + else if (filenames_only) + { + fprintf(stdout, "%s\n", (name == NULL)? "<stdin>" : name); + return 0; + } + + else if (silent) return 0; + + else + { + if (name != NULL) fprintf(stdout, "%s:", name); + if (number) fprintf(stdout, "%d:", linenumber); + fprintf(stdout, "%s\n", buffer); + } + + rc = 0; + } + } + +if (count_only) + { + if (name != NULL) fprintf(stdout, "%s:", name); + fprintf(stdout, "%d\n", count); + } + +return rc; +} + + + + +/************************************************* +* Usage function * +*************************************************/ + +static int +usage(int rc) +{ +fprintf(stderr, "Usage: pgrep [-chilnsvx] pattern [file] ...\n"); +return rc; +} + + + + +/************************************************* +* Main program * +*************************************************/ + +int +main(int argc, char **argv) +{ +int i; +int rc = 1; +int options = 0; +int errptr; +const char *error; +BOOL filenames = TRUE; + +/* Process the options */ + +for (i = 1; i < argc; i++) + { + char *s; + if (argv[i][0] != '-') break; + s = argv[i] + 1; + while (*s != 0) + { + switch (*s++) + { + case 'c': count_only = TRUE; break; + case 'h': filenames = FALSE; break; + case 'i': options |= PCRE_CASELESS; break; + case 'l': filenames_only = TRUE; + case 'n': number = TRUE; break; + case 's': silent = TRUE; break; + case 'v': invert = TRUE; break; + case 'x': whole_lines = TRUE; options |= PCRE_ANCHORED; break; + default: + fprintf(stderr, "pgrep: unknown option %c\n", s[-1]); + return usage(2); + } + } + } + +/* There must be at least a regexp argument */ + +if (i >= argc) return usage(0); + +/* Compile the regular expression. */ + +pattern = pcre_compile(argv[i++], options, &error, &errptr, NULL); +if (pattern == NULL) + { + fprintf(stderr, "pgrep: error in regex at offset %d: %s\n", errptr, error); + return 2; + } + +/* Study the regular expression, as we will be running it may times */ + +hints = pcre_study(pattern, 0, &error); +if (error != NULL) + { + fprintf(stderr, "pgrep: error while studing regex: %s\n", error); + return 2; + } + +/* If there are no further arguments, do the business on stdin and exit */ + +if (i >= argc) return pgrep(stdin, NULL); + +/* Otherwise, work through the remaining arguments as files. If there is only +one, don't give its name on the output. */ + +if (i == argc - 1) filenames = FALSE; +if (filenames_only) filenames = TRUE; + +for (; i < argc; i++) + { + FILE *in = fopen(argv[i], "r"); + if (in == NULL) + { + fprintf(stderr, "%s: failed to open: %s\n", argv[i], strerror(errno)); + rc = 2; + } + else + { + int frc = pgrep(in, filenames? argv[i] : NULL); + if (frc == 0 && rc == 1) rc = 0; + fclose(in); + } + } + +return rc; +} + +/* End */ diff --git a/ext/pcre/pcrelib/study.c b/ext/pcre/pcrelib/study.c new file mode 100644 index 0000000000..284833ba04 --- /dev/null +++ b/ext/pcre/pcrelib/study.c @@ -0,0 +1,397 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997-1999 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. + +4. If PCRE is embedded in any software that is released under the GNU + General Purpose Licence (GPL), then the terms of that licence shall + supersede any condition above with which it is incompatible. +----------------------------------------------------------------------------- +*/ + + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + + +/************************************************* +* Set a bit and maybe its alternate case * +*************************************************/ + +/* Given a character, set its bit in the table, and also the bit for the other +version of a letter if we are caseless. + +Arguments: + start_bits points to the bit map + c is the character + caseless the caseless flag + cd the block with char table pointers + +Returns: nothing +*/ + +static void +set_bit(uschar *start_bits, int c, BOOL caseless, compile_data *cd) +{ +start_bits[c/8] |= (1 << (c&7)); +if (caseless && (cd->ctypes[c] & ctype_letter) != 0) + start_bits[cd->fcc[c]/8] |= (1 << (cd->fcc[c]&7)); +} + + + +/************************************************* +* Create bitmap of starting chars * +*************************************************/ + +/* This function scans a compiled unanchored expression and attempts to build a +bitmap of the set of initial characters. If it can't, it returns FALSE. As time +goes by, we may be able to get more clever at doing this. + +Arguments: + code points to an expression + start_bits points to a 32-byte table, initialized to 0 + caseless the current state of the caseless flag + cd the block with char table pointers + +Returns: TRUE if table built, FALSE otherwise +*/ + +static BOOL +set_start_bits(const uschar *code, uschar *start_bits, BOOL caseless, + compile_data *cd) +{ +register int c; + +/* This next statement and the later reference to dummy are here in order to +trick the optimizer of the IBM C compiler for OS/2 into generating correct +code. Apparently IBM isn't going to fix the problem, and we would rather not +disable optimization (in this module it actually makes a big difference, and +the pcre module can use all the optimization it can get). */ + +volatile int dummy; + +do + { + const uschar *tcode = code + 3; + BOOL try_next = TRUE; + + while (try_next) + { + try_next = FALSE; + + /* If a branch starts with a bracket or a positive lookahead assertion, + recurse to set bits from within them. That's all for this branch. */ + + if ((int)*tcode >= OP_BRA || *tcode == OP_ASSERT) + { + if (!set_start_bits(tcode, start_bits, caseless, cd)) + return FALSE; + } + + else switch(*tcode) + { + default: + return FALSE; + + /* Skip over lookbehind and negative lookahead assertions */ + + case OP_ASSERT_NOT: + case OP_ASSERTBACK: + case OP_ASSERTBACK_NOT: + try_next = TRUE; + do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); + tcode += 3; + break; + + /* Skip over an option setting, changing the caseless flag */ + + case OP_OPT: + caseless = (tcode[1] & PCRE_CASELESS) != 0; + tcode += 2; + try_next = TRUE; + break; + + /* BRAZERO does the bracket, but carries on. */ + + case OP_BRAZERO: + case OP_BRAMINZERO: + if (!set_start_bits(++tcode, start_bits, caseless, cd)) + return FALSE; + dummy = 1; + do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); + tcode += 3; + try_next = TRUE; + break; + + /* Single-char * or ? sets the bit and tries the next item */ + + case OP_STAR: + case OP_MINSTAR: + case OP_QUERY: + case OP_MINQUERY: + set_bit(start_bits, tcode[1], caseless, cd); + tcode += 2; + try_next = TRUE; + break; + + /* Single-char upto sets the bit and tries the next */ + + case OP_UPTO: + case OP_MINUPTO: + set_bit(start_bits, tcode[3], caseless, cd); + tcode += 4; + try_next = TRUE; + break; + + /* At least one single char sets the bit and stops */ + + case OP_EXACT: /* Fall through */ + tcode++; + + case OP_CHARS: /* Fall through */ + tcode++; + + case OP_PLUS: + case OP_MINPLUS: + set_bit(start_bits, tcode[1], caseless, cd); + break; + + /* Single character type sets the bits and stops */ + + case OP_NOT_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_digit]; + break; + + case OP_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_digit]; + break; + + case OP_NOT_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_space]; + break; + + case OP_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_space]; + break; + + case OP_NOT_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= ~(cd->cbits[c] | cd->cbits[c+cbit_word]); + break; + + case OP_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= (cd->cbits[c] | cd->cbits[c+cbit_word]); + break; + + /* One or more character type fudges the pointer and restarts, knowing + it will hit a single character type and stop there. */ + + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + tcode++; + try_next = TRUE; + break; + + case OP_TYPEEXACT: + tcode += 3; + try_next = TRUE; + break; + + /* Zero or more repeats of character types set the bits and then + try again. */ + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + tcode += 2; /* Fall through */ + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + switch(tcode[1]) + { + case OP_NOT_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_digit]; + break; + + case OP_DIGIT: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_digit]; + break; + + case OP_NOT_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= ~cd->cbits[c+cbit_space]; + break; + + case OP_WHITESPACE: + for (c = 0; c < 32; c++) + start_bits[c] |= cd->cbits[c+cbit_space]; + break; + + case OP_NOT_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= ~(cd->cbits[c] | cd->cbits[c+cbit_word]); + break; + + case OP_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= (cd->cbits[c] | cd->cbits[c+cbit_word]); + break; + } + + tcode += 2; + try_next = TRUE; + break; + + /* Character class: set the bits and either carry on or not, + according to the repeat count. */ + + case OP_CLASS: + { + tcode++; + for (c = 0; c < 32; c++) start_bits[c] |= tcode[c]; + tcode += 32; + switch (*tcode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + tcode++; + try_next = TRUE; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if (((tcode[1] << 8) + tcode[2]) == 0) + { + tcode += 5; + try_next = TRUE; + } + break; + } + } + break; /* End of class handling */ + + } /* End of switch */ + } /* End of try_next loop */ + + code += (code[1] << 8) + code[2]; /* Advance to next branch */ + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Study a compiled expression * +*************************************************/ + +/* This function is handed a compiled expression that it must study to produce +information that will speed up the matching. It returns a pcre_extra block +which then gets handed back to pcre_exec(). + +Arguments: + re points to the compiled expression + options contains option bits + errorptr points to where to place error messages; + set NULL unless error + +Returns: pointer to a pcre_extra block, + NULL on error or if no optimization possible +*/ + +pcre_extra * +pcre_study(const pcre *external_re, int options, const char **errorptr) +{ +uschar start_bits[32]; +real_pcre_extra *extra; +const real_pcre *re = (const real_pcre *)external_re; +compile_data compile_block; + +*errorptr = NULL; + +if (re == NULL || re->magic_number != MAGIC_NUMBER) + { + *errorptr = "argument is not a compiled regular expression"; + return NULL; + } + +if ((options & ~PUBLIC_STUDY_OPTIONS) != 0) + { + *errorptr = "unknown or incorrect option bit(s) set"; + return NULL; + } + +/* For an anchored pattern, or an unchored pattern that has a first char, or a +multiline pattern that matches only at "line starts", no further processing at +present. */ + +if ((re->options & (PCRE_ANCHORED|PCRE_FIRSTSET|PCRE_STARTLINE)) != 0) + return NULL; + +/* Set the character tables in the block which is passed around */ + +compile_block.lcc = re->tables + lcc_offset; +compile_block.fcc = re->tables + fcc_offset; +compile_block.cbits = re->tables + cbits_offset; +compile_block.ctypes = re->tables + ctypes_offset; + +/* See if we can find a fixed set of initial characters for the pattern. */ + +memset(start_bits, 0, 32 * sizeof(uschar)); +if (!set_start_bits(re->code, start_bits, (re->options & PCRE_CASELESS) != 0, + &compile_block)) return NULL; + +/* Get an "extra" block and put the information therein. */ + +extra = (real_pcre_extra *)(pcre_malloc)(sizeof(real_pcre_extra)); + +if (extra == NULL) + { + *errorptr = "failed to get memory"; + return NULL; + } + +extra->options = PCRE_STUDY_MAPPED; +memcpy(extra->start_bits, start_bits, sizeof(start_bits)); + +return (pcre_extra *)extra; +} + +/* End of study.c */ diff --git a/ext/pcre/pcrelib/testinput1 b/ext/pcre/pcrelib/testinput1 new file mode 100644 index 0000000000..da679eaa80 --- /dev/null +++ b/ext/pcre/pcrelib/testinput1 @@ -0,0 +1,1817 @@ +/the quick brown fox/ + the quick brown fox + The quick brown FOX + What do you know about the quick brown fox? + What do you know about THE QUICK BROWN FOX? + +/The quick brown fox/i + the quick brown fox + The quick brown FOX + What do you know about the quick brown fox? + What do you know about THE QUICK BROWN FOX? + +/abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/ + abcd\t\n\r\f\a\e9;\$\\?caxyz + +/a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/ + abxyzpqrrrabbxyyyypqAzz + abxyzpqrrrabbxyyyypqAzz + aabxyzpqrrrabbxyyyypqAzz + aaabxyzpqrrrabbxyyyypqAzz + aaaabxyzpqrrrabbxyyyypqAzz + abcxyzpqrrrabbxyyyypqAzz + aabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypAzz + aaabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypqqAzz + aaabcxyzpqrrrabbxyyyypqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqAzz + aaaabcxyzpqrrrabbxyyyypqAzz + abxyzzpqrrrabbxyyyypqAzz + aabxyzzzpqrrrabbxyyyypqAzz + aaabxyzzzzpqrrrabbxyyyypqAzz + aaaabxyzzzzpqrrrabbxyyyypqAzz + abcxyzzpqrrrabbxyyyypqAzz + aabcxyzzzpqrrrabbxyyyypqAzz + aaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + aaabcxyzpqrrrabbxyyyypABzz + aaabcxyzpqrrrabbxyyyypABBzz + >>>aaabxyzpqrrrabbxyyyypqAzz + >aaaabxyzpqrrrabbxyyyypqAzz + >>>>abcxyzpqrrrabbxyyyypqAzz + *** Failers + abxyzpqrrabbxyyyypqAzz + abxyzpqrrrrabbxyyyypqAzz + abxyzpqrrrabxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyypqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqqAzz + +/^(abc){1,2}zz/ + abczz + abcabczz + *** Failers + zz + abcabcabczz + >>abczz + +/^(b+?|a){1,2}?c/ + bc + bbc + bbbc + bac + bbac + aac + abbbbbbbbbbbc + bbbbbbbbbbbac + *** Failers + aaac + abbbbbbbbbbbac + +/^(b+|a){1,2}c/ + bc + bbc + bbbc + bac + bbac + aac + abbbbbbbbbbbc + bbbbbbbbbbbac + *** Failers + aaac + abbbbbbbbbbbac + +/^(b+|a){1,2}?bc/ + bbc + +/^(b*|ba){1,2}?bc/ + babc + bbabc + bababc + *** Failers + bababbc + babababc + +/^(ba|b*){1,2}?bc/ + babc + bbabc + bababc + *** Failers + bababbc + babababc + +/^\ca\cA\c[\c{\c:/ + \x01\x01\e;z + +/^[ab\]cde]/ + athing + bthing + ]thing + cthing + dthing + ething + *** Failers + fthing + [thing + \\thing + +/^[]cde]/ + ]thing + cthing + dthing + ething + *** Failers + athing + fthing + +/^[^ab\]cde]/ + fthing + [thing + \\thing + *** Failers + athing + bthing + ]thing + cthing + dthing + ething + +/^[^]cde]/ + athing + fthing + *** Failers + ]thing + cthing + dthing + ething + +/^\/ + + +/^/ + + +/^[0-9]+$/ + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 100 + *** Failers + abc + +/^.*nter/ + enter + inter + uponter + +/^xxx[0-9]+$/ + xxx0 + xxx1234 + *** Failers + xxx + +/^.+[0-9][0-9][0-9]$/ + x123 + xx123 + 123456 + *** Failers + 123 + x1234 + +/^.+?[0-9][0-9][0-9]$/ + x123 + xx123 + 123456 + *** Failers + 123 + x1234 + +/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ + abc!pqr=apquxz.ixr.zzz.ac.uk + *** Failers + !pqr=apquxz.ixr.zzz.ac.uk + abc!=apquxz.ixr.zzz.ac.uk + abc!pqr=apquxz:ixr.zzz.ac.uk + abc!pqr=apquxz.ixr.zzz.ac.ukk + +/:/ + Well, we need a colon: somewhere + *** Fail if we don't + +/([\da-f:]+)$/i + 0abc + abc + fed + E + :: + 5f03:12C0::932e + fed def + Any old stuff + *** Failers + 0zzz + gzzz + fed\x20 + Any old rubbish + +/^.*\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ + .1.2.3 + A.12.123.0 + *** Failers + .1.2.3333 + 1.2.3 + 1234.2.3 + +/^(\d+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/ + 1 IN SOA non-sp1 non-sp2( + 1 IN SOA non-sp1 non-sp2 ( + *** Failers + 1IN SOA non-sp1 non-sp2( + +/^[a-zA-Z\d][a-zA-Z\d\-]*(\.[a-zA-Z\d][a-zA-z\d\-]*)*\.$/ + a. + Z. + 2. + ab-c.pq-r. + sxk.zzz.ac.uk. + x-.y-. + *** Failers + -abc.peq. + +/^\*\.[a-z]([a-z\-\d]*[a-z\d]+)?(\.[a-z]([a-z\-\d]*[a-z\d]+)?)*$/ + *.a + *.b0-a + *.c3-b.c + *.c-a.b-c + *** Failers + *.0 + *.a- + *.a-b.c- + *.c-a.0-c + +/^(?=ab(de))(abd)(e)/ + abde + +/^(?!(ab)de|x)(abd)(f)/ + abdf + +/^(?=(ab(cd)))(ab)/ + abcd + +/^[\da-f](\.[\da-f])*$/i + a.b.c.d + A.B.C.D + a.b.c.1.2.3.C + +/^\".*\"\s*(;.*)?$/ + \"1234\" + \"abcd\" ; + \"\" ; rhubarb + *** Failers + \"1234\" : things + +/^$/ + \ + *** Failers + +/ ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/x + ab c + *** Failers + abc + ab cde + +/(?x) ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/ + ab c + *** Failers + abc + ab cde + +/^ a\ b[c ]d $/x + a bcd + a b d + *** Failers + abcd + ab d + +/^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/ + abcdefhijklm + +/^(?:a(b(c)))(?:d(e(f)))(?:h(i(j)))(?:k(l(m)))$/ + abcdefhijklm + +/^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ + a+ Z0+\x08\n\x1d\x12 + +/^[.^$|()*+?{,}]+/ + .^\$(*+)|{?,?} + +/^a*\w/ + z + az + aaaz + a + aa + aaaa + a+ + aa+ + +/^a*?\w/ + z + az + aaaz + a + aa + aaaa + a+ + aa+ + +/^a+\w/ + az + aaaz + aa + aaaa + aa+ + +/^a+?\w/ + az + aaaz + aa + aaaa + aa+ + +/^\d{8}\w{2,}/ + 1234567890 + 12345678ab + 12345678__ + *** Failers + 1234567 + +/^[aeiou\d]{4,5}$/ + uoie + 1234 + 12345 + aaaaa + *** Failers + 123456 + +/^[aeiou\d]{4,5}?/ + uoie + 1234 + 12345 + aaaaa + 123456 + +/\A(abc|def)=(\1){2,3}\Z/ + abc=abcabc + def=defdefdef + *** Failers + abc=defdef + +/^(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\11*(\3\4)\1(?#)2$/ + abcdefghijkcda2 + abcdefghijkkkkcda2 + +/(cat(a(ract|tonic)|erpillar)) \1()2(3)/ + cataract cataract23 + catatonic catatonic23 + caterpillar caterpillar23 + + +/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + From abcd Mon Sep 01 12:33:02 1997 + +/^From\s+\S+\s+([a-zA-Z]{3}\s+){2}\d{1,2}\s+\d\d:\d\d/ + From abcd Mon Sep 01 12:33:02 1997 + From abcd Mon Sep 1 12:33:02 1997 + *** Failers + From abcd Sep 01 12:33:02 1997 + +/^12.34/s + 12\n34 + 12\r34 + +/\w+(?=\t)/ + the quick brown\t fox + +/foo(?!bar)(.*)/ + foobar is foolish see? + +/(?:(?!foo)...|^.{0,2})bar(.*)/ + foobar crowbar etc + barrel + 2barrel + A barrel + +/^(\D*)(?=\d)(?!123)/ + abc456 + *** Failers + abc123 + +/^1234(?# test newlines + inside)/ + 1234 + +/^1234 #comment in extended re + /x + 1234 + +/#rhubarb + abcd/x + abcd + +/^abcd#rhubarb/x + abcd + +/^(a)\1{2,3}(.)/ + aaab + aaaab + aaaaab + aaaaaab + +/(?!^)abc/ + the abc + *** Failers + abc + +/(?=^)abc/ + abc + *** Failers + the abc + +/^[ab]{1,3}(ab*|b)/ + aabbbbb + +/^[ab]{1,3}?(ab*|b)/ + aabbbbb + +/^[ab]{1,3}?(ab*?|b)/ + aabbbbb + +/^[ab]{1,3}(ab*?|b)/ + aabbbbb + +/ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional leading comment +(?: (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # one word, optionally followed by.... +(?: +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or... +\( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) | # comments, or... + +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +# quoted strings +)* +< (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # leading < +(?: @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* + +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* , (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +)* # further okay, if led by comma +: # closing colon +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* )? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address spec +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* > # trailing > +# name and address +) (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional trailing comment +/x + Alan Other <user\@dom.ain> + <user\@dom.ain> + user\@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + A missing angle <user\@some.where + *** Failers + The quick brown fox + +/[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional leading comment +(?: +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +# leading word +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # "normal" atoms and or spaces +(?: +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +| +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +) # "special" comment or quoted string +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # more "normal" +)* +< +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# < +(?: +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +(?: , +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +)* # additional domains +: +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address spec +> # > +# name and address +) +/x + Alan Other <user\@dom.ain> + <user\@dom.ain> + user\@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + A missing angle <user\@some.where + *** Failers + The quick brown fox + +/abc\0def\00pqr\000xyz\0000AB/ + abc\0def\00pqr\000xyz\0000AB + abc456 abc\0def\00pqr\000xyz\0000ABCDE + +/abc\x0def\x00pqr\x000xyz\x0000AB/ + abc\x0def\x00pqr\x000xyz\x0000AB + abc456 abc\x0def\x00pqr\x000xyz\x0000ABCDE + +/^[\000-\037]/ + \0A + \01B + \037C + +/\0*/ + \0\0\0\0 + +/A\x0{2,3}Z/ + The A\x0\x0Z + An A\0\x0\0Z + *** Failers + A\0Z + A\0\x0\0\x0Z + +/^(cow|)\1(bell)/ + cowcowbell + bell + *** Failers + cowbell + +/^\s/ + \040abc + \x0cabc + \nabc + \rabc + \tabc + *** Failers + abc + +/^a b + + c/x + abc + +/^(a|)\1*b/ + ab + aaaab + b + *** Failers + acb + +/^(a|)\1+b/ + aab + aaaab + b + *** Failers + ab + +/^(a|)\1?b/ + ab + aab + b + *** Failers + acb + +/^(a|)\1{2}b/ + aaab + b + *** Failers + ab + aab + aaaab + +/^(a|)\1{2,3}b/ + aaab + aaaab + b + *** Failers + ab + aab + aaaaab + +/ab{1,3}bc/ + abbbbc + abbbc + abbc + *** Failers + abc + abbbbbc + +/([^.]*)\.([^:]*):[T ]+(.*)/ + track1.title:TBlah blah blah + +/([^.]*)\.([^:]*):[T ]+(.*)/i + track1.title:TBlah blah blah + +/([^.]*)\.([^:]*):[t ]+(.*)/i + track1.title:TBlah blah blah + +/^[W-c]+$/ + WXY_^abc + ***Failers + wxy + +/^[W-c]+$/i + WXY_^abc + wxy_^ABC + +/^[\x3f-\x5F]+$/i + WXY_^abc + wxy_^ABC + +/^abc$/m + abc + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +/^abc$/ + abc + *** Failers + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +/\Aabc\Z/m + abc + abc\n + *** Failers + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +/\A(.)*\Z/s + abc\ndef + +/\A(.)*\Z/m + *** Failers + abc\ndef + +/(?:b)|(?::+)/ + b::c + c::b + +/[-az]+/ + az- + *** Failers + b + +/[az-]+/ + za- + *** Failers + b + +/[a\-z]+/ + a-z + *** Failers + b + +/[a-z]+/ + abcdxyz + +/[\d-]+/ + 12-34 + *** Failers + aaa + +/[\d-z]+/ + 12-34z + *** Failers + aaa + +/\x5c/ + \\ + +/\x20Z/ + the Zoo + *** Failers + Zulu + +/(abc)\1/i + abcabc + ABCabc + abcABC + +/(main(O)?)+/ + mainmain + mainOmain + +/ab{3cd/ + ab{3cd + +/ab{3,cd/ + ab{3,cd + +/ab{3,4a}cd/ + ab{3,4a}cd + +/{4,5a}bc/ + {4,5a}bc + +/^a.b/ + a\rb + *** Failers + a\nb + +/abc$/ + abc + abc\n + *** Failers + abc\ndef + +/(abc)\123/ + abc\x53 + +/(abc)\223/ + abc\x93 + +/(abc)\323/ + abc\xd3 + +/(abc)\500/ + abc\x40 + abc\100 + +/(abc)\5000/ + abc\x400 + abc\x40\x30 + abc\1000 + abc\100\x30 + abc\100\060 + abc\100\60 + +/abc\81/ + abc\081 + abc\0\x38\x31 + +/abc\91/ + abc\091 + abc\0\x39\x31 + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/ + abcdefghijkllS + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/ + abcdefghijk\12S + +/ab\gdef/ + abgdef + +/a{0}bc/ + bc + +/(a|(bc)){0,0}?xyz/ + xyz + +/abc[\10]de/ + abc\010de + +/abc[\1]de/ + abc\1de + +/(abc)[\1]de/ + abc\1de + +/a.b(?s)/ + a\nb + +/^([^a])([^\b])([^c]*)([^d]{3,4})/ + baNOTccccd + baNOTcccd + baNOTccd + bacccd + *** Failers + anything + b\bc + baccd + +/[^a]/ + Abc + +/[^a]/i + Abc + +/[^a]+/ + AAAaAbc + +/[^a]+/i + AAAaAbc + +/[^a]+/ + bbb\nccc + +/[^k]$/ + abc + *** Failers + abk + +/[^k]{2,3}$/ + abc + kbc + kabc + *** Failers + abk + akb + akk + +/^\d{8,}\@.+[^k]$/ + 12345678\@a.b.c.d + 123456789\@x.y.z + *** Failers + 12345678\@x.y.uk + 1234567\@a.b.c.d + +/(a)\1{8,}/ + aaaaaaaaa + aaaaaaaaaa + *** Failers + aaaaaaa + +/[^a]/ + aaaabcd + aaAabcd + +/[^a]/i + aaaabcd + aaAabcd + +/[^az]/ + aaaabcd + aaAabcd + +/[^az]/i + aaaabcd + aaAabcd + +/\000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377/ + \000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377 + +/P[^*]TAIRE[^*]{1,6}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + +/P[^*]TAIRE[^*]{1,}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + +/(\.\d\d[1-9]?)\d+/ + 1.230003938 + 1.875000282 + 1.235 + +/(\.\d\d((?=0)|\d(?=\d)))/ + 1.230003938 + 1.875000282 + *** Failers + 1.235 + +/a(?)b/ + ab + +/\b(foo)\s+(\w+)/i + Food is on the foo table + +/foo(.*)bar/ + The food is under the bar in the barn. + +/foo(.*?)bar/ + The food is under the bar in the barn. + +/(.*)(\d*)/ + I have 2 numbers: 53147 + +/(.*)(\d+)/ + I have 2 numbers: 53147 + +/(.*?)(\d*)/ + I have 2 numbers: 53147 + +/(.*?)(\d+)/ + I have 2 numbers: 53147 + +/(.*)(\d+)$/ + I have 2 numbers: 53147 + +/(.*?)(\d+)$/ + I have 2 numbers: 53147 + +/(.*)\b(\d+)$/ + I have 2 numbers: 53147 + +/(.*\D)(\d+)$/ + I have 2 numbers: 53147 + +/^\D*(?!123)/ + ABC123 + +/^(\D*)(?=\d)(?!123)/ + ABC445 + *** Failers + ABC123 + +/^[W-]46]/ + W46]789 + -46]789 + *** Failers + Wall + Zebra + 42 + [abcd] + ]abcd[ + +/^[W-\]46]/ + W46]789 + Wall + Zebra + Xylophone + 42 + [abcd] + ]abcd[ + \\backslash + *** Failers + -46]789 + well + +/\d\d\/\d\d\/\d\d\d\d/ + 01/01/2000 + +/word (?:[a-zA-Z0-9]+ ){0,10}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + +/word (?:[a-zA-Z0-9]+ ){0,300}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + +/^(a){0,0}/ + bcd + abc + aab + +/^(a){0,1}/ + bcd + abc + aab + +/^(a){0,2}/ + bcd + abc + aab + +/^(a){0,3}/ + bcd + abc + aab + aaa + +/^(a){0,}/ + bcd + abc + aab + aaa + aaaaaaaa + +/^(a){1,1}/ + bcd + abc + aab + +/^(a){1,2}/ + bcd + abc + aab + +/^(a){1,3}/ + bcd + abc + aab + aaa + +/^(a){1,}/ + bcd + abc + aab + aaa + aaaaaaaa + +/.*\.gif/ + borfle\nbib.gif\nno + +/.{0,}\.gif/ + borfle\nbib.gif\nno + +/.*\.gif/m + borfle\nbib.gif\nno + +/.*\.gif/s + borfle\nbib.gif\nno + +/.*\.gif/ms + borfle\nbib.gif\nno + +/.*$/ + borfle\nbib.gif\nno + +/.*$/m + borfle\nbib.gif\nno + +/.*$/s + borfle\nbib.gif\nno + +/.*$/ms + borfle\nbib.gif\nno + +/.*$/ + borfle\nbib.gif\nno\n + +/.*$/m + borfle\nbib.gif\nno\n + +/.*$/s + borfle\nbib.gif\nno\n + +/.*$/ms + borfle\nbib.gif\nno\n + +/(.*X|^B)/ + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + +/(.*X|^B)/m + abcde\n1234Xyz + BarFoo + abcde\nBar + +/(.*X|^B)/s + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + +/(.*X|^B)/ms + abcde\n1234Xyz + BarFoo + abcde\nBar + +/(?s)(.*X|^B)/ + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + +/(?s:.*X|^B)/ + abcde\n1234Xyz + BarFoo + *** Failers + abcde\nBar + +/ End of test input / diff --git a/ext/pcre/pcrelib/testinput2 b/ext/pcre/pcrelib/testinput2 new file mode 100644 index 0000000000..39a7560aa7 --- /dev/null +++ b/ext/pcre/pcrelib/testinput2 @@ -0,0 +1,445 @@ +/(a)b|/ + +/abc/ + abc + defabc + \Aabc + *** Failers + \Adefabc + ABC + +/^abc/ + abc + \Aabc + *** Failers + defabc + \Adefabc + +/a+bc/ + +/a*bc/ + +/a{3}bc/ + +/(abc|a+z)/ + +/^abc$/ + abc + *** Failers + def\nabc + +/ab\gdef/X + +/(?X)ab\gdef/X + +/x{5,4}/ + +/z{65536}/ + +/[abcd/ + +/[\B]/ + +/[a-\w]/ + +/[z-a]/ + +/^*/ + +/(abc/ + +/(?# abc/ + +/(?z)abc/ + +/.*b/ + +/.*?b/ + +/cat|dog|elephant/ + this sentence eventually mentions a cat + this sentences rambles on and on for a while and then reaches elephant + +/cat|dog|elephant/S + this sentence eventually mentions a cat + this sentences rambles on and on for a while and then reaches elephant + +/cat|dog|elephant/iS + this sentence eventually mentions a CAT cat + this sentences rambles on and on for a while to elephant ElePhant + +/a|[bcd]/S + +/(a|[^\dZ])/S + +/(a|b)*[\s]/S + +/(ab\2)/ + +/{4,5}abc/ + +/(a)(b)(c)\2/ + abcb + \O0abcb + \O3abcb + \O6abcb + \O9abcb + \O12abcb + +/(a)bc|(a)(b)\2/ + abc + \O0abc + \O3abc + \O6abc + aba + \O0aba + \O3aba + \O6aba + \O9aba + \O12aba + +/abc$/E + abc + *** Failers + abc\n + abc\ndef + +/(a)(b)(c)(d)(e)\6/ + +/the quick brown fox/ + the quick brown fox + this is a line with the quick brown fox + +/the quick brown fox/A + the quick brown fox + *** Failers + this is a line with the quick brown fox + +/ab(?z)cd/ + +/^abc|def/ + abcdef + abcdef\B + +/.*((abc)$|(def))/ + defabc + \Zdefabc + +/abc/P + abc + *** Failers + +/^abc|def/P + abcdef + abcdef\B + +/.*((abc)$|(def))/P + defabc + \Zdefabc + +/the quick brown fox/P + the quick brown fox + *** Failers + The Quick Brown Fox + +/the quick brown fox/Pi + the quick brown fox + The Quick Brown Fox + +/abc.def/P + *** Failers + abc\ndef + +/abc$/P + abc + abc\n + +/(abc)\2/P + +/(abc\1)/P + abc + +/)/ + +/a[]b/ + +/[^aeiou ]{3,}/ + co-processors, and for + +/<.*>/ + abc<def>ghi<klm>nop + +/<.*?>/ + abc<def>ghi<klm>nop + +/<.*>/U + abc<def>ghi<klm>nop + +/<.*>(?U)/ + abc<def>ghi<klm>nop + +/<.*?>/U + abc<def>ghi<klm>nop + +/={3,}/U + abc========def + +/(?U)={3,}?/ + abc========def + +/(?<!bar|cattle)foo/ + foo + catfoo + *** Failers + the barfoo + and cattlefoo + +/(?<=a+)b/ + +/(?<=aaa|b{0,3})b/ + +/(?<!(foo)a\1)bar/ + +/(?i)abc/ + +/(a|(?m)a)/ + +/(?i)^1234/ + +/(^b|(?i)^d)/ + +/(?s).*/ + +/[abcd]/S + +/(?i)[abcd]/S + +/(?m)[xy]|(b|c)/S + +/(^a|^b)/m + +/(?i)(^a|^b)/m + +/(a)(?(1)a|b|c)/ + +/(?(?=a)a|b|c)/ + +/(?(1a)/ + +/(?(?i))/ + +/(?(abc))/ + +/(?(?<ab))/ + +/((?s)blah)\s+\1/ + +/((?i)blah)\s+\1/ + +/((?i)b)/DS + +/(a*b|(?i:c*(?-i)d))/S + +/a$/ + a + a\n + *** Failers + \Za + \Za\n + +/a$/m + a + a\n + \Za\n + *** Failers + \Za + +/\Aabc/m + +/^abc/m + +/^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/ + aaaaabbbbbcccccdef + +/(?<=foo)[ab]/S + +/(?<!foo)(alpha|omega)/S + +/(?!alphabet)[ab]/S + +/(?<=foo\n)^bar/m + +/(?>^abc)/m + abc + def\nabc + *** Failers + defabc + +/(?<=ab(c+)d)ef/ + +/(?<=ab(?<=c+)d)ef/ + +/(?<=ab(c|de)f)g/ + +/The next three are in testinput2 because they have variable length branches/ + +/(?<=bullock|donkey)-cart/ + the bullock-cart + a donkey-cart race + *** Failers + cart + horse-and-cart + +/(?<=ab(?i)x|y|z)/ + +/(?>.*)(?<=(abcd)|(xyz))/ + alphabetabcd + endingxyz + +/(?<=ab(?i)x(?-i)y|(?i)z|b)ZZ/ + abxyZZ + abXyZZ + ZZZ + zZZ + bZZ + BZZ + *** Failers + ZZ + abXYZZ + zzz + bzz + +/(?<!(foo)a)bar/ + bar + foobbar + *** Failers + fooabar + +/This one is here because Perl 5.005_02 doesn't fail it/ + +/^(a)?(?(1)a|b)+$/ + *** Failers + a + +/This one is here because I think Perl 5.005_02 gets the setting of $1 wrong/ + +/^(a\1?){4}$/ + aaaaaa + +/These are syntax tests from Perl 5.005/ + +/a[b-a]/ + +/a[]b/ + +/a[/ + +/*a/ + +/(*)b/ + +/abc)/ + +/(abc/ + +/a**/ + +/)(/ + +/\1/ + +/\2/ + +/(a)|\2/ + +/a[b-a]/i + +/a[]b/i + +/a[/i + +/*a/i + +/(*)b/i + +/abc)/i + +/(abc/i + +/a**/i + +/)(/i + +/:(?:/ + +/(?<%)b/ + +/a(?{)b/ + +/a(?{{})b/ + +/a(?{}})b/ + +/a(?{"{"})b/ + +/a(?{"{"}})b/ + +/(?(1?)a|b)/ + +/(?(1)a|b|c)/ + +/[a[:xyz:/ + +/(?<=x+)y/ + +/a{37,17}/ + +/abc/\ + +/abc/\P + +/abc/\i + +/(a)bc(d)/ + abcd + abcd\C2 + abcd\C5 + +/(.{20})/ + abcdefghijklmnopqrstuvwxyz + abcdefghijklmnopqrstuvwxyz\C1 + abcdefghijklmnopqrstuvwxyz\G1 + +/(.{15})/ + abcdefghijklmnopqrstuvwxyz + abcdefghijklmnopqrstuvwxyz\C1\G1 + +/(.{16})/ + abcdefghijklmnopqrstuvwxyz + abcdefghijklmnopqrstuvwxyz\C1\G1\L + +/^(a|(bc))de(f)/ + adef\G1\G2\G3\G4\L + bcdef\G1\G2\G3\G4\L + adefghijk\C0 + +/^abc\00def/ + abc\00def\L\C0 + +/word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ +)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ +)?)?)?)?)?)?)?)?)?otherword/M + +/.*X/D + +/.*X/Ds + +/(.*X|^B)/D + +/(.*X|^B)/Ds + +/(?s)(.*X|^B)/D + +/(?s:.*X|^B)/D + +/ End of test input / diff --git a/ext/pcre/pcrelib/testinput3 b/ext/pcre/pcrelib/testinput3 new file mode 100644 index 0000000000..2e686b3d5d --- /dev/null +++ b/ext/pcre/pcrelib/testinput3 @@ -0,0 +1,1641 @@ +/(?<!bar)foo/ + foo + catfood + arfootle + rfoosh + *** Failers + barfoo + towbarfoo + +/\w{3}(?<!bar)foo/ + catfood + *** Failers + foo + barfoo + towbarfoo + +/(?<=(foo)a)bar/ + fooabar + *** Failers + bar + foobbar + +/\Aabc\z/m + abc + *** Failers + abc\n + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +"(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ + +"(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(?>(\.\d\d[1-9]?))\d+/ + 1.230003938 + 1.875000282 + *** Failers + 1.235 + +/^((?>\w+)|(?>\s+))*$/ + now is the time for all good men to come to the aid of the party + *** Failers + this is not a line with only words and spaces! + +/(\d+)(\w)/ + 12345a + 12345+ + +/((?>\d+))(\w)/ + 12345a + *** Failers + 12345+ + +/(?>a+)b/ + aaab + +/((?>a+)b)/ + aaab + +/(?>(a+))b/ + aaab + +/(?>b)+/ + aaabbbccc + +/(?>a+|b+|c+)*c/ + aaabbbbccccd + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + +/\(((?>[^()]+)|\([^()]+\))+\)/ + (abc) + (abc(def)xyz) + *** Failers + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + +/a(?-i)b/i + ab + *** Failers + Ab + aB + AB + +/(a (?x)b c)d e/ + a bcd e + *** Failers + a b cd e + abcd e + a bcde + +/(a b(?x)c d (?-x)e f)/ + a bcde f + *** Failers + abcdef + +/(a(?i)b)c/ + abc + aBc + *** Failers + abC + aBC + Abc + ABc + ABC + AbC + +/a(?i:b)c/ + abc + aBc + *** Failers + ABC + abC + aBC + +/a(?i:b)*c/ + aBc + aBBc + *** Failers + aBC + aBBC + +/a(?=b(?i)c)\w\wd/ + abcd + abCd + *** Failers + aBCd + abcD + +/(?s-i:more.*than).*million/i + more than million + more than MILLION + more \n than Million + *** Failers + MORE THAN MILLION + more \n than \n million + +/(?:(?s-i)more.*than).*million/i + more than million + more than MILLION + more \n than Million + *** Failers + MORE THAN MILLION + more \n than \n million + +/(?>a(?i)b+)+c/ + abc + aBbc + aBBc + *** Failers + Abc + abAb + abbC + +/(?=a(?i)b)\w\wc/ + abc + aBc + *** Failers + Ab + abC + aBC + +/(?<=a(?i)b)(\w\w)c/ + abxxc + aBxxc + *** Failers + Abxxc + ABxxc + abxxC + +/(?:(a)|b)(?(1)A|B)/ + aA + bB + *** Failers + aB + bA + +/^(a)?(?(1)a|b)+$/ + aa + b + bb + *** Failers + ab + +/^(?(?=abc)\w{3}:|\d\d)$/ + abc: + 12 + *** Failers + 123 + xyz + +/^(?(?!abc)\d\d|\w{3}:)$/ + abc: + 12 + *** Failers + 123 + xyz + +/(?(?<=foo)bar|cat)/ + foobar + cat + fcat + focat + *** Failers + foocat + +/(?(?<!foo)cat|bar)/ + foobar + cat + fcat + focat + *** Failers + foocat + +/( \( )? [^()]+ (?(1) \) |) /x + abcd + (abcd) + the quick (abcd) fox + (abcd + +/( \( )? [^()]+ (?(1) \) ) /x + abcd + (abcd) + the quick (abcd) fox + (abcd + +/^(?(2)a|(1)(2))+$/ + 12 + 12a + 12aa + *** Failers + 1234 + +/((?i)blah)\s+\1/ + blah blah + BLAH BLAH + Blah Blah + blaH blaH + *** Failers + blah BLAH + Blah blah + blaH blah + +/((?i)blah)\s+(?i:\1)/ + blah blah + BLAH BLAH + Blah Blah + blaH blaH + blah BLAH + Blah blah + blaH blah + +/(?>a*)*/ + a + aa + aaaa + +/(abc|)+/ + abc + abcabc + abcabcabc + xyz + +/([a]*)*/ + a + aaaaa + +/([ab]*)*/ + a + b + ababab + aaaabcde + bbbb + +/([^a]*)*/ + b + bbbb + aaa + +/([^ab]*)*/ + cccc + abab + +/([a]*?)*/ + a + aaaa + +/([ab]*?)*/ + a + b + abab + baba + +/([^a]*?)*/ + b + bbbb + aaa + +/([^ab]*?)*/ + c + cccc + baba + +/(?>a*)*/ + a + aaabcde + +/((?>a*))*/ + aaaaa + aabbaa + +/((?>a*?))*/ + aaaaa + aabbaa + +/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x + 12-sep-98 + 12-09-98 + *** Failers + sep-12-98 + +/(?<=(foo))bar\1/ + foobarfoo + foobarfootling + *** Failers + foobar + barfoo + +/(?i:saturday|sunday)/ + saturday + sunday + Saturday + Sunday + SATURDAY + SUNDAY + SunDay + +/(a(?i)bc|BB)x/ + abcx + aBCx + bbx + BBx + *** Failers + abcX + aBCX + bbX + BBX + +/^([ab](?i)[cd]|[ef])/ + ac + aC + bD + elephant + Europe + frog + France + *** Failers + Africa + +/^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ + ab + aBd + xy + xY + zebra + Zambesi + *** Failers + aCD + XY + +/(?<=foo\n)^bar/m + foo\nbar + *** Failers + bar + baz\nbar + +/(?<=(?<!foo)bar)baz/ + barbaz + barbarbaz + koobarbaz + *** Failers + baz + foobarbaz + +/The case of aaaaaa is missed out below because I think Perl 5.005_02 gets/ +/it wrong; it sets $1 to aaa rather than aa. Compare the following test,/ +/where it does set $1 to aa when matching aaaaaa./ + +/^(a\1?){4}$/ + a + aa + aaa + aaaa + aaaaa + aaaaaaa + aaaaaaaa + aaaaaaaaa + aaaaaaaaaa + aaaaaaaaaaa + aaaaaaaaaaaa + aaaaaaaaaaaaa + aaaaaaaaaaaaaa + aaaaaaaaaaaaaaa + aaaaaaaaaaaaaaaa + +/^(a\1?)(a\1?)(a\2?)(a\3?)$/ + a + aa + aaa + aaaa + aaaaa + aaaaaa + aaaaaaa + aaaaaaaa + aaaaaaaaa + aaaaaaaaaa + aaaaaaaaaaa + aaaaaaaaaaaa + aaaaaaaaaaaaa + aaaaaaaaaaaaaa + aaaaaaaaaaaaaaa + aaaaaaaaaaaaaaaa + +/The following tests are taken from the Perl 5.005 test suite; some of them/ +/are compatible with 5.004, but I'd rather not have to sort them out./ + +/abc/ + abc + xabcy + ababc + *** Failers + xbc + axc + abx + +/ab*c/ + abc + +/ab*bc/ + abc + abbc + abbbbc + +/.{1}/ + abbbbc + +/.{3,4}/ + abbbbc + +/ab{0,}bc/ + abbbbc + +/ab+bc/ + abbc + *** Failers + abc + abq + +/ab{1,}bc/ + +/ab+bc/ + abbbbc + +/ab{1,}bc/ + abbbbc + +/ab{1,3}bc/ + abbbbc + +/ab{3,4}bc/ + abbbbc + +/ab{4,5}bc/ + *** Failers + abq + abbbbc + +/ab?bc/ + abbc + abc + +/ab{0,1}bc/ + abc + +/ab?bc/ + +/ab?c/ + abc + +/ab{0,1}c/ + abc + +/^abc$/ + abc + *** Failers + abbbbc + abcc + +/^abc/ + abcc + +/^abc$/ + +/abc$/ + aabc + *** Failers + aabc + aabcd + +/^/ + abc + +/$/ + abc + +/a.c/ + abc + axc + +/a.*c/ + axyzc + +/a[bc]d/ + abd + *** Failers + axyzd + abc + +/a[b-d]e/ + ace + +/a[b-d]/ + aac + +/a[-b]/ + a- + +/a[b-]/ + a- + +/a]/ + a] + +/a[]]b/ + a]b + +/a[^bc]d/ + aed + *** Failers + abd + abd + +/a[^-b]c/ + adc + +/a[^]b]c/ + adc + *** Failers + a-c + a]c + +/\ba\b/ + a- + -a + -a- + +/\by\b/ + *** Failers + xy + yz + xyz + +/\Ba\B/ + *** Failers + a- + -a + -a- + +/\By\b/ + xy + +/\by\B/ + yz + +/\By\B/ + xyz + +/\w/ + a + +/\W/ + - + *** Failers + - + a + +/a\sb/ + a b + +/a\Sb/ + a-b + *** Failers + a-b + a b + +/\d/ + 1 + +/\D/ + - + *** Failers + - + 1 + +/[\w]/ + a + +/[\W]/ + - + *** Failers + - + a + +/a[\s]b/ + a b + +/a[\S]b/ + a-b + *** Failers + a-b + a b + +/[\d]/ + 1 + +/[\D]/ + - + *** Failers + - + 1 + +/ab|cd/ + abc + abcd + +/()ef/ + def + +/$b/ + +/a\(b/ + a(b + +/a\(*b/ + ab + a((b + +/a\\b/ + a\b + +/((a))/ + abc + +/(a)b(c)/ + abc + +/a+b+c/ + aabbabc + +/a{1,}b{1,}c/ + aabbabc + +/a.+?c/ + abcabc + +/(a+|b)*/ + ab + +/(a+|b){0,}/ + ab + +/(a+|b)+/ + ab + +/(a+|b){1,}/ + ab + +/(a+|b)?/ + ab + +/(a+|b){0,1}/ + ab + +/[^ab]*/ + cde + +/abc/ + *** Failers + b + + +/a*/ + + +/([abc])*d/ + abbbcd + +/([abc])*bcd/ + abcd + +/a|b|c|d|e/ + e + +/(a|b|c|d|e)f/ + ef + +/abcd*efg/ + abcdefg + +/ab*/ + xabyabbbz + xayabbbz + +/(ab|cd)e/ + abcde + +/[abhgefdc]ij/ + hij + +/^(ab|cd)e/ + +/(abc|)ef/ + abcdef + +/(a|b)c*d/ + abcd + +/(ab|ab*)bc/ + abc + +/a([bc]*)c*/ + abc + +/a([bc]*)(c*d)/ + abcd + +/a([bc]+)(c*d)/ + abcd + +/a([bc]*)(c+d)/ + abcd + +/a[bcd]*dcdcde/ + adcdcde + +/a[bcd]+dcdcde/ + *** Failers + abcde + adcdcde + +/(ab|a)b*c/ + abc + +/((a)(b)c)(d)/ + abcd + +/[a-zA-Z_][a-zA-Z0-9_]*/ + alpha + +/^a(bc+|b[eh])g|.h$/ + abh + +/(bc+d$|ef*g.|h?i(j|k))/ + effgz + ij + reffgz + *** Failers + effg + bcdd + +/((((((((((a))))))))))/ + a + +/((((((((((a))))))))))\10/ + aa + +/(((((((((a)))))))))/ + a + +/multiple words of text/ + *** Failers + aa + uh-uh + +/multiple words/ + multiple words, yeah + +/(.*)c(.*)/ + abcde + +/\((.*), (.*)\)/ + (a, b) + +/[k]/ + +/abcd/ + abcd + +/a(bc)d/ + abcd + +/a[-]?c/ + ac + +/(abc)\1/ + abcabc + +/([a-c]*)\1/ + abcabc + +/(a)|\1/ + a + *** Failers + ab + x + +/(([a-c])b*?\2)*/ + ababbbcbc + +/(([a-c])b*?\2){3}/ + ababbbcbc + +/((\3|b)\2(a)x)+/ + aaaxabaxbaaxbbax + +/((\3|b)\2(a)){2,}/ + bbaababbabaaaaabbaaaabba + +/abc/i + ABC + XABCY + ABABC + *** Failers + aaxabxbaxbbx + XBC + AXC + ABX + +/ab*c/i + ABC + +/ab*bc/i + ABC + ABBC + +/ab*?bc/i + ABBBBC + +/ab{0,}?bc/i + ABBBBC + +/ab+?bc/i + ABBC + +/ab+bc/i + *** Failers + ABC + ABQ + +/ab{1,}bc/i + +/ab+bc/i + ABBBBC + +/ab{1,}?bc/i + ABBBBC + +/ab{1,3}?bc/i + ABBBBC + +/ab{3,4}?bc/i + ABBBBC + +/ab{4,5}?bc/i + *** Failers + ABQ + ABBBBC + +/ab??bc/i + ABBC + ABC + +/ab{0,1}?bc/i + ABC + +/ab??bc/i + +/ab??c/i + ABC + +/ab{0,1}?c/i + ABC + +/^abc$/i + ABC + *** Failers + ABBBBC + ABCC + +/^abc/i + ABCC + +/^abc$/i + +/abc$/i + AABC + +/^/i + ABC + +/$/i + ABC + +/a.c/i + ABC + AXC + +/a.*?c/i + AXYZC + +/a.*c/i + *** Failers + AABC + AXYZD + +/a[bc]d/i + ABD + +/a[b-d]e/i + ACE + *** Failers + ABC + ABD + +/a[b-d]/i + AAC + +/a[-b]/i + A- + +/a[b-]/i + A- + +/a]/i + A] + +/a[]]b/i + A]B + +/a[^bc]d/i + AED + +/a[^-b]c/i + ADC + *** Failers + ABD + A-C + +/a[^]b]c/i + ADC + +/ab|cd/i + ABC + ABCD + +/()ef/i + DEF + +/$b/i + *** Failers + A]C + B + +/a\(b/i + A(B + +/a\(*b/i + AB + A((B + +/a\\b/i + A\B + +/((a))/i + ABC + +/(a)b(c)/i + ABC + +/a+b+c/i + AABBABC + +/a{1,}b{1,}c/i + AABBABC + +/a.+?c/i + ABCABC + +/a.*?c/i + ABCABC + +/a.{0,5}?c/i + ABCABC + +/(a+|b)*/i + AB + +/(a+|b){0,}/i + AB + +/(a+|b)+/i + AB + +/(a+|b){1,}/i + AB + +/(a+|b)?/i + AB + +/(a+|b){0,1}/i + AB + +/(a+|b){0,1}?/i + AB + +/[^ab]*/i + CDE + +/abc/i + +/a*/i + + +/([abc])*d/i + ABBBCD + +/([abc])*bcd/i + ABCD + +/a|b|c|d|e/i + E + +/(a|b|c|d|e)f/i + EF + +/abcd*efg/i + ABCDEFG + +/ab*/i + XABYABBBZ + XAYABBBZ + +/(ab|cd)e/i + ABCDE + +/[abhgefdc]ij/i + HIJ + +/^(ab|cd)e/i + ABCDE + +/(abc|)ef/i + ABCDEF + +/(a|b)c*d/i + ABCD + +/(ab|ab*)bc/i + ABC + +/a([bc]*)c*/i + ABC + +/a([bc]*)(c*d)/i + ABCD + +/a([bc]+)(c*d)/i + ABCD + +/a([bc]*)(c+d)/i + ABCD + +/a[bcd]*dcdcde/i + ADCDCDE + +/a[bcd]+dcdcde/i + +/(ab|a)b*c/i + ABC + +/((a)(b)c)(d)/i + ABCD + +/[a-zA-Z_][a-zA-Z0-9_]*/i + ALPHA + +/^a(bc+|b[eh])g|.h$/i + ABH + +/(bc+d$|ef*g.|h?i(j|k))/i + EFFGZ + IJ + REFFGZ + *** Failers + ADCDCDE + EFFG + BCDD + +/((((((((((a))))))))))/i + A + +/((((((((((a))))))))))\10/i + AA + +/(((((((((a)))))))))/i + A + +/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a))))))))))/i + A + +/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/i + C + +/multiple words of text/i + *** Failers + AA + UH-UH + +/multiple words/i + MULTIPLE WORDS, YEAH + +/(.*)c(.*)/i + ABCDE + +/\((.*), (.*)\)/i + (A, B) + +/[k]/i + +/abcd/i + ABCD + +/a(bc)d/i + ABCD + +/a[-]?c/i + AC + +/(abc)\1/i + ABCABC + +/([a-c]*)\1/i + ABCABC + +/a(?!b)./ + abad + +/a(?=d)./ + abad + +/a(?=c|d)./ + abad + +/a(?:b|c|d)(.)/ + ace + +/a(?:b|c|d)*(.)/ + ace + +/a(?:b|c|d)+?(.)/ + ace + acdbcdbe + +/a(?:b|c|d)+(.)/ + acdbcdbe + +/a(?:b|c|d){2}(.)/ + acdbcdbe + +/a(?:b|c|d){4,5}(.)/ + acdbcdbe + +/a(?:b|c|d){4,5}?(.)/ + acdbcdbe + +/((foo)|(bar))*/ + foobar + +/a(?:b|c|d){6,7}(.)/ + acdbcdbe + +/a(?:b|c|d){6,7}?(.)/ + acdbcdbe + +/a(?:b|c|d){5,6}(.)/ + acdbcdbe + +/a(?:b|c|d){5,6}?(.)/ + acdbcdbe + +/a(?:b|c|d){5,7}(.)/ + acdbcdbe + +/a(?:b|c|d){5,7}?(.)/ + acdbcdbe + +/a(?:b|(c|e){1,2}?|d)+?(.)/ + ace + +/^(.+)?B/ + AB + +/^([^a-z])|(\^)$/ + . + +/^[<>]&/ + <&OUT + +/^(a\1?){4}$/ + aaaaaaaaaa + *** Failers + AB + aaaaaaaaa + aaaaaaaaaaa + +/^(a(?(1)\1)){4}$/ + aaaaaaaaaa + *** Failers + aaaaaaaaa + aaaaaaaaaaa + +/(?:(f)(o)(o)|(b)(a)(r))*/ + foobar + +/(?<=a)b/ + ab + *** Failers + cb + b + +/(?<!c)b/ + ab + b + b + +/(?:..)*a/ + aba + +/(?:..)*?a/ + aba + +/^(?:b|a(?=(.)))*\1/ + abc + +/^(){3,5}/ + abc + +/^(a+)*ax/ + aax + +/^((a|b)+)*ax/ + aax + +/^((a|bc)+)*ax/ + aax + +/(a|x)*ab/ + cab + +/(a)*ab/ + cab + +/(?:(?i)a)b/ + ab + +/((?i)a)b/ + ab + +/(?:(?i)a)b/ + Ab + +/((?i)a)b/ + Ab + +/(?:(?i)a)b/ + *** Failers + cb + aB + +/((?i)a)b/ + +/(?i:a)b/ + ab + +/((?i:a))b/ + ab + +/(?i:a)b/ + Ab + +/((?i:a))b/ + Ab + +/(?i:a)b/ + *** Failers + aB + aB + +/((?i:a))b/ + +/(?:(?-i)a)b/i + ab + +/((?-i)a)b/i + ab + +/(?:(?-i)a)b/i + aB + +/((?-i)a)b/i + aB + +/(?:(?-i)a)b/i + *** Failers + aB + Ab + +/((?-i)a)b/i + +/(?:(?-i)a)b/i + aB + +/((?-i)a)b/i + aB + +/(?:(?-i)a)b/i + *** Failers + Ab + AB + +/((?-i)a)b/i + +/(?-i:a)b/i + ab + +/((?-i:a))b/i + ab + +/(?-i:a)b/i + aB + +/((?-i:a))b/i + aB + +/(?-i:a)b/i + *** Failers + AB + Ab + +/((?-i:a))b/i + +/(?-i:a)b/i + aB + +/((?-i:a))b/i + aB + +/(?-i:a)b/i + *** Failers + Ab + AB + +/((?-i:a))b/i + +/((?-i:a.))b/i + *** Failers + AB + a\nB + +/((?s-i:a.))b/i + a\nB + +/(?:c|d)(?:)(?:a(?:)(?:b)(?:b(?:))(?:b(?:)(?:b)))/ + cabbbb + +/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/ + caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb + +/(ab)\d\1/i + Ab4ab + ab4Ab + +/foo\w*\d{4}baz/ + foobar1234baz + +/x(~~)*(?:(?:F)?)?/ + x~~ + +/^a(?#xxx){3}c/ + aaac + +/^a (?#xxx) (?#yyy) {3}c/x + aaac + +/(?<![cd])b/ + *** Failers + B\nB + dbcb + +/(?<![cd])[ab]/ + dbaacb + +/(?<!(c|d))b/ + +/(?<!(c|d))[ab]/ + dbaacb + +/(?<!cd)[ab]/ + cdaccb + +/^(?:a?b?)*$/ + *** Failers + dbcb + a-- + +/((?s)^a(.))((?m)^b$)/ + a\nb\nc\n + +/((?m)^b$)/ + a\nb\nc\n + +/(?m)^b/ + a\nb\n + +/(?m)^(b)/ + a\nb\n + +/((?m)^b)/ + a\nb\n + +/\n((?m)^b)/ + a\nb\n + +/((?s).)c(?!.)/ + a\nb\nc\n + a\nb\nc\n + +/((?s)b.)c(?!.)/ + a\nb\nc\n + a\nb\nc\n + +/^b/ + +/()^b/ + *** Failers + a\nb\nc\n + a\nb\nc\n + +/((?m)^b)/ + a\nb\nc\n + +/(?(1)a|b)/ + +/(?(1)b|a)/ + a + +/(x)?(?(1)a|b)/ + *** Failers + a + a + +/(x)?(?(1)b|a)/ + a + +/()?(?(1)b|a)/ + a + +/()(?(1)b|a)/ + +/()?(?(1)a|b)/ + a + +/^(\()?blah(?(1)(\)))$/ + (blah) + blah + *** Failers + a + blah) + (blah + +/^(\(+)?blah(?(1)(\)))$/ + (blah) + blah + *** Failers + blah) + (blah + +/(?(?!a)a|b)/ + +/(?(?!a)b|a)/ + a + +/(?(?=a)b|a)/ + *** Failers + a + a + +/(?(?=a)a|b)/ + a + +/(?=(a+?))(\1ab)/ + aaab + +/^(?=(a+?))\1ab/ + +/(\w+:)+/ + one: + +/$(?<=^(a))/ + a + +/(?=(a+?))(\1ab)/ + aaab + +/^(?=(a+?))\1ab/ + *** Failers + aaab + aaab + +/([\w:]+::)?(\w+)$/ + abcd + xy:z:::abcd + +/^[^bcd]*(c+)/ + aexycd + +/(a*)b+/ + caab + +/([\w:]+::)?(\w+)$/ + abcd + xy:z:::abcd + *** Failers + abcd: + abcd: + +/^[^bcd]*(c+)/ + aexycd + +/(>a+)ab/ + +/(?>a+)b/ + aaab + +/([[:]+)/ + a:[b]: + +/([[=]+)/ + a=[b]= + +/([[.]+)/ + a.[b]. + +/((?>a+)b)/ + aaab + +/(?>(a+))b/ + aaab + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + +/a\Z/ + *** Failers + aaab + a\nb\n + +/b\Z/ + a\nb\n + +/b\z/ + +/b\Z/ + a\nb + +/b\z/ + a\nb + *** Failers + +/^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ + a + abc + a-b + 0-9 + a.b + 5.6.7 + the.quick.brown.fox + a100.b200.300c + 12-ab.1245 + ***Failers + \ + .a + -a + a- + a. + a_b + a.- + a.. + ab..bc + the.quick.brown.fox- + the.quick.brown.fox. + the.quick.brown.fox_ + the.quick.brown.fox+ + +/(?>.*)(?<=(abcd|wxyz))/ + alphabetabcd + endingwxyz + *** Failers + a rather long string that doesn't end with one of them + +/word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark + +/word (?>[a-zA-Z0-9]+ ){0,30}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope + +/ End of test input / diff --git a/ext/pcre/pcrelib/testinput4 b/ext/pcre/pcrelib/testinput4 new file mode 100644 index 0000000000..c23b52aceb --- /dev/null +++ b/ext/pcre/pcrelib/testinput4 @@ -0,0 +1,64 @@ +/^[\w]+/ + *** Failers + cole + +/^[\w]+/Lfr + cole + +/^[\w]+/ + *** Failers + cole + +/^[\W]+/ + cole + +/^[\W]+/Lfr + *** Failers + cole + +/[\b]/ + \b + *** Failers + a + +/[\b]/Lfr + \b + *** Failers + a + +/^\w+/ + *** Failers + cole + +/^\w+/Lfr + cole + +/(.+)\b(.+)/ + cole + +/(.+)\b(.+)/Lfr + *** Failers + cole + +/cole/i + cole + *** Failers + cole + +/cole/iLfr + cole + cole + +/\w/IS + +/\w/ISLfr + +/^[\xc8-\xc9]/iLfr + cole + cole + +/^[\xc8-\xc9]/Lfr + cole + *** Failers + cole + diff --git a/ext/pcre/pcrelib/testoutput1 b/ext/pcre/pcrelib/testoutput1 new file mode 100644 index 0000000000..44e6d5c1a9 --- /dev/null +++ b/ext/pcre/pcrelib/testoutput1 @@ -0,0 +1,2772 @@ +PCRE version 2.05 21-Apr-1999 + +/the quick brown fox/ + the quick brown fox + 0: the quick brown fox + The quick brown FOX +No match + What do you know about the quick brown fox? + 0: the quick brown fox + What do you know about THE QUICK BROWN FOX? +No match + +/The quick brown fox/i + the quick brown fox + 0: the quick brown fox + The quick brown FOX + 0: The quick brown FOX + What do you know about the quick brown fox? + 0: the quick brown fox + What do you know about THE QUICK BROWN FOX? + 0: THE QUICK BROWN FOX + +/abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/ + abcd\t\n\r\f\a\e9;\$\\?caxyz + 0: abcd\x09\x0a\x0d\x0c\x07\x1b9;$\?caxyz + +/a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/ + abxyzpqrrrabbxyyyypqAzz + 0: abxyzpqrrrabbxyyyypqAzz + abxyzpqrrrabbxyyyypqAzz + 0: abxyzpqrrrabbxyyyypqAzz + aabxyzpqrrrabbxyyyypqAzz + 0: aabxyzpqrrrabbxyyyypqAzz + aaabxyzpqrrrabbxyyyypqAzz + 0: aaabxyzpqrrrabbxyyyypqAzz + aaaabxyzpqrrrabbxyyyypqAzz + 0: aaaabxyzpqrrrabbxyyyypqAzz + abcxyzpqrrrabbxyyyypqAzz + 0: abcxyzpqrrrabbxyyyypqAzz + aabcxyzpqrrrabbxyyyypqAzz + 0: aabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypAzz + 0: aaabcxyzpqrrrabbxyyyypAzz + aaabcxyzpqrrrabbxyyyypqAzz + 0: aaabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqAzz + aaabcxyzpqrrrabbxyyyypqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqqqAzz + aaaabcxyzpqrrrabbxyyyypqAzz + 0: aaaabcxyzpqrrrabbxyyyypqAzz + abxyzzpqrrrabbxyyyypqAzz + 0: abxyzzpqrrrabbxyyyypqAzz + aabxyzzzpqrrrabbxyyyypqAzz + 0: aabxyzzzpqrrrabbxyyyypqAzz + aaabxyzzzzpqrrrabbxyyyypqAzz + 0: aaabxyzzzzpqrrrabbxyyyypqAzz + aaaabxyzzzzpqrrrabbxyyyypqAzz + 0: aaaabxyzzzzpqrrrabbxyyyypqAzz + abcxyzzpqrrrabbxyyyypqAzz + 0: abcxyzzpqrrrabbxyyyypqAzz + aabcxyzzzpqrrrabbxyyyypqAzz + 0: aabcxyzzzpqrrrabbxyyyypqAzz + aaabcxyzzzzpqrrrabbxyyyypqAzz + 0: aaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbxyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + aaabcxyzpqrrrabbxyyyypABzz + 0: aaabcxyzpqrrrabbxyyyypABzz + aaabcxyzpqrrrabbxyyyypABBzz + 0: aaabcxyzpqrrrabbxyyyypABBzz + >>>aaabxyzpqrrrabbxyyyypqAzz + 0: aaabxyzpqrrrabbxyyyypqAzz + >aaaabxyzpqrrrabbxyyyypqAzz + 0: aaaabxyzpqrrrabbxyyyypqAzz + >>>>abcxyzpqrrrabbxyyyypqAzz + 0: abcxyzpqrrrabbxyyyypqAzz + *** Failers +No match + abxyzpqrrabbxyyyypqAzz +No match + abxyzpqrrrrabbxyyyypqAzz +No match + abxyzpqrrrabxyyyypqAzz +No match + aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz +No match + aaaabcxyzzzzpqrrrabbbxyyypqAzz +No match + aaabcxyzpqrrrabbxyyyypqqqqqqqAzz +No match + +/^(abc){1,2}zz/ + abczz + 0: abczz + 1: abc + abcabczz + 0: abcabczz + 1: abc + *** Failers +No match + zz +No match + abcabcabczz +No match + >>abczz +No match + +/^(b+?|a){1,2}?c/ + bc + 0: bc + 1: b + bbc + 0: bbc + 1: b + bbbc + 0: bbbc + 1: bb + bac + 0: bac + 1: a + bbac + 0: bbac + 1: a + aac + 0: aac + 1: a + abbbbbbbbbbbc + 0: abbbbbbbbbbbc + 1: bbbbbbbbbbb + bbbbbbbbbbbac + 0: bbbbbbbbbbbac + 1: a + *** Failers +No match + aaac +No match + abbbbbbbbbbbac +No match + +/^(b+|a){1,2}c/ + bc + 0: bc + 1: b + bbc + 0: bbc + 1: bb + bbbc + 0: bbbc + 1: bbb + bac + 0: bac + 1: a + bbac + 0: bbac + 1: a + aac + 0: aac + 1: a + abbbbbbbbbbbc + 0: abbbbbbbbbbbc + 1: bbbbbbbbbbb + bbbbbbbbbbbac + 0: bbbbbbbbbbbac + 1: a + *** Failers +No match + aaac +No match + abbbbbbbbbbbac +No match + +/^(b+|a){1,2}?bc/ + bbc + 0: bbc + 1: b + +/^(b*|ba){1,2}?bc/ + babc + 0: babc + 1: ba + bbabc + 0: bbabc + 1: ba + bababc + 0: bababc + 1: ba + *** Failers +No match + bababbc +No match + babababc +No match + +/^(ba|b*){1,2}?bc/ + babc + 0: babc + 1: ba + bbabc + 0: bbabc + 1: ba + bababc + 0: bababc + 1: ba + *** Failers +No match + bababbc +No match + babababc +No match + +/^\ca\cA\c[\c{\c:/ + \x01\x01\e;z + 0: \x01\x01\x1b;z + +/^[ab\]cde]/ + athing + 0: a + bthing + 0: b + ]thing + 0: ] + cthing + 0: c + dthing + 0: d + ething + 0: e + *** Failers +No match + fthing +No match + [thing +No match + \\thing +No match + +/^[]cde]/ + ]thing + 0: ] + cthing + 0: c + dthing + 0: d + ething + 0: e + *** Failers +No match + athing +No match + fthing +No match + +/^[^ab\]cde]/ + fthing + 0: f + [thing + 0: [ + \\thing + 0: \ + *** Failers + 0: * + athing +No match + bthing +No match + ]thing +No match + cthing +No match + dthing +No match + ething +No match + +/^[^]cde]/ + athing + 0: a + fthing + 0: f + *** Failers + 0: * + ]thing +No match + cthing +No match + dthing +No match + ething +No match + +/^\/ + + 0: \x81 + +/^/ + + 0: \xff + +/^[0-9]+$/ + 0 + 0: 0 + 1 + 0: 1 + 2 + 0: 2 + 3 + 0: 3 + 4 + 0: 4 + 5 + 0: 5 + 6 + 0: 6 + 7 + 0: 7 + 8 + 0: 8 + 9 + 0: 9 + 10 + 0: 10 + 100 + 0: 100 + *** Failers +No match + abc +No match + +/^.*nter/ + enter + 0: enter + inter + 0: inter + uponter + 0: uponter + +/^xxx[0-9]+$/ + xxx0 + 0: xxx0 + xxx1234 + 0: xxx1234 + *** Failers +No match + xxx +No match + +/^.+[0-9][0-9][0-9]$/ + x123 + 0: x123 + xx123 + 0: xx123 + 123456 + 0: 123456 + *** Failers +No match + 123 +No match + x1234 + 0: x1234 + +/^.+?[0-9][0-9][0-9]$/ + x123 + 0: x123 + xx123 + 0: xx123 + 123456 + 0: 123456 + *** Failers +No match + 123 +No match + x1234 + 0: x1234 + +/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ + abc!pqr=apquxz.ixr.zzz.ac.uk + 0: abc!pqr=apquxz.ixr.zzz.ac.uk + 1: abc + 2: pqr + *** Failers +No match + !pqr=apquxz.ixr.zzz.ac.uk +No match + abc!=apquxz.ixr.zzz.ac.uk +No match + abc!pqr=apquxz:ixr.zzz.ac.uk +No match + abc!pqr=apquxz.ixr.zzz.ac.ukk +No match + +/:/ + Well, we need a colon: somewhere + 0: : + *** Fail if we don't +No match + +/([\da-f:]+)$/i + 0abc + 0: 0abc + 1: 0abc + abc + 0: abc + 1: abc + fed + 0: fed + 1: fed + E + 0: E + 1: E + :: + 0: :: + 1: :: + 5f03:12C0::932e + 0: 5f03:12C0::932e + 1: 5f03:12C0::932e + fed def + 0: def + 1: def + Any old stuff + 0: ff + 1: ff + *** Failers +No match + 0zzz +No match + gzzz +No match + fed\x20 +No match + Any old rubbish +No match + +/^.*\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ + .1.2.3 + 0: .1.2.3 + 1: 1 + 2: 2 + 3: 3 + A.12.123.0 + 0: A.12.123.0 + 1: 12 + 2: 123 + 3: 0 + *** Failers +No match + .1.2.3333 +No match + 1.2.3 +No match + 1234.2.3 +No match + +/^(\d+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/ + 1 IN SOA non-sp1 non-sp2( + 0: 1 IN SOA non-sp1 non-sp2( + 1: 1 + 2: non-sp1 + 3: non-sp2 + 1 IN SOA non-sp1 non-sp2 ( + 0: 1 IN SOA non-sp1 non-sp2 ( + 1: 1 + 2: non-sp1 + 3: non-sp2 + *** Failers +No match + 1IN SOA non-sp1 non-sp2( +No match + +/^[a-zA-Z\d][a-zA-Z\d\-]*(\.[a-zA-Z\d][a-zA-z\d\-]*)*\.$/ + a. + 0: a. + Z. + 0: Z. + 2. + 0: 2. + ab-c.pq-r. + 0: ab-c.pq-r. + 1: .pq-r + sxk.zzz.ac.uk. + 0: sxk.zzz.ac.uk. + 1: .uk + x-.y-. + 0: x-.y-. + 1: .y- + *** Failers +No match + -abc.peq. +No match + +/^\*\.[a-z]([a-z\-\d]*[a-z\d]+)?(\.[a-z]([a-z\-\d]*[a-z\d]+)?)*$/ + *.a + 0: *.a + *.b0-a + 0: *.b0-a + 1: 0-a + *.c3-b.c + 0: *.c3-b.c + 1: 3-b + 2: .c + *.c-a.b-c + 0: *.c-a.b-c + 1: -a + 2: .b-c + 3: -c + *** Failers +No match + *.0 +No match + *.a- +No match + *.a-b.c- +No match + *.c-a.0-c +No match + +/^(?=ab(de))(abd)(e)/ + abde + 0: abde + 1: de + 2: abd + 3: e + +/^(?!(ab)de|x)(abd)(f)/ + abdf + 0: abdf + 1: <unset> + 2: abd + 3: f + +/^(?=(ab(cd)))(ab)/ + abcd + 0: ab + 1: abcd + 2: cd + 3: ab + +/^[\da-f](\.[\da-f])*$/i + a.b.c.d + 0: a.b.c.d + 1: .d + A.B.C.D + 0: A.B.C.D + 1: .D + a.b.c.1.2.3.C + 0: a.b.c.1.2.3.C + 1: .C + +/^\".*\"\s*(;.*)?$/ + \"1234\" + 0: "1234" + \"abcd\" ; + 0: "abcd" ; + 1: ; + \"\" ; rhubarb + 0: "" ; rhubarb + 1: ; rhubarb + *** Failers +No match + \"1234\" : things +No match + +/^$/ + \ + 0: + *** Failers +No match + +/ ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/x + ab c + 0: ab c + *** Failers +No match + abc +No match + ab cde +No match + +/(?x) ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/ + ab c + 0: ab c + *** Failers +No match + abc +No match + ab cde +No match + +/^ a\ b[c ]d $/x + a bcd + 0: a bcd + a b d + 0: a b d + *** Failers +No match + abcd +No match + ab d +No match + +/^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/ + abcdefhijklm + 0: abcdefhijklm + 1: abc + 2: bc + 3: c + 4: def + 5: ef + 6: f + 7: hij + 8: ij + 9: j +10: klm +11: lm +12: m + +/^(?:a(b(c)))(?:d(e(f)))(?:h(i(j)))(?:k(l(m)))$/ + abcdefhijklm + 0: abcdefhijklm + 1: bc + 2: c + 3: ef + 4: f + 5: ij + 6: j + 7: lm + 8: m + +/^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ + a+ Z0+\x08\n\x1d\x12 + 0: a+ Z0+\x08\x0a\x1d\x12 + +/^[.^$|()*+?{,}]+/ + .^\$(*+)|{?,?} + 0: .^$(*+)|{?,?} + +/^a*\w/ + z + 0: z + az + 0: az + aaaz + 0: aaaz + a + 0: a + aa + 0: aa + aaaa + 0: aaaa + a+ + 0: a + aa+ + 0: aa + +/^a*?\w/ + z + 0: z + az + 0: a + aaaz + 0: a + a + 0: a + aa + 0: a + aaaa + 0: a + a+ + 0: a + aa+ + 0: a + +/^a+\w/ + az + 0: az + aaaz + 0: aaaz + aa + 0: aa + aaaa + 0: aaaa + aa+ + 0: aa + +/^a+?\w/ + az + 0: az + aaaz + 0: aa + aa + 0: aa + aaaa + 0: aa + aa+ + 0: aa + +/^\d{8}\w{2,}/ + 1234567890 + 0: 1234567890 + 12345678ab + 0: 12345678ab + 12345678__ + 0: 12345678__ + *** Failers +No match + 1234567 +No match + +/^[aeiou\d]{4,5}$/ + uoie + 0: uoie + 1234 + 0: 1234 + 12345 + 0: 12345 + aaaaa + 0: aaaaa + *** Failers +No match + 123456 +No match + +/^[aeiou\d]{4,5}?/ + uoie + 0: uoie + 1234 + 0: 1234 + 12345 + 0: 1234 + aaaaa + 0: aaaa + 123456 + 0: 1234 + +/\A(abc|def)=(\1){2,3}\Z/ + abc=abcabc + 0: abc=abcabc + 1: abc + 2: abc + def=defdefdef + 0: def=defdefdef + 1: def + 2: def + *** Failers +No match + abc=defdef +No match + +/^(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\11*(\3\4)\1(?#)2$/ + abcdefghijkcda2 + 0: abcdefghijkcda2 + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k +12: cd + abcdefghijkkkkcda2 + 0: abcdefghijkkkkcda2 + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k +12: cd + +/(cat(a(ract|tonic)|erpillar)) \1()2(3)/ + cataract cataract23 + 0: cataract cataract23 + 1: cataract + 2: aract + 3: ract + 4: + 5: 3 + catatonic catatonic23 + 0: catatonic catatonic23 + 1: catatonic + 2: atonic + 3: tonic + 4: + 5: 3 + caterpillar caterpillar23 + 0: caterpillar caterpillar23 + 1: caterpillar + 2: erpillar + 3: <unset> + 4: + 5: 3 + + +/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + From abcd Mon Sep 01 12:33:02 1997 + 0: From abcd Mon Sep 01 12:33 + 1: abcd + +/^From\s+\S+\s+([a-zA-Z]{3}\s+){2}\d{1,2}\s+\d\d:\d\d/ + From abcd Mon Sep 01 12:33:02 1997 + 0: From abcd Mon Sep 01 12:33 + 1: Sep + From abcd Mon Sep 1 12:33:02 1997 + 0: From abcd Mon Sep 1 12:33 + 1: Sep + *** Failers +No match + From abcd Sep 01 12:33:02 1997 +No match + +/^12.34/s + 12\n34 + 0: 12\x0a34 + 12\r34 + 0: 12\x0d34 + +/\w+(?=\t)/ + the quick brown\t fox + 0: brown + +/foo(?!bar)(.*)/ + foobar is foolish see? + 0: foolish see? + 1: lish see? + +/(?:(?!foo)...|^.{0,2})bar(.*)/ + foobar crowbar etc + 0: rowbar etc + 1: etc + barrel + 0: barrel + 1: rel + 2barrel + 0: 2barrel + 1: rel + A barrel + 0: A barrel + 1: rel + +/^(\D*)(?=\d)(?!123)/ + abc456 + 0: abc + 1: abc + *** Failers +No match + abc123 +No match + +/^1234(?# test newlines + inside)/ + 1234 + 0: 1234 + +/^1234 #comment in extended re + /x + 1234 + 0: 1234 + +/#rhubarb + abcd/x + abcd + 0: abcd + +/^abcd#rhubarb/x + abcd + 0: abcd + +/^(a)\1{2,3}(.)/ + aaab + 0: aaab + 1: a + 2: b + aaaab + 0: aaaab + 1: a + 2: b + aaaaab + 0: aaaaa + 1: a + 2: a + aaaaaab + 0: aaaaa + 1: a + 2: a + +/(?!^)abc/ + the abc + 0: abc + *** Failers +No match + abc +No match + +/(?=^)abc/ + abc + 0: abc + *** Failers +No match + the abc +No match + +/^[ab]{1,3}(ab*|b)/ + aabbbbb + 0: aabb + 1: b + +/^[ab]{1,3}?(ab*|b)/ + aabbbbb + 0: aabbbbb + 1: abbbbb + +/^[ab]{1,3}?(ab*?|b)/ + aabbbbb + 0: aa + 1: a + +/^[ab]{1,3}(ab*?|b)/ + aabbbbb + 0: aabb + 1: b + +/ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional leading comment +(?: (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # one word, optionally followed by.... +(?: +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or... +\( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) | # comments, or... + +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +# quoted strings +)* +< (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # leading < +(?: @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* + +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* , (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +)* # further okay, if led by comma +: # closing colon +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* )? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address spec +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* > # trailing > +# name and address +) (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional trailing comment +/x + Alan Other <user\@dom.ain> + 0: Alan Other <user@dom.ain> + <user\@dom.ain> + 0: user@dom.ain + user\@dom.ain + 0: user@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + 0: "A. Other" <user.1234@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + 0: Other <user.1234@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + 0: "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + A missing angle <user\@some.where + 0: user@some.where + *** Failers +No match + The quick brown fox +No match + +/[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional leading comment +(?: +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +# leading word +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # "normal" atoms and or spaces +(?: +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +| +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +) # "special" comment or quoted string +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # more "normal" +)* +< +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# < +(?: +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +(?: , +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +)* # additional domains +: +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address spec +> # > +# name and address +) +/x + Alan Other <user\@dom.ain> + 0: Alan Other <user@dom.ain> + <user\@dom.ain> + 0: user@dom.ain + user\@dom.ain + 0: user@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + 0: "A. Other" <user.1234@dom.ain> + A. Other <user.1234\@dom.ain> (a comment) + 0: Other <user.1234@dom.ain> + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + 0: "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + A missing angle <user\@some.where + 0: user@some.where + *** Failers +No match + The quick brown fox +No match + +/abc\0def\00pqr\000xyz\0000AB/ + abc\0def\00pqr\000xyz\0000AB + 0: abc\x00def\x00pqr\x00xyz\x000AB + abc456 abc\0def\00pqr\000xyz\0000ABCDE + 0: abc\x00def\x00pqr\x00xyz\x000AB + +/abc\x0def\x00pqr\x000xyz\x0000AB/ + abc\x0def\x00pqr\x000xyz\x0000AB + 0: abc\x0def\x00pqr\x000xyz\x0000AB + abc456 abc\x0def\x00pqr\x000xyz\x0000ABCDE + 0: abc\x0def\x00pqr\x000xyz\x0000AB + +/^[\000-\037]/ + \0A + 0: \x00 + \01B + 0: \x01 + \037C + 0: \x1f + +/\0*/ + \0\0\0\0 + 0: \x00\x00\x00\x00 + +/A\x0{2,3}Z/ + The A\x0\x0Z + 0: A\x00\x00Z + An A\0\x0\0Z + 0: A\x00\x00\x00Z + *** Failers +No match + A\0Z +No match + A\0\x0\0\x0Z +No match + +/^(cow|)\1(bell)/ + cowcowbell + 0: cowcowbell + 1: cow + 2: bell + bell + 0: bell + 1: + 2: bell + *** Failers +No match + cowbell +No match + +/^\s/ + \040abc + 0: + \x0cabc + 0: \x0c + \nabc + 0: \x0a + \rabc + 0: \x0d + \tabc + 0: \x09 + *** Failers +No match + abc +No match + +/^a b + + c/x + abc + 0: abc + +/^(a|)\1*b/ + ab + 0: ab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers +No match + acb +No match + +/^(a|)\1+b/ + aab + 0: aab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers +No match + ab +No match + +/^(a|)\1?b/ + ab + 0: ab + 1: a + aab + 0: aab + 1: a + b + 0: b + 1: + *** Failers +No match + acb +No match + +/^(a|)\1{2}b/ + aaab + 0: aaab + 1: a + b + 0: b + 1: + *** Failers +No match + ab +No match + aab +No match + aaaab +No match + +/^(a|)\1{2,3}b/ + aaab + 0: aaab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers +No match + ab +No match + aab +No match + aaaaab +No match + +/ab{1,3}bc/ + abbbbc + 0: abbbbc + abbbc + 0: abbbc + abbc + 0: abbc + *** Failers +No match + abc +No match + abbbbbc +No match + +/([^.]*)\.([^:]*):[T ]+(.*)/ + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + +/([^.]*)\.([^:]*):[T ]+(.*)/i + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + +/([^.]*)\.([^:]*):[t ]+(.*)/i + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + +/^[W-c]+$/ + WXY_^abc + 0: WXY_^abc + ***Failers +No match + wxy +No match + +/^[W-c]+$/i + WXY_^abc + 0: WXY_^abc + wxy_^ABC + 0: wxy_^ABC + +/^[\x3f-\x5F]+$/i + WXY_^abc + 0: WXY_^abc + wxy_^ABC + 0: wxy_^ABC + +/^abc$/m + abc + 0: abc + qqq\nabc + 0: abc + abc\nzzz + 0: abc + qqq\nabc\nzzz + 0: abc + +/^abc$/ + abc + 0: abc + *** Failers +No match + qqq\nabc +No match + abc\nzzz +No match + qqq\nabc\nzzz +No match + +/\Aabc\Z/m + abc + 0: abc + abc\n + 0: abc + *** Failers +No match + qqq\nabc +No match + abc\nzzz +No match + qqq\nabc\nzzz +No match + +/\A(.)*\Z/s + abc\ndef + 0: abc\x0adef + 1: f + +/\A(.)*\Z/m + *** Failers + 0: *** Failers + 1: s + abc\ndef +No match + +/(?:b)|(?::+)/ + b::c + 0: b + c::b + 0: :: + +/[-az]+/ + az- + 0: az- + *** Failers + 0: a + b +No match + +/[az-]+/ + za- + 0: za- + *** Failers + 0: a + b +No match + +/[a\-z]+/ + a-z + 0: a-z + *** Failers + 0: a + b +No match + +/[a-z]+/ + abcdxyz + 0: abcdxyz + +/[\d-]+/ + 12-34 + 0: 12-34 + *** Failers +No match + aaa +No match + +/[\d-z]+/ + 12-34z + 0: 12-34z + *** Failers +No match + aaa +No match + +/\x5c/ + \\ + 0: \ + +/\x20Z/ + the Zoo + 0: Z + *** Failers +No match + Zulu +No match + +/(abc)\1/i + abcabc + 0: abcabc + 1: abc + ABCabc + 0: ABCabc + 1: ABC + abcABC + 0: abcABC + 1: abc + +/(main(O)?)+/ + mainmain + 0: mainmain + 1: main + mainOmain + 0: mainOmain + 1: main + 2: O + +/ab{3cd/ + ab{3cd + 0: ab{3cd + +/ab{3,cd/ + ab{3,cd + 0: ab{3,cd + +/ab{3,4a}cd/ + ab{3,4a}cd + 0: ab{3,4a}cd + +/{4,5a}bc/ + {4,5a}bc + 0: {4,5a}bc + +/^a.b/ + a\rb + 0: a\x0db + *** Failers +No match + a\nb +No match + +/abc$/ + abc + 0: abc + abc\n + 0: abc + *** Failers +No match + abc\ndef +No match + +/(abc)\123/ + abc\x53 + 0: abcS + 1: abc + +/(abc)\223/ + abc\x93 + 0: abc\x93 + 1: abc + +/(abc)\323/ + abc\xd3 + 0: abc\xd3 + 1: abc + +/(abc)\500/ + abc\x40 + 0: abc@ + 1: abc + abc\100 + 0: abc@ + 1: abc + +/(abc)\5000/ + abc\x400 + 0: abc@0 + 1: abc + abc\x40\x30 + 0: abc@0 + 1: abc + abc\1000 + 0: abc@0 + 1: abc + abc\100\x30 + 0: abc@0 + 1: abc + abc\100\060 + 0: abc@0 + 1: abc + abc\100\60 + 0: abc@0 + 1: abc + +/abc\81/ + abc\081 + 0: abc\x0081 + abc\0\x38\x31 + 0: abc\x0081 + +/abc\91/ + abc\091 + 0: abc\x0091 + abc\0\x39\x31 + 0: abc\x0091 + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/ + abcdefghijkllS + 0: abcdefghijkllS + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k +12: l + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/ + abcdefghijk\12S + 0: abcdefghijk\x0aS + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k + +/ab\gdef/ + abgdef + 0: abgdef + +/a{0}bc/ + bc + 0: bc + +/(a|(bc)){0,0}?xyz/ + xyz + 0: xyz + +/abc[\10]de/ + abc\010de + 0: abc\x08de + +/abc[\1]de/ + abc\1de + 0: abc\x01de + +/(abc)[\1]de/ + abc\1de + 0: abc\x01de + 1: abc + +/a.b(?s)/ + a\nb + 0: a\x0ab + +/^([^a])([^\b])([^c]*)([^d]{3,4})/ + baNOTccccd + 0: baNOTcccc + 1: b + 2: a + 3: NOT + 4: cccc + baNOTcccd + 0: baNOTccc + 1: b + 2: a + 3: NOT + 4: ccc + baNOTccd + 0: baNOTcc + 1: b + 2: a + 3: NO + 4: Tcc + bacccd + 0: baccc + 1: b + 2: a + 3: + 4: ccc + *** Failers + 0: *** Failers + 1: * + 2: * + 3: * Fail + 4: ers + anything +No match + b\bc +No match + baccd +No match + +/[^a]/ + Abc + 0: A + +/[^a]/i + Abc + 0: b + +/[^a]+/ + AAAaAbc + 0: AAA + +/[^a]+/i + AAAaAbc + 0: bc + +/[^a]+/ + bbb\nccc + 0: bbb\x0accc + +/[^k]$/ + abc + 0: c + *** Failers + 0: s + abk +No match + +/[^k]{2,3}$/ + abc + 0: abc + kbc + 0: bc + kabc + 0: abc + *** Failers + 0: ers + abk +No match + akb +No match + akk +No match + +/^\d{8,}\@.+[^k]$/ + 12345678\@a.b.c.d + 0: 12345678@a.b.c.d + 123456789\@x.y.z + 0: 123456789@x.y.z + *** Failers +No match + 12345678\@x.y.uk +No match + 1234567\@a.b.c.d +No match + +/(a)\1{8,}/ + aaaaaaaaa + 0: aaaaaaaaa + 1: a + aaaaaaaaaa + 0: aaaaaaaaaa + 1: a + *** Failers +No match + aaaaaaa +No match + +/[^a]/ + aaaabcd + 0: b + aaAabcd + 0: A + +/[^a]/i + aaaabcd + 0: b + aaAabcd + 0: b + +/[^az]/ + aaaabcd + 0: b + aaAabcd + 0: A + +/[^az]/i + aaaabcd + 0: b + aaAabcd + 0: b + +/\000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377/ + \000\001\002\003\004\005\006\007\010\011\012\013\014\015\016\017\020\021\022\023\024\025\026\027\030\031\032\033\034\035\036\037\040\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175\176\177\200\201\202\203\204\205\206\207\210\211\212\213\214\215\216\217\220\221\222\223\224\225\226\227\230\231\232\233\234\235\236\237\240\241\242\243\244\245\246\247\250\251\252\253\254\255\256\257\260\261\262\263\264\265\266\267\270\271\272\273\274\275\276\277\300\301\302\303\304\305\306\307\310\311\312\313\314\315\316\317\320\321\322\323\324\325\326\327\330\331\332\333\334\335\336\337\340\341\342\343\344\345\346\347\350\351\352\353\354\355\356\357\360\361\362\363\364\365\366\367\370\371\372\373\374\375\376\377 + 0: \x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0a\x0b\x0c\x0d\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff + +/P[^*]TAIRE[^*]{1,6}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + 0: PSTAIREISLL + +/P[^*]TAIRE[^*]{1,}?LL/ + xxxxxxxxxxxPSTAIREISLLxxxxxxxxx + 0: PSTAIREISLL + +/(\.\d\d[1-9]?)\d+/ + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + 1.235 + 0: .235 + 1: .23 + +/(\.\d\d((?=0)|\d(?=\d)))/ + 1.230003938 + 0: .23 + 1: .23 + 2: + 1.875000282 + 0: .875 + 1: .875 + 2: 5 + *** Failers +No match + 1.235 +No match + +/a(?)b/ + ab + 0: ab + +/\b(foo)\s+(\w+)/i + Food is on the foo table + 0: foo table + 1: foo + 2: table + +/foo(.*)bar/ + The food is under the bar in the barn. + 0: food is under the bar in the bar + 1: d is under the bar in the + +/foo(.*?)bar/ + The food is under the bar in the barn. + 0: food is under the bar + 1: d is under the + +/(.*)(\d*)/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: 53147 + 2: + +/(.*)(\d+)/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: 5314 + 2: 7 + +/(.*?)(\d*)/ + I have 2 numbers: 53147 + 0: + 1: + 2: + +/(.*?)(\d+)/ + I have 2 numbers: 53147 + 0: I have 2 + 1: I have + 2: 2 + +/(.*)(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: 5314 + 2: 7 + +/(.*?)(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: + 2: 53147 + +/(.*)\b(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: + 2: 53147 + +/(.*\D)(\d+)$/ + I have 2 numbers: 53147 + 0: I have 2 numbers: 53147 + 1: I have 2 numbers: + 2: 53147 + +/^\D*(?!123)/ + ABC123 + 0: AB + +/^(\D*)(?=\d)(?!123)/ + ABC445 + 0: ABC + 1: ABC + *** Failers +No match + ABC123 +No match + +/^[W-]46]/ + W46]789 + 0: W46] + -46]789 + 0: -46] + *** Failers +No match + Wall +No match + Zebra +No match + 42 +No match + [abcd] +No match + ]abcd[ +No match + +/^[W-\]46]/ + W46]789 + 0: W + Wall + 0: W + Zebra + 0: Z + Xylophone + 0: X + 42 + 0: 4 + [abcd] + 0: [ + ]abcd[ + 0: ] + \\backslash + 0: \ + *** Failers +No match + -46]789 +No match + well +No match + +/\d\d\/\d\d\/\d\d\d\d/ + 01/01/2000 + 0: 01/01/2000 + +/word (?:[a-zA-Z0-9]+ ){0,10}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + 0: word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark +No match + +/word (?:[a-zA-Z0-9]+ ){0,300}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope +No match + +/^(a){0,0}/ + bcd + 0: + abc + 0: + aab + 0: + +/^(a){0,1}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: a + 1: a + +/^(a){0,2}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: aa + 1: a + +/^(a){0,3}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + +/^(a){0,}/ + bcd + 0: + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + aaaaaaaa + 0: aaaaaaaa + 1: a + +/^(a){1,1}/ + bcd +No match + abc + 0: a + 1: a + aab + 0: a + 1: a + +/^(a){1,2}/ + bcd +No match + abc + 0: a + 1: a + aab + 0: aa + 1: a + +/^(a){1,3}/ + bcd +No match + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + +/^(a){1,}/ + bcd +No match + abc + 0: a + 1: a + aab + 0: aa + 1: a + aaa + 0: aaa + 1: a + aaaaaaaa + 0: aaaaaaaa + 1: a + +/.*\.gif/ + borfle\nbib.gif\nno + 0: bib.gif + +/.{0,}\.gif/ + borfle\nbib.gif\nno + 0: bib.gif + +/.*\.gif/m + borfle\nbib.gif\nno + 0: bib.gif + +/.*\.gif/s + borfle\nbib.gif\nno + 0: borfle\x0abib.gif + +/.*\.gif/ms + borfle\nbib.gif\nno + 0: borfle\x0abib.gif + +/.*$/ + borfle\nbib.gif\nno + 0: no + +/.*$/m + borfle\nbib.gif\nno + 0: borfle + +/.*$/s + borfle\nbib.gif\nno + 0: borfle\x0abib.gif\x0ano + +/.*$/ms + borfle\nbib.gif\nno + 0: borfle\x0abib.gif\x0ano + +/.*$/ + borfle\nbib.gif\nno\n + 0: no + +/.*$/m + borfle\nbib.gif\nno\n + 0: borfle + +/.*$/s + borfle\nbib.gif\nno\n + 0: borfle\x0abib.gif\x0ano\x0a + +/.*$/ms + borfle\nbib.gif\nno\n + 0: borfle\x0abib.gif\x0ano\x0a + +/(.*X|^B)/ + abcde\n1234Xyz + 0: 1234X + 1: 1234X + BarFoo + 0: B + 1: B + *** Failers +No match + abcde\nBar +No match + +/(.*X|^B)/m + abcde\n1234Xyz + 0: 1234X + 1: 1234X + BarFoo + 0: B + 1: B + abcde\nBar + 0: B + 1: B + +/(.*X|^B)/s + abcde\n1234Xyz + 0: abcde\x0a1234X + 1: abcde\x0a1234X + BarFoo + 0: B + 1: B + *** Failers +No match + abcde\nBar +No match + +/(.*X|^B)/ms + abcde\n1234Xyz + 0: abcde\x0a1234X + 1: abcde\x0a1234X + BarFoo + 0: B + 1: B + abcde\nBar + 0: B + 1: B + +/(?s)(.*X|^B)/ + abcde\n1234Xyz + 0: abcde\x0a1234X + 1: abcde\x0a1234X + BarFoo + 0: B + 1: B + *** Failers +No match + abcde\nBar +No match + +/(?s:.*X|^B)/ + abcde\n1234Xyz + 0: abcde\x0a1234X + BarFoo + 0: B + *** Failers +No match + abcde\nBar +No match + +/ End of test input / + diff --git a/ext/pcre/pcrelib/testoutput2 b/ext/pcre/pcrelib/testoutput2 new file mode 100644 index 0000000000..09148ff243 --- /dev/null +++ b/ext/pcre/pcrelib/testoutput2 @@ -0,0 +1,1088 @@ +PCRE version 2.05 21-Apr-1999 + +/(a)b|/ +Identifying subpattern count = 1 +No options +No first char + +/abc/ +Identifying subpattern count = 0 +No options +First char = 'a' + abc + 0: abc + defabc + 0: abc + \Aabc + 0: abc + *** Failers +No match + \Adefabc +No match + ABC +No match + +/^abc/ +Identifying subpattern count = 0 +Options: anchored +No first char + abc + 0: abc + \Aabc + 0: abc + *** Failers +No match + defabc +No match + \Adefabc +No match + +/a+bc/ +Identifying subpattern count = 0 +No options +First char = 'a' + +/a*bc/ +Identifying subpattern count = 0 +No options +No first char + +/a{3}bc/ +Identifying subpattern count = 0 +No options +First char = 'a' + +/(abc|a+z)/ +Identifying subpattern count = 1 +No options +First char = 'a' + +/^abc$/ +Identifying subpattern count = 0 +Options: anchored +No first char + abc + 0: abc + *** Failers +No match + def\nabc +No match + +/ab\gdef/X +Failed: unrecognized character follows \ at offset 3 + +/(?X)ab\gdef/X +Failed: unrecognized character follows \ at offset 7 + +/x{5,4}/ +Failed: numbers out of order in {} quantifier at offset 5 + +/z{65536}/ +Failed: number too big in {} quantifier at offset 7 + +/[abcd/ +Failed: missing terminating ] for character class at offset 5 + +/[\B]/ +Failed: invalid escape sequence in character class at offset 2 + +/[a-\w]/ +Failed: invalid escape sequence in character class at offset 4 + +/[z-a]/ +Failed: range out of order in character class at offset 3 + +/^*/ +Failed: nothing to repeat at offset 1 + +/(abc/ +Failed: missing ) at offset 4 + +/(?# abc/ +Failed: missing ) after comment at offset 7 + +/(?z)abc/ +Failed: unrecognized character after (? at offset 2 + +/.*b/ +Identifying subpattern count = 0 +No options +First char at start or follows \n + +/.*?b/ +Identifying subpattern count = 0 +No options +First char at start or follows \n + +/cat|dog|elephant/ +Identifying subpattern count = 0 +No options +No first char + this sentence eventually mentions a cat + 0: cat + this sentences rambles on and on for a while and then reaches elephant + 0: elephant + +/cat|dog|elephant/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: c d e + this sentence eventually mentions a cat + 0: cat + this sentences rambles on and on for a while and then reaches elephant + 0: elephant + +/cat|dog|elephant/iS +Identifying subpattern count = 0 +Options: caseless +No first char +Starting character set: C D E c d e + this sentence eventually mentions a CAT cat + 0: CAT + this sentences rambles on and on for a while to elephant ElePhant + 0: elephant + +/a|[bcd]/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: a b c d + +/(a|[^\dZ])/S +Identifying subpattern count = 1 +No options +No first char +Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a + \x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 + \x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = > + ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y [ \ ] ^ _ ` a b c d + e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80 \x81 \x82 \x83 + \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f \x90 \x91 \x92 + \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e \x9f \xa0 \xa1 + \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae \xaf \xb0 + \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd \xbe \xbf + \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce + \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd + \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec + \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb + \xfc \xfd \xfe \xff + +/(a|b)*[\s]/S +Identifying subpattern count = 1 +No options +No first char +Starting character set: \x09 \x0a \x0b \x0c \x0d \x20 a b + +/(ab\2)/ +Failed: back reference to non-existent subpattern at offset 6 + +/{4,5}abc/ +Failed: nothing to repeat at offset 4 + +/(a)(b)(c)\2/ +Identifying subpattern count = 3 +No options +First char = 'a' + abcb + 0: abcb + 1: a + 2: b + 3: c + \O0abcb +Matched, but too many substrings + \O3abcb +Matched, but too many substrings + 0: abcb + \O6abcb +Matched, but too many substrings + 0: abcb + 1: a + \O9abcb +Matched, but too many substrings + 0: abcb + 1: a + 2: b + \O12abcb + 0: abcb + 1: a + 2: b + 3: c + +/(a)bc|(a)(b)\2/ +Identifying subpattern count = 3 +No options +First char = 'a' + abc + 0: abc + 1: a + \O0abc +Matched, but too many substrings + \O3abc +Matched, but too many substrings + 0: abc + \O6abc + 0: abc + 1: a + aba + 0: aba + 1: <unset> + 2: a + 3: b + \O0aba +Matched, but too many substrings + \O3aba +Matched, but too many substrings + 0: aba + \O6aba +Matched, but too many substrings + 0: aba + 1: <unset> + \O9aba +Matched, but too many substrings + 0: aba + 1: <unset> + 2: a + \O12aba + 0: aba + 1: <unset> + 2: a + 3: b + +/abc$/E +Identifying subpattern count = 0 +Options: dollar_endonly +First char = 'a' + abc + 0: abc + *** Failers +No match + abc\n +No match + abc\ndef +No match + +/(a)(b)(c)(d)(e)\6/ +Failed: back reference to non-existent subpattern at offset 17 + +/the quick brown fox/ +Identifying subpattern count = 0 +No options +First char = 't' + the quick brown fox + 0: the quick brown fox + this is a line with the quick brown fox + 0: the quick brown fox + +/the quick brown fox/A +Identifying subpattern count = 0 +Options: anchored +No first char + the quick brown fox + 0: the quick brown fox + *** Failers +No match + this is a line with the quick brown fox +No match + +/ab(?z)cd/ +Failed: unrecognized character after (? at offset 4 + +/^abc|def/ +Identifying subpattern count = 0 +No options +No first char + abcdef + 0: abc + abcdef\B + 0: def + +/.*((abc)$|(def))/ +Identifying subpattern count = 3 +No options +First char at start or follows \n + defabc + 0: defabc + 1: abc + 2: abc + \Zdefabc + 0: def + 1: def + 2: <unset> + 3: def + +/abc/P + abc + 0: abc + *** Failers +No match: POSIX code 17: match failed + +/^abc|def/P + abcdef + 0: abc + abcdef\B + 0: def + +/.*((abc)$|(def))/P + defabc + 0: defabc + 1: abc + 2: abc + \Zdefabc + 0: def + 1: def + 3: def + +/the quick brown fox/P + the quick brown fox + 0: the quick brown fox + *** Failers +No match: POSIX code 17: match failed + The Quick Brown Fox +No match: POSIX code 17: match failed + +/the quick brown fox/Pi + the quick brown fox + 0: the quick brown fox + The Quick Brown Fox + 0: The Quick Brown Fox + +/abc.def/P + *** Failers +No match: POSIX code 17: match failed + abc\ndef +No match: POSIX code 17: match failed + +/abc$/P + abc + 0: abc + abc\n + 0: abc + +/(abc)\2/P +Failed: POSIX code 15: bad back reference at offset 7 + +/(abc\1)/P + abc +No match: POSIX code 17: match failed + +/)/ +Failed: unmatched parentheses at offset 0 + +/a[]b/ +Failed: missing terminating ] for character class at offset 4 + +/[^aeiou ]{3,}/ +Identifying subpattern count = 0 +No options +No first char + co-processors, and for + 0: -pr + +/<.*>/ +Identifying subpattern count = 0 +No options +First char = '<' + abc<def>ghi<klm>nop + 0: <def>ghi<klm> + +/<.*?>/ +Identifying subpattern count = 0 +No options +First char = '<' + abc<def>ghi<klm>nop + 0: <def> + +/<.*>/U +Identifying subpattern count = 0 +Options: ungreedy +First char = '<' + abc<def>ghi<klm>nop + 0: <def> + +/<.*>(?U)/ +Identifying subpattern count = 0 +Options: ungreedy +First char = '<' + abc<def>ghi<klm>nop + 0: <def> + +/<.*?>/U +Identifying subpattern count = 0 +Options: ungreedy +First char = '<' + abc<def>ghi<klm>nop + 0: <def>ghi<klm> + +/={3,}/U +Identifying subpattern count = 0 +Options: ungreedy +First char = '=' + abc========def + 0: === + +/(?U)={3,}?/ +Identifying subpattern count = 0 +Options: ungreedy +First char = '=' + abc========def + 0: ======== + +/(?<!bar|cattle)foo/ +Identifying subpattern count = 0 +No options +First char = 'f' + foo + 0: foo + catfoo + 0: foo + *** Failers +No match + the barfoo +No match + and cattlefoo +No match + +/(?<=a+)b/ +Failed: lookbehind assertion is not fixed length at offset 6 + +/(?<=aaa|b{0,3})b/ +Failed: lookbehind assertion is not fixed length at offset 14 + +/(?<!(foo)a\1)bar/ +Failed: lookbehind assertion is not fixed length at offset 12 + +/(?i)abc/ +Identifying subpattern count = 0 +Options: caseless +First char = 'a' + +/(a|(?m)a)/ +Identifying subpattern count = 1 +No options +First char = 'a' + +/(?i)^1234/ +Identifying subpattern count = 0 +Options: anchored caseless +No first char + +/(^b|(?i)^d)/ +Identifying subpattern count = 1 +Options: anchored +No first char + +/(?s).*/ +Identifying subpattern count = 0 +Options: anchored dotall +No first char + +/[abcd]/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: a b c d + +/(?i)[abcd]/S +Identifying subpattern count = 0 +Options: caseless +No first char +Starting character set: A B C D a b c d + +/(?m)[xy]|(b|c)/S +Identifying subpattern count = 1 +Options: multiline +No first char +Starting character set: b c x y + +/(^a|^b)/m +Identifying subpattern count = 1 +Options: multiline +First char at start or follows \n + +/(?i)(^a|^b)/m +Identifying subpattern count = 1 +Options: caseless multiline +First char at start or follows \n + +/(a)(?(1)a|b|c)/ +Failed: conditional group contains more than two branches at offset 13 + +/(?(?=a)a|b|c)/ +Failed: conditional group contains more than two branches at offset 12 + +/(?(1a)/ +Failed: malformed number after (?( at offset 4 + +/(?(?i))/ +Failed: assertion expected after (?( at offset 3 + +/(?(abc))/ +Failed: assertion expected after (?( at offset 3 + +/(?(?<ab))/ +Failed: unrecognized character after (?< at offset 2 + +/((?s)blah)\s+\1/ +Identifying subpattern count = 1 +No options +First char = 'b' + +/((?i)blah)\s+\1/ +Identifying subpattern count = 1 +No options +No first char + +/((?i)b)/DS +------------------------------------------------------------------ + 0 16 Bra 0 + 3 8 Bra 1 + 6 01 Opt + 8 1 b + 11 8 Ket + 14 00 Opt + 16 16 Ket + 19 End +------------------------------------------------------------------ +Identifying subpattern count = 1 +No options +No first char +Starting character set: B b + +/(a*b|(?i:c*(?-i)d))/S +Identifying subpattern count = 1 +No options +No first char +Starting character set: C a b c d + +/a$/ +Identifying subpattern count = 0 +No options +First char = 'a' + a + 0: a + a\n + 0: a + *** Failers +No match + \Za +No match + \Za\n +No match + +/a$/m +Identifying subpattern count = 0 +Options: multiline +First char = 'a' + a + 0: a + a\n + 0: a + \Za\n + 0: a + *** Failers +No match + \Za +No match + +/\Aabc/m +Identifying subpattern count = 0 +Options: anchored multiline +No first char + +/^abc/m +Identifying subpattern count = 0 +Options: multiline +First char at start or follows \n + +/^((a+)(?U)([ab]+)(?-U)([bc]+)(\w*))/ +Identifying subpattern count = 5 +Options: anchored +No first char + aaaaabbbbbcccccdef + 0: aaaaabbbbbcccccdef + 1: aaaaabbbbbcccccdef + 2: aaaaa + 3: b + 4: bbbbccccc + 5: def + +/(?<=foo)[ab]/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: a b + +/(?<!foo)(alpha|omega)/S +Identifying subpattern count = 1 +No options +No first char +Starting character set: a o + +/(?!alphabet)[ab]/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: a b + +/(?<=foo\n)^bar/m +Identifying subpattern count = 0 +Options: multiline +First char at start or follows \n + +/(?>^abc)/m +Identifying subpattern count = 0 +Options: multiline +First char at start or follows \n + abc + 0: abc + def\nabc + 0: abc + *** Failers +No match + defabc +No match + +/(?<=ab(c+)d)ef/ +Failed: lookbehind assertion is not fixed length at offset 11 + +/(?<=ab(?<=c+)d)ef/ +Failed: lookbehind assertion is not fixed length at offset 12 + +/(?<=ab(c|de)f)g/ +Failed: lookbehind assertion is not fixed length at offset 13 + +/The next three are in testinput2 because they have variable length branches/ +Identifying subpattern count = 0 +No options +First char = 'T' + +/(?<=bullock|donkey)-cart/ +Identifying subpattern count = 0 +No options +First char = '-' + the bullock-cart + 0: -cart + a donkey-cart race + 0: -cart + *** Failers +No match + cart +No match + horse-and-cart +No match + +/(?<=ab(?i)x|y|z)/ +Identifying subpattern count = 0 +No options +No first char + +/(?>.*)(?<=(abcd)|(xyz))/ +Identifying subpattern count = 2 +No options +First char at start or follows \n + alphabetabcd + 0: alphabetabcd + 1: abcd + endingxyz + 0: endingxyz + 1: <unset> + 2: xyz + +/(?<=ab(?i)x(?-i)y|(?i)z|b)ZZ/ +Identifying subpattern count = 0 +No options +First char = 'Z' + abxyZZ + 0: ZZ + abXyZZ + 0: ZZ + ZZZ + 0: ZZ + zZZ + 0: ZZ + bZZ + 0: ZZ + BZZ + 0: ZZ + *** Failers +No match + ZZ +No match + abXYZZ +No match + zzz +No match + bzz +No match + +/(?<!(foo)a)bar/ +Identifying subpattern count = 1 +No options +First char = 'b' + bar + 0: bar + foobbar + 0: bar + *** Failers +No match + fooabar +No match + +/This one is here because Perl 5.005_02 doesn't fail it/ +Identifying subpattern count = 0 +No options +First char = 'T' + +/^(a)?(?(1)a|b)+$/ +Identifying subpattern count = 1 +Options: anchored +No first char + *** Failers +No match + a +No match + +/This one is here because I think Perl 5.005_02 gets the setting of $1 wrong/ +Identifying subpattern count = 0 +No options +First char = 'T' + +/^(a\1?){4}$/ +Identifying subpattern count = 1 +Options: anchored +No first char + aaaaaa + 0: aaaaaa + 1: aa + +/These are syntax tests from Perl 5.005/ +Identifying subpattern count = 0 +No options +First char = 'T' + +/a[b-a]/ +Failed: range out of order in character class at offset 4 + +/a[]b/ +Failed: missing terminating ] for character class at offset 4 + +/a[/ +Failed: missing terminating ] for character class at offset 2 + +/*a/ +Failed: nothing to repeat at offset 0 + +/(*)b/ +Failed: nothing to repeat at offset 1 + +/abc)/ +Failed: unmatched parentheses at offset 3 + +/(abc/ +Failed: missing ) at offset 4 + +/a**/ +Failed: nothing to repeat at offset 2 + +/)(/ +Failed: unmatched parentheses at offset 0 + +/\1/ +Failed: back reference to non-existent subpattern at offset 2 + +/\2/ +Failed: back reference to non-existent subpattern at offset 2 + +/(a)|\2/ +Failed: back reference to non-existent subpattern at offset 6 + +/a[b-a]/i +Failed: range out of order in character class at offset 4 + +/a[]b/i +Failed: missing terminating ] for character class at offset 4 + +/a[/i +Failed: missing terminating ] for character class at offset 2 + +/*a/i +Failed: nothing to repeat at offset 0 + +/(*)b/i +Failed: nothing to repeat at offset 1 + +/abc)/i +Failed: unmatched parentheses at offset 3 + +/(abc/i +Failed: missing ) at offset 4 + +/a**/i +Failed: nothing to repeat at offset 2 + +/)(/i +Failed: unmatched parentheses at offset 0 + +/:(?:/ +Failed: missing ) at offset 4 + +/(?<%)b/ +Failed: unrecognized character after (?< at offset 0 + +/a(?{)b/ +Failed: unrecognized character after (? at offset 3 + +/a(?{{})b/ +Failed: unrecognized character after (? at offset 3 + +/a(?{}})b/ +Failed: unrecognized character after (? at offset 3 + +/a(?{"{"})b/ +Failed: unrecognized character after (? at offset 3 + +/a(?{"{"}})b/ +Failed: unrecognized character after (? at offset 3 + +/(?(1?)a|b)/ +Failed: malformed number after (?( at offset 4 + +/(?(1)a|b|c)/ +Failed: conditional group contains more than two branches at offset 10 + +/[a[:xyz:/ +Failed: missing terminating ] for character class at offset 8 + +/(?<=x+)y/ +Failed: lookbehind assertion is not fixed length at offset 6 + +/a{37,17}/ +Failed: numbers out of order in {} quantifier at offset 7 + +/abc/\ +Failed: \ at end of pattern at offset 4 + +/abc/\P +Failed: POSIX code 9: bad escape sequence at offset 4 + +/abc/\i +Failed: \ at end of pattern at offset 4 + +/(a)bc(d)/ +Identifying subpattern count = 2 +No options +First char = 'a' + abcd + 0: abcd + 1: a + 2: d + abcd\C2 + 0: abcd + 1: a + 2: d + 2C d (1) + abcd\C5 + 0: abcd + 1: a + 2: d +copy substring 5 failed -7 + +/(.{20})/ +Identifying subpattern count = 1 +No options +No first char + abcdefghijklmnopqrstuvwxyz + 0: abcdefghijklmnopqrst + 1: abcdefghijklmnopqrst + abcdefghijklmnopqrstuvwxyz\C1 + 0: abcdefghijklmnopqrst + 1: abcdefghijklmnopqrst +copy substring 1 failed -6 + abcdefghijklmnopqrstuvwxyz\G1 + 0: abcdefghijklmnopqrst + 1: abcdefghijklmnopqrst + 1G abcdefghijklmnopqrst (20) + +/(.{15})/ +Identifying subpattern count = 1 +No options +No first char + abcdefghijklmnopqrstuvwxyz + 0: abcdefghijklmno + 1: abcdefghijklmno + abcdefghijklmnopqrstuvwxyz\C1\G1 + 0: abcdefghijklmno + 1: abcdefghijklmno + 1C abcdefghijklmno (15) + 1G abcdefghijklmno (15) + +/(.{16})/ +Identifying subpattern count = 1 +No options +No first char + abcdefghijklmnopqrstuvwxyz + 0: abcdefghijklmnop + 1: abcdefghijklmnop + abcdefghijklmnopqrstuvwxyz\C1\G1\L + 0: abcdefghijklmnop + 1: abcdefghijklmnop +copy substring 1 failed -6 + 1G abcdefghijklmnop (16) + 0L abcdefghijklmnop + 1L abcdefghijklmnop + +/^(a|(bc))de(f)/ +Identifying subpattern count = 3 +Options: anchored +No first char + adef\G1\G2\G3\G4\L + 0: adef + 1: a + 2: <unset> + 3: f + 1G a (1) + 2G (0) + 3G f (1) +get substring 4 failed -7 + 0L adef + 1L a + 2L + 3L f + bcdef\G1\G2\G3\G4\L + 0: bcdef + 1: bc + 2: bc + 3: f + 1G bc (2) + 2G bc (2) + 3G f (1) +get substring 4 failed -7 + 0L bcdef + 1L bc + 2L bc + 3L f + adefghijk\C0 + 0: adef + 1: a + 2: <unset> + 3: f + 0C adef (4) + +/^abc\00def/ +Identifying subpattern count = 0 +Options: anchored +No first char + abc\00def\L\C0 + 0: abc\x00def + 0C abc (7) + 0L abc + +/word ((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ +)((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ )((?:[a-zA-Z0-9]+ +)?)?)?)?)?)?)?)?)?otherword/M +Memory allocation request: 441 (code space 428) +Identifying subpattern count = 8 +No options +First char = 'w' + +/.*X/D +------------------------------------------------------------------ + 0 8 Bra 0 + 3 Any* + 5 1 X + 8 8 Ket + 11 End +------------------------------------------------------------------ +Identifying subpattern count = 0 +No options +First char at start or follows \n + +/.*X/Ds +------------------------------------------------------------------ + 0 8 Bra 0 + 3 Any* + 5 1 X + 8 8 Ket + 11 End +------------------------------------------------------------------ +Identifying subpattern count = 0 +Options: anchored dotall +No first char + +/(.*X|^B)/D +------------------------------------------------------------------ + 0 21 Bra 0 + 3 8 Bra 1 + 6 Any* + 8 1 X + 11 7 Alt + 14 ^ + 15 1 B + 18 15 Ket + 21 21 Ket + 24 End +------------------------------------------------------------------ +Identifying subpattern count = 1 +No options +First char at start or follows \n + +/(.*X|^B)/Ds +------------------------------------------------------------------ + 0 21 Bra 0 + 3 8 Bra 1 + 6 Any* + 8 1 X + 11 7 Alt + 14 ^ + 15 1 B + 18 15 Ket + 21 21 Ket + 24 End +------------------------------------------------------------------ +Identifying subpattern count = 1 +Options: anchored dotall +No first char + +/(?s)(.*X|^B)/D +------------------------------------------------------------------ + 0 21 Bra 0 + 3 8 Bra 1 + 6 Any* + 8 1 X + 11 7 Alt + 14 ^ + 15 1 B + 18 15 Ket + 21 21 Ket + 24 End +------------------------------------------------------------------ +Identifying subpattern count = 1 +Options: anchored dotall +No first char + +/(?s:.*X|^B)/D +------------------------------------------------------------------ + 0 27 Bra 0 + 3 10 Bra 0 + 6 04 Opt + 8 Any* + 10 1 X + 13 9 Alt + 16 04 Opt + 18 ^ + 19 1 B + 22 19 Ket + 25 00 Opt + 27 27 Ket + 30 End +------------------------------------------------------------------ +Identifying subpattern count = 0 +No options +First char at start or follows \n + +/ End of test input / +Identifying subpattern count = 0 +No options +First char = ' ' + diff --git a/ext/pcre/pcrelib/testoutput3 b/ext/pcre/pcrelib/testoutput3 new file mode 100644 index 0000000000..6d597cdb39 --- /dev/null +++ b/ext/pcre/pcrelib/testoutput3 @@ -0,0 +1,2832 @@ +PCRE version 2.05 21-Apr-1999 + +/(?<!bar)foo/ + foo + 0: foo + catfood + 0: foo + arfootle + 0: foo + rfoosh + 0: foo + *** Failers +No match + barfoo +No match + towbarfoo +No match + +/\w{3}(?<!bar)foo/ + catfood + 0: catfoo + *** Failers +No match + foo +No match + barfoo +No match + towbarfoo +No match + +/(?<=(foo)a)bar/ + fooabar + 0: bar + 1: foo + *** Failers +No match + bar +No match + foobbar +No match + +/\Aabc\z/m + abc + 0: abc + *** Failers +No match + abc\n +No match + qqq\nabc +No match + abc\nzzz +No match + qqq\nabc\nzzz +No match + +"(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ +No match + +"(?>.*/)foo" + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + 0: /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(?>(\.\d\d[1-9]?))\d+/ + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + *** Failers +No match + 1.235 +No match + +/^((?>\w+)|(?>\s+))*$/ + now is the time for all good men to come to the aid of the party + 0: now is the time for all good men to come to the aid of the party + 1: party + *** Failers +No match + this is not a line with only words and spaces! +No match + +/(\d+)(\w)/ + 12345a + 0: 12345a + 1: 12345 + 2: a + 12345+ + 0: 12345 + 1: 1234 + 2: 5 + +/((?>\d+))(\w)/ + 12345a + 0: 12345a + 1: 12345 + 2: a + *** Failers +No match + 12345+ +No match + +/(?>a+)b/ + aaab + 0: aaab + +/((?>a+)b)/ + aaab + 0: aaab + 1: aaab + +/(?>(a+))b/ + aaab + 0: aaab + 1: aaa + +/(?>b)+/ + aaabbbccc + 0: bbb + +/(?>a+|b+|c+)*c/ + aaabbbbccccd + 0: aaabbbbc + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + +/\(((?>[^()]+)|\([^()]+\))+\)/ + (abc) + 0: (abc) + 1: abc + (abc(def)xyz) + 0: (abc(def)xyz) + 1: xyz + *** Failers +No match + ((()aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +No match + +/a(?-i)b/i + ab + 0: ab + *** Failers +No match + Ab +No match + aB +No match + AB +No match + +/(a (?x)b c)d e/ + a bcd e + 0: a bcd e + 1: a bc + *** Failers +No match + a b cd e +No match + abcd e +No match + a bcde +No match + +/(a b(?x)c d (?-x)e f)/ + a bcde f + 0: a bcde f + 1: a bcde f + *** Failers +No match + abcdef +No match + +/(a(?i)b)c/ + abc + 0: abc + 1: ab + aBc + 0: aBc + 1: aB + *** Failers +No match + abC +No match + aBC +No match + Abc +No match + ABc +No match + ABC +No match + AbC +No match + +/a(?i:b)c/ + abc + 0: abc + aBc + 0: aBc + *** Failers +No match + ABC +No match + abC +No match + aBC +No match + +/a(?i:b)*c/ + aBc + 0: aBc + aBBc + 0: aBBc + *** Failers +No match + aBC +No match + aBBC +No match + +/a(?=b(?i)c)\w\wd/ + abcd + 0: abcd + abCd + 0: abCd + *** Failers +No match + aBCd +No match + abcD +No match + +/(?s-i:more.*than).*million/i + more than million + 0: more than million + more than MILLION + 0: more than MILLION + more \n than Million + 0: more \x0a than Million + *** Failers +No match + MORE THAN MILLION +No match + more \n than \n million +No match + +/(?:(?s-i)more.*than).*million/i + more than million + 0: more than million + more than MILLION + 0: more than MILLION + more \n than Million + 0: more \x0a than Million + *** Failers +No match + MORE THAN MILLION +No match + more \n than \n million +No match + +/(?>a(?i)b+)+c/ + abc + 0: abc + aBbc + 0: aBbc + aBBc + 0: aBBc + *** Failers +No match + Abc +No match + abAb +No match + abbC +No match + +/(?=a(?i)b)\w\wc/ + abc + 0: abc + aBc + 0: aBc + *** Failers +No match + Ab +No match + abC +No match + aBC +No match + +/(?<=a(?i)b)(\w\w)c/ + abxxc + 0: xxc + 1: xx + aBxxc + 0: xxc + 1: xx + *** Failers +No match + Abxxc +No match + ABxxc +No match + abxxC +No match + +/(?:(a)|b)(?(1)A|B)/ + aA + 0: aA + 1: a + bB + 0: bB + *** Failers +No match + aB +No match + bA +No match + +/^(a)?(?(1)a|b)+$/ + aa + 0: aa + 1: a + b + 0: b + bb + 0: bb + *** Failers +No match + ab +No match + +/^(?(?=abc)\w{3}:|\d\d)$/ + abc: + 0: abc: + 12 + 0: 12 + *** Failers +No match + 123 +No match + xyz +No match + +/^(?(?!abc)\d\d|\w{3}:)$/ + abc: + 0: abc: + 12 + 0: 12 + *** Failers +No match + 123 +No match + xyz +No match + +/(?(?<=foo)bar|cat)/ + foobar + 0: bar + cat + 0: cat + fcat + 0: cat + focat + 0: cat + *** Failers +No match + foocat +No match + +/(?(?<!foo)cat|bar)/ + foobar + 0: bar + cat + 0: cat + fcat + 0: cat + focat + 0: cat + *** Failers +No match + foocat +No match + +/( \( )? [^()]+ (?(1) \) |) /x + abcd + 0: abcd + (abcd) + 0: (abcd) + 1: ( + the quick (abcd) fox + 0: the quick + (abcd + 0: abcd + +/( \( )? [^()]+ (?(1) \) ) /x + abcd + 0: abcd + (abcd) + 0: (abcd) + 1: ( + the quick (abcd) fox + 0: the quick + (abcd + 0: abcd + +/^(?(2)a|(1)(2))+$/ + 12 + 0: 12 + 1: 1 + 2: 2 + 12a + 0: 12a + 1: 1 + 2: 2 + 12aa + 0: 12aa + 1: 1 + 2: 2 + *** Failers +No match + 1234 +No match + +/((?i)blah)\s+\1/ + blah blah + 0: blah blah + 1: blah + BLAH BLAH + 0: BLAH BLAH + 1: BLAH + Blah Blah + 0: Blah Blah + 1: Blah + blaH blaH + 0: blaH blaH + 1: blaH + *** Failers +No match + blah BLAH +No match + Blah blah +No match + blaH blah +No match + +/((?i)blah)\s+(?i:\1)/ + blah blah + 0: blah blah + 1: blah + BLAH BLAH + 0: BLAH BLAH + 1: BLAH + Blah Blah + 0: Blah Blah + 1: Blah + blaH blaH + 0: blaH blaH + 1: blaH + blah BLAH + 0: blah BLAH + 1: blah + Blah blah + 0: Blah blah + 1: Blah + blaH blah + 0: blaH blah + 1: blaH + +/(?>a*)*/ + a + 0: a + aa + 0: aa + aaaa + 0: aaaa + +/(abc|)+/ + abc + 0: abc + 1: + abcabc + 0: abcabc + 1: + abcabcabc + 0: abcabcabc + 1: + xyz + 0: + 1: + +/([a]*)*/ + a + 0: a + 1: + aaaaa + 0: aaaaa + 1: + +/([ab]*)*/ + a + 0: a + 1: + b + 0: b + 1: + ababab + 0: ababab + 1: + aaaabcde + 0: aaaab + 1: + bbbb + 0: bbbb + 1: + +/([^a]*)*/ + b + 0: b + 1: + bbbb + 0: bbbb + 1: + aaa + 0: + 1: + +/([^ab]*)*/ + cccc + 0: cccc + 1: + abab + 0: + 1: + +/([a]*?)*/ + a + 0: + 1: + aaaa + 0: + 1: + +/([ab]*?)*/ + a + 0: + 1: + b + 0: + 1: + abab + 0: + 1: + baba + 0: + 1: + +/([^a]*?)*/ + b + 0: + 1: + bbbb + 0: + 1: + aaa + 0: + 1: + +/([^ab]*?)*/ + c + 0: + 1: + cccc + 0: + 1: + baba + 0: + 1: + +/(?>a*)*/ + a + 0: a + aaabcde + 0: aaa + +/((?>a*))*/ + aaaaa + 0: aaaaa + 1: + aabbaa + 0: aa + 1: + +/((?>a*?))*/ + aaaaa + 0: + 1: + aabbaa + 0: + 1: + +/(?(?=[^a-z]+[a-z]) \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) /x + 12-sep-98 + 0: 12-sep-98 + 12-09-98 + 0: 12-09-98 + *** Failers +No match + sep-12-98 +No match + +/(?<=(foo))bar\1/ + foobarfoo + 0: barfoo + 1: foo + foobarfootling + 0: barfoo + 1: foo + *** Failers +No match + foobar +No match + barfoo +No match + +/(?i:saturday|sunday)/ + saturday + 0: saturday + sunday + 0: sunday + Saturday + 0: Saturday + Sunday + 0: Sunday + SATURDAY + 0: SATURDAY + SUNDAY + 0: SUNDAY + SunDay + 0: SunDay + +/(a(?i)bc|BB)x/ + abcx + 0: abcx + 1: abc + aBCx + 0: aBCx + 1: aBC + bbx + 0: bbx + 1: bb + BBx + 0: BBx + 1: BB + *** Failers +No match + abcX +No match + aBCX +No match + bbX +No match + BBX +No match + +/^([ab](?i)[cd]|[ef])/ + ac + 0: ac + 1: ac + aC + 0: aC + 1: aC + bD + 0: bD + 1: bD + elephant + 0: e + 1: e + Europe + 0: E + 1: E + frog + 0: f + 1: f + France + 0: F + 1: F + *** Failers +No match + Africa +No match + +/^(ab|a(?i)[b-c](?m-i)d|x(?i)y|z)/ + ab + 0: ab + 1: ab + aBd + 0: aBd + 1: aBd + xy + 0: xy + 1: xy + xY + 0: xY + 1: xY + zebra + 0: z + 1: z + Zambesi + 0: Z + 1: Z + *** Failers +No match + aCD +No match + XY +No match + +/(?<=foo\n)^bar/m + foo\nbar + 0: bar + *** Failers +No match + bar +No match + baz\nbar +No match + +/(?<=(?<!foo)bar)baz/ + barbaz + 0: baz + barbarbaz + 0: baz + koobarbaz + 0: baz + *** Failers +No match + baz +No match + foobarbaz +No match + +/The case of aaaaaa is missed out below because I think Perl 5.005_02 gets/ +/it wrong; it sets $1 to aaa rather than aa. Compare the following test,/ +No match +/where it does set $1 to aa when matching aaaaaa./ +No match + +/^(a\1?){4}$/ + a +No match + aa +No match + aaa +No match + aaaa + 0: aaaa + 1: a + aaaaa + 0: aaaaa + 1: a + aaaaaaa + 0: aaaaaaa + 1: a + aaaaaaaa +No match + aaaaaaaaa +No match + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + aaaaaaaaaaa +No match + aaaaaaaaaaaa +No match + aaaaaaaaaaaaa +No match + aaaaaaaaaaaaaa +No match + aaaaaaaaaaaaaaa +No match + aaaaaaaaaaaaaaaa +No match + +/^(a\1?)(a\1?)(a\2?)(a\3?)$/ + a +No match + aa +No match + aaa +No match + aaaa + 0: aaaa + 1: a + 2: a + 3: a + 4: a + aaaaa + 0: aaaaa + 1: a + 2: aa + 3: a + 4: a + aaaaaa + 0: aaaaaa + 1: a + 2: aa + 3: a + 4: aa + aaaaaaa + 0: aaaaaaa + 1: a + 2: aa + 3: aaa + 4: a + aaaaaaaa +No match + aaaaaaaaa +No match + aaaaaaaaaa + 0: aaaaaaaaaa + 1: a + 2: aa + 3: aaa + 4: aaaa + aaaaaaaaaaa +No match + aaaaaaaaaaaa +No match + aaaaaaaaaaaaa +No match + aaaaaaaaaaaaaa +No match + aaaaaaaaaaaaaaa +No match + aaaaaaaaaaaaaaaa +No match + +/The following tests are taken from the Perl 5.005 test suite; some of them/ +/are compatible with 5.004, but I'd rather not have to sort them out./ +No match + +/abc/ + abc + 0: abc + xabcy + 0: abc + ababc + 0: abc + *** Failers +No match + xbc +No match + axc +No match + abx +No match + +/ab*c/ + abc + 0: abc + +/ab*bc/ + abc + 0: abc + abbc + 0: abbc + abbbbc + 0: abbbbc + +/.{1}/ + abbbbc + 0: a + +/.{3,4}/ + abbbbc + 0: abbb + +/ab{0,}bc/ + abbbbc + 0: abbbbc + +/ab+bc/ + abbc + 0: abbc + *** Failers +No match + abc +No match + abq +No match + +/ab{1,}bc/ + +/ab+bc/ + abbbbc + 0: abbbbc + +/ab{1,}bc/ + abbbbc + 0: abbbbc + +/ab{1,3}bc/ + abbbbc + 0: abbbbc + +/ab{3,4}bc/ + abbbbc + 0: abbbbc + +/ab{4,5}bc/ + *** Failers +No match + abq +No match + abbbbc +No match + +/ab?bc/ + abbc + 0: abbc + abc + 0: abc + +/ab{0,1}bc/ + abc + 0: abc + +/ab?bc/ + +/ab?c/ + abc + 0: abc + +/ab{0,1}c/ + abc + 0: abc + +/^abc$/ + abc + 0: abc + *** Failers +No match + abbbbc +No match + abcc +No match + +/^abc/ + abcc + 0: abc + +/^abc$/ + +/abc$/ + aabc + 0: abc + *** Failers +No match + aabc + 0: abc + aabcd +No match + +/^/ + abc + 0: + +/$/ + abc + 0: + +/a.c/ + abc + 0: abc + axc + 0: axc + +/a.*c/ + axyzc + 0: axyzc + +/a[bc]d/ + abd + 0: abd + *** Failers +No match + axyzd +No match + abc +No match + +/a[b-d]e/ + ace + 0: ace + +/a[b-d]/ + aac + 0: ac + +/a[-b]/ + a- + 0: a- + +/a[b-]/ + a- + 0: a- + +/a]/ + a] + 0: a] + +/a[]]b/ + a]b + 0: a]b + +/a[^bc]d/ + aed + 0: aed + *** Failers +No match + abd +No match + abd +No match + +/a[^-b]c/ + adc + 0: adc + +/a[^]b]c/ + adc + 0: adc + *** Failers +No match + a-c + 0: a-c + a]c +No match + +/\ba\b/ + a- + 0: a + -a + 0: a + -a- + 0: a + +/\by\b/ + *** Failers +No match + xy +No match + yz +No match + xyz +No match + +/\Ba\B/ + *** Failers + 0: a + a- +No match + -a +No match + -a- +No match + +/\By\b/ + xy + 0: y + +/\by\B/ + yz + 0: y + +/\By\B/ + xyz + 0: y + +/\w/ + a + 0: a + +/\W/ + - + 0: - + *** Failers + 0: * + - + 0: - + a +No match + +/a\sb/ + a b + 0: a b + +/a\Sb/ + a-b + 0: a-b + *** Failers +No match + a-b + 0: a-b + a b +No match + +/\d/ + 1 + 0: 1 + +/\D/ + - + 0: - + *** Failers + 0: * + - + 0: - + 1 +No match + +/[\w]/ + a + 0: a + +/[\W]/ + - + 0: - + *** Failers + 0: * + - + 0: - + a +No match + +/a[\s]b/ + a b + 0: a b + +/a[\S]b/ + a-b + 0: a-b + *** Failers +No match + a-b + 0: a-b + a b +No match + +/[\d]/ + 1 + 0: 1 + +/[\D]/ + - + 0: - + *** Failers + 0: * + - + 0: - + 1 +No match + +/ab|cd/ + abc + 0: ab + abcd + 0: ab + +/()ef/ + def + 0: ef + 1: + +/$b/ + +/a\(b/ + a(b + 0: a(b + +/a\(*b/ + ab + 0: ab + a((b + 0: a((b + +/a\\b/ + a\b +No match + +/((a))/ + abc + 0: a + 1: a + 2: a + +/(a)b(c)/ + abc + 0: abc + 1: a + 2: c + +/a+b+c/ + aabbabc + 0: abc + +/a{1,}b{1,}c/ + aabbabc + 0: abc + +/a.+?c/ + abcabc + 0: abc + +/(a+|b)*/ + ab + 0: ab + 1: b + +/(a+|b){0,}/ + ab + 0: ab + 1: b + +/(a+|b)+/ + ab + 0: ab + 1: b + +/(a+|b){1,}/ + ab + 0: ab + 1: b + +/(a+|b)?/ + ab + 0: a + 1: a + +/(a+|b){0,1}/ + ab + 0: a + 1: a + +/[^ab]*/ + cde + 0: cde + +/abc/ + *** Failers +No match + b +No match + + +/a*/ + + +/([abc])*d/ + abbbcd + 0: abbbcd + 1: c + +/([abc])*bcd/ + abcd + 0: abcd + 1: a + +/a|b|c|d|e/ + e + 0: e + +/(a|b|c|d|e)f/ + ef + 0: ef + 1: e + +/abcd*efg/ + abcdefg + 0: abcdefg + +/ab*/ + xabyabbbz + 0: ab + xayabbbz + 0: a + +/(ab|cd)e/ + abcde + 0: cde + 1: cd + +/[abhgefdc]ij/ + hij + 0: hij + +/^(ab|cd)e/ + +/(abc|)ef/ + abcdef + 0: ef + 1: + +/(a|b)c*d/ + abcd + 0: bcd + 1: b + +/(ab|ab*)bc/ + abc + 0: abc + 1: a + +/a([bc]*)c*/ + abc + 0: abc + 1: bc + +/a([bc]*)(c*d)/ + abcd + 0: abcd + 1: bc + 2: d + +/a([bc]+)(c*d)/ + abcd + 0: abcd + 1: bc + 2: d + +/a([bc]*)(c+d)/ + abcd + 0: abcd + 1: b + 2: cd + +/a[bcd]*dcdcde/ + adcdcde + 0: adcdcde + +/a[bcd]+dcdcde/ + *** Failers +No match + abcde +No match + adcdcde +No match + +/(ab|a)b*c/ + abc + 0: abc + 1: ab + +/((a)(b)c)(d)/ + abcd + 0: abcd + 1: abc + 2: a + 3: b + 4: d + +/[a-zA-Z_][a-zA-Z0-9_]*/ + alpha + 0: alpha + +/^a(bc+|b[eh])g|.h$/ + abh + 0: bh + +/(bc+d$|ef*g.|h?i(j|k))/ + effgz + 0: effgz + 1: effgz + ij + 0: ij + 1: ij + 2: j + reffgz + 0: effgz + 1: effgz + *** Failers +No match + effg +No match + bcdd +No match + +/((((((((((a))))))))))/ + a + 0: a + 1: a + 2: a + 3: a + 4: a + 5: a + 6: a + 7: a + 8: a + 9: a +10: a + +/((((((((((a))))))))))\10/ + aa + 0: aa + 1: a + 2: a + 3: a + 4: a + 5: a + 6: a + 7: a + 8: a + 9: a +10: a + +/(((((((((a)))))))))/ + a + 0: a + 1: a + 2: a + 3: a + 4: a + 5: a + 6: a + 7: a + 8: a + 9: a + +/multiple words of text/ + *** Failers +No match + aa +No match + uh-uh +No match + +/multiple words/ + multiple words, yeah + 0: multiple words + +/(.*)c(.*)/ + abcde + 0: abcde + 1: ab + 2: de + +/\((.*), (.*)\)/ + (a, b) + 0: (a, b) + 1: a + 2: b + +/[k]/ + +/abcd/ + abcd + 0: abcd + +/a(bc)d/ + abcd + 0: abcd + 1: bc + +/a[-]?c/ + ac + 0: ac + +/(abc)\1/ + abcabc + 0: abcabc + 1: abc + +/([a-c]*)\1/ + abcabc + 0: abcabc + 1: abc + +/(a)|\1/ + a + 0: a + 1: a + *** Failers + 0: a + 1: a + ab + 0: a + 1: a + x +No match + +/(([a-c])b*?\2)*/ + ababbbcbc + 0: ababb + 1: bb + 2: b + +/(([a-c])b*?\2){3}/ + ababbbcbc + 0: ababbbcbc + 1: cbc + 2: c + +/((\3|b)\2(a)x)+/ + aaaxabaxbaaxbbax + 0: bbax + 1: bbax + 2: b + 3: a + +/((\3|b)\2(a)){2,}/ + bbaababbabaaaaabbaaaabba + 0: bbaaaabba + 1: bba + 2: b + 3: a + +/abc/i + ABC + 0: ABC + XABCY + 0: ABC + ABABC + 0: ABC + *** Failers +No match + aaxabxbaxbbx +No match + XBC +No match + AXC +No match + ABX +No match + +/ab*c/i + ABC + 0: ABC + +/ab*bc/i + ABC + 0: ABC + ABBC + 0: ABBC + +/ab*?bc/i + ABBBBC + 0: ABBBBC + +/ab{0,}?bc/i + ABBBBC + 0: ABBBBC + +/ab+?bc/i + ABBC + 0: ABBC + +/ab+bc/i + *** Failers +No match + ABC +No match + ABQ +No match + +/ab{1,}bc/i + +/ab+bc/i + ABBBBC + 0: ABBBBC + +/ab{1,}?bc/i + ABBBBC + 0: ABBBBC + +/ab{1,3}?bc/i + ABBBBC + 0: ABBBBC + +/ab{3,4}?bc/i + ABBBBC + 0: ABBBBC + +/ab{4,5}?bc/i + *** Failers +No match + ABQ +No match + ABBBBC +No match + +/ab??bc/i + ABBC + 0: ABBC + ABC + 0: ABC + +/ab{0,1}?bc/i + ABC + 0: ABC + +/ab??bc/i + +/ab??c/i + ABC + 0: ABC + +/ab{0,1}?c/i + ABC + 0: ABC + +/^abc$/i + ABC + 0: ABC + *** Failers +No match + ABBBBC +No match + ABCC +No match + +/^abc/i + ABCC + 0: ABC + +/^abc$/i + +/abc$/i + AABC + 0: ABC + +/^/i + ABC + 0: + +/$/i + ABC + 0: + +/a.c/i + ABC + 0: ABC + AXC + 0: AXC + +/a.*?c/i + AXYZC + 0: AXYZC + +/a.*c/i + *** Failers +No match + AABC + 0: AABC + AXYZD +No match + +/a[bc]d/i + ABD + 0: ABD + +/a[b-d]e/i + ACE + 0: ACE + *** Failers +No match + ABC +No match + ABD +No match + +/a[b-d]/i + AAC + 0: AC + +/a[-b]/i + A- + 0: A- + +/a[b-]/i + A- + 0: A- + +/a]/i + A] + 0: A] + +/a[]]b/i + A]B + 0: A]B + +/a[^bc]d/i + AED + 0: AED + +/a[^-b]c/i + ADC + 0: ADC + *** Failers +No match + ABD +No match + A-C +No match + +/a[^]b]c/i + ADC + 0: ADC + +/ab|cd/i + ABC + 0: AB + ABCD + 0: AB + +/()ef/i + DEF + 0: EF + 1: + +/$b/i + *** Failers +No match + A]C +No match + B +No match + +/a\(b/i + A(B + 0: A(B + +/a\(*b/i + AB + 0: AB + A((B + 0: A((B + +/a\\b/i + A\B +No match + +/((a))/i + ABC + 0: A + 1: A + 2: A + +/(a)b(c)/i + ABC + 0: ABC + 1: A + 2: C + +/a+b+c/i + AABBABC + 0: ABC + +/a{1,}b{1,}c/i + AABBABC + 0: ABC + +/a.+?c/i + ABCABC + 0: ABC + +/a.*?c/i + ABCABC + 0: ABC + +/a.{0,5}?c/i + ABCABC + 0: ABC + +/(a+|b)*/i + AB + 0: AB + 1: B + +/(a+|b){0,}/i + AB + 0: AB + 1: B + +/(a+|b)+/i + AB + 0: AB + 1: B + +/(a+|b){1,}/i + AB + 0: AB + 1: B + +/(a+|b)?/i + AB + 0: A + 1: A + +/(a+|b){0,1}/i + AB + 0: A + 1: A + +/(a+|b){0,1}?/i + AB + 0: + +/[^ab]*/i + CDE + 0: CDE + +/abc/i + +/a*/i + + +/([abc])*d/i + ABBBCD + 0: ABBBCD + 1: C + +/([abc])*bcd/i + ABCD + 0: ABCD + 1: A + +/a|b|c|d|e/i + E + 0: E + +/(a|b|c|d|e)f/i + EF + 0: EF + 1: E + +/abcd*efg/i + ABCDEFG + 0: ABCDEFG + +/ab*/i + XABYABBBZ + 0: AB + XAYABBBZ + 0: A + +/(ab|cd)e/i + ABCDE + 0: CDE + 1: CD + +/[abhgefdc]ij/i + HIJ + 0: HIJ + +/^(ab|cd)e/i + ABCDE +No match + +/(abc|)ef/i + ABCDEF + 0: EF + 1: + +/(a|b)c*d/i + ABCD + 0: BCD + 1: B + +/(ab|ab*)bc/i + ABC + 0: ABC + 1: A + +/a([bc]*)c*/i + ABC + 0: ABC + 1: BC + +/a([bc]*)(c*d)/i + ABCD + 0: ABCD + 1: BC + 2: D + +/a([bc]+)(c*d)/i + ABCD + 0: ABCD + 1: BC + 2: D + +/a([bc]*)(c+d)/i + ABCD + 0: ABCD + 1: B + 2: CD + +/a[bcd]*dcdcde/i + ADCDCDE + 0: ADCDCDE + +/a[bcd]+dcdcde/i + +/(ab|a)b*c/i + ABC + 0: ABC + 1: AB + +/((a)(b)c)(d)/i + ABCD + 0: ABCD + 1: ABC + 2: A + 3: B + 4: D + +/[a-zA-Z_][a-zA-Z0-9_]*/i + ALPHA + 0: ALPHA + +/^a(bc+|b[eh])g|.h$/i + ABH + 0: BH + +/(bc+d$|ef*g.|h?i(j|k))/i + EFFGZ + 0: EFFGZ + 1: EFFGZ + IJ + 0: IJ + 1: IJ + 2: J + REFFGZ + 0: EFFGZ + 1: EFFGZ + *** Failers +No match + ADCDCDE +No match + EFFG +No match + BCDD +No match + +/((((((((((a))))))))))/i + A + 0: A + 1: A + 2: A + 3: A + 4: A + 5: A + 6: A + 7: A + 8: A + 9: A +10: A + +/((((((((((a))))))))))\10/i + AA + 0: AA + 1: A + 2: A + 3: A + 4: A + 5: A + 6: A + 7: A + 8: A + 9: A +10: A + +/(((((((((a)))))))))/i + A + 0: A + 1: A + 2: A + 3: A + 4: A + 5: A + 6: A + 7: A + 8: A + 9: A + +/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a))))))))))/i + A + 0: A + 1: A + +/(?:(?:(?:(?:(?:(?:(?:(?:(?:(a|b|c))))))))))/i + C + 0: C + 1: C + +/multiple words of text/i + *** Failers +No match + AA +No match + UH-UH +No match + +/multiple words/i + MULTIPLE WORDS, YEAH + 0: MULTIPLE WORDS + +/(.*)c(.*)/i + ABCDE + 0: ABCDE + 1: AB + 2: DE + +/\((.*), (.*)\)/i + (A, B) + 0: (A, B) + 1: A + 2: B + +/[k]/i + +/abcd/i + ABCD + 0: ABCD + +/a(bc)d/i + ABCD + 0: ABCD + 1: BC + +/a[-]?c/i + AC + 0: AC + +/(abc)\1/i + ABCABC + 0: ABCABC + 1: ABC + +/([a-c]*)\1/i + ABCABC + 0: ABCABC + 1: ABC + +/a(?!b)./ + abad + 0: ad + +/a(?=d)./ + abad + 0: ad + +/a(?=c|d)./ + abad + 0: ad + +/a(?:b|c|d)(.)/ + ace + 0: ace + 1: e + +/a(?:b|c|d)*(.)/ + ace + 0: ace + 1: e + +/a(?:b|c|d)+?(.)/ + ace + 0: ace + 1: e + acdbcdbe + 0: acd + 1: d + +/a(?:b|c|d)+(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + +/a(?:b|c|d){2}(.)/ + acdbcdbe + 0: acdb + 1: b + +/a(?:b|c|d){4,5}(.)/ + acdbcdbe + 0: acdbcdb + 1: b + +/a(?:b|c|d){4,5}?(.)/ + acdbcdbe + 0: acdbcd + 1: d + +/((foo)|(bar))*/ + foobar + 0: foobar + 1: bar + 2: foo + 3: bar + +/a(?:b|c|d){6,7}(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + +/a(?:b|c|d){6,7}?(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + +/a(?:b|c|d){5,6}(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + +/a(?:b|c|d){5,6}?(.)/ + acdbcdbe + 0: acdbcdb + 1: b + +/a(?:b|c|d){5,7}(.)/ + acdbcdbe + 0: acdbcdbe + 1: e + +/a(?:b|c|d){5,7}?(.)/ + acdbcdbe + 0: acdbcdb + 1: b + +/a(?:b|(c|e){1,2}?|d)+?(.)/ + ace + 0: ace + 1: c + 2: e + +/^(.+)?B/ + AB + 0: AB + 1: A + +/^([^a-z])|(\^)$/ + . + 0: . + 1: . + +/^[<>]&/ + <&OUT + 0: <& + +/^(a\1?){4}$/ + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + *** Failers +No match + AB +No match + aaaaaaaaa +No match + aaaaaaaaaaa +No match + +/^(a(?(1)\1)){4}$/ + aaaaaaaaaa + 0: aaaaaaaaaa + 1: aaaa + *** Failers +No match + aaaaaaaaa +No match + aaaaaaaaaaa +No match + +/(?:(f)(o)(o)|(b)(a)(r))*/ + foobar + 0: foobar + 1: f + 2: o + 3: o + 4: b + 5: a + 6: r + +/(?<=a)b/ + ab + 0: b + *** Failers +No match + cb +No match + b +No match + +/(?<!c)b/ + ab + 0: b + b + 0: b + b + 0: b + +/(?:..)*a/ + aba + 0: aba + +/(?:..)*?a/ + aba + 0: a + +/^(?:b|a(?=(.)))*\1/ + abc + 0: ab + 1: b + +/^(){3,5}/ + abc + 0: + 1: + +/^(a+)*ax/ + aax + 0: aax + 1: a + +/^((a|b)+)*ax/ + aax + 0: aax + 1: a + 2: a + +/^((a|bc)+)*ax/ + aax + 0: aax + 1: a + 2: a + +/(a|x)*ab/ + cab + 0: ab + +/(a)*ab/ + cab + 0: ab + +/(?:(?i)a)b/ + ab + 0: ab + +/((?i)a)b/ + ab + 0: ab + 1: a + +/(?:(?i)a)b/ + Ab + 0: Ab + +/((?i)a)b/ + Ab + 0: Ab + 1: A + +/(?:(?i)a)b/ + *** Failers +No match + cb +No match + aB +No match + +/((?i)a)b/ + +/(?i:a)b/ + ab + 0: ab + +/((?i:a))b/ + ab + 0: ab + 1: a + +/(?i:a)b/ + Ab + 0: Ab + +/((?i:a))b/ + Ab + 0: Ab + 1: A + +/(?i:a)b/ + *** Failers +No match + aB +No match + aB +No match + +/((?i:a))b/ + +/(?:(?-i)a)b/i + ab + 0: ab + +/((?-i)a)b/i + ab + 0: ab + 1: a + +/(?:(?-i)a)b/i + aB + 0: aB + +/((?-i)a)b/i + aB + 0: aB + 1: a + +/(?:(?-i)a)b/i + *** Failers +No match + aB + 0: aB + Ab +No match + +/((?-i)a)b/i + +/(?:(?-i)a)b/i + aB + 0: aB + +/((?-i)a)b/i + aB + 0: aB + 1: a + +/(?:(?-i)a)b/i + *** Failers +No match + Ab +No match + AB +No match + +/((?-i)a)b/i + +/(?-i:a)b/i + ab + 0: ab + +/((?-i:a))b/i + ab + 0: ab + 1: a + +/(?-i:a)b/i + aB + 0: aB + +/((?-i:a))b/i + aB + 0: aB + 1: a + +/(?-i:a)b/i + *** Failers +No match + AB +No match + Ab +No match + +/((?-i:a))b/i + +/(?-i:a)b/i + aB + 0: aB + +/((?-i:a))b/i + aB + 0: aB + 1: a + +/(?-i:a)b/i + *** Failers +No match + Ab +No match + AB +No match + +/((?-i:a))b/i + +/((?-i:a.))b/i + *** Failers +No match + AB +No match + a\nB +No match + +/((?s-i:a.))b/i + a\nB + 0: a\x0aB + 1: a\x0a + +/(?:c|d)(?:)(?:a(?:)(?:b)(?:b(?:))(?:b(?:)(?:b)))/ + cabbbb + 0: cabbbb + +/(?:c|d)(?:)(?:aaaaaaaa(?:)(?:bbbbbbbb)(?:bbbbbbbb(?:))(?:bbbbbbbb(?:)(?:bbbbbbbb)))/ + caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb + 0: caaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb + +/(ab)\d\1/i + Ab4ab + 0: Ab4ab + 1: Ab + ab4Ab + 0: ab4Ab + 1: ab + +/foo\w*\d{4}baz/ + foobar1234baz + 0: foobar1234baz + +/x(~~)*(?:(?:F)?)?/ + x~~ + 0: x~~ + 1: ~~ + +/^a(?#xxx){3}c/ + aaac + 0: aaac + +/^a (?#xxx) (?#yyy) {3}c/x + aaac + 0: aaac + +/(?<![cd])b/ + *** Failers +No match + B\nB +No match + dbcb +No match + +/(?<![cd])[ab]/ + dbaacb + 0: a + +/(?<!(c|d))b/ + +/(?<!(c|d))[ab]/ + dbaacb + 0: a + +/(?<!cd)[ab]/ + cdaccb + 0: b + +/^(?:a?b?)*$/ + *** Failers +No match + dbcb +No match + a-- +No match + +/((?s)^a(.))((?m)^b$)/ + a\nb\nc\n + 0: a\x0ab + 1: a\x0a + 2: \x0a + 3: b + +/((?m)^b$)/ + a\nb\nc\n + 0: b + 1: b + +/(?m)^b/ + a\nb\n + 0: b + +/(?m)^(b)/ + a\nb\n + 0: b + 1: b + +/((?m)^b)/ + a\nb\n + 0: b + 1: b + +/\n((?m)^b)/ + a\nb\n + 0: \x0ab + 1: b + +/((?s).)c(?!.)/ + a\nb\nc\n + 0: \x0ac + 1: \x0a + a\nb\nc\n + 0: \x0ac + 1: \x0a + +/((?s)b.)c(?!.)/ + a\nb\nc\n + 0: b\x0ac + 1: b\x0a + a\nb\nc\n + 0: b\x0ac + 1: b\x0a + +/^b/ + +/()^b/ + *** Failers +No match + a\nb\nc\n +No match + a\nb\nc\n +No match + +/((?m)^b)/ + a\nb\nc\n + 0: b + 1: b + +/(?(1)a|b)/ + +/(?(1)b|a)/ + a + 0: a + +/(x)?(?(1)a|b)/ + *** Failers +No match + a +No match + a +No match + +/(x)?(?(1)b|a)/ + a + 0: a + +/()?(?(1)b|a)/ + a + 0: a + +/()(?(1)b|a)/ + +/()?(?(1)a|b)/ + a + 0: a + 1: + +/^(\()?blah(?(1)(\)))$/ + (blah) + 0: (blah) + 1: ( + 2: ) + blah + 0: blah + *** Failers +No match + a +No match + blah) +No match + (blah +No match + +/^(\(+)?blah(?(1)(\)))$/ + (blah) + 0: (blah) + 1: ( + 2: ) + blah + 0: blah + *** Failers +No match + blah) +No match + (blah +No match + +/(?(?!a)a|b)/ + +/(?(?!a)b|a)/ + a + 0: a + +/(?(?=a)b|a)/ + *** Failers +No match + a +No match + a +No match + +/(?(?=a)a|b)/ + a + 0: a + +/(?=(a+?))(\1ab)/ + aaab + 0: aab + 1: a + 2: aab + +/^(?=(a+?))\1ab/ + +/(\w+:)+/ + one: + 0: one: + 1: one: + +/$(?<=^(a))/ + a + 0: + 1: a + +/(?=(a+?))(\1ab)/ + aaab + 0: aab + 1: a + 2: aab + +/^(?=(a+?))\1ab/ + *** Failers +No match + aaab +No match + aaab +No match + +/([\w:]+::)?(\w+)$/ + abcd + 0: abcd + 1: <unset> + 2: abcd + xy:z:::abcd + 0: xy:z:::abcd + 1: xy:z::: + 2: abcd + +/^[^bcd]*(c+)/ + aexycd + 0: aexyc + 1: c + +/(a*)b+/ + caab + 0: aab + 1: aa + +/([\w:]+::)?(\w+)$/ + abcd + 0: abcd + 1: <unset> + 2: abcd + xy:z:::abcd + 0: xy:z:::abcd + 1: xy:z::: + 2: abcd + *** Failers + 0: Failers + 1: <unset> + 2: Failers + abcd: +No match + abcd: +No match + +/^[^bcd]*(c+)/ + aexycd + 0: aexyc + 1: c + +/(>a+)ab/ + +/(?>a+)b/ + aaab + 0: aaab + +/([[:]+)/ + a:[b]: + 0: :[ + 1: :[ + +/([[=]+)/ + a=[b]= + 0: =[ + 1: =[ + +/([[.]+)/ + a.[b]. + 0: .[ + 1: .[ + +/((?>a+)b)/ + aaab + 0: aaab + 1: aaab + +/(?>(a+))b/ + aaab + 0: aaab + 1: aaa + +/((?>[^()]+)|\([^()]*\))+/ + ((abc(ade)ufh()()x + 0: abc(ade)ufh()()x + 1: x + +/a\Z/ + *** Failers +No match + aaab +No match + a\nb\n +No match + +/b\Z/ + a\nb\n + 0: b + +/b\z/ + +/b\Z/ + a\nb + 0: b + +/b\z/ + a\nb + 0: b + *** Failers +No match + +/^(?>(?(1)\.|())[^\W_](?>[a-z0-9-]*[^\W_])?)+$/ + a + 0: a + 1: + abc + 0: abc + 1: + a-b + 0: a-b + 1: + 0-9 + 0: 0-9 + 1: + a.b + 0: a.b + 1: + 5.6.7 + 0: 5.6.7 + 1: + the.quick.brown.fox + 0: the.quick.brown.fox + 1: + a100.b200.300c + 0: a100.b200.300c + 1: + 12-ab.1245 + 0: 12-ab.1245 + 1: + ***Failers +No match + \ +No match + .a +No match + -a +No match + a- +No match + a. +No match + a_b +No match + a.- +No match + a.. +No match + ab..bc +No match + the.quick.brown.fox- +No match + the.quick.brown.fox. +No match + the.quick.brown.fox_ +No match + the.quick.brown.fox+ +No match + +/(?>.*)(?<=(abcd|wxyz))/ + alphabetabcd + 0: alphabetabcd + 1: abcd + endingwxyz + 0: endingwxyz + 1: wxyz + *** Failers +No match + a rather long string that doesn't end with one of them +No match + +/word (?>(?:(?!otherword)[a-zA-Z0-9]+ ){0,30})otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark otherword + 0: word cat dog elephant mussel cow horse canary baboon snake shark otherword + word cat dog elephant mussel cow horse canary baboon snake shark +No match + +/word (?>[a-zA-Z0-9]+ ){0,30}otherword/ + word cat dog elephant mussel cow horse canary baboon snake shark the quick brown fox and the lazy dog and several other words getting close to thirty by now I hope +No match + +/ End of test input / + diff --git a/ext/pcre/pcrelib/testoutput4 b/ext/pcre/pcrelib/testoutput4 new file mode 100644 index 0000000000..0e156c497f --- /dev/null +++ b/ext/pcre/pcrelib/testoutput4 @@ -0,0 +1,113 @@ +PCRE version 2.05 21-Apr-1999 + +/^[\w]+/ + *** Failers +No match + cole +No match + +/^[\w]+/Lfr + cole + 0: cole + +/^[\w]+/ + *** Failers +No match + cole +No match + +/^[\W]+/ + cole + 0: \xc9 + +/^[\W]+/Lfr + *** Failers + 0: *** + cole +No match + +/[\b]/ + \b + 0: \x08 + *** Failers +No match + a +No match + +/[\b]/Lfr + \b + 0: \x08 + *** Failers +No match + a +No match + +/^\w+/ + *** Failers +No match + cole +No match + +/^\w+/Lfr + cole + 0: cole + +/(.+)\b(.+)/ + cole + 0: \xc9cole + 1: \xc9 + 2: cole + +/(.+)\b(.+)/Lfr + *** Failers + 0: *** Failers + 1: *** + 2: Failers + cole +No match + +/cole/i + cole + 0: \xc9cole + *** Failers +No match + cole +No match + +/cole/iLfr + cole + 0: cole + cole + 0: cole + +/\w/IS +Identifying subpattern count = 0 +No options +No first char +Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P + Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z + +/\w/ISLfr +Identifying subpattern count = 0 +No options +No first char +Starting character set: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P + Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z + + + +/^[\xc8-\xc9]/iLfr + cole + 0: + cole + 0: + +/^[\xc8-\xc9]/Lfr + cole + 0: + *** Failers +No match + cole +No match + + |