diff options
author | nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2007-02-24 21:38:01 +0000 |
---|---|---|
committer | nigel <nigel@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2007-02-24 21:38:01 +0000 |
commit | beb7de117d08845a5d039d4ad4ffab8f5bac61fe (patch) | |
tree | bdef90c67b6a880b5bb7f91568ab7fe41d508ccc | |
parent | 18f0553f2b954c484ee6d426a4d0a3e062caaed6 (diff) | |
download | pcre-beb7de117d08845a5d039d4ad4ffab8f5bac61fe.tar.gz |
Load pcre-1.00 into code/trunk.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@3 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r-- | ChangeLog | 173 | ||||
-rw-r--r-- | Makefile | 73 | ||||
-rw-r--r-- | Performance | 172 | ||||
-rw-r--r-- | README | 233 | ||||
-rw-r--r-- | Tech.Notes | 199 | ||||
-rw-r--r-- | internal.h | 281 | ||||
-rw-r--r-- | maketables.c | 157 | ||||
-rw-r--r-- | pcre.3 | 1017 | ||||
-rw-r--r-- | pcre.c | 3510 | ||||
-rw-r--r-- | pcre.h | 57 | ||||
-rw-r--r-- | pcreposix.3 | 135 | ||||
-rw-r--r-- | pcreposix.c | 246 | ||||
-rw-r--r-- | pcreposix.h | 72 | ||||
-rw-r--r-- | pcretest.c | 771 | ||||
-rwxr-xr-x | perltest | 143 | ||||
-rw-r--r-- | pgrep.1 | 72 | ||||
-rw-r--r-- | pgrep.c | 220 | ||||
-rw-r--r-- | study.c | 337 | ||||
-rw-r--r-- | testinput | 1551 | ||||
-rw-r--r-- | testinput2 | 244 | ||||
-rw-r--r-- | testoutput | 2298 | ||||
-rw-r--r-- | testoutput2 | 573 |
22 files changed, 12534 insertions, 0 deletions
diff --git a/ChangeLog b/ChangeLog new file mode 100644 index 0000000..0227726 --- /dev/null +++ b/ChangeLog @@ -0,0 +1,173 @@ +ChangeLog for PCRE +------------------ + +Version 0.99 27-Oct-97 +---------------------- + +1. Fixed bug in code for optimizing classes with only one character. It was +initializing a 32-byte map regardless, which could cause it to run off the end +of the memory it had got. + +2. Added, conditional on PCRE_EXTRA, the proposed (?>REGEX) construction. + + +Version 0.98 22-Oct-97 +---------------------- + +1. Fixed bug in code for handling temporary memory usage when there are more +back references than supplied space in the ovector. This could cause segfaults. + + +Version 0.97 21-Oct-97 +---------------------- + +1. Added the \X "cut" facility, conditional on PCRE_EXTRA. + +2. Optimized negated single characters not to use a bit map. + +3. Brought error texts together as macro definitions; clarified some of them; +fixed one that was wrong - it said "range out of order" when it meant "invalid +escape sequence". + +4. Changed some char * arguments to const char *. + +5. Added PCRE_NOTBOL and PCRE_NOTEOL (from POSIX). + +6. Added the POSIX-style API wrapper in pcreposix.a and testing facilities in +pcretest. + + +Version 0.96 16-Oct-97 +---------------------- + +1. Added a simple "pgrep" utility to the distribution. + +2. Fixed an incompatibility with Perl: "{" is now treated as a normal character +unless it appears in one of the precise forms "{ddd}", "{ddd,}", or "{ddd,ddd}" +where "ddd" means "one or more decimal digits". + +3. Fixed serious bug. If a pattern had a back reference, but the call to +pcre_exec() didn't supply a large enough ovector to record the related +identifying subpattern, the match always failed. PCRE now remembers the number +of the largest back reference, and gets some temporary memory in which to save +the offsets during matching if necessary, in order to ensure that +backreferences always work. + +4. Increased the compatibility with Perl in a number of ways: + + (a) . no longer matches \n by default; an option PCRE_DOTALL is provided + to request this handling. The option can be set at compile or exec time. + + (b) $ matches before a terminating newline by default; an option + PCRE_DOLLAR_ENDONLY is provided to override this (but not in multiline + mode). The option can be set at compile or exec time. + + (c) The handling of \ followed by a digit other than 0 is now supposed to be + the same as Perl's. If the decimal number it represents is less than 10 + or there aren't that many previous left capturing parentheses, an octal + escape is read. Inside a character class, it's always an octal escape, + even if it is a single digit. + + (d) An escaped but undefined alphabetic character is taken as a literal, + unless PCRE_EXTRA is set. Currently this just reserves the remaining + escapes. + + (e) {0} is now permitted. (The previous item is removed from the compiled + pattern). + +5. Changed all the names of code files so that the basic parts are no longer +than 10 characters, and abolished the teeny "globals.c" file. + +6. Changed the handling of character classes; they are now done with a 32-byte +bit map always. + +7. Added the -d and /D options to pcretest to make it possible to look at the +internals of compilation without having to recompile pcre. + + +Version 0.95 23-Sep-97 +---------------------- + +1. Fixed bug in pre-pass concerning escaped "normal" characters such as \x5c or +\x20 at the start of a run of normal characters. These were being treated as +real characters, instead of the source characters being re-checked. + + +Version 0.94 18-Sep-97 +---------------------- + +1. The functions are now thread-safe, with the caveat that the global variables +containing pointers to malloc() and free() or alternative functions are the +same for all threads. + +2. Get pcre_study() to generate a bitmap of initial characters for non- +anchored patterns when this is possible, and use it if passed to pcre_exec(). + + +Version 0.93 15-Sep-97 +---------------------- + +1. /(b)|(:+)/ was computing an incorrect first character. + +2. Add pcre_study() to the API and the passing of pcre_extra to pcre_exec(), +but not actually doing anything yet. + +3. Treat "-" characters in classes that cannot be part of ranges as literals, +as Perl does (e.g. [-az] or [az-]). + +4. Set the anchored flag if a branch starts with .* or .*? because that tests +all possible positions. + +5. Split up into different modules to avoid including unneeded functions in a +compiled binary. However, compile and exec are still in one module. The "study" +function is split off. + +6. The character tables are now in a separate module whose source is generated +by an auxiliary program - but can then be edited by hand if required. There are +now no calls to isalnum(), isspace(), isdigit(), isxdigit(), tolower() or +toupper() in the code. + +7. Turn the malloc/free funtions variables into pcre_malloc and pcre_free and +make them global. Abolish the function for setting them, as the caller can now +set them directly. + + +Version 0.92 11-Sep-97 +---------------------- + +1. A repeat with a fixed maximum and a minimum of 1 for an ordinary character +(e.g. /a{1,3}/) was broken (I mis-optimized it). + +2. Caseless matching was not working in character classes if the characters in +the pattern were in upper case. + +3. Make ranges like [W-c] work in the same way as Perl for caseless matching. + +4. Make PCRE_ANCHORED public and accept as a compile option. + +5. Add an options word to pcre_exec() and accept PCRE_ANCHORED and +PCRE_CASELESS at run time. Add escapes \A and \I to pcretest to cause it to +pass them. + +6. Give an error if bad option bits passed at compile or run time. + +7. Add PCRE_MULTILINE at compile and exec time, and (?m) as well. Add \M to +pcretest to cause it to pass that flag. + +8. Add pcre_info(), to get the number of identifying subpatterns, the stored +options, and the first character, if set. + +9. Recognize C+ or C{n,m} where n >= 1 as providing a fixed starting character. + + +Version 0.91 10-Sep-97 +---------------------- + +1. PCRE was failing to diagnose unlimited repeats of subpatterns that could +match the empty string as in /(a*)*/. It was looping and ultimately crashing. + +2. PCRE was looping on encountering an indefinitely repeated back reference to +a subpattern that had matched an empty string, e.g. /(a|)\1*/. It now does what +Perl does - treats the match as successful. + +**** diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..41d7523 --- /dev/null +++ b/Makefile @@ -0,0 +1,73 @@ +# Make file for PCRE (Perl-Compatible Regular Expression) library. + +# Edit CC , CFLAGS, and RANLIB for your system. + +# It is believed that RANLIB=ranlib is required for AIX, BSDI, FreeBSD, Linux, +# MIPS RISCOS, NetBSD, OpenBSD, Digital Unix, and Ultrix. + +# Use CFLAGS = -DUSE_BCOPY on SunOS4 and any other system that lacks the +# memmove() function, but has bcopy(). + +# Use CFLAGS = -DSTRERROR_FROM_ERRLIST on SunOS4 and any other system that +# lacks the strerror() function, but can provide the equivalent by indexing +# into errlist. + +CC = gcc -O +CFLAGS = +RANLIB = @true + +########################################################################## + +OBJ = chartables.o study.o pcre.o + +all: libpcre.a libpcreposix.a pcretest pgrep + +pgrep: libpcre.a pgrep.o + $(CC) $(CFLAGS) -o pgrep pgrep.o libpcre.a + +pcretest: libpcre.a libpcreposix.a pcretest.o + $(CC) $(CFLAGS) -o pcretest pcretest.o libpcre.a libpcreposix.a + +libpcre.a: $(OBJ) + /bin/rm -f libpcre.a + ar cq libpcre.a $(OBJ) + $(RANLIB) libpcre.a + +libpcreposix.a: pcreposix.o + /bin/rm -f libpcreposix.a + ar cq libpcreposix.a pcreposix.o + $(RANLIB) libpcreposix.a + +pcre.o: pcre.c pcre.h internal.h + $(CC) -c $(CFLAGS) pcre.c + +pcreposix.o: pcreposix.c pcreposix.h internal.h + $(CC) -c $(CFLAGS) pcreposix.c + +chartables.o: chartables.c + $(CC) -c $(CFLAGS) chartables.c + +study.o: study.c pcre.h internal.h + $(CC) -c $(CFLAGS) study.c + +pcretest.o: pcretest.c pcre.h + $(CC) -c $(CFLAGS) pcretest.c + +pgrep.o: pgrep.c pcre.h + $(CC) -c $(CFLAGS) pgrep.c + +# An auxiliary program makes the character tables + +chartables.c: maketables + ./maketables >chartables.c + +maketables: maketables.c + $(CC) -o maketables $(CFLAGS) maketables.c + +# We deliberately omit maketables and chartables.c from 'make clean'; once made +# chartables.c shouldn't change, and if people have edited the tables by hand, +# you don't want to throw them away. + +clean:; /bin/rm -f *.o *.a pcretest pgrep + +# End diff --git a/Performance b/Performance new file mode 100644 index 0000000..2bb4b96 --- /dev/null +++ b/Performance @@ -0,0 +1,172 @@ +Some comparisons of PCRE with the original Henry Spencer (1986) regular +expression functions were done on a SPARCstation IPC using gcc version 2.6.3 +with -O optimization, to give some idea as to how the two libraries compare. +This is not a major statistical investigation. + + +Code size +--------- + +The code size of PCRE is a bit over twice the size of the Henry Spencer +functions (roughly 33K vs 14K bytes on a SPARCstation with gcc -O). + + +Store size for compiled expressions +----------------------------------- + +For expressions that are compatible with both libraries, PCRE uses less store +for the examples tried, except in some cases that involve the use of character +classes. Except in the special case of a negated charcter class containing only +one character (e.g. [^a]), PCRE uses a 32-byte bit map for each character +class, in order to get the maximum matching speed. By contrast the Spencer code +uses a strchr() call. + +The Spencer functions have an overhead of 92 bytes per expression, because +there is a table for up to 10 matched substrings held with every compiled +expression. In contrast, PCRE's overhead is just 9 bytes, since it requires the +caller to supply a vector to receive the offsets of the matched substrings. In +the table below, the size without the overhead is shown in brackets. + +PCRE Spencer Pattern +---- ------- ------- + + 18 (09) 109 (17) /^$/ + 25 (16) 120 (28) /^.*nter/ + 26 (17) 121 (29) /^12.34/ + 37 (28) 126 (34) /the quick brown fox/ + 50 (41) 114 (22) /^[]cde]/ + 50 (41) 114 (22) /^[^]cde]/ + 51 (42) 125 (33) /^[.^$|()*+?{,}]+/ + 52 (43) 126 (34) /^[0-9]+$/ + 56 (47) 153 (61) /^(abc)(abc)?zz/ + 57 (48) 133 (41) /^xxx[0-9]+$/ + 57 (48) 145 (53) /([0-9a-fA-F:]+)$/ + 62 (53) 171 (79) /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$$/ + 70 (61) 170 (78) /^(b+|a)(b+|a)?c/ + 74 (65) 173 (81) /^(ba|b*)(ba|b*)?bc/ + 99 (90) 235 (143) /^(a(b(c)))(d(e(f)))(h(i(j)))$/ +119 (110) 157 (65) /^.+[0-9][0-9][0-9]$/ +165 (156) 446 (354) /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ +451 (442) 605 (513) /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + + +Compilation time +---------------- + +Timing was done using the clock() function to time 2000 compilations of each +expression and then dividing by twice the number of clocks per second, to get a +value in milliseconds. The variation observed over several runs was never more +than 0.01: + +PCRE Spencer Pattern +---- ------- ------- + +0.04 0.07 /^$/ +0.06 0.12 /^.*nter/ +0.06 0.13 /^12.34/ +0.06 0.09 /^[]cde]/ +0.07 0.14 /^[0-9]+$/ +0.07 0.10 /^[^]cde]/ +0.08 0.17 /^xxx[0-9]+$/ +0.08 0.14 /the quick brown fox/ +0.09 0.14 /^[.^$|()*+?{,}]+/ +0.10 0.33 /([0-9a-fA-F:]+)$/ +0.12 0.26 /^.+[0-9][0-9][0-9]$/ +0.12 0.42 /^(abc)(abc)?zz/ +0.14 0.51 /^(b+|a)(b+|a)?c/ +0.15 0.53 /^(ba|b*)(ba|b*)?bc/ +0.19 0.51 /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ +0.34 1.59 /^(a(b(c)))(d(e(f)))(h(i(j)))$/ +0.47 1.32 /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ +0.66 1.78 /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + + +Execution time +-------------- + +Execution timing was done in a similar manner. Blank entries in the "pattern" +column below indicate the use of the same pattern as before. + +PCRE Spencer Subject Pattern +---- ------- ------- ------- + +0.03 0.02 <null string> /^$/ +0.04 0.04 enter /^.*nter/ +0.04 0.04 uponter +0.03 0.03 12\r34 /^12.34/ +0.03 0.03 0 /^[0-9]+$/ +0.04 0.03 100 +0.03 0.03 ]thing /^[]cde]/ +0.03 0.03 ething +0.03 0.03 athing /^[^]cde]/ +0.04 0.04 xxx0 /^xxx[0-9]+$/ +0.04 0.04 xxx1234 +0.04 0.07 .^\$(*+)|{?,?} /^[.^$|()*+?{,}]+/ +0.03 0.03 the quick brown fox /the quick brown fox/ +0.06 0.08 What do you know about the quick brown fox? +0.04 0.07 0abc /([0-9a-fA-F:]+)$/ +0.04 0.07 abc +0.05 0.13 5f03:12C0::932e +0.06 0.07 x123 /^.+[0-9][0-9][0-9]$/ +0.06 0.07 123456 +0.06 0.09 abczz /^(abc)(abc)?zz/ +0.06 0.12 abcabczz /^(abc)(abc)?zz/ + /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ +0.23 0.28 abc!pqr=apquxz.ixr.zzz.ac.uk +0.09 0.15 bc /^(b+|a)(b+|a)?c/ +0.09 0.15 bbc +0.08 0.15 bbbc +0.09 0.15 bac +0.09 0.15 bbac +0.07 0.14 aac +0.09 0.15 abbbbbbbbbbbc +0.09 0.15 bbbbbbbbbbbac +0.09 0.18 babc /^(ba|b*)(ba|b*)?bc/ +0.12 0.24 bbabc +0.07 0.15 bababc +0.06 0.10 a. /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/ +0.13 0.34 ab-c.pq-r. +0.24 0.58 sxk.zzz.ac.uk. +0.12 0.34 x-.y-. +0.20 0.38 abcdefhij /^(a(b(c)))(d(e(f)))(h(i(j)))$/ + /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ +0.18 0.30 From abcd Mon Sep 01 12:33:02 1997 + +In general, PCRE runs faster than the Spencer function, but remember, this +is just for one particular compiler on one set of hardware and operating +system. Until comprehensive tests have been run in other environments, the most +one can plausibly say is that it is probably no worse on average for the kinds +of expression tested here. + + +Speeding up matching +-------------------- + +A character class is much more efficient than a set of bracketed alternatives. +Matching /^[abc]{12}/ against "abcabcabcabc" took 0.03 ms, whereas +/^(a|b|c){12}/ took 0.33 ms. This is because brackets and alternatives involve +recursion. + + +Serious test +------------ + +One of the tests of PCRE is the monster regular expression from "Mastering +Regular Expressions" (O'Reilly's "hip owls" book, 1997, ISBN 1-56592-257-3) +which recognizes email addresses. There are two versions, unoptimized and +optimized. For interest, here are their timings, again on a SPARCstation IPC. +The compile times were 55 ms and 94 ms, and the compiled expressions +occupied 11010 and 15426 bytes of store, respectively. The following strings +were matched in the times shown (unoptimized first): + +0.34 0.38 user@dom.ain +0.38 0.42 <user@dom.ain> +0.88 0.60 Alan Other <user@dom.ain> +1.87 0.82 "A. Other" <user.1234@dom.ain> (a comment) +1.77 1.19 A. Other <user.1234@dom.ain> (a comment) +2.21 0.42 "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + +The optimization of the expression clearly has a dramatic effect in some cases. + +Philip Hazel <ph10@cus.cam.ac.uk> +October 1997 @@ -0,0 +1,233 @@ +README file for PCRE (Perl-compatible regular expressions) +---------------------------------------------------------- + +The distribution should contain the following files: + + ChangeLog log of changes to the code + Makefile for building PCRE + Performance notes on performance + README this file + Tech.Notes notes on the encoding + pcre.3 man page for the functions + pcreposix.3 man page for the POSIX wrapper API + maketables.c auxiliary program for building chartables.c + study.c ) source of + pcre.c ) the functions + pcreposix.c ) + pcre.h header for the external API + pcreposix.h header for the external POSIX wrapper API + internal.h header for internal use + pcretest.c test program + pgrep.1 man page for pgrep + pgrep.c source of a grep utility that uses PCRE + perltest Perl test program + testinput test data, compatible with Perl + testinput2 test data for error messages and non-Perl things + testoutput test results corresponding to testinput + testoutput2 test results corresponding to testinput2 + +To build PCRE, edit Makefile for your system (it is a fairly simple make file) +and then run it. It builds a two libraries called libpcre.a and libpcreposix.a, +a test program called pcretest, and the pgrep command. + +To test PCRE, run pcretest on the file testinput, and compare the output with +the contents of testoutput. There should be no differences. For example: + + pcretest testinput /tmp/anything + diff /tmp/anything testoutput + +Do the same with testinput2, comparing the output with testoutput2, but this +time using the -i flag for pcretest, i.e. + + pcretest -i testinput2 /tmp/anything + diff /tmp/anything testoutput2 + +There are two sets of tests because the first set can also be fed directly into +the perltest program to check that Perl gives the same results. The second set +of tests check pcre_info(), pcre_study(), error detection and run-time flags +that are specific to PCRE, as well as the POSIX wrapper API. + +To install PCRE, copy libpcre.a to any suitable library directory (e.g. +/usr/local/lib), pcre.h to any suitable include directory (e.g. +/usr/local/include), and pcre.3 to any suitable man directory (e.g. +/usr/local/man/man3). + +To install the pgrep command, copy it to any suitable binary directory, (e.g. +/usr/local/bin) and pgrep.1 to any suitable man directory (e.g. +/usr/local/man/man1). + +PCRE has its own native API, but a set of "wrapper" functions that are based on +the POSIX API are also supplied in the library libpcreposix.a. Note that this +just provides a POSIX calling interface to PCRE: the regular expressions +themselves still follow Perl syntax and semantics. The header file +for the POSIX-style functions is called pcreposix.h. The official POSIX name is +regex.h, but I didn't want to risk possible problems with existing files of +that name by distributing it that way. To use it with an existing program that +uses the POSIX API it will have to be renamed or pointed at by a link. + + +Character tables +---------------- + +PCRE uses four tables for manipulating and identifying characters. These are +compiled from a source file called chartables.c. This is not supplied in +the distribution, but is built by the program maketables (compiled from +maketables.c), which uses the ANSI C character handling functions such as +isalnum(), isalpha(), isupper(), islower(), etc. to build the table sources. +This means that the default C locale set in your system may affect the contents +of the tables. You can change the tables by editing chartables.c and then +re-building PCRE. If you do this, you should probably also edit Makefile to +ensure that the file doesn't ever get re-generated. + +The first two tables pcre_lcc[] and pcre_fcc[] provide lower casing and a +case flipping functions, respectively. The pcre_cbits[] table consists of four +32-byte bit maps which identify digits, letters, "word" characters, and white +space, respectively. These are used when building 32-byte bit maps that +represent character classes. + +The pcre_ctypes[] table has bits indicating various character types, as +follows: + + 1 white space character + 2 letter + 4 decimal digit + 8 hexadecimal digit + 16 alphanumeric or '_' + 128 regular expression metacharacter or binary zero + +You should not alter the set of characters that contain the 128 bit, as that +will cause PCRE to malfunction. + + +The pcretest program +-------------------- + +This program is intended for testing PCRE, but it can also be used for +experimenting with regular expressions. + +If it is given two filename arguments, it reads from the first and writes to +the second. If it is given only one filename argument, it reads from that file +and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and +prompts for each line of input. + +The program handles any number of sets of input on a single input file. Each +set starts with a regular expression, and continues with any number of data +lines to be matched against the pattern. An empty line signals the end of the +set. The regular expressions are given enclosed in any non-alphameric +delimiters, for example + + /(a|bc)x+yz/ + +and may be followed by i, m, s, or x to set the PCRE_CASELESS, PCRE_MULTILINE, +PCRE_DOTALL, or PCRE_EXTENDED options, respectively. These options have the +same effect as they do in Perl. + +There are also some upper case options that do not match Perl options: /A, /E, +and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively. +The /D option is a PCRE debugging feature. It causes the internal form of +compiled regular expressions to be output after compilation. The /S option +causes pcre_study() to be called after the expression has been compiled, and +the results used when the expression is matched. If /I is present as well as +/S, then pcre_study() is called with the PCRE_CASELESS option. + +Finally, the /P option causes pcretest to call PCRE via the POSIX wrapper API +rather than its native API. When this is done, all other options except /i and +/m are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m +is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and +PCRE_DOTALL unless REG_NEWLINE is set. + +A regular expression can extend over several lines of input; the newlines are +included in it. See the testinput file for many examples. + +Before each data line is passed to pcre_exec(), leading and trailing whitespace +is removed, and it is then scanned for \ escapes. The following are recognized: + + \a alarm (= BEL) + \b backspace + \e escape + \f formfeed + \n newline + \r carriage return + \t tab + \v vertical tab + \nnn octal character (up to 3 octal digits) + \xhh hexadecimal character (up to 2 hex digits) + + \A pass the PCRE_ANCHORED option to pcre_exec() + \B pass the PCRE_NOTBOL option to pcre_exec() + \E pass the PCRE_DOLLAR_ENDONLY option to pcre_exec() + \I pass the PCRE_CASELESS option to pcre_exec() + \M pass the PCRE_MULTILINE option to pcre_exec() + \S pass the PCRE_DOTALL option to pcre_exec() + \Odd set the size of the output vector passed to pcre_exec() to dd + (any number of decimal digits) + \Z pass the PCRE_NOTEOL option to pcre_exec() + +A backslash followed by anything else just escapes the anything else. If the +very last character is a backslash, it is ignored. This gives a way of passing +an empty line as data, since a real empty line terminates the data input. + +If /P was present on the regex, causing the POSIX wrapper API to be used, only +\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to +regexec() respectively. + +When a match succeeds, pcretest outputs the list of identified substrings that +pcre_exec() returns, starting with number 0 for the string that matched the +whole pattern. Here is an example of an interactive pcretest run. + + $ pcretest + Testing Perl-Compatible Regular Expressions + PCRE version 0.90 08-Sep-1997 + + re> /^abc(\d+)/ + data> abc123 + 0: abc123 + 1: 123 + data> xyz + No match + +Note that while patterns can be continued over several lines (a plain ">" +prompt is used for continuations), data lines may not. However newlines can be +included in data by means of the \n escape. + +If the -p option is given to pcretest, it is equivalent to adding /P to each +regular expression: the POSIX wrapper API is used to call PCRE. None of the +following flags has any effect in this case. + +If the option -d is given to pcretest, it is equivalent to adding /D to each +regular expression: the internal form is output after compilation. + +If the option -i (for "information") is given to pcretest, it calls pcre_info() +after compiling an expression, and outputs the information it gets back. If the +pattern is studied, the results of that are also output. + +If the option -s is given to pcretest, it outputs the size of each compiled +pattern after it has been compiled. + +If the -t option is given, each compile, study, and match is run 2000 times +while being timed, and the resulting time per compile or match is output in +milliseconds. Do not set -t with -s, because you will then get the size output +2000 times and the timing will be distorted. + + + +The perltest program +-------------------- + +The perltest program tests Perl's regular expressions; it has the same +specification as pcretest, and so can be given identical input, except that +input patterns can be followed only by Perl's lower case options. + +The data lines are processed as Perl strings, so if they contain $ or @ +characters, these have to be escaped. For this reason, all such characters in +the testinput file are escaped so that it can be used for perltest as well as +for pcretest, and the special upper case options such as /A that pcretest +recognizes are not used in this file. The output should be identical, apart +from the initial identifying banner. + +The testinput2 file is not suitable for feeding to Perltest, since it does +make use of the special upper case options and escapes that pcretest uses to +test additional features of PCRE. + +Philip Hazel <ph10@cam.ac.uk> +October 1997 diff --git a/Tech.Notes b/Tech.Notes new file mode 100644 index 0000000..f17b661 --- /dev/null +++ b/Tech.Notes @@ -0,0 +1,199 @@ +Technical Notes about PCRE +-------------------------- + +Many years ago I implemented some regular expression functions to an algorithm +suggested by Martin Richards. These were not Unix-like in form, and were quite +restricted in what they could do by comparison with Perl. The interesting part +about the algorithm was that the amount of space required to hold the compiled +form of an expression was known in advance. The code to apply an expression did +not operate by backtracking, as the Henry Spencer and Perl code does, but +instead checked all possibilities simultaneously by keeping a list of current +states and checking all of them as it advanced through the subject string. (In +the terminology of Jeffrey Friedl's book, it was a "DFA algorithm".) When the +pattern was all used up, all remaining states were possible matches, and the +one matching the longest subset of the subject string was chosen. This did not +necessarily maximize the individual wild portions of the pattern, as is +expected in Unix and Perl-style regular expressions. + +By contrast, the code originally written by Henry Spencer and subsequently +heavily modified for Perl actually compiles the expression twice: once in a +dummy mode in order to find out how much store will be needed, and then for +real. The execution function operates by backtracking and maximizing (or +minimizing in Perl) the amount of the subject that matches individual wild +portions of the pattern. This is a "NFA algorithm". + +For this set of functions, I tried at first to invent an algorithm that used an +amount of store bounded by a multiple of the number of characters in the +pattern, to save on compiling time. However, because of the greater complexity +in Perl regular expressions, I couldn't do this. In any case, a first pass +through the pattern is needed, in order to find internal flag settings like +(?i). So it works by running a very degenerate first pass to calculate a +maximum store size, and then a second pass to do the real compile - which may +use a bit less than the predicted amount of store. The idea is that this is +going to turn out faster because the first pass is degenerate and the second +can just store stuff straight into the vector. It does make the compiling +functions bigger, of course, but they have got quite big anyway to handle all +the Perl stuff. + +The compiled form of a pattern is a vector of bytes, containing items of +variable length. The first byte in an item is an opcode, and the length of the +item is either implicit in the opcode or contained in the data bytes which +follow it. A list of all the opcodes follows: + +Opcodes with no following data +------------------------------ + +These items are all just one byte long + + OP_END end of pattern + OP_ANY match any character + OP_SOD match start of data: \A + OP_CIRC ^ (start of data, or after \n in multiline) + OP_NOT_WORD_BOUNDARY \W + OP_WORD_BOUNDARY \w + OP_NOT_DIGIT \D + OP_DIGIT \d + OP_NOT_WHITESPACE \S + OP_WHITESPACE \s + OP_NOT_WORDCHAR \W + OP_WORDCHAR \w + OP_CUT analogue of Prolog's "cut" + OP_EOD match end of data: \Z + OP_DOLL $ (end of data, or before \n in multiline) + + +Repeating single characters +--------------------------- + +The common repeats (*, +, ?) when applied to a single character appear as +two-byte items using the following opcodes: + + OP_STAR + OP_MINSTAR + OP_PLUS + OP_MINPLUS + OP_QUERY + OP_MINQUERY + +Those with "MIN" in their name are the minimizing versions. Each is followed by +the character that is to be repeated. Other repeats make use of + + OP_UPTO + OP_MINUPTO + OP_EXACT + +which are followed by a two-byte count (most significant first) and the +repeated character. OP_UPTO matches from 0 to the given number. A repeat with a +non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an +OP_UPTO (or OP_MINUPTO). + + +Repeating character types +------------------------- + +Repeats of things like \d are done exactly as for single characters, except +that instead of a character, the opcode for the type is stored in the data +byte. The opcodes are: + + OP_TYPESTAR + OP_TYPEMINSTAR + OP_TYPEPLUS + OP_TYPEMINPLUS + OP_TYPEQUERY + OP_TYPEMINQUERY + OP_TYPEUPTO + OP_TYPEMINUPTO + OP_TYPEEXACT + + +Matching a character string +--------------------------- + +The OP_CHARS opcode is followed by a one-byte count and then that number of +characters. If there are more than 255 characters in sequence, successive +instances of OP_CHARS are used. + + +Character classes +----------------- + +OP_CLASS is used for a character class. It is followed by a 32-byte bit map +containing a 1 bit for every character that is acceptable. The bits are counted +from the least significant end of each byte. + + +Back references +--------------- + +OP_REF is followed by a single byte containing the reference number. + + +Repeating character classes and back references +----------------------------------------------- + +In both cases, the repeat information follows the base item. The matching code +looks at the following opcode to see if it is one of + + OP_CRSTAR + OP_CRMINSTAR + OP_CRPLUS + OP_CRMINPLUS + OP_CRQUERY + OP_CRMINQUERY + OP_CRRANGE + OP_CRMINRANGE + +All but the last two are just single-byte items. The others are followed by +four bytes of data, comprising the minimum and maximum repeat counts. + + +Brackets and alternation +------------------------ + +A pair of non-identifying (round) brackets is wrapped round each expression at +compile time, so alternation always happens in the context of brackets. +Non-identifying brackets use the opcode OP_BRA, while identifying brackets use +OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English +speakers, including myself, can be round, square, or curly. Hence this usage.] + +A bracket opcode is followed by two bytes which give the offset to the next +alternative OP_ALT or, if there aren't any branches, to the matching KET +opcode. Each OP_ALT is followed by two bytes giving the offset to the next one, +or to the KET opcode. + +OP_KET is used for subpatterns that do not repeat indefinitely, while +OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or +maximally respectively. All three are followed by two bytes giving (as a +positive number) the offset back to the matching BRA opcode. + +If a subpattern is quantified such that it is permitted to match zero times, it +is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte +opcodes which tell the matcher that skipping this subpattern entirely is a +valid branch. + +A subpattern with an indefinite maximum repetition is replicated in the +compiled data its minimum number of times (or once with a BRAZERO if the +minimum is zero), with the final copy terminating with a KETRMIN or KETRMAX as +appropriate. + +A subpattern with a bounded maximum repetition is replicated up to the maximum +number of times, with BRAZERO or BRAMINZERO before each replication after the +minimum. In effect, (abc){2,5} becomes (abc)(abc)(abc)?(abc)?(abc)?. + + +Assertions +---------- + +Assertions are just like other subpatterns, but starting with one of the +opcodes OP_ASSERT or OP_ASSERT_NOT. + + +Once-only subpatterns +--------------------- + +These are also just like other subpatterns, but they start with the opcode +OP_ONCE. + + +Philip Hazel +October 1997 diff --git a/internal.h b/internal.h new file mode 100644 index 0000000..1ba868d --- /dev/null +++ b/internal.h @@ -0,0 +1,281 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + + +#define PCRE_VERSION "1.00 18-Nov-1997" + + +/* This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. +----------------------------------------------------------------------------- +*/ + +/* This header contains definitions that are shared between the different +modules, but which are not relevant to the outside. */ + +/* To cope with SunOS4 and other systems that lack memmove() but have bcopy(), +define a macro for memmove() if USE_BCOPY is defined. */ + +#ifdef USE_BCOPY +#define memmove(a, b, c) bcopy(b, a, c) +#endif + +/* Standard C headers plus the external interface definition */ + +#include <ctype.h> +#include <limits.h> +#include <setjmp.h> +#include <stddef.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include "pcre.h" + +/* Private options flags start at the most significant end of the two bytes. +The public options defined in pcre.h start at the least significant end. Make +sure they don't overlap! */ + +#define PCRE_FIRSTSET 0x8000 /* first_char is set */ +#define PCRE_STARTLINE 0x4000 /* start after \n for multiline */ +#define PCRE_COMPILED_CASELESS 0x2000 /* like it says */ + +/* Options for the "extra" block produced by pcre_study(). */ + +#define PCRE_STUDY_CASELESS 0x01 /* study was caseless */ +#define PCRE_STUDY_MAPPED 0x02 /* a map of starting chars exists */ + +/* Masks for identifying the public options: all permitted at compile time, +only some permitted at run or study time. */ + +#define PUBLIC_OPTIONS \ + (PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \ + PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA) + +#define PUBLIC_EXEC_OPTIONS \ + (PCRE_CASELESS|PCRE_ANCHORED|PCRE_MULTILINE|PCRE_NOTBOL|PCRE_NOTEOL| \ + PCRE_DOTALL|PCRE_DOLLAR_ENDONLY) + +#define PUBLIC_STUDY_OPTIONS (PCRE_CASELESS) + +/* Magic number to provide a small check against being handed junk. */ + +#define MAGIC_NUMBER 0x50435245 /* 'PCRE' */ + +/* Miscellaneous definitions */ + +typedef int BOOL; + +#define FALSE 0 +#define TRUE 1 + +/* These are escaped items that aren't just an encoding of a particular data +value such as \n. They must have non-zero values, as check_escape() returns +their negation. Also, they must appear in the same order as in the opcode +definitions below, up to ESC_Z. The final one must be ESC_REF as subsequent +values are used for \1, \2, \3, etc. There is a test in the code for an escape +greater than ESC_b and less than ESC_X to detect the types that may be +repeated. If any new escapes are put in-between that don't consume a character, +that code will have to change. */ + +enum { ESC_A = 1, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s, ESC_W, ESC_w, + + /* These are not Perl escapes, so can't appear in the */ + ESC_X, /* simple table-lookup because they must be conditional */ + /* on PCRE_EXTRA. */ + ESC_Z, + ESC_REF }; + +/* Opcode table: OP_BRA must be last, as all values >= it are used for brackets +that extract substrings. Starting from 1 (i.e. after OP_END), the values up to +OP_EOL must correspond in order to the list of escapes immediately above. */ + +enum { + OP_END, /* End of pattern */ + + /* Values corresponding to backslashed metacharacters */ + + OP_SOD, /* Start of data: \A */ + OP_NOT_WORD_BOUNDARY, /* \B */ + OP_WORD_BOUNDARY, /* \b */ + OP_NOT_DIGIT, /* \D */ + OP_DIGIT, /* \d */ + OP_NOT_WHITESPACE, /* \S */ + OP_WHITESPACE, /* \s */ + OP_NOT_WORDCHAR, /* \W */ + OP_WORDCHAR, /* \w */ + OP_CUT, /* The analogue of Prolog's "cut" operation (extension) */ + OP_EOD, /* End of data: or \Z. This must always be the last + of the backslashed meta values. */ + + OP_CIRC, /* Start of line - varies with multiline switch */ + OP_DOLL, /* End of line - varies with multiline switch */ + OP_ANY, /* Match any character */ + OP_CHARS, /* Match string of characters */ + OP_NOT, /* Match anything but the following char */ + + OP_STAR, /* The maximizing and minimizing versions of */ + OP_MINSTAR, /* all these opcodes must come in pairs, with */ + OP_PLUS, /* the minimizing one second. */ + OP_MINPLUS, /* This first set applies to single characters */ + OP_QUERY, + OP_MINQUERY, + OP_UPTO, /* From 0 to n matches */ + OP_MINUPTO, + OP_EXACT, /* Exactly n matches */ + + OP_NOTSTAR, /* The maximizing and minimizing versions of */ + OP_NOTMINSTAR, /* all these opcodes must come in pairs, with */ + OP_NOTPLUS, /* the minimizing one second. */ + OP_NOTMINPLUS, /* This first set applies to "not" single characters */ + OP_NOTQUERY, + OP_NOTMINQUERY, + OP_NOTUPTO, /* From 0 to n matches */ + OP_NOTMINUPTO, + OP_NOTEXACT, /* Exactly n matches */ + + OP_TYPESTAR, /* The maximizing and minimizing versions of */ + OP_TYPEMINSTAR, /* all these opcodes must come in pairs, with */ + OP_TYPEPLUS, /* the minimizing one second. These codes must */ + OP_TYPEMINPLUS, /* be in exactly the same order as those above. */ + OP_TYPEQUERY, /* This set applies to character types such as \d */ + OP_TYPEMINQUERY, + OP_TYPEUPTO, /* From 0 to n matches */ + OP_TYPEMINUPTO, + OP_TYPEEXACT, /* Exactly n matches */ + + OP_CRSTAR, /* The maximizing and minimizing versions of */ + OP_CRMINSTAR, /* all these opcodes must come in pairs, with */ + OP_CRPLUS, /* the minimizing one second. These codes must */ + OP_CRMINPLUS, /* be in exactly the same order as those above. */ + OP_CRQUERY, /* These are for character classes and back refs */ + OP_CRMINQUERY, + OP_CRRANGE, /* These are different to the three seta above. */ + OP_CRMINRANGE, + + OP_CLASS, /* Match a character class */ + OP_REF, /* Match a back reference */ + + OP_ALT, /* Start of alternation */ + OP_KET, /* End of group that doesn't have an unbounded repeat */ + OP_KETRMAX, /* These two must remain together and in this */ + OP_KETRMIN, /* order. They are for groups the repeat for ever. */ + + OP_ASSERT, + OP_ASSERT_NOT, + OP_ONCE, /* Once matched, don't back up into the subpattern */ + + OP_BRAZERO, /* These two must remain together and in this */ + OP_BRAMINZERO, /* order. */ + + OP_BRA /* This and greater values are used for brackets that + extract substrings. */ +}; + +/* The highest extraction number. This is limited by the number of opcodes +left after OP_BRA, i.e. 255 - OP_BRA. We actually set it somewhat lower. */ + +#define EXTRACT_MAX 99 + +/* The texts of compile-time error messages are defined as macros here so that +they can be accessed by the POSIX wrapper and converted into error codes. Yes, +I could have used error codes in the first place, but didn't feel like changing +just to accommodate the POSIX wrapper. */ + +#define ERR1 "\\ at end of pattern" +#define ERR2 "\\c at end of pattern" +#define ERR3 "unrecognized character follows \\" +#define ERR4 "numbers out of order in {} quantifier" +#define ERR5 "number too big in {} quantifier" +#define ERR6 "missing terminating ] for character class" +#define ERR7 "invalid escape sequence in character class" +#define ERR8 "range out of order in character class" +#define ERR9 "nothing to repeat" +#define ERR10 "operand of unlimited repeat could match the empty string" +#define ERR11 "internal error: unexpected repeat" +#define ERR12 "unrecognized character after (?" +#define ERR13 "too many capturing parenthesized sub-patterns" +#define ERR14 "missing )" +#define ERR15 "back reference to non-existent subpattern" +#define ERR16 "erroffset passed as NULL" +#define ERR17 "unknown option bit(s) set" +#define ERR18 "missing ) after comment" +#define ERR19 "too many sets of parentheses" +#define ERR20 "regular expression too large" +#define ERR21 "failed to get memory" +#define ERR22 "unmatched brackets" +#define ERR23 "internal error: code overflow" + +/* All character handling must be done as unsigned characters. Otherwise there +are problems with top-bit-set characters and functions such as isspace(). +However, we leave the interface to the outside world as char *, because that +should make things easier for callers. We define a short type for unsigned char +to save lots of typing. I tried "uchar", but it causes problems on Digital +Unix, where it is defined in sys/types, so use "uschar" instead. */ + +typedef unsigned char uschar; + +/* The real format of the start of the pcre block; the actual code vector +runs on as long as necessary after the end. */ + +typedef struct real_pcre { + unsigned int magic_number; + unsigned short int options; + unsigned char top_bracket; + unsigned char top_backref; + unsigned char first_char; + unsigned char code[1]; +} real_pcre; + +/* The real format of the extra block returned by pcre_study(). */ + +typedef struct real_pcre_extra { + unsigned char options; + unsigned char start_bits[32]; +} real_pcre_extra; + +/* Global tables from chartables.c */ + +extern uschar pcre_lcc[]; +extern uschar pcre_fcc[]; +extern uschar pcre_cbits[]; +extern uschar pcre_ctypes[]; + +/* Bit definitions for entries in pcre_ctypes[]. */ + +#define ctype_space 0x01 +#define ctype_letter 0x02 +#define ctype_digit 0x04 +#define ctype_xdigit 0x08 +#define ctype_word 0x10 /* alphameric or '_' */ +#define ctype_meta 0x80 /* regexp meta char or zero (end pattern) */ + +/* Offsets for the bitmap tables */ + +#define cbit_digit 0 +#define cbit_letter 32 +#define cbit_word 64 +#define cbit_space 96 +#define cbit_length 128 /* Length of the cbits table */ + +/* End of internal.h */ diff --git a/maketables.c b/maketables.c new file mode 100644 index 0000000..26a1319 --- /dev/null +++ b/maketables.c @@ -0,0 +1,157 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +PCRE is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. +----------------------------------------------------------------------------- + +See the file Tech.Notes for some information on the internals. +*/ + + +/* This is a support program to generate the file chartables.c, containing +character tables of various kinds. They are built according to the local C +locale. */ + +#include <ctype.h> +#include <stdio.h> +#include <string.h> + +#include "internal.h" + +int main(void) +{ +int i; +unsigned char cbits[cbit_length]; + +printf( + "/*************************************************\n" + "* Perl-Compatible Regular Expressions *\n" + "*************************************************/\n\n" + "/* This file is automatically written by the makechartables auxiliary \n" + "program. If you edit it by hand, you might like to edit the Makefile to \n" + "prevent its ever being regenerated. */\n\n" + "/* This table is a lower casing table. */\n\n" + "unsigned char pcre_lcc[] = {\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) printf("\n "); + printf("%3d", tolower(i)); + if (i != 255) printf(","); + } +printf(" };\n\n"); + +printf( + "/* This table is a case flipping table. */\n\n" + "unsigned char pcre_fcc[] = {\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + if ((i & 7) == 0 && i != 0) printf("\n "); + printf("%3d", islower(i)? toupper(i) : tolower(i)); + if (i != 255) printf(","); + } +printf(" };\n\n"); + +printf( + "/* This table contains bit maps for digits, letters, 'word' chars, and\n" + "white space. Each map is 32 bytes long and the bits run from the least\n" + "significant end of each byte. */\n\n" + "unsigned char pcre_cbits[] = {\n"); + +memset(cbits, 0, sizeof(cbits)); + +for (i = 0; i < 256; i++) + { + if (isdigit(i)) cbits[cbit_digit + i/8] |= 1 << (i&7); + if (isalpha(i)) cbits[cbit_letter + i/8] |= 1 << (i&7); + if (isalnum(i) || i == '_') + cbits[cbit_word + i/8] |= 1 << (i&7); + if (isspace(i)) cbits[cbit_space + i/8] |= 1 << (i&7); + } + +printf(" "); +for (i = 0; i < cbit_length; i++) + { + if ((i & 7) == 0 && i != 0) + { + if ((i & 31) == 0) printf("\n"); + printf("\n "); + } + printf("0x%02x", cbits[i]); + if (i != cbit_length - 1) printf(","); + } +printf(" };\n\n"); + +printf( + "/* This table identifies various classes of character by individual bits:\n" + " 0x%02x white space character\n" + " 0x%02x letter\n" + " 0x%02x decimal digit\n" + " 0x%02x hexadecimal digit\n" + " 0x%02x alphanumeric or '_'\n" + " 0x%02x regular expression metacharacter or binary zero\n*/\n\n", + ctype_space, ctype_letter, ctype_digit, ctype_xdigit, ctype_word, + ctype_meta); + +printf("unsigned char pcre_ctypes[] = {\n"); + +printf(" "); +for (i = 0; i < 256; i++) + { + int x = 0; + if (isspace(i)) x += ctype_space; + if (isalpha(i)) x += ctype_letter; + if (isdigit(i)) x += ctype_digit; + if (isxdigit(i)) x += ctype_xdigit; + if (isalnum(i) || i == '_') x += ctype_word; + if (strchr("*+?{^.$|()[", i) != 0) x += ctype_meta; + + if ((i & 7) == 0 && i != 0) + { + printf(" /* "); + if (isprint(i-8)) printf(" %c -", i-8); + else printf("%3d-", i-8); + if (isprint(i-1)) printf(" %c ", i-1); + else printf("%3d", i-1); + printf(" */\n "); + } + printf("0x%02x", x); + if (i != 255) printf(","); + } + +printf("};/* "); +if (isprint(i-8)) printf(" %c -", i-8); + else printf("%3d-", i-8); +if (isprint(i-1)) printf(" %c ", i-1); + else printf("%3d", i-1); +printf(" */\n\n/* End of chartables.c */\n"); + +return 0; +} + +/* End of maketables.c */ @@ -0,0 +1,1017 @@ +.TH PCRE 3 +.SH NAME +pcre - Perl-compatible regular expressions. +.SH SYNOPSIS +.B #include <pcre.h> +.PP +.SM +.br +.B pcre *pcre_compile(const char *\fIpattern\fR, int \fIoptions\fR, +.ti +5n +.B char **\fIerrptr\fR, int *\fIerroffset\fR); +.PP +.br +.B pcre_extra *pcre_study(const pcre *\fIcode\fR, int \fIoptions\fR, +.ti +5n +.B char **\fIerrptr\fR); +.PP +.br +.B int pcre_exec(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," +.ti +5n +.B "const char *\fIsubject\fR," int \fIlength\fR, int \fIoptions\fR, +.ti +5n +.B int *\fIovector\fR, int \fIovecsize\fR); +.PP +.br +.B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int +.B *\fIfirstcharptr\fR); +.PP +.br +.B char *pcre_version(void); +.PP +.br +.B void *(*pcre_malloc)(size_t); +.PP +.br +.B void (*pcre_free)(void *); +.PP +.br +.B unsigned char *pcre_cbits[128]; +.PP +.br +.B unsigned char *pcre_ctypes[256]; +.PP +.br +.B unsigned char *pcre_fcc[256]; +.PP +.br +.B unsigned char *pcre_lcc[256]; + + + +.SH DESCRIPTION +The PCRE library is a set of functions that implement regular expression +pattern matching using the same syntax and semantics as Perl 5, with just a few +differences (see below). The current implementation corresponds to Perl 5.004. + +PCRE has its own native API, which is described in this man page. There is also +a set of wrapper functions that correspond to the POSIX API. See +\fBpcreposix (3)\fR. + +The three functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and +\fBpcre_exec()\fR are used for compiling and matching regular expressions. The +function \fBpcre_info()\fR is used to find out information about a compiled +pattern, while the function \fBpcre_version()\fR returns a pointer to a string +containing the version of PCRE and its date of release. + +The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain +the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions +respectively. PCRE calls the memory management functions via these variables, +so a calling program can replace them if it wishes to intercept the calls. This +should be done before calling any PCRE functions. + +The other global variables are character tables. They are initialized when PCRE +is compiled, from source that is generated by reference to the C character type +functions, but which the maintainer of PCRE is free to modify. In principle +they could also be modified at runtime. See PCRE's README file for more +details. + + +.SH MULTI-THREADING +The PCRE functions can be used in multi-threading applications, with the +proviso that the character tables and the memory management functions pointed +to by \fBpcre_malloc\fR and \fBpcre_free\fR will be shared by all threads. + +The compiled form of a regular expression is not altered during matching, so +the same compiled pattern can safely be used by several threads at once. + + +.SH COMPILING A PATTERN +The function \fBpcre_compile()\fR is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument \fIpattern\fR. A pointer to the compiled code block +is returned. The \fBpcre\fR type is defined for this for convenience, but in +fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the contents of the +block are not defined. +.PP +The size of a compiled pattern is roughly proportional to the length of the +pattern string, except that each character class (other than those containing +just a single character, negated or not) requires 33 bytes, and repeat +quantifiers with a minimum greater than one or a bounded maximum cause the +relevant portions of the compiled pattern to be replicated. +.PP +The \fIoptions\fR argument contains independent bits that affect the +compilation. It should be zero if no options are required. Those options that +are compabible with Perl can also be set at compile time from within the +pattern (see the detailed description of regular expressions below) and all +options except PCRE_EXTENDED and PCRE_EXTRA can be set at the time of matching. +.PP +If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately. +Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns +NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual +error message. + +The offset from the start of the pattern to the character where the error was +discovered is placed in the variable pointed to by \fIerroffset\fR, which must +not be NULL. If it is, an immediate error is given. +.PP +The following option bits are defined in the header file: + + PCRE_ANCHORED + +If this bit is set, the pattern is forced to be "anchored", that is, it is +constrained to match only at the start of the string which is being searched +(the "subject string"). This effect can also be achieved by appropriate +constructs in the pattern itself, which is the only way to do it in Perl. + + PCRE_CASELESS + +If this bit is set, letters in the pattern match both upper and lower case +letters in any subject string. It is equivalent to Perl's /i option. + + PCRE_DOLLAR_ENDONLY + +If this bit is set, a dollar metacharacter in the pattern matches only at the +end of the subject string. By default, it also matches immediately before the +final character if it is a newline (but not before any other newlines). The +PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set. There is no +equivalent to this option in Perl. + + PCRE_DOTALL + +If this bit is set, a dot metacharater in the pattern matches all characters, +including newlines. By default, newlines are excluded. This option is +equivalent to Perl's /s option. A negative class such as [^a] always matches a +newline character, independent of the setting of this option. + + PCRE_EXTENDED + +If this bit is set, whitespace characters in the pattern are totally ignored +except when escaped or inside a character class, and characters between an +unescaped # outside a character class and the next newline character, +inclusive, are also ignored. This is equivalent to Perl's /x option, and makes +it possible to include comments inside complicated patterns. + + PCRE_MULTILINE + +By default, PCRE treats the subject string as consisting of a single "line" of +characters (even if it actually contains several newlines). The "start of line" +metacharacter (^) matches only at the start of the string, while the "end of +line" metacharacter ($) matches only at the end of the string, or before a +terminating newline. This is the same as Perl. + +When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs +match immediately following or immediately before any newline in the subject +string, respectively, as well as at the very start and end. This is equivalent +to Perl's /m option. If there are no "\\n" characters in a subject string, or +no occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no +effect. + + PCRE_EXTRA + +This option turns on additional functionality of PCRE that is incompatible with +Perl. Any backslash in a pattern that is followed by a letter that has no +special meaning causes an error, thus reserving these combinations for future +expansion. By default, as in Perl, a backslash followed by a letter with no +special meaning is treated as a literal. There are two extra features currently +provided, and both are in some sense experimental additions that are useful for +influencing the progress of a match. + + (1) The sequence \\X inserts a Prolog-like "cut" into the expression. + + (2) Once a subpattern enclosed in (?>subpat) brackets has matched, + backtracking never goes back into the pattern. + +See below for further details of both of these. + + + +.SH STUDYING A PATTERN +When a pattern is going to be used several times, it is worth spending more +time analyzing it in order to speed up the time taken for matching. The +function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first +argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR +typedef) containing additional information about the pattern; this can be +passed to \fBpcre_exec()\fR. If no additional information is available, NULL +is returned. + +The second argument contains option bits. The only one currently supported is +PCRE_CASELESS. It forces the studying to be done in a caseless manner, even if +the original pattern was compiled without PCRE_CASELESS. When the result of +\fBpcre_study()\fR is passed to \fBpcre_exec()\fR, it is used only if its +caseless state is the same as that of the matching process. A pattern that is +compiled without PCRE_CASELESS can be studied with and without PCRE_CASELESS, +and the appropriate data passed to \fBpcre_exec()\fR with and without the +PCRE_CASELESS flag. + +The third argument for \fBpcre_study()\fR is a pointer to an error message. If +studying succeeds (even if no data is returned), the variable it points to is +set to NULL. Otherwise it points to a textual error message. + +At present, studying a pattern is useful only for non-anchored patterns that do +not have a single fixed starting character. A bitmap of possible starting +characters is created. + + +.SH MATCHING A PATTERN +The function \fBpcre_exec()\fR is called to match a subject string against a +pre-compiled pattern, which is passed in the \fIcode\fR argument. If the +pattern has been studied, the result of the study should be passed in the +\fIextra\fR argument. Otherwise this must be NULL. + +The subject string is passed as a pointer in \fIsubject\fR and a length in +\fIlength\fR. Unlike the pattern string, it may contain binary zero characters. + +The options PCRE_ANCHORED, PCRE_CASELESS, PCRE_DOLLAR_ENDONLY, PCRE_DOTALL, and +PCRE_MULTILINE can be passed in the \fIoptions\fR argument, whose unused bits +must be zero. However, if a pattern is compiled with any of these options, they +cannot be unset when it is obeyed. + +There are also two further options that can be set only at matching time: + + PCRE_NOTBOL + +The first character of the string is not the beginning of a line, so the +circumflex metacharacter should not match before it. Setting this without +PCRE_MULTILINE (at either compile or match time) causes circumflex never to +match. + + PCRE_NOTEOL + +The end of the string is not the end of a line, so the dollar metacharacter +should not match it. Setting this without PCRE_MULTILINE (at either compile or +match time) causes dollar never to match. + +In general, a pattern matches a certain portion of the subject, and in +addition, further substrings from the subject may be picked out by parts of the +pattern. Following the usage in Jeffrey Friedl's book, this is called +"capturing" in what follows, and the phrase "capturing subpattern" is used for +a fragment of a pattern that picks out a substring. PCRE supports several other +kinds of parenthesized subpattern that do not cause substrings to be captured. + +Captured substrings are returned to the caller via a vector of integer offsets +whose address is passed in \fIovector\fR. The number of elements in the vector +is passed in \fIovecsize\fR. This should always be an even number, because the +elements are used in pairs. If an odd number is passed, it is rounded down. + +The first element of a pair is set to the offset of the first character in a +substring, and the second is set to the offset of the first character after the +end of a substring. The first pair, \fIovector[0]\fR and \fIovector[1]\fR, +identify the portion of the subject string matched by the entire pattern. The +next pair is used for the first capturing subpattern, and so on. The value +returned by \fBpcre_exec()\fR is the number of pairs that have been set. If +there are no capturing subpatterns, the return value from a successful match +is 1, indicating that just the first pair of offsets has been set. + +It is possible for an capturing subpattern number \fIn+1\fR to match some +part of the subject when subpattern \fIn\fR has not been used at all. For +example, if the string "abc" is matched against the pattern "(a|(z))(bc)", +subpatterns 1 and 3 are matched, but 2 is not. When this happens, both offset +values corresponding to the unused subpattern are set to -1. + +If a capturing subpattern is matched repeatedly, it is the last portion of the +string that it matched that gets returned. + +If the vector is too small to hold all the captured substrings, it is used as +far as possible, and the function returns a value of zero. In particular, if +the substring offsets are not of interest, \fBpcre_exec()\fR may be called with +\fIovector\fR passed as NULL and \fIovecsize\fR as zero. However, if the +pattern contains back references and the \fIovector\fR isn't big enough to +remember the related substrings, PCRE has to get additional memory for use +during matching. Thus it is usually advisable to supply an \fIovector\fR. + +Note that \fBpcre_info()\fR can be used to find out how many capturing +subpatterns there are in a compiled pattern. + +If \fBpcre_exec()\fR fails, it returns a negative number. The following are +defined in the header file: + + PCRE_ERROR_NOMATCH (-1) + +The subject string did not match the pattern. + + PCRE_ERROR_BADREF (-2) + +There was a back-reference in the pattern to a capturing subpattern that had +not previously been set. + + PCRE_ERROR_NULL (-3) + +Either \fIcode\fR or \fIsubject\fR was passed as NULL, or \fIovector\fR was +NULL and \fIovecsize\fR was not zero. + + PCRE_ERROR_BADOPTION (-4) + +An unrecognized bit was set in the \fIoptions\fR argument. + + PCRE_ERROR_BADMAGIC (-5) + +PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch +the case when it is passed a junk pointer. This is the error it gives when the +magic number isn't present. + + PCRE_ERROR_UNKNOWN_NODE (-6) + +While running the pattern match, an unknown item was encountered in the +compiled pattern. This error could be caused by a bug in PCRE or by overwriting +of the compiled pattern. + + PCRE_ERROR_NOMEMORY (-7) + +If a pattern contains back references, but the \fIovector\fR that is passed to +\fBpcre_exec()\fR is not big enough to remember the referenced substrings, PCRE +gets a block of memory at the start of matching to use for this purpose. If the +call via \fBpcre_malloc()\fR fails, this error is given. The memory is freed at +the end of matching. + + +.SH INFORMATION ABOUT A PATTERN +The \fBpcre_info()\fR function returns information about a compiled pattern. +Its yield is the number of capturing subpatterns, or one of the following +negative numbers: + + PCRE_ERROR_NULL the argument \fIcode\fR was NULL + PCRE_ERROR_BADMAGIC the "magic number" was not found + +If the \fIoptptr\fR argument is not NULL, a copy of the options with which the +pattern was compiled is placed in the integer it points to. + +If the \fIfirstcharptr\fR argument is not NULL, is is used to pass back +information about the first character of any matched string. If there is a +fixed first character, e.g. from a pattern such as (cat|cow|coyote), then it is +returned in the integer pointed to by \fIfirstcharptr\fR. Otherwise, if the +pattern was compiled with the PCRE_MULTILINE option, and every branch started +with "^", then -1 is returned, indicating that the pattern will match at the +start of a subject string or after any "\\n" within the string. Otherwise -2 is +returned. + + +.SH LIMITATIONS +There are some size limitations in PCRE but it is hoped that they will never in +practice be relevant. +The maximum length of a compiled pattern is 65539 (sic) bytes. +All values in repeating quantifiers must be less than 65536. +The maximum number of capturing subpatterns is 99. +The maximum number of all parenthesized subpatterns, including capturing +subpatterns and assertions, is 200. + +The maximum length of a subject string is the largest positive number that an +integer variable can hold. However, PCRE uses recursion to handle subpatterns +and indefinite repetition. This means that the available stack space may limit +the size of a subject string that can be processed by certain patterns. + + +.SH DIFFERENCES FROM PERL +The differences described here are with respect to Perl 5.004. + +1. By default, a whitespace character is any character that the C library +function \fBisspace()\fR recognizes, though it is possible to compile PCRE with +alternative character type tables. Normally \fBisspace()\fR matches space, +formfeed, newline, carriage return, horizontal tab, and vertical tab. Perl 5 +no longer includes vertical tab in its set of whitespace characters. The \\v +escape that was in the Perl documentation for a long time was never in fact +recognized. However, the character itself was treated as whitespace at least +up to 5.002. In 5.004 it does not match \\s. + +2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl permits +them, but they do not mean what you might think. For example, "(?!a){3}" does +not assert that the next three characters are not "a". It just asserts that the +next character is not "a" three times. + +3. Capturing subpatterns that occur inside negative lookahead assertions are +counted, but their entries in the offsets vector are never set. Perl sets its +numerical variables from any such patterns that are matched before the +assertion fails to match something (thereby succeeding), but only if the +negative lookahead assertion contains just one branch. + +4. Though binary zero characters are supported in the subject string, they are +not allowed in a pattern string because it is passed as a normal C string, +terminated by zero. The escape sequence "\\0" can be used in the pattern to +represent a binary zero. + +5. The following Perl escape sequences are not supported: \\l, \\u, \\L, \\U, +\\E, \\Q. In fact these are implemented by Perl's general string-handling and +are not part of its pattern matching engine. + +6. The Perl \\G assertion is not supported as it is not relevant to single +pattern matches. + +7. If a backreference can never be matched, PCRE diagnoses an error. In a case +like + + /(123)\\2/ + +the error occurs at compile time. Perl gives no compile time error; version +5.004 either always fails to match, or gives a segmentation fault at runtime. +In more complicated cases such as + + /(1)(2)(3)(4)(5)(6)(7)(8)(9)(10\\10)/ + +PCRE returns PCRE_ERROR_BADREF at run time. Perl always fails to match. + +8. PCRE provides some extensions to the Perl regular expression facilities: + +(a) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ meta- +character matches only at the very end of the string. + +(b) If PCRE_EXTRA is set, the \\X assertion (a Prolog-like "cut") is +recognized, and a backslash followed by a letter with no special meaning is +faulted. There is also a new kind of parenthesized subpattern starting with (?> +which has a block on backtracking into it once it has matched. + + +.SH REGULAR EXPRESSION DETAILS +The syntax and semantics of the regular expressions supported by PCRE are +described below. Regular expressions are also described in the Perl +documentation and in a number of other books, some of which have copious +examples. Jeffrey Friedl's "Mastering Regular Expressions", published by +O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description +here is intended as reference documentation. + +A regular expression is a pattern that is matched against a subject string from +left to right. Most characters stand for themselves in a pattern, and match the +corresponding characters in the subject. As a trivial example, the pattern + + The quick brown fox + +matches a portion of a subject string that is identical to itself. The power of +regular expressions comes from the ability to include alternatives and +repetitions in the pattern. These are encoded in the pattern by the use of +\fImeta-characters\fR, which do not stand for themselves but instead are +interpreted in some special way. + +There are two different sets of meta-characters: those that are recognized +anywhere in the pattern except within square brackets, and those that are +recognized in square brackets. Outside square brackets, the meta-characters are +as follows: + + \\ general escape character with several uses + ^ assert start of subject (or line, in multiline mode) + $ assert end of subject (or line, in multiline mode) + . match any character except newline (by default) + [ start character class definition + | start of alternative branch + ( start subpattern + ) end subpattern + ? extends the meaning of ( + also 0 or 1 quantifier + also quantifier minimizer + * 0 or more quantifier + + 1 or more quantifier + { start min/max quantifier + +Part of a pattern that is in square brackets is called a "character class". In +a character class the only meta-characters are: + + \\ general escape character + ^ negate the class, but only if the first character + - indicates character range + ] terminates the character class + +The following sections describe the use of each of the meta-characters. + + +.SH BACKSLASH +The backslash character has several uses. Firstly, if it is followed by a +non-alphameric character, it takes away any special meaning that character may +have. This use of backslash as an escape character applies both inside and +outside character classes. + +For example, if you want to match a "*" character, you write "\\*" in the +pattern. This applies whether or not the following character would otherwise be +interpreted as a meta-character, so it is always safe to precede a +non-alphameric with "\\" to specify that it stands for itself. In particular, +if you want to match a backslash, you write "\\\\". + +If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the +pattern and characters between a "#" outside a character class and the next +newline character are ignored. An escaping backslash can be used to include a +whitespace or "#" character as part of the pattern. + +A second use of backslash provides a way of encoding non-printing characters +in patterns in a visible manner. There is no restriction on the appearance of +non-printing characters, apart from the binary zero that terminates a pattern, +but when a pattern is being prepared by text editing, it is usually easier to +use one of the following escape sequences than the binary character it +represents: + + \\a alarm, that is, the BEL character (hex 07) + \\cx "control-x", where x is any character + \\e escape (hex 1B) + \\f formfeed (hex 0C) + \\n newline (hex 0A) + \\r carriage return (hex 0D) + \\t tab (hex 09) + \\xhh character with hex code hh + \\ddd character with octal code ddd or backreference + +The precise effect of "\\cx" is as follows: if "x" is a lower case letter, it +is converted to upper case. Then bit 6 of the character (hex 40) is inverted. +Thus "\\cz" becomes hex 1A, but "\\c{" becomes hex 3B, while "\\c;" becomes hex +7B. + +After "\\x", up to two hexadecimal digits are read (letters can be in upper or +lower case). + +After "\\0" up to two further octal digits are read. In both cases, if there +are fewer than two digits, just those that are present are used. Thus the +sequence "\\0\\x\\07" specifies two binary zeros followed by a BEL character. +Make sure you supply two digits if the character that follows could otherwise +be taken as another digit. + +The handling of a backslash followed by a digit other than 0 is complicated. +Outside a character class, PCRE reads it and any following digits as a decimal +number. If the number is less than 10, or if there have been at least that many +previous capturing left parentheses in the expression, the entire sequence is +taken as a \fIback reference\fR. A description of how this works is given +later, following the discussion of parenthesized subpatterns. + +Inside a character class, or if the decimal number is greater than 9 and there +have not been that many capturing subpatterns, PCRE re-reads up to three octal +digits following the backslash, and generates a single byte from the least +significant 8 bits of the value. Any subsequent digits stand for themselves. +For example: + + \\040 is another way of writing a space + \\40 is the same, provided there are fewer than 40 + previous capturing subpatterns + \\7 is always a back reference + \\11 might be a back reference, or another way of + writing a tab + \\011 is always a tab + \\0113 is a tab followed by the character "3" + \\113 is the character with octal code 113 (since there + can be no more than 99 back references) + \\377 is a byte consisting entirely of 1 bits + \\81 is either a back reference, or a binary zero + followed by the two characters "8" and "1" + +Note that octal values of 100 or greater must not be introduced by a leading +zero, because no more than three octal digits are ever read. + +All the sequences that define a single byte value can be used both inside and +outside character classes. In addition, inside a character class, the sequence +"\\b" is interpreted as the backspace character (hex 08). Outside a character +class it has a different meaning (see below). + +The third use of backslash is for specifying generic character types: + + \\d any decimal digit + \\D any character that is not a decimal digit + \\s any whitespace character + \\S any character that is not a whitespace character + \\w any "word" character + \\W any "non-word" character + +Each pair of escape sequences partitions the complete set of characters into +two disjoint sets. Any given character matches one, and only one, of each pair. + +A "word" character is any letter or digit or the underscore character, that is, +any character which can be part of a Perl "word". These character type +sequences can appear both inside and outside character classes. They each match +one character of the appropriate type. If the current matching point is at the +end of the subject string, all of them fail, since there is no character to +match. + +The fourth use of backslash is for certain assertions. An assertion specifies a +condition that has to be met at a particular point in a match, without +consuming any characters from the subject string. The backslashed assertions +are + + \\b word boundary + \\B not a word boundary + \\A start of subject (independent of multiline mode) + \\Z end of subject (independent of multiline mode) + +Assertions may not appear in character classes (but note that "\\b" has a +different meaning, namely the backspace character, inside a character class). + +A word boundary is a position in the subject string where the current character +and the previous character do not both match "\\w" or "\\W" (i.e. one matches +"\\w" and the other matches "\\W"), or the start or end of the string if the +first or last character matches "\\w", respectively. More complicated +assertions are also supported (see below). + +The "\\A" and "\\Z" assertions differ from the traditional "^" and "$" +(described below) in that they only ever match at the very start and end of the +subject string, respectively, whatever options are set. + +When the PCRE_EXTRA flag is set on a call to \fBpcre_compile()\fR, the +additional assertion \\X, which has no equivalent in Perl, is recognized. +This operates like the "cut" operation in Prolog: it prevents the matching +operation from backtracking past it. For example, if the expression + + .*/foo + +is matched against the string "/foo/this/is/not" then after the initial greedy +.* has swallowed the whole string, it keeps backtracking right the way to the +beginning before failing. If, on the other hand, the expression is + + .*/\\Xfoo + +then once it has discovered that "/not" is not "/foo", backtracking ceases, and +the match fails. See also the section on "once-only" subpatterns below. + + + +.SH CIRCUMFLEX AND DOLLAR +Outside a character class, the circumflex character is an assertion which is +true only if the current matching point is at the start of the subject string, +in the default matching mode. Inside a character class, circumflex has an +entirely different meaning (see below). + +Circumflex need not be the first character of the pattern if a number of +alternatives are involved, but it should be the first thing in each alternative +in which it appears if the pattern is ever to match that branch. If all +possible alternatives start with a circumflex, that is, if the pattern is +constrained to match only at the start of the subject, it is said to be an +"anchored" pattern. (There are also other constructs that can cause a pattern +to be anchored.) + +A dollar character is an assertion which is true only if the current matching +point is at the end of the subject string, or immediately before a newline +character that is the last character in the string (by default). Dollar need +not be the last character of the pattern if a number of alternatives are +involved, but it should be the last item in any branch in which it appears. +Dollar has no special meaning in a character class. + +The meaning of dollar can be changed so that it matches only at the very end of +the string, by setting the PCRE_DOLLAR_ENDONLY option at compile or matching +time. + +The meanings of the circumflex and dollar characters are changed if the +PCRE_MULTILINE option is set at compile or matching time. When this is the +case, they match immediately after and immediately before an internal "\\n" +character, respectively, in addition to matching at the start and end of the +subject string. For example, the pattern /^abc$/ matches the subject string +"def\\nabc" in multiline mode, but not otherwise. Consequently, patterns that +are anchored in single line mode because all branches start with "^" are not +anchored in multiline mode. The PCRE_DOLLAR_ENDONLY option is ignored if +PCRE_MULTILINE is set. + +Note that the sequences "\\A" and "\\Z" can be used to match the start and end +of the subject in both modes, and if all branches of a pattern start with "\\A" +is it always anchored. + + +.SH FULL STOP (PERIOD, DOT) +Outside a character class, a dot in the pattern matches any one character in +the subject, including a non-printing character, but not (by default) newline. +If the PCRE_DOTALL option is set, then dots match newlines as well. The +handling of dot is entirely independent of the handling of circumflex and +dollar, the only relationship being that they both involve newline characters. +Dot has no special meaning in a character class. + + +.SH SQUARE BRACKETS +An opening square bracket introduces a character class, terminated by a closing +square bracket. A closing square bracket on its own is not special. If a +closing square bracket is required as a member of the class, it should be the +first data character in the class (after an initial circumflex, if present) or +escaped with \\. + +A character class matches a single character in the subject; the character must +be in the set of characters defined by the class, unless the first character in +the class is a circumflex, in which case the subject character must not be in +the set defined by the class. If a circumflex is actually required as a member +of the class, ensure it is not the first character, or escape it with \\. + +For example, the character class [aeiou] matches any lower case vowel, while +[^aeiou] matches any character that is not a lower case vowel. Note that a +circumflex is just a convenient notation for specifying the characters which +are in the class by enumerating those that are not. It is not an assertion: it +still consumes a character from the subject string, and fails if the current +pointer is at the end of the string. + +The newline character is never treated in any special way in character classes, +whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE options is. A class +such as [^a] will always match a newline. + +The minus (hyphen) character can be used to specify a range of characters in a +character class. For example, [d-m] matches any letter between d and m, +inclusive. If a minus character is required in a class, it must be escaped with +\\ or appear in a position where it cannot be interpreted as indicating a +range, typically as the first or last character in the class. It is not +possible to have the character "]" as the end character of a range, since a +sequence such as [w-] is interpreted as a class of two characters. The octal or +hexadecimal representation of "]" can, however, be used to end a range. + +Ranges operate in ASCII collating sequence. They can also be used for +characters specified numerically, for example [\\000-\\037]. If a range such as +[W-c] is used when PCRE_CASELESS is set, it matches the letters involved in +either case. + +The character types \\d, \\D, \\s, \\S, \\w, and \\W may also appear in a +character class, and add the characters that they match to the class. For +example, the class [^\\W_] matches any letter or digit. + +All non-alphameric characters other than \\, -, ^ (at the start) and the +terminating ] are non-special in character classes, but it does no harm if they +are escaped. + + +.SH VERTICAL BAR +Vertical bar characters are used to separate alternative patterns. The matching +process tries all the alternatives in turn. For example, the pattern + + gilbert|sullivan + +matches either "gilbert" or "sullivan". Any number of alternatives can be used, +and an empty alternative is permitted (matching the empty string). + + +.SH SUBPATTERNS +Subpatterns are delimited by parentheses (round brackets), which can be nested. +Marking part of a pattern as a subpattern does two things: + +1. It localizes a set of alternatives. For example, the pattern + + cat(aract|erpillar|) + +matches one of the words "cat", "cataract", or "caterpillar". Without the +parentheses, it would match "cataract", "erpillar" or the empty string. + +2. It sets up the subpattern as a capturing subpattern (as defined above). +When the whole pattern matches, that portion of the subject string that matched +the subpattern is passed back to the caller via the \fIovector\fR argument of +\fBpcre_exec()\fR. Opening parentheses are counted from left to right (starting +from 1) to obtain the numbers of the capturing subpatterns. + +For example, if the string "the red king" is matched against the pattern + + the ((red|white) (king|queen)) + +the captured substrings are "red king", "red", and "king", and are numbered 1, +2, and 3. + +The fact that plain parentheses fulfil two functions is not always helpful. +There are often times when a grouping subpattern is required without a +capturing requirement. If an opening parenthesis is followed by "?:", the +subpattern does not do any capturing, and is not counted when computing the +number of any subsequent capturing subpatterns. For example, if the string "the +white queen" is matched against the pattern + + the ((?:red|white) (king|queen)) + +the captured substrings are "white queen" and "queen", and are numbered 1 and +2. The maximum number of captured substrings is 99, and the maximum number of +all subpatterns, both capturing and non-capturing, is 200. + + +.SH BACK REFERENCES +Outside a character class, a backslash followed by a digit greater than 0 (and +possibly further digits) is a back reference to a capturing subpattern earlier +(i.e. to its left) in the pattern, provided there have been that many previous +capturing left parentheses. However, if the decimal number following the +backslash is less than 10, it is always taken as a back reference, and causes +an error if there have not been that many previous capturing left parentheses. +See the section entitled "Backslash" above for further details of the handling +of digits following a backslash. + +A back reference matches whatever actually matched the capturing subpattern in +the current subject string, rather than anything matching the subpattern +itself. So the pattern + + (sens|respons)e and \\1ibility + +matches "sense and sensibility" and "response and responsibility", but not +"sense and responsibility". + +There may be more than one back reference to the same subpattern. If a +subpattern has not actually been used in a particular match, then any back +references to it always fail. For example, the pattern + + (a|(bc))\\2 + +always fails if it starts to match "a" rather than "bc". Because there may be +up to 99 back references, all digits following the backslash are taken +as part of a potential back reference number. If the pattern continues with a +digit character, then some delimiter must be used to terminate the back +reference. If the PCRE_EXTENDED option is set, this can be whitespace. +Otherwise an empty comment can be used. + + +.SH REPETITION +Repetition is specified by quantifiers, which can follow any of the following +items: + + a single character, possibly escaped + the . metacharacter + a character class + a back reference + a parenthesized subpattern + +The general repetition quantifier specifies a minimum and maximum number of +permitted matches, by giving the two numbers in curly brackets (braces), +separated by a comma. The numbers must be less than 65536, and the first must +be less than or equal to the second. For example: + + z{2,4} + +matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special +character. If the second number is omitted, but the comma is present, there is +no upper limit; if the second number and the comma are both omitted, the +quantifier specifies an exact number of required matches. Thus + + [aeiou]{3,} + +matches at least 3 successive vowels, but may match many more, while + + \\d{8} + +matches exactly 8 digits. An opening curly bracket that appears in a position +where a quantifier is not allowed, or one that does not match the syntax of a +quantifier, is taken as a literal character. For example, "{,6}" is not a +quantifier, but a literal string of four characters. + +The quantifier {0} is permitted, causing the expression to behave as if the +previous item and the quantifier were not present. + +For convenience (and historical compatibility) the three most common +quantifiers have single-character abbreviations: + + * is equivalent to {0,} + + is equivalent to {1,} + ? is equivalent to {0,1} + +By default, the quantifiers are "greedy", that is, they match as much as +possible (up to the maximum number of permitted times), without causing the +rest of the pattern to fail. The classic example of where this gives problems +is in trying to match comments in C programs. These appear between the +sequences /* and */ and within the sequence, individual * and / characters may +appear. An attempt to match C comments by applying the pattern + + /\\*.*\\*/ + +to the string + + /* first command */ not comment /* second comment */ + +fails, because it matches the entire string due to the greediness of the .* +item. + +However, if a quantifier is followed by a question mark, then it ceases to be +greedy, and instead matches the minimum number of times possible, so the +pattern + + /\\*.*?\\*/ + +does the right thing with the C comments. The meaning of the various +quantifiers is not otherwise changed, just the preferred number of matches. +Do not confuse this use of question mark with its use as a quantifier in its +own right. Because it has two uses, it can sometimes appear doubled, as in + + \\d??\\d + +which matches one digit by preference, but can match two if that is the only +way the rest of the pattern matches. + +When a parenthesized subpattern is quantified a with minimum repeat count that +is greater than 1 or with a limited maximum, more store is required for the +compiled pattern, in proportion to the size of the minimum or maximum. + +If a pattern starts with .* then it is implicitly anchored, since whatever +follows will be tried against every character position in the subject string. +PCRE treats this as though it were preceded by \\A. + +When a capturing subpattern is repeated, the value captured is the substring +that matched the final iteration. For example, + + (\s*tweedle[dume]{3})+\\1 + +matches "tweedledum tweedledee tweedledee" but not "tweedledum tweedledee +tweedledum". + + +.SH ASSERTIONS +An assertion is a test on the characters following the current matching point +that does not actually consume any of those characters. The simple assertions +coded as \\b, \\B, \\A, \\Z, ^ and $ are described above. More complicated +assertions are coded as subpatterns starting with (?= for positive assertions, +and (?! for negative assertions. For example, + + \\w+(?=;) + +matches a word followed by a semicolon, but does not include the semicolon in +the match, and + + foo(?!bar) + +matches any occurrence of "foo" that is not followed by "bar". Note that the +apparently similar pattern + + (?!foo)bar + +does not find an occurrence of "bar" that is preceded by something other than +"foo"; it finds any occurrence of "bar" whatsoever, because the assertion +(?!foo) is always true when the next three characters are "bar". + +Assertion subpatterns are not capturing subpatterns, and may not be repeated, +because it makes no sense to assert the same thing several times. If an +assertion contains capturing subpatterns within it, these are always counted +for the purposes of numbering the capturing subpatterns in the whole pattern. +Substring capturing is carried out for positive assertions, but it does not +make sense for negative assertions. + +Assertions count towards the maximum of 200 parenthesized subpatterns. + + +.SH ONCE-ONLY SUBPATTERNS +The facility described in this section is available only when the PCRE_EXTRA +option is set at compile time. It is an extension to Perl regular expressions. + +With both maximizing and minimizing repetition, failure of what follows +normally causes the repeated item to be re-evaluated to see if a different +number of repeats allows the rest of the pattern to match. Sometimes it is +useful to prevent this, either to change the nature of the match, or to cause +it fail earlier than it otherwise might when the author or the pattern knows +there is no point in carrying on. + +Consider, for example, the pattern \\d+foo when applied to the subject line + + 123456bar + +After matching all 6 digits and then failing to match "foo", the normal +action of the matcher is to try again with only 5 digits matching the \\d+ +item, and then with 4, and so on, before ultimately failing. Once-only +subpatterns provide the means for specifying that once a portion of the pattern +has matched, it is not to be re-evaluated in this way, so the matcher would +give up immediately on failing to match "foo" the first time. The notation is +another kind of special parenthesis, starting with (?> as in this example: + + (?>\d+)bar + +This kind of parenthesis "locks up" the part of the pattern it contains once +it has matched, and a failure further into the pattern is prevented from +backtracking into it. Backtracking past it to previous items, however, works as +normal. + +For simple cases such as the above example, this feature can be though of as a +maximizing repeat that must swallow everything it can. So, while both \\d+ and +\\d+? are prepared to adjust the number of digits they match in order to make +the rest of the pattern match, (?>\\d+) can only match an entire sequence of +digits. + +This construction can of course contain arbitrarily complicated subpatterns, +and it can be nested. Contrast with the \\X assertion, which is a Prolog-like +"cut". + + +.SH COMMENTS +The sequence (?# marks the start of a comment which continues up to the next +closing parenthesis. Nested parentheses are not permitted. The characters +that make up a comment play no part in the pattern matching at all. + +If the PCRE_EXTENDED option is set, an unescaped # character outside a +character class introduces a comment that continues up to the next newline +character in the pattern. + + +.SH INTERNAL FLAG SETTING +If the sequence (?i) occurs anywhere in a pattern, it has the effect of setting +the PCRE_CASELESS option, that is, all letters are matched in a +case-independent manner. The option applies to the whole pattern, not just to +the portion that follows it. + +If the sequence (?m) occurs anywhere in a pattern, it has the effect of setting +the PCRE_MULTILINE option, that is, subject strings matched by this pattern are +treated as consisting of multiple lines. + +If the sequence (?s) occurs anywhere in a pattern, it has the effect of setting +the PCRE_DOTALL option, so that dot metacharacters match newlines as well as +all other characters. + +If the sequence (?x) occurs anywhere in a pattern, it has the effect of setting +the PCRE_EXTENDED option, that is, whitespace is ignored and # introduces a +comment that lasts till the next newline. The option applies to the whole +pattern, not just to the portion that follows it. + +If more than one option is required, they can be specified jointly, for example +as (?ix) or (?mi). + + +.SH PERFORMANCE +Certain items that may appear in patterns are more efficient than others. It is +more efficient to use a character class like [aeiou] than a set of alternatives +such as (a|e|i|o|u). In general, the simplest construction that provides the +required behaviour is usually the most efficient. Jeffrey Friedl's book +contains a lot of discussion about optimizing regular expressions for efficient +performance. + +The use of PCRE_MULTILINE causes additional processing and should be avoided +when it is not necessary. Caseless matching of character classes is more +efficient if PCRE_CASELESS is set when the pattern is compiled. + + +.SH AUTHOR +Philip Hazel <ph10@cam.ac.uk> +.br +University Computing Service, +.br +New Museums Site, +.br +Cambridge CB2 3QG, England. +.br +Phone: +44 1223 334714 + +Copyright (c) 1997 University of Cambridge. @@ -0,0 +1,3510 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. +----------------------------------------------------------------------------- +*/ + + +/* Define DEBUG to get debugging output on stdout. */ + +/* #define DEBUG */ + + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + +/* Min and max values for the common repeats; for the maxima, 0 => infinity */ + +static char rep_min[] = { 0, 0, 1, 1, 0, 0 }; +static char rep_max[] = { 0, 0, 0, 0, 1, 1 }; + +/* Text forms of OP_ values and things, for debugging */ + +#ifdef DEBUG +static char *OP_names[] = { "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "Cut", "\\Z", "^", "$", "Any", "chars", + "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", "Once", + "Brazero", "Braminzero", "Bra" +}; +#endif + +/* Table for handling escaped characters in the range '0'-'z'. Positive returns +are simple data values; negative values are for special things like \d and so +on. Zero means further processing is needed (for things like \x), or the escape +is invalid. */ + +static short int escapes[] = { + 0, 0, 0, 0, 0, 0, 0, 0, /* 0 - 7 */ + 0, 0, ':', ';', '<', '=', '>', '?', /* 8 - ? */ + '@', -ESC_A, -ESC_B, 0, -ESC_D, 0, 0, 0, /* @ - G */ + 0, 0, 0, 0, 0, 0, 0, 0, /* H - O */ + 0, 0, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */ + 0, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */ + '`', 7, -ESC_b, 0, -ESC_d, 27, '\f', 0, /* ` - g */ + 0, 0, 0, 0, 0, 0, '\n', 0, /* h - o */ + 0, 0, '\r', -ESC_s, '\t', 0, 0, -ESC_w, /* p - w */ + 0, 0, 0 /* x - z */ +}; + +/* Definition to allow mutual recursion */ + +static BOOL compile_regex(int, int *,uschar **,uschar **,char **); + +/* Structure for passing "static" information around between the functions +doing the matching, so that they are thread-safe. */ + +typedef struct match_data { + int errorcode; /* As it says */ + int *offset_vector; /* Offset vector */ + int offset_end; /* One past the end */ + BOOL offset_overflow; /* Set if too many extractions */ + BOOL caseless; /* Case-independent flag */ + BOOL runtime_caseless; /* Caseless forced at run time */ + BOOL multiline; /* Multiline flag */ + BOOL notbol; /* NOTBOL flag */ + BOOL noteol; /* NOTEOL flag */ + BOOL dotall; /* Dot matches any char */ + BOOL endonly; /* Dollar not before final \n */ + uschar *start_subject; /* Start of the subject string */ + uschar *end_subject; /* End of the subject string */ + jmp_buf fail_env; /* Environment for longjump() break out */ + uschar *end_match_ptr; /* Subject position at end match */ + int end_offset_top; /* Highwater mark at end of match */ +} match_data; + + + +/************************************************* +* Global variables * +*************************************************/ + +/* PCRE is thread-clean and doesn't use any global variables in the normal +sense. However, it calls memory allocation and free functions via the two +indirections below, which are can be changed by the caller, but are shared +between all threads. */ + +void *(*pcre_malloc)(size_t) = malloc; +void (*pcre_free)(void *) = free; + + + + +/************************************************* +* Return version string * +*************************************************/ + +char * +pcre_version(void) +{ +return PCRE_VERSION; +} + + + + +/************************************************* +* Return info about a compiled pattern * +*************************************************/ + +/* This function picks potentially useful data out of the private +structure. + +Arguments: + external_re points to compiled code + optptr where to pass back the options + first_char where to pass back the first character, + or -1 if multiline and all branches start ^, + or -2 otherwise + +Returns: number of identifying extraction brackets + or negative values on error +*/ + +int +pcre_info(const pcre *external_re, int *optptr, int *first_char) +{ +real_pcre *re = (real_pcre *)external_re; +if (re == NULL) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; +if (optptr != NULL) *optptr = (re->options & PUBLIC_OPTIONS); +if (first_char != NULL) + *first_char = ((re->options & PCRE_FIRSTSET) != 0)? re->first_char : + ((re->options & PCRE_STARTLINE) != 0)? -1 : -2; +return re->top_bracket; +} + + + + +#ifdef DEBUG +/************************************************* +* Debugging function to print chars * +*************************************************/ + +/* Print a sequence of chars in printable format, stopping at the end of the +subject if the requested. + +Arguments: + p points to characters + length number to print + is_subject TRUE if printing from within md->start_subject + md pointer to matching data block, if is_subject is TRUE + +Returns: nothing +*/ + +static pchars(uschar *p, int length, BOOL is_subject, match_data *md) +{ +int c; +if (is_subject && length > md->end_subject - p) length = md->end_subject - p; +while (length-- > 0) + if (isprint(c = *(p++))) printf("%c", c); else printf("\\x%02x", c); +} +#endif + + + + +/************************************************* +* Check subpattern for empty operand * +*************************************************/ + +/* This function checks a bracketed subpattern to see if any of the paths +through it could match an empty string. This is used to diagnose an error if +such a subpattern is followed by a quantifier with an unlimited upper bound. + +Argument: + code points to the opening bracket + +Returns: TRUE or FALSE +*/ + +static BOOL +could_be_empty(uschar *code) +{ +do { + uschar *cc = code + 3; + + /* Scan along the opcodes for this branch; as soon as we find something + that matches a non-empty string, break out and advance to test the next + branch. If we get to the end of the branch, return TRUE for the whole + sub-expression. */ + + for (;;) + { + /* Test an embedded subpattern; if it could not be empty, break the + loop. Otherwise carry on in the branch. */ + + if ((int)(*cc) >= OP_BRA) + { + if (!could_be_empty(cc)) break; + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + } + + else switch (*cc) + { + /* Reached end of a branch: the subpattern may match the empty string */ + + case OP_ALT: + case OP_KET: + case OP_KETRMAX: + case OP_KETRMIN: + return TRUE; + + /* Skip over assertive subpatterns */ + + case OP_ASSERT: + case OP_ASSERT_NOT: + do cc += (cc[1] << 8) + cc[2]; while (*cc == OP_ALT); + cc += 3; + break; + + /* Skip over things that don't match chars */ + + case OP_SOD: + case OP_EOD: + case OP_CIRC: + case OP_DOLL: + case OP_BRAZERO: + case OP_BRAMINZERO: + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + cc++; + break; + + /* Skip over simple repeats with zero lower bound */ + + case OP_STAR: + case OP_MINSTAR: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + cc += 2; + break; + + /* Skip over UPTOs (lower bound is zero) */ + + case OP_UPTO: + case OP_MINUPTO: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + cc += 4; + break; + + /* Check a class or a back reference for a zero minimum */ + + case OP_CLASS: + case OP_REF: + cc += (*cc == OP_REF)? 2 : 4 + 2 * cc[2] + cc[3]; + + switch (*cc) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + cc++; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if ((cc[1] << 8) + cc[2] != 0) goto NEXT_BRANCH; + cc += 3; + break; + + default: + goto NEXT_BRANCH; + } + break; + + /* Anything else matches at least one character */ + + default: + goto NEXT_BRANCH; + } + } + + NEXT_BRANCH: + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); + +/* No branches match the empty string */ + +return FALSE; +} + + + +/************************************************* +* Handle escapes * +*************************************************/ + +/* This function is called when a \ has been encountered. It either returns a +positive value for a simple escape such as \n, or a negative value which +encodes one of the more complicated things such as \d. On entry, ptr is +pointing at the \. On exit, it is on the final character of the escape +sequence. + +Arguments: + ptrptr points to the pattern position pointer + errorptr points to the pointer to the error message + bracount number of previous extracting brackets + options the options bits + isclass TRUE if inside a character class + +Returns: zero or positive => a data character + negative => a special escape sequence + on error, errorptr is set +*/ + +static int +check_escape(uschar **ptrptr, char **errorptr, int bracount, int options, + BOOL isclass) +{ +uschar *ptr = *ptrptr; +int c = *(++ptr) & 255; /* Ensure > 0 on signed-char systems */ +int i; + +if (c == 0) *errorptr = ERR1; + +/* Digits or letters may have special meaning; all others are literals. */ + +else if (c < '0' || c > 'z') {} + +/* Do an initial lookup in a table. A non-zero result is something that can be +returned immediately. Otherwise further processing may be required. */ + +else if ((i = escapes[c - '0']) != 0) c = i; + +/* Escapes that need further processing, or are illegal. */ + +else + { + uschar *oldptr; + switch (c) + { + /* The handling of escape sequences consisting of a string of digits + starting with one that is not zero is not straightforward. By experiment, + the way Perl works seems to be as follows: + + Outside a character class, the digits are read as a decimal number. If the + number is less than 10, or if there are that many previous extracting + left brackets, then it is a back reference. Otherwise, up to three octal + digits are read to form an escaped byte. Thus \123 is likely to be octal + 123 (cf \0123, which is octal 012 followed by the literal 3). If the octal + value is greater than 377, the least significant 8 bits are taken. Inside a + character class, \ followed by a digit is always an octal number. */ + + case '1': case '2': case '3': case '4': case '5': + case '6': case '7': case '8': case '9': + + if (!isclass) + { + oldptr = ptr; + c -= '0'; + while ((pcre_ctypes[ptr[1]] & ctype_digit) != 0) + c = c * 10 + *(++ptr) - '0'; + if (c < 10 || c <= bracount) + { + c = -(ESC_REF + c); + break; + } + ptr = oldptr; /* Put the pointer back and fall through */ + } + + /* Handle an octal number following \. If the first digit is 8 or 9, Perl + generates a binary zero byte and treats the digit as a following literal. + Thus we have to pull back the pointer by one. */ + + if ((c = *ptr) >= '8') + { + ptr--; + c = 0; + break; + } + + /* \0 always starts an octal number, but we may drop through to here with a + larger first octal digit */ + + case '0': + c -= '0'; + while(i++ < 2 && (pcre_ctypes[ptr[1]] & ctype_digit) != 0 && + ptr[1] != '8' && ptr[1] != '9') + c = c * 8 + *(++ptr) - '0'; + break; + + /* Special escapes not starting with a digit are straightforward */ + + case 'x': + c = 0; + while (i++ < 2 && (pcre_ctypes[ptr[1]] & ctype_xdigit) != 0) + { + ptr++; + c = c * 16 + pcre_lcc[*ptr] - + (((pcre_ctypes[*ptr] & ctype_digit) != 0)? '0' : 'W'); + } + break; + + case 'c': + c = *(++ptr); + if (c == 0) + { + *errorptr = ERR2; + return 0; + } + + /* A letter is upper-cased; then the 0x40 bit is flipped */ + + if (c >= 'a' && c <= 'z') c = pcre_fcc[c]; + c ^= 0x40; + break; + + /* PCRE_EXTRA enables extensions to Perl in the matter of escapes. Any + other alphameric following \ is an error if PCRE_EXTRA was set; otherwise, + for Perl compatibility, it is a literal. */ + + default: + if ((options & PCRE_EXTRA) != 0) switch(c) + { + case 'X': + c = -ESC_X; /* This could be a lookup if it ever got into Perl */ + break; + + default: + *errorptr = ERR3; + break; + } + break; + } + } + +*ptrptr = ptr; +return c; +} + + + +/************************************************* +* Check for counted repeat * +*************************************************/ + +/* This function is called when a '{' is encountered in a place where it might +start a quantifier. It looks ahead to see if it really is a quantifier or not. +It is only a quantifier if it is one of the forms {ddd} {ddd,} or {ddd,ddd} +where the ddds are digits. + +Arguments: + p pointer to the first char after '{' + +Returns: TRUE or FALSE +*/ + +static BOOL +is_counted_repeat(uschar *p) +{ +if ((pcre_ctypes[*p++] & ctype_digit) == 0) return FALSE; +while ((pcre_ctypes[*p] & ctype_digit) != 0) p++; +if (*p == '}') return TRUE; + +if (*p++ != ',') return FALSE; +if (*p == '}') return TRUE; + +if ((pcre_ctypes[*p++] & ctype_digit) == 0) return FALSE; +while ((pcre_ctypes[*p] & ctype_digit) != 0) p++; +return (*p == '}'); +} + + + +/************************************************* +* Read repeat counts * +*************************************************/ + +/* Read an item of the form {n,m} and return the values. This is called only +after is_counted_repeat() has confirmed that a repeat-count quantifier exists, +so the syntax is guaranteed to be correct, but we need to check the values. + +Arguments: + p pointer to first char after '{' + minp pointer to int for min + maxp pointer to int for max + returned as -1 if no max + errorptr points to pointer to error message + +Returns: pointer to '}' on success; + current ptr on error, with errorptr set +*/ + +static uschar * +read_repeat_counts(uschar *p, int *minp, int *maxp, char **errorptr) +{ +int min = 0; +int max = -1; + +while ((pcre_ctypes[*p] & ctype_digit) != 0) min = min * 10 + *p++ - '0'; + +if (*p == '}') max = min; else + { + if (*(++p) != '}') + { + max = 0; + while((pcre_ctypes[*p] & ctype_digit) != 0) max = max * 10 + *p++ - '0'; + if (max < min) + { + *errorptr = ERR4; + return p; + } + } + } + +/* Do paranoid checks, then fill in the required variables, and pass back the +pointer to the terminating '}'. */ + +if (min > 65535 || max > 65535) + *errorptr = ERR5; +else + { + *minp = min; + *maxp = max; + } +return p; +} + + + +/************************************************* +* Compile one branch * +*************************************************/ + +/* Scan the pattern, compiling it into the code vector. + +Arguments: + options the option bits + bracket points to number of brackets used + code points to the pointer to the current code point + ptrptr points to the current pattern pointer + errorptr points to pointer to error message + +Returns: TRUE on success + FALSE, with *errorptr set on error +*/ + +static BOOL +compile_branch(int options, int *brackets, uschar **codeptr, uschar **ptrptr, + char **errorptr) +{ +int repeat_type, op_type; +int repeat_min, repeat_max; +int bravalue, length; +register int c; +register uschar *code = *codeptr; +uschar *ptr = *ptrptr; +uschar *previous = NULL; +uschar *oldptr; +uschar class[32]; + +/* Switch on next character until the end of the branch */ + +for (;; ptr++) + { + BOOL negate_class; + int class_charcount; + int class_lastchar; + + c = *ptr; + if ((options & PCRE_EXTENDED) != 0) + { + if ((pcre_ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + } + + switch(c) + { + /* The branch terminates at end of string, |, or ). */ + + case 0: + case '|': + case ')': + *codeptr = code; + *ptrptr = ptr; + return TRUE; + + /* Handle single-character metacharacters */ + + case '^': + previous = NULL; + *code++ = OP_CIRC; + break; + + case '$': + previous = NULL; + *code++ = OP_DOLL; + break; + + case '.': + previous = code; + *code++ = OP_ANY; + break; + + /* Character classes. These always build a 32-byte bitmap of the permitted + characters, except in the special case where there is only one character. + For negated classes, we build the map as usual, then invert it at the end. + */ + + case '[': + previous = code; + *code++ = OP_CLASS; + + /* If the first character is '^', set the negation flag */ + + if ((c = *(++ptr)) == '^') + { + negate_class = TRUE; + c = *(++ptr); + } + else negate_class = FALSE; + + /* Keep a count of chars so that we can optimize the case of just a single + character. */ + + class_charcount = 0; + class_lastchar = -1; + + /* Initialize the 32-char bit map to all zeros. We have to build the + map in a temporary bit of store, in case the class contains only 1 + character, because in that case the compiled code doesn't use the + bit map. */ + + memset(class, 0, 32 * sizeof(uschar)); + + /* Process characters until ] is reached. By writing this as a "do" it + means that an initial ] is taken as a data character. */ + + do + { + if (c == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* Backslash may introduce a single character, or it may introduce one + of the specials, which just set a flag. Escaped items are checked for + validity in the pre-compiling pass. The sequence \b is a special case. + Inside a class (and only there) it is treated as backslash. Elsewhere + it marks a word boundary. Other escapes have preset maps ready to + or into the one we are building. We assume they have more than one + character in them, so set class_count bigger than one. */ + + if (c == '\\') + { + c = check_escape(&ptr, errorptr, *brackets, options, TRUE); + if (-c == ESC_b) c = '\b'; + else if (c < 0) + { + class_charcount = 10; + switch (-c) + { + case ESC_d: + for (c = 0; c < 32; c++) class[c] |= pcre_cbits[c+cbit_digit]; + continue; + + case ESC_D: + for (c = 0; c < 32; c++) class[c] |= ~pcre_cbits[c+cbit_digit]; + continue; + + case ESC_w: + for (c = 0; c < 32; c++) + class[c] |= (pcre_cbits[c] | pcre_cbits[c+cbit_word]); + continue; + + case ESC_W: + for (c = 0; c < 32; c++) + class[c] |= ~(pcre_cbits[c] | pcre_cbits[c+cbit_word]); + continue; + + case ESC_s: + for (c = 0; c < 32; c++) class[c] |= pcre_cbits[c+cbit_space]; + continue; + + case ESC_S: + for (c = 0; c < 32; c++) class[c] |= ~pcre_cbits[c+cbit_space]; + continue; + + default: + *errorptr = ERR7; + goto FAILED; + } + } + /* Fall through if single character */ + } + + /* A single character may be followed by '-' to form a range. However, + Perl does not permit ']' to be the end of the range. A '-' character + here is treated as a literal. */ + + if (ptr[1] == '-' && ptr[2] != ']') + { + int d; + ptr += 2; + d = *ptr; + + if (d == 0) + { + *errorptr = ERR6; + goto FAILED; + } + + /* The second part of a range can be a single-character escape, but + not any of the other escapes. */ + + if (d == '\\') + { + d = check_escape(&ptr, errorptr, *brackets, options, TRUE); + if (d < 0) + { + if (d == -ESC_b) d = '\b'; else + { + *errorptr = ERR7; + goto FAILED; + } + } + } + + if (d < c) + { + *errorptr = ERR8; + goto FAILED; + } + + for (; c <= d; c++) + { + class[c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + int uc = pcre_fcc[c]; /* flip case */ + class[uc/8] |= (1 << (uc&7)); + } + class_charcount++; /* in case a one-char range */ + class_lastchar = c; + } + continue; /* Go get the next char in the class */ + } + + /* Handle a lone single character - we can get here for a normal + non-escape char, or after \ that introduces a single character. */ + + class [c/8] |= (1 << (c&7)); + if ((options & PCRE_CASELESS) != 0) + { + c = pcre_fcc[c]; /* flip case */ + class[c/8] |= (1 << (c&7)); + } + class_charcount++; + class_lastchar = c; + } + + /* Loop until ']' reached; the check for end of string happens inside the + loop. This "while" is the end of the "do" above. */ + + while ((c = *(++ptr)) != ']'); + + /* If class_charcount is 1 and class_lastchar is not negative, we saw + precisely one character. This doesn't need the whole 32-byte bit map. + We turn it into a 1-character OP_CHAR if it's positive, or OP_NOT if + it's negative. */ + + if (class_charcount == 1 && class_lastchar >= 0) + { + if (negate_class) + { + code[-1] = OP_NOT; + } + else + { + code[-1] = OP_CHARS; + *code++ = 1; + } + *code++ = class_lastchar; + } + + /* Otherwise, negate the 32-byte map if necessary, and copy it into + the code vector. */ + + else + { + if (negate_class) + for (c = 0; c < 32; c++) code[c] = ~class[c]; + else + memcpy(code, class, 32); + code += 32; + } + break; + + /* Various kinds of repeat */ + + case '{': + if (!is_counted_repeat(ptr+1)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &repeat_min, &repeat_max, errorptr); + if (*errorptr != NULL) goto FAILED; + goto REPEAT; + + case '*': + repeat_min = 0; + repeat_max = -1; + goto REPEAT; + + case '+': + repeat_min = 1; + repeat_max = -1; + goto REPEAT; + + case '?': + repeat_min = 0; + repeat_max = 1; + + REPEAT: + if (previous == NULL) + { + *errorptr = ERR9; + goto FAILED; + } + + /* If the next character is '?' this is a minimizing repeat. Advance to the + next character. */ + + if (ptr[1] == '?') { repeat_type = 1; ptr++; } else repeat_type = 0; + + /* If the maximum is zero then the minimum must also be zero; Perl allows + this case, so we do too - by simply omitting the item altogether. */ + + if (repeat_max == 0) code = previous; + + /* If previous was a string of characters, chop off the last one and use it + as the subject of the repeat. If there was only one character, we can + abolish the previous item altogether. */ + + else if (*previous == OP_CHARS) + { + int len = previous[1]; + if (len == 1) + { + c = previous[2]; + code = previous; + } + else + { + c = previous[len+1]; + previous[1]--; + code--; + } + op_type = 0; /* Use single-char op codes */ + goto OUTPUT_SINGLE_REPEAT; /* Code shared with single character types */ + } + + /* If previous was a single negated character ([^a] or similar), we use + one of the special opcodes, replacing it. The code is shared with single- + character repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous == OP_NOT) + { + op_type = OP_NOTSTAR - OP_STAR; /* Use "not" opcodes */ + c = previous[1]; + code = previous; + goto OUTPUT_SINGLE_REPEAT; + } + + /* If previous was a character type match (\d or similar), abolish it and + create a suitable repeat item. The code is shared with single-character + repeats by adding a suitable offset into repeat_type. */ + + else if ((int)*previous < OP_EOD || *previous == OP_ANY) + { + op_type = OP_TYPESTAR - OP_STAR; /* Use type opcodes */ + c = *previous; + code = previous; + + OUTPUT_SINGLE_REPEAT: + repeat_type += op_type; /* Combine both values for many cases */ + + /* A minimum of zero is handled either as the special case * or ?, or as + an UPTO, with the maximum given. */ + + if (repeat_min == 0) + { + if (repeat_max == -1) *code++ = OP_STAR + repeat_type; + else if (repeat_max == 1) *code++ = OP_QUERY + repeat_type; + else + { + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The case {1,} is handled as the special case + */ + + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_PLUS + repeat_type; + + /* The case {n,n} is just an EXACT, while the general case {n,m} is + handled as an EXACT followed by an UPTO. An EXACT of 1 is optimized. */ + + else + { + if (repeat_min != 1) + { + *code++ = OP_EXACT + op_type; /* NB EXACT doesn't have repeat_type */ + *code++ = repeat_min >> 8; + *code++ = (repeat_min & 255); + } + + /* If the mininum is 1 and the previous item was a character string, + we either have to put back the item that got cancelled if the string + length was 1, or add the character back onto the end of a longer + string. For a character type nothing need be done; it will just get put + back naturally. */ + + else if (*previous == OP_CHARS) + { + if (code == previous) code += 2; else previous[1]++; + } + + /* Insert an UPTO if the max is greater than the min. */ + + if (repeat_max != repeat_min) + { + *code++ = c; + repeat_max -= repeat_min; + *code++ = OP_UPTO + repeat_type; + *code++ = repeat_max >> 8; + *code++ = (repeat_max & 255); + } + } + + /* The character or character type itself comes last in all cases. */ + + *code++ = c; + } + + /* If previous was a character class or a back reference, we put the repeat + stuff after it. */ + + else if (*previous == OP_CLASS || *previous == OP_REF) + { + if (repeat_min == 0 && repeat_max == -1) + *code++ = OP_CRSTAR + repeat_type; + else if (repeat_min == 1 && repeat_max == -1) + *code++ = OP_CRPLUS + repeat_type; + else if (repeat_min == 0 && repeat_max == 1) + *code++ = OP_CRQUERY + repeat_type; + else + { + *code++ = OP_CRRANGE + repeat_type; + *code++ = repeat_min >> 8; + *code++ = repeat_min & 255; + if (repeat_max == -1) repeat_max = 0; /* 2-byte encoding for max */ + *code++ = repeat_max >> 8; + *code++ = repeat_max & 255; + } + } + + /* If previous was a bracket group, we may have to replicate it in certain + cases. If the maximum repeat count is unlimited, check that the bracket + group cannot match the empty string, and diagnose an error if it can. */ + + else if ((int)*previous >= OP_BRA) + { + int i; + int length = code - previous; + + if (repeat_max == -1 && could_be_empty(previous)) + { + *errorptr = ERR10; + goto FAILED; + } + + /* If the minimum is greater than zero, and the maximum is unlimited or + equal to the minimum, the first copy remains where it is, and is + replicated up to the minimum number of times. This case includes the + + repeat, but of course no replication is needed in that case. */ + + if (repeat_min > 0 && (repeat_max == -1 || repeat_max == repeat_min)) + { + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, length); + code += length; + } + } + + /* If the minimum is zero, stick BRAZERO in front of the first copy. + Then, if there is a fixed upper limit, replicated up to that many times, + sticking BRAZERO in front of all the optional ones. */ + + else + { + if (repeat_min == 0) + { + memmove(previous+1, previous, length); + code++; + *previous++ = OP_BRAZERO + repeat_type; + } + + for (i = 1; i < repeat_min; i++) + { + memcpy(code, previous, length); + code += length; + } + + for (i = (repeat_min > 0)? repeat_min : 1; i < repeat_max; i++) + { + *code++ = OP_BRAZERO + repeat_type; + memcpy(code, previous, length); + code += length; + } + } + + /* If the maximum is unlimited, set a repeater in the final copy. */ + + if (repeat_max == -1) code[-3] = OP_KETRMAX + repeat_type; + } + + /* Else there's some kind of shambles */ + + else + { + *errorptr = ERR11; + goto FAILED; + } + + /* In all case we no longer have a previous item. */ + + previous = NULL; + break; + + + /* Start of nested bracket sub-expression, or comment or lookahead. + First deal with special things that can come after a bracket; all are + introduced by ?, and the appearance of any of them means that this is not a + referencing group. They were checked for validity in the first pass over + the string, so we don't have to check for syntax errors here. */ + + case '(': + previous = code; /* Only real brackets can be repeated */ + if (*(++ptr) == '?') + { + bravalue = OP_BRA; + + switch (*(++ptr)) + { + case '#': + case 'i': + case 'm': + case 's': + case 'x': + ptr++; + while (*ptr != ')') ptr++; + previous = NULL; + continue; + + case ':': /* Non-extracting bracket */ + ptr++; + break; + + case '=': /* Assertions can't be repeated */ + bravalue = OP_ASSERT; + ptr++; + previous = NULL; + break; + + case '!': + bravalue = OP_ASSERT_NOT; + ptr++; + previous = NULL; + break; + + case '>': /* "Match once" brackets */ + if ((options & PCRE_EXTRA) != 0) /* Not yet standard */ + { + bravalue = OP_ONCE; + ptr++; + previous = NULL; + break; + } + /* Else fall through */ + + default: + *errorptr = ERR12; + goto FAILED; + } + } + + /* Else we have a referencing group */ + + else + { + if (++(*brackets) > EXTRACT_MAX) + { + *errorptr = ERR13; + goto FAILED; + } + bravalue = OP_BRA + *brackets; + } + + /* Process nested bracketed re; at end pointer is on the bracket. We copy + code into a non-register variable in order to be able to pass its address + because some compilers complain otherwise. */ + + *code = bravalue; + { + uschar *mcode = code; + if (!compile_regex(options, brackets, &mcode, &ptr, errorptr)) + goto FAILED; + code = mcode; + } + + if (*ptr != ')') + { + *errorptr = ERR14; + goto FAILED; + } + break; + + /* Check \ for being a real metacharacter; if not, fall through and handle + it as a data character at the start of a string. Escape items are checked + for validity in the pre-compiling pass. */ + + case '\\': + oldptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE); + + /* Handle metacharacters introduced by \. For ones like \d, the ESC_ values + are arranged to be the negation of the corresponding OP_values. For the + back references, the values are ESC_REF plus the reference number. Only + back references and those types that consume a character may be repeated. + We can test for values between ESC_b and ESC_Z for the latter; this may + have to change if any new ones are ever created. */ + + if (c < 0) + { + if (-c >= ESC_REF) + { + int refnum = -c - ESC_REF; + if (*brackets < refnum) + { + *errorptr = ERR15; + goto FAILED; + } + previous = code; + *code++ = OP_REF; + *code++ = refnum; + } + else + { + previous = (-c > ESC_b && -c < ESC_X)? code : NULL; + *code++ = -c; + } + continue; + } + + /* Reset and fall through */ + + ptr = oldptr; + c = '\\'; + + /* Handle a run of data characters until a metacharacter is encountered. + The first character is guaranteed not to be whitespace or # when the + extended flag is set. */ + + NORMAL_CHAR: + default: + previous = code; + *code = OP_CHARS; + code += 2; + length = 0; + + do + { + if ((options & PCRE_EXTENDED) != 0) + { + if ((pcre_ctypes[c] & ctype_space) != 0) continue; + if (c == '#') + { + while ((c = *(++ptr)) != 0 && c != '\n'); + if (c == 0) break; + continue; + } + } + + /* Backslash may introduce a data char or a metacharacter. Escaped items + are checked for validity in the pre-compiling pass. Stop the string + before a metaitem. */ + + if (c == '\\') + { + oldptr = ptr; + c = check_escape(&ptr, errorptr, *brackets, options, FALSE); + if (c < 0) { ptr = oldptr; break; } + } + + /* Ordinary character or single-char escape */ + + *code++ = c; + length++; + } + + /* This "while" is the end of the "do" above. */ + + while (length < 255 && (pcre_ctypes[c = *(++ptr)] & ctype_meta) == 0); + + /* Compute the length and set it in the data vector, and advance to + the next state. */ + + previous[1] = length; + ptr--; + break; + } + } /* end of big loop */ + +/* Control never reaches here by falling through, only by a goto for all the +error states. Pass back the position in the pattern so that it can be displayed +to the user for diagnosing the error. */ + +FAILED: +*ptrptr = ptr; +return FALSE; +} + + + + +/************************************************* +* Compile sequence of alternatives * +*************************************************/ + +/* On entry, ptr is pointing past the bracket character, but on return +it points to the closing bracket, or vertical bar, or end of string. +The code variable is pointing at the byte into which the BRA operator has been +stored. + +Argument: + options the option bits + brackets -> int containing the number of extracting brackets used + codeptr -> the address of the current code pointer + ptrptr -> the address of the current pattern pointer + errorptr -> pointer to error message + +Returns: TRUE on success +*/ + +static BOOL +compile_regex(int options, int *brackets, uschar **codeptr, uschar **ptrptr, + char **errorptr) +{ +uschar *ptr = *ptrptr; +uschar *code = *codeptr; +uschar *start_bracket = code; + +for (;;) + { + int length; + uschar *last_branch = code; + + code += 3; + if (!compile_branch(options, brackets, &code, &ptr, errorptr)) + { + *ptrptr = ptr; + return FALSE; + } + + /* Fill in the length of the last branch */ + + length = code - last_branch; + last_branch[1] = length >> 8; + last_branch[2] = length & 255; + + /* Reached end of expression, either ')' or end of pattern. Insert a + terminating ket and the length of the whole bracketed item, and return, + leaving the pointer at the terminating char. */ + + if (*ptr != '|') + { + length = code - start_bracket; + *code++ = OP_KET; + *code++ = length >> 8; + *code++ = length & 255; + *codeptr = code; + *ptrptr = ptr; + return TRUE; + } + + /* Another branch follows; insert an "or" node and advance the pointer. */ + + *code = OP_ALT; + ptr++; + } +/* Control never reaches here */ +} + + + +/************************************************* +* Check for anchored expression * +*************************************************/ + +/* Try to find out if this is an anchored regular expression. Consider each +alternative branch. If they all start with OP_SOD or OP_CIRC, or with a bracket +all of whose alternatives start with OP_SOD or OP_CIRC (recurse ad lib), then +it's anchored. However, if this is a multiline pattern, then only OP_SOD +counts, since OP_CIRC can match in the middle. + +A branch is also implicitly anchored if it starts with .* because that will try +the rest of the pattern at all possible matching points, so there is no point +trying them again. + +Argument: points to start of expression (the bracket) +Returns: TRUE or FALSE +*/ + +static BOOL +is_anchored(register uschar *code, BOOL multiline) +{ +do { + int op = (int)code[3]; + if (op >= OP_BRA || op == OP_ASSERT || op == OP_ONCE) + { if (!is_anchored(code+3, multiline)) return FALSE; } + else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR) + { if (code[4] != OP_ANY) return FALSE; } + else if (op != OP_SOD && (multiline || op != OP_CIRC)) return FALSE; + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Check for start with \n line expression * +*************************************************/ + +/* This is called for multiline expressions to try to find out if every branch +starts with ^ so that "first char" processing can be done to speed things up. + +Argument: points to start of expression (the bracket) +Returns: TRUE or FALSE +*/ + +static BOOL +is_startline(uschar *code) +{ +do { + if ((int)code[3] >= OP_BRA || code[3] == OP_ASSERT) + { if (!is_startline(code+3)) return FALSE; } + else if (code[3] != OP_CIRC) return FALSE; + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Check for fixed first char * +*************************************************/ + +/* Try to find out if there is a fixed first character. This is called for +unanchored expressions, as it speeds up their processing quite considerably. +Consider each alternative branch. If they all start with the same char, or with +a bracket all of whose alternatives start with the same char (recurse ad lib), +then we return that char, otherwise -1. + +Argument: points to start of expression (the bracket) +Returns: -1 or the fixed first char +*/ + +static int +find_firstchar(uschar *code) +{ +register int c = -1; +do + { + register int charoffset = 4; + + if ((int)code[3] >= OP_BRA || code[3] == OP_ASSERT) + { + register int d; + if ((d = find_firstchar(code+3)) < 0) return -1; + if (c < 0) c = d; else if (c != d) return -1; + } + + else switch(code[3]) + { + default: + return -1; + + case OP_EXACT: /* Fall through */ + charoffset++; + + case OP_CHARS: /* Fall through */ + charoffset++; + + case OP_PLUS: + case OP_MINPLUS: + if (c < 0) c = code[charoffset]; else if (c != code[charoffset]) return -1; + break; + } + code += (code[1] << 8) + code[2]; + } +while (*code == OP_ALT); +return c; +} + + + +/************************************************* +* Compile a Regular Expression * +*************************************************/ + +/* This function takes a string and returns a pointer to a block of store +holding a compiled version of the expression. + +Arguments: + pattern the regular expression + options various option bits + errorptr pointer to pointer to error text + erroroffset ptr offset in pattern where error was detected + +Returns: pointer to compiled data block, or NULL on error, + with errorptr and erroroffset set +*/ + +pcre * +pcre_compile(const char *pattern, int options, char **errorptr, + int *erroroffset) +{ +real_pcre *re; +int spaces = 0; +int length = 3; /* For initial BRA plus length */ +int runlength; +int c, size; +int bracount = 0; +int brastack[200]; +int brastackptr = 0; +int top_backref = 0; +uschar *code, *ptr; + +#ifdef DEBUG +uschar *code_base, *code_end; +#endif + +/* We can't pass back an error message if errorptr is NULL; I guess the best we +can do is just return NULL. */ + +if (errorptr == NULL) return NULL; +*errorptr = NULL; + +/* However, we can give a message for this error */ + +if (erroroffset == NULL) + { + *errorptr = ERR16; + return NULL; + } +*erroroffset = 0; + +if ((options & ~PUBLIC_OPTIONS) != 0) + { + *errorptr = ERR17; + return NULL; + } + +#ifdef DEBUG +printf("------------------------------------------------------------------\n"); +printf("%s\n", pattern); +#endif + +/* The first thing to do is to make a pass over the pattern to compute the +amount of store required to hold the compiled code. This does not have to be +perfect as long as errors are overestimates. At the same time we can detect any +internal flag settings. Make an attempt to correct for any counted white space +if an "extended" flag setting appears late in the pattern. We can't be so +clever for #-comments. */ + +ptr = (uschar *)(pattern - 1); +while ((c = *(++ptr)) != 0) + { + int min, max; + int class_charcount; + + if ((pcre_ctypes[c] & ctype_space) != 0) + { + if ((options & PCRE_EXTENDED) != 0) continue; + spaces++; + } + + if (c == '#' && (options & PCRE_EXTENDED) != 0) + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + + switch(c) + { + /* A backslashed item may be an escaped "normal" character or a + character type. For a "normal" character, put the pointers and + character back so that tests for whitespace etc. in the input + are done correctly. */ + + case '\\': + { + uschar *save_ptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c >= 0) + { + ptr = save_ptr; + c = '\\'; + goto NORMAL_CHAR; + } + } + length++; + + /* A back reference needs an additional char, plus either one or 5 + bytes for a repeat. We also need to keep the value of the highest + back reference. */ + + if (c <= -ESC_REF) + { + int refnum = -c - ESC_REF; + if (refnum > top_backref) top_backref = refnum; + length++; /* For single back reference */ + if (ptr[1] == '{' && is_counted_repeat(ptr+2)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + case '^': + case '.': + case '$': + case '*': /* These repeats won't be after brackets; */ + case '+': /* those are handled separately */ + case '?': + length++; + continue; + + /* This covers the cases of repeats after a single char, metachar, class, + or back reference. */ + + case '{': + if (!is_counted_repeat(ptr+1)) goto NORMAL_CHAR; + ptr = read_repeat_counts(ptr+1, &min, &max, errorptr); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else + { + length--; /* Uncount the original char or metachar */ + if (min == 1) length++; else if (min > 0) length += 4; + if (max > 0) length += 4; else length += 2; + } + if (ptr[1] == '?') ptr++; + continue; + + /* An alternation contains an offset to the next branch or ket. */ + case '|': + length += 3; + continue; + + /* A character class uses 33 characters. Don't worry about character types + that aren't allowed in classes - they'll get picked up during the compile. + A character class that contains only one character uses 2 or 3 bytes, + depending on whether it is negated or not. Notice this where we can. */ + + case '[': + class_charcount = 0; + if (*(++ptr) == '^') ptr++; + do + { + if (*ptr == '\\') + { + int c = check_escape(&ptr, errorptr, bracount, options, TRUE); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (-c == ESC_b) class_charcount++; else class_charcount = 10; + } + else class_charcount++; + ptr++; + } + while (*ptr != 0 && *ptr != ']'); + + /* Repeats for negated single chars are handled by the general code */ + + if (class_charcount == 1) length += 3; else + { + length += 33; + + /* A repeat needs either 1 or 5 bytes. */ + + if (ptr[1] == '{' && is_counted_repeat(ptr+2)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if ((min == 0 && (max == 1 || max == -1)) || + (min == 1 && max == -1)) + length++; + else length += 5; + if (ptr[1] == '?') ptr++; + } + } + continue; + + /* Brackets may be genuine groups or special things */ + + case '(': + + /* Handle special forms of bracket, which all start (? */ + + if (ptr[1] == '?') switch (c = ptr[2]) + { + /* Skip over comments entirely */ + case '#': + ptr += 3; + while (*ptr != 0 && *ptr != ')') ptr++; + if (*ptr == 0) + { + *errorptr = ERR18; + goto PCRE_ERROR_RETURN; + } + continue; + + /* Non-referencing groups and lookaheads just move the pointer on, and + then behave like a non-special bracket, except that they don't increment + the count of extracting brackets. */ + + case ':': + case '=': + case '!': + ptr += 2; + break; + + /* Ditto for the "once only" bracket, allowed only if the extra bit + is set. */ + + case '>': + if ((options & PCRE_EXTRA) != 0) + { + ptr += 2; + break; + } + /* Else fall thourh */ + + /* Else loop setting valid options until ) is met. Anything else is an + error. */ + + default: + ptr += 2; + for (;; ptr++) + { + if ((c = *ptr) == 'i') + { + options |= PCRE_CASELESS; + continue; + } + else if ((c = *ptr) == 'm') + { + options |= PCRE_MULTILINE; + continue; + } + else if (c == 's') + { + options |= PCRE_DOTALL; + continue; + } + else if (c == 'x') + { + options |= PCRE_EXTENDED; + length -= spaces; /* Already counted spaces */ + continue; + } + else if (c == ')') break; + + *errorptr = ERR12; + goto PCRE_ERROR_RETURN; + } + continue; /* End of this bracket handling */ + } + + /* Extracting brackets must be counted so we can process escapes in a + Perlish way. */ + + else bracount++; + + /* Non-special forms of bracket. Save length for computing whole length + at end if there's a repeat that requires duplication of the group. */ + + if (brastackptr >= sizeof(brastack)/sizeof(int)) + { + *errorptr = ERR19; + goto PCRE_ERROR_RETURN; + } + + brastack[brastackptr++] = length; + length += 3; + continue; + + /* Handle ket. Look for subsequent max/min; for certain sets of values we + have to replicate this bracket up to that many times. */ + + case ')': + length += 3; + { + int min = 1; + int max = 1; + int duplength = length - brastack[--brastackptr]; + + /* Leave ptr at the final char; for read_repeat_counts this happens + automatically; for the others we need an increment. */ + + if ((c = ptr[1]) == '{' && is_counted_repeat(ptr+2)) + { + ptr = read_repeat_counts(ptr+2, &min, &max, errorptr); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + } + else if (c == '*') { min = 0; max = -1; ptr++; } + else if (c == '+') { max = -1; ptr++; } + else if (c == '?') { min = 0; ptr++; } + + /* If there is a minimum > 1 we have to replicate up to min-1 times; if + there is a limited maximum we have to replicate up to max-1 times and + allow for a BRAZERO item before each optional copy, as we also have to + do before the first copy if the minimum is zero. */ + + if (min == 0) length++; + else if (min > 1) length += (min - 1) * duplength; + if (max > min) length += (max - min) * (duplength + 1); + } + + continue; + + /* Non-special character. For a run of such characters the length required + is the number of characters + 2, except that the maximum run length is 255. + We won't get a skipped space or a non-data escape or the start of a # + comment as the first character, so the length can't be zero. */ + + NORMAL_CHAR: + default: + length += 2; + runlength = 0; + do + { + if ((pcre_ctypes[c] & ctype_space) != 0) + { + if ((options & PCRE_EXTENDED) != 0) continue; + spaces++; + } + + if (c == '#' && (options & PCRE_EXTENDED) != 0) + { + while ((c = *(++ptr)) != 0 && c != '\n'); + continue; + } + + /* Backslash may introduce a data char or a metacharacter; stop the + string before the latter. */ + + if (c == '\\') + { + uschar *saveptr = ptr; + c = check_escape(&ptr, errorptr, bracount, options, FALSE); + if (*errorptr != NULL) goto PCRE_ERROR_RETURN; + if (c < 0) { ptr = saveptr; break; } + } + + /* Ordinary character or single-char escape */ + + runlength++; + } + + /* This "while" is the end of the "do" above. */ + + while (runlength < 255 && (pcre_ctypes[c = *(++ptr)] & ctype_meta) == 0); + + ptr--; + length += runlength; + continue; + } + } + +length += 4; /* For final KET and END */ + +if (length > 65539) + { + *errorptr = ERR20; + return NULL; + } + +/* Compute the size of data block needed and get it, either from malloc or +externally provided function. Put in the magic number and the options. */ + +size = length + offsetof(real_pcre, code); +re = (real_pcre *)(pcre_malloc)(size); + +if (re == NULL) + { + *errorptr = ERR21; + return NULL; + } + +re->magic_number = MAGIC_NUMBER; +re->options = options; + +/* Set up a starting, non-extracting bracket, then compile the expression. On +error, *errorptr will be set non-NULL, so we don't need to look at the result +of the function here. */ + +ptr = (uschar *)pattern; +code = re->code; +*code = OP_BRA; +bracount = 0; +(void)compile_regex(options, &bracount, &code, &ptr, errorptr); +re->top_bracket = bracount; +re->top_backref = top_backref; + +/* If not reached end of pattern on success, there's an excess bracket. */ + +if (*errorptr == NULL && *ptr != 0) *errorptr = ERR22; + +/* Fill in the terminating state and check for disastrous overflow, but +if debugging, leave the test till after things are printed out. */ + +*code++ = OP_END; + +#ifndef DEBUG +if (code - re->code > length) *errorptr = ERR23; +#endif + +/* Failed to compile */ + +if (*errorptr != NULL) + { + (pcre_free)(re); + PCRE_ERROR_RETURN: + *erroroffset = ptr - (uschar *)pattern; + return NULL; + } + +/* If the anchored option was not passed, set flag if we can determine that it +is anchored by virtue of ^ characters or \A or anything else. Otherwise, see if +we can determine what the first character has to be, because that speeds up +unanchored matches no end. In the case of multiline matches, an alternative is +to set the PCRE_STARTLINE flag if all branches start with ^. */ + +if ((options & PCRE_ANCHORED) == 0) + { + if (is_anchored(re->code, (options & PCRE_MULTILINE) != 0)) + re->options |= PCRE_ANCHORED; + else + { + int c = find_firstchar(re->code); + if (c >= 0) + { + re->first_char = c; + re->options |= PCRE_FIRSTSET; + } + else if (is_startline(re->code)) + re->options |= PCRE_STARTLINE; + } + } + +/* Print out the compiled data for debugging */ + +#ifdef DEBUG + +printf("Length = %d top_bracket = %d top_backref=%d\n", + length, re->top_bracket, re->top_backref); + +if (re->options != 0) + { + printf("%s%s%s%s%s%s%s\n", + ((re->options & PCRE_ANCHORED) != 0)? "anchored " : "", + ((re->options & PCRE_CASELESS) != 0)? "caseless " : "", + ((re->options & PCRE_EXTENDED) != 0)? "extended " : "", + ((re->options & PCRE_MULTILINE) != 0)? "multiline " : "", + ((re->options & PCRE_DOTALL) != 0)? "dotall " : "", + ((re->options & PCRE_DOLLAR_ENDONLY) != 0)? "endonly " : "", + ((re->options & PCRE_EXTRA) != 0)? "extra " : ""); + } + +if ((re->options & PCRE_FIRSTSET) != 0) + { + if (isprint(re->first_char)) printf("First char = %c\n", re->first_char); + else printf("First char = \\x%02x\n", re->first_char); + } + +code_end = code; +code_base = code = re->code; + +while (code < code_end) + { + int charlength; + + printf("%3d ", code - code_base); + + if (*code >= OP_BRA) + { + printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_CHARS: + charlength = *(++code); + printf("%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) printf("%c", c); else printf("\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ONCE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + printf(" %s", OP_names[code[1]]); + else if (isprint(c = code[1])) printf(" %c", c); + else printf(" \\x%02x", c); + printf("%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) printf(" %c{", c); + else printf(" \\x%02x{", c); + if (*code != OP_EXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) printf("?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + printf(" %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) printf("?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + printf("%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) printf(" [^%c]{", c); + else printf(" [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) printf("?"); + code += 3; + break; + + case OP_REF: + printf(" \\%d", *(++code)); + break; + + case OP_CLASS: + { + int i, min, max; + + code++; + printf(" ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') printf("\\"); + if (isprint(i)) printf("%c", i); else printf("\\x%02x", i); + if (--j > i) + { + printf("-"); + if (j == '-' || j == ']') printf("\\"); + if (isprint(j)) printf("%c", j); else printf("\\x%02x", j); + } + i = j; + } + } + printf("]"); + code += 32; + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + printf("%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) printf("{%d,}", min); + else printf("{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) printf("?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + printf(" %s", OP_names[*code]); + break; + } + + code++; + printf("\n"); + } +printf("------------------------------------------------------------------\n"); + +/* This check is done here in the debugging case so that the code that +was compiled can be seen. */ + +if (code - re->code > length) + { + *errorptr = ERR23; + (pcre_free)(re); + *erroroffset = ptr - (uschar *)pattern; + return NULL; + } +#endif + +return (pcre *)re; +} + + + +/************************************************* +* Match a character type * +*************************************************/ + +/* Not used in all the places it might be as it's sometimes faster +to put the code inline. + +Arguments: + type the character type + c the character + dotall the dotall flag + +Returns: TRUE if character is of the type +*/ + +static BOOL +match_type(int type, int c, BOOL dotall) +{ + +#ifdef DEBUG +if (isprint(c)) printf("matching subject %c against ", c); + else printf("matching subject \\x%02x against ", c); +printf("%s\n", OP_names[type]); +#endif + +switch(type) + { + case OP_ANY: return dotall || c != '\n'; + case OP_NOT_DIGIT: return (pcre_ctypes[c] & ctype_digit) == 0; + case OP_DIGIT: return (pcre_ctypes[c] & ctype_digit) != 0; + case OP_NOT_WHITESPACE: return (pcre_ctypes[c] & ctype_space) == 0; + case OP_WHITESPACE: return (pcre_ctypes[c] & ctype_space) != 0; + case OP_NOT_WORDCHAR: return (pcre_ctypes[c] & ctype_word) == 0; + case OP_WORDCHAR: return (pcre_ctypes[c] & ctype_word) != 0; + } +return FALSE; +} + + + +/************************************************* +* Match a back-reference * +*************************************************/ + +/* If a back reference hasn't been set, the match fails. + +Arguments: + number reference number + eptr points into the subject + length length to be matched + md points to match data block + +Returns: TRUE if matched +*/ + +static BOOL +match_ref(int number, register uschar *eptr, int length, match_data *md) +{ +uschar *p = md->start_subject + md->offset_vector[number]; + +#ifdef DEBUG +if (eptr >= md->end_subject) + printf("matching subject <null>"); +else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + } +printf(" against backref "); +pchars(p, length, FALSE, md); +printf("\n"); +#endif + +/* Always fail if not enough characters left */ + +if (length > md->end_subject - p) return FALSE; + +/* Separate the caselesss case for speed */ + +if (md->caseless) + { while (length-- > 0) if (pcre_lcc[*p++] != pcre_lcc[*eptr++]) return FALSE; } +else + { while (length-- > 0) if (*p++ != *eptr++) return FALSE; } + +return TRUE; +} + + + +/************************************************* +* Match from current position * +*************************************************/ + +/* On entry ecode points to the first opcode, and eptr to the first character. + +Arguments: + eptr pointer in subject + ecode position in code + offset_top current top pointer + md pointer to "static" info for the match + +Returns: TRUE if matched +*/ + +static BOOL +match(register uschar *eptr, register uschar *ecode, int offset_top, + match_data *md) +{ +for (;;) + { + int min, max, ctype; + register int i; + register int c; + BOOL minimize; + + /* Opening bracket. Check the alternative branches in turn, failing if none + match. We have to set the start offset if required and there is space + in the offset vector so that it is available for subsequent back references + if the bracket matches. However, if the bracket fails, we must put back the + previous value of both offsets in case they were set by a previous copy of + the same bracket. Don't worry about setting the flag for the error case here; + that is handled in the code for KET. */ + + if ((int)*ecode >= OP_BRA) + { + int number = (*ecode - OP_BRA) << 1; + int save_offset1, save_offset2; + + #ifdef DEBUG + printf("start bracket %d\n", number/2); + #endif + + if (number > 0 && number < md->offset_end) + { + save_offset1 = md->offset_vector[number]; + save_offset2 = md->offset_vector[number+1]; + md->offset_vector[number] = eptr - md->start_subject; + + #ifdef DEBUG + printf("saving %d %d\n", save_offset1, save_offset2); + #endif + } + + /* Recurse for all the alternatives. */ + + do + { + if (match(eptr, ecode+3, offset_top, md)) return TRUE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + + #ifdef DEBUG + printf("bracket %d failed\n", number/2); + #endif + + if (number > 0 && number < md->offset_end) + { + md->offset_vector[number] = save_offset1; + md->offset_vector[number+1] = save_offset2; + } + + return FALSE; + } + + /* Other types of node can be handled by a switch */ + + switch(*ecode) + { + case OP_END: + md->end_match_ptr = eptr; /* Record where we ended */ + md->end_offset_top = offset_top; /* and how many extracts were taken */ + return TRUE; + + /* The equivalent of Prolog's "cut" - if the rest doesn't match, the + whole thing doesn't match, so we have to get out via a longjmp(). */ + + case OP_CUT: + if (match(eptr, ecode+1, offset_top, md)) return TRUE; + longjmp(md->fail_env, 1); + + /* Assertion brackets. Check the alternative branches in turn - the + matching won't pass the KET for an assertion. If any one branch matches, + the assertion is true. */ + + case OP_ASSERT: + do + { + if (match(eptr, ecode+3, offset_top, md)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + if (*ecode == OP_KET) return FALSE; + + /* Continue from after the assertion, updating the offsets high water + mark, since extracts may have been taken during the assertion. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + ecode += 3; + offset_top = md->end_offset_top; + continue; + + /* Negative assertion: all branches must fail to match */ + + case OP_ASSERT_NOT: + do + { + if (match(eptr, ecode+3, offset_top, md)) return FALSE; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + ecode += 3; + continue; + + /* "Once" brackets are like assertion brackets except that after a match, + the point in the subject string is not moved back. Thus there can never be + a back into the brackets. Check the alternative branches in turn - the + matching won't pass the KET for this kind of subpattern. If any one branch + matches, we carry on, leaving the subject pointer. */ + + case OP_ONCE: + do + { + if (match(eptr, ecode+3, offset_top, md)) break; + ecode += (ecode[1] << 8) + ecode[2]; + } + while (*ecode == OP_ALT); + if (*ecode == OP_KET) return FALSE; + + /* Continue as from after the assertion, updating the offsets high water + mark, since extracts may have been taken. */ + + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + ecode += 3; + offset_top = md->end_offset_top; + eptr = md->end_match_ptr; + continue; + + /* An alternation is the end of a branch; scan along to find the end of the + bracketed group and go to there. */ + + case OP_ALT: + do ecode += (ecode[1] << 8) + ecode[2]; while (*ecode == OP_ALT); + break; + + /* BRAZERO and BRAMINZERO occur just before a bracket group, indicating + that it may occur zero times. It may repeat infinitely, or not at all - + i.e. it could be ()* or ()? in the pattern. Brackets with fixed upper + repeat limits are compiled as a number of copies, with the optional ones + preceded by BRAZERO or BRAMINZERO. */ + + case OP_BRAZERO: + { + uschar *next = ecode+1; + if (match(eptr, next, offset_top, md)) return TRUE; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + ecode = next + 3; + } + break; + + case OP_BRAMINZERO: + { + uschar *next = ecode+1; + do next += (next[1] << 8) + next[2]; while (*next == OP_ALT); + if (match(eptr, next+3, offset_top, md)) return TRUE; + ecode++; + } + break;; + + /* End of a group, repeated or non-repeating. If we are at the end of + an assertion "group", stop matching and return TRUE, but record the + current high water mark for use by positive assertions. */ + + case OP_KET: + case OP_KETRMIN: + case OP_KETRMAX: + { + int number; + uschar *prev = ecode - (ecode[1] << 8) - ecode[2]; + + if (*prev == OP_ASSERT || *prev == OP_ASSERT_NOT || *prev == OP_ONCE) + { + md->end_match_ptr = eptr; /* For ONCE */ + md->end_offset_top = offset_top; + return TRUE; + } + + /* In all other cases we have to check the group number back at the + start and if necessary complete handling an extraction by setting the + final offset and bumping the high water mark. */ + + number = (*prev - OP_BRA) << 1; + + #ifdef DEBUG + printf("end bracket %d\n", number/2); + #endif + + if (number > 0) + { + if (number >= md->offset_end) md->offset_overflow = TRUE; else + { + md->offset_vector[number+1] = eptr - md->start_subject; + if (offset_top <= number) offset_top = number + 2; + } + } + + /* For a non-repeating ket, just advance to the next node and continue at + this level. */ + + if (*ecode == OP_KET) + { + ecode += 3; + break; + } + + /* The repeating kets try the rest of the pattern or restart from the + preceding bracket, in the appropriate order. */ + + if (*ecode == OP_KETRMIN) + { + if (match(eptr, ecode+3, offset_top, md) || + match(eptr, prev, offset_top, md)) return TRUE; + } + else /* OP_KETRMAX */ + { + if (match(eptr, prev, offset_top, md) || + match(eptr, ecode+3, offset_top, md)) return TRUE; + } + } + return FALSE; + + /* Start of subject unless notbol, or after internal newline if multiline */ + + case OP_CIRC: + if (md->notbol && eptr == md->start_subject) return FALSE; + if (md->multiline) + { + if (eptr != md->start_subject && eptr[-1] != '\n') return FALSE; + ecode++; + break; + } + /* ... else fall through */ + + /* Start of subject assertion */ + + case OP_SOD: + if (eptr != md->start_subject) return FALSE; + ecode++; + break; + + /* Assert before internal newline if multiline, or before + a terminating newline unless endonly is set, else end of subject unless + noteol is set. */ + + case OP_DOLL: + if (md->noteol && eptr >= md->end_subject) return FALSE; + if (md->multiline) + { + if (eptr < md->end_subject && *eptr != '\n') return FALSE; + ecode++; + break; + } + else if (!md->endonly) + { + if (eptr < md->end_subject - 1 || + (eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE; + ecode++; + break; + } + /* ... else fall through */ + + /* End of subject assertion */ + + case OP_EOD: + if (eptr < md->end_subject) return FALSE; + ecode++; + break; + + /* Word boundary assertions */ + + case OP_NOT_WORD_BOUNDARY: + case OP_WORD_BOUNDARY: + { + BOOL prev_is_word = (eptr != md->start_subject) && + ((pcre_ctypes[eptr[-1]] & ctype_word) != 0); + BOOL cur_is_word = (eptr < md->end_subject) && + ((pcre_ctypes[*eptr] & ctype_word) != 0); + if ((*ecode++ == OP_WORD_BOUNDARY)? + cur_is_word == prev_is_word : cur_is_word != prev_is_word) + return FALSE; + } + break; + + /* Match a single character type; inline for speed */ + + case OP_ANY: + if (!md->dotall && eptr < md->end_subject && *eptr == '\n') return FALSE; + if (eptr++ >= md->end_subject) return FALSE; + ecode++; + break; + + case OP_NOT_DIGIT: + if (eptr >= md->end_subject || (pcre_ctypes[*eptr++] & ctype_digit) != 0) + return FALSE; + ecode++; + break; + + case OP_DIGIT: + if (eptr >= md->end_subject || (pcre_ctypes[*eptr++] & ctype_digit) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WHITESPACE: + if (eptr >= md->end_subject || (pcre_ctypes[*eptr++] & ctype_space) != 0) + return FALSE; + ecode++; + break; + + case OP_WHITESPACE: + if (eptr >= md->end_subject || (pcre_ctypes[*eptr++] & ctype_space) == 0) + return FALSE; + ecode++; + break; + + case OP_NOT_WORDCHAR: + if (eptr >= md->end_subject || (pcre_ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + ecode++; + break; + + case OP_WORDCHAR: + if (eptr >= md->end_subject || (pcre_ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + ecode++; + break; + + /* Match a back reference, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. The code is similar + to that for character classes, but repeated for efficiency. Then obey + similar code to character type repeats - written out again for speed. + However, if the referenced string is the empty string, always treat + it as matched, any number of times (otherwise there could be infinite + loops). */ + + case OP_REF: + { + int length; + int number = ecode[1] << 1; /* Doubled reference number */ + ecode += 2; /* Advance past the item */ + + if (number >= offset_top || md->offset_vector[number] < 0) + { + md->errorcode = PCRE_ERROR_BADREF; + return FALSE; + } + + length = md->offset_vector[number+1] - md->offset_vector[number]; + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + if (!match_ref(number, eptr, length, md)) return FALSE; + eptr += length; + continue; /* With the main loop */ + } + + /* If the length of the reference is zero, just continue with the + main loop. */ + + if (length == 0) continue; + + /* First, ensure the minimum number of matches are present. We get back + the length of the reference string explicitly rather than passing the + address of eptr, so that eptr can be a register variable. */ + + for (i = 1; i <= min; i++) + { + if (!match_ref(number, eptr, length, md)) return FALSE; + eptr += length; + } + + /* If min = max, continue at the same level without recursion. + They are not both allowed to be zero. */ + + if (min == max) continue; + + /* If minimizing, keep trying and advancing the pointer */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || !match_ref(number, eptr, length, md)) + return FALSE; + eptr += length; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest string and work backwards */ + + else + { + uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (!match_ref(number, eptr, length, md)) break; + eptr += length; + } + while (eptr >= pp) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + eptr -= length; + } + return FALSE; + } + } + /* Control never gets here */ + + /* Match a character class, possibly repeatedly. Look past the end of the + item to see if there is repeat information following. Then obey similar + code to character type repeats - written out again for speed. If caseless + matching was set at runtime but not at compile time, we have to check both + versions of a character. */ + + case OP_CLASS: + { + uschar *data = ecode + 1; /* Save for matching */ + ecode += 33; /* Advance past the item */ + + switch (*ecode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + c = *ecode++ - OP_CRSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + minimize = (*ecode == OP_CRMINRANGE); + min = (ecode[1] << 8) + ecode[2]; + max = (ecode[3] << 8) + ecode[4]; + if (max == 0) max = INT_MAX; + ecode += 5; + break; + + default: /* No repeat follows */ + if (eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; /* With main loop */ + if (md->runtime_caseless) + { + c = pcre_fcc[c]; + if ((data[c/8] & (1 << (c&7))) != 0) continue; /* With main loop */ + } + return FALSE; + } + + /* First, ensure the minimum number of matches are present. */ + + for (i = 1; i <= min; i++) + { + if (eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + if (md->runtime_caseless) + { + c = pcre_fcc[c]; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + } + return FALSE; + } + + /* If max == min we can continue with the main loop without the + need to recurse. */ + + if (min == max) continue; + + /* If minimizing, keep testing the rest of the expression and advancing + the pointer while it matches the class. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || eptr >= md->end_subject) return FALSE; + c = *eptr++; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + if (md->runtime_caseless) + { + c = pcre_fcc[c]; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + } + return FALSE; + } + /* Control never gets here */ + } + + /* If maximizing, find the longest possible run, then work backwards. */ + + else + { + uschar *pp = eptr; + for (i = min; i < max; eptr++, i++) + { + if (eptr >= md->end_subject) break; + c = *eptr; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + if (md->runtime_caseless) + { + c = pcre_fcc[c]; + if ((data[c/8] & (1 << (c&7))) != 0) continue; + } + break; + } + + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md)) return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a run of characters */ + + case OP_CHARS: + { + register int length = ecode[1]; + ecode += 2; + + #ifdef DEBUG + if (eptr >= md->end_subject) + printf("matching subject <null> against pattern "); + else + { + printf("matching subject "); + pchars(eptr, length, TRUE, md); + printf(" against pattern "); + } + pchars(ecode, length, FALSE, md); + printf("\n"); + #endif + + if (length > md->end_subject - eptr) return FALSE; + if (md->caseless) + { + while (length-- > 0) if (pcre_lcc[*ecode++] != pcre_lcc[*eptr++]) return FALSE; + } + else + { + while (length-- > 0) if (*ecode++ != *eptr++) return FALSE; + } + } + break; + + /* Match a single character repeatedly; different opcodes share code. */ + + case OP_EXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATCHAR; + + case OP_UPTO: + case OP_MINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_MINUPTO; + ecode += 3; + goto REPEATCHAR; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + c = *ecode++ - OP_STAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + #ifdef DEBUG + printf("matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr); + #endif + + if (md->caseless) + { + c = pcre_lcc[c]; + for (i = 1; i <= min; i++) if (c != pcre_lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || eptr >= md->end_subject || c != pcre_lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != pcre_lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md)) return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c != *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || eptr >= md->end_subject || c != *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c != *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md)) return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a negated single character */ + + case OP_NOT: + if (eptr > md->end_subject) return FALSE; + ecode++; + if (md->caseless) + { + if (pcre_lcc[*ecode++] == pcre_lcc[*eptr++]) return FALSE; + } + else + { + if (*ecode++ == *eptr++) return FALSE; + } + break; + + /* Match a negated single character repeatedly. This is almost a repeat of + the code for a repeated single character, but I haven't found a nice way of + commoning these up that doesn't require a test of the positive/negative + option for each character match. Maybe that wouldn't add very much to the + time taken, but character matching *is* what this is all about... */ + + case OP_NOTEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTUPTO: + case OP_NOTMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_NOTMINUPTO; + ecode += 3; + goto REPEATNOTCHAR; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + c = *ecode++ - OP_NOTSTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single-character matches. We can give + up quickly if there are fewer than the minimum number of characters left in + the subject. */ + + REPEATNOTCHAR: + if (min > md->end_subject - eptr) return FALSE; + c = *ecode++; + + /* The code is duplicated for the caseless and caseful cases, for speed, + since matching characters is likely to be quite common. First, ensure the + minimum number of matches are present. If min = max, continue at the same + level without recursing. Otherwise, if minimizing, keep trying the rest of + the expression and advancing one matching character if failing, up to the + maximum. Alternatively, if maximizing, find the maximum number of + characters and work backwards. */ + + #ifdef DEBUG + printf("negative matching %c{%d,%d} against subject %.*s\n", c, min, max, + max, eptr); + #endif + + if (md->caseless) + { + c = pcre_lcc[c]; + for (i = 1; i <= min; i++) if (c == pcre_lcc[*eptr++]) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || eptr >= md->end_subject || c == pcre_lcc[*eptr++]) + return FALSE; + } + /* Control never gets here */ + } + else + { + uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == pcre_lcc[*eptr]) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md)) return TRUE; + return FALSE; + } + /* Control never gets here */ + } + + /* Caseful comparisons */ + + else + { + for (i = 1; i <= min; i++) if (c == *eptr++) return FALSE; + if (min == max) continue; + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || eptr >= md->end_subject || c == *eptr++) return FALSE; + } + /* Control never gets here */ + } + else + { + uschar *pp = eptr; + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || c == *eptr) break; + eptr++; + } + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md)) return TRUE; + return FALSE; + } + } + /* Control never gets here */ + + /* Match a single character type repeatedly; several different opcodes + share code. This is very similar to the code for single characters, but we + repeat it in the interests of efficiency. */ + + case OP_TYPEEXACT: + min = max = (ecode[1] << 8) + ecode[2]; + minimize = TRUE; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + min = 0; + max = (ecode[1] << 8) + ecode[2]; + minimize = *ecode == OP_TYPEMINUPTO; + ecode += 3; + goto REPEATTYPE; + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + c = *ecode++ - OP_TYPESTAR; + minimize = (c & 1) != 0; + min = rep_min[c]; /* Pick up values from tables; */ + max = rep_max[c]; /* zero for max => infinity */ + if (max == 0) max = INT_MAX; + + /* Common code for all repeated single character type matches */ + + REPEATTYPE: + ctype = *ecode++; /* Code for the character type */ + + /* First, ensure the minimum number of matches are present. Use inline + code for maximizing the speed, and do the type test once at the start + (i.e. keep it out of the loop). Also test that there are at least the + minimum number of characters before we start. */ + + if (min > md->end_subject - eptr) return FALSE; + if (min > 0) switch(ctype) + { + case OP_ANY: + if (!md->dotall) + { for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; } + else eptr += min; + break; + + case OP_NOT_DIGIT: + for (i = 1; i <= min; i++) + if ((pcre_ctypes[*eptr++] & ctype_digit) != 0) return FALSE; + break; + + case OP_DIGIT: + for (i = 1; i <= min; i++) + if ((pcre_ctypes[*eptr++] & ctype_digit) == 0) return FALSE; + break; + + case OP_NOT_WHITESPACE: + for (i = 1; i <= min; i++) + if ((pcre_ctypes[*eptr++] & ctype_space) != 0) return FALSE; + break; + + case OP_WHITESPACE: + for (i = 1; i <= min; i++) + if ((pcre_ctypes[*eptr++] & ctype_space) == 0) return FALSE; + break; + + case OP_NOT_WORDCHAR: + for (i = 1; i <= min; i++) if ((pcre_ctypes[*eptr++] & ctype_word) != 0) + return FALSE; + break; + + case OP_WORDCHAR: + for (i = 1; i <= min; i++) if ((pcre_ctypes[*eptr++] & ctype_word) == 0) + return FALSE; + break; + } + + /* If min = max, continue at the same level without recursing */ + + if (min == max) continue; + + /* If minimizing, we have to test the rest of the pattern before each + subsequent match, so inlining isn't much help; just use the function. */ + + if (minimize) + { + for (i = min;; i++) + { + if (match(eptr, ecode, offset_top, md)) return TRUE; + if (i >= max || eptr >= md->end_subject || + !match_type(ctype, *eptr++, md->dotall)) + return FALSE; + } + /* Control never gets here */ + } + + /* If maximizing it is worth using inline code for speed, doing the type + test once at the start (i.e. keep it out of the loop). */ + + else + { + uschar *pp = eptr; + switch(ctype) + { + case OP_ANY: + if (!md->dotall) + { + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || *eptr == '\n') break; + eptr++; + } + } + else + { + c = max - min; + if (c > md->end_subject - eptr) c = md->end_subject - eptr; + eptr += c; + } + break; + + case OP_NOT_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (pcre_ctypes[*eptr] & ctype_digit) != 0) + break; + eptr++; + } + break; + + case OP_DIGIT: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (pcre_ctypes[*eptr] & ctype_digit) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (pcre_ctypes[*eptr] & ctype_space) != 0) + break; + eptr++; + } + break; + + case OP_WHITESPACE: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (pcre_ctypes[*eptr] & ctype_space) == 0) + break; + eptr++; + } + break; + + case OP_NOT_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (pcre_ctypes[*eptr] & ctype_word) != 0) + break; + eptr++; + } + break; + + case OP_WORDCHAR: + for (i = min; i < max; i++) + { + if (eptr >= md->end_subject || (pcre_ctypes[*eptr] & ctype_word) == 0) + break; + eptr++; + } + break; + } + + while (eptr >= pp) + if (match(eptr--, ecode, offset_top, md)) return TRUE; + return FALSE; + } + /* Control never gets here */ + + /* There's been some horrible disaster. */ + + default: + #ifdef DEBUG + printf("Unknown opcode %d\n", *ecode); + #endif + md->errorcode = PCRE_ERROR_UNKNOWN_NODE; + return FALSE; + } + + /* Do not stick any code in here without much thought; it is assumed + that "continue" in the code above comes out to here to repeat the main + loop. */ + + } /* End of main loop */ +/* Control never reaches here */ +} + + + +/************************************************* +* Execute a Regular Expression * +*************************************************/ + +/* This function applies a compiled re to a subject string and picks out +portions of the string if it matches. Two elements in the vector are set for +each substring: the offsets to the start and end of the substring. + +Arguments: + external_re points to the compiled expression + external_extra points to "hints" from pcre_study() or is NULL + subject points to the subject string + length length of subject string (may contain binary zeros) + options option bits + offsets points to a vector of ints to be filled in with offsets + offsetcount the number of elements in the vector + +Returns: > 0 => success; value is the number of elements filled in + = 0 => success, but offsets is not big enough + -1 => failed to match + < -1 => some kind of unexpected problem +*/ + +int +pcre_exec(const pcre *external_re, const pcre_extra *external_extra, + const char *subject, int length, int options, int *offsets, int offsetcount) +{ +int resetcount; +int ocount = offsetcount; +int first_char = -1; +match_data match_block; +uschar *start_bits = NULL; +uschar *start_match = (uschar *)subject; +uschar *end_subject; +real_pcre *re = (real_pcre *)external_re; +real_pcre_extra *extra = (real_pcre_extra *)external_extra; +BOOL anchored = ((re->options | options) & PCRE_ANCHORED) != 0; +BOOL startline = (re->options & PCRE_STARTLINE) != 0; + +if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION; + +if (re == NULL || subject == NULL || + (offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL; +if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC; + +match_block.start_subject = (uschar *)subject; +match_block.end_subject = match_block.start_subject + length; +end_subject = match_block.end_subject; + +match_block.caseless = ((re->options | options) & PCRE_CASELESS) != 0; +match_block.runtime_caseless = match_block.caseless && + (re->options & PCRE_CASELESS) == 0; + +match_block.multiline = ((re->options | options) & PCRE_MULTILINE) != 0; +match_block.dotall = ((re->options | options) & PCRE_DOTALL) != 0; +match_block.endonly = ((re->options | options) & PCRE_DOLLAR_ENDONLY) != 0; + +match_block.notbol = (options & PCRE_NOTBOL) != 0; +match_block.noteol = (options & PCRE_NOTEOL) != 0; + +match_block.errorcode = PCRE_ERROR_NOMATCH; /* Default error */ + +/* If the expression has got more back references than the offsets supplied can +hold, we get a temporary bit of working store to use during the matching. +Otherwise, we can use the vector supplied, rounding down the size of it to a +multiple of 2. */ + +ocount &= (-2); +if (re->top_backref > 0 && re->top_backref + 1 >= ocount/2) + { + ocount = re->top_backref * 2 + 2; + match_block.offset_vector = (pcre_malloc)(ocount * sizeof(int)); + if (match_block.offset_vector == NULL) return PCRE_ERROR_NOMEMORY; + #ifdef DEBUG + printf("Got memory to hold back references\n"); + #endif + } +else match_block.offset_vector = offsets; + +match_block.offset_end = ocount; +match_block.offset_overflow = FALSE; + +/* Compute the minimum number of offsets that we need to reset each time. Doing +this makes a huge difference to execution time when there aren't many brackets +in the pattern. */ + +resetcount = 2 + re->top_bracket * 2; +if (resetcount > offsetcount) resetcount = ocount; + +/* If MULTILINE is set at exec time but was not set at compile time, and the +anchored flag is set, we must re-check because a setting provoked by ^ in the +pattern is not right in multi-line mode. Calling is_anchored() again here does +the right check, because multiline is now set. If it now yields FALSE, the +expression must have had ^ starting some of its branches. Check to see if +that is true for *all* branches, and if so, set the startline flag. */ + +if (match_block. multiline && anchored && (re->options & PCRE_MULTILINE) == 0 && + !is_anchored(re->code, match_block.multiline)) + { + anchored = FALSE; + if (is_startline(re->code)) startline = TRUE; + } + +/* Set up the first character to match, if available. The first_char value is +never set for an anchored regular expression, but the anchoring may be forced +at run time, so we have to test for anchoring. The first char may be unset for +an unanchored pattern, of course. If there's no first char and the pattern was +studied, the may be a bitmap of possible first characters. However, we can +use this only if the caseless state of the studying was correct. */ + +if (!anchored) + { + if ((re->options & PCRE_FIRSTSET) != 0) + { + first_char = re->first_char; + if (match_block.caseless) first_char = pcre_lcc[first_char]; + } + else + if (!startline && extra != NULL && + (extra->options & PCRE_STUDY_MAPPED) != 0 && + ((extra->options & PCRE_STUDY_CASELESS) != 0) == match_block.caseless) + start_bits = extra->start_bits; + } + +/* Loop for unanchored matches; for anchored regexps the loop runs just once. */ + +do + { + register int *iptr = match_block.offset_vector; + register int *iend = iptr + resetcount; + + /* Reset the maximum number of extractions we might see. */ + + while (iptr < iend) *iptr++ = -1; + + /* Advance to a unique first char if possible */ + + if (first_char >= 0) + { + if (match_block.caseless) + while (start_match < end_subject && pcre_lcc[*start_match] != first_char) + start_match++; + else + while (start_match < end_subject && *start_match != first_char) + start_match++; + } + + /* Or to just after \n for a multiline match if possible */ + + else if (startline) + { + if (start_match > match_block.start_subject) + { + while (start_match < end_subject && start_match[-1] != '\n') + start_match++; + } + } + + /* Or to a non-unique first char */ + + else if (start_bits != NULL) + { + while (start_match < end_subject) + { + register int c = *start_match; + if ((start_bits[c/8] & (1 << (c&7))) == 0) start_match++; else break; + } + } + + #ifdef DEBUG + printf(">>>> Match against: "); + pchars(start_match, end_subject - start_match, TRUE, &match_block); + printf("\n"); + #endif + + /* When a match occurs, substrings will be set for all internal extractions; + we just need to set up the whole thing as substring 0 before returning. If + there were too many extractions, set the return code to zero. In the case + where we had to get some local store to hold offsets for backreferences, copy + those back references that we can. In this case there need not be overflow + if certain parts of the pattern were not used. + + Before starting the match, we have to set up a longjmp() target to enable + the "cut" operation to fail a match completely without backtracking. */ + + if (setjmp(match_block.fail_env) == 0 && + match(start_match, re->code, 2, &match_block)) + { + int rc; + + if (ocount != offsetcount) + { + if (offsetcount >= 4) + { + memcpy(offsets + 2, match_block.offset_vector + 2, + (offsetcount - 2) * sizeof(int)); + #ifdef DEBUG + printf("Copied offsets; freeing temporary memory\n"); + #endif + } + if (match_block.end_offset_top > offsetcount) + match_block.offset_overflow = TRUE; + + #ifdef DEBUG + printf("Freeing temporary memory\n"); + #endif + + (pcre_free)(match_block.offset_vector); + } + + rc = match_block.offset_overflow? 0 : match_block.end_offset_top/2; + + if (match_block.offset_end < 2) rc = 0; else + { + offsets[0] = start_match - match_block.start_subject; + offsets[1] = match_block.end_match_ptr - match_block.start_subject; + } + + #ifdef DEBUG + printf(">>>> returning %d\n", rc); + #endif + return rc; + } + } +while (!anchored && + match_block.errorcode == PCRE_ERROR_NOMATCH && + start_match++ < end_subject); + +#ifdef DEBUG +printf(">>>> returning %d\n", match_block.errorcode); +#endif + +return match_block.errorcode; +} + +/* End of pcre.c */ @@ -0,0 +1,57 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* Copyright (c) 1997 University of Cambridge */ + +#ifndef _PCRE_H +#define _PCRE_H + +/* Have to include stdlib.h in order to ensure that size_t is defined; +it is needed here for malloc. */ + +#include <stdlib.h> + +/* Options */ + +#define PCRE_CASELESS 0x0001 +#define PCRE_EXTENDED 0x0002 +#define PCRE_ANCHORED 0x0004 +#define PCRE_MULTILINE 0x0008 +#define PCRE_DOTALL 0x0010 +#define PCRE_DOLLAR_ENDONLY 0x0020 +#define PCRE_EXTRA 0x0040 +#define PCRE_NOTBOL 0x0080 +#define PCRE_NOTEOL 0x0100 + +/* Exec-time error codes */ + +#define PCRE_ERROR_NOMATCH (-1) +#define PCRE_ERROR_BADREF (-2) +#define PCRE_ERROR_NULL (-3) +#define PCRE_ERROR_BADOPTION (-4) +#define PCRE_ERROR_BADMAGIC (-5) +#define PCRE_ERROR_UNKNOWN_NODE (-6) +#define PCRE_ERROR_NOMEMORY (-7) + +/* Types */ + +typedef void pcre; +typedef void pcre_extra; + +/* Store get and free functions. These can be set to alternative malloc/free +functions if required. */ + +extern void *(*pcre_malloc)(size_t); +extern void (*pcre_free)(void *); + +/* Functions */ + +extern pcre *pcre_compile(const char *, int, char **, int *); +extern int pcre_exec(const pcre *, const pcre_extra *, const char *, + int, int, int *, int); +extern int pcre_info(const pcre *, int *, int *); +extern pcre_extra *pcre_study(const pcre *, int, char **); +extern char *pcre_version(void); + +#endif /* End of pcre.h */ diff --git a/pcreposix.3 b/pcreposix.3 new file mode 100644 index 0000000..2c907e7 --- /dev/null +++ b/pcreposix.3 @@ -0,0 +1,135 @@ +.TH PCRE 3 +.SH NAME +pcreposix - POSIX API for Perl-compatible regular expressions. +.SH SYNOPSIS +.B #include <pcreposix.h> +.PP +.SM +.br +.B int regcomp(regex_t *\fIpreg\fR, const char *\fIpattern\fR, +.ti +5n +.B int \fIcflags\fR); +.PP +.br +.B int regexec(regex_t *\fIpreg\fR, const char *\fIstring\fR, +.ti +5n +.B size_t \fInmatch\fR, regmatch_t \fIpmatch\fR[], int \fIeflags\fR); +.PP +.br +.B size_t regerror(int \fIerrcode\fR, const regex_t *\fIpreg\fR, +.ti +5n +.B char *\fIerrbuf\fR, size_t \fIerrbuf_size\fR); +.PP +.br +.B void regfree(regex_t *\fIpreg\fR); + + +.SH DESCRIPTION +This set of functions provides a POSIX-style API to the PCRE regular expression +package. See \fBpcre (3)\fR for a description of the native API, which contains +additional functionality. The functions described here are just wrapper +functions that ultimately call the native API. + +As I am pretty ignorant about POSIX, these functions must be considered as +experimental. I have implemented only those option bits that can be reasonably +mapped to PCRE native options. Other POSIX options are not even defined. It may +be that it is useful to define, but ignore, other options. Feedback from more +knowledgeable folk may cause this kind of detail to change. + +When PCRE is called via these functions, it is only the API that is POSIX-like +in style. The syntax and semantics of the regular expressions themselves are +still those of Perl, subject to the setting of various PCRE options, as +described below. + +The header for these functions is supplied as \fBpcreposix.h\fR to avoid any +potential clash with other POSIX libraries. It can, of course, be renamed or +aliased as \fBregex.h\fR, which is the "correct" name. It provides two +structure types, \fIregex_t\fR for compiled internal forms, and +\fIregmatch_t\fR for returning captured substrings. It also defines some +constants whose names start with "REG_"; these are used for setting options and +identifying error codes. + + +.SH COMPILING A PATTERN + +The function \fBregcomp()\fR is called to compile a pattern into an +internal form. The pattern is a C string terminated by a binary zero, and +is passed in the argument \fIpattern\fR. The \fIpreg\fR argument is a pointer +to a regex_t structure which is used as a base for storing information about +the compiled expression. + +The argument \fIcflags\fR is either zero, or contains one or more of the bits +defined by the following macros: + + REG_ICASE + +The PCRE_CASELESS option is set when the expression is passed for compilation +to the native function. + + REG_NEWLINE + +The PCRE_MULTILINE option is set when the expression is passed for compilation +to the native function. + +The yield of \fBregcomp()\fR is zero on success, and non-zero otherwise. The +\fIpreg\fR structure is filled in on success, and one member of the structure +is publicized: \fIre_nsub\fR contains the number of capturing subpatterns in +the regular expression. Various error codes are defined in the header file. + + +.SH MATCHING A PATTERN +The function \fBregexec()\fR is called to match a pre-compiled pattern +\fIpreg\fR against a given \fIstring\fR, which is terminated by a zero byte, +subject to the options in \fIeflags\fR. These can be: + + REG_NOTBOL + +The PCRE_NOTBOL option is set when calling the underlying PCRE matching +function. + + REG_NOTEOL + +The PCRE_NOTEOL option is set when calling the underlying PCRE matching +function. + +The portion of the string that was matched, and also any captured substrings, +are returned via the \fIpmatch\fR argument, which points to an array of +\fInmatch\fR structures of type \fIregmatch_t\fR, containing the members +\fIrm_so\fR and \fIrm_eo\fR. These contain the offset to the first character of +each substring and the offset to the first character after the end of each +substring, respectively. The 0th element of the vector relates to the entire +portion of \fIstring\fR that was matched; subsequent elements relate to the +capturing subpatterns of the regular expression. Unused entries in the array +have both structure members set to -1. + +A successful match yields a zero return; various error codes are defined in the +header file, of which REG_NOMATCH is the "expected" failure code. + + +.SH ERROR MESSAGES +The \fBregerror()\fR function maps a non-zero errorcode from either +\fBregcomp\fR or \fBregexec\fR to a printable message. If \fIpreg\fR is not +NULL, the error should have arisen from the use of that structure. A message +terminated by a binary zero is placed in \fIerrbuf\fR. The length of the +message, including the zero, is limited to \fIerrbuf_size\fR. The yield of the +function is the size of buffer needed to hold the whole message. + + +.SH STORAGE +Compiling a regular expression causes memory to be allocated and associated +with the \fIpreg\fR structure. The function \fBregfree()\fR frees all such +memory, after which \fIpreg\fR may no longer be used as a compiled expression. + + +.SH AUTHOR +Philip Hazel <ph10@cam.ac.uk> +.br +University Computing Service, +.br +New Museums Site, +.br +Cambridge CB2 3QG, England. +.br +Phone: +44 1223 334714 + +Copyright (c) 1997 University of Cambridge. diff --git a/pcreposix.c b/pcreposix.c new file mode 100644 index 0000000..1fbd9ac --- /dev/null +++ b/pcreposix.c @@ -0,0 +1,246 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +This module is a wrapper that provides a POSIX API to the underlying PCRE +functions. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. +----------------------------------------------------------------------------- +*/ + +#include "internal.h" +#include "pcreposix.h" +#include "stdlib.h" + + + +/* Corresponding tables of PCRE error messages and POSIX error codes. */ + +static char *estring[] = { + ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9, ERR10, + ERR11, ERR12, ERR13, ERR14, ERR15, ERR16, ERR17, ERR18, ERR19, ERR20, + ERR21, ERR22, ERR23 }; + +static int eint[] = { + REG_EESCAPE, /* "\\ at end of pattern" */ + REG_EESCAPE, /* "\\c at end of pattern" */ + REG_EESCAPE, /* "unrecognized character follows \\" */ + REG_BADBR, /* "numbers out of order in {} quantifier" */ + REG_BADBR, /* "number too big in {} quantifier" */ + REG_EBRACK, /* "missing terminating ] for character class" */ + REG_ECTYPE, /* "invalid escape sequence in character class" */ + REG_ERANGE, /* "range out of order in character class" */ + REG_BADRPT, /* "nothing to repeat" */ + REG_BADRPT, /* "operand of unlimited repeat could match the empty string" */ + REG_ASSERT, /* "internal error: unexpected repeat" */ + REG_BADPAT, /* "unrecognized character after (?" */ + REG_ESIZE, /* "too many capturing parenthesized sub-patterns" */ + REG_EPAREN, /* "missing )" */ + REG_ESUBREG, /* "back reference to non-existent subpattern" */ + REG_INVARG, /* "erroffset passed as NULL" */ + REG_INVARG, /* "unknown option bit(s) set" */ + REG_EPAREN, /* "missing ) after comment" */ + REG_ESIZE, /* "too many sets of parentheses" */ + REG_ESIZE, /* "regular expression too large" */ + REG_ESPACE, /* "failed to get memory" */ + REG_EPAREN, /* "unmatched brackets" */ + REG_ASSERT /* "internal error: code overflow" */ +}; + +/* Table of texts corresponding to POSIX error codes */ + +static char *pstring[] = { + "", /* Dummy for value 0 */ + "internal error", /* REG_ASSERT */ + "invalid repeat counts in {}", /* BADBR */ + "pattern error", /* BADPAT */ + "? * + invalid", /* BADRPT */ + "unbalanced {}", /* EBRACE */ + "unbalanced []", /* EBRACK */ + "collation error - not relevant", /* ECOLLATE */ + "bad class", /* ECTYPE */ + "bad escape sequence", /* EESCAPE */ + "empty expression", /* EMPTY */ + "unbalanced ()", /* EPAREN */ + "bad range inside []", /* ERANGE */ + "expression too big", /* ESIZE */ + "failed to get memory", /* ESPACE */ + "bad back reference", /* ESUBREG */ + "bad argument", /* INVARG */ + "match failed" /* NOMATCH */ +}; + + + + +/************************************************* +* Translate PCRE text code to int * +*************************************************/ + +/* PCRE compile-time errors are given as strings defined as macros. We can just +look them up in a table to turn them into POSIX-style error codes. */ + +static int +pcre_posix_error_code(const char *s) +{ +int i; +for (i = 0; i < sizeof(estring)/sizeof(char *); i++) + if (strcmp(s, estring[i]) == 0) return eint[i]; +return REG_ASSERT; +} + + + +/************************************************* +* Translate error code to string * +*************************************************/ + +size_t +regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size) +{ +char *message, *addmessage; +int length, adlength; + +message = (errcode >= sizeof(pstring)/sizeof(char *))? + "unknown error code" : pstring[errcode]; + +length = (int)strlen(message) + 1; + +if (preg != NULL && (int)preg->re_erroffset != -1) + { + addmessage = " at offset "; + adlength = (int)strlen(addmessage) + 6; + } +else adlength = 0; + +if (errbuf_size > 0) + { + if (adlength > 0 && errbuf_size >= length + adlength) + sprintf(errbuf, "%s%s%-6d", message, addmessage, preg->re_erroffset); + else + { + strncpy(errbuf, message, errbuf_size - 1); + errbuf[errbuf_size-1] = 0; + } + } + +return length + adlength; +} + + + + +/************************************************* +* Free store held by a regex * +*************************************************/ + +void +regfree(regex_t *preg) +{ +(pcre_free)(preg->re_pcre); +} + + + + +/************************************************* +* Compile a regular expression * +*************************************************/ + +/* +Arguments: + preg points to a structure for recording the compiled expression + pattern the pattern to compile + cflags compilation flags + +Returns: 0 on success + various non-zero codes on failure +*/ + +int +regcomp(regex_t *preg, const char *pattern, int cflags) +{ +char *errorptr; +int erroffset; +int options = 0; + +if ((cflags & REG_ICASE) != 0) options |= PCRE_CASELESS; +if ((cflags & REG_NEWLINE) != 0) options |= PCRE_MULTILINE; + +preg->re_pcre = pcre_compile(pattern, options, &errorptr, &erroffset); +preg->re_erroffset = erroffset; + +if (preg->re_pcre == NULL) return pcre_posix_error_code(errorptr); + +preg->re_nsub = pcre_info(preg->re_pcre, NULL, NULL); +return 0; +} + + + + +/************************************************* +* Match a regular expression * +*************************************************/ + +int +regexec(regex_t *preg, const char *string, size_t nmatch, + regmatch_t pmatch[], int eflags) +{ +int rc; +int options = 0; + +if ((eflags & REG_NOTBOL) != 0) options |= PCRE_NOTBOL; +if ((eflags & REG_NOTEOL) != 0) options |= PCRE_NOTEOL; + +preg->re_erroffset = -1; /* Only has meaning after compile */ + +rc = pcre_exec(preg->re_pcre, NULL, string, (int)strlen(string), options, + (int *)pmatch, nmatch * 2); + +if (rc == 0) return 0; /* All pmatch were filled in */ + +if (rc > 0) + { + int i; + for (i = rc; i < nmatch; i++) pmatch[i].rm_so = pmatch[i].rm_eo = -1; + return 0; + } + +else switch(rc) + { + case PCRE_ERROR_NOMATCH: return REG_NOMATCH; + case PCRE_ERROR_BADREF: return REG_ESUBREG; + case PCRE_ERROR_NULL: return REG_INVARG; + case PCRE_ERROR_BADOPTION: return REG_INVARG; + case PCRE_ERROR_BADMAGIC: return REG_INVARG; + case PCRE_ERROR_UNKNOWN_NODE: return REG_ASSERT; + case PCRE_ERROR_NOMEMORY: return REG_ESPACE; + default: return REG_ASSERT; + } +} + +/* End of pcreposix.c */ diff --git a/pcreposix.h b/pcreposix.h new file mode 100644 index 0000000..1d0f16a --- /dev/null +++ b/pcreposix.h @@ -0,0 +1,72 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* Copyright (c) 1997 University of Cambridge */ + +#ifndef _PCREPOSIX_H +#define _PCREPOSIX_H + +/* This is the header for the POSIX wrapper interface to the PCRE Perl- +Compatible Regular Expression library. It defines the things POSIX says should +be there. I hope. */ + +/* Have to include stdlib.h in order to ensure that size_t is defined. */ + +#include <stdlib.h> + +/* Options defined by POSIX. */ + +#define REG_ICASE 0x01 +#define REG_NEWLINE 0x02 +#define REG_NOTBOL 0x04 +#define REG_NOTEOL 0x08 + +/* Error values. Not all these are relevant or used by the wrapper. */ + +enum { + REG_ASSERT = 1, /* internal error ? */ + REG_BADBR, /* invalid repeat counts in {} */ + REG_BADPAT, /* pattern error */ + REG_BADRPT, /* ? * + invalid */ + REG_EBRACE, /* unbalanced {} */ + REG_EBRACK, /* unbalanced [] */ + REG_ECOLLATE, /* collation error - not relevant */ + REG_ECTYPE, /* bad class */ + REG_EESCAPE, /* bad escape sequence */ + REG_EMPTY, /* empty expression */ + REG_EPAREN, /* unbalanced () */ + REG_ERANGE, /* bad range inside [] */ + REG_ESIZE, /* expression too big */ + REG_ESPACE, /* failed to get memory */ + REG_ESUBREG, /* bad back reference */ + REG_INVARG, /* bad argument */ + REG_NOMATCH /* match failed */ +}; + + +/* The structure representing a compiled regular expression. */ + +typedef struct { + void *re_pcre; + size_t re_nsub; + size_t re_erroffset; +} regex_t; + +/* The structure in which a captured offset is returned. */ + +typedef int regoff_t; + +typedef struct { + regoff_t rm_so; + regoff_t rm_eo; +} regmatch_t; + +/* The functions */ + +extern int regcomp(regex_t *, const char *, int); +extern int regexec(regex_t *, const char *, size_t, regmatch_t *, int); +extern size_t regerror(int, const regex_t *, char *, size_t); +extern void regfree(regex_t *); + +#endif /* End of pcreposix.h */ diff --git a/pcretest.c b/pcretest.c new file mode 100644 index 0000000..6b558b2 --- /dev/null +++ b/pcretest.c @@ -0,0 +1,771 @@ +/************************************************* +* PCRE testing program * +*************************************************/ + +#include <ctype.h> +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <time.h> + +/* Use the internal info for displaying the results of pcre_study(). */ + +#include "internal.h" +#include "pcreposix.h" + +#ifndef CLOCKS_PER_SEC +#ifdef CLK_TCK +#define CLOCKS_PER_SEC CLK_TCK +#else +#define CLOCKS_PER_SEC 100 +#endif +#endif + + +static FILE *outfile; +static int log_store = 0; + + + +/* Debugging function to print the internal form of the regex. This is the same +code as contained in pcre.c under the DEBUG macro. */ + +static char *OP_names[] = { "End", "\\A", "\\B", "\\b", "\\D", "\\d", + "\\S", "\\s", "\\W", "\\w", "Cut", "\\Z", "^", "$", "Any", "chars", + "not", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", "{", + "*", "*?", "+", "+?", "?", "??", "{", "{", + "class", "Ref", + "Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not", "Once", + "Brazero", "Braminzero", "Bra" +}; + + +static void print_internals(pcre *re) +{ +unsigned char *code = ((real_pcre *)re)->code; + +printf("------------------------------------------------------------------\n"); + +for(;;) + { + int c; + int charlength; + + printf("%3d ", code - ((real_pcre *)re)->code); + + if (*code >= OP_BRA) + { + printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA); + code += 2; + } + + else switch(*code) + { + case OP_END: + printf(" %s\n", OP_names[*code]); + printf("------------------------------------------------------------------\n"); + return; + + case OP_CHARS: + charlength = *(++code); + printf("%3d ", charlength); + while (charlength-- > 0) + if (isprint(c = *(++code))) printf("%c", c); else printf("\\x%02x", c); + break; + + case OP_KETRMAX: + case OP_KETRMIN: + case OP_ALT: + case OP_KET: + case OP_ASSERT: + case OP_ASSERT_NOT: + case OP_ONCE: + printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]); + code += 2; + break; + + case OP_STAR: + case OP_MINSTAR: + case OP_PLUS: + case OP_MINPLUS: + case OP_QUERY: + case OP_MINQUERY: + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + if (*code >= OP_TYPESTAR) + printf(" %s", OP_names[code[1]]); + else if (isprint(c = code[1])) printf(" %c", c); + else printf(" \\x%02x", c); + printf("%s", OP_names[*code++]); + break; + + case OP_EXACT: + case OP_UPTO: + case OP_MINUPTO: + if (isprint(c = code[3])) printf(" %c{", c); + else printf(" \\x%02x{", c); + if (*code != OP_EXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_MINUPTO) printf("?"); + code += 3; + break; + + case OP_TYPEEXACT: + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + printf(" %s{", OP_names[code[3]]); + if (*code != OP_TYPEEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_TYPEMINUPTO) printf("?"); + code += 3; + break; + + case OP_NOT: + if (isprint(c = *(++code))) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + break; + + case OP_NOTSTAR: + case OP_NOTMINSTAR: + case OP_NOTPLUS: + case OP_NOTMINPLUS: + case OP_NOTQUERY: + case OP_NOTMINQUERY: + if (isprint(c = code[1])) printf(" [^%c]", c); + else printf(" [^\\x%02x]", c); + printf("%s", OP_names[*code++]); + break; + + case OP_NOTEXACT: + case OP_NOTUPTO: + case OP_NOTMINUPTO: + if (isprint(c = code[3])) printf(" [^%c]{", c); + else printf(" [^\\x%02x]{", c); + if (*code != OP_NOTEXACT) printf(","); + printf("%d}", (code[1] << 8) + code[2]); + if (*code == OP_NOTMINUPTO) printf("?"); + code += 3; + break; + + case OP_REF: + printf(" \\%d", *(++code)); + break; + + case OP_CLASS: + { + int i, min, max; + + code++; + printf(" ["); + + for (i = 0; i < 256; i++) + { + if ((code[i/8] & (1 << (i&7))) != 0) + { + int j; + for (j = i+1; j < 256; j++) + if ((code[j/8] & (1 << (j&7))) == 0) break; + if (i == '-' || i == ']') printf("\\"); + if (isprint(i)) printf("%c", i); else printf("\\x%02x", i); + if (--j > i) + { + printf("-"); + if (j == '-' || j == ']') printf("\\"); + if (isprint(j)) printf("%c", j); else printf("\\x%02x", j); + } + i = j; + } + } + printf("]"); + code += 32; + + switch(*code) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRPLUS: + case OP_CRMINPLUS: + case OP_CRQUERY: + case OP_CRMINQUERY: + printf("%s", OP_names[*code]); + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + min = (code[1] << 8) + code[2]; + max = (code[3] << 8) + code[4]; + if (max == 0) printf("{%d,}", min); + else printf("{%d,%d}", min, max); + if (*code == OP_CRMINRANGE) printf("?"); + code += 4; + break; + + default: + code--; + } + } + break; + + /* Anything else is just a one-node item */ + + default: + printf(" %s", OP_names[*code]); + break; + } + + code++; + printf("\n"); + } +} + + + +/* Character string printing function. */ + +static void pchars(unsigned char *p, int length) +{ +int c; +while (length-- > 0) + if (isprint(c = *(p++))) fprintf(outfile, "%c", c); + else fprintf(outfile, "\\x%02x", c); +} + + + +/* Alternative malloc function, to test functionality and show the size of the +compiled re. */ + +static void *new_malloc(size_t size) +{ +if (log_store) fprintf(outfile, "Store size request: %d\n", (int)size); +return malloc(size); +} + + + +/* Read lines from named file or stdin and write to named file or stdout; lines +consist of a regular expression, in delimiters and optionally followed by +options, followed by a set of test data, terminated by an empty line. */ + +int main(int argc, char **argv) +{ +FILE *infile = stdin; +int options = 0; +int study_options = 0; +int op = 1; +int timeit = 0; +int showinfo = 0; +int posix = 0; +int debug = 0; +unsigned char buffer[30000]; +unsigned char dbuffer[1024]; + +/* Static so that new_malloc can use it. */ + +outfile = stdout; + +/* Scan options */ + +while (argc > 1 && argv[op][0] == '-') + { + if (strcmp(argv[op], "-s") == 0) log_store = 1; + else if (strcmp(argv[op], "-t") == 0) timeit = 1; + else if (strcmp(argv[op], "-i") == 0) showinfo = 1; + else if (strcmp(argv[op], "-d") == 0) showinfo = debug = 1; + else if (strcmp(argv[op], "-p") == 0) posix = 1; + else + { + printf("*** Unknown option %s\n", argv[op]); + return 1; + } + op++; + argc--; + } + +/* Sort out the input and output files */ + +if (argc > 1) + { + infile = fopen(argv[op], "r"); + if (infile == NULL) + { + printf("** Failed to open %s\n", argv[op]); + return 1; + } + } + +if (argc > 2) + { + outfile = fopen(argv[op+1], "w"); + if (outfile == NULL) + { + printf("** Failed to open %s\n", argv[op+1]); + return 1; + } + } + +/* Set alternative malloc function */ + +pcre_malloc = new_malloc; + +/* Heading line, then prompt for first re if stdin */ + +fprintf(outfile, "Testing Perl-Compatible Regular Expressions\n"); +fprintf(outfile, "PCRE version %s\n\n", pcre_version()); + +/* Main loop */ + +for (;;) + { + pcre *re = NULL; + pcre_extra *extra = NULL; + regex_t preg; + char *error; + unsigned char *p, *pp; + int do_study = 0; + int do_debug = 0; + int do_posix = 0; + int erroroffset, len, delimiter; + + if (infile == stdin) printf(" re> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) break; + if (infile != stdin) fprintf(outfile, (char *)buffer); + + p = buffer; + while (isspace(*p)) p++; + if (*p == 0) continue; + + /* Get the delimiter and seek the end of the pattern; if is isn't + complete, read more. */ + + delimiter = *p++; + + if (isalnum(delimiter)) + { + fprintf(outfile, "** Delimiter must not be alphameric\n"); + goto SKIP_DATA; + } + + pp = p; + + for(;;) + { + while (*pp != 0 && *pp != delimiter) pp++; + if (*pp != 0) break; + + len = sizeof(buffer) - (pp - buffer); + if (len < 256) + { + fprintf(outfile, "** Expression too long - missing delimiter?\n"); + goto SKIP_DATA; + } + + if (infile == stdin) printf(" > "); + if (fgets((char *)pp, len, infile) == NULL) + { + fprintf(outfile, "** Unexpected EOF\n"); + goto END_OFF; + } + if (infile != stdin) fprintf(outfile, (char *)pp); + } + + /* Terminate the pattern at the delimiter */ + + *pp++ = 0; + + /* Look for options after final delimiter */ + + options = 0; + study_options = 0; + while (*pp != 0) + { + switch (*pp++) + { + case 'i': options |= PCRE_CASELESS; break; + case 'm': options |= PCRE_MULTILINE; break; + case 's': options |= PCRE_DOTALL; break; + case 'x': options |= PCRE_EXTENDED; break; + case 'A': options |= PCRE_ANCHORED; break; + case 'D': do_debug = 1; break; + case 'E': options |= PCRE_DOLLAR_ENDONLY; break; + case 'P': do_posix = 1; break; + case 'S': do_study = 1; break; + case 'I': study_options |= PCRE_CASELESS; break; + case 'X': options |= PCRE_EXTRA; break; + case '\n': case ' ': break; + default: + fprintf(outfile, "** Unknown option '%c'\n", pp[-1]); + goto SKIP_DATA; + } + } + + /* Handle compiing via the POSIX interface, which doesn't support the + timing, showing, or debugging options. */ + + if (posix || do_posix) + { + int rc; + int cflags = 0; + if ((options & PCRE_CASELESS) != 0) cflags |= REG_ICASE; + if ((options & PCRE_MULTILINE) != 0) cflags |= REG_NEWLINE; + rc = regcomp(&preg, (char *)p, cflags); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "Failed: POSIX code %d: %s\n", rc, buffer); + goto SKIP_DATA; + } + } + + /* Handle compiling via the native interface */ + + else + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < 4000; i++) + { + re = pcre_compile((char *)p, options, &error, &erroroffset); + if (re != NULL) free(re); + } + time_taken = clock() - start_time; + fprintf(outfile, "Compile time %.2f milliseconds\n", + ((double)time_taken)/(4 * CLOCKS_PER_SEC)); + } + + re = pcre_compile((char *)p, options, &error, &erroroffset); + + /* Compilation failed; go back for another re, skipping to blank line + if non-interactive. */ + + if (re == NULL) + { + fprintf(outfile, "Failed: %s at offset %d\n", error, erroroffset); + SKIP_DATA: + if (infile != stdin) + { + for (;;) + { + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) + goto END_OFF; + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + if (len == 0) break; + } + fprintf(outfile, "\n"); + } + continue; + } + + /* Compilation succeeded; print data if required */ + + if (showinfo || do_debug) + { + int first_char, count; + + if (debug || do_debug) print_internals(re); + + count = pcre_info(re, &options, &first_char); + if (count < 0) fprintf(outfile, + "Error %d while reading info\n", count); + else + { + fprintf(outfile, "Identifying subpattern count = %d\n", count); + if (options == 0) fprintf(outfile, "No options\n"); + else fprintf(outfile, "Options:%s%s%s%s%s%s%s\n", + ((options & PCRE_ANCHORED) != 0)? " anchored" : "", + ((options & PCRE_CASELESS) != 0)? " caseless" : "", + ((options & PCRE_EXTENDED) != 0)? " extended" : "", + ((options & PCRE_MULTILINE) != 0)? " multiline" : "", + ((options & PCRE_DOTALL) != 0)? " dotall" : "", + ((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "", + ((options & PCRE_EXTRA) != 0)? " extra" : ""); + if (first_char == -1) + { + fprintf(outfile, "First char at start or follows \\n\n"); + } + else if (first_char < 0) + { + fprintf(outfile, "No first char\n"); + } + else + { + if (isprint(first_char)) + fprintf(outfile, "First char = \'%c\'\n", first_char); + else + fprintf(outfile, "First char = %d\n", first_char); + } + } + } + + /* If /S was present, study the regexp to generate additional info to + help with the matching. */ + + if (do_study) + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < 4000; i++) + extra = pcre_study(re, study_options, &error); + time_taken = clock() - start_time; + if (extra != NULL) free(extra); + fprintf(outfile, " Study time %.2f milliseconds\n", + ((double)time_taken)/(4 * CLOCKS_PER_SEC)); + } + + extra = pcre_study(re, study_options, &error); + if (error != NULL) + fprintf(outfile, "Failed to study: %s\n", error); + else if (extra == NULL) + fprintf(outfile, "Study returned NULL\n"); + + /* This looks at internal information. A bit kludgy to do it this + way, but it is useful for testing. */ + + else if (showinfo || do_debug) + { + real_pcre_extra *xx = (real_pcre_extra *)extra; + if ((xx->options & PCRE_STUDY_MAPPED) == 0) + fprintf(outfile, "No starting character set\n"); + else + { + int i; + int c = 24; + fprintf(outfile, "Starting character set: "); + for (i = 0; i < 256; i++) + { + if ((xx->start_bits[i/8] & (1<<(i%8))) != 0) + { + if (c > 75) + { + fprintf(outfile, "\n "); + c = 2; + } + if (isprint(i) && i != ' ') + { + fprintf(outfile, "%c ", i); + c += 2; + } + else + { + fprintf(outfile, "\\x%02x ", i); + c += 5; + } + } + } + fprintf(outfile, "\n"); + } + } + } + } + + /* Read data lines and test them */ + + for (;;) + { + unsigned char *pp; + int count, c; + int offsets[30]; + int size_offsets = sizeof(offsets)/sizeof(int); + + options = 0; + + if (infile == stdin) printf(" data> "); + if (fgets((char *)buffer, sizeof(buffer), infile) == NULL) goto END_OFF; + if (infile != stdin) fprintf(outfile, (char *)buffer); + + len = (int)strlen((char *)buffer); + while (len > 0 && isspace(buffer[len-1])) len--; + buffer[len] = 0; + if (len == 0) break; + + p = buffer; + while (isspace(*p)) p++; + + pp = dbuffer; + while ((c = *p++) != 0) + { + int i = 0; + int n = 0; + if (c == '\\') switch ((c = *p++)) + { + case 'a': c = 7; break; + case 'b': c = '\b'; break; + case 'e': c = 27; break; + case 'f': c = '\f'; break; + case 'n': c = '\n'; break; + case 'r': c = '\r'; break; + case 't': c = '\t'; break; + case 'v': c = '\v'; break; + + case '0': case '1': case '2': case '3': + case '4': case '5': case '6': case '7': + c -= '0'; + while (i++ < 2 && isdigit(*p) && *p != '8' && *p != '9') + c = c * 8 + *p++ - '0'; + break; + + case 'x': + c = 0; + while (i++ < 2 && isxdigit(*p)) + { + c = c * 16 + tolower(*p) - ((isdigit(*p))? '0' : 'W'); + p++; + } + break; + + case 0: /* Allows for an empty line */ + p--; + continue; + + case 'A': /* Option setting */ + options |= PCRE_ANCHORED; + continue; + + case 'B': + options |= PCRE_NOTBOL; + continue; + + case 'E': + options |= PCRE_DOLLAR_ENDONLY; + continue; + + case 'I': + options |= PCRE_CASELESS; + continue; + + case 'M': + options |= PCRE_MULTILINE; + continue; + + case 'S': + options |= PCRE_DOTALL; + continue; + + case 'O': + while(isdigit(*p)) n = n * 10 + *p++ - '0'; + if (n <= sizeof(offsets)/sizeof(int)) size_offsets = n; + continue; + + case 'Z': + options |= PCRE_NOTEOL; + continue; + } + *pp++ = c; + } + *pp = 0; + len = pp - dbuffer; + + /* Handle matching via the POSIX interface, which does not + support timing. */ + + if (posix || do_posix) + { + int rc; + int eflags = 0; + regmatch_t pmatch[30]; + if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL; + if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL; + + rc = regexec(&preg, (char *)dbuffer, sizeof(pmatch)/sizeof(regmatch_t), + pmatch, eflags); + + if (rc != 0) + { + (void)regerror(rc, &preg, (char *)buffer, sizeof(buffer)); + fprintf(outfile, "No match: POSIX code %d: %s\n", rc, buffer); + } + else + { + int i; + for (i = 0; i < sizeof(pmatch)/sizeof(regmatch_t); i++) + { + if (pmatch[i].rm_so >= 0) + { + fprintf(outfile, "%2d: ", i); + pchars(dbuffer + pmatch[i].rm_so, + pmatch[i].rm_eo - pmatch[i].rm_so); + fprintf(outfile, "\n"); + } + } + } + } + + /* Handle matching via the native interface */ + + else + { + if (timeit) + { + register int i; + clock_t time_taken; + clock_t start_time = clock(); + for (i = 0; i < 4000; i++) + count = pcre_exec(re, extra, (char *)dbuffer, len, options, offsets, + size_offsets); + time_taken = clock() - start_time; + fprintf(outfile, "Execute time %.2f milliseconds\n", + ((double)time_taken)/(4 * CLOCKS_PER_SEC)); + } + + count = pcre_exec(re, extra, (char *)dbuffer, len, options, offsets, + size_offsets); + + if (count == 0) + { + fprintf(outfile, "Matched, but too many substrings\n"); + count = size_offsets/2; + } + + if (count >= 0) + { + int i; + count *= 2; + for (i = 0; i < count; i += 2) + { + if (offsets[i] < 0) + fprintf(outfile, "%2d: <unset>\n", i/2); + else + { + fprintf(outfile, "%2d: ", i/2); + pchars(dbuffer + offsets[i], offsets[i+1] - offsets[i]); + fprintf(outfile, "\n"); + } + } + } + else + { + if (count == -1) fprintf(outfile, "No match\n"); + else fprintf(outfile, "Error %d\n", count); + } + } + } + + if (posix || do_posix) regfree(&preg); + if (re != NULL) free(re); + if (extra != NULL) free(extra); + } + +END_OFF: +fprintf(outfile, "\n"); +return 0; +} + +/* End */ diff --git a/perltest b/perltest new file mode 100755 index 0000000..7c2114b --- /dev/null +++ b/perltest @@ -0,0 +1,143 @@ +#! /usr/bin/perl + +# Program for testing regular expressions with perl to check that PCRE handles +# them the same. + + +# Function for turning a string into a string of printing chars + +sub pchars { +my($t) = ""; + +foreach $c (split(//, @_[0])) + { + if (ord $c >= 32 && ord $c < 127) { $t .= $c; } + else { $t .= sprintf("\\x%02x", ord $c); } + } +$t; +} + + + +# Read lines from named file or stdin and write to named file or stdout; lines +# consist of a regular expression, in delimiters and optionally followed by +# options, followed by a set of test data, terminated by an empty line. + +# Sort out the input and output files + +if (@ARGV > 0) + { + open(INFILE, "<$ARGV[0]") || die "Failed to open $ARGV[0]\n"; + $infile = "INFILE"; + } +else { $infile = "STDIN"; } + +if (@ARGV > 1) + { + open(OUTFILE, ">$ARGV[1]") || die "Failed to open $ARGV[1]\n"; + $outfile = "OUTFILE"; + } +else { $outfile = "STDOUT"; } + +printf($outfile "Perl Regular Expressions\n\n"); + +# Main loop + +NEXT_RE: +for (;;) + { + printf " re> " if $infile eq "STDIN"; + last if ! ($_ = <$infile>); + printf $outfile "$_" if $infile ne "STDIN"; + next if ($_ eq ""); + + $pattern = $_; + + $delimiter = substr($_, 0, 1); + while ($pattern !~ /^\s*(.).*\1/s) + { + printf " > " if $infile eq "STDIN"; + last if ! ($_ = <$infile>); + printf $outfile "$_" if $infile ne "STDIN"; + $pattern .= $_; + } + + chomp($pattern); + $pattern =~ s/\s+$//; + + # Check that the pattern is valid + + eval "\$_ =~ ${pattern}"; + if ($@) + { + printf $outfile "Error: $@\n"; + next NEXT_RE; + } + + # Read data lines and test them + + for (;;) + { + printf "data> " if $infile eq "STDIN"; + last NEXT_RE if ! ($_ = <$infile>); + chomp; + printf $outfile "$_\n" if $infile ne "STDIN"; + + s/\s+$//; + s/^\s+//; + + last if ($_ eq ""); + + $_ = eval "\"$_\""; # To get escapes processed + + $ok = 0; + eval "if (\$_ =~ ${pattern}) {" . + "\$z = \$&;" . + "\$a = \$1;" . + "\$b = \$2;" . + "\$c = \$3;" . + "\$d = \$4;" . + "\$e = \$5;" . + "\$f = \$6;" . + "\$g = \$7;" . + "\$h = \$8;" . + "\$i = \$9;" . + "\$j = \$10;" . + "\$k = \$11;" . + "\$l = \$12;" . + "\$m = \$13;" . + "\$n = \$14;" . + "\$o = \$15;" . + "\$p = \$16;" . + "\$ok = 1; }"; + + if ($@) + { + printf $outfile "Error: $@\n"; + next NEXT_RE; + } + elsif (!$ok) + { + printf $outfile "No match\n"; + } + else + { + @subs = ($z,$a,$b,$c,$d,$e,$f,$g,$h,$i,$j,$k,$l,$m,$n,$o,$p); + $last_printed = 0; + for ($i = 0; $i <= 17; $i++) + { + if ($i == 0 || defined $subs[$i]) + { + while ($last_printed++ < $i-1) + { printf $outfile ("%2d: <unset>\n", $last_printed); } + printf $outfile ("%2d: %s\n", $i, &pchars($subs[$i])); + $last_printed = $i; + } + } + } + } + } + +printf $outfile "\n"; + +# End @@ -0,0 +1,72 @@ +.TH PGREP 1 +.SH NAME +pgrep - a grep with Perl-compatible regular expressions. +.SH SYNOPSIS +.B pgrep [-chilnsvx] pattern [file] ... + + +.SH DESCRIPTION +\fBpgrep\fR searches files for character patterns, in the same way as other +grep commands do, but it uses the PCRE regular expression library to support +patterns that are compatible with the regular expressions of Perl 5. See +\fBpcre(3)\fR for a full description of syntax and semantics. + +If no files are specified, \fBpgrep\fR reads the standard input. By default, +each line that matches the pattern is copied to the standard output, and if +there is more than one file, the file name is printed before each line of +output. However, there are options that can change how \fBpgrep\fR behaves. + +Lines are limited to BUFSIZ characters. BUFSIZ is defined in \fB<stdio.h>\fR. +The newline character is removed from the end of each line before it is matched +against the pattern. + + +.SH OPTIONS +.TP 10 +\fB-c\fR +Do not print individual lines; instead just print a count of the number of +lines that would otherwise have been printed. If several files are given, a +count is printed for each of them. +.TP +\fB-h\fR +Suppress printing of filenames when searching multiple files. +.TP +\fB-i\fR +Ignore upper/lower case distinctions during comparisons. +.TP +\fB-l\fR +Instead of printing lines from the files, just print the names of the files +containing lines that would have been printed. Each file name is printed +once, on a separate line. +.TP +\fB-n\fR +Precede each line by its line number in the file. +.TP +\fB-s\fR +Work silently, that is, display nothing except error messages. +The exit status indicates whether any matches were found. +.TP +\fB-v\fR +Invert the sense of the match, so that lines which do \fInot\fR match the +pattern are now the ones that are found. +.TP +\fB-x\fR +Force the pattern to be anchored (it must start matching at the beginning of +the line) and in addition, require it to match the entire line. This is +equivalent to having ^ and $ characters at the start and end of each +alternative branch in the regular expression. + + +.SH SEE ALSO +\fBpcre(3)\fR, Perl 5 documentation + + +.SH DIAGNOSTICS +Exit status is 0 if any matches were found, 1 if no matches were found, and 2 +for syntax errors or inacessible files (even if matches were found). + + +.SH AUTHOR +Philip Hazel <ph10@cam.ac.uk> +.br +Copyright (c) 1997 University of Cambridge. @@ -0,0 +1,220 @@ +/************************************************* +* PCRE grep program * +*************************************************/ + +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <errno.h> +#include "pcre.h" + + +#define FALSE 0 +#define TRUE 1 + +typedef int BOOL; + + + +/************************************************* +* Global variables * +*************************************************/ + +static pcre *pattern; +static pcre_extra *hints; + +static BOOL count_only = FALSE; +static BOOL filenames_only = FALSE; +static BOOL invert = FALSE; +static BOOL number = FALSE; +static BOOL silent = FALSE; +static BOOL whole_lines = FALSE; + + + +#ifdef STRERROR_FROM_ERRLIST +/************************************************* +* Provide strerror() for non-ANSI libraries * +*************************************************/ + +/* Some old-fashioned systems still around (e.g. SunOS4) don't have strerror() +in their libraries, but can provide the same facility by this simple +alternative function. */ + +extern int sys_nerr; +extern char *sys_errlist[]; + +char * +strerror(int n) +{ +if (n < 0 || n >= sys_nerr) return "unknown error number"; +return sys_errlist[n]; +} +#endif /* STRERROR_FROM_ERRLIST */ + + + +/************************************************* +* Grep an individual file * +*************************************************/ + +static int +pgrep(FILE *in, char *name) +{ +int rc = 1; +int linenumber = 0; +int count = 0; +int offsets[2]; +char buffer[BUFSIZ]; + +while (fgets(buffer, sizeof(buffer), in) != NULL) + { + BOOL match; + int length = (int)strlen(buffer); + if (length > 0 && buffer[length-1] == '\n') buffer[--length] = 0; + linenumber++; + + match = pcre_exec(pattern, hints, buffer, length, 0, offsets, 2) >= 0; + if (match && whole_lines && offsets[1] != length) match = FALSE; + + if (match != invert) + { + if (count_only) count++; + + else if (filenames_only) + { + fprintf(stdout, "%s\n", (name == NULL)? "<stdin>" : name); + return 0; + } + + else if (silent) return 0; + + else + { + if (name != NULL) fprintf(stdout, "%s:", name); + if (number) fprintf(stdout, "%d:", linenumber); + fprintf(stdout, "%s\n", buffer); + } + + rc = 0; + } + } + +if (count_only) + { + if (name != NULL) fprintf(stdout, "%s:", name); + fprintf(stdout, "%d\n", count); + } + +return rc; +} + + + + +/************************************************* +* Usage function * +*************************************************/ + +static int +usage(int rc) +{ +fprintf(stderr, "Usage: pgrep [-chilnsvx] pattern [file] ...\n"); +return rc; +} + + + + +/************************************************* +* Main program * +*************************************************/ + +int +main(int argc, char **argv) +{ +int i; +int rc = 1; +int options = 0; +int errptr; +char *error; +BOOL filenames = TRUE; + +/* Process the options */ + +for (i = 1; i < argc; i++) + { + char *s; + if (argv[i][0] != '-') break; + s = argv[i] + 1; + while (*s != 0) + { + switch (*s++) + { + case 'c': count_only = TRUE; break; + case 'h': filenames = FALSE; break; + case 'i': options |= PCRE_CASELESS; break; + case 'l': filenames_only = TRUE; + case 'n': number = TRUE; break; + case 's': silent = TRUE; break; + case 'v': invert = TRUE; break; + case 'x': whole_lines = TRUE; options |= PCRE_ANCHORED; break; + default: + fprintf(stderr, "pgrep: unknown option %c\n", s[-1]); + return usage(2); + } + } + } + +/* There must be at least a regexp argument */ + +if (i >= argc) return usage(0); + +/* Compile the regular expression. */ + +pattern = pcre_compile(argv[i++], options, &error, &errptr); +if (pattern == NULL) + { + fprintf(stderr, "pgrep: error in regex at offset %d: %s\n", errptr, error); + return 2; + } + +/* Study the regular expression, as we will be running it may times */ + +hints = pcre_study(pattern, 0, &error); +if (error != NULL) + { + fprintf(stderr, "pgrep: error while studing regex: %s\n", error); + return 2; + } + +/* If there are no further arguments, do the business on stdin and exit */ + +if (i >= argc) return pgrep(stdin, NULL); + +/* Otherwise, work through the remaining arguments as files. If there is only +one, don't give its name on the output. */ + +if (i == argc - 1) filenames = FALSE; +if (filenames_only) filenames = TRUE; + +for (; i < argc; i++) + { + FILE *in = fopen(argv[i], "r"); + if (in == NULL) + { + fprintf(stderr, "%s: failed to open: %s\n", argv[i], strerror(errno)); + rc = 2; + } + else + { + int frc = pgrep(in, filenames? argv[i] : NULL); + if (frc == 0 && rc == 1) rc = 0; + fclose(in); + } + } + +return rc; +} + +/* End */ @@ -0,0 +1,337 @@ +/************************************************* +* Perl-Compatible Regular Expressions * +*************************************************/ + +/* +This is a library of functions to support regular expressions whose syntax +and semantics are as close as possible to those of the Perl 5 language. See +the file Tech.Notes for some information on the internals. + +Written by: Philip Hazel <ph10@cam.ac.uk> + + Copyright (c) 1997 University of Cambridge + +----------------------------------------------------------------------------- +Permission is granted to anyone to use this software for any purpose on any +computer system, and to redistribute it freely, subject to the following +restrictions: + +1. This software is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. + +2. The origin of this software must not be misrepresented, either by + explicit claim or by omission. + +3. Altered versions must be plainly marked as such, and must not be + misrepresented as being the original software. +----------------------------------------------------------------------------- +*/ + + +/* Include the internals header, which itself includes Standard C headers plus +the external pcre header. */ + +#include "internal.h" + + + +/************************************************* +* Create bitmap of starting chars * +*************************************************/ + +/* This function scans a compiled unanchored expression and attempts to build a +bitmap of the set of initial characters. If it can't, it returns FALSE. As time +goes by, we may be able to get more clever at doing this. + +Arguments: + code points to an expression + start_bits points to a 32-byte table, initialized to 0 + +Returns: TRUE if table built, FALSE otherwise +*/ + +static BOOL +set_start_bits(uschar *code, uschar *start_bits) +{ +register int c; + +do + { + uschar *tcode = code + 3; + BOOL try_next = TRUE; + + while (try_next) + { + try_next = FALSE; + + if ((int)*tcode >= OP_BRA || *tcode == OP_ASSERT) + { + if (!set_start_bits(tcode, start_bits)) return FALSE; + } + + else switch(*tcode) + { + default: + return FALSE; + + /* BRAZERO does the bracket, but carries on. */ + + case OP_BRAZERO: + case OP_BRAMINZERO: + if (!set_start_bits(++tcode, start_bits)) return FALSE; + do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT); + tcode += 3; + try_next = TRUE; + break; + + /* Single-char * or ? sets the bit and tries the next item */ + + case OP_STAR: + case OP_MINSTAR: + case OP_QUERY: + case OP_MINQUERY: + start_bits[tcode[1]/8] |= (1 << (tcode[1]&7)); + tcode += 2; + try_next = TRUE; + break; + + /* Single-char upto sets the bit and tries the next */ + + case OP_UPTO: + case OP_MINUPTO: + start_bits[tcode[3]/8] |= (1 << (tcode[3]&7)); + tcode += 4; + try_next = TRUE; + break; + + /* At least one single char sets the bit and stops */ + + case OP_EXACT: /* Fall through */ + tcode++; + + case OP_CHARS: /* Fall through */ + tcode++; + + case OP_PLUS: + case OP_MINPLUS: + start_bits[tcode[1]/8] |= (1 << (tcode[1]&7)); + break; + + /* Single character type sets the bits and stops */ + + case OP_NOT_DIGIT: + for (c = 0; c < 32; c++) start_bits[c] |= ~pcre_cbits[c+cbit_digit]; + break; + + case OP_DIGIT: + for (c = 0; c < 32; c++) start_bits[c] |= pcre_cbits[c+cbit_digit]; + break; + + case OP_NOT_WHITESPACE: + for (c = 0; c < 32; c++) start_bits[c] |= ~pcre_cbits[c+cbit_space]; + break; + + case OP_WHITESPACE: + for (c = 0; c < 32; c++) start_bits[c] |= pcre_cbits[c+cbit_space]; + break; + + case OP_NOT_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= ~(pcre_cbits[c] | pcre_cbits[c+cbit_word]); + break; + + case OP_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= (pcre_cbits[c] | pcre_cbits[c+cbit_word]); + break; + + /* One or more character type fudges the pointer and restarts, knowing + it will hit a single character type and stop there. */ + + case OP_TYPEPLUS: + case OP_TYPEMINPLUS: + tcode++; + try_next = TRUE; + break; + + case OP_TYPEEXACT: + tcode += 3; + try_next = TRUE; + break; + + /* Zero or more repeats of character types set the bits and then + try again. */ + + case OP_TYPEUPTO: + case OP_TYPEMINUPTO: + tcode += 2; /* Fall through */ + + case OP_TYPESTAR: + case OP_TYPEMINSTAR: + case OP_TYPEQUERY: + case OP_TYPEMINQUERY: + switch(tcode[1]) + { + case OP_NOT_DIGIT: + for (c = 0; c < 32; c++) start_bits[c] |= ~pcre_cbits[c+cbit_digit]; + break; + + case OP_DIGIT: + for (c = 0; c < 32; c++) start_bits[c] |= pcre_cbits[c+cbit_digit]; + break; + + case OP_NOT_WHITESPACE: + for (c = 0; c < 32; c++) start_bits[c] |= ~pcre_cbits[c+cbit_space]; + break; + + case OP_WHITESPACE: + for (c = 0; c < 32; c++) start_bits[c] |= pcre_cbits[c+cbit_space]; + break; + + case OP_NOT_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= ~(pcre_cbits[c] | pcre_cbits[c+cbit_word]); + break; + + case OP_WORDCHAR: + for (c = 0; c < 32; c++) + start_bits[c] |= (pcre_cbits[c] | pcre_cbits[c+cbit_word]); + break; + } + + tcode += 2; + try_next = TRUE; + break; + + /* Character class: set the bits and either carry on or not, + according to the repeat count. */ + + case OP_CLASS: + { + tcode++; + for (c = 0; c < 32; c++) start_bits[c] |= tcode[c]; + tcode += 32; + switch (*tcode) + { + case OP_CRSTAR: + case OP_CRMINSTAR: + case OP_CRQUERY: + case OP_CRMINQUERY: + tcode++; + try_next = TRUE; + break; + + case OP_CRRANGE: + case OP_CRMINRANGE: + if (((tcode[1] << 8) + tcode[2]) == 0) + { + tcode += 5; + try_next = TRUE; + } + break; + } + } + break; /* End of class handling */ + + } /* End of switch */ + } /* End of try_next loop */ + + code += (code[1] << 8) + code[2]; /* Advance to next branch */ + } +while (*code == OP_ALT); +return TRUE; +} + + + +/************************************************* +* Study a compiled expression * +*************************************************/ + +/* This function is handed a compiled expression that it must study to produce +information that will speed up the matching. It returns a pcre_extra block +which then gets handed back to pcre_exec(). + +Arguments: + re points to the compiled expression + options contains option bits + errorptr points to where to place error messages; + set NULL unless error + +Returns: pointer to a pcre_extra block, + NULL on error or if no optimization possible +*/ + +pcre_extra * +pcre_study(const pcre *external_re, int options, char **errorptr) +{ +BOOL caseless; +uschar start_bits[32]; +real_pcre_extra *extra; +real_pcre *re = (real_pcre *)external_re; + +*errorptr = NULL; + +if (re == NULL || re->magic_number != MAGIC_NUMBER) + { + *errorptr = "argument is not a compiled regular expression"; + return NULL; + } + +if ((options & ~PUBLIC_STUDY_OPTIONS) != 0) + { + *errorptr = "unknown or incorrect option bit(s) set"; + return NULL; + } + +/* Caseless can either be from the compiled regex or from options. */ + +caseless = ((re->options | options) & PCRE_CASELESS) != 0; + +/* For an anchored pattern, or an unchored pattern that has a first char, or a +multiline pattern that matches only at "line starts", no further processing at +present. */ + +if ((re->options & (PCRE_ANCHORED|PCRE_FIRSTSET|PCRE_STARTLINE)) != 0) + return NULL; + +/* See if we can find a fixed set of initial characters for the pattern. */ + +memset(start_bits, 0, 32 * sizeof(uschar)); +if (!set_start_bits(re->code, start_bits)) return NULL; + +/* If this studying is caseless, scan the created bit map and duplicate the +bits for any letters. */ + +if (caseless) + { + register int c; + for (c = 0; c < 256; c++) + { + if ((start_bits[c/8] & (1 << (c&7))) != 0 && + (pcre_ctypes[c] & ctype_letter) != 0) + { + int d = pcre_fcc[c]; + start_bits[d/8] |= (1 << (d&7)); + } + } + } + +/* Get an "extra" block and put the information therein. */ + +extra = (real_pcre_extra *)(pcre_malloc)(sizeof(real_pcre_extra)); + +if (extra == NULL) + { + *errorptr = "failed to get memory"; + return NULL; + } + +extra->options = PCRE_STUDY_MAPPED | (caseless? PCRE_STUDY_CASELESS : 0); +memcpy(extra->start_bits, start_bits, sizeof(start_bits)); + +return (pcre_extra *)extra; +} + +/* End of study.c */ diff --git a/testinput b/testinput new file mode 100644 index 0000000..0c00e7f --- /dev/null +++ b/testinput @@ -0,0 +1,1551 @@ +/the quick brown fox/ + the quick brown fox + The quick brown FOX + What do you know about the quick brown fox? + What do you know about THE QUICK BROWN FOX? + +/The quick brown fox/i + the quick brown fox + The quick brown FOX + What do you know about the quick brown fox? + What do you know about THE QUICK BROWN FOX? + +/abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/ + abcd\t\n\r\f\a\e9;\$\\?caxyz + +/a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/ + abxyzpqrrrabbxyyyypqAzz + abxyzpqrrrabbxyyyypqAzz + aabxyzpqrrrabbxyyyypqAzz + aaabxyzpqrrrabbxyyyypqAzz + aaaabxyzpqrrrabbxyyyypqAzz + abcxyzpqrrrabbxyyyypqAzz + aabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypAzz + aaabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypqqAzz + aaabcxyzpqrrrabbxyyyypqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqAzz + aaaabcxyzpqrrrabbxyyyypqAzz + abxyzzpqrrrabbxyyyypqAzz + aabxyzzzpqrrrabbxyyyypqAzz + aaabxyzzzzpqrrrabbxyyyypqAzz + aaaabxyzzzzpqrrrabbxyyyypqAzz + abcxyzzpqrrrabbxyyyypqAzz + aabcxyzzzpqrrrabbxyyyypqAzz + aaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + aaabcxyzpqrrrabbxyyyypABzz + aaabcxyzpqrrrabbxyyyypABBzz + >>>aaabxyzpqrrrabbxyyyypqAzz + >aaaabxyzpqrrrabbxyyyypqAzz + >>>>abcxyzpqrrrabbxyyyypqAzz + *** Failers + abxyzpqrrabbxyyyypqAzz + abxyzpqrrrrabbxyyyypqAzz + abxyzpqrrrabxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyypqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqqAzz + +/^(abc){1,2}zz/ + abczz + abcabczz + *** Failers + zz + abcabcabczz + >>abczz + +/^(b+?|a){1,2}?c/ + bc + bbc + bbbc + bac + bbac + aac + abbbbbbbbbbbc + bbbbbbbbbbbac + *** Failers + aaac + abbbbbbbbbbbac + +/^(b+|a){1,2}c/ + bc + bbc + bbbc + bac + bbac + aac + abbbbbbbbbbbc + bbbbbbbbbbbac + *** Failers + aaac + abbbbbbbbbbbac + +/^(b+|a){1,2}?bc/ + bbc + +/^(b*|ba){1,2}?bc/ + babc + bbabc + bababc + *** Failers + bababbc + babababc + +/^(ba|b*){1,2}?bc/ + babc + bbabc + bababc + *** Failers + bababbc + babababc + +/^\ca\cA\c[\c{\c:/ + \x01\x01\e;z + +/^[ab\]cde]/ + athing + bthing + ]thing + cthing + dthing + ething + *** Failers + fthing + [thing + \\thing + +/^[]cde]/ + ]thing + cthing + dthing + ething + *** Failers + athing + fthing + +/^[^ab\]cde]/ + fthing + [thing + \\thing + *** Failers + athing + bthing + ]thing + cthing + dthing + ething + +/^[^]cde]/ + athing + fthing + *** Failers + ]thing + cthing + dthing + ething + +/^\/ + + +/^ÿ/ + ÿ + +/^[0-9]+$/ + 0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 100 + *** Failers + abc + +/^.*nter/ + enter + inter + uponter + +/^xxx[0-9]+$/ + xxx0 + xxx1234 + *** Failers + xxx + +/^.+[0-9][0-9][0-9]$/ + x123 + xx123 + 123456 + *** Failers + 123 + x1234 + +/^.+?[0-9][0-9][0-9]$/ + x123 + xx123 + 123456 + *** Failers + 123 + x1234 + +/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ + abc!pqr=apquxz.ixr.zzz.ac.uk + *** Failers + !pqr=apquxz.ixr.zzz.ac.uk + abc!=apquxz.ixr.zzz.ac.uk + abc!pqr=apquxz:ixr.zzz.ac.uk + abc!pqr=apquxz.ixr.zzz.ac.ukk + +/:/ + Well, we need a colon: somewhere + *** Fail if we don't + +/([\da-f:]+)$/i + 0abc + abc + fed + E + :: + 5f03:12C0::932e + fed def + Any old stuff + *** Failers + 0zzz + gzzz + fed\x20 + Any old rubbish + +/^.*\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ + .1.2.3 + A.12.123.0 + *** Failers + .1.2.3333 + 1.2.3 + 1234.2.3 + +/^(\d+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/ + 1 IN SOA non-sp1 non-sp2( + 1 IN SOA non-sp1 non-sp2 ( + *** Failers + 1IN SOA non-sp1 non-sp2( + +/^[a-zA-Z\d][a-zA-Z\d\-]*(\.[a-zA-Z\d][a-zA-z\d\-]*)*\.$/ + a. + Z. + 2. + ab-c.pq-r. + sxk.zzz.ac.uk. + x-.y-. + *** Failers + -abc.peq. + +/^\*\.[a-z]([a-z\-\d]*[a-z\d]+)?(\.[a-z]([a-z\-\d]*[a-z\d]+)?)*$/ + *.a + *.b0-a + *.c3-b.c + *.c-a.b-c + *** Failers + *.0 + *.a- + *.a-b.c- + *.c-a.0-c + +/^(?=ab(de))(abd)(e)/ + abde + +/^(?!(ab)de|x)(abd)(f)/ + abdf + +/^(?=(ab(cd)))(ab)/ + abcd + +/^[\da-f](\.[\da-f])*$/i + a.b.c.d + A.B.C.D + a.b.c.1.2.3.C + +/^\".*\"\s*(;.*)?$/ + \"1234\" + \"abcd\" ; + \"\" ; rhubarb + *** Failers + \"1234\" : things + +/^$/ + \ + *** Failers + +/ ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/x + ab c + *** Failers + abc + ab cde + +/(?x) ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/ + ab c + *** Failers + abc + ab cde + +/^ a\ b[c ]d $/x + a bcd + a b d + *** Failers + abcd + ab d + +/^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/ + abcdefhijklm + +/^(?:a(b(c)))(?:d(e(f)))(?:h(i(j)))(?:k(l(m)))$/ + abcdefhijklm + +/^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ + a+ Z0+\x08\n\x1d\x12 + +/^[.^$|()*+?{,}]+/ + .^\$(*+)|{?,?} + +/^a*\w/ + z + az + aaaz + a + aa + aaaa + a+ + aa+ + +/^a*?\w/ + z + az + aaaz + a + aa + aaaa + a+ + aa+ + +/^a+\w/ + az + aaaz + aa + aaaa + aa+ + +/^a+?\w/ + az + aaaz + aa + aaaa + aa+ + +/^\d{8}\w{2,}/ + 1234567890 + 12345678ab + 12345678__ + *** Failers + 1234567 + +/^[aeiou\d]{4,5}$/ + uoie + 1234 + 12345 + aaaaa + *** Failers + 123456 + +/^[aeiou\d]{4,5}?/ + uoie + 1234 + 12345 + aaaaa + 123456 + +/\A(abc|def)=(\1){2,3}\Z/ + abc=abcabc + def=defdefdef + *** Failers + abc=defdef + +/^(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\11*(\3\4)\1(?#)2$/ + abcdefghijkcda2 + abcdefghijkkkkcda2 + +/(cat(a(ract|tonic)|erpillar)) \1()2(3)/ + cataract cataract23 + catatonic catatonic23 + caterpillar caterpillar23 + + +/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + From abcd Mon Sep 01 12:33:02 1997 + +/^From\s+\S+\s+([a-zA-Z]{3}\s+){2}\d{1,2}\s+\d\d:\d\d/ + From abcd Mon Sep 01 12:33:02 1997 + From abcd Mon Sep 1 12:33:02 1997 + *** Failers + From abcd Sep 01 12:33:02 1997 + +/^12.34/s + 12\n34 + 12\r34 + +/\w+(?=\t)/ + the quick brown\t fox + +/foo(?!bar)(.*)/ + foobar is foolish see? + +/(?:(?!foo)...|^.{0,2})bar(.*)/ + foobar crowbar etc + barrel + 2barrel + A barrel + +/^(\D*)(?=\d)(?!123)/ + abc456 + *** Failers + abc123 + +/^1234(?# test newlines + inside)/ + 1234 + +/^1234 #comment in extended re + /x + 1234 + +/#rhubarb + abcd/x + abcd + +/^abcd#rhubarb/x + abcd + +/^(a)\1{2,3}(.)/ + aaab + aaaab + aaaaab + aaaaaab + +/(?!^)abc/ + the abc + *** Failers + abc + +/(?=^)abc/ + abc + *** Failers + the abc + +/^[ab]{1,3}(ab*|b)/ + aabbbbb + +/^[ab]{1,3}?(ab*|b)/ + aabbbbb + +/^[ab]{1,3}?(ab*?|b)/ + aabbbbb + +/^[ab]{1,3}(ab*?|b)/ + aabbbbb + +/ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional leading comment +(?: (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # one word, optionally followed by.... +(?: +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or... +\( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) | # comments, or... + +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +# quoted strings +)* +< (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # leading < +(?: @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* + +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* , (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +)* # further okay, if led by comma +: # closing colon +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* )? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address spec +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* > # trailing > +# name and address +) (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional trailing comment +/x + Alan Other <user\@dom.ain> + <user\@dom.ain> + user\@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + A missing angle <user\@some.where + *** Failers + The quick brown fox + +/[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional leading comment +(?: +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +# leading word +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # "normal" atoms and or spaces +(?: +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +| +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +) # "special" comment or quoted string +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # more "normal" +)* +< +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# < +(?: +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +(?: , +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +)* # additional domains +: +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address spec +> # > +# name and address +) +/x + Alan Other <user\@dom.ain> + <user\@dom.ain> + user\@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + A missing angle <user\@some.where + *** Failers + The quick brown fox + +/abc\0def\00pqr\000xyz\0000AB/ + abc\0def\00pqr\000xyz\0000AB + abc456 abc\0def\00pqr\000xyz\0000ABCDE + +/abc\x0def\x00pqr\x000xyz\x0000AB/ + abc\x0def\x00pqr\x000xyz\x0000AB + abc456 abc\x0def\x00pqr\x000xyz\x0000ABCDE + +/^[\000-\037]/ + \0A + \01B + \037C + +/\0*/ + \0\0\0\0 + +/A\x0{2,3}Z/ + The A\x0\x0Z + An A\0\x0\0Z + *** Failers + A\0Z + A\0\x0\0\x0Z + +/^(cow|)\1(bell)/ + cowcowbell + bell + *** Failers + cowbell + +/^\s/ + \040abc + \x0cabc + \nabc + \rabc + \tabc + *** Failers + abc + +/^a b +
c/x + abc + +/^(a|)\1*b/ + ab + aaaab + b + *** Failers + acb + +/^(a|)\1+b/ + aab + aaaab + b + *** Failers + ab + +/^(a|)\1?b/ + ab + aab + b + *** Failers + acb + +/^(a|)\1{2}b/ + aaab + b + *** Failers + ab + aab + aaaab + +/^(a|)\1{2,3}b/ + aaab + aaaab + b + *** Failers + ab + aab + aaaaab + +/ab{1,3}bc/ + abbbbc + abbbc + abbc + *** Failers + abc + abbbbbc + +/([^.]*)\.([^:]*):[T ]+(.*)/ + track1.title:TBlah blah blah + +/([^.]*)\.([^:]*):[T ]+(.*)/i + track1.title:TBlah blah blah + +/([^.]*)\.([^:]*):[t ]+(.*)/i + track1.title:TBlah blah blah + +/^[W-c]+$/ + WXY_^abc + ***Failers + wxy + +/^[W-c]+$/i + WXY_^abc + wxy_^ABC + +/^[\x3f-\x5F]+$/i + WXY_^abc + wxy_^ABC + +/^abc$/m + abc + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +/^abc$/ + abc + *** Failers + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +/\Aabc\Z/m + abc + *** Failers + qqq\nabc + abc\nzzz + qqq\nabc\nzzz + +/\A(.)*\Z/s + abc\ndef + +/\A(.)*\Z/m + *** Failers + abc\ndef + +/(?:b)|(?::+)/ + b::c + c::b + +/[-az]+/ + az- + *** Failers + b + +/[az-]+/ + za- + *** Failers + b + +/[a\-z]+/ + a-z + *** Failers + b + +/[a-z]+/ + abcdxyz + +/[\d-]+/ + 12-34 + *** Failers + aaa + +/[\d-z]+/ + 12-34z + *** Failers + aaa + +/\x5c/ + \\ + +/\x20Z/ + the Zoo + *** Failers + Zulu + +/(abc)\1/i + abcabc + ABCabc + abcABC + +/(main(OPT)?)+/ + mainmain + mainOPTmain + +/ab{3cd/ + ab{3cd + +/ab{3,cd/ + ab{3,cd + +/ab{3,4a}cd/ + ab{3,4a}cd + +/{4,5a}bc/ + {4,5a}bc + +/^a.b/ + a\rb + *** Failers + a\nb + +/abc$/ + abc + abc\n + *** Failers + abc\ndef + +/(abc)\123/ + abc\x53 + +/(abc)\223/ + abc\x93 + +/(abc)\323/ + abc\xd3 + +/(abc)\500/ + abc\x40 + abc\100 + +/(abc)\5000/ + abc\x400 + abc\x40\x30 + abc\1000 + abc\100\x30 + abc\100\060 + abc\100\60 + +/abc\81/ + abc\081 + abc\0\x38\x31 + +/abc\91/ + abc\091 + abc\0\x39\x31 + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/ + abcdefghijkllS + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/ + abcdefghijk\12S + +/ab\gdef/ + abgdef + +/a{0}bc/ + bc + +/(a|(bc)){0,0}?xyz/ + xyz + +/abc[\10]de/ + abc\010de + +/abc[\1]de/ + abc\1de + +/(abc)[\1]de/ + abc\1de + +/a.b(?s)/ + a\nb + +/^([^a])([^\b])([^c]*)([^d]{3,4})/ + baNOTccccd + baNOTcccd + baNOTccd + bacccd + *** Failers + anything + b\bc + baccd + +/[^a]/ + Abc + +/[^a]/i + Abc + +/[^a]+/ + AAAaAbc + +/[^a]+/i + AAAaAbc + +/[^a]+/ + bbb\nccc + +/ End of test input / diff --git a/testinput2 b/testinput2 new file mode 100644 index 0000000..6a164d6 --- /dev/null +++ b/testinput2 @@ -0,0 +1,244 @@ +/(a)b|/ + +/(a*)*/ + +/(abc|)+/ + +/abc/ + abc + defabc + \Aabc + \IABC + *** Failers + \Adefabc + ABC + +/^abc/ + abc + \Aabc + *** Failers + defabc + \Adefabc + +/a+bc/ + +/a*bc/ + +/a{3}bc/ + +/(abc|a+z)/ + +/^abc$/ + abc + \Mdef\nabc + *** Failers + def\nabc + +/abc\/ + +/ab\gdef/X + +/x{5,4}/ + +/z{65536}/ + +/[abcd/ + +/[\B]/ + +/[a-\w]/ + +/[z-a]/ + +/^*/ + +/(abc/ + +/(?# abc/ + +/(?z)abc/ + +/.*b/ + +/.*?b/ + +/cat|dog|elephant/ + this sentence eventually mentions a cat + this sentences rambles on and on for a while and then reaches elephant + +/cat|dog|elephant/S + this sentence eventually mentions a cat + this sentences rambles on and on for a while and then reaches elephant + +/cat|dog|elephant/iS + this sentence eventually mentions a CAT cat + this sentences rambles on and on for a while to elephant ElePhant + +/cat|dog|elephant/IS + this sentence eventually mentions a CAT cat + this sentences rambles on and on for a while to elephant ElePhant + +/cat|dog|elephant/IS + \Ithis sentence eventually mentions a CAT cat + \Ithis sentences rambles on and on for a while to elephant ElePhant + +/a|[bcd]/S + +/(a|[^\dZ])/S + +/(a|b)*[\s]/S + +/(ab\2)/ + +/{4,5}abc/ + +/(a)(b)(c)\2/ + abcb + \O0abcb + \O2abcb + \O4abcb + \O6abcb + \O8abcb + +/(a)bc|(a)(b)\2/ + abc + \O0abc + \O2abc + \O4abc + aba + \O0aba + \O2aba + \O4aba + \O6aba + \O8aba + +/^a.b/ + \Sa\nb + +/abc$/E + abc + *** Failers + abc\n + abc\ndef + +/abc$/ + *** Failers + \Eabc\n + \Eabc\ndef + +/abc$/m + \Eabc\n + \Eabc\ndef + +/(a)(b)(c)(d)(e)\6/ + +/the quick brown fox/ + the quick brown fox + this is a line with the quick brown fox + +/the quick brown fox/A + the quick brown fox + *** Failers + this is a line with the quick brown fox + +/ab(?z)cd/ + +".*/\Xfoo"X + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ + +".*/\Xfoo"X + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(\.\d\d[1-9]?)\d+/ + 1.230003938 + 1.875000282 + 1.235 + +/(\.\d\d[1-9]?)\X\d+/X + 1.230003938 + 1.875000282 + *** Failers + 1.235 + +/(\.\d\d((?=0)|\d(?=\d)))/ + 1.230003938 + 1.875000282 + *** Failers + 1.235 + +/^(\w+\X|\s+\X)*$/X + now is the time for all good men to come to the aid of the party + *** Failers + this is not a line with only words and spaces! + +/^abc|def/ + abcdef + abcdef\B + +/.*((abc)$|(def))/ + defabc + \Zdefabc + +/abc/P + abc + *** Failers + +/^abc|def/P + abcdef + abcdef\B + +/.*((abc)$|(def))/P + defabc + \Zdefabc + +/the quick brown fox/P + the quick brown fox + *** Failers + The Quick Brown Fox + +/the quick brown fox/Pi + the quick brown fox + The Quick Brown Fox + +/abc.def/P + *** Failers + abc\ndef + +/abc$/P + abc + abc\n + +/abc\/P + +/(abc)\2/P + +/(abc\1)/P + abc + +"(?>.*/)foo"X + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ + +"(?>.*/)foo"X + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(?>(\.\d\d[1-9]?))\d+/X + 1.230003938 + 1.875000282 + *** Failers + 1.235 + +/^((?>\w+)|(?>\s+))*$/X + now is the time for all good men to come to the aid of the party + *** Failers + this is not a line with only words and spaces! + +/(\d+)(\w)/X + 12345a + 12345+ + +/((?>\d+))(\w)/X + 12345a + *** Failers + 12345+ + +/ End of test input / diff --git a/testoutput b/testoutput new file mode 100644 index 0000000..0c924d7 --- /dev/null +++ b/testoutput @@ -0,0 +1,2298 @@ +Testing Perl-Compatible Regular Expressions +PCRE version 1.00 18-Nov-1997 + +/the quick brown fox/ + the quick brown fox + 0: the quick brown fox + The quick brown FOX +No match + What do you know about the quick brown fox? + 0: the quick brown fox + What do you know about THE QUICK BROWN FOX? +No match + +/The quick brown fox/i + the quick brown fox + 0: the quick brown fox + The quick brown FOX + 0: The quick brown FOX + What do you know about the quick brown fox? + 0: the quick brown fox + What do you know about THE QUICK BROWN FOX? + 0: THE QUICK BROWN FOX + +/abcd\t\n\r\f\a\e\071\x3b\$\\\?caxyz/ + abcd\t\n\r\f\a\e9;\$\\?caxyz + 0: abcd\x09\x0a\x0d\x0c\x07\x1b9;$\?caxyz + +/a*abc?xyz+pqr{3}ab{2,}xy{4,5}pq{0,6}AB{0,}zz/ + abxyzpqrrrabbxyyyypqAzz + 0: abxyzpqrrrabbxyyyypqAzz + abxyzpqrrrabbxyyyypqAzz + 0: abxyzpqrrrabbxyyyypqAzz + aabxyzpqrrrabbxyyyypqAzz + 0: aabxyzpqrrrabbxyyyypqAzz + aaabxyzpqrrrabbxyyyypqAzz + 0: aaabxyzpqrrrabbxyyyypqAzz + aaaabxyzpqrrrabbxyyyypqAzz + 0: aaaabxyzpqrrrabbxyyyypqAzz + abcxyzpqrrrabbxyyyypqAzz + 0: abcxyzpqrrrabbxyyyypqAzz + aabcxyzpqrrrabbxyyyypqAzz + 0: aabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypAzz + 0: aaabcxyzpqrrrabbxyyyypAzz + aaabcxyzpqrrrabbxyyyypqAzz + 0: aaabcxyzpqrrrabbxyyyypqAzz + aaabcxyzpqrrrabbxyyyypqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqAzz + aaabcxyzpqrrrabbxyyyypqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqqAzz + aaabcxyzpqrrrabbxyyyypqqqqqqAzz + 0: aaabcxyzpqrrrabbxyyyypqqqqqqAzz + aaaabcxyzpqrrrabbxyyyypqAzz + 0: aaaabcxyzpqrrrabbxyyyypqAzz + abxyzzpqrrrabbxyyyypqAzz + 0: abxyzzpqrrrabbxyyyypqAzz + aabxyzzzpqrrrabbxyyyypqAzz + 0: aabxyzzzpqrrrabbxyyyypqAzz + aaabxyzzzzpqrrrabbxyyyypqAzz + 0: aaabxyzzzzpqrrrabbxyyyypqAzz + aaaabxyzzzzpqrrrabbxyyyypqAzz + 0: aaaabxyzzzzpqrrrabbxyyyypqAzz + abcxyzzpqrrrabbxyyyypqAzz + 0: abcxyzzpqrrrabbxyyyypqAzz + aabcxyzzzpqrrrabbxyyyypqAzz + 0: aabcxyzzzpqrrrabbxyyyypqAzz + aaabcxyzzzzpqrrrabbxyyyypqAzz + 0: aaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbxyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbbxyyyypqAzz + aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + 0: aaaabcxyzzzzpqrrrabbbxyyyyypqAzz + aaabcxyzpqrrrabbxyyyypABzz + 0: aaabcxyzpqrrrabbxyyyypABzz + aaabcxyzpqrrrabbxyyyypABBzz + 0: aaabcxyzpqrrrabbxyyyypABBzz + >>>aaabxyzpqrrrabbxyyyypqAzz + 0: aaabxyzpqrrrabbxyyyypqAzz + >aaaabxyzpqrrrabbxyyyypqAzz + 0: aaaabxyzpqrrrabbxyyyypqAzz + >>>>abcxyzpqrrrabbxyyyypqAzz + 0: abcxyzpqrrrabbxyyyypqAzz + *** Failers +No match + abxyzpqrrabbxyyyypqAzz +No match + abxyzpqrrrrabbxyyyypqAzz +No match + abxyzpqrrrabxyyyypqAzz +No match + aaaabcxyzzzzpqrrrabbbxyyyyyypqAzz +No match + aaaabcxyzzzzpqrrrabbbxyyypqAzz +No match + aaabcxyzpqrrrabbxyyyypqqqqqqqAzz +No match + +/^(abc){1,2}zz/ + abczz + 0: abczz + 1: abc + abcabczz + 0: abcabczz + 1: abc + *** Failers +No match + zz +No match + abcabcabczz +No match + >>abczz +No match + +/^(b+?|a){1,2}?c/ + bc + 0: bc + 1: b + bbc + 0: bbc + 1: b + bbbc + 0: bbbc + 1: bb + bac + 0: bac + 1: a + bbac + 0: bbac + 1: a + aac + 0: aac + 1: a + abbbbbbbbbbbc + 0: abbbbbbbbbbbc + 1: bbbbbbbbbbb + bbbbbbbbbbbac + 0: bbbbbbbbbbbac + 1: a + *** Failers +No match + aaac +No match + abbbbbbbbbbbac +No match + +/^(b+|a){1,2}c/ + bc + 0: bc + 1: b + bbc + 0: bbc + 1: bb + bbbc + 0: bbbc + 1: bbb + bac + 0: bac + 1: a + bbac + 0: bbac + 1: a + aac + 0: aac + 1: a + abbbbbbbbbbbc + 0: abbbbbbbbbbbc + 1: bbbbbbbbbbb + bbbbbbbbbbbac + 0: bbbbbbbbbbbac + 1: a + *** Failers +No match + aaac +No match + abbbbbbbbbbbac +No match + +/^(b+|a){1,2}?bc/ + bbc + 0: bbc + 1: b + +/^(b*|ba){1,2}?bc/ + babc + 0: babc + 1: ba + bbabc + 0: bbabc + 1: ba + bababc + 0: bababc + 1: ba + *** Failers +No match + bababbc +No match + babababc +No match + +/^(ba|b*){1,2}?bc/ + babc + 0: babc + 1: ba + bbabc + 0: bbabc + 1: ba + bababc + 0: bababc + 1: ba + *** Failers +No match + bababbc +No match + babababc +No match + +/^\ca\cA\c[\c{\c:/ + \x01\x01\e;z + 0: \x01\x01\x1b;z + +/^[ab\]cde]/ + athing + 0: a + bthing + 0: b + ]thing + 0: ] + cthing + 0: c + dthing + 0: d + ething + 0: e + *** Failers +No match + fthing +No match + [thing +No match + \\thing +No match + +/^[]cde]/ + ]thing + 0: ] + cthing + 0: c + dthing + 0: d + ething + 0: e + *** Failers +No match + athing +No match + fthing +No match + +/^[^ab\]cde]/ + fthing + 0: f + [thing + 0: [ + \\thing + 0: \ + *** Failers + 0: * + athing +No match + bthing +No match + ]thing +No match + cthing +No match + dthing +No match + ething +No match + +/^[^]cde]/ + athing + 0: a + fthing + 0: f + *** Failers + 0: * + ]thing +No match + cthing +No match + dthing +No match + ething +No match + +/^\/ + + 0: \x81 + +/^ÿ/ + ÿ + 0: \xff + +/^[0-9]+$/ + 0 + 0: 0 + 1 + 0: 1 + 2 + 0: 2 + 3 + 0: 3 + 4 + 0: 4 + 5 + 0: 5 + 6 + 0: 6 + 7 + 0: 7 + 8 + 0: 8 + 9 + 0: 9 + 10 + 0: 10 + 100 + 0: 100 + *** Failers +No match + abc +No match + +/^.*nter/ + enter + 0: enter + inter + 0: inter + uponter + 0: uponter + +/^xxx[0-9]+$/ + xxx0 + 0: xxx0 + xxx1234 + 0: xxx1234 + *** Failers +No match + xxx +No match + +/^.+[0-9][0-9][0-9]$/ + x123 + 0: x123 + xx123 + 0: xx123 + 123456 + 0: 123456 + *** Failers +No match + 123 +No match + x1234 + 0: x1234 + +/^.+?[0-9][0-9][0-9]$/ + x123 + 0: x123 + xx123 + 0: xx123 + 123456 + 0: 123456 + *** Failers +No match + 123 +No match + x1234 + 0: x1234 + +/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/ + abc!pqr=apquxz.ixr.zzz.ac.uk + 0: abc!pqr=apquxz.ixr.zzz.ac.uk + 1: abc + 2: pqr + *** Failers +No match + !pqr=apquxz.ixr.zzz.ac.uk +No match + abc!=apquxz.ixr.zzz.ac.uk +No match + abc!pqr=apquxz:ixr.zzz.ac.uk +No match + abc!pqr=apquxz.ixr.zzz.ac.ukk +No match + +/:/ + Well, we need a colon: somewhere + 0: : + *** Fail if we don't +No match + +/([\da-f:]+)$/i + 0abc + 0: 0abc + 1: 0abc + abc + 0: abc + 1: abc + fed + 0: fed + 1: fed + E + 0: E + 1: E + :: + 0: :: + 1: :: + 5f03:12C0::932e + 0: 5f03:12C0::932e + 1: 5f03:12C0::932e + fed def + 0: def + 1: def + Any old stuff + 0: ff + 1: ff + *** Failers +No match + 0zzz +No match + gzzz +No match + fed\x20 +No match + Any old rubbish +No match + +/^.*\.(\d{1,3})\.(\d{1,3})\.(\d{1,3})$/ + .1.2.3 + 0: .1.2.3 + 1: 1 + 2: 2 + 3: 3 + A.12.123.0 + 0: A.12.123.0 + 1: 12 + 2: 123 + 3: 0 + *** Failers +No match + .1.2.3333 +No match + 1.2.3 +No match + 1234.2.3 +No match + +/^(\d+)\s+IN\s+SOA\s+(\S+)\s+(\S+)\s*\(\s*$/ + 1 IN SOA non-sp1 non-sp2( + 0: 1 IN SOA non-sp1 non-sp2( + 1: 1 + 2: non-sp1 + 3: non-sp2 + 1 IN SOA non-sp1 non-sp2 ( + 0: 1 IN SOA non-sp1 non-sp2 ( + 1: 1 + 2: non-sp1 + 3: non-sp2 + *** Failers +No match + 1IN SOA non-sp1 non-sp2( +No match + +/^[a-zA-Z\d][a-zA-Z\d\-]*(\.[a-zA-Z\d][a-zA-z\d\-]*)*\.$/ + a. + 0: a. + Z. + 0: Z. + 2. + 0: 2. + ab-c.pq-r. + 0: ab-c.pq-r. + 1: .pq-r + sxk.zzz.ac.uk. + 0: sxk.zzz.ac.uk. + 1: .uk + x-.y-. + 0: x-.y-. + 1: .y- + *** Failers +No match + -abc.peq. +No match + +/^\*\.[a-z]([a-z\-\d]*[a-z\d]+)?(\.[a-z]([a-z\-\d]*[a-z\d]+)?)*$/ + *.a + 0: *.a + *.b0-a + 0: *.b0-a + 1: 0-a + *.c3-b.c + 0: *.c3-b.c + 1: 3-b + 2: .c + *.c-a.b-c + 0: *.c-a.b-c + 1: -a + 2: .b-c + 3: -c + *** Failers +No match + *.0 +No match + *.a- +No match + *.a-b.c- +No match + *.c-a.0-c +No match + +/^(?=ab(de))(abd)(e)/ + abde + 0: abde + 1: de + 2: abd + 3: e + +/^(?!(ab)de|x)(abd)(f)/ + abdf + 0: abdf + 1: <unset> + 2: abd + 3: f + +/^(?=(ab(cd)))(ab)/ + abcd + 0: ab + 1: abcd + 2: cd + 3: ab + +/^[\da-f](\.[\da-f])*$/i + a.b.c.d + 0: a.b.c.d + 1: .d + A.B.C.D + 0: A.B.C.D + 1: .D + a.b.c.1.2.3.C + 0: a.b.c.1.2.3.C + 1: .C + +/^\".*\"\s*(;.*)?$/ + \"1234\" + 0: "1234" + \"abcd\" ; + 0: "abcd" ; + 1: ; + \"\" ; rhubarb + 0: "" ; rhubarb + 1: ; rhubarb + *** Failers +No match + \"1234\" : things +No match + +/^$/ + \ + 0: + *** Failers +No match + +/ ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/x + ab c + 0: ab c + *** Failers +No match + abc +No match + ab cde +No match + +/(?x) ^ a (?# begins with a) b\sc (?# then b c) $ (?# then end)/ + ab c + 0: ab c + *** Failers +No match + abc +No match + ab cde +No match + +/^ a\ b[c ]d $/x + a bcd + 0: a bcd + a b d + 0: a b d + *** Failers +No match + abcd +No match + ab d +No match + +/^(a(b(c)))(d(e(f)))(h(i(j)))(k(l(m)))$/ + abcdefhijklm + 0: abcdefhijklm + 1: abc + 2: bc + 3: c + 4: def + 5: ef + 6: f + 7: hij + 8: ij + 9: j +10: klm +11: lm +12: m + +/^(?:a(b(c)))(?:d(e(f)))(?:h(i(j)))(?:k(l(m)))$/ + abcdefhijklm + 0: abcdefhijklm + 1: bc + 2: c + 3: ef + 4: f + 5: ij + 6: j + 7: lm + 8: m + +/^[\w][\W][\s][\S][\d][\D][\b][\n][\c]][\022]/ + a+ Z0+\x08\n\x1d\x12 + 0: a+ Z0+\x08\x0a\x1d\x12 + +/^[.^$|()*+?{,}]+/ + .^\$(*+)|{?,?} + 0: .^$(*+)|{?,?} + +/^a*\w/ + z + 0: z + az + 0: az + aaaz + 0: aaaz + a + 0: a + aa + 0: aa + aaaa + 0: aaaa + a+ + 0: a + aa+ + 0: aa + +/^a*?\w/ + z + 0: z + az + 0: a + aaaz + 0: a + a + 0: a + aa + 0: a + aaaa + 0: a + a+ + 0: a + aa+ + 0: a + +/^a+\w/ + az + 0: az + aaaz + 0: aaaz + aa + 0: aa + aaaa + 0: aaaa + aa+ + 0: aa + +/^a+?\w/ + az + 0: az + aaaz + 0: aa + aa + 0: aa + aaaa + 0: aa + aa+ + 0: aa + +/^\d{8}\w{2,}/ + 1234567890 + 0: 1234567890 + 12345678ab + 0: 12345678ab + 12345678__ + 0: 12345678__ + *** Failers +No match + 1234567 +No match + +/^[aeiou\d]{4,5}$/ + uoie + 0: uoie + 1234 + 0: 1234 + 12345 + 0: 12345 + aaaaa + 0: aaaaa + *** Failers +No match + 123456 +No match + +/^[aeiou\d]{4,5}?/ + uoie + 0: uoie + 1234 + 0: 1234 + 12345 + 0: 1234 + aaaaa + 0: aaaa + 123456 + 0: 1234 + +/\A(abc|def)=(\1){2,3}\Z/ + abc=abcabc + 0: abc=abcabc + 1: abc + 2: abc + def=defdefdef + 0: def=defdefdef + 1: def + 2: def + *** Failers +No match + abc=defdef +No match + +/^(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\11*(\3\4)\1(?#)2$/ + abcdefghijkcda2 + 0: abcdefghijkcda2 + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k +12: cd + abcdefghijkkkkcda2 + 0: abcdefghijkkkkcda2 + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k +12: cd + +/(cat(a(ract|tonic)|erpillar)) \1()2(3)/ + cataract cataract23 + 0: cataract cataract23 + 1: cataract + 2: aract + 3: ract + 4: + 5: 3 + catatonic catatonic23 + 0: catatonic catatonic23 + 1: catatonic + 2: atonic + 3: tonic + 4: + 5: 3 + caterpillar caterpillar23 + 0: caterpillar caterpillar23 + 1: caterpillar + 2: erpillar + 3: <unset> + 4: + 5: 3 + + +/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/ + From abcd Mon Sep 01 12:33:02 1997 + 0: From abcd Mon Sep 01 12:33 + 1: abcd + +/^From\s+\S+\s+([a-zA-Z]{3}\s+){2}\d{1,2}\s+\d\d:\d\d/ + From abcd Mon Sep 01 12:33:02 1997 + 0: From abcd Mon Sep 01 12:33 + 1: Sep + From abcd Mon Sep 1 12:33:02 1997 + 0: From abcd Mon Sep 1 12:33 + 1: Sep + *** Failers +No match + From abcd Sep 01 12:33:02 1997 +No match + +/^12.34/s + 12\n34 + 0: 12\x0a34 + 12\r34 + 0: 12\x0d34 + +/\w+(?=\t)/ + the quick brown\t fox + 0: brown + +/foo(?!bar)(.*)/ + foobar is foolish see? + 0: foolish see? + 1: lish see? + +/(?:(?!foo)...|^.{0,2})bar(.*)/ + foobar crowbar etc + 0: rowbar etc + 1: etc + barrel + 0: barrel + 1: rel + 2barrel + 0: 2barrel + 1: rel + A barrel + 0: A barrel + 1: rel + +/^(\D*)(?=\d)(?!123)/ + abc456 + 0: abc + 1: abc + *** Failers +No match + abc123 +No match + +/^1234(?# test newlines + inside)/ + 1234 + 0: 1234 + +/^1234 #comment in extended re + /x + 1234 + 0: 1234 + +/#rhubarb + abcd/x + abcd + 0: abcd + +/^abcd#rhubarb/x + abcd + 0: abcd + +/^(a)\1{2,3}(.)/ + aaab + 0: aaab + 1: a + 2: b + aaaab + 0: aaaab + 1: a + 2: b + aaaaab + 0: aaaaa + 1: a + 2: a + aaaaaab + 0: aaaaa + 1: a + 2: a + +/(?!^)abc/ + the abc + 0: abc + *** Failers +No match + abc +No match + +/(?=^)abc/ + abc + 0: abc + *** Failers +No match + the abc +No match + +/^[ab]{1,3}(ab*|b)/ + aabbbbb + 0: aabb + 1: b + +/^[ab]{1,3}?(ab*|b)/ + aabbbbb + 0: aabbbbb + 1: abbbbb + +/^[ab]{1,3}?(ab*?|b)/ + aabbbbb + 0: aa + 1: a + +/^[ab]{1,3}(ab*?|b)/ + aabbbbb + 0: aabb + 1: b + +/ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional leading comment +(?: (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # one word, optionally followed by.... +(?: +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] | # atom and space parts, or... +\( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) | # comments, or... + +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +# quoted strings +)* +< (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # leading < +(?: @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* + +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* , (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +)* # further okay, if led by comma +: # closing colon +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* )? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) # initial word +(?: (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +" (?: # opening quote... +[^\\\x80-\xff\n\015"] # Anything except backslash and quote +| # or +\\ [^\x80-\xff] # Escaped something (something != CR) +)* " # closing quote +) )* # further okay, if led by a period +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* @ (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # initial subdomain +(?: # +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* \. # if led by a period... +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* (?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| \[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) # ...further okay +)* +# address spec +(?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* > # trailing > +# name and address +) (?: [\040\t] | \( +(?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] | \( (?: [^\\\x80-\xff\n\015()] | \\ [^\x80-\xff] )* \) )* +\) )* # optional trailing comment +/x + Alan Other <user\@dom.ain> + 0: Alan Other <user@dom.ain> + <user\@dom.ain> + 0: user@dom.ain + user\@dom.ain + 0: user@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + 0: "A. Other" <user.1234@dom.ain> (a comment) + A. Other <user.1234\@dom.ain> (a comment) + 0: Other <user.1234@dom.ain> (a comment) + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + 0: "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + A missing angle <user\@some.where + 0: user@some.where + *** Failers +No match + The quick brown fox +No match + +/[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional leading comment +(?: +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address +| # or +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +# leading word +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # "normal" atoms and or spaces +(?: +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +| +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +) # "special" comment or quoted string +[^()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037] * # more "normal" +)* +< +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# < +(?: +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +(?: , +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +)* # additional domains +: +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)? # optional route +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +# Atom +| # or +" # " +[^\\\x80-\xff\n\015"] * # normal +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015"] * )* # ( special normal* )* +" # " +# Quoted string +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# additional words +)* +@ +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +(?: +\. +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +(?: +[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+ # some number of atom characters... +(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]) # ..not followed by something that could be part of an atom +| +\[ # [ +(?: [^\\\x80-\xff\n\015\[\]] | \\ [^\x80-\xff] )* # stuff +\] # ] +) +[\040\t]* # Nab whitespace. +(?: +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: # ( +(?: \\ [^\x80-\xff] | +\( # ( +[^\\\x80-\xff\n\015()] * # normal* +(?: \\ [^\x80-\xff] [^\\\x80-\xff\n\015()] * )* # (special normal*)* +\) # ) +) # special +[^\\\x80-\xff\n\015()] * # normal* +)* # )* +\) # ) +[\040\t]* )* # If comment found, allow more spaces. +# optional trailing comments +)* +# address spec +> # > +# name and address +) +/x + Alan Other <user\@dom.ain> + 0: Alan Other <user@dom.ain> + <user\@dom.ain> + 0: user@dom.ain + user\@dom.ain + 0: user@dom.ain + \"A. Other\" <user.1234\@dom.ain> (a comment) + 0: "A. Other" <user.1234@dom.ain> + A. Other <user.1234\@dom.ain> (a comment) + 0: Other <user.1234@dom.ain> + \"/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/\"\@x400-re.lay + 0: "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay + A missing angle <user\@some.where + 0: user@some.where + *** Failers +No match + The quick brown fox +No match + +/abc\0def\00pqr\000xyz\0000AB/ + abc\0def\00pqr\000xyz\0000AB + 0: abc\x00def\x00pqr\x00xyz\x000AB + abc456 abc\0def\00pqr\000xyz\0000ABCDE + 0: abc\x00def\x00pqr\x00xyz\x000AB + +/abc\x0def\x00pqr\x000xyz\x0000AB/ + abc\x0def\x00pqr\x000xyz\x0000AB + 0: abc\x0def\x00pqr\x000xyz\x0000AB + abc456 abc\x0def\x00pqr\x000xyz\x0000ABCDE + 0: abc\x0def\x00pqr\x000xyz\x0000AB + +/^[\000-\037]/ + \0A + 0: \x00 + \01B + 0: \x01 + \037C + 0: \x1f + +/\0*/ + \0\0\0\0 + 0: \x00\x00\x00\x00 + +/A\x0{2,3}Z/ + The A\x0\x0Z + 0: A\x00\x00Z + An A\0\x0\0Z + 0: A\x00\x00\x00Z + *** Failers +No match + A\0Z +No match + A\0\x0\0\x0Z +No match + +/^(cow|)\1(bell)/ + cowcowbell + 0: cowcowbell + 1: cow + 2: bell + bell + 0: bell + 1: + 2: bell + *** Failers +No match + cowbell +No match + +/^\s/ + \040abc + 0: + \x0cabc + 0: \x0c + \nabc + 0: \x0a + \rabc + 0: \x0d + \tabc + 0: \x09 + *** Failers +No match + abc +No match + +/^a b +
c/x + abc + 0: abc + +/^(a|)\1*b/ + ab + 0: ab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers +No match + acb +No match + +/^(a|)\1+b/ + aab + 0: aab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers +No match + ab +No match + +/^(a|)\1?b/ + ab + 0: ab + 1: a + aab + 0: aab + 1: a + b + 0: b + 1: + *** Failers +No match + acb +No match + +/^(a|)\1{2}b/ + aaab + 0: aaab + 1: a + b + 0: b + 1: + *** Failers +No match + ab +No match + aab +No match + aaaab +No match + +/^(a|)\1{2,3}b/ + aaab + 0: aaab + 1: a + aaaab + 0: aaaab + 1: a + b + 0: b + 1: + *** Failers +No match + ab +No match + aab +No match + aaaaab +No match + +/ab{1,3}bc/ + abbbbc + 0: abbbbc + abbbc + 0: abbbc + abbc + 0: abbc + *** Failers +No match + abc +No match + abbbbbc +No match + +/([^.]*)\.([^:]*):[T ]+(.*)/ + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + +/([^.]*)\.([^:]*):[T ]+(.*)/i + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + +/([^.]*)\.([^:]*):[t ]+(.*)/i + track1.title:TBlah blah blah + 0: track1.title:TBlah blah blah + 1: track1 + 2: title + 3: Blah blah blah + +/^[W-c]+$/ + WXY_^abc + 0: WXY_^abc + ***Failers +No match + wxy +No match + +/^[W-c]+$/i + WXY_^abc + 0: WXY_^abc + wxy_^ABC + 0: wxy_^ABC + +/^[\x3f-\x5F]+$/i + WXY_^abc + 0: WXY_^abc + wxy_^ABC + 0: wxy_^ABC + +/^abc$/m + abc + 0: abc + qqq\nabc + 0: abc + abc\nzzz + 0: abc + qqq\nabc\nzzz + 0: abc + +/^abc$/ + abc + 0: abc + *** Failers +No match + qqq\nabc +No match + abc\nzzz +No match + qqq\nabc\nzzz +No match + +/\Aabc\Z/m + abc + 0: abc + *** Failers +No match + qqq\nabc +No match + abc\nzzz +No match + qqq\nabc\nzzz +No match + +/\A(.)*\Z/s + abc\ndef + 0: abc\x0adef + 1: f + +/\A(.)*\Z/m + *** Failers + 0: *** Failers + 1: s + abc\ndef +No match + +/(?:b)|(?::+)/ + b::c + 0: b + c::b + 0: :: + +/[-az]+/ + az- + 0: az- + *** Failers + 0: a + b +No match + +/[az-]+/ + za- + 0: za- + *** Failers + 0: a + b +No match + +/[a\-z]+/ + a-z + 0: a-z + *** Failers + 0: a + b +No match + +/[a-z]+/ + abcdxyz + 0: abcdxyz + +/[\d-]+/ + 12-34 + 0: 12-34 + *** Failers +No match + aaa +No match + +/[\d-z]+/ + 12-34z + 0: 12-34z + *** Failers +No match + aaa +No match + +/\x5c/ + \\ + 0: \ + +/\x20Z/ + the Zoo + 0: Z + *** Failers +No match + Zulu +No match + +/(abc)\1/i + abcabc + 0: abcabc + 1: abc + ABCabc + 0: ABCabc + 1: ABC + abcABC + 0: abcABC + 1: abc + +/(main(OPT)?)+/ + mainmain + 0: mainmain + 1: main + mainOPTmain + 0: mainOPTmain + 1: main + 2: OPT + +/ab{3cd/ + ab{3cd + 0: ab{3cd + +/ab{3,cd/ + ab{3,cd + 0: ab{3,cd + +/ab{3,4a}cd/ + ab{3,4a}cd + 0: ab{3,4a}cd + +/{4,5a}bc/ + {4,5a}bc + 0: {4,5a}bc + +/^a.b/ + a\rb + 0: a\x0db + *** Failers +No match + a\nb +No match + +/abc$/ + abc + 0: abc + abc\n + 0: abc + *** Failers +No match + abc\ndef +No match + +/(abc)\123/ + abc\x53 + 0: abcS + 1: abc + +/(abc)\223/ + abc\x93 + 0: abc\x93 + 1: abc + +/(abc)\323/ + abc\xd3 + 0: abc\xd3 + 1: abc + +/(abc)\500/ + abc\x40 + 0: abc@ + 1: abc + abc\100 + 0: abc@ + 1: abc + +/(abc)\5000/ + abc\x400 + 0: abc@0 + 1: abc + abc\x40\x30 + 0: abc@0 + 1: abc + abc\1000 + 0: abc@0 + 1: abc + abc\100\x30 + 0: abc@0 + 1: abc + abc\100\060 + 0: abc@0 + 1: abc + abc\100\60 + 0: abc@0 + 1: abc + +/abc\81/ + abc\081 + 0: abc\x0081 + abc\0\x38\x31 + 0: abc\x0081 + +/abc\91/ + abc\091 + 0: abc\x0091 + abc\0\x39\x31 + 0: abc\x0091 + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)(l)\12\123/ + abcdefghijkllS + 0: abcdefghijkllS + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k +12: l + +/(a)(b)(c)(d)(e)(f)(g)(h)(i)(j)(k)\12\123/ + abcdefghijk\12S + 0: abcdefghijk\x0aS + 1: a + 2: b + 3: c + 4: d + 5: e + 6: f + 7: g + 8: h + 9: i +10: j +11: k + +/ab\gdef/ + abgdef + 0: abgdef + +/a{0}bc/ + bc + 0: bc + +/(a|(bc)){0,0}?xyz/ + xyz + 0: xyz + +/abc[\10]de/ + abc\010de + 0: abc\x08de + +/abc[\1]de/ + abc\1de + 0: abc\x01de + +/(abc)[\1]de/ + abc\1de + 0: abc\x01de + 1: abc + +/a.b(?s)/ + a\nb + 0: a\x0ab + +/^([^a])([^\b])([^c]*)([^d]{3,4})/ + baNOTccccd + 0: baNOTcccc + 1: b + 2: a + 3: NOT + 4: cccc + baNOTcccd + 0: baNOTccc + 1: b + 2: a + 3: NOT + 4: ccc + baNOTccd + 0: baNOTcc + 1: b + 2: a + 3: NO + 4: Tcc + bacccd + 0: baccc + 1: b + 2: a + 3: + 4: ccc + *** Failers + 0: *** Failers + 1: * + 2: * + 3: * Fail + 4: ers + anything +No match + b\bc +No match + baccd +No match + +/[^a]/ + Abc + 0: A + +/[^a]/i + Abc + 0: b + +/[^a]+/ + AAAaAbc + 0: AAA + +/[^a]+/i + AAAaAbc + 0: bc + +/[^a]+/ + bbb\nccc + 0: bbb\x0accc + +/ End of test input / + diff --git a/testoutput2 b/testoutput2 new file mode 100644 index 0000000..ec09561 --- /dev/null +++ b/testoutput2 @@ -0,0 +1,573 @@ +Testing Perl-Compatible Regular Expressions +PCRE version 1.00 18-Nov-1997 + +/(a)b|/ +Identifying subpattern count = 1 +No options +No first char + +/(a*)*/ +Failed: operand of unlimited repeat could match the empty string at offset 4 + +/(abc|)+/ +Failed: operand of unlimited repeat could match the empty string at offset 6 + +/abc/ +Identifying subpattern count = 0 +No options +First char = 'a' + abc + 0: abc + defabc + 0: abc + \Aabc + 0: abc + \IABC + 0: ABC + *** Failers +No match + \Adefabc +No match + ABC +No match + +/^abc/ +Identifying subpattern count = 0 +Options: anchored +No first char + abc + 0: abc + \Aabc + 0: abc + *** Failers +No match + defabc +No match + \Adefabc +No match + +/a+bc/ +Identifying subpattern count = 0 +No options +First char = 'a' + +/a*bc/ +Identifying subpattern count = 0 +No options +No first char + +/a{3}bc/ +Identifying subpattern count = 0 +No options +First char = 'a' + +/(abc|a+z)/ +Identifying subpattern count = 1 +No options +First char = 'a' + +/^abc$/ +Identifying subpattern count = 0 +Options: anchored +No first char + abc + 0: abc + \Mdef\nabc + 0: abc + *** Failers +No match + def\nabc +No match + +/abc\/ +Failed: \ at end of pattern at offset 4 + +/ab\gdef/X +Failed: unrecognized character follows \ at offset 3 + +/x{5,4}/ +Failed: numbers out of order in {} quantifier at offset 5 + +/z{65536}/ +Failed: number too big in {} quantifier at offset 7 + +/[abcd/ +Failed: missing terminating ] for character class at offset 5 + +/[\B]/ +Failed: invalid escape sequence in character class at offset 2 + +/[a-\w]/ +Failed: invalid escape sequence in character class at offset 4 + +/[z-a]/ +Failed: range out of order in character class at offset 3 + +/^*/ +Failed: nothing to repeat at offset 1 + +/(abc/ +Failed: missing ) at offset 4 + +/(?# abc/ +Failed: missing ) after comment at offset 7 + +/(?z)abc/ +Failed: unrecognized character after (? at offset 2 + +/.*b/ +Identifying subpattern count = 0 +Options: anchored +No first char + +/.*?b/ +Identifying subpattern count = 0 +Options: anchored +No first char + +/cat|dog|elephant/ +Identifying subpattern count = 0 +No options +No first char + this sentence eventually mentions a cat + 0: cat + this sentences rambles on and on for a while and then reaches elephant + 0: elephant + +/cat|dog|elephant/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: c d e + this sentence eventually mentions a cat + 0: cat + this sentences rambles on and on for a while and then reaches elephant + 0: elephant + +/cat|dog|elephant/iS +Identifying subpattern count = 0 +Options: caseless +No first char +Starting character set: C D E c d e + this sentence eventually mentions a CAT cat + 0: CAT + this sentences rambles on and on for a while to elephant ElePhant + 0: elephant + +/cat|dog|elephant/IS +Identifying subpattern count = 0 +No options +No first char +Starting character set: C D E c d e + this sentence eventually mentions a CAT cat + 0: cat + this sentences rambles on and on for a while to elephant ElePhant + 0: elephant + +/cat|dog|elephant/IS +Identifying subpattern count = 0 +No options +No first char +Starting character set: C D E c d e + \Ithis sentence eventually mentions a CAT cat + 0: CAT + \Ithis sentences rambles on and on for a while to elephant ElePhant + 0: elephant + +/a|[bcd]/S +Identifying subpattern count = 0 +No options +No first char +Starting character set: a b c d + +/(a|[^\dZ])/S +Identifying subpattern count = 1 +No options +No first char +Starting character set: \x00 \x01 \x02 \x03 \x04 \x05 \x06 \x07 \x08 \x09 \x0a + \x0b \x0c \x0d \x0e \x0f \x10 \x11 \x12 \x13 \x14 \x15 \x16 \x17 \x18 \x19 + \x1a \x1b \x1c \x1d \x1e \x1f \x20 ! " # $ % & ' ( ) * + , - . / : ; < = > + ? @ A B C D E F G H I J K L M N O P Q R S T U V W X Y [ \ ] ^ _ ` a b c d + e f g h i j k l m n o p q r s t u v w x y z { | } ~ \x7f \x80 \x81 \x82 \x83 + \x84 \x85 \x86 \x87 \x88 \x89 \x8a \x8b \x8c \x8d \x8e \x8f \x90 \x91 \x92 + \x93 \x94 \x95 \x96 \x97 \x98 \x99 \x9a \x9b \x9c \x9d \x9e \x9f \xa0 \xa1 + \xa2 \xa3 \xa4 \xa5 \xa6 \xa7 \xa8 \xa9 \xaa \xab \xac \xad \xae \xaf \xb0 + \xb1 \xb2 \xb3 \xb4 \xb5 \xb6 \xb7 \xb8 \xb9 \xba \xbb \xbc \xbd \xbe \xbf + \xc0 \xc1 \xc2 \xc3 \xc4 \xc5 \xc6 \xc7 \xc8 \xc9 \xca \xcb \xcc \xcd \xce + \xcf \xd0 \xd1 \xd2 \xd3 \xd4 \xd5 \xd6 \xd7 \xd8 \xd9 \xda \xdb \xdc \xdd + \xde \xdf \xe0 \xe1 \xe2 \xe3 \xe4 \xe5 \xe6 \xe7 \xe8 \xe9 \xea \xeb \xec + \xed \xee \xef \xf0 \xf1 \xf2 \xf3 \xf4 \xf5 \xf6 \xf7 \xf8 \xf9 \xfa \xfb + \xfc \xfd \xfe \xff + +/(a|b)*[\s]/S +Identifying subpattern count = 1 +No options +No first char +Starting character set: \x09 \x0a \x0b \x0c \x0d \x20 a b + +/(ab\2)/ +Failed: back reference to non-existent subpattern at offset 4 + +/{4,5}abc/ +Failed: nothing to repeat at offset 4 + +/(a)(b)(c)\2/ +Identifying subpattern count = 3 +No options +First char = 'a' + abcb + 0: abcb + 1: a + 2: b + 3: c + \O0abcb +Matched, but too many substrings + \O2abcb +Matched, but too many substrings + 0: abcb + \O4abcb +Matched, but too many substrings + 0: abcb + 1: a + \O6abcb +Matched, but too many substrings + 0: abcb + 1: a + 2: b + \O8abcb + 0: abcb + 1: a + 2: b + 3: c + +/(a)bc|(a)(b)\2/ +Identifying subpattern count = 3 +No options +First char = 'a' + abc + 0: abc + 1: a + \O0abc +Matched, but too many substrings + \O2abc +Matched, but too many substrings + 0: abc + \O4abc + 0: abc + 1: a + aba + 0: aba + 1: <unset> + 2: a + 3: b + \O0aba +Matched, but too many substrings + \O2aba +Matched, but too many substrings + 0: aba + \O4aba +Matched, but too many substrings + 0: aba + 1: <unset> + \O6aba +Matched, but too many substrings + 0: aba + 1: <unset> + 2: a + \O8aba + 0: aba + 1: <unset> + 2: a + 3: b + +/^a.b/ +Identifying subpattern count = 0 +Options: anchored +No first char + \Sa\nb + 0: a\x0ab + +/abc$/E +Identifying subpattern count = 0 +Options: dollar_endonly +First char = 'a' + abc + 0: abc + *** Failers +No match + abc\n +No match + abc\ndef +No match + +/abc$/ +Identifying subpattern count = 0 +No options +First char = 'a' + *** Failers +No match + \Eabc\n +No match + \Eabc\ndef +No match + +/abc$/m +Identifying subpattern count = 0 +Options: multiline +First char = 'a' + \Eabc\n + 0: abc + \Eabc\ndef + 0: abc + +/(a)(b)(c)(d)(e)\6/ +Failed: back reference to non-existent subpattern at offset 16 + +/the quick brown fox/ +Identifying subpattern count = 0 +No options +First char = 't' + the quick brown fox + 0: the quick brown fox + this is a line with the quick brown fox + 0: the quick brown fox + +/the quick brown fox/A +Identifying subpattern count = 0 +Options: anchored +No first char + the quick brown fox + 0: the quick brown fox + *** Failers +No match + this is a line with the quick brown fox +No match + +/ab(?z)cd/ +Failed: unrecognized character after (? at offset 4 + +".*/\Xfoo"X +Identifying subpattern count = 0 +Options: anchored extra +No first char + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ +No match + +".*/\Xfoo"X +Identifying subpattern count = 0 +Options: anchored extra +No first char + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + 0: /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(\.\d\d[1-9]?)\d+/ +Identifying subpattern count = 1 +No options +First char = '.' + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + 1.235 + 0: .235 + 1: .23 + +/(\.\d\d[1-9]?)\X\d+/X +Identifying subpattern count = 1 +Options: extra +First char = '.' + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + *** Failers +No match + 1.235 +No match + +/(\.\d\d((?=0)|\d(?=\d)))/ +Identifying subpattern count = 2 +No options +First char = '.' + 1.230003938 + 0: .23 + 1: .23 + 2: + 1.875000282 + 0: .875 + 1: .875 + 2: 5 + *** Failers +No match + 1.235 +No match + +/^(\w+\X|\s+\X)*$/X +Identifying subpattern count = 1 +Options: anchored extra +No first char + now is the time for all good men to come to the aid of the party + 0: now is the time for all good men to come to the aid of the party + 1: party + *** Failers +No match + this is not a line with only words and spaces! +No match + +/^abc|def/ +Identifying subpattern count = 0 +No options +No first char + abcdef + 0: abc + abcdef\B + 0: def + +/.*((abc)$|(def))/ +Identifying subpattern count = 3 +Options: anchored +No first char + defabc + 0: defabc + 1: abc + 2: abc + \Zdefabc + 0: def + 1: def + 2: <unset> + 3: def + +/abc/P + abc + 0: abc + *** Failers +No match: POSIX code 17: match failed + +/^abc|def/P + abcdef + 0: abc + abcdef\B + 0: def + +/.*((abc)$|(def))/P + defabc + 0: defabc + 1: abc + 2: abc + \Zdefabc + 0: def + 1: def + 3: def + +/the quick brown fox/P + the quick brown fox + 0: the quick brown fox + *** Failers +No match: POSIX code 17: match failed + The Quick Brown Fox +No match: POSIX code 17: match failed + +/the quick brown fox/Pi + the quick brown fox + 0: the quick brown fox + The Quick Brown Fox + 0: The Quick Brown Fox + +/abc.def/P + *** Failers +No match: POSIX code 17: match failed + abc\ndef +No match: POSIX code 17: match failed + +/abc$/P + abc + 0: abc + abc\n + 0: abc + +/abc\/P +Failed: POSIX code 9: bad escape sequence at offset 4 + +/(abc)\2/P +Failed: POSIX code 15: bad back reference at offset 6 + +/(abc\1)/P + abc +No match: POSIX code 15: bad back reference + +"(?>.*/)foo"X +Identifying subpattern count = 0 +Options: anchored extra +No first char + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/it/you/see/ +No match + +"(?>.*/)foo"X +Identifying subpattern count = 0 +Options: anchored extra +No first char + /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + 0: /this/is/a/very/long/line/in/deed/with/very/many/slashes/in/and/foo + +/(?>(\.\d\d[1-9]?))\d+/X +Identifying subpattern count = 1 +Options: extra +No first char + 1.230003938 + 0: .230003938 + 1: .23 + 1.875000282 + 0: .875000282 + 1: .875 + *** Failers +No match + 1.235 +No match + +/^((?>\w+)|(?>\s+))*$/X +Identifying subpattern count = 1 +Options: anchored extra +No first char + now is the time for all good men to come to the aid of the party + 0: now is the time for all good men to come to the aid of the party + 1: party + *** Failers +No match + this is not a line with only words and spaces! +No match + +/(\d+)(\w)/X +Identifying subpattern count = 2 +Options: extra +No first char + 12345a + 0: 12345a + 1: 12345 + 2: a + 12345+ + 0: 12345 + 1: 1234 + 2: 5 + +/((?>\d+))(\w)/X +Identifying subpattern count = 2 +Options: extra +No first char + 12345a + 0: 12345a + 1: 12345 + 2: a + *** Failers +No match + 12345+ +No match + +/ End of test input / +Identifying subpattern count = 0 +No options +First char = ' ' + |