diff options
author | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-07-02 15:20:59 +0000 |
---|---|---|
committer | ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> | 2011-07-02 15:20:59 +0000 |
commit | 5c7a0c52f657f9ee5670cddc9466e239243c9b18 (patch) | |
tree | b5e2b3ffe768624a719d31485a06fd83f31f4fde | |
parent | 477829e693c6a38cc3443ea90b2dacb19a2eddfc (diff) | |
download | pcre-5c7a0c52f657f9ee5670cddc9466e239243c9b18.tar.gz |
Fix two study bugs concerned with minimum subject lengths; add features to
pcretest so that all tests can be run with or without study; adjust tests so
that this happens.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@612 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r-- | ChangeLog | 13 | ||||
-rw-r--r-- | HACKING | 5 | ||||
-rwxr-xr-x | RunTest | 227 | ||||
-rw-r--r-- | doc/pcretest.1 | 54 | ||||
-rw-r--r-- | pcre_internal.h | 6 | ||||
-rw-r--r-- | pcre_study.c | 44 | ||||
-rw-r--r-- | pcretest.c | 36 | ||||
-rwxr-xr-x | perltest.pl | 4 | ||||
-rw-r--r-- | testdata/testinput11 | 10 | ||||
-rw-r--r-- | testdata/testinput2 | 80 | ||||
-rw-r--r-- | testdata/testinput5 | 2 | ||||
-rw-r--r-- | testdata/testinput7 | 4 | ||||
-rw-r--r-- | testdata/testoutput11 | 13 | ||||
-rw-r--r-- | testdata/testoutput2 | 332 | ||||
-rw-r--r-- | testdata/testoutput5 | 6 | ||||
-rw-r--r-- | testdata/testoutput7 | 24 |
16 files changed, 667 insertions, 193 deletions
@@ -79,8 +79,10 @@ Version 8.13 30-Apr-2011 synonym of -m (show memory usage). I have changed it to mean "force study for every regex", that is, assume /S for every regex. This is similar to -i and -d etc. It's slightly incompatible, but I'm hoping nobody is still - using it. It makes it easier to run collection of tests with study enabled, - and thereby test pcre_study() more easily. + using it. It makes it easier to run collections of tests with and without + study enabled, and thereby test pcre_study() more easily. All the standard + tests are now run with and without -s (but some patterns can be marked as + "never study" - see 20 below). 15. When (*ACCEPT) was used in a subpattern that was called recursively, the restoration of the capturing data to the outer values was not happening @@ -101,6 +103,13 @@ Version 8.13 30-Apr-2011 18. If a pattern containing \R was studied, it was assumed that \R always matched two bytes, thus causing the minimum subject length to be incorrectly computed because \R can also match just one byte. + +19. If a pattern containing (*ACCEPT) was studied, the minimum subject length + was incorrectly computed. + +20. If /S is present twice on a test pattern in pcretest input, it *disables* + studying, thereby overriding the use of -s on the command line. This is + necessary for one or two tests to keep the output identical in both cases. Version 8.12 15-Jan-2011 @@ -2,7 +2,8 @@ Technical Notes about PCRE -------------------------- These are very rough technical notes that record potentially useful information -about PCRE internals. +about PCRE internals. For information about testing PCRE, see the pcretest +documentation and the comment at the head of the RunTest file. Historical note 1 @@ -449,4 +450,4 @@ next item. Philip Hazel -May 2011 +July 2011 @@ -1,6 +1,14 @@ #! /bin/sh -# Run PCRE tests. +# Run the PCRE tests using the pcretest program. All tests are now run both +# with and without -s, to ensure that everything is tested with and without +# studying. However, there are some tests that produce different output after +# studying, typically when we are tracing the actual matching process (for +# example, using auto-callouts). In these few cases, the tests are duplicated +# in the files, one with /S to force studying always, and one with /SS to force +# *not* studying always. The use of -s doesn't then make any difference to +# their output. There is also one test which compiles invalid UTF-8 with the +# UTF-8 check turned off for which studying is disabled with /SS. valgrind= @@ -137,33 +145,37 @@ echo PCRE C library tests if [ $do1 = yes ] ; then echo "Test 1: main functionality (Compatible with Perl >= 5.8)" - $valgrind ./pcretest -q $testdata/testinput1 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput1 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput1 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput1 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # PCRE tests that are not Perl-compatible - API, errors, internals if [ $do2 = yes ] ; then echo "Test 2: API, errors, internals, and non-Perl stuff" - $valgrind ./pcretest -q $testdata/testinput2 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput2 testtry - if [ $? != 0 ] ; then exit 1; fi - else - echo " " - echo "** Test 2 requires a lot of stack. If it has crashed with a" - echo "** segmentation fault, it may be that you do not have enough" - echo "** stack available by default. Please see the 'pcrestack' man" - echo "** page for a discussion of PCRE's stack usage." - echo " " - exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput2 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput2 testtry + if [ $? != 0 ] ; then exit 1; fi + else + echo " " + echo "** Test 2 requires a lot of stack. If it has crashed with a" + echo "** segmentation fault, it may be that you do not have enough" + echo "** stack available by default. Please see the 'pcrestack' man" + echo "** page for a discussion of PCRE's stack usage." + echo " " + exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # Locale-specific tests, provided that either the "fr_FR" or the "french" @@ -191,19 +203,22 @@ if [ $do3 = yes ] ; then if [ "$locale" != "" ] ; then echo "Test 3: locale-specific features (using '$locale' locale)" - $valgrind ./pcretest -q $infile testtry - if [ $? = 0 ] ; then - $cf $outfile testtry - if [ $? != 0 ] ; then - echo " " - echo "Locale test did not run entirely successfully." - echo "This usually means that there is a problem with the locale" - echo "settings rather than a bug in PCRE." - else - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $infile testtry + if [ $? = 0 ] ; then + $cf $outfile testtry + if [ $? != 0 ] ; then + echo " " + echo "Locale test did not run entirely successfully." + echo "This usually means that there is a problem with the locale" + echo "settings rather than a bug in PCRE." + break; + else + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + fi + else exit 1 fi - else exit 1 - fi + done else echo "Cannot test locale-specific features - neither the 'fr_FR' nor the" echo "'french' locale exists, or the \"locale\" command is not available" @@ -216,70 +231,82 @@ fi if [ $do4 = yes ] ; then echo "Test 4: UTF-8 support (Compatible with Perl >= 5.8)" - $valgrind ./pcretest -q $testdata/testinput4 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput4 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput4 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput4 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi if [ $do5 = yes ] ; then echo "Test 5: API, internals, and non-Perl stuff for UTF-8 support" - $valgrind ./pcretest -q $testdata/testinput5 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput5 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput5 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput5 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi if [ $do6 = yes ] ; then echo "Test 6: Unicode property support (Compatible with Perl >= 5.10)" - $valgrind ./pcretest -q $testdata/testinput6 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput6 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput6 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput6 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # Tests for DFA matching support if [ $do7 = yes ] ; then echo "Test 7: DFA matching" - $valgrind ./pcretest -q -dfa $testdata/testinput7 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput7 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt -dfa $testdata/testinput7 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput7 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi if [ $do8 = yes ] ; then echo "Test 8: DFA matching with UTF-8" - $valgrind ./pcretest -q -dfa $testdata/testinput8 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput8 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt -dfa $testdata/testinput8 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput8 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi if [ $do9 = yes ] ; then echo "Test 9: DFA matching with Unicode properties" - $valgrind ./pcretest -q -dfa $testdata/testinput9 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput9 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt -dfa $testdata/testinput9 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput9 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # Test of internal offsets and code sizes. This test is run only when there @@ -290,39 +317,45 @@ fi if [ $do10 = yes ] ; then echo "Test 10: Internal offsets and code size tests" - $valgrind ./pcretest -q $testdata/testinput10 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput10 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput10 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput10 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # Test of Perl >= 5.10 features if [ $do11 = yes ] ; then echo "Test 11: Features from Perl >= 5.10" - $valgrind ./pcretest -q $testdata/testinput11 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput11 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput11 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput11 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # Test non-Perl-compatible Unicode property support if [ $do12 = yes ] ; then echo "Test 12: API, internals, and non-Perl stuff for Unicode property support" - $valgrind ./pcretest -q $testdata/testinput12 testtry - if [ $? = 0 ] ; then - $cf $testdata/testoutput12 testtry - if [ $? != 0 ] ; then exit 1; fi - else exit 1 - fi - echo "OK" + for opt in "" "-s"; do + $valgrind ./pcretest -q $opt $testdata/testinput12 testtry + if [ $? = 0 ] ; then + $cf $testdata/testoutput12 testtry + if [ $? != 0 ] ; then exit 1; fi + else exit 1 + fi + if [ "$opt" = "-s" ] ; then echo "OK with study" ; else echo "OK"; fi + done fi # End diff --git a/doc/pcretest.1 b/doc/pcretest.1 index 924750c..ffea3fd 100644 --- a/doc/pcretest.1 +++ b/doc/pcretest.1 @@ -4,7 +4,7 @@ pcretest - a program for testing Perl-compatible regular expressions. .SH SYNOPSIS .rs .sp -.B pcretest "[options] [source] [destination]" +.B pcretest "[options] [input file [output file]]" .sp \fBpcretest\fP was written as a test program for the PCRE regular expression library itself, but it can also be used for experimenting with regular @@ -18,14 +18,17 @@ options, see the .\" HREF \fBpcreapi\fP .\" -documentation. +documentation. The input for \fBpcretest\fP is a sequence of regular expression +patterns and strings to be matched, as described below. The output shows the +result of each match. Options on the command line and the patterns control PCRE +options and exactly what is output. . . -.SH OPTIONS +.SH COMMAND LINE OPTIONS .rs .TP 10 \fB-b\fP -Behave as if each regex has the \fB/B\fP (show byte code) modifier; the +Behave as if each pattern has the \fB/B\fP (show byte code) modifier; the internal form is output after compilation. .TP 10 \fB-C\fP @@ -33,7 +36,7 @@ Output the version number of the PCRE library, and all available information about the optional features that are included, and then exit. .TP 10 \fB-d\fP -Behave as if each regex has the \fB/D\fP (debug) modifier; the internal +Behave as if each pattern has the \fB/D\fP (debug) modifier; the internal form and information about the compiled pattern is output after compilation; \fB-d\fP is equivalent to \fB-b -i\fP. .TP 10 @@ -46,7 +49,7 @@ standard \fBpcre_exec()\fP function (more detail is given below). Output a brief summary these options and then exit. .TP 10 \fB-i\fP -Behave as if each regex has the \fB/I\fP modifier; information about the +Behave as if each pattern has the \fB/I\fP modifier; information about the compiled pattern is given after compilation. .TP 10 \fB-M\fP @@ -67,7 +70,7 @@ changed for individual matching calls by including \eO in the data line (see below). .TP 10 \fB-p\fP -Behave as if each regex has the \fB/P\fP modifier; the POSIX wrapper API is +Behave as if each pattern has the \fB/P\fP modifier; the POSIX wrapper API is used to call PCRE. None of the other options has any effect when \fB-p\fP is set. .TP 10 @@ -79,8 +82,21 @@ On Unix-like systems, set the size of the run-time stack to \fIsize\fP megabytes. .TP 10 \fB-s\fP -Behave as if each regex has the \fB/S\fP modifier; in other words, force each -regex to be studied. +Behave as if each pattern has the \fB/S\fP modifier; in other words, force each +pattern to be studied. If the \fB/I\fP or \fB/D\fP option is present on a +pattern (requesting output about the compiled pattern), information about the +result of studying is not included when studying is caused only by \fB-s\fP and +neither \fB-i\fP nor \fB-d\fP is present on the command line. This behaviour +means that the output from tests that are run with and without \fB-s\fP should +be identical, except when options that output information about the actual +running of a match are set. The \fB-M\fP, \fB-t\fP, and \fB-tm\fP options, +which give information about resources used, are likely to produce different +output with and without \fB-s\fP. Output may also differ if the \fB/C\fP option +is present on an individual pattern. This uses callouts to trace the the +matching process, and this may be different between studied and non-studied +patterns. If the pattern contains (*MARK) items there may also be differences, +for the same reason. The \fB-s\fP command line option can be overridden for +specific patterns that should never be studied (see the /S option below). .TP 10 \fB-t\fP Run each compile, study, and match many times with a timer, and output @@ -193,10 +209,10 @@ options that do not correspond to anything in Perl: \fB/<bsr_unicode>\fP PCRE_BSR_UNICODE .sp The modifiers that are enclosed in angle brackets are literal strings as shown, -including the angle brackets, but the letters can be in either case. This -example sets multiline matching with CRLF as the line ending sequence: +including the angle brackets, but the letters within can be in either case. +This example sets multiline matching with CRLF as the line ending sequence: .sp - /^abc/m<crlf> + /^abc/m<CRLF> .sp As well as turning on the PCRE_UTF8 option, the \fB/8\fP modifier also causes any non-printing characters in output strings to be printed using the @@ -290,9 +306,13 @@ which it appears. The \fB/M\fP modifier causes the size of memory block used to hold the compiled pattern to be output. .P -The \fB/S\fP modifier causes \fBpcre_study()\fP to be called after the -expression has been compiled, and the results used when the expression is -matched. +If the \fB/S\fP modifier appears once, it causes \fBpcre_study()\fP to be +called after the expression has been compiled, and the results used when the +expression is matched. If \fB/S\fP appears twice, it suppresses studying, even +if it was requested externally by the \fB-s\fP command line option. This makes +it possible to specify that certain patterns are always studied, and others are +never studied, independently of \fB-s\fP. This feature is used in the test +files in a few cases where the output is different when the pattern is studied. .P The \fB/T\fP modifier must be followed by a single digit. It causes a specific set of built-in character tables to be passed to \fBpcre_compile()\fP. It is @@ -746,7 +766,7 @@ characters. For example: .sp re> </some/file - Compiled regex loaded from /some/file + Compiled pattern loaded from /some/file No study data .sp When the pattern has been loaded, \fBpcretest\fP proceeds to read data lines in @@ -792,6 +812,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 06 June 2011 +Last updated: 02 July 2011 Copyright (c) 1997-2011 University of Cambridge. .fi diff --git a/pcre_internal.h b/pcre_internal.h index ae3e6a4..586df5d 100644 --- a/pcre_internal.h +++ b/pcre_internal.h @@ -595,10 +595,10 @@ compatibility. */ #define PCRE_JCHANGED 0x0010 /* j option used in regex */ #define PCRE_HASCRORLF 0x0020 /* explicit \r or \n in pattern */ -/* Options for the "extra" block produced by pcre_study(). */ +/* Flags for the "extra" block produced by pcre_study(). */ -#define PCRE_STUDY_MAPPED 0x01 /* a map of starting chars exists */ -#define PCRE_STUDY_MINLEN 0x02 /* a minimum length field exists */ +#define PCRE_STUDY_MAPPED 0x0001 /* a map of starting chars exists */ +#define PCRE_STUDY_MINLEN 0x0002 /* a minimum length field exists */ /* Masks for identifying the public options that are permitted at compile time, run time, or study time, respectively. */ diff --git a/pcre_study.c b/pcre_study.c index ac0dc46..5869f86 100644 --- a/pcre_study.c +++ b/pcre_study.c @@ -66,9 +66,10 @@ string of that length that matches. In UTF8 mode, the result is in characters rather than bytes. Arguments: - code pointer to start of group (the bracket) - startcode pointer to start of the whole pattern - options the compiling options + code pointer to start of group (the bracket) + startcode pointer to start of the whole pattern + options the compiling options + had_accept pointer to flag for (*ACCEPT) encountered Returns: the minimum length -1 if \C was encountered @@ -77,7 +78,8 @@ Returns: the minimum length */ static int -find_minlength(const uschar *code, const uschar *startcode, int options) +find_minlength(const uschar *code, const uschar *startcode, int options, + BOOL *had_accept_ptr) { int length = -1; BOOL utf8 = (options & PCRE_UTF8) != 0; @@ -125,17 +127,23 @@ for (;;) case OP_BRAPOS: case OP_SBRAPOS: case OP_ONCE: - d = find_minlength(cc, startcode, options); + d = find_minlength(cc, startcode, options, had_accept_ptr); if (d < 0) return d; branchlength += d; + if (*had_accept_ptr) return branchlength; do cc += GET(cc, 1); while (*cc == OP_ALT); cc += 1 + LINK_SIZE; break; /* Reached end of a branch; if it's a ket it is the end of a nested - call. If it's ALT it is an alternation in a nested call. If it is - END it's the end of the outer call. All can be handled by the same code. */ - + call. If it's ALT it is an alternation in a nested call. If it is END it's + the end of the outer call. All can be handled by the same code. If it is + ACCEPT, it is essentially the same as END, but we set a flag so that + counting stops. */ + + case OP_ACCEPT: + *had_accept_ptr = TRUE; + /* Fall through */ case OP_ALT: case OP_KET: case OP_KETRMAX: @@ -144,7 +152,7 @@ for (;;) case OP_END: if (length < 0 || (!had_recurse && branchlength < length)) length = branchlength; - if (*cc != OP_ALT) return length; + if (op != OP_ALT) return length; cc += 1 + LINK_SIZE; branchlength = 0; had_recurse = FALSE; @@ -367,7 +375,11 @@ for (;;) d = 0; had_recurse = TRUE; } - else d = find_minlength(cs, startcode, options); + else + { + d = find_minlength(cs, startcode, options, had_accept_ptr); + *had_accept_ptr = FALSE; + } } else d = 0; cc += 3; @@ -411,7 +423,10 @@ for (;;) if (cc > cs && cc < ce) had_recurse = TRUE; else - branchlength += find_minlength(cs, startcode, options); + { + branchlength += find_minlength(cs, startcode, options, had_accept_ptr); + *had_accept_ptr = FALSE; + } cc += 1 + LINK_SIZE; break; @@ -479,10 +494,9 @@ for (;;) case OP_THEN_ARG: cc += _pcre_OP_lengths[op] + cc[1+LINK_SIZE]; break; - + /* The remaining opcodes are just skipped over. */ - case OP_ACCEPT: case OP_CLOSE: case OP_COMMIT: case OP_FAIL: @@ -688,6 +702,7 @@ do while (try_next) /* Loop for items in this branch */ { int rc; + switch(*tcode) { /* If we reach something we don't understand, it means a new opcode has @@ -1200,6 +1215,7 @@ pcre_study(const pcre *external_re, int options, const char **errorptr) { int min; BOOL bits_set = FALSE; +BOOL had_accept = FALSE; uschar start_bits[32]; pcre_extra *extra; pcre_study_data *study; @@ -1257,7 +1273,7 @@ if ((re->options & PCRE_ANCHORED) == 0 && /* Find the minimum length of subject string. */ -switch(min = find_minlength(code, code, re->options)) +switch(min = find_minlength(code, code, re->options, &had_accept)) { case -2: *errorptr = "internal error: missing capturing bracket"; break; case -3: *errorptr = "internal error: opcode not recognized"; break; @@ -1436,6 +1436,7 @@ while (!done) size_t size, regex_gotten_store; int do_mark = 0; int do_study = 0; + int no_force_study = 0; int do_debug = debug; int do_G = 0; int do_g = 0; @@ -1502,7 +1503,7 @@ while (!done) } } - fprintf(outfile, "Compiled regex%s loaded from %s\n", + fprintf(outfile, "Compiled pattern%s loaded from %s\n", do_flip? " (byte-inverted)" : "", p); /* Need to know if UTF-8 for printing data strings */ @@ -1510,7 +1511,7 @@ while (!done) new_info(re, NULL, PCRE_INFO_OPTIONS, &get_options); use_utf8 = (get_options & PCRE_UTF8) != 0; - /* Now see if there is any following study data */ + /* Now see if there is any following study data. */ if (true_study_size != 0) { @@ -1624,7 +1625,14 @@ while (!done) case 'P': do_posix = 1; break; #endif - case 'S': do_study = 1; break; + case 'S': + if (do_study == 0) do_study = 1; else + { + do_study = 0; + no_force_study = 1; + } + break; + case 'U': options |= PCRE_UNGREEDY; break; case 'W': options |= PCRE_UCP; break; case 'X': options |= PCRE_EXTRA; break; @@ -1808,10 +1816,12 @@ while (!done) true_size = ((real_pcre *)re)->size; regex_gotten_store = gotten_store; - /* If -s or /S was present, study the regexp to generate additional info to - help with the matching. */ + /* If -s or /S was present, study the regex to generate additional info to + help with the matching, unless the pattern has the SS option, which + suppresses the effect of /S (used for a few test patterns where studying is + never sensible). */ - if (do_study || force_study) + if (do_study || (force_study && !no_force_study)) { if (timeit > 0) { @@ -2049,9 +2059,12 @@ while (!done) /* Don't output study size; at present it is in any case a fixed value, but it varies, depending on the computer architecture, and so messes up the test suite. (And with the /F option, it might be - flipped.) */ + flipped.) If study was forced by an external -s, don't show this + information unless -i or -d was also present. This means that, except + when auto-callouts are involved, the output from runs with and without + -s should be identical. */ - if (do_study || force_study) + if (do_study || (force_study && showinfo && !no_force_study)) { if (extra == NULL) fprintf(outfile, "Study returned NULL\n"); @@ -2129,7 +2142,11 @@ while (!done) } else { - fprintf(outfile, "Compiled regex written to %s\n", to_file); + fprintf(outfile, "Compiled pattern written to %s\n", to_file); + + /* If there is study data, write it, but verify the writing only + if the studying was requested by /S, not just by -s. */ + if (extra != NULL) { if (fwrite(extra->study_data, 1, true_study_size, f) < @@ -2139,7 +2156,6 @@ while (!done) strerror(errno)); } else fprintf(outfile, "Study data written to %s\n", to_file); - } } fclose(f); diff --git a/perltest.pl b/perltest.pl index 424de2d..9eaa8ac 100755 --- a/perltest.pl +++ b/perltest.pl @@ -103,6 +103,10 @@ for (;;) $pattern =~ s/W(?=[a-zA-Z]*$)//; + # Remove /S or /SS from a pattern (asks pcretest to study or not to study) + + $pattern =~ s/S(?=[a-zA-Z]*$)//g; + # Check that the pattern is valid eval "\$_ =~ ${pattern}"; diff --git a/testdata/testinput11 b/testdata/testinput11 index 9631eb8..cf02fac 100644 --- a/testdata/testinput11 +++ b/testdata/testinput11 @@ -246,6 +246,7 @@ aaabccc /(A (A|B(*ACCEPT)|C) D)(E)/x + AB ABX AADE ACDE @@ -403,7 +404,10 @@ AC CB -/(*MARK:A)(*SKIP:B)(C|X)/K +/--- Force no study, otherwise mark is not seen. The studied version is in + test 2 because it isn't Perl-compatible. ---/ + +/(*MARK:A)(*SKIP:B)(C|X)/KSS C D @@ -435,9 +439,9 @@ with the handling of backtracking verbs. ---/ /A(*:A)A+(*SKIP:A)(B|Z) | AC/xK AAAC -/--- Don't loop! ---/ +/--- Don't loop! Force no study, otherwise mark is not seen. ---/ -/(*:A)A+(*SKIP:A)(B|Z)/K +/(*:A)A+(*SKIP:A)(B|Z)/KSS AAAC /--- This should succeed, as a non-existent skip name disables the skip ---/ diff --git a/testdata/testinput2 b/testdata/testinput2 index f0a32ac..d97050f 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -1061,7 +1061,12 @@ /abc(?C)de(?C1)f/I 123abcdef -/(?C1)\dabc(?C2)def/I +/(?C1)\dabc(?C2)def/IS + 1234abcdef + *** Failers + abcdef + +/(?C1)\dabc(?C2)def/ISS 1234abcdef *** Failers abcdef @@ -1310,7 +1315,12 @@ abcde abcdfe -/a*b/ICDZ +/a*b/ICDZS + ab + aaaab + aaaacb + +/a*b/ICDZSS ab aaaab aaaacb @@ -1320,9 +1330,16 @@ aaaab aaaacb -/(abc|def)x/ICDZ +/(abc|def)x/ICDZS abcx defx + ** Failers + abcdefzx + +/(abc|def)x/ICDZSS + abcx + defx + ** Failers abcdefzx /(ab|cd){3,4}/IC @@ -1330,7 +1347,10 @@ abcdabcd abcdcdcdcdcd -/([ab]{,4}c|xy)/ICDZ +/([ab]{,4}c|xy)/ICDZS + Note: that { does NOT introduce a quantifier + +/([ab]{,4}c|xy)/ICDZSS Note: that { does NOT introduce a quantifier /([ab]{1,4}c|xy){4,5}?123/ICDZ @@ -1404,13 +1424,25 @@ 1X 123456\P -/abc/I>testsavedregex +/abc/IS>testsavedregex +<testsavedregex + abc + ** Failers + bca + +/abc/ISS>testsavedregex +<testsavedregex + abc + ** Failers + bca + +/abc/IFS>testsavedregex <testsavedregex abc ** Failers bca -/abc/IF>testsavedregex +/abc/IFSS>testsavedregex <testsavedregex abc ** Failers @@ -1422,12 +1454,24 @@ ** Failers def +/(a|b)/ISS>testsavedregex +<testsavedregex + abc + ** Failers + def + /(a|b)/ISF>testsavedregex <testsavedregex abc ** Failers def +/(a|b)/ISSF>testsavedregex +<testsavedregex + abc + ** Failers + def + ~<(\w+)/?>(.)*</(\1)>~smgI <!DOCTYPE seite SYSTEM "http://www.lco.lineas.de/xmlCms.dtd">\n<seite>\n<dokumenteninformation>\n<seitentitel>Partner der LCO</seitentitel>\n<sprache>de</sprache>\n<seitenbeschreibung>Partner der LINEAS Consulting\nGmbH</seitenbeschreibung>\n<schluesselworte>LINEAS Consulting GmbH Hamburg\nPartnerfirmen</schluesselworte>\n<revisit>30 days</revisit>\n<robots>index,follow</robots>\n<menueinformation>\n<aktiv>ja</aktiv>\n<menueposition>3</menueposition>\n<menuetext>Partner</menuetext>\n</menueinformation>\n<lastedited>\n<autor>LCO</autor>\n<firma>LINEAS Consulting</firma>\n<datum>15.10.2003</datum>\n</lastedited>\n</dokumenteninformation>\n<inhalt>\n\n<absatzueberschrift>Die Partnerfirmen der LINEAS Consulting\nGmbH</absatzueberschrift>\n\n<absatz><link ziel="http://www.ca.com/" zielfenster="_blank">\n<bild name="logo_ca.gif" rahmen="no"/></link> <link\nziel="http://www.ey.com/" zielfenster="_blank"><bild\nname="logo_euy.gif" rahmen="no"/></link>\n</absatz>\n\n<absatz><link ziel="http://www.cisco.de/" zielfenster="_blank">\n<bild name="logo_cisco.gif" rahmen="ja"/></link></absatz>\n\n<absatz><link ziel="http://www.atelion.de/"\nzielfenster="_blank"><bild\nname="logo_atelion.gif" rahmen="no"/></link>\n</absatz>\n\n<absatz><link ziel="http://www.line-information.de/"\nzielfenster="_blank">\n<bild name="logo_line_information.gif" rahmen="no"/></link>\n</absatz>\n\n<absatz><bild name="logo_aw.gif" rahmen="no"/></absatz>\n\n<absatz><link ziel="http://www.incognis.de/"\nzielfenster="_blank"><bild\nname="logo_incognis.gif" rahmen="no"/></link></absatz>\n\n<absatz><link ziel="http://www.addcraft.com/"\nzielfenster="_blank"><bild\nname="logo_addcraft.gif" rahmen="no"/></link></absatz>\n\n<absatz><link ziel="http://www.comendo.com/"\nzielfenster="_blank"><bild\nname="logo_comendo.gif" rahmen="no"/></link></absatz>\n\n</inhalt>\n</seite> @@ -3312,11 +3356,19 @@ name were given. ---/ /A(*PRUNE:A)B/K ACAB -/(*MARK:A)(*PRUNE:B)(C|X)/K +/(*MARK:A)(*PRUNE:B)(C|X)/KS C D -/(*MARK:A)(*THEN:B)(C|X)/K +/(*MARK:A)(*PRUNE:B)(C|X)/KSS + C + D + +/(*MARK:A)(*THEN:B)(C|X)/KS + C + D + +/(*MARK:A)(*THEN:B)(C|X)/KSS C D @@ -3681,4 +3733,16 @@ with \Y. ---/ /-- --/ +/-- These studied versions are here because they are not Perl-compatible; the + studying means the mark is not seen. --/ + +/(*MARK:A)(*SKIP:B)(C|X)/KS + C + D + +/(*:A)A+(*SKIP:A)(B|Z)/KS + AAAC + +/-- --/ + /-- End of testinput2 --/ diff --git a/testdata/testinput5 b/testdata/testinput5 index 6aeaa4d..62ae695 100644 --- a/testdata/testinput5 +++ b/testdata/testinput5 @@ -198,7 +198,7 @@ correctly, but that messes up comparisons). --/ /ÃÃÃxxx/8 -/ÃÃÃxxx/8?DZ +/ÃÃÃxxx/8?DZSS /abc/8 Ã] diff --git a/testdata/testinput7 b/testdata/testinput7 index 758267e..04a1829 100644 --- a/testdata/testinput7 +++ b/testdata/testinput7 @@ -3973,13 +3973,13 @@ ac bbbbc -/abc/>testsavedregex +/abc/SS>testsavedregex <testsavedregex abc *** Failers bca -/abc/F>testsavedregex +/abc/FSS>testsavedregex <testsavedregex abc *** Failers diff --git a/testdata/testoutput11 b/testdata/testoutput11 index 425dfcf..7942e15 100644 --- a/testdata/testoutput11 +++ b/testdata/testoutput11 @@ -501,6 +501,10 @@ No match No match /(A (A|B(*ACCEPT)|C) D)(E)/x + AB + 0: AB + 1: AB + 2: B ABX 0: AB 1: AB @@ -821,7 +825,10 @@ No match, mark = A CB No match, mark = B -/(*MARK:A)(*SKIP:B)(C|X)/K +/--- Force no study, otherwise mark is not seen. The studied version is in + test 2 because it isn't Perl-compatible. ---/ + +/(*MARK:A)(*SKIP:B)(C|X)/KSS C 0: C 1: C @@ -864,9 +871,9 @@ with the handling of backtracking verbs. ---/ AAAC 0: AC -/--- Don't loop! ---/ +/--- Don't loop! Force no study, otherwise mark is not seen. ---/ -/(*:A)A+(*SKIP:A)(B|Z)/K +/(*:A)A+(*SKIP:A)(B|Z)/KSS AAAC No match, mark = A diff --git a/testdata/testoutput2 b/testdata/testoutput2 index 002a11f..fd81bbe 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -3580,7 +3580,27 @@ Need char = 'f' 1 ^ ^ f 0: abcdef -/(?C1)\dabc(?C2)def/I +/(?C1)\dabc(?C2)def/IS +Capturing subpattern count = 0 +No options +No first char +Need char = 'f' +Subject length lower bound = 7 +Starting byte set: 0 1 2 3 4 5 6 7 8 9 + 1234abcdef +--->1234abcdef + 1 ^ \d + 1 ^ \d + 1 ^ \d + 1 ^ \d + 2 ^ ^ d + 0: 4abcdef + *** Failers +No match + abcdef +No match + +/(?C1)\dabc(?C2)def/ISS Capturing subpattern count = 0 No options No first char @@ -4778,7 +4798,51 @@ Need char = 'e' +4 ^ ^ e No match -/a*b/ICDZ +/a*b/ICDZS +------------------------------------------------------------------ + Bra + Callout 255 0 2 + a*+ + Callout 255 2 1 + b + Callout 255 3 0 + Ket + End +------------------------------------------------------------------ +Capturing subpattern count = 0 +Options: +No first char +Need char = 'b' +Subject length lower bound = 1 +Starting byte set: a b + ab +--->ab + +0 ^ a* + +2 ^^ b + +3 ^ ^ + 0: ab + aaaab +--->aaaab + +0 ^ a* + +2 ^ ^ b + +3 ^ ^ + 0: aaaab + aaaacb +--->aaaacb + +0 ^ a* + +2 ^ ^ b + +0 ^ a* + +2 ^ ^ b + +0 ^ a* + +2 ^ ^ b + +0 ^ a* + +2 ^^ b + +0 ^ a* + +2 ^ b + +3 ^^ + 0: b + +/a*b/ICDZSS ------------------------------------------------------------------ Bra Callout 255 0 2 @@ -4861,7 +4925,83 @@ Need char = 'b' +2 ^^ b No match -/(abc|def)x/ICDZ +/(abc|def)x/ICDZS +------------------------------------------------------------------ + Bra + Callout 255 0 9 + CBra 1 + Callout 255 1 1 + a + Callout 255 2 1 + b + Callout 255 3 1 + c + Callout 255 4 0 + Alt + Callout 255 5 1 + d + Callout 255 6 1 + e + Callout 255 7 1 + f + Callout 255 8 0 + Ket + Callout 255 9 1 + x + Callout 255 10 0 + Ket + End +------------------------------------------------------------------ +Capturing subpattern count = 1 +Options: +No first char +Need char = 'x' +Subject length lower bound = 4 +Starting byte set: a d + abcx +--->abcx + +0 ^ (abc|def) + +1 ^ a + +2 ^^ b + +3 ^ ^ c + +4 ^ ^ | + +9 ^ ^ x ++10 ^ ^ + 0: abcx + 1: abc + defx +--->defx + +0 ^ (abc|def) + +1 ^ a + +5 ^ d + +6 ^^ e + +7 ^ ^ f + +8 ^ ^ ) + +9 ^ ^ x ++10 ^ ^ + 0: defx + 1: def + ** Failers +No match + abcdefzx +--->abcdefzx + +0 ^ (abc|def) + +1 ^ a + +2 ^^ b + +3 ^ ^ c + +4 ^ ^ | + +9 ^ ^ x + +5 ^ d + +0 ^ (abc|def) + +1 ^ a + +5 ^ d + +6 ^^ e + +7 ^ ^ f + +8 ^ ^ ) + +9 ^ ^ x +No match + +/(abc|def)x/ICDZSS ------------------------------------------------------------------ Bra Callout 255 0 9 @@ -4915,6 +5055,8 @@ Need char = 'x' +10 ^ ^ 0: defx 1: def + ** Failers +No match abcdefzx --->abcdefzx +0 ^ (abc|def) @@ -5015,7 +5157,58 @@ No need char 0: abcdcdcd 1: cd -/([ab]{,4}c|xy)/ICDZ +/([ab]{,4}c|xy)/ICDZS +------------------------------------------------------------------ + Bra + Callout 255 0 14 + CBra 1 + Callout 255 1 4 + [ab] + Callout 255 5 1 + { + Callout 255 6 1 + , + Callout 255 7 1 + 4 + Callout 255 8 1 + } + Callout 255 9 1 + c + Callout 255 10 0 + Alt + Callout 255 11 1 + x + Callout 255 12 1 + y + Callout 255 13 0 + Ket + Callout 255 14 0 + Ket + End +------------------------------------------------------------------ +Capturing subpattern count = 1 +Options: +No first char +No need char +Subject length lower bound = 2 +Starting byte set: a b x + Note: that { does NOT introduce a quantifier +--->Note: that { does NOT introduce a quantifier + +0 ^ ([ab]{,4}c|xy) + +1 ^ [ab] + +5 ^^ { ++11 ^ x + +0 ^ ([ab]{,4}c|xy) + +1 ^ [ab] + +5 ^^ { ++11 ^ x + +0 ^ ([ab]{,4}c|xy) + +1 ^ [ab] + +5 ^^ { ++11 ^ x +No match + +/([ab]{,4}c|xy)/ICDZSS ------------------------------------------------------------------ Bra Callout 255 0 14 @@ -5467,14 +5660,33 @@ No match 123456\P No match -/abc/I>testsavedregex +/abc/IS>testsavedregex +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'c' +Subject length lower bound = 3 +No set of starting bytes +Compiled pattern written to testsavedregex +Study data written to testsavedregex +<testsavedregex +Compiled pattern loaded from testsavedregex +Study data loaded from testsavedregex + abc + 0: abc + ** Failers +No match + bca +No match + +/abc/ISS>testsavedregex Capturing subpattern count = 0 No options First char = 'a' Need char = 'c' -Compiled regex written to testsavedregex +Compiled pattern written to testsavedregex <testsavedregex -Compiled regex loaded from testsavedregex +Compiled pattern loaded from testsavedregex No study data abc 0: abc @@ -5483,14 +5695,33 @@ No match bca No match -/abc/IF>testsavedregex +/abc/IFS>testsavedregex Capturing subpattern count = 0 No options First char = 'a' Need char = 'c' -Compiled regex written to testsavedregex +Subject length lower bound = 3 +No set of starting bytes +Compiled pattern written to testsavedregex +Study data written to testsavedregex <testsavedregex -Compiled regex (byte-inverted) loaded from testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex +Study data loaded from testsavedregex + abc + 0: abc + ** Failers +No match + bca +No match + +/abc/IFSS>testsavedregex +Capturing subpattern count = 0 +No options +First char = 'a' +Need char = 'c' +Compiled pattern written to testsavedregex +<testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex No study data abc 0: abc @@ -5506,10 +5737,10 @@ No first char No need char Subject length lower bound = 1 Starting byte set: a b -Compiled regex written to testsavedregex +Compiled pattern written to testsavedregex Study data written to testsavedregex <testsavedregex -Compiled regex loaded from testsavedregex +Compiled pattern loaded from testsavedregex Study data loaded from testsavedregex abc 0: a @@ -5520,6 +5751,24 @@ Study data loaded from testsavedregex def No match +/(a|b)/ISS>testsavedregex +Capturing subpattern count = 1 +No options +No first char +No need char +Compiled pattern written to testsavedregex +<testsavedregex +Compiled pattern loaded from testsavedregex +No study data + abc + 0: a + 1: a + ** Failers + 0: a + 1: a + def +No match + /(a|b)/ISF>testsavedregex Capturing subpattern count = 1 No options @@ -5527,10 +5776,10 @@ No first char No need char Subject length lower bound = 1 Starting byte set: a b -Compiled regex written to testsavedregex +Compiled pattern written to testsavedregex Study data written to testsavedregex <testsavedregex -Compiled regex (byte-inverted) loaded from testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex Study data loaded from testsavedregex abc 0: a @@ -5541,6 +5790,24 @@ Study data loaded from testsavedregex def No match +/(a|b)/ISSF>testsavedregex +Capturing subpattern count = 1 +No options +No first char +No need char +Compiled pattern written to testsavedregex +<testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex +No study data + abc + 0: a + 1: a + ** Failers + 0: a + 1: a + def +No match + ~<(\w+)/?>(.)*</(\1)>~smgI Capturing subpattern count = 3 Max back reference = 1 @@ -10805,7 +11072,15 @@ name were given. ---/ ACAB 0: AB -/(*MARK:A)(*PRUNE:B)(C|X)/K +/(*MARK:A)(*PRUNE:B)(C|X)/KS + C + 0: C + 1: C +MK: A + D +No match + +/(*MARK:A)(*PRUNE:B)(C|X)/KSS C 0: C 1: C @@ -10813,7 +11088,15 @@ MK: A D No match, mark = B -/(*MARK:A)(*THEN:B)(C|X)/K +/(*MARK:A)(*THEN:B)(C|X)/KS + C + 0: C + 1: C +MK: A + D +No match + +/(*MARK:A)(*THEN:B)(C|X)/KSS C 0: C 1: C @@ -11577,4 +11860,21 @@ No match /-- --/ +/-- These studied versions are here because they are not Perl-compatible; the + studying means the mark is not seen. --/ + +/(*MARK:A)(*SKIP:B)(C|X)/KS + C + 0: C + 1: C +MK: A + D +No match + +/(*:A)A+(*SKIP:A)(B|Z)/KS + AAAC +No match + +/-- --/ + /-- End of testinput2 --/ diff --git a/testdata/testoutput5 b/testdata/testoutput5 index 129dbc7..9b18300 100644 --- a/testdata/testoutput5 +++ b/testdata/testoutput5 @@ -802,7 +802,7 @@ Failed: invalid UTF-8 string at offset 0 /ÃÃÃxxx/8 Failed: invalid UTF-8 string at offset 0 -/ÃÃÃxxx/8?DZ +/ÃÃÃxxx/8?DZSS ------------------------------------------------------------------ Bra \X{c0}\X{c0}\X{c0}xxx @@ -2184,7 +2184,7 @@ Capturing subpattern count = 0 No options No first char No need char -Subject length lower bound = 2 +Subject length lower bound = 1 Starting byte set: \x0a \x0b \x0c \x0d \x85 /\R/SI8 @@ -2192,7 +2192,7 @@ Capturing subpattern count = 0 Options: utf8 No first char No need char -Subject length lower bound = 2 +Subject length lower bound = 1 Starting byte set: \x0a \x0b \x0c \x0d \xc2 \xe2 /\h*A/SI8 diff --git a/testdata/testoutput7 b/testdata/testoutput7 index ce63a28..45b447a 100644 --- a/testdata/testoutput7 +++ b/testdata/testoutput7 @@ -1011,10 +1011,10 @@ Partial match: efabbbbbbbbbbbbbbbb 0: bbbbbbbbbbbbcdX /(a|b)/SF>testsavedregex -Compiled regex written to testsavedregex +Compiled pattern written to testsavedregex Study data written to testsavedregex <testsavedregex -Compiled regex (byte-inverted) loaded from testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex Study data loaded from testsavedregex abc 0: a @@ -6439,10 +6439,10 @@ Error -17 (backreference condition or recursion test not supported for DFA match bbbbc 0: c -/abc/>testsavedregex -Compiled regex written to testsavedregex +/abc/SS>testsavedregex +Compiled pattern written to testsavedregex <testsavedregex -Compiled regex loaded from testsavedregex +Compiled pattern loaded from testsavedregex No study data abc 0: abc @@ -6451,10 +6451,10 @@ No match bca No match -/abc/F>testsavedregex -Compiled regex written to testsavedregex +/abc/FSS>testsavedregex +Compiled pattern written to testsavedregex <testsavedregex -Compiled regex (byte-inverted) loaded from testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex No study data abc 0: abc @@ -6464,10 +6464,10 @@ No match No match /(a|b)/S>testsavedregex -Compiled regex written to testsavedregex +Compiled pattern written to testsavedregex Study data written to testsavedregex <testsavedregex -Compiled regex loaded from testsavedregex +Compiled pattern loaded from testsavedregex Study data loaded from testsavedregex abc 0: a @@ -6477,10 +6477,10 @@ Study data loaded from testsavedregex No match /(a|b)/SF>testsavedregex -Compiled regex written to testsavedregex +Compiled pattern written to testsavedregex Study data written to testsavedregex <testsavedregex -Compiled regex (byte-inverted) loaded from testsavedregex +Compiled pattern (byte-inverted) loaded from testsavedregex Study data loaded from testsavedregex abc 0: a |