Further updates to partial matching.

git-svn-id: svn://vcs.exim.org/pcre/code/trunk@435 2f5784b3-3f2a-0410-8824-cb99058d5e15
author: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2009-09-05 10:20:44 +0000
committer: ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15> 2009-09-05 10:20:44 +0000
commit: 426d9955dc06c708b9f8e0673704ece1b2527fd9 (patch)
tree: 78b0d6679bb6f8a266e3b52a4bbda3adab5d2aab
parent: 92f006ccd3ea998dec9c9f8be7a9801103f87eb5 (diff)
download: pcre-426d9955dc06c708b9f8e0673704ece1b2527fd9.tar.gz
13 files changed, 221 insertions, 95 deletions
diff --git a/ChangeLog b/ChangeLog
index fd44cba..403f9fc 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -51,8 +51,9 @@ Version 8.00 ??-???-??
 9.  The restrictions on what a pattern can contain when partial matching is
     requested for pcre_exec() have been removed. All patterns can now be 
     partially matched by this function. In addition, if there are at least two
-    slots in the offset vector, the offsets of the first-encountered partial
-    match are set in them when PCRE_ERROR_PARTIAL is returned.
+    slots in the offset vector, the offset of the earliest inspected character
+    for the match and the offset of the end of the subject are set in them when
+    PCRE_ERROR_PARTIAL is returned. 
     
 10. Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is
     synonymous with PCRE_PARTIAL, for backwards compatibility, and
@@ -73,22 +74,29 @@ Version 8.00 ??-???-??
     earlier partial match, unless partial matching was again requested. For
     example, with the pattern /dog.(body)?/, the "must contain" character is
     "g". If the first part-match was for the string "dog", restarting with
-    "sbody" failed.
+    "sbody" failed. This bug has been fixed.
     
-13. Added a pcredemo man page, created automatically from the pcredemo.c file,
+13. The string returned by pcre_dfa_exec() after a partial match has been 
+    changed so that it starts at the first inspected character rather than the 
+    first character of the match. This makes a difference only if the pattern 
+    starts with a lookbehind assertion or \b or \B (\K is not supported by 
+    pcre_dfa_exec()). It's an incompatible change, but it makes the two 
+    matching functions compatible, and I think it's the right thing to do.
+    
+14. Added a pcredemo man page, created automatically from the pcredemo.c file,
     so that the demonstration program is easily available in environments where 
     PCRE has not been installed from source.  
     
-14. Arranged to add -DPCRE_STATIC to cflags in libpcre.pc, libpcreposix.cp,
+15. Arranged to add -DPCRE_STATIC to cflags in libpcre.pc, libpcreposix.cp,
     libpcrecpp.pc and pcre-config when PCRE is not compiled as a shared
     library.
     
-15. Added REG_UNGREEDY to the pcreposix interface, at the request of a user.
+16. Added REG_UNGREEDY to the pcreposix interface, at the request of a user.
     It maps to PCRE_UNGREEDY. It is not, of course, POSIX-compatible, but it
     is not the first non-POSIX option to be added. Clearly some people find 
     these options useful.
     
-16. If a caller to the POSIX matching function regexec() passes a non-zero 
+17. If a caller to the POSIX matching function regexec() passes a non-zero 
     value for \fInmatch\fP with a NULL value for \fIpmatch\fP, the value of
     \fInmatch\fP is forced to zero. 
     
diff --git a/configure.ac b/configure.ac
index a4bda76..7f567f8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -9,7 +9,7 @@ dnl empty.
 m4_define(pcre_major, [8])
 m4_define(pcre_minor, [00])
 m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2009-04-23])
+m4_define(pcre_date, [2009-09-05])
 
 # Libtool shared library interface versions (current:revision:age)
 m4_define(libpcre_version, [0:1:0])
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 8c4b45d..db3096e 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -149,9 +149,10 @@ documentation describes how to compile and run it.
 A second matching function, \fBpcre_dfa_exec()\fP, which is not
 Perl-compatible, is also provided. This uses a different algorithm for the
 matching. The alternative algorithm finds all possible matches (at a given
-point in the subject), and scans the subject just once. However, this algorithm
-does not return captured substrings. A description of the two matching
-algorithms and their advantages and disadvantages is given in the
+point in the subject), and scans the subject just once (unless there are
+lookbehind assertions). However, this algorithm does not return captured
+substrings. A description of the two matching algorithms and their advantages
+and disadvantages is given in the
 .\" HREF
 \fBpcrematching\fP
 .\"
@@ -1011,10 +1012,10 @@ different for each compiled pattern.
 .sp
   PCRE_INFO_OKPARTIAL
 .sp
-Return 1 if the pattern can be used for partial matching, otherwise 0. The
-fourth argument should point to an \fBint\fP variable. From release 8.00, this
-always returns 1, because the restrictions that previously applied to partial
-matching have been lifted. The
+Return 1 if the pattern can be used for partial matching with
+\fBpcre_exec()\fP, otherwise 0. The fourth argument should point to an
+\fBint\fP variable. From release 8.00, this always returns 1, because the
+restrictions that previously applied to partial matching have been lifted. The
 .\" HREF
 \fBpcrepartial\fP
 .\"
@@ -1387,8 +1388,8 @@ PCRE_PARTIAL_HARD is set, \fBpcre_exec()\fP immediately returns
 PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
 by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
 returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
-provided the partial match is set as the first matching string. There is a more
-detailed discussion in the
+was inspected when the partial match was found is set as the first matching
+string. There is a more detailed discussion in the
 .\" HREF
 \fBpcrepartial\fP
 .\"
@@ -1834,8 +1835,8 @@ a compiled pattern, using a matching algorithm that scans the subject string
 just once, and does not backtrack. This has different characteristics to the
 normal algorithm, and is not compatible with Perl. Some of the features of PCRE
 patterns are not supported. Nevertheless, there are times when this kind of
-matching can be useful. For a discussion of the two matching algorithms, see
-the
+matching can be useful. For a discussion of the two matching algorithms, and a 
+list of features that \fBpcre_dfa_exec()\fP does not support, see the
 .\" HREF
 \fBpcrematching\fP
 .\"
@@ -1890,8 +1891,8 @@ additional characters. This happens even if some complete matches have also
 been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH
 is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
 there have been no complete matches, but there is still at least one matching
-possibility. The portion of the string that provided the longest partial match
-is set as the first matching string in both cases.
+possibility. The portion of the string that was inspected when the longest
+partial match was found is set as the first matching string in both cases.
 .sp
   PCRE_DFA_SHORTEST
 .sp
@@ -2011,6 +2012,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 01 September 2009
+Last updated: 05 September 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi
diff --git a/doc/pcrematching.3 b/doc/pcrematching.3
index 808221d..a3b8363 100644
--- a/doc/pcrematching.3
+++ b/doc/pcrematching.3
@@ -92,6 +92,11 @@ the three strings "cat", "cater", and "caterpillar" that start at the fourth
 character of the subject. The algorithm does not automatically move on to find
 matches that start at later positions.
 .P
+Although the general principle of this matching algorithm is that it scans the 
+subject string only once, without backtracking, there is one exception: when a 
+lookbehind assertion is encountered, the preceding characters have to be
+re-inspected.
+.P
 There are a number of features of PCRE regular expressions that are not
 supported by the alternative matching algorithm. They are as follows:
 .P
@@ -178,6 +183,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 25 August 2009
+Last updated: 05 September 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi
diff --git a/doc/pcrepartial.3 b/doc/pcrepartial.3
index 0d76489..0e9cc47 100644
--- a/doc/pcrepartial.3
+++ b/doc/pcrepartial.3
@@ -26,10 +26,10 @@ is very long and is not all available at once.
 .P
 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
 PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
-\fBpcre_dfa_exec()\fP. For backwards compatibility, PCRE_PARTIAL is a synonym 
-for PCRE_PARTIAL_SOFT. The essential difference between the two options is 
-whether or not a partial match is preferred to an alternative complete match, 
-though the details differ between the two matching functions. If both options 
+\fBpcre_dfa_exec()\fP. For backwards compatibility, PCRE_PARTIAL is a synonym
+for PCRE_PARTIAL_SOFT. The essential difference between the two options is
+whether or not a partial match is preferred to an alternative complete match,
+though the details differ between the two matching functions. If both options
 are set, PCRE_PARTIAL_HARD takes precedence.
 .P
 Setting a partial matching option disables one of PCRE's optimizations. PCRE
@@ -49,69 +49,86 @@ been matched. (In other words, a partial match can never be an empty string.)
 If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
 continues as normal, and other alternatives in the pattern are tried. If no
 complete match can be found, \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL
-instead of PCRE_ERROR_NOMATCH, and if there are at least two slots in the
-offsets vector, they are filled in with the offsets of the longest string that
-partially matched. Consider this pattern:
+instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
+vector, the first of them is set to the offset of the earliest character that
+was inspected when the partial match was found. For convenience, the second
+offset points to the end of the string so that a substring can easily be
+extracted.
+.P
+For the majority of patterns, the first offset identifies the start of the
+partially matched string. However, for patterns that contain lookbehind
+assertions, or \eK, or begin with \eb or \eB, earlier characters have been
+inspected while carrying out the match. For example:
+.sp
+  /(?<=abc)123/
+.sp
+This pattern matches "123", but only if it is preceded by "abc". If the subject
+string is "xyzabc12", the offsets after a partial match are for the substring
+"abc12", because all these characters are needed if another match is tried
+with extra characters added.
+.P
+If there is more than one partial match, the first one that was found provides
+the data that is returned. Consider this pattern:
 .sp
   /123\ew+X|dogY/
 .sp
 If this is matched against the subject string "abc123dog", both
-alternatives fail to match, but the end of the subject is reached during 
+alternatives fail to match, but the end of the subject is reached during
 matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
-offsets are set to 3 and 9, identifying "123dog" as the longest partial match
+offsets are set to 3 and 9, identifying "123dog" as the first partial match
 that was found. (In this example, there are two partial matches, because "dog"
 on its own partially matches the second alternative.)
 .P
-If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns 
+If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
 PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
 search for possible complete matches. The difference between the two options
 can be illustrated by a pattern such as:
 .sp
   /dog(sbody)?/
 .sp
-This matches either "dog" or "dogsbody", greedily (that is, it prefers the 
+This matches either "dog" or "dogsbody", greedily (that is, it prefers the
 longer string if possible). If it is matched against the string "dog" with
-PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if 
-PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand, 
+PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
+PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
 if the pattern is made ungreedy the result is different:
 .sp
   /dog(sbody)??/
 .sp
-In this case the result is always a complete match because \fBpcre_exec()\fP 
-finds that first, and it never continues after finding a match. It might be 
+In this case the result is always a complete match because \fBpcre_exec()\fP
+finds that first, and it never continues after finding a match. It might be
 easier to follow this explanation by thinking of the two patterns like this:
 .sp
   /dog(sbody)?/    is the same as  /dogsbody|dog/
   /dog(sbody)??/   is the same as  /dog|dogsbody/
 .sp
-The second pattern will never match "dogsbody" when \fBpcre_exec()\fP is 
+The second pattern will never match "dogsbody" when \fBpcre_exec()\fP is
 used, because it will always find the shorter match first.
 .
 .
 .SH "PARTIAL MATCHING USING pcre_dfa_exec()"
 .rs
 .sp
-The \fBpcre_dfa_exec()\fP function moves along the subject string character by 
-character, without backtracking, searching for all possible matches 
-simultaneously. If the end of the subject is reached before the end of the 
+The \fBpcre_dfa_exec()\fP function moves along the subject string character by
+character, without backtracking, searching for all possible matches
+simultaneously. If the end of the subject is reached before the end of the
 pattern, there is the possibility of a partial match, again provided that at
 least one character has matched.
 .P
 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
 have been no complete matches. Otherwise, the complete matches are returned.
 However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
-complete matches. The portion of the string that provided the longest partial
-match is set as the first matching string, provided there are at least two
-slots in the offsets vector.
+complete matches. The portion of the string that was inspected when the longest
+partial match was found is set as the first matching string, provided there are
+at least two slots in the offsets vector.
 .P
-Because \fBpcre_dfa_exec()\fP always searches for all possible matches, and 
+Because \fBpcre_dfa_exec()\fP always searches for all possible matches, and
 there is no difference between greedy and ungreedy repetition, its behaviour is
-different from \fBpcre_exec\fP when PCRE_PARTIAL_HARD is set. Consider the 
+different from \fBpcre_exec\fP when PCRE_PARTIAL_HARD is set. Consider the
 string "dog" matched against the ungreedy pattern shown above:
 .sp
   /dog(sbody)??/
 .sp
-Whereas \fBpcre_exec()\fP stops as soon as it finds the complete match for 
+Whereas \fBpcre_exec()\fP stops as soon as it finds the complete match for
 "dog", \fBpcre_dfa_exec()\fP also finds the partial match for "dogsbody", and
 so returns that when PCRE_PARTIAL_HARD is set.
 .
@@ -119,21 +136,21 @@ so returns that when PCRE_PARTIAL_HARD is set.
 .SH "PARTIAL MATCHING AND WORD BOUNDARIES"
 .rs
 .sp
-If a pattern ends with one of sequences \ew or \eW, which test for word 
-boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive 
+If a pattern ends with one of sequences \ew or \eW, which test for word
+boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
 results. Consider this pattern:
 .sp
   /\ebcat\eb/
 .sp
 This matches "cat", provided there is a word boundary at either end. If the
 subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However, 
-\fBpcre_exec()\fP carries on with normal matching, which matches \eb at the end 
-of the subject when the last character is a letter, thus finding a complete 
-match. The result, therefore, is \fInot\fP PCRE_ERROR_PARTIAL. The same thing 
+character cannot take place, so a partial match is found. However,
+\fBpcre_exec()\fP carries on with normal matching, which matches \eb at the end
+of the subject when the last character is a letter, thus finding a complete
+match. The result, therefore, is \fInot\fP PCRE_ERROR_PARTIAL. The same thing
 happens with \fBpcre_dfa_exec()\fP, because it also finds the complete match.
 .P
-Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because 
+Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
 then the partial match takes precedence.
 .
 .
@@ -182,7 +199,7 @@ when \fBpcre_dfa_exec()\fP is used.
 If the escape sequence \eP is present more than once in a \fBpcretest\fP data
 line, the PCRE_PARTIAL_HARD option is set for the match.
 .
-.                                                          
+.
 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
 .rs
 .sp
@@ -216,10 +233,10 @@ facility can be used to pass very long subject strings to
 .SH "MULTI-SEGMENT MATCHING WITH pcre_exec()"
 .rs
 .sp
-From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment 
-matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the 
-previous match with a new segment of data. Instead, new data must be added to 
-the previous subject string, and the entire match re-run, starting from the 
+From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
+matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
+previous match with a new segment of data. Instead, new data must be added to
+the previous subject string, and the entire match re-run, starting from the
 point where the partial match occurred. Earlier data can be discarded.
 Consider an unanchored pattern that matches dates:
 .sp
@@ -227,34 +244,39 @@ Consider an unanchored pattern that matches dates:
   data> The date is 23ja\eP
   Partial match: 23ja
 .sp
-The this stage, an application could discard the text preceding "23ja", add on 
-text from the next segment, and call \fBpcre_exec()\fP again. Unlike 
-\fBpcre_dfa_exec()\fP, the entire matching string must always be available, and 
-the complete matching process occurs for each call, so more memory and more 
+The this stage, an application could discard the text preceding "23ja", add on
+text from the next segment, and call \fBpcre_exec()\fP again. Unlike
+\fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
+the complete matching process occurs for each call, so more memory and more
 processing time is needed.
+.P
+\fBNote:\fP If the pattern contains lookbehind assertions, or \eK, or starts
+with \eb or \eB, the string that is returned for a partial match will include
+characters that precede the partially matched string itself, because these must
+be retained when adding on more characters for a subsequent matching attempt.
+.
 .
-.                                                          
 .SH "ISSUES WITH MULTI-SEGMENT MATCHING"
 .rs
 .sp
-Certain types of pattern may give problems with multi-segment matching, 
+Certain types of pattern may give problems with multi-segment matching,
 whichever matching function is used.
 .P
 1. If the pattern contains tests for the beginning or end of a line, you need
 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
 subject string for any call does not contain the beginning or end of a line.
 .P
-2. If the pattern contains backward assertions (including \eb or \eB), you need
-to arrange for some overlap in the subject strings to allow for them to be
-correctly tested at the start of each substring. For example, using
-\fBpcre_dfa_exec()\fP, you could pass the subject in chunks that are 500 bytes
-long, but in a buffer of 700 bytes, with the starting offset set to 200 and the
-previous 200 bytes at the start of the buffer.
+2. Lookbehind assertions at the start of a pattern are catered for in the
+offsets that are returned for a partial match. However, in theory, a lookbehind
+assertion later in the pattern could require even earlier characters to be
+inspected, and it might not have been reached when a partial match occurs. This
+is probably an extremely unlikely case; you could guard against it to a certain
+extent by always including extra characters at the start.
 .P
 3. Matching a subject string that is split into multiple segments may not
 always produce exactly the same result as matching over one single long string,
-especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and 
-Word Boundaries" above describes an issue that arises if the pattern ends with 
+especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
+Word Boundaries" above describes an issue that arises if the pattern ends with
 \eb or \eB. Another kind of difference may occur when there are multiple
 matching possibilities, because a partial match result is given only when there
 are no completed matches. This means that as soon as the shortest match has
@@ -263,7 +285,7 @@ Consider again this \fBpcretest\fP example:
 .sp
     re> /dog(sbody)?/
   data> dogsb\eP
-   0: dog    
+   0: dog
   data> do\eP\eD
   Partial match: do
   data> gsb\eR\eP\eD
@@ -286,15 +308,15 @@ matching multi-segment data. The example above then behaves differently:
 .sp
     re> /dog(sbody)?/
   data> dogsb\eP\eP
-  Partial match: dogsb 
+  Partial match: dogsb
   data> do\eP\eD
   Partial match: do
   data> gsb\eR\eP\eP\eD
-  Partial match: gsb    
+  Partial match: gsb
 .sp
 .P
 4. Patterns that contain alternatives at the top level which do not all
-start with the same pattern item may not work as expected when 
+start with the same pattern item may not work as expected when
 \fBpcre_dfa_exec()\fP is used. For example, consider this pattern:
 .sp
   1234|3789
@@ -311,7 +333,7 @@ patterns or patterns such as:
   1234|ABCD
 .sp
 where no string can be a partial match for both alternatives. This is not a
-problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun 
+problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
 each time:
 .sp
     re> /1234|3789/
@@ -319,7 +341,7 @@ each time:
   Partial match: 123
   data> 1237890
    0: 3789
-.sp        
+.sp
 .
 .
 .SH AUTHOR
@@ -336,6 +358,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 31 August 2009
+Last updated: 05 September 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 69a2be6..f51aefe 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -461,10 +461,11 @@ This section describes the output when the normal matching function,
 .P
 When a match succeeds, pcretest outputs the list of captured substrings that
 \fBpcre_exec()\fP returns, starting with number 0 for the string that matched
-the whole pattern. Otherwise, it outputs "No match" or "Partial match:"
-followed by the partially matching substring when \fBpcre_exec()\fP returns
-PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL, respectively, and otherwise the PCRE
-negative error number. Here is an example of an interactive \fBpcretest\fP run.
+the whole pattern. Otherwise, it outputs "No match" when the return is
+PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
+substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. For any other
+returns, it outputs the PCRE negative error number. Here is an example of an
+interactive \fBpcretest\fP run.
 .sp
   $ pcretest
   PCRE version 7.0 30-Nov-2006
@@ -726,6 +727,6 @@ Cambridge CB2 3QH, England.
 .rs
 .sp
 .nf
-Last updated: 29 August 2009
+Last updated: 05 September 2009
 Copyright (c) 1997-2009 University of Cambridge.
 .fi
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index 56836de..701cfce 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -389,6 +389,11 @@ if (*first_op == OP_REVERSE)
       current_subject - start_subject : max_back;
     current_subject -= gone_back;
     }
+    
+  /* Save the earliest consulted character */
+  
+  if (current_subject < md->start_used_ptr) 
+    md->start_used_ptr = current_subject; 
 
   /* Now we can process the individual branches. */
 
@@ -800,6 +805,7 @@ for (;;)
         if (ptr > start_subject)
           {
           const uschar *temp = ptr - 1;
+          if (temp < md->start_used_ptr) md->start_used_ptr = temp; 
 #ifdef SUPPORT_UTF8
           if (utf8) BACKCHAR(temp);
 #endif
@@ -2503,7 +2509,7 @@ for (;;)
       {
       if (offsetcount >= 2)
         {
-        offsets[0] = current_subject - start_subject;
+        offsets[0] = md->start_used_ptr - start_subject;
         offsets[1] = end_subject - start_subject;
         }
       match_count = PCRE_ERROR_PARTIAL;
@@ -2936,6 +2942,8 @@ for (;;)
 
   /* OK, now we can do the business */
 
+  md->start_used_ptr = current_subject;
+   
   rc = internal_dfa_exec(
     md,                                /* fixed match data */
     md->start_code,                    /* this subexpression's code */
diff --git a/pcre_exec.c b/pcre_exec.c
index 710b74d..553f09e 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -408,7 +408,7 @@ immediately. The second one is used when we already know we are past the end of
 the subject. */
 
 #define CHECK_PARTIAL()\
-  if (md->partial && eptr >= md->end_subject && eptr > mstart)\
+  if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
     {\
     md->hitend = TRUE;\
     if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);\
@@ -1024,8 +1024,9 @@ for (;;)
       if (eptr < md->start_subject) RRETURN(MATCH_NOMATCH);
       }
 
-    /* Skip to next op code */
+    /* Save the earliest consulted character, then skip to next op code */
 
+    if (eptr < md->start_used_ptr) md->start_used_ptr = eptr;
     ecode += 1 + LINK_SIZE;
     break;
 
@@ -1468,7 +1469,8 @@ for (;;)
 
       /* Find out if the previous and current characters are "word" characters.
       It takes a bit more work in UTF-8 mode. Characters > 255 are assumed to
-      be "non-word" characters. */
+      be "non-word" characters. Remember the earliest consulted character for 
+      partial matching. */
 
 #ifdef SUPPORT_UTF8
       if (utf8)
@@ -1477,6 +1479,7 @@ for (;;)
           {
           USPTR lastptr = eptr - 1;
           while((*lastptr & 0xc0) == 0x80) lastptr--;
+          if (lastptr < md->start_used_ptr) md->start_used_ptr = lastptr; 
           GETCHAR(c, lastptr);
           prev_is_word = c < 256 && (md->ctypes[c] & ctype_word) != 0;
           }
@@ -1497,8 +1500,11 @@ for (;;)
       /* Not in UTF-8 mode */
 
         {
-        prev_is_word = (eptr != md->start_subject) &&
-          ((md->ctypes[eptr[-1]] & ctype_word) != 0);
+        if (eptr == md->start_subject) prev_is_word = FALSE; else
+          {
+          if (eptr <= md->start_used_ptr) md->start_used_ptr = eptr - 1;  
+          prev_is_word = ((md->ctypes[eptr[-1]] & ctype_word) != 0);
+          }
         if (eptr >= md->end_subject) 
           {
           SCHECK_PARTIAL(); 
@@ -5277,9 +5283,10 @@ for(;;)
   first starting point for which a partial match was found. */
 
   md->start_match_ptr = start_match;
+  md->start_used_ptr = start_match; 
   md->match_call_count = 0;
   rc = match(start_match, md->start_code, start_match, 2, md, ims, NULL, 0, 0);
-  if (md->hitend && start_partial == NULL) start_partial = start_match;
+  if (md->hitend && start_partial == NULL) start_partial = md->start_used_ptr;
 
   switch(rc)
     {
diff --git a/pcre_internal.h b/pcre_internal.h
index c424446..c48f248 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1601,7 +1601,6 @@ typedef struct match_data {
   BOOL   jscript_compat;        /* JAVASCRIPT_COMPAT flag */
   BOOL   endonly;               /* Dollar not before final \n */
   BOOL   notempty;              /* Empty string match not wanted */
-  int    partial;               /* PARTIAL options */
   BOOL   hitend;                /* Hit the end of the subject at some point */
   BOOL   bsr_anycrlf;           /* \R is just any CRLF, not full Unicode */
   const uschar *start_code;     /* For use when recursing */
@@ -1609,6 +1608,8 @@ typedef struct match_data {
   USPTR  end_subject;           /* End of the subject string */
   USPTR  start_match_ptr;       /* Start of matched string */
   USPTR  end_match_ptr;         /* Subject position at end match */
+  USPTR  start_used_ptr;        /* Earliest consulted character */ 
+  int    partial;               /* PARTIAL options */
   int    end_offset_top;        /* Highwater mark at end of match */
   int    capture_last;          /* Most recent capture number */
   int    start_offset;          /* The start offset value */
@@ -1625,6 +1626,7 @@ typedef struct dfa_match_data {
   const uschar *start_code;     /* Start of the compiled pattern */
   const uschar *start_subject;  /* Start of the subject string */
   const uschar *end_subject;    /* End of subject string */
+  const uschar *start_used_ptr; /* Earliest consulted character */ 
   const uschar *tables;         /* Character tables */
   int   moptions;               /* Match options */
   int   poptions;               /* Pattern options */
diff --git a/testdata/testinput2 b/testdata/testinput2
index 095d196..5877a29 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -2942,4 +2942,19 @@ a random value. /Ix
 /\w+A/PU
    CDAAAAB 
 
+/abc\K123/
+    xyzabc123pqr
+    xyzabc12\P
+    xyzabc12\P\P
+    
+/(?<=abc)123/
+    xyzabc123pqr 
+    xyzabc12\P
+    xyzabc12\P\P
+
+/\babc\b/
+    +++abc+++
+    +++ab\P
+    +++ab\P\P  
+
 / End of testinput2 /
diff --git a/testdata/testinput7 b/testdata/testinput7
index a112d37..237914e 100644
--- a/testdata/testinput7
+++ b/testdata/testinput7
@@ -4470,4 +4470,17 @@
    abc\P
    abc\P\P
 
+/abc\K123/
+    xyzabc123pqr
+    
+/(?<=abc)123/
+    xyzabc123pqr 
+    xyzabc12\P
+    xyzabc12\P\P
+
+/\babc\b/
+    +++abc+++
+    +++ab\P
+    +++ab\P\P  
+
 / End of testinput7 /
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index ba88a8f..98ab6af 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -9972,4 +9972,28 @@ Partial match: the cat
    CDAAAAB 
  0: CDA
 
+/abc\K123/
+    xyzabc123pqr
+ 0: 123
+    xyzabc12\P
+Partial match: abc12
+    xyzabc12\P\P
+Partial match: abc12
+    
+/(?<=abc)123/
+    xyzabc123pqr 
+ 0: 123
+    xyzabc12\P
+Partial match: abc12
+    xyzabc12\P\P
+Partial match: abc12
+
+/\babc\b/
+    +++abc+++
+ 0: abc
+    +++ab\P
+Partial match: +ab
+    +++ab\P\P  
+Partial match: +ab
+
 / End of testinput2 /
diff --git a/testdata/testoutput7 b/testdata/testoutput7
index 95f60a8..552ebc0 100644
--- a/testdata/testoutput7
+++ b/testdata/testoutput7
@@ -981,7 +981,7 @@ Partial match: abc
    xyzfo\P 
 No match
    foob\P\>2 
-Partial match: b
+Partial match: foob
    foobar...\R\P\>4 
  0: ar
    xyzfo\P
@@ -7442,4 +7442,24 @@ Partial match: dogs
    abc\P\P
  0: abc
 
+/abc\K123/
+    xyzabc123pqr
+Error -16
+    
+/(?<=abc)123/
+    xyzabc123pqr 
+ 0: 123
+    xyzabc12\P
+Partial match: abc12
+    xyzabc12\P\P
+Partial match: abc12
+
+/\babc\b/
+    +++abc+++
+ 0: abc
+    +++ab\P
+Partial match: +ab
+    +++ab\P\P  
+Partial match: +ab
+
 / End of testinput7 /
author	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2009-09-05 10:20:44 +0000
committer	ph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>	2009-09-05 10:20:44 +0000
commit	426d9955dc06c708b9f8e0673704ece1b2527fd9 (patch)
tree	78b0d6679bb6f8a266e3b52a4bbda3adab5d2aab
parent	92f006ccd3ea998dec9c9f8be7a9801103f87eb5 (diff)
download	pcre-426d9955dc06c708b9f8e0673704ece1b2527fd9.tar.gz