summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-05 10:20:44 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2009-09-05 10:20:44 +0000
commit426d9955dc06c708b9f8e0673704ece1b2527fd9 (patch)
tree78b0d6679bb6f8a266e3b52a4bbda3adab5d2aab
parent92f006ccd3ea998dec9c9f8be7a9801103f87eb5 (diff)
downloadpcre-426d9955dc06c708b9f8e0673704ece1b2527fd9.tar.gz
Further updates to partial matching.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@435 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog22
-rw-r--r--configure.ac2
-rw-r--r--doc/pcreapi.329
-rw-r--r--doc/pcrematching.37
-rw-r--r--doc/pcrepartial.3138
-rw-r--r--doc/pcretest.111
-rw-r--r--pcre_dfa_exec.c10
-rw-r--r--pcre_exec.c19
-rw-r--r--pcre_internal.h4
-rw-r--r--testdata/testinput215
-rw-r--r--testdata/testinput713
-rw-r--r--testdata/testoutput224
-rw-r--r--testdata/testoutput722
13 files changed, 221 insertions, 95 deletions
diff --git a/ChangeLog b/ChangeLog
index fd44cba..403f9fc 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -51,8 +51,9 @@ Version 8.00 ??-???-??
9. The restrictions on what a pattern can contain when partial matching is
requested for pcre_exec() have been removed. All patterns can now be
partially matched by this function. In addition, if there are at least two
- slots in the offset vector, the offsets of the first-encountered partial
- match are set in them when PCRE_ERROR_PARTIAL is returned.
+ slots in the offset vector, the offset of the earliest inspected character
+ for the match and the offset of the end of the subject are set in them when
+ PCRE_ERROR_PARTIAL is returned.
10. Partial matching has been split into two forms: PCRE_PARTIAL_SOFT, which is
synonymous with PCRE_PARTIAL, for backwards compatibility, and
@@ -73,22 +74,29 @@ Version 8.00 ??-???-??
earlier partial match, unless partial matching was again requested. For
example, with the pattern /dog.(body)?/, the "must contain" character is
"g". If the first part-match was for the string "dog", restarting with
- "sbody" failed.
+ "sbody" failed. This bug has been fixed.
-13. Added a pcredemo man page, created automatically from the pcredemo.c file,
+13. The string returned by pcre_dfa_exec() after a partial match has been
+ changed so that it starts at the first inspected character rather than the
+ first character of the match. This makes a difference only if the pattern
+ starts with a lookbehind assertion or \b or \B (\K is not supported by
+ pcre_dfa_exec()). It's an incompatible change, but it makes the two
+ matching functions compatible, and I think it's the right thing to do.
+
+14. Added a pcredemo man page, created automatically from the pcredemo.c file,
so that the demonstration program is easily available in environments where
PCRE has not been installed from source.
-14. Arranged to add -DPCRE_STATIC to cflags in libpcre.pc, libpcreposix.cp,
+15. Arranged to add -DPCRE_STATIC to cflags in libpcre.pc, libpcreposix.cp,
libpcrecpp.pc and pcre-config when PCRE is not compiled as a shared
library.
-15. Added REG_UNGREEDY to the pcreposix interface, at the request of a user.
+16. Added REG_UNGREEDY to the pcreposix interface, at the request of a user.
It maps to PCRE_UNGREEDY. It is not, of course, POSIX-compatible, but it
is not the first non-POSIX option to be added. Clearly some people find
these options useful.
-16. If a caller to the POSIX matching function regexec() passes a non-zero
+17. If a caller to the POSIX matching function regexec() passes a non-zero
value for \fInmatch\fP with a NULL value for \fIpmatch\fP, the value of
\fInmatch\fP is forced to zero.
diff --git a/configure.ac b/configure.ac
index a4bda76..7f567f8 100644
--- a/configure.ac
+++ b/configure.ac
@@ -9,7 +9,7 @@ dnl empty.
m4_define(pcre_major, [8])
m4_define(pcre_minor, [00])
m4_define(pcre_prerelease, [-RC1])
-m4_define(pcre_date, [2009-04-23])
+m4_define(pcre_date, [2009-09-05])
# Libtool shared library interface versions (current:revision:age)
m4_define(libpcre_version, [0:1:0])
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 8c4b45d..db3096e 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -149,9 +149,10 @@ documentation describes how to compile and run it.
A second matching function, \fBpcre_dfa_exec()\fP, which is not
Perl-compatible, is also provided. This uses a different algorithm for the
matching. The alternative algorithm finds all possible matches (at a given
-point in the subject), and scans the subject just once. However, this algorithm
-does not return captured substrings. A description of the two matching
-algorithms and their advantages and disadvantages is given in the
+point in the subject), and scans the subject just once (unless there are
+lookbehind assertions). However, this algorithm does not return captured
+substrings. A description of the two matching algorithms and their advantages
+and disadvantages is given in the
.\" HREF
\fBpcrematching\fP
.\"
@@ -1011,10 +1012,10 @@ different for each compiled pattern.
.sp
PCRE_INFO_OKPARTIAL
.sp
-Return 1 if the pattern can be used for partial matching, otherwise 0. The
-fourth argument should point to an \fBint\fP variable. From release 8.00, this
-always returns 1, because the restrictions that previously applied to partial
-matching have been lifted. The
+Return 1 if the pattern can be used for partial matching with
+\fBpcre_exec()\fP, otherwise 0. The fourth argument should point to an
+\fBint\fP variable. From release 8.00, this always returns 1, because the
+restrictions that previously applied to partial matching have been lifted. The
.\" HREF
\fBpcrepartial\fP
.\"
@@ -1387,8 +1388,8 @@ PCRE_PARTIAL_HARD is set, \fBpcre_exec()\fP immediately returns
PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
-provided the partial match is set as the first matching string. There is a more
-detailed discussion in the
+was inspected when the partial match was found is set as the first matching
+string. There is a more detailed discussion in the
.\" HREF
\fBpcrepartial\fP
.\"
@@ -1834,8 +1835,8 @@ a compiled pattern, using a matching algorithm that scans the subject string
just once, and does not backtrack. This has different characteristics to the
normal algorithm, and is not compatible with Perl. Some of the features of PCRE
patterns are not supported. Nevertheless, there are times when this kind of
-matching can be useful. For a discussion of the two matching algorithms, see
-the
+matching can be useful. For a discussion of the two matching algorithms, and a
+list of features that \fBpcre_dfa_exec()\fP does not support, see the
.\" HREF
\fBpcrematching\fP
.\"
@@ -1890,8 +1891,8 @@ additional characters. This happens even if some complete matches have also
been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH
is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
there have been no complete matches, but there is still at least one matching
-possibility. The portion of the string that provided the longest partial match
-is set as the first matching string in both cases.
+possibility. The portion of the string that was inspected when the longest
+partial match was found is set as the first matching string in both cases.
.sp
PCRE_DFA_SHORTEST
.sp
@@ -2011,6 +2012,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 01 September 2009
+Last updated: 05 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrematching.3 b/doc/pcrematching.3
index 808221d..a3b8363 100644
--- a/doc/pcrematching.3
+++ b/doc/pcrematching.3
@@ -92,6 +92,11 @@ the three strings "cat", "cater", and "caterpillar" that start at the fourth
character of the subject. The algorithm does not automatically move on to find
matches that start at later positions.
.P
+Although the general principle of this matching algorithm is that it scans the
+subject string only once, without backtracking, there is one exception: when a
+lookbehind assertion is encountered, the preceding characters have to be
+re-inspected.
+.P
There are a number of features of PCRE regular expressions that are not
supported by the alternative matching algorithm. They are as follows:
.P
@@ -178,6 +183,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 25 August 2009
+Last updated: 05 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcrepartial.3 b/doc/pcrepartial.3
index 0d76489..0e9cc47 100644
--- a/doc/pcrepartial.3
+++ b/doc/pcrepartial.3
@@ -26,10 +26,10 @@ is very long and is not all available at once.
.P
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
-\fBpcre_dfa_exec()\fP. For backwards compatibility, PCRE_PARTIAL is a synonym
-for PCRE_PARTIAL_SOFT. The essential difference between the two options is
-whether or not a partial match is preferred to an alternative complete match,
-though the details differ between the two matching functions. If both options
+\fBpcre_dfa_exec()\fP. For backwards compatibility, PCRE_PARTIAL is a synonym
+for PCRE_PARTIAL_SOFT. The essential difference between the two options is
+whether or not a partial match is preferred to an alternative complete match,
+though the details differ between the two matching functions. If both options
are set, PCRE_PARTIAL_HARD takes precedence.
.P
Setting a partial matching option disables one of PCRE's optimizations. PCRE
@@ -49,69 +49,86 @@ been matched. (In other words, a partial match can never be an empty string.)
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
continues as normal, and other alternatives in the pattern are tried. If no
complete match can be found, \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL
-instead of PCRE_ERROR_NOMATCH, and if there are at least two slots in the
-offsets vector, they are filled in with the offsets of the longest string that
-partially matched. Consider this pattern:
+instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
+vector, the first of them is set to the offset of the earliest character that
+was inspected when the partial match was found. For convenience, the second
+offset points to the end of the string so that a substring can easily be
+extracted.
+.P
+For the majority of patterns, the first offset identifies the start of the
+partially matched string. However, for patterns that contain lookbehind
+assertions, or \eK, or begin with \eb or \eB, earlier characters have been
+inspected while carrying out the match. For example:
+.sp
+ /(?<=abc)123/
+.sp
+This pattern matches "123", but only if it is preceded by "abc". If the subject
+string is "xyzabc12", the offsets after a partial match are for the substring
+"abc12", because all these characters are needed if another match is tried
+with extra characters added.
+.P
+If there is more than one partial match, the first one that was found provides
+the data that is returned. Consider this pattern:
.sp
/123\ew+X|dogY/
.sp
If this is matched against the subject string "abc123dog", both
-alternatives fail to match, but the end of the subject is reached during
+alternatives fail to match, but the end of the subject is reached during
matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
-offsets are set to 3 and 9, identifying "123dog" as the longest partial match
+offsets are set to 3 and 9, identifying "123dog" as the first partial match
that was found. (In this example, there are two partial matches, because "dog"
on its own partially matches the second alternative.)
.P
-If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
+If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
search for possible complete matches. The difference between the two options
can be illustrated by a pattern such as:
.sp
/dog(sbody)?/
.sp
-This matches either "dog" or "dogsbody", greedily (that is, it prefers the
+This matches either "dog" or "dogsbody", greedily (that is, it prefers the
longer string if possible). If it is matched against the string "dog" with
-PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
-PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
+PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
+PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
if the pattern is made ungreedy the result is different:
.sp
/dog(sbody)??/
.sp
-In this case the result is always a complete match because \fBpcre_exec()\fP
-finds that first, and it never continues after finding a match. It might be
+In this case the result is always a complete match because \fBpcre_exec()\fP
+finds that first, and it never continues after finding a match. It might be
easier to follow this explanation by thinking of the two patterns like this:
.sp
/dog(sbody)?/ is the same as /dogsbody|dog/
/dog(sbody)??/ is the same as /dog|dogsbody/
.sp
-The second pattern will never match "dogsbody" when \fBpcre_exec()\fP is
+The second pattern will never match "dogsbody" when \fBpcre_exec()\fP is
used, because it will always find the shorter match first.
.
.
.SH "PARTIAL MATCHING USING pcre_dfa_exec()"
.rs
.sp
-The \fBpcre_dfa_exec()\fP function moves along the subject string character by
-character, without backtracking, searching for all possible matches
-simultaneously. If the end of the subject is reached before the end of the
+The \fBpcre_dfa_exec()\fP function moves along the subject string character by
+character, without backtracking, searching for all possible matches
+simultaneously. If the end of the subject is reached before the end of the
pattern, there is the possibility of a partial match, again provided that at
least one character has matched.
.P
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
-complete matches. The portion of the string that provided the longest partial
-match is set as the first matching string, provided there are at least two
-slots in the offsets vector.
+complete matches. The portion of the string that was inspected when the longest
+partial match was found is set as the first matching string, provided there are
+at least two slots in the offsets vector.
.P
-Because \fBpcre_dfa_exec()\fP always searches for all possible matches, and
+Because \fBpcre_dfa_exec()\fP always searches for all possible matches, and
there is no difference between greedy and ungreedy repetition, its behaviour is
-different from \fBpcre_exec\fP when PCRE_PARTIAL_HARD is set. Consider the
+different from \fBpcre_exec\fP when PCRE_PARTIAL_HARD is set. Consider the
string "dog" matched against the ungreedy pattern shown above:
.sp
/dog(sbody)??/
.sp
-Whereas \fBpcre_exec()\fP stops as soon as it finds the complete match for
+Whereas \fBpcre_exec()\fP stops as soon as it finds the complete match for
"dog", \fBpcre_dfa_exec()\fP also finds the partial match for "dogsbody", and
so returns that when PCRE_PARTIAL_HARD is set.
.
@@ -119,21 +136,21 @@ so returns that when PCRE_PARTIAL_HARD is set.
.SH "PARTIAL MATCHING AND WORD BOUNDARIES"
.rs
.sp
-If a pattern ends with one of sequences \ew or \eW, which test for word
-boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
+If a pattern ends with one of sequences \ew or \eW, which test for word
+boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
results. Consider this pattern:
.sp
/\ebcat\eb/
.sp
This matches "cat", provided there is a word boundary at either end. If the
subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However,
-\fBpcre_exec()\fP carries on with normal matching, which matches \eb at the end
-of the subject when the last character is a letter, thus finding a complete
-match. The result, therefore, is \fInot\fP PCRE_ERROR_PARTIAL. The same thing
+character cannot take place, so a partial match is found. However,
+\fBpcre_exec()\fP carries on with normal matching, which matches \eb at the end
+of the subject when the last character is a letter, thus finding a complete
+match. The result, therefore, is \fInot\fP PCRE_ERROR_PARTIAL. The same thing
happens with \fBpcre_dfa_exec()\fP, because it also finds the complete match.
.P
-Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
+Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
then the partial match takes precedence.
.
.
@@ -182,7 +199,7 @@ when \fBpcre_dfa_exec()\fP is used.
If the escape sequence \eP is present more than once in a \fBpcretest\fP data
line, the PCRE_PARTIAL_HARD option is set for the match.
.
-.
+.
.SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
.rs
.sp
@@ -216,10 +233,10 @@ facility can be used to pass very long subject strings to
.SH "MULTI-SEGMENT MATCHING WITH pcre_exec()"
.rs
.sp
-From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
-matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
-previous match with a new segment of data. Instead, new data must be added to
-the previous subject string, and the entire match re-run, starting from the
+From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
+matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
+previous match with a new segment of data. Instead, new data must be added to
+the previous subject string, and the entire match re-run, starting from the
point where the partial match occurred. Earlier data can be discarded.
Consider an unanchored pattern that matches dates:
.sp
@@ -227,34 +244,39 @@ Consider an unanchored pattern that matches dates:
data> The date is 23ja\eP
Partial match: 23ja
.sp
-The this stage, an application could discard the text preceding "23ja", add on
-text from the next segment, and call \fBpcre_exec()\fP again. Unlike
-\fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
-the complete matching process occurs for each call, so more memory and more
+The this stage, an application could discard the text preceding "23ja", add on
+text from the next segment, and call \fBpcre_exec()\fP again. Unlike
+\fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
+the complete matching process occurs for each call, so more memory and more
processing time is needed.
+.P
+\fBNote:\fP If the pattern contains lookbehind assertions, or \eK, or starts
+with \eb or \eB, the string that is returned for a partial match will include
+characters that precede the partially matched string itself, because these must
+be retained when adding on more characters for a subsequent matching attempt.
+.
.
-.
.SH "ISSUES WITH MULTI-SEGMENT MATCHING"
.rs
.sp
-Certain types of pattern may give problems with multi-segment matching,
+Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
.P
1. If the pattern contains tests for the beginning or end of a line, you need
to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
subject string for any call does not contain the beginning or end of a line.
.P
-2. If the pattern contains backward assertions (including \eb or \eB), you need
-to arrange for some overlap in the subject strings to allow for them to be
-correctly tested at the start of each substring. For example, using
-\fBpcre_dfa_exec()\fP, you could pass the subject in chunks that are 500 bytes
-long, but in a buffer of 700 bytes, with the starting offset set to 200 and the
-previous 200 bytes at the start of the buffer.
+2. Lookbehind assertions at the start of a pattern are catered for in the
+offsets that are returned for a partial match. However, in theory, a lookbehind
+assertion later in the pattern could require even earlier characters to be
+inspected, and it might not have been reached when a partial match occurs. This
+is probably an extremely unlikely case; you could guard against it to a certain
+extent by always including extra characters at the start.
.P
3. Matching a subject string that is split into multiple segments may not
always produce exactly the same result as matching over one single long string,
-especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
-Word Boundaries" above describes an issue that arises if the pattern ends with
+especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
+Word Boundaries" above describes an issue that arises if the pattern ends with
\eb or \eB. Another kind of difference may occur when there are multiple
matching possibilities, because a partial match result is given only when there
are no completed matches. This means that as soon as the shortest match has
@@ -263,7 +285,7 @@ Consider again this \fBpcretest\fP example:
.sp
re> /dog(sbody)?/
data> dogsb\eP
- 0: dog
+ 0: dog
data> do\eP\eD
Partial match: do
data> gsb\eR\eP\eD
@@ -286,15 +308,15 @@ matching multi-segment data. The example above then behaves differently:
.sp
re> /dog(sbody)?/
data> dogsb\eP\eP
- Partial match: dogsb
+ Partial match: dogsb
data> do\eP\eD
Partial match: do
data> gsb\eR\eP\eP\eD
- Partial match: gsb
+ Partial match: gsb
.sp
.P
4. Patterns that contain alternatives at the top level which do not all
-start with the same pattern item may not work as expected when
+start with the same pattern item may not work as expected when
\fBpcre_dfa_exec()\fP is used. For example, consider this pattern:
.sp
1234|3789
@@ -311,7 +333,7 @@ patterns or patterns such as:
1234|ABCD
.sp
where no string can be a partial match for both alternatives. This is not a
-problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
+problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
each time:
.sp
re> /1234|3789/
@@ -319,7 +341,7 @@ each time:
Partial match: 123
data> 1237890
0: 3789
-.sp
+.sp
.
.
.SH AUTHOR
@@ -336,6 +358,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 31 August 2009
+Last updated: 05 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 69a2be6..f51aefe 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -461,10 +461,11 @@ This section describes the output when the normal matching function,
.P
When a match succeeds, pcretest outputs the list of captured substrings that
\fBpcre_exec()\fP returns, starting with number 0 for the string that matched
-the whole pattern. Otherwise, it outputs "No match" or "Partial match:"
-followed by the partially matching substring when \fBpcre_exec()\fP returns
-PCRE_ERROR_NOMATCH or PCRE_ERROR_PARTIAL, respectively, and otherwise the PCRE
-negative error number. Here is an example of an interactive \fBpcretest\fP run.
+the whole pattern. Otherwise, it outputs "No match" when the return is
+PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
+substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. For any other
+returns, it outputs the PCRE negative error number. Here is an example of an
+interactive \fBpcretest\fP run.
.sp
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -726,6 +727,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 29 August 2009
+Last updated: 05 September 2009
Copyright (c) 1997-2009 University of Cambridge.
.fi
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index 56836de..701cfce 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -389,6 +389,11 @@ if (*first_op == OP_REVERSE)
current_subject - start_subject : max_back;
current_subject -= gone_back;
}
+
+ /* Save the earliest consulted character */
+
+ if (current_subject < md->start_used_ptr)
+ md->start_used_ptr = current_subject;
/* Now we can process the individual branches. */
@@ -800,6 +805,7 @@ for (;;)
if (ptr > start_subject)
{
const uschar *temp = ptr - 1;
+ if (temp < md->start_used_ptr) md->start_used_ptr = temp;
#ifdef SUPPORT_UTF8
if (utf8) BACKCHAR(temp);
#endif
@@ -2503,7 +2509,7 @@ for (;;)
{
if (offsetcount >= 2)
{
- offsets[0] = current_subject - start_subject;
+ offsets[0] = md->start_used_ptr - start_subject;
offsets[1] = end_subject - start_subject;
}
match_count = PCRE_ERROR_PARTIAL;
@@ -2936,6 +2942,8 @@ for (;;)
/* OK, now we can do the business */
+ md->start_used_ptr = current_subject;
+
rc = internal_dfa_exec(
md, /* fixed match data */
md->start_code, /* this subexpression's code */
diff --git a/pcre_exec.c b/pcre_exec.c
index 710b74d..553f09e 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -408,7 +408,7 @@ immediately. The second one is used when we already know we are past the end of
the subject. */
#define CHECK_PARTIAL()\
- if (md->partial && eptr >= md->end_subject && eptr > mstart)\
+ if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
{\
md->hitend = TRUE;\
if (md->partial > 1) RRETURN(PCRE_ERROR_PARTIAL);\
@@ -1024,8 +1024,9 @@ for (;;)
if (eptr < md->start_subject) RRETURN(MATCH_NOMATCH);
}
- /* Skip to next op code */
+ /* Save the earliest consulted character, then skip to next op code */
+ if (eptr < md->start_used_ptr) md->start_used_ptr = eptr;
ecode += 1 + LINK_SIZE;
break;
@@ -1468,7 +1469,8 @@ for (;;)
/* Find out if the previous and current characters are "word" characters.
It takes a bit more work in UTF-8 mode. Characters > 255 are assumed to
- be "non-word" characters. */
+ be "non-word" characters. Remember the earliest consulted character for
+ partial matching. */
#ifdef SUPPORT_UTF8
if (utf8)
@@ -1477,6 +1479,7 @@ for (;;)
{
USPTR lastptr = eptr - 1;
while((*lastptr & 0xc0) == 0x80) lastptr--;
+ if (lastptr < md->start_used_ptr) md->start_used_ptr = lastptr;
GETCHAR(c, lastptr);
prev_is_word = c < 256 && (md->ctypes[c] & ctype_word) != 0;
}
@@ -1497,8 +1500,11 @@ for (;;)
/* Not in UTF-8 mode */
{
- prev_is_word = (eptr != md->start_subject) &&
- ((md->ctypes[eptr[-1]] & ctype_word) != 0);
+ if (eptr == md->start_subject) prev_is_word = FALSE; else
+ {
+ if (eptr <= md->start_used_ptr) md->start_used_ptr = eptr - 1;
+ prev_is_word = ((md->ctypes[eptr[-1]] & ctype_word) != 0);
+ }
if (eptr >= md->end_subject)
{
SCHECK_PARTIAL();
@@ -5277,9 +5283,10 @@ for(;;)
first starting point for which a partial match was found. */
md->start_match_ptr = start_match;
+ md->start_used_ptr = start_match;
md->match_call_count = 0;
rc = match(start_match, md->start_code, start_match, 2, md, ims, NULL, 0, 0);
- if (md->hitend && start_partial == NULL) start_partial = start_match;
+ if (md->hitend && start_partial == NULL) start_partial = md->start_used_ptr;
switch(rc)
{
diff --git a/pcre_internal.h b/pcre_internal.h
index c424446..c48f248 100644
--- a/pcre_internal.h
+++ b/pcre_internal.h
@@ -1601,7 +1601,6 @@ typedef struct match_data {
BOOL jscript_compat; /* JAVASCRIPT_COMPAT flag */
BOOL endonly; /* Dollar not before final \n */
BOOL notempty; /* Empty string match not wanted */
- int partial; /* PARTIAL options */
BOOL hitend; /* Hit the end of the subject at some point */
BOOL bsr_anycrlf; /* \R is just any CRLF, not full Unicode */
const uschar *start_code; /* For use when recursing */
@@ -1609,6 +1608,8 @@ typedef struct match_data {
USPTR end_subject; /* End of the subject string */
USPTR start_match_ptr; /* Start of matched string */
USPTR end_match_ptr; /* Subject position at end match */
+ USPTR start_used_ptr; /* Earliest consulted character */
+ int partial; /* PARTIAL options */
int end_offset_top; /* Highwater mark at end of match */
int capture_last; /* Most recent capture number */
int start_offset; /* The start offset value */
@@ -1625,6 +1626,7 @@ typedef struct dfa_match_data {
const uschar *start_code; /* Start of the compiled pattern */
const uschar *start_subject; /* Start of the subject string */
const uschar *end_subject; /* End of subject string */
+ const uschar *start_used_ptr; /* Earliest consulted character */
const uschar *tables; /* Character tables */
int moptions; /* Match options */
int poptions; /* Pattern options */
diff --git a/testdata/testinput2 b/testdata/testinput2
index 095d196..5877a29 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -2942,4 +2942,19 @@ a random value. /Ix
/\w+A/PU
CDAAAAB
+/abc\K123/
+ xyzabc123pqr
+ xyzabc12\P
+ xyzabc12\P\P
+
+/(?<=abc)123/
+ xyzabc123pqr
+ xyzabc12\P
+ xyzabc12\P\P
+
+/\babc\b/
+ +++abc+++
+ +++ab\P
+ +++ab\P\P
+
/ End of testinput2 /
diff --git a/testdata/testinput7 b/testdata/testinput7
index a112d37..237914e 100644
--- a/testdata/testinput7
+++ b/testdata/testinput7
@@ -4470,4 +4470,17 @@
abc\P
abc\P\P
+/abc\K123/
+ xyzabc123pqr
+
+/(?<=abc)123/
+ xyzabc123pqr
+ xyzabc12\P
+ xyzabc12\P\P
+
+/\babc\b/
+ +++abc+++
+ +++ab\P
+ +++ab\P\P
+
/ End of testinput7 /
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index ba88a8f..98ab6af 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -9972,4 +9972,28 @@ Partial match: the cat
CDAAAAB
0: CDA
+/abc\K123/
+ xyzabc123pqr
+ 0: 123
+ xyzabc12\P
+Partial match: abc12
+ xyzabc12\P\P
+Partial match: abc12
+
+/(?<=abc)123/
+ xyzabc123pqr
+ 0: 123
+ xyzabc12\P
+Partial match: abc12
+ xyzabc12\P\P
+Partial match: abc12
+
+/\babc\b/
+ +++abc+++
+ 0: abc
+ +++ab\P
+Partial match: +ab
+ +++ab\P\P
+Partial match: +ab
+
/ End of testinput2 /
diff --git a/testdata/testoutput7 b/testdata/testoutput7
index 95f60a8..552ebc0 100644
--- a/testdata/testoutput7
+++ b/testdata/testoutput7
@@ -981,7 +981,7 @@ Partial match: abc
xyzfo\P
No match
foob\P\>2
-Partial match: b
+Partial match: foob
foobar...\R\P\>4
0: ar
xyzfo\P
@@ -7442,4 +7442,24 @@ Partial match: dogs
abc\P\P
0: abc
+/abc\K123/
+ xyzabc123pqr
+Error -16
+
+/(?<=abc)123/
+ xyzabc123pqr
+ 0: 123
+ xyzabc12\P
+Partial match: abc12
+ xyzabc12\P\P
+Partial match: abc12
+
+/\babc\b/
+ +++abc+++
+ 0: abc
+ +++ab\P
+Partial match: +ab
+ +++ab\P\P
+Partial match: +ab
+
/ End of testinput7 /