summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-10-22 15:57:50 +0000
committerph10 <ph10@2f5784b3-3f2a-0410-8824-cb99058d5e15>2010-10-22 15:57:50 +0000
commiteb0957c2b28f0842cf8e03e5216b3349c8e950fe (patch)
treef935bb111412c4fb8101c1249e0235b7669d6800
parente998b8d9a8d15c20e782ae677bc66e4d84294086 (diff)
downloadpcre-eb0957c2b28f0842cf8e03e5216b3349c8e950fe.tar.gz
Change the way PCRE_PARTIAL_HARD handles \z, \Z, \b, \B, and $.
git-svn-id: svn://vcs.exim.org/pcre/code/trunk@553 2f5784b3-3f2a-0410-8824-cb99058d5e15
-rw-r--r--ChangeLog18
-rw-r--r--doc/pcre.314
-rw-r--r--doc/pcre_exec.32
-rw-r--r--doc/pcreapi.329
-rw-r--r--doc/pcrematching.310
-rw-r--r--doc/pcrepartial.3116
-rw-r--r--doc/pcretest.114
-rw-r--r--pcre.h.in72
-rw-r--r--pcre_dfa_exec.c19
-rw-r--r--pcre_exec.c66
-rw-r--r--pcretest.c2
-rw-r--r--testdata/testinput236
-rw-r--r--testdata/testinput736
-rw-r--r--testdata/testinput84
-rw-r--r--testdata/testoutput258
-rw-r--r--testdata/testoutput758
-rw-r--r--testdata/testoutput86
17 files changed, 426 insertions, 134 deletions
diff --git a/ChangeLog b/ChangeLog
index 68940d2..5b8d840 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -21,6 +21,24 @@ Version 8.11 10-Oct-2010
the class, even if it had been included by some previous item, for example
in [\x00-\xff\s]. (This was a bug related to the fact that VT is not part
of \s, but is part of the POSIX "space" class.)
+
+4. A partial match never returns an empty string (because you can always
+ match an empty string at the end of the subject); however the checking for
+ an empty string was starting at the "start of match" point. This has been
+ changed to the "earliest inspected character" point, because the returned
+ data for a partial match starts at this character. This means that, for
+ example, /(?<=abc)def/ gives a partial match for the subject "abc"
+ (previously it gave "no match").
+
+5. Changes have been made to the way PCRE_PARTIAL_HARD affects the matching
+ of $, \z, \Z, \b, and \B. If the match point is at the end of the string,
+ previously a full match would be given. However, setting PCRE_PARTIAL_HARD
+ has an implication that the given string is incomplete (because a partial
+ match is preferred over a full match). For this reason, these items now
+ give a partial match in this situation. [Aside: previously, the one case
+ /t\b/ matched against "cat" with PCRE_PARTIAL_HARD set did return a partial
+ match rather than a full match, which was wrong by the old rules, but is
+ now correct.]
Version 8.10 25-Jun-2010
diff --git a/doc/pcre.3 b/doc/pcre.3
index c823d6e..4908299 100644
--- a/doc/pcre.3
+++ b/doc/pcre.3
@@ -254,12 +254,12 @@ test characters of any code value, but, by default, the characters that PCRE
recognizes as digits, spaces, or word characters remain the same set as before,
all with values less than 256. This remains true even when PCRE is built to
include Unicode property support, because to do otherwise would slow down PCRE
-in many common cases. Note that this also applies to \eb, because it is defined
-in terms of \ew and \eW. If you really want to test for a wider sense of, say,
-"digit", you can use explicit Unicode property tests such as \ep{Nd}.
-Alternatively, if you set the PCRE_UCP option, the way that the character
-escapes work is changed so that Unicode properties are used to determine which
-characters match. There are more details in the section on
+in many common cases. Note in particular that this applies to \eb and \eB,
+because they are defined in terms of \ew and \eW. If you really want to test
+for a wider sense of, say, "digit", you can use explicit Unicode property tests
+such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, the way that
+the character escapes work is changed so that Unicode properties are used to
+determine which characters match. There are more details in the section on
.\" HTML <a href="pcrepattern.html#genericchartypes">
.\" </a>
generic character types
@@ -306,6 +306,6 @@ two digits 10, at the domain cam.ac.uk.
.rs
.sp
.nf
-Last updated: 12 May 2010
+Last updated: 22 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
diff --git a/doc/pcre_exec.3 b/doc/pcre_exec.3
index 0a3399f..b0578b7 100644
--- a/doc/pcre_exec.3
+++ b/doc/pcre_exec.3
@@ -53,7 +53,7 @@ The options are:
PCRE_PARTIAL ) Return PCRE_ERROR_PARTIAL for a partial
PCRE_PARTIAL_SOFT ) match if no full matches are found
PCRE_PARTIAL_HARD Return PCRE_ERROR_PARTIAL for a partial match
- even if there is a full match as well
+ if that is found before a full match
.sp
For details of partial matching, see the
.\" HREF
diff --git a/doc/pcreapi.3 b/doc/pcreapi.3
index 7b07c2e..e94fc62 100644
--- a/doc/pcreapi.3
+++ b/doc/pcreapi.3
@@ -1532,12 +1532,21 @@ These options turn on the partial matching feature. For backwards
compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match
occurs if the end of the subject string is reached successfully, but there are
not enough subject characters to complete the match. If this happens when
-PCRE_PARTIAL_HARD is set, \fBpcre_exec()\fP immediately returns
-PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, matching continues
-by testing any other alternatives. Only if they all fail is PCRE_ERROR_PARTIAL
-returned (instead of PCRE_ERROR_NOMATCH). The portion of the string that
-was inspected when the partial match was found is set as the first matching
-string. There is a more detailed discussion in the
+PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by
+testing any remaining alternatives. Only if no complete match can be found is
+PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words,
+PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
+but only if no complete match can be found.
+.P
+If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a
+partial match is found, \fBpcre_exec()\fP immediately returns
+PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,
+when PCRE_PARTIAL_HARD is set, a partial match is considered to be more
+important that an alternative complete match.
+.P
+In both cases, the portion of the string that was inspected when the partial
+match was found is set as the first matching string. There is a more detailed
+discussion of partial and multi-segment matching, with examples, in the
.\" HREF
\fBpcrepartial\fP
.\"
@@ -2059,6 +2068,12 @@ is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
there have been no complete matches, but there is still at least one matching
possibility. The portion of the string that was inspected when the longest
partial match was found is set as the first matching string in both cases.
+There is a more detailed discussion of partial and multi-segment matching, with
+examples, in the
+.\" HREF
+\fBpcrepartial\fP
+.\"
+documentation.
.sp
PCRE_DFA_SHORTEST
.sp
@@ -2178,6 +2193,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 21 June 2010
+Last updated: 22 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
diff --git a/doc/pcrematching.3 b/doc/pcrematching.3
index 490f914..af9b00f 100644
--- a/doc/pcrematching.3
+++ b/doc/pcrematching.3
@@ -151,11 +151,13 @@ callouts.
2. Because the alternative algorithm scans the subject string just once, and
never needs to backtrack, it is possible to pass very long subject strings to
the matching function in several pieces, checking for partial matching each
-time. The
+time. It is possible to do multi-segment matching using \fBpcre_exec()\fP (by
+retaining partially matched substrings), but it is more complicated. The
.\" HREF
\fBpcrepartial\fP
.\"
-documentation gives details of partial matching.
+documentation gives details of partial matching and discusses multi-segment
+matching.
.
.
.SH "DISADVANTAGES OF THE ALTERNATIVE ALGORITHM"
@@ -187,6 +189,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 29 September 2009
-Copyright (c) 1997-2009 University of Cambridge.
+Last updated: 22 October 2010
+Copyright (c) 1997-2010 University of Cambridge.
.fi
diff --git a/doc/pcrepartial.3 b/doc/pcrepartial.3
index e28056d..eeb7d85 100644
--- a/doc/pcrepartial.3
+++ b/doc/pcrepartial.3
@@ -21,8 +21,8 @@ what has been typed so far is potentially valid, it is able to raise an error
as soon as a mistake is made, by beeping and not reflecting the character that
has been typed, for example. This immediate feedback is likely to be a better
user interface than a check that is delayed until the entire string has been
-entered. Partial matching can also sometimes be useful when the subject string
-is very long and is not all available at once.
+entered. Partial matching can also be useful when the subject string is very
+long and is not all available at once.
.P
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
@@ -44,19 +44,21 @@ also disabled for partial matching.
.SH "PARTIAL MATCHING USING pcre_exec()"
.rs
.sp
-A partial match occurs during a call to \fBpcre_exec()\fP whenever the end of
-the subject string is reached successfully, but matching cannot continue
-because more characters are needed. However, at least one character must have
-been matched. (In other words, a partial match can never be an empty string.)
+A partial match occurs during a call to \fBpcre_exec()\fP when the end of the
+subject string is reached successfully, but matching cannot continue because
+more characters are needed. However, at least one character in the subject must
+have been inspected. This character need not form part of the final matched
+string; lookbehind assertions and the \eK escape sequence provide ways of
+inspecting characters before the start of a matched substring. The requirement
+for inspecting at least one character exists because an empty string can always
+be matched; without such a restriction there would always be a partial match of
+an empty string at the end of the subject.
.P
-If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
-continues as normal, and other alternatives in the pattern are tried. If no
-complete match can be found, \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL
-instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
-vector, the first of them is set to the offset of the earliest character that
-was inspected when the partial match was found. For convenience, the second
-offset points to the end of the string so that a substring can easily be
-identified.
+If there are at least two slots in the offsets vector when \fBpcre_exec()\fP
+returns with a partial match, the first slot is set to the offset of the
+earliest character that was inspected when the partial match was found. For
+convenience, the second offset points to the end of the subject so that a
+substring can easily be identified.
.P
For the majority of patterns, the first offset identifies the start of the
partially matched string. However, for patterns that contain lookbehind
@@ -68,7 +70,25 @@ inspected while carrying out the match. For example:
This pattern matches "123", but only if it is preceded by "abc". If the subject
string is "xyzabc12", the offsets after a partial match are for the substring
"abc12", because all these characters are needed if another match is tried
-with extra characters added.
+with extra characters added to the subject.
+.P
+What happens when a partial match is identified depends on which of the two
+partial matching options are set.
+.
+.
+.SS "PCRE_PARTIAL_SOFT with pcre_exec()"
+.rs
+.sp
+If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP identifies a partial match,
+the partial match is remembered, but matching continues as normal, and other
+alternatives in the pattern are tried. If no complete match can be found,
+\fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
+.P
+This option is "soft" because it prefers a complete match over a partial match.
+All the various matching items in a pattern behave as if the subject string is
+potentially complete. For example, \ez, \eZ, and $ match at the end of the
+subject, as normal, and for \eb and \eB the end of the subject is treated as a
+non-alphanumeric.
.P
If there is more than one partial match, the first one that was found provides
the data that is returned. Consider this pattern:
@@ -77,15 +97,29 @@ the data that is returned. Consider this pattern:
.sp
If this is matched against the subject string "abc123dog", both
alternatives fail to match, but the end of the subject is reached during
-matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
-offsets are set to 3 and 9, identifying "123dog" as the first partial match
-that was found. (In this example, there are two partial matches, because "dog"
-on its own partially matches the second alternative.)
-.P
+matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
+identifying "123dog" as the first partial match that was found. (In this
+example, there are two partial matches, because "dog" on its own partially
+matches the second alternative.)
+.
+.
+.SS "PCRE_PARTIAL_HARD with pcre_exec()"
+.rs
+.sp
If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
-search for possible complete matches. The difference between the two options
-can be illustrated by a pattern such as:
+search for possible complete matches. This option is "hard" because it prefers
+an earlier partial match over a later complete match. For this reason, the
+assumption is made that the end of the supplied subject string may not be the
+true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
+encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
+.
+.
+.SS "Comparing hard and soft partial matching"
+.rs
+.sp
+The difference between the two partial matching options can be illustrated by a
+pattern such as:
.sp
/dog(sbody)?/
.sp
@@ -115,7 +149,7 @@ The \fBpcre_dfa_exec()\fP function moves along the subject string character by
character, without backtracking, searching for all possible matches
simultaneously. If the end of the subject is reached before the end of the
pattern, there is the possibility of a partial match, again provided that at
-least one character has matched.
+least one character has been inspected.
.P
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
have been no complete matches. Otherwise, the complete matches are returned.
@@ -240,11 +274,13 @@ From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
previous match with a new segment of data. Instead, new data must be added to
the previous subject string, and the entire match re-run, starting from the
-point where the partial match occurred. Earlier data can be discarded.
-Consider an unanchored pattern that matches dates:
+point where the partial match occurred. Earlier data can be discarded. It is
+best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
+end of a segment as the end of the subject when matching \ez, \eZ, \eb, \eB,
+and $. Consider an unanchored pattern that matches dates:
.sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
- data> The date is 23ja\eP
+ data> The date is 23ja\eP\eP
Partial match: 23ja
.sp
At this stage, an application could discard the text preceding "23ja", add on
@@ -265,9 +301,11 @@ be retained when adding on more characters for a subsequent matching attempt.
Certain types of pattern may give problems with multi-segment matching,
whichever matching function is used.
.P
-1. If the pattern contains tests for the beginning or end of a line, you need
-to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
-subject string for any call does not contain the beginning or end of a line.
+1. If the pattern contains a test for the beginning of a line, you need to pass
+the PCRE_NOTBOL option when the subject string for any call does start at the
+beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
+doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
+includes the effect of PCRE_NOTEOL.
.P
2. Lookbehind assertions at the start of a pattern are catered for in the
offsets that are returned for a partial match. However, in theory, a lookbehind
@@ -281,10 +319,10 @@ always produce exactly the same result as matching over one single long string,
especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
Word Boundaries" above describes an issue that arises if the pattern ends with
\eb or \eB. Another kind of difference may occur when there are multiple
-matching possibilities, because a partial match result is given only when there
-are no completed matches. This means that as soon as the shortest match has
-been found, continuation to a new subject segment is no longer possible.
-Consider again this \fBpcretest\fP example:
+matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
+is given only when there are no completed matches. This means that as soon as
+the shortest match has been found, continuation to a new subject segment is no
+longer possible. Consider again this \fBpcretest\fP example:
.sp
re> /dog(sbody)?/
data> dogsb\eP
@@ -306,8 +344,8 @@ match stops when "dog" has been found, and it is not possible to continue. On
the other hand, if "dogsbody" is presented as a single string,
\fBpcre_dfa_exec()\fP finds both matches.
.P
-Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
-matching multi-segment data. The example above then behaves differently:
+Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
+multi-segment data. The example above then behaves differently:
.sp
re> /dog(sbody)?/
data> dogsb\eP\eP
@@ -341,12 +379,12 @@ problem if \fBpcre_exec()\fP is used, because the entire match has to be rerun
each time:
.sp
re> /1234|3789/
- data> ABC123\eP
+ data> ABC123\eP\eP
Partial match: 123
data> 1237890
0: 3789
.sp
-Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running
+Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
the entire match can also be used with \fBpcre_dfa_exec()\fP. Another
possibility is to work with two buffers. If a partial match at offset \fIn\fP
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
@@ -368,6 +406,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 19 October 2009
-Copyright (c) 1997-2009 University of Cambridge.
+Last updated: 22 October 2010
+Copyright (c) 1997-2010 University of Cambridge.
.fi
diff --git a/doc/pcretest.1 b/doc/pcretest.1
index 7e91394..75c2611 100644
--- a/doc/pcretest.1
+++ b/doc/pcretest.1
@@ -499,9 +499,11 @@ When a match succeeds, pcretest outputs the list of captured substrings that
\fBpcre_exec()\fP returns, starting with number 0 for the string that matched
the whole pattern. Otherwise, it outputs "No match" when the return is
PCRE_ERROR_NOMATCH, and "Partial match:" followed by the partially matching
-substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. For any other
-returns, it outputs the PCRE negative error number. Here is an example of an
-interactive \fBpcretest\fP run.
+substring when \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL. (Note that this is
+the entire substring that was inspected during the partial match; it may
+include characters before the actual match start if a lookbehind assertion,
+\eK, \eb, or \eB was involved.) For any other returns, it outputs the PCRE
+negative error number. Here is an example of an interactive \fBpcretest\fP run.
.sp
$ pcretest
PCRE version 7.0 30-Nov-2006
@@ -584,7 +586,9 @@ the subject where there is at least one match. For example:
(Using the normal matching function on this data finds only "tang".) The
longest matching string is always given first (and numbered zero). After a
PCRE_ERROR_PARTIAL return, the output is "Partial match:", followed by the
-partially matching substring.
+partially matching substring. (Note that this is the entire substring that was
+inspected during the partial match; it may include characters before the actual
+match start if a lookbehind assertion, \eK, \eb, or \eB was involved.)
.P
If \fB/g\fP is present on the pattern, the search for further matches resumes
at the end of the longest match. For example:
@@ -763,6 +767,6 @@ Cambridge CB2 3QH, England.
.rs
.sp
.nf
-Last updated: 14 June 2010
+Last updated: 22 October 2010
Copyright (c) 1997-2010 University of Cambridge.
.fi
diff --git a/pcre.h.in b/pcre.h.in
index b682063..c5999f3 100644
--- a/pcre.h.in
+++ b/pcre.h.in
@@ -96,42 +96,44 @@ extern "C" {
#endif
/* Options. Some are compile-time only, some are run-time only, and some are
-both, so we keep them all distinct. */
-
-#define PCRE_CASELESS 0x00000001
-#define PCRE_MULTILINE 0x00000002
-#define PCRE_DOTALL 0x00000004
-#define PCRE_EXTENDED 0x00000008
-#define PCRE_ANCHORED 0x00000010
-#define PCRE_DOLLAR_ENDONLY 0x00000020
-#define PCRE_EXTRA 0x00000040
-#define PCRE_NOTBOL 0x00000080
-#define PCRE_NOTEOL 0x00000100
-#define PCRE_UNGREEDY 0x00000200
-#define PCRE_NOTEMPTY 0x00000400
-#define PCRE_UTF8 0x00000800
-#define PCRE_NO_AUTO_CAPTURE 0x00001000
-#define PCRE_NO_UTF8_CHECK 0x00002000
-#define PCRE_AUTO_CALLOUT 0x00004000
-#define PCRE_PARTIAL_SOFT 0x00008000
+both, so we keep them all distinct. However, almost all the bits in the options
+word are now used. In the long run, we may have to re-use some of the
+compile-time only bits for runtime options, or vice versa. */
+
+#define PCRE_CASELESS 0x00000001 /* Compile */
+#define PCRE_MULTILINE 0x00000002 /* Compile */
+#define PCRE_DOTALL 0x00000004 /* Compile */
+#define PCRE_EXTENDED 0x00000008 /* Compile */
+#define PCRE_ANCHORED 0x00000010 /* Compile, exec, DFA exec */
+#define PCRE_DOLLAR_ENDONLY 0x00000020 /* Compile */
+#define PCRE_EXTRA 0x00000040 /* Compile */
+#define PCRE_NOTBOL 0x00000080 /* Exec, DFA exec */
+#define PCRE_NOTEOL 0x00000100 /* Exec, DFA exec */
+#define PCRE_UNGREEDY 0x00000200 /* Compile */
+#define PCRE_NOTEMPTY 0x00000400 /* Exec, DFA exec */
+#define PCRE_UTF8 0x00000800 /* Compile */
+#define PCRE_NO_AUTO_CAPTURE 0x00001000 /* Compile */
+#define PCRE_NO_UTF8_CHECK 0x00002000 /* Compile, exec, DFA exec */
+#define PCRE_AUTO_CALLOUT 0x00004000 /* Compile */
+#define PCRE_PARTIAL_SOFT 0x00008000 /* Exec, DFA exec */
#define PCRE_PARTIAL 0x00008000 /* Backwards compatible synonym */
-#define PCRE_DFA_SHORTEST 0x00010000
-#define PCRE_DFA_RESTART 0x00020000
-#define PCRE_FIRSTLINE 0x00040000
-#define PCRE_DUPNAMES 0x00080000
-#define PCRE_NEWLINE_CR 0x00100000
-#define PCRE_NEWLINE_LF 0x00200000
-#define PCRE_NEWLINE_CRLF 0x00300000
-#define PCRE_NEWLINE_ANY 0x00400000
-#define PCRE_NEWLINE_ANYCRLF 0x00500000
-#define PCRE_BSR_ANYCRLF 0x00800000
-#define PCRE_BSR_UNICODE 0x01000000
-#define PCRE_JAVASCRIPT_COMPAT 0x02000000
-#define PCRE_NO_START_OPTIMIZE 0x04000000
-#define PCRE_NO_START_OPTIMISE 0x04000000
-#define PCRE_PARTIAL_HARD 0x08000000
-#define PCRE_NOTEMPTY_ATSTART 0x10000000
-#define PCRE_UCP 0x20000000
+#define PCRE_DFA_SHORTEST 0x00010000 /* DFA exec */
+#define PCRE_DFA_RESTART 0x00020000 /* DFA exec */
+#define PCRE_FIRSTLINE 0x00040000 /* Compile */
+#define PCRE_DUPNAMES 0x00080000 /* Compile */
+#define PCRE_NEWLINE_CR 0x00100000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_LF 0x00200000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_CRLF 0x00300000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANY 0x00400000 /* Compile, exec, DFA exec */
+#define PCRE_NEWLINE_ANYCRLF 0x00500000 /* Compile, exec, DFA exec */
+#define PCRE_BSR_ANYCRLF 0x00800000 /* Compile, exec, DFA exec */
+#define PCRE_BSR_UNICODE 0x01000000 /* Compile, exec, DFA exec */
+#define PCRE_JAVASCRIPT_COMPAT 0x02000000 /* Compile */
+#define PCRE_NO_START_OPTIMIZE 0x04000000 /* Exec, DFA exec */
+#define PCRE_NO_START_OPTIMISE 0x04000000 /* Synonym */
+#define PCRE_PARTIAL_HARD 0x08000000 /* Exec, DFA exec */
+#define PCRE_NOTEMPTY_ATSTART 0x10000000 /* Exec, DFA exec */
+#define PCRE_UCP 0x20000000 /* Compile */
/* Exec-time and get/set-time error codes */
diff --git a/pcre_dfa_exec.c b/pcre_dfa_exec.c
index e55d968..3bb2451 100644
--- a/pcre_dfa_exec.c
+++ b/pcre_dfa_exec.c
@@ -831,7 +831,12 @@ for (;;)
/*-----------------------------------------------------------------*/
case OP_EOD:
- if (ptr >= end_subject) { ADD_ACTIVE(state_offset + 1, 0); }
+ if (ptr >= end_subject)
+ {
+ if ((md->moptions & PCRE_PARTIAL_HARD) != 0)
+ could_continue = TRUE;
+ else { ADD_ACTIVE(state_offset + 1, 0); }
+ }
break;
/*-----------------------------------------------------------------*/
@@ -871,7 +876,9 @@ for (;;)
/*-----------------------------------------------------------------*/
case OP_EODN:
- if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
+ if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
+ could_continue = TRUE;
+ else if (clen == 0 || (IS_NEWLINE(ptr) && ptr == end_subject - md->nllen))
{ ADD_ACTIVE(state_offset + 1, 0); }
break;
@@ -879,7 +886,9 @@ for (;;)
case OP_DOLL:
if ((md->moptions & PCRE_NOTEOL) == 0)
{
- if (clen == 0 ||
+ if (clen == 0 && (md->moptions & PCRE_PARTIAL_HARD) != 0)
+ could_continue = TRUE;
+ else if (clen == 0 ||
((md->poptions & PCRE_DOLLAR_ENDONLY) == 0 && IS_NEWLINE(ptr) &&
((ims & PCRE_MULTILINE) != 0 || ptr == end_subject - md->nllen)
))
@@ -2744,8 +2753,8 @@ for (;;)
((md->moptions & PCRE_PARTIAL_SOFT) != 0 && /* Soft partial and */
match_count < 0) /* no matches */
) && /* And... */
- ptr >= end_subject && /* Reached end of subject */
- ptr > current_subject) /* Matched non-empty string */
+ ptr >= end_subject && /* Reached end of subject */
+ ptr > md->start_used_ptr) /* Inspected non-empty string */
{
if (offsetcount >= 2)
{
diff --git a/pcre_exec.c b/pcre_exec.c
index 1eeea52..f6984bd 100644
--- a/pcre_exec.c
+++ b/pcre_exec.c
@@ -422,17 +422,18 @@ immediately. The second one is used when we already know we are past the end of
the subject. */
#define CHECK_PARTIAL()\
- if (md->partial != 0 && eptr >= md->end_subject && eptr > mstart)\
- {\
- md->hitend = TRUE;\
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+ if (md->partial != 0 && eptr >= md->end_subject && \
+ eptr > md->start_used_ptr) \
+ { \
+ md->hitend = TRUE; \
+ if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
#define SCHECK_PARTIAL()\
- if (md->partial != 0 && eptr > mstart)\
- {\
- md->hitend = TRUE;\
- if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL);\
+ if (md->partial != 0 && eptr > md->start_used_ptr) \
+ { \
+ md->hitend = TRUE; \
+ if (md->partial > 1) MRRETURN(PCRE_ERROR_PARTIAL); \
}
@@ -711,18 +712,18 @@ for (;;)
MRRETURN(MATCH_NOMATCH);
/* COMMIT overrides PRUNE, SKIP, and THEN */
-
+
case OP_COMMIT:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM52);
if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE &&
- rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
- rrc != MATCH_THEN)
+ rrc != MATCH_SKIP && rrc != MATCH_SKIP_ARG &&
+ rrc != MATCH_THEN)
RRETURN(rrc);
MRRETURN(MATCH_COMMIT);
/* PRUNE overrides THEN */
-
+
case OP_PRUNE:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM51);
@@ -737,11 +738,11 @@ for (;;)
RRETURN(MATCH_PRUNE);
/* SKIP overrides PRUNE and THEN */
-
+
case OP_SKIP:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
ims, eptrb, flags, RM53);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
md->start_match_ptr = eptr; /* Pass back current position */
MRRETURN(MATCH_SKIP);
@@ -749,7 +750,7 @@ for (;;)
case OP_SKIP_ARG:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1], offset_top, md,
ims, eptrb, flags, RM57);
- if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
+ if (rrc != MATCH_NOMATCH && rrc != MATCH_PRUNE && rrc != MATCH_THEN)
RRETURN(rrc);
/* Pass back the current skip name by overloading md->start_match_ptr and
@@ -759,11 +760,11 @@ for (;;)
md->start_match_ptr = ecode + 2;
RRETURN(MATCH_SKIP_ARG);
-
+
/* For THEN (and THEN_ARG) we pass back the address of the bracket or
- the alt that is at the start of the current branch. This makes it possible
- to skip back past alternatives that precede the THEN within the current
- branch. */
+ the alt that is at the start of the current branch. This makes it possible
+ to skip back past alternatives that precede the THEN within the current
+ branch. */
case OP_THEN:
RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode], offset_top, md,
@@ -773,7 +774,7 @@ for (;;)
MRRETURN(MATCH_THEN);
case OP_THEN_ARG:
- RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
+ RMATCH(eptr, ecode + _pcre_OP_lengths[*ecode] + ecode[1+LINK_SIZE],
offset_top, md, ims, eptrb, flags, RM58);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
md->start_match_ptr = ecode - GET(ecode, 1);
@@ -1704,37 +1705,40 @@ for (;;)
if (eptr < md->end_subject)
{ if (!IS_NEWLINE(eptr)) MRRETURN(MATCH_NOMATCH); }
else
- { if (md->noteol) MRRETURN(MATCH_NOMATCH); }
+ {
+ if (md->noteol) MRRETURN(MATCH_NOMATCH);
+ SCHECK_PARTIAL();
+ }
ecode++;
break;
}
- else
+ else /* Not multiline */
{
if (md->noteol) MRRETURN(MATCH_NOMATCH);
- if (!md->endonly)
- {
- if (eptr != md->end_subject &&
- (!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
- MRRETURN(MATCH_NOMATCH);
- ecode++;
- break;
- }
+ if (!md->endonly) goto ASSERT_NL_OR_EOS;
}
+
/* ... else fall through for endonly */
/* End of subject assertion (\z) */
case OP_EOD:
if (eptr < md->end_subject) MRRETURN(MATCH_NOMATCH);
+ SCHECK_PARTIAL();
ecode++;
break;
/* End of subject or ending \n assertion (\Z) */
case OP_EODN:
- if (eptr != md->end_subject &&
+ ASSERT_NL_OR_EOS:
+ if (eptr < md->end_subject &&
(!IS_NEWLINE(eptr) || eptr != md->end_subject - md->nllen))
MRRETURN(MATCH_NOMATCH);
+
+ /* Either at end of string or \n before end. */
+
+ SCHECK_PARTIAL();
ecode++;
break;
diff --git a/pcretest.c b/pcretest.c
index ffd963f..cc80a5b 100644
--- a/pcretest.c
+++ b/pcretest.c
@@ -2313,7 +2313,9 @@ while (!done)
#endif
use_dfa = 1;
continue;
+#endif
+#if !defined NODFA
case 'F':
options |= PCRE_DFA_SHORTEST;
continue;
diff --git a/testdata/testinput2 b/testdata/testinput2
index b999d8f..959ce0b 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -3499,4 +3499,40 @@ with \Y. ---/
*** Failers
abcxy
+/(?<=abc)def/
+ abc\P\P
+
+/abc$/
+ abc
+ abc\P
+ abc\P\P
+
+/abc$/m
+ abc
+ abc\n
+ abc\P\P
+ abc\n\P\P
+ abc\P
+ abc\n\P
+
+/abc\z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\Z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\b/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\B/
+ abc
+ abc\P
+ abc\P\P
+
/-- End of testinput2 --/
diff --git a/testdata/testinput7 b/testdata/testinput7
index 5d27311..163f2a5 100644
--- a/testdata/testinput7
+++ b/testdata/testinput7
@@ -4560,4 +4560,40 @@
/^(?(?!a(*SKIP)b))/
ac
+/(?<=abc)def/
+ abc\P\P
+
+/abc$/
+ abc
+ abc\P
+ abc\P\P
+
+/abc$/m
+ abc
+ abc\n
+ abc\P\P
+ abc\n\P\P
+ abc\P
+ abc\n\P
+
+/abc\z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\Z/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\b/
+ abc
+ abc\P
+ abc\P\P
+
+/abc\B/
+ abc
+ abc\P
+ abc\P\P
+
/-- End of testinput7 --/
diff --git a/testdata/testinput8 b/testdata/testinput8
index 1c6f684..2a6bef3 100644
--- a/testdata/testinput8
+++ b/testdata/testinput8
@@ -685,4 +685,8 @@
xxxxabcde\P
xxxxabcde\P\P
+/\bthe cat\b/8
+ the cat\P
+ the cat\P\P
+
/-- End of testinput8 --/
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index b2e2e9a..2f26fe6 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -11138,4 +11138,62 @@ No match
abcxy
No match
+/(?<=abc)def/
+ abc\P\P
+Partial match: abc
+
+/abc$/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc$/m
+ abc
+ 0: abc
+ abc\n
+ 0: abc
+ abc\P\P
+Partial match: abc
+ abc\n\P\P
+ 0: abc
+ abc\P
+ 0: abc
+ abc\n\P
+ 0: abc
+
+/abc\z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\Z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\b/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\B/
+ abc
+No match
+ abc\P
+Partial match: abc
+ abc\P\P
+Partial match: abc
+
/-- End of testinput2 --/
diff --git a/testdata/testoutput7 b/testdata/testoutput7
index 2aab80d..3b12f07 100644
--- a/testdata/testoutput7
+++ b/testdata/testoutput7
@@ -7610,4 +7610,62 @@ Error -16
ac
Error -16
+/(?<=abc)def/
+ abc\P\P
+Partial match: abc
+
+/abc$/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc$/m
+ abc
+ 0: abc
+ abc\n
+ 0: abc
+ abc\P\P
+Partial match: abc
+ abc\n\P\P
+ 0: abc
+ abc\P
+ 0: abc
+ abc\n\P
+ 0: abc
+
+/abc\z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\Z/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\b/
+ abc
+ 0: abc
+ abc\P
+ 0: abc
+ abc\P\P
+Partial match: abc
+
+/abc\B/
+ abc
+No match
+ abc\P
+Partial match: abc
+ abc\P\P
+Partial match: abc
+
/-- End of testinput7 --/
diff --git a/testdata/testoutput8 b/testdata/testoutput8
index 0cc87d7..0e60e7a 100644
--- a/testdata/testoutput8
+++ b/testdata/testoutput8
@@ -1320,4 +1320,10 @@ Partial match: abc1
xxxxabcde\P\P
Partial match: abcde
+/\bthe cat\b/8
+ the cat\P
+ 0: the cat
+ the cat\P\P
+Partial match: the cat
+
/-- End of testinput8 --/