summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--doc/html/pcre2partial.html531
-rw-r--r--doc/pcre2.txt576
-rw-r--r--doc/pcre2partial.3513
3 files changed, 724 insertions, 896 deletions
diff --git a/doc/html/pcre2partial.html b/doc/html/pcre2partial.html
index a2faa76..e0f37ea 100644
--- a/doc/html/pcre2partial.html
+++ b/doc/html/pcre2partial.html
@@ -14,85 +14,123 @@ please consult the man page, in case the conversion went wrong.
<br>
<ul>
<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
-<li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre2_match()</a>
-<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_dfa_match()</a>
-<li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
-<li><a name="TOC5" href="#SEC5">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a>
+<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a>
+<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a>
+<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
+<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a>
<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
-<li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
-<li><a name="TOC8" href="#SEC8">ISSUES WITH MULTI-SEGMENT MATCHING</a>
-<li><a name="TOC9" href="#SEC9">AUTHOR</a>
-<li><a name="TOC10" href="#SEC10">REVISION</a>
+<li><a name="TOC7" href="#SEC7">AUTHOR</a>
+<li><a name="TOC8" href="#SEC8">REVISION</a>
</ul>
<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
<P>
-In normal use of PCRE2, if the subject string that is passed to a matching
-function matches as far as it goes, but is too short to match the entire
-pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
-might be helpful to distinguish this case from other cases in which there is no
-match.
+In normal use of PCRE2, if there is a match up to the end of a subject string,
+but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
+is returned, just like any other failing match. There are circumstances where
+it might be helpful to distinguish this "partial match" case.
</P>
<P>
-Consider, for example, an application where a human is required to type in data
-for a field with specific formatting requirements. An example might be a date
-in the form <i>ddmmmyy</i>, defined by this pattern:
-<pre>
- ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
-</pre>
-If the application sees the user's keystrokes one by one, and can check that
-what has been typed so far is potentially valid, it is able to raise an error
-as soon as a mistake is made, by beeping and not reflecting the character that
-has been typed, for example. This immediate feedback is likely to be a better
-user interface than a check that is delayed until the entire string has been
-entered. Partial matching can also be useful when the subject string is very
-long and is not all available at once, as discussed below.
+One example is an application where the subject string is very long, and not
+all available at once. The requirement here is to be able to do the matching
+segment by segment, but special action is needed when a matched substring spans
+the boundary between two segments.
+</P>
+<P>
+Another example is checking a user input string as it is typed, to ensure that
+it conforms to a required format. Invalid characters can be immediately
+diagnosed and rejected, giving instant feedback.
</P>
<P>
-PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
-PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
-The difference between the two options is whether or not a partial match is
-preferred to an alternative complete match, though the details differ between
-the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
-takes precedence.
+Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
+requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
+options when calling a matching function. The difference between the two
+options is whether or not a partial match is preferred to an alternative
+complete match, though the details differ between the two types of matching
+function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
</P>
<P>
-If you want to use partial matching with just-in-time optimized code, you must
-call <b>pcre2_jit_compile()</b> with one or both of these options:
+If you want to use partial matching with just-in-time optimized code, as well
+as setting a partial match option for the matching function, you must also call
+<b>pcre2_jit_compile()</b> with one or both of these options:
<pre>
- PCRE2_JIT_PARTIAL_SOFT
PCRE2_JIT_PARTIAL_HARD
+ PCRE2_JIT_PARTIAL_SOFT
</pre>
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
-matches on the same pattern. If the appropriate JIT mode has not been compiled,
-interpretive matching code is used.
+matches on the same pattern. Separate code is compiled for each mode. If the
+appropriate JIT mode has not been compiled, interpretive matching code is used.
</P>
<P>
Setting a partial matching option disables two of PCRE2's standard
-optimizations. PCRE2 remembers the last literal code unit in a pattern, and
-abandons matching immediately if it is not present in the subject string. This
-optimization cannot be used for a subject string that might match only
-partially. PCRE2 also knows the minimum length of a matching string, and does
+optimization hints. PCRE2 remembers the last literal code unit in a pattern,
+and abandons matching immediately if it is not present in the subject string.
+This optimization cannot be used for a subject string that might match only
+partially. PCRE2 also remembers a minimum length of a matching string, and does
not bother to run the matching function on shorter strings. This optimization
is also disabled for partial matching.
</P>
-<br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
+<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br>
+<P>
+A possible partial match occurs during matching when the end of the subject
+string is reached successfully, but either more characters are needed to
+complete the match, or the addition of more characters might change what is
+matched.
+</P>
+<P>
+Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
+definitely needed to complete a match. In this case both hard and soft matching
+options yield a partial match.
+</P>
+<P>
+Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
+can be found, but the addition of more characters might change what is
+matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
+PCRE2_PARTIAL_SOFT returns the complete match.
+</P>
+<P>
+On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
+pattern item is \z, \Z, \b, \B, or $ there is always a partial match.
+Otherwise, for both options, the next pattern item must be one that inspects a
+character, and at least one of the following must be true:
+</P>
+<P>
+(1) At least one character has already been inspected. An inspected character
+need not form part of the final matched string; lookbehind assertions and the
+\K escape sequence provide ways of inspecting characters before the start of a
+matched string.
+</P>
<P>
-A partial match occurs during a call to <b>pcre2_match()</b> when the end of the
-subject string is reached successfully, but matching cannot continue because
-more characters are needed, and in addition, either at least one character in
-the subject has been inspected or the pattern contains a lookbehind, or (when
-PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
-inspected character need not form part of the final matched string; lookbehind
-assertions and the \K escape sequence provide ways of inspecting characters
-before the start of a matched string.
+(2) The pattern contains one or more lookbehind assertions. This condition
+exists in case there is a lookbehind that inspects characters before the start
+of the match.
</P>
<P>
-The three additional requirements define the cases where adding more characters
-to the existing subject may complete the same match that would occur if they
-had all been present in the first place. Without these conditions there would
-be a partial match of an empty string at the end of the subject for all
-unanchored patterns (and also for anchored patterns if the subject itself is
-empty).
+(3) There is a special case when the whole pattern can match an empty string.
+When the starting point is at the end of the subject, the empty string match is
+a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
+conditions is true, it is returned. However, because adding more characters
+might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
+which in this case means "there is going to be a match at this point, but until
+some more characters are added, we do not know if it will be an empty string or
+something longer".
+</P>
+<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
+<P>
+When a partial matching option is set, the result of calling
+<b>pcre2_match()</b> can be one of the following:
+</P>
+<P>
+<b>A successful match</b>
+A complete match has been found, starting and ending within this subject.
+</P>
+<P>
+<b>PCRE2_ERROR_NOMATCH</b>
+No match can start anywhere in this subject.
+</P>
+<P>
+<b>PCRE2_ERROR_PARTIAL</b>
+Adding more characters may result in a complete match that uses one or more
+characters from the end of this subject.
</P>
<P>
When a partial match is returned, the first two elements in the ovector point
@@ -110,26 +148,6 @@ these characters are needed for a subsequent re-match with additional
characters.
</P>
<P>
-What happens when a partial match is identified depends on which of the two
-partial matching options is set.
-</P>
-<br><b>
-PCRE2_PARTIAL_SOFT WITH pcre2_match()
-</b><br>
-<P>
-If PCRE2_PARTIAL_SOFT is set when <b>pcre2_match()</b> identifies a partial
-match, the partial match is remembered, but matching continues as normal, and
-other alternatives in the pattern are tried. If no complete match can be found,
-PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
-</P>
-<P>
-This option is "soft" because it prefers a complete match over a partial match.
-All the various matching items in a pattern behave as if the subject string is
-potentially complete. For example, \z, \Z, and $ match at the end of the
-subject, as normal, and for \b and \B the end of the subject is treated as a
-non-alphanumeric.
-</P>
-<P>
If there is more than one partial match, the first one that was found provides
the data that is returned. Consider this pattern:
<pre>
@@ -138,26 +156,34 @@ the data that is returned. Consider this pattern:
If this is matched against the subject string "abc123dog", both alternatives
fail to match, but the end of the subject is reached during matching, so
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
-"123dog" as the first partial match that was found. (In this example, there are
-two partial matches, because "dog" on its own partially matches the second
-alternative.)
+"123dog" as the first partial match. (In this example, there are two partial
+matches, because "dog" on its own partially matches the second alternative.)
</P>
<br><b>
-PCRE2_PARTIAL_HARD WITH pcre2_match()
+How a partial match is processed by pcre2_match()
</b><br>
<P>
-If PCRE2_PARTIAL_HARD is set for <b>pcre2_match()</b>, PCRE2_ERROR_PARTIAL is
-returned as soon as a partial match is found, without continuing to search for
-possible complete matches. This option is "hard" because it prefers an earlier
-partial match over a later complete match. For this reason, the assumption is
-made that the end of the supplied subject string may not be the true end of the
-available data, and so, if \z, \Z, \b, \B, or $ are encountered at the end
-of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
-characters have been inspected.
+What happens when a partial match is identified depends on which of the two
+partial matching options is set.
+</P>
+<P>
+If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
+partial match is found, without continuing to search for possible complete
+matches. This option is "hard" because it prefers an earlier partial match over
+a later complete match. For this reason, the assumption is made that the end of
+the supplied subject string is not the true end of the available data, which is
+why \z, \Z, \b, \B, and $ always give a partial match.
+</P>
+<P>
+If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
+continues as normal, and other alternatives in the pattern are tried. If no
+complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
+PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
+over a partial match. All the various matching items in a pattern behave as if
+the subject string is potentially complete; \z, \Z, and $ match at the end of
+the subject, as normal, and for \b and \B the end of the subject is treated
+as a non-alphanumeric.
</P>
-<br><b>
-Comparing hard and soft partial matching
-</b><br>
<P>
The difference between the two partial matching options can be illustrated by a
pattern such as:
@@ -182,154 +208,85 @@ to follow this explanation by thinking of the two patterns like this:
The second pattern will never match "dogsbody", because it will always find the
shorter match first.
</P>
-<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
-<P>
-The DFA functions move along the subject string character by character, without
-backtracking, searching for all possible matches simultaneously. If the end of
-the subject is reached before the end of the pattern, there is the possibility
-of a partial match, again provided that at least one character has been
-inspected.
-</P>
-<P>
-When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
-have been no complete matches. Otherwise, the complete matches are returned.
-However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
-any complete matches. The portion of the string that was matched when the
-longest partial match was found is set as the first matching string.
-</P>
-<P>
-Because the DFA functions always search for all possible matches, and there is
-no difference between greedy and ungreedy repetition, their behaviour is
-different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
-the string "dog" matched against the ungreedy pattern shown above:
-<pre>
- /dog(sbody)??/
-</pre>
-Whereas the standard function stops as soon as it finds the complete match for
-"dog", the DFA function also finds the partial match for "dogsbody", and so
-returns that when PCRE2_PARTIAL_HARD is set.
-</P>
-<br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
+<br><b>
+Example of partial matching using pcre2test
+</b><br>
<P>
-If a pattern ends with one of sequences \b or \B, which test for word
-boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
-results. Consider this pattern:
-<pre>
- /\bcat\b/
-</pre>
-This matches "cat", provided there is a word boundary at either end. If the
-subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However, normal
-matching carries on, and \b matches at the end of the subject when the last
-character is a letter, so a complete match is found. The result, therefore, is
-<i>not</i> PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
-PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
-</P>
-<br><a name="SEC5" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST</a><br>
-<P>
-If the <b>partial_soft</b> (or <b>ps</b>) modifier is present on a
-<b>pcre2test</b> data line, the PCRE2_PARTIAL_SOFT option is used for the match.
-Here is a run of <b>pcre2test</b> that uses the date example quoted above:
+The <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and
+<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
+respectively, when calling <b>pcre2_match()</b>. Here is a run of
+<b>pcre2test</b> using a pattern that matches the whole subject in the form of a
+date:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
- data&#62; 25jun04\=ps
- 0: 25jun04
- 1: jun
- data&#62; 25dec3\=ps
+ data&#62; 25dec3\=ph
Partial match: 23dec3
- data&#62; 3ju\=ps
+ data&#62; 3ju\=ph
Partial match: 3ju
- data&#62; 3juj\=ps
- No match
- data&#62; j\=ps
+ data&#62; 3juj\=ph
No match
</pre>
-The first data string is matched completely, so <b>pcre2test</b> shows the
-matched substrings. The remaining four strings do not match the complete
-pattern, but the first two are partial matches. Similar output is obtained
-if DFA matching is used.
-</P>
-<P>
-If the <b>partial_hard</b> (or <b>ph</b>) modifier is present on a
-<b>pcre2test</b> data line, the PCRE2_PARTIAL_HARD option is set for the match.
-</P>
-<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
-<P>
-When a partial match has been found using a DFA matching function, it is
-possible to continue the match by providing additional subject data and calling
-the function again with the same compiled regular expression, this time setting
-the PCRE2_DFA_RESTART option. You must pass the same working space as before,
-because this is where details of the previous partial match are stored. Here is
-an example using <b>pcre2test</b>:
+This example gives the same results for both hard and soft partial matching
+options. Here is an example where there is a difference:
<pre>
re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
- data&#62; 23ja\=dfa,ps
- Partial match: 23ja
- data&#62; n05\=dfa,dfa_restart
- 0: n05
+ data&#62; 25jun04\=ps
+ 0: 25jun04
+ 1: jun
+ data&#62; 25jun04\=ph
+ Partial match: 25jun04
</pre>
-The first call has "23ja" as the subject, and requests partial matching; the
-second call has "n05" as the subject for the continued (restarted) match.
-Notice that when the match is complete, only the last part is shown; PCRE2 does
-not retain the previously partially-matched string. It is up to the calling
-program to do that if it needs to.
-</P>
-<P>
-That means that, for an unanchored pattern, if a continued match fails, it is
-not possible to try again at a new starting point. All this facility is capable
-of doing is continuing with the previous match attempt. In the previous
-example, if the second set of data is "ug23" the result is no match, even
-though there would be a match for "aug23" if the entire string were given at
-once. Depending on the application, this may or may not be what you want.
-The only way to allow for starting again at the next character is to retain the
-matched part of the subject and try a new complete match.
+With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
+PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
+there is only a partial match.
</P>
+<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
<P>
-You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
-PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
-facility can be used to pass very long subject strings to the DFA matching
-functions.
+PCRE was not originally designed with multi-segment matching in mind. However,
+over time, features (including partial matching) that make multi-segment
+matching possible have been added. The string is searched segment by segment by
+calling <b>pcre2_match()</b> repeatedly, with the aim of achieving the same
+results that would happen if the entire string was available for searching.
</P>
-<br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
<P>
-Unlike the DFA function, it is not possible to restart the previous match with
-a new segment of data when using <b>pcre2_match()</b>. Instead, new data must be
-added to the previous subject string, and the entire match re-run, starting
-from the point where the partial match occurred. Earlier data can be discarded.
+Special logic must be implemented to handle a matched substring that spans a
+segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
+partial match at the end of a segment whenever there is the possibility of
+changing the match by adding more characters. The PCRE2_NOTBOL option should
+also be set for all but the first segment.
</P>
<P>
-It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
-treat the end of a segment as the end of the subject when matching \z, \Z,
-\b, \B, and $. Consider an unanchored pattern that matches dates:
+When a partial match occurs, the next segment must be added to the current
+subject and the match re-run, using the <i>startoffset</i> argument of
+<b>pcre2_match()</b> to begin at the point where the partial match started.
+Multi-segment matching is usually used to search for substrings in the middle
+of very long sequences, so the patterns are normally not anchored. For example:
<pre>
re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
- data&#62; The date is 23ja\=ph
+ data&#62; ...the date is 23ja\=ph
Partial match: 23ja
+ data&#62; ...the date is 23jan19 and on that day...\=offset=15
+ 0: 23jan19
+ 1: jan
</pre>
-At this stage, an application could discard the text preceding "23ja", add on
-text from the next segment, and call the matching function again. Unlike the
-DFA matching function, the entire matching string must always be available,
-and the complete matching process occurs for each call, so more memory and more
-processing time is needed.
+Note the use of the <b>offset</b> modifier to start the new match where the
+partial match was found.
</P>
-<br><a name="SEC8" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
<P>
-Certain types of pattern may give problems with multi-segment matching,
-whichever matching function is used.
+In this simple example, the next segment was just added to the one in which the
+partial match was found. However, if there are memory constraints, it may be
+necessary to discard text that precedes the partial match before adding the
+next segment. In cases such as the above, where the pattern does not contain
+any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if a pattern contains a lookbehind assertion, characters
+that precede the start of the partial match may have been inspected during the
+matching process.
</P>
<P>
-1. If the pattern contains a test for the beginning of a line, you need to pass
-the PCRE2_NOTBOL option when the subject string for any call does start at the
-beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
-doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
-includes the effect of PCRE2_NOTEOL.
-</P>
-<P>
-2. If a pattern contains a lookbehind assertion, characters that precede the
-start of the partial match may have been inspected during the matching process.
-When using <b>pcre2_match()</b>, sufficient characters must be retained for the
-next match attempt. You can ensure that enough characters are retained by doing
-the following:
+The only lookbehind information that is available is the length of the longest
+lookbehind in a pattern. This may not, of course, be at the start of the
+pattern, but retaining that many characters before the partial match is
+sufficient, if not always strictly necessary. The way to do this is as follows:
</P>
<P>
Before doing any matching, find the length of the longest lookbehind in the
@@ -339,13 +296,8 @@ partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
-while moving back through the code units.
-</P>
-<P>
-Characters before the point you have now reached can be discarded, and after
-the next segment has been added to what is retained, you should run the next
-match with the <b>startoffset</b> argument set so that the match begins at the
-same point as before.
+while moving back through the code units. Characters before the point you have
+now reached can be discarded.
</P>
<P>
For example, if the pattern "(?&#60;=123)abc" is partially matched against the
@@ -353,62 +305,67 @@ string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of <b>startoffset</b> for the next match should be 3. When <b>pcre2test</b>
displays a partial match, it indicates the lookbehind characters with '&#60;'
-characters if the "allusedtext" modifier is set:
+characters if the <b>allusedtext</b> modifier is set:
<pre>
re&#62; "(?&#60;=123)abc"
data&#62; xx123ab\=ph,allusedtext
Partial match: 123ab
&#60;&#60;&#60;
</pre>
-However, the "allusedtext" modifier is not available for JIT matching, because
-JIT matching does not maintain the first and last consulted characters.
-</P>
-<P>
-3. Matching a subject string that is split into multiple segments may not
-always produce exactly the same result as matching over one single long string
-when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
-Boundaries" above describes an issue that arises if the pattern ends with \b
-or \B. Another kind of difference may occur when there are multiple matching
-possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
-only when there are no completed matches. This means that as soon as the
-shortest match has been found, continuation to a new subject segment is no
-longer possible. Consider this <b>pcre2test</b> example:
+Note that the \fPallusedtext\fP modifier is not available for JIT matching,
+because JIT matching does not maintain the first and last consulted characters.
+</P>
+<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
+<P>
+The DFA function moves along the subject string character by character, without
+backtracking, searching for all possible matches simultaneously. If the end of
+the subject is reached before the end of the pattern, there is the possibility
+of a partial match.
+</P>
+<P>
+When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
+have been no complete matches. Otherwise, the complete matches are returned.
+If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
+complete matches. The portion of the string that was matched when the longest
+partial match was found is set as the first matching string.
+</P>
+<P>
+Because the DFA function always searches for all possible matches, and there is
+no difference between greedy and ungreedy repetition, its behaviour is
+different from the <b>pcre2_match()</b>. Consider the string "dog" matched
+against this ungreedy pattern:
<pre>
- re&#62; /dog(sbody)?/
- data&#62; dogsb\=ps
- 0: dog
- data&#62; do\=ps,dfa
- Partial match: do
- data&#62; gsb\=ps,dfa,dfa_restart
- 0: g
- data&#62; dogsbody\=dfa
- 0: dogsbody
- 1: dog
+ /dog(sbody)??/
</pre>
-The first data line passes the string "dogsb" to a standard matching function,
-setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
-for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
-string "dog" is a complete match. Similarly, when the subject is presented to
-a DFA matching function in several parts ("do" and "gsb" being the first two)
-the match stops when "dog" has been found, and it is not possible to continue.
-On the other hand, if "dogsbody" is presented as a single string, a DFA
-matching function finds both matches.
-</P>
-<P>
-Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
-multi-segment data. The example above then behaves differently:
+Whereas the standard function stops as soon as it finds the complete match for
+"dog", the DFA function also finds the partial match for "dogsbody", and so
+returns that when PCRE2_PARTIAL_HARD is set.
+</P>
+<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
+<P>
+When a partial match has been found using the DFA matching function, it is
+possible to continue the match by providing additional subject data and calling
+the function again with the same compiled regular expression, this time setting
+the PCRE2_DFA_RESTART option. You must pass the same working space as before,
+because this is where details of the previous partial match are stored. You can
+set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
+to continue partial matching over multiple segments. Here is an example using
+<b>pcre2test</b>:
<pre>
- re&#62; /dog(sbody)?/
- data&#62; dogsb\=ph
- Partial match: dogsb
- data&#62; do\=ps,dfa
- Partial match: do
- data&#62; gsb\=ph,dfa,dfa_restart
- Partial match: gsb
+ re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+ data&#62; 23ja\=dfa,ps
+ Partial match: 23ja
+ data&#62; n05\=dfa,dfa_restart
+ 0: n05
</pre>
-4. Patterns that contain alternatives at the top level which do not all start
-with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
-used. For example, consider this pattern:
+The first call has "23ja" as the subject, and requests partial matching; the
+second call has "n05" as the subject for the continued (restarted) match.
+Notice that when the match is complete, only the last part is shown; PCRE2 does
+not retain the previously partially-matched string. It is up to the calling
+program to do that if it needs to. This means that, for an unanchored pattern,
+if a continued match fails, it is not possible to try again at a new starting
+point. All this facility is capable of doing is continuing with the previous
+match attempt. For example, consider this pattern:
<pre>
1234|3789
</pre>
@@ -417,30 +374,18 @@ alternative is found at offset 3. There is no partial match for the second
alternative, because such a match does not start at the same point in the
subject string. Attempting to continue with the string "7890" does not yield a
match because only those alternatives that match at one point in the subject
-are remembered. The problem arises because the start of the second alternative
-matches within the first alternative. There is no problem with anchored
-patterns or patterns such as:
-<pre>
- 1234|ABCD
-</pre>
-where no string can be a partial match for both alternatives. This is not a
-problem if a standard matching function is used, because the entire match has
-to be rerun each time:
-<pre>
- re&#62; /1234|3789/
- data&#62; ABC123\=ph
- Partial match: 123
- data&#62; 1237890
- 0: 3789
-</pre>
-Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
-the entire match can also be used with the DFA matching function. Another
-possibility is to work with two buffers. If a partial match at offset <i>n</i>
-in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
-the second buffer, you can then try a new match starting at offset <i>n+1</i> in
-the first buffer.
+are remembered. Depending on the application, this may or may not be what you
+want.
+</P>
+<P>
+If you do want to allow for starting again at the next character, one way of
+doing it is to retain the matched part of the segment and try a new complete
+match, as described for <b>pcre2_match()</b> above. Another possibility is to
+work with two buffers. If a partial match at offset <i>n</i> in the first buffer
+is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
+you can then try a new match starting at offset <i>n+1</i> in the first buffer.
</P>
-<br><a name="SEC9" href="#TOC1">AUTHOR</a><br>
+<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
<P>
Philip Hazel
<br>
@@ -449,9 +394,9 @@ University Computing Service
Cambridge, England.
<br>
</P>
-<br><a name="SEC10" href="#TOC1">REVISION</a><br>
+<br><a name="SEC8" href="#TOC1">REVISION</a><br>
<P>
-Last updated: 22 July 2019
+Last updated: 07 August 2019
<br>
Copyright &copy; 1997-2019 University of Cambridge.
<br>
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index e3eb3f4..a990396 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -5650,72 +5650,109 @@ NAME
PARTIAL MATCHING IN PCRE2
- In normal use of PCRE2, if the subject string that is passed to a
- matching function matches as far as it goes, but is too short to match
- the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
- stances where it might be helpful to distinguish this case from other
- cases in which there is no match.
-
- Consider, for example, an application where a human is required to type
- in data for a field with specific formatting requirements. An example
- might be a date in the form ddmmmyy, defined by this pattern:
-
- ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
-
- If the application sees the user's keystrokes one by one, and can check
- that what has been typed so far is potentially valid, it is able to
- raise an error as soon as a mistake is made, by beeping and not re-
- flecting the character that has been typed, for example. This immediate
- feedback is likely to be a better user interface than a check that is
- delayed until the entire string has been entered. Partial matching can
- also be useful when the subject string is very long and is not all
- available at once, as discussed below.
-
- PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
- PCRE2_PARTIAL_HARD options, which can be set when calling a matching
- function. The difference between the two options is whether or not a
- partial match is preferred to an alternative complete match, though the
- details differ between the two types of matching function. If both op-
- tions are set, PCRE2_PARTIAL_HARD takes precedence.
-
- If you want to use partial matching with just-in-time optimized code,
- you must call pcre2_jit_compile() with one or both of these options:
+ In normal use of PCRE2, if there is a match up to the end of a subject
+ string, but more characters are needed to match the entire pattern,
+ PCRE2_ERROR_NOMATCH is returned, just like any other failing match.
+ There are circumstances where it might be helpful to distinguish this
+ "partial match" case.
+
+ One example is an application where the subject string is very long,
+ and not all available at once. The requirement here is to be able to do
+ the matching segment by segment, but special action is needed when a
+ matched substring spans the boundary between two segments.
+
+ Another example is checking a user input string as it is typed, to en-
+ sure that it conforms to a required format. Invalid characters can be
+ immediately diagnosed and rejected, giving instant feedback.
+
+ Partial matching is a PCRE2-specific feature; it is not Perl-compati-
+ ble. It is requested by setting one of the PCRE2_PARTIAL_HARD or
+ PCRE2_PARTIAL_SOFT options when calling a matching function. The dif-
+ ference between the two options is whether or not a partial match is
+ preferred to an alternative complete match, though the details differ
+ between the two types of matching function. If both options are set,
+ PCRE2_PARTIAL_HARD takes precedence.
+
+ If you want to use partial matching with just-in-time optimized code,
+ as well as setting a partial match option for the matching function,
+ you must also call pcre2_jit_compile() with one or both of these op-
+ tions:
- PCRE2_JIT_PARTIAL_SOFT
PCRE2_JIT_PARTIAL_HARD
+ PCRE2_JIT_PARTIAL_SOFT
- PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
- tial matches on the same pattern. If the appropriate JIT mode has not
- been compiled, interpretive matching code is used.
+ PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
+ tial matches on the same pattern. Separate code is compiled for each
+ mode. If the appropriate JIT mode has not been compiled, interpretive
+ matching code is used.
Setting a partial matching option disables two of PCRE2's standard op-
- timizations. PCRE2 remembers the last literal code unit in a pattern,
- and abandons matching immediately if it is not present in the subject
- string. This optimization cannot be used for a subject string that
- might match only partially. PCRE2 also knows the minimum length of a
- matching string, and does not bother to run the matching function on
- shorter strings. This optimization is also disabled for partial match-
- ing.
+ timization hints. PCRE2 remembers the last literal code unit in a pat-
+ tern, and abandons matching immediately if it is not present in the
+ subject string. This optimization cannot be used for a subject string
+ that might match only partially. PCRE2 also remembers a minimum length
+ of a matching string, and does not bother to run the matching function
+ on shorter strings. This optimization is also disabled for partial
+ matching.
+
+
+REQUIREMENTS FOR A PARTIAL MATCH
+
+ A possible partial match occurs during matching when the end of the
+ subject string is reached successfully, but either more characters are
+ needed to complete the match, or the addition of more characters might
+ change what is matched.
+
+ Example 1: if the pattern is /abc/ and the subject is "ab", more char-
+ acters are definitely needed to complete a match. In this case both
+ hard and soft matching options yield a partial match.
+
+ Example 2: if the pattern is /ab+/ and the subject is "ab", a complete
+ match can be found, but the addition of more characters might change
+ what is matched. In this case, only PCRE2_PARTIAL_HARD returns a par-
+ tial match; PCRE2_PARTIAL_SOFT returns the complete match.
+
+ On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if
+ the next pattern item is \z, \Z, \b, \B, or $ there is always a partial
+ match. Otherwise, for both options, the next pattern item must be one
+ that inspects a character, and at least one of the following must be
+ true:
+
+ (1) At least one character has already been inspected. An inspected
+ character need not form part of the final matched string; lookbehind
+ assertions and the \K escape sequence provide ways of inspecting char-
+ acters before the start of a matched string.
+
+ (2) The pattern contains one or more lookbehind assertions. This condi-
+ tion exists in case there is a lookbehind that inspects characters be-
+ fore the start of the match.
+
+ (3) There is a special case when the whole pattern can match an empty
+ string. When the starting point is at the end of the subject, the
+ empty string match is a possibility, and if PCRE2_PARTIAL_SOFT is set
+ and neither of the above conditions is true, it is returned. However,
+ because adding more characters might result in a non-empty match,
+ PCRE2_PARTIAL_HARD returns a partial match, which in this case means
+ "there is going to be a match at this point, but until some more char-
+ acters are added, we do not know if it will be an empty string or some-
+ thing longer".
PARTIAL MATCHING USING pcre2_match()
- A partial match occurs during a call to pcre2_match() when the end of
- the subject string is reached successfully, but matching cannot con-
- tinue because more characters are needed, and in addition, either at
- least one character in the subject has been inspected or the pattern
- contains a lookbehind, or (when PCRE2_PARTIAL_HARD is set) the pattern
- could match an empty string. An inspected character need not form part
- of the final matched string; lookbehind assertions and the \K escape
- sequence provide ways of inspecting characters before the start of a
- matched string.
-
- The three additional requirements define the cases where adding more
- characters to the existing subject may complete the same match that
- would occur if they had all been present in the first place. Without
- these conditions there would be a partial match of an empty string at
- the end of the subject for all unanchored patterns (and also for an-
- chored patterns if the subject itself is empty).
+ When a partial matching option is set, the result of calling
+ pcre2_match() can be one of the following:
+
+ A successful match
+ A complete match has been found, starting and ending within this sub-
+ ject.
+
+ PCRE2_ERROR_NOMATCH
+ No match can start anywhere in this subject.
+
+ PCRE2_ERROR_PARTIAL
+ Adding more characters may result in a complete match that uses one
+ or more characters from the end of this subject.
When a partial match is returned, the first two elements in the ovector
point to the portion of the subject that was matched, but the values in
@@ -5725,29 +5762,12 @@ PARTIAL MATCHING USING pcre2_match()
/abc\K123/
If it is matched against "456abc123xyz" the result is a complete match,
- and the ovector defines the matched string as "123", because \K resets
- the "start of match" point. However, if a partial match is requested
- and the subject string is "456abc12", a partial match is found for the
- string "abc12", because all these characters are needed for a subse-
+ and the ovector defines the matched string as "123", because \K resets
+ the "start of match" point. However, if a partial match is requested
+ and the subject string is "456abc12", a partial match is found for the
+ string "abc12", because all these characters are needed for a subse-
quent re-match with additional characters.
- What happens when a partial match is identified depends on which of the
- two partial matching options is set.
-
- PCRE2_PARTIAL_SOFT WITH pcre2_match()
-
- If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial
- match, the partial match is remembered, but matching continues as nor-
- mal, and other alternatives in the pattern are tried. If no complete
- match can be found, PCRE2_ERROR_PARTIAL is returned instead of
- PCRE2_ERROR_NOMATCH.
-
- This option is "soft" because it prefers a complete match over a par-
- tial match. All the various matching items in a pattern behave as if
- the subject string is potentially complete. For example, \z, \Z, and $
- match at the end of the subject, as normal, and for \b and \B the end
- of the subject is treated as a non-alphanumeric.
-
If there is more than one partial match, the first one that was found
provides the data that is returned. Consider this pattern:
@@ -5756,23 +5776,31 @@ PARTIAL MATCHING USING pcre2_match()
If this is matched against the subject string "abc123dog", both alter-
natives fail to match, but the end of the subject is reached during
matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
- and 9, identifying "123dog" as the first partial match that was found.
- (In this example, there are two partial matches, because "dog" on its
- own partially matches the second alternative.)
-
- PCRE2_PARTIAL_HARD WITH pcre2_match()
-
- If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
- returned as soon as a partial match is found, without continuing to
- search for possible complete matches. This option is "hard" because it
- prefers an earlier partial match over a later complete match. For this
- reason, the assumption is made that the end of the supplied subject
- string may not be the true end of the available data, and so, if \z,
- \Z, \b, \B, or $ are encountered at the end of the subject, the result
- is PCRE2_ERROR_PARTIAL, whether or not any characters have been in-
- spected.
+ and 9, identifying "123dog" as the first partial match. (In this exam-
+ ple, there are two partial matches, because "dog" on its own partially
+ matches the second alternative.)
- Comparing hard and soft partial matching
+ How a partial match is processed by pcre2_match()
+
+ What happens when a partial match is identified depends on which of the
+ two partial matching options is set.
+
+ If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon
+ as a partial match is found, without continuing to search for possible
+ complete matches. This option is "hard" because it prefers an earlier
+ partial match over a later complete match. For this reason, the assump-
+ tion is made that the end of the supplied subject string is not the
+ true end of the available data, which is why \z, \Z, \b, \B, and $ al-
+ ways give a partial match.
+
+ If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but
+ matching continues as normal, and other alternatives in the pattern are
+ tried. If no complete match can be found, PCRE2_ERROR_PARTIAL is re-
+ turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it
+ prefers a complete match over a partial match. All the various matching
+ items in a pattern behave as if the subject string is potentially com-
+ plete; \z, \Z, and $ match at the end of the subject, as normal, and
+ for \b and \B the end of the subject is treated as a non-alphanumeric.
The difference between the two partial matching options can be illus-
trated by a pattern such as:
@@ -5799,89 +5827,147 @@ PARTIAL MATCHING USING pcre2_match()
The second pattern will never match "dogsbody", because it will always
find the shorter match first.
+ Example of partial matching using pcre2test
-PARTIAL MATCHING USING pcre2_dfa_match()
+ The pcre2test data modifiers partial_hard (or ph) and partial_soft (or
+ ps) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when
+ calling pcre2_match(). Here is a run of pcre2test using a pattern that
+ matches the whole subject in the form of a date:
- The DFA functions move along the subject string character by character,
- without backtracking, searching for all possible matches simultane-
- ously. If the end of the subject is reached before the end of the pat-
- tern, there is the possibility of a partial match, again provided that
- at least one character has been inspected.
+ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+ data> 25dec3\=ph
+ Partial match: 23dec3
+ data> 3ju\=ph
+ Partial match: 3ju
+ data> 3juj\=ph
+ No match
- When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
- there have been no complete matches. Otherwise, the complete matches
- are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match
- takes precedence over any complete matches. The portion of the string
- that was matched when the longest partial match was found is set as the
- first matching string.
+ This example gives the same results for both hard and soft partial
+ matching options. Here is an example where there is a difference:
+
+ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+ data> 25jun04\=ps
+ 0: 25jun04
+ 1: jun
+ data> 25jun04\=ph
+ Partial match: 25jun04
- Because the DFA functions always search for all possible matches, and
- there is no difference between greedy and ungreedy repetition, their
- behaviour is different from the standard functions when PCRE2_PAR-
- TIAL_HARD is set. Consider the string "dog" matched against the un-
- greedy pattern shown above:
+ With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
+ PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete,
+ so there is only a partial match.
- /dog(sbody)??/
- Whereas the standard function stops as soon as it finds the complete
- match for "dog", the DFA function also finds the partial match for
- "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
+MULTI-SEGMENT MATCHING WITH pcre2_match()
+ PCRE was not originally designed with multi-segment matching in mind.
+ However, over time, features (including partial matching) that make
+ multi-segment matching possible have been added. The string is searched
+ segment by segment by calling pcre2_match() repeatedly, with the aim of
+ achieving the same results that would happen if the entire string was
+ available for searching.
+
+ Special logic must be implemented to handle a matched substring that
+ spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it
+ returns a partial match at the end of a segment whenever there is the
+ possibility of changing the match by adding more characters. The
+ PCRE2_NOTBOL option should also be set for all but the first segment.
+
+ When a partial match occurs, the next segment must be added to the cur-
+ rent subject and the match re-run, using the startoffset argument of
+ pcre2_match() to begin at the point where the partial match started.
+ Multi-segment matching is usually used to search for substrings in the
+ middle of very long sequences, so the patterns are normally not an-
+ chored. For example:
+
+ re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
+ data> ...the date is 23ja\=ph
+ Partial match: 23ja
+ data> ...the date is 23jan19 and on that day...\=offset=15
+ 0: 23jan19
+ 1: jan
+
+ Note the use of the offset modifier to start the new match where the
+ partial match was found.
+
+ In this simple example, the next segment was just added to the one in
+ which the partial match was found. However, if there are memory con-
+ straints, it may be necessary to discard text that precedes the partial
+ match before adding the next segment. In cases such as the above, where
+ the pattern does not contain any lookbehinds, it is sufficient to re-
+ tain only the partially matched substring. However, if a pattern con-
+ tains a lookbehind assertion, characters that precede the start of the
+ partial match may have been inspected during the matching process.
+
+ The only lookbehind information that is available is the length of the
+ longest lookbehind in a pattern. This may not, of course, be at the
+ start of the pattern, but retaining that many characters before the
+ partial match is sufficient, if not always strictly necessary. The way
+ to do this is as follows:
-PARTIAL MATCHING AND WORD BOUNDARIES
+ Before doing any matching, find the length of the longest lookbehind in
+ the pattern by calling pcre2_pattern_info() with the
+ PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
+ characters, not code units. After a partial match, moving back from the
+ ovector[0] offset in the subject by the number of characters given for
+ the maximum lookbehind gets you to the earliest character that must be
+ retained. In a non-UTF or a 32-bit situation, moving back is just a
+ subtraction, but in UTF-8 or UTF-16 you have to count characters while
+ moving back through the code units. Characters before the point you
+ have now reached can be discarded.
+
+ For example, if the pattern "(?<=123)abc" is partially matched against
+ the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
+ mum lookbehind count is 3, so all characters before offset 2 can be
+ discarded. The value of startoffset for the next match should be 3.
+ When pcre2test displays a partial match, it indicates the lookbehind
+ characters with '<' characters if the allusedtext modifier is set:
- If a pattern ends with one of sequences \b or \B, which test for word
- boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
- intuitive results. Consider this pattern:
+ re> "(?<=123)abc"
+ data> xx123ab\=ph,allusedtext
+ Partial match: 123ab
+ <<<
- /\bcat\b/
+ Note that the allusedtext modifier is not available for JIT matching,
+ because JIT matching does not maintain the first and last consulted
+ characters.
- This matches "cat", provided there is a word boundary at either end. If
- the subject string is "the cat", the comparison of the final "t" with a
- following character cannot take place, so a partial match is found.
- However, normal matching carries on, and \b matches at the end of the
- subject when the last character is a letter, so a complete match is
- found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using
- PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
- then the partial match takes precedence.
+PARTIAL MATCHING USING pcre2_dfa_match()
-EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
+ The DFA function moves along the subject string character by character,
+ without backtracking, searching for all possible matches simultane-
+ ously. If the end of the subject is reached before the end of the pat-
+ tern, there is the possibility of a partial match.
- If the partial_soft (or ps) modifier is present on a pcre2test data
- line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a
- run of pcre2test that uses the date example quoted above:
+ When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
+ there have been no complete matches. Otherwise, the complete matches
+ are returned. If PCRE2_PARTIAL_HARD is set, a partial match takes
+ precedence over any complete matches. The portion of the string that
+ was matched when the longest partial match was found is set as the
+ first matching string.
- re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
- data> 25jun04\=ps
- 0: 25jun04
- 1: jun
- data> 25dec3\=ps
- Partial match: 23dec3
- data> 3ju\=ps
- Partial match: 3ju
- data> 3juj\=ps
- No match
- data> j\=ps
- No match
+ Because the DFA function always searches for all possible matches, and
+ there is no difference between greedy and ungreedy repetition, its be-
+ haviour is different from the pcre2_match(). Consider the string "dog"
+ matched against this ungreedy pattern:
- The first data string is matched completely, so pcre2test shows the
- matched substrings. The remaining four strings do not match the com-
- plete pattern, but the first two are partial matches. Similar output is
- obtained if DFA matching is used.
+ /dog(sbody)??/
- If the partial_hard (or ph) modifier is present on a pcre2test data
- line, the PCRE2_PARTIAL_HARD option is set for the match.
+ Whereas the standard function stops as soon as it finds the complete
+ match for "dog", the DFA function also finds the partial match for
+ "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
- When a partial match has been found using a DFA matching function, it
- is possible to continue the match by providing additional subject data
- and calling the function again with the same compiled regular expres-
+ When a partial match has been found using the DFA matching function, it
+ is possible to continue the match by providing additional subject data
+ and calling the function again with the same compiled regular expres-
sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
same working space as before, because this is where details of the pre-
- vious partial match are stored. Here is an example using pcre2test:
+ vious partial match are stored. You can set the PCRE2_PARTIAL_SOFT or
+ PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART to continue partial
+ matching over multiple segments. Here is an example using pcre2test:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 23ja\=dfa,ps
@@ -5889,146 +5975,15 @@ MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
data> n05\=dfa,dfa_restart
0: n05
- The first call has "23ja" as the subject, and requests partial match-
- ing; the second call has "n05" as the subject for the continued
- (restarted) match. Notice that when the match is complete, only the
- last part is shown; PCRE2 does not retain the previously partially-
- matched string. It is up to the calling program to do that if it needs
- to.
-
- That means that, for an unanchored pattern, if a continued match fails,
- it is not possible to try again at a new starting point. All this fa-
- cility is capable of doing is continuing with the previous match at-
- tempt. In the previous example, if the second set of data is "ug23" the
- result is no match, even though there would be a match for "aug23" if
- the entire string were given at once. Depending on the application,
- this may or may not be what you want. The only way to allow for start-
- ing again at the next character is to retain the matched part of the
- subject and try a new complete match.
-
- You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
- PCRE2_DFA_RESTART to continue partial matching over multiple segments.
- This facility can be used to pass very long subject strings to the DFA
- matching functions.
-
-
-MULTI-SEGMENT MATCHING WITH pcre2_match()
-
- Unlike the DFA function, it is not possible to restart the previous
- match with a new segment of data when using pcre2_match(). Instead, new
- data must be added to the previous subject string, and the entire match
- re-run, starting from the point where the partial match occurred. Ear-
- lier data can be discarded.
-
- It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
- not treat the end of a segment as the end of the subject when matching
- \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
- dates:
-
- re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
- data> The date is 23ja\=ph
- Partial match: 23ja
-
- At this stage, an application could discard the text preceding "23ja",
- add on text from the next segment, and call the matching function
- again. Unlike the DFA matching function, the entire matching string
- must always be available, and the complete matching process occurs for
- each call, so more memory and more processing time is needed.
-
-
-ISSUES WITH MULTI-SEGMENT MATCHING
-
- Certain types of pattern may give problems with multi-segment matching,
- whichever matching function is used.
-
- 1. If the pattern contains a test for the beginning of a line, you need
- to pass the PCRE2_NOTBOL option when the subject string for any call
- does start at the beginning of a line. There is also a PCRE2_NOTEOL op-
- tion, but in practice when doing multi-segment matching you should be
- using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
-
- 2. If a pattern contains a lookbehind assertion, characters that pre-
- cede the start of the partial match may have been inspected during the
- matching process. When using pcre2_match(), sufficient characters must
- be retained for the next match attempt. You can ensure that enough
- characters are retained by doing the following:
-
- Before doing any matching, find the length of the longest lookbehind in
- the pattern by calling pcre2_pattern_info() with the
- PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
- characters, not code units. After a partial match, moving back from the
- ovector[0] offset in the subject by the number of characters given for
- the maximum lookbehind gets you to the earliest character that must be
- retained. In a non-UTF or a 32-bit situation, moving back is just a
- subtraction, but in UTF-8 or UTF-16 you have to count characters while
- moving back through the code units.
-
- Characters before the point you have now reached can be discarded, and
- after the next segment has been added to what is retained, you should
- run the next match with the startoffset argument set so that the match
- begins at the same point as before.
-
- For example, if the pattern "(?<=123)abc" is partially matched against
- the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
- mum lookbehind count is 3, so all characters before offset 2 can be
- discarded. The value of startoffset for the next match should be 3.
- When pcre2test displays a partial match, it indicates the lookbehind
- characters with '<' characters if the "allusedtext" modifier is set:
-
- re> "(?<=123)abc"
- data> xx123ab\=ph,allusedtext
- Partial match: 123ab
- <<< However, the "allusedtext" modifier is not avail-
- able for JIT matching, because JIT matching does not maintain the first
- and last consulted characters.
-
- 3. Matching a subject string that is split into multiple segments may
- not always produce exactly the same result as matching over one single
- long string when PCRE2_PARTIAL_SOFT is used. The section "Partial
- Matching and Word Boundaries" above describes an issue that arises if
- the pattern ends with \b or \B. Another kind of difference may occur
- when there are multiple matching possibilities, because (for PCRE2_PAR-
- TIAL_SOFT) a partial match result is given only when there are no com-
- pleted matches. This means that as soon as the shortest match has been
- found, continuation to a new subject segment is no longer possible.
- Consider this pcre2test example:
-
- re> /dog(sbody)?/
- data> dogsb\=ps
- 0: dog
- data> do\=ps,dfa
- Partial match: do
- data> gsb\=ps,dfa,dfa_restart
- 0: g
- data> dogsbody\=dfa
- 0: dogsbody
- 1: dog
-
- The first data line passes the string "dogsb" to a standard matching
- function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
- a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL,
- because the shorter string "dog" is a complete match. Similarly, when
- the subject is presented to a DFA matching function in several parts
- ("do" and "gsb" being the first two) the match stops when "dog" has
- been found, and it is not possible to continue. On the other hand, if
- "dogsbody" is presented as a single string, a DFA matching function
- finds both matches.
-
- Because of these problems, it is best to use PCRE2_PARTIAL_HARD when
- matching multi-segment data. The example above then behaves differ-
- ently:
-
- re> /dog(sbody)?/
- data> dogsb\=ph
- Partial match: dogsb
- data> do\=ps,dfa
- Partial match: do
- data> gsb\=ph,dfa,dfa_restart
- Partial match: gsb
-
- 4. Patterns that contain alternatives at the top level which do not all
- start with the same pattern item may not work as expected when
- PCRE2_DFA_RESTART is used. For example, consider this pattern:
+ The first call has "23ja" as the subject, and requests partial match-
+ ing; the second call has "n05" as the subject for the continued
+ (restarted) match. Notice that when the match is complete, only the
+ last part is shown; PCRE2 does not retain the previously partially-
+ matched string. It is up to the calling program to do that if it needs
+ to. This means that, for an unanchored pattern, if a continued match
+ fails, it is not possible to try again at a new starting point. All
+ this facility is capable of doing is continuing with the previous match
+ attempt. For example, consider this pattern:
1234|3789
@@ -6037,29 +5992,16 @@ ISSUES WITH MULTI-SEGMENT MATCHING
the second alternative, because such a match does not start at the same
point in the subject string. Attempting to continue with the string
"7890" does not yield a match because only those alternatives that
- match at one point in the subject are remembered. The problem arises
- because the start of the second alternative matches within the first
- alternative. There is no problem with anchored patterns or patterns
- such as:
-
- 1234|ABCD
-
- where no string can be a partial match for both alternatives. This is
- not a problem if a standard matching function is used, because the en-
- tire match has to be rerun each time:
-
- re> /1234|3789/
- data> ABC123\=ph
- Partial match: 123
- data> 1237890
- 0: 3789
+ match at one point in the subject are remembered. Depending on the ap-
+ plication, this may or may not be what you want.
- Of course, instead of using PCRE2_DFA_RESTART, the same technique of
- re-running the entire match can also be used with the DFA matching
- function. Another possibility is to work with two buffers. If a partial
- match at offset n in the first buffer is followed by "no match" when
- PCRE2_DFA_RESTART is used on the second buffer, you can then try a new
- match starting at offset n+1 in the first buffer.
+ If you do want to allow for starting again at the next character, one
+ way of doing it is to retain the matched part of the segment and try a
+ new complete match, as described for pcre2_match() above. Another pos-
+ sibility is to work with two buffers. If a partial match at offset n in
+ the first buffer is followed by "no match" when PCRE2_DFA_RESTART is
+ used on the second buffer, you can then try a new match starting at
+ offset n+1 in the first buffer.
AUTHOR
@@ -6071,7 +6013,7 @@ AUTHOR
REVISION
- Last updated: 22 July 2019
+ Last updated: 07 August 2019
Copyright (c) 1997-2019 University of Cambridge.
------------------------------------------------------------------------------
diff --git a/doc/pcre2partial.3 b/doc/pcre2partial.3
index adb7814..92d5038 100644
--- a/doc/pcre2partial.3
+++ b/doc/pcre2partial.3
@@ -1,73 +1,107 @@
-.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
+.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
.SH NAME
PCRE2 - Perl-compatible regular expressions
.SH "PARTIAL MATCHING IN PCRE2"
.rs
.sp
-In normal use of PCRE2, if the subject string that is passed to a matching
-function matches as far as it goes, but is too short to match the entire
-pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
-might be helpful to distinguish this case from other cases in which there is no
-match.
+In normal use of PCRE2, if there is a match up to the end of a subject string,
+but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
+is returned, just like any other failing match. There are circumstances where
+it might be helpful to distinguish this "partial match" case.
.P
-Consider, for example, an application where a human is required to type in data
-for a field with specific formatting requirements. An example might be a date
-in the form \fIddmmmyy\fP, defined by this pattern:
-.sp
- ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
-.sp
-If the application sees the user's keystrokes one by one, and can check that
-what has been typed so far is potentially valid, it is able to raise an error
-as soon as a mistake is made, by beeping and not reflecting the character that
-has been typed, for example. This immediate feedback is likely to be a better
-user interface than a check that is delayed until the entire string has been
-entered. Partial matching can also be useful when the subject string is very
-long and is not all available at once, as discussed below.
+One example is an application where the subject string is very long, and not
+all available at once. The requirement here is to be able to do the matching
+segment by segment, but special action is needed when a matched substring spans
+the boundary between two segments.
.P
-PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
-PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
-The difference between the two options is whether or not a partial match is
-preferred to an alternative complete match, though the details differ between
-the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
-takes precedence.
+Another example is checking a user input string as it is typed, to ensure that
+it conforms to a required format. Invalid characters can be immediately
+diagnosed and rejected, giving instant feedback.
.P
-If you want to use partial matching with just-in-time optimized code, you must
-call \fBpcre2_jit_compile()\fP with one or both of these options:
+Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
+requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
+options when calling a matching function. The difference between the two
+options is whether or not a partial match is preferred to an alternative
+complete match, though the details differ between the two types of matching
+function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
+.P
+If you want to use partial matching with just-in-time optimized code, as well
+as setting a partial match option for the matching function, you must also call
+\fBpcre2_jit_compile()\fP with one or both of these options:
.sp
- PCRE2_JIT_PARTIAL_SOFT
PCRE2_JIT_PARTIAL_HARD
+ PCRE2_JIT_PARTIAL_SOFT
.sp
PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
-matches on the same pattern. If the appropriate JIT mode has not been compiled,
-interpretive matching code is used.
+matches on the same pattern. Separate code is compiled for each mode. If the
+appropriate JIT mode has not been compiled, interpretive matching code is used.
.P
Setting a partial matching option disables two of PCRE2's standard
-optimizations. PCRE2 remembers the last literal code unit in a pattern, and
-abandons matching immediately if it is not present in the subject string. This
-optimization cannot be used for a subject string that might match only
-partially. PCRE2 also knows the minimum length of a matching string, and does
+optimization hints. PCRE2 remembers the last literal code unit in a pattern,
+and abandons matching immediately if it is not present in the subject string.
+This optimization cannot be used for a subject string that might match only
+partially. PCRE2 also remembers a minimum length of a matching string, and does
not bother to run the matching function on shorter strings. This optimization
is also disabled for partial matching.
.
.
-.SH "PARTIAL MATCHING USING pcre2_match()"
+.SH "REQUIREMENTS FOR A PARTIAL MATCH"
.rs
.sp
-A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
-subject string is reached successfully, but matching cannot continue because
-more characters are needed, and in addition, either at least one character in
-the subject has been inspected or the pattern contains a lookbehind, or (when
-PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
-inspected character need not form part of the final matched string; lookbehind
-assertions and the \eK escape sequence provide ways of inspecting characters
-before the start of a matched string.
+A possible partial match occurs during matching when the end of the subject
+string is reached successfully, but either more characters are needed to
+complete the match, or the addition of more characters might change what is
+matched.
+.P
+Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
+definitely needed to complete a match. In this case both hard and soft matching
+options yield a partial match.
+.P
+Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
+can be found, but the addition of more characters might change what is
+matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
+PCRE2_PARTIAL_SOFT returns the complete match.
+.P
+On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
+pattern item is \ez, \eZ, \eb, \eB, or $ there is always a partial match.
+Otherwise, for both options, the next pattern item must be one that inspects a
+character, and at least one of the following must be true:
+.P
+(1) At least one character has already been inspected. An inspected character
+need not form part of the final matched string; lookbehind assertions and the
+\eK escape sequence provide ways of inspecting characters before the start of a
+matched string.
.P
-The three additional requirements define the cases where adding more characters
-to the existing subject may complete the same match that would occur if they
-had all been present in the first place. Without these conditions there would
-be a partial match of an empty string at the end of the subject for all
-unanchored patterns (and also for anchored patterns if the subject itself is
-empty).
+(2) The pattern contains one or more lookbehind assertions. This condition
+exists in case there is a lookbehind that inspects characters before the start
+of the match.
+.P
+(3) There is a special case when the whole pattern can match an empty string.
+When the starting point is at the end of the subject, the empty string match is
+a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
+conditions is true, it is returned. However, because adding more characters
+might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
+which in this case means "there is going to be a match at this point, but until
+some more characters are added, we do not know if it will be an empty string or
+something longer".
+.
+.
+.
+.SH "PARTIAL MATCHING USING pcre2_match()"
+.rs
+.sp
+When a partial matching option is set, the result of calling
+\fBpcre2_match()\fP can be one of the following:
+.TP 2
+\fBA successful match\fP
+A complete match has been found, starting and ending within this subject.
+.TP
+\fBPCRE2_ERROR_NOMATCH\fP
+No match can start anywhere in this subject.
+.TP
+\fBPCRE2_ERROR_PARTIAL\fP
+Adding more characters may result in a complete match that uses one or more
+characters from the end of this subject.
.P
When a partial match is returned, the first two elements in the ovector point
to the portion of the subject that was matched, but the values in the rest of
@@ -83,24 +117,6 @@ is "456abc12", a partial match is found for the string "abc12", because all
these characters are needed for a subsequent re-match with additional
characters.
.P
-What happens when a partial match is identified depends on which of the two
-partial matching options is set.
-.
-.
-.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
-.rs
-.sp
-If PCRE2_PARTIAL_SOFT is set when \fBpcre2_match()\fP identifies a partial
-match, the partial match is remembered, but matching continues as normal, and
-other alternatives in the pattern are tried. If no complete match can be found,
-PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
-.P
-This option is "soft" because it prefers a complete match over a partial match.
-All the various matching items in a pattern behave as if the subject string is
-potentially complete. For example, \ez, \eZ, and $ match at the end of the
-subject, as normal, and for \eb and \eB the end of the subject is treated as a
-non-alphanumeric.
-.P
If there is more than one partial match, the first one that was found provides
the data that is returned. Consider this pattern:
.sp
@@ -109,27 +125,32 @@ the data that is returned. Consider this pattern:
If this is matched against the subject string "abc123dog", both alternatives
fail to match, but the end of the subject is reached during matching, so
PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
-"123dog" as the first partial match that was found. (In this example, there are
-two partial matches, because "dog" on its own partially matches the second
-alternative.)
+"123dog" as the first partial match. (In this example, there are two partial
+matches, because "dog" on its own partially matches the second alternative.)
.
.
-.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
-.rs
-.sp
-If PCRE2_PARTIAL_HARD is set for \fBpcre2_match()\fP, PCRE2_ERROR_PARTIAL is
-returned as soon as a partial match is found, without continuing to search for
-possible complete matches. This option is "hard" because it prefers an earlier
-partial match over a later complete match. For this reason, the assumption is
-made that the end of the supplied subject string may not be the true end of the
-available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
-of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any
-characters have been inspected.
-.
-.
-.SS "Comparing hard and soft partial matching"
+.SS "How a partial match is processed by pcre2_match()"
.rs
.sp
+What happens when a partial match is identified depends on which of the two
+partial matching options is set.
+.P
+If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
+partial match is found, without continuing to search for possible complete
+matches. This option is "hard" because it prefers an earlier partial match over
+a later complete match. For this reason, the assumption is made that the end of
+the supplied subject string is not the true end of the available data, which is
+why \ez, \eZ, \eb, \eB, and $ always give a partial match.
+.P
+If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
+continues as normal, and other alternatives in the pattern are tried. If no
+complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
+PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
+over a partial match. All the various matching items in a pattern behave as if
+the subject string is potentially complete; \ez, \eZ, and $ match at the end of
+the subject, as normal, and for \eb and \eB the end of the subject is treated
+as a non-alphanumeric.
+.P
The difference between the two partial matching options can be illustrated by a
pattern such as:
.sp
@@ -154,157 +175,83 @@ The second pattern will never match "dogsbody", because it will always find the
shorter match first.
.
.
-.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
+.SS "Example of partial matching using pcre2test"
.rs
.sp
-The DFA functions move along the subject string character by character, without
-backtracking, searching for all possible matches simultaneously. If the end of
-the subject is reached before the end of the pattern, there is the possibility
-of a partial match, again provided that at least one character has been
-inspected.
-.P
-When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
-have been no complete matches. Otherwise, the complete matches are returned.
-However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
-any complete matches. The portion of the string that was matched when the
-longest partial match was found is set as the first matching string.
-.P
-Because the DFA functions always search for all possible matches, and there is
-no difference between greedy and ungreedy repetition, their behaviour is
-different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
-the string "dog" matched against the ungreedy pattern shown above:
-.sp
- /dog(sbody)??/
-.sp
-Whereas the standard function stops as soon as it finds the complete match for
-"dog", the DFA function also finds the partial match for "dogsbody", and so
-returns that when PCRE2_PARTIAL_HARD is set.
-.
-.
-.SH "PARTIAL MATCHING AND WORD BOUNDARIES"
-.rs
-.sp
-If a pattern ends with one of sequences \eb or \eB, which test for word
-boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
-results. Consider this pattern:
-.sp
- /\ebcat\eb/
-.sp
-This matches "cat", provided there is a word boundary at either end. If the
-subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However, normal
-matching carries on, and \eb matches at the end of the subject when the last
-character is a letter, so a complete match is found. The result, therefore, is
-\fInot\fP PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
-PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
-.
-.
-.SH "EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST"
-.rs
-.sp
-If the \fBpartial_soft\fP (or \fBps\fP) modifier is present on a
-\fBpcre2test\fP data line, the PCRE2_PARTIAL_SOFT option is used for the match.
-Here is a run of \fBpcre2test\fP that uses the date example quoted above:
+The \fBpcre2test\fP data modifiers \fBpartial_hard\fP (or \fBph\fP) and
+\fBpartial_soft\fP (or \fBps\fP) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
+respectively, when calling \fBpcre2_match()\fP. Here is a run of
+\fBpcre2test\fP using a pattern that matches the whole subject in the form of a
+date:
.sp
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
- data> 25jun04\e=ps
- 0: 25jun04
- 1: jun
- data> 25dec3\e=ps
+ data> 25dec3\e=ph
Partial match: 23dec3
- data> 3ju\e=ps
+ data> 3ju\e=ph
Partial match: 3ju
- data> 3juj\e=ps
- No match
- data> j\e=ps
+ data> 3juj\e=ph
No match
.sp
-The first data string is matched completely, so \fBpcre2test\fP shows the
-matched substrings. The remaining four strings do not match the complete
-pattern, but the first two are partial matches. Similar output is obtained
-if DFA matching is used.
-.P
-If the \fBpartial_hard\fP (or \fBph\fP) modifier is present on a
-\fBpcre2test\fP data line, the PCRE2_PARTIAL_HARD option is set for the match.
-.
-.
-.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
-.rs
-.sp
-When a partial match has been found using a DFA matching function, it is
-possible to continue the match by providing additional subject data and calling
-the function again with the same compiled regular expression, this time setting
-the PCRE2_DFA_RESTART option. You must pass the same working space as before,
-because this is where details of the previous partial match are stored. Here is
-an example using \fBpcre2test\fP:
+This example gives the same results for both hard and soft partial matching
+options. Here is an example where there is a difference:
.sp
re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
- data> 23ja\e=dfa,ps
- Partial match: 23ja
- data> n05\e=dfa,dfa_restart
- 0: n05
-.sp
-The first call has "23ja" as the subject, and requests partial matching; the
-second call has "n05" as the subject for the continued (restarted) match.
-Notice that when the match is complete, only the last part is shown; PCRE2 does
-not retain the previously partially-matched string. It is up to the calling
-program to do that if it needs to.
-.P
-That means that, for an unanchored pattern, if a continued match fails, it is
-not possible to try again at a new starting point. All this facility is capable
-of doing is continuing with the previous match attempt. In the previous
-example, if the second set of data is "ug23" the result is no match, even
-though there would be a match for "aug23" if the entire string were given at
-once. Depending on the application, this may or may not be what you want.
-The only way to allow for starting again at the next character is to retain the
-matched part of the subject and try a new complete match.
-.P
-You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
-PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
-facility can be used to pass very long subject strings to the DFA matching
-functions.
+ data> 25jun04\e=ps
+ 0: 25jun04
+ 1: jun
+ data> 25jun04\e=ph
+ Partial match: 25jun04
+.sp
+With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
+PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
+there is only a partial match.
+.
.
.
.SH "MULTI-SEGMENT MATCHING WITH pcre2_match()"
.rs
.sp
-Unlike the DFA function, it is not possible to restart the previous match with
-a new segment of data when using \fBpcre2_match()\fP. Instead, new data must be
-added to the previous subject string, and the entire match re-run, starting
-from the point where the partial match occurred. Earlier data can be discarded.
+PCRE was not originally designed with multi-segment matching in mind. However,
+over time, features (including partial matching) that make multi-segment
+matching possible have been added. The string is searched segment by segment by
+calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same
+results that would happen if the entire string was available for searching.
+.P
+Special logic must be implemented to handle a matched substring that spans a
+segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
+partial match at the end of a segment whenever there is the possibility of
+changing the match by adding more characters. The PCRE2_NOTBOL option should
+also be set for all but the first segment.
.P
-It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
-treat the end of a segment as the end of the subject when matching \ez, \eZ,
-\eb, \eB, and $. Consider an unanchored pattern that matches dates:
+When a partial match occurs, the next segment must be added to the current
+subject and the match re-run, using the \fIstartoffset\fP argument of
+\fBpcre2_match()\fP to begin at the point where the partial match started.
+Multi-segment matching is usually used to search for substrings in the middle
+of very long sequences, so the patterns are normally not anchored. For example:
.sp
re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
- data> The date is 23ja\e=ph
+ data> ...the date is 23ja\e=ph
Partial match: 23ja
+ data> ...the date is 23jan19 and on that day...\e=offset=15
+ 0: 23jan19
+ 1: jan
.sp
-At this stage, an application could discard the text preceding "23ja", add on
-text from the next segment, and call the matching function again. Unlike the
-DFA matching function, the entire matching string must always be available,
-and the complete matching process occurs for each call, so more memory and more
-processing time is needed.
-.
-.
-.SH "ISSUES WITH MULTI-SEGMENT MATCHING"
-.rs
-.sp
-Certain types of pattern may give problems with multi-segment matching,
-whichever matching function is used.
+Note the use of the \fBoffset\fP modifier to start the new match where the
+partial match was found.
.P
-1. If the pattern contains a test for the beginning of a line, you need to pass
-the PCRE2_NOTBOL option when the subject string for any call does start at the
-beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
-doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
-includes the effect of PCRE2_NOTEOL.
+In this simple example, the next segment was just added to the one in which the
+partial match was found. However, if there are memory constraints, it may be
+necessary to discard text that precedes the partial match before adding the
+next segment. In cases such as the above, where the pattern does not contain
+any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if a pattern contains a lookbehind assertion, characters
+that precede the start of the partial match may have been inspected during the
+matching process.
.P
-2. If a pattern contains a lookbehind assertion, characters that precede the
-start of the partial match may have been inspected during the matching process.
-When using \fBpcre2_match()\fP, sufficient characters must be retained for the
-next match attempt. You can ensure that enough characters are retained by doing
-the following:
+The only lookbehind information that is available is the length of the longest
+lookbehind in a pattern. This may not, of course, be at the start of the
+pattern, but retaining that many characters before the partial match is
+sufficient, if not always strictly necessary. The way to do this is as follows:
.P
Before doing any matching, find the length of the longest lookbehind in the
pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
@@ -313,71 +260,78 @@ partial match, moving back from the ovector[0] offset in the subject by the
number of characters given for the maximum lookbehind gets you to the earliest
character that must be retained. In a non-UTF or a 32-bit situation, moving
back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
-while moving back through the code units.
-.P
-Characters before the point you have now reached can be discarded, and after
-the next segment has been added to what is retained, you should run the next
-match with the \fBstartoffset\fP argument set so that the match begins at the
-same point as before.
+while moving back through the code units. Characters before the point you have
+now reached can be discarded.
.P
For example, if the pattern "(?<=123)abc" is partially matched against the
string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
lookbehind count is 3, so all characters before offset 2 can be discarded. The
value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
displays a partial match, it indicates the lookbehind characters with '<'
-characters if the "allusedtext" modifier is set:
+characters if the \fBallusedtext\fP modifier is set:
.sp
re> "(?<=123)abc"
data> xx123ab\e=ph,allusedtext
Partial match: 123ab
<<<
-However, the "allusedtext" modifier is not available for JIT matching, because
-JIT matching does not maintain the first and last consulted characters.
+.sp
+Note that the \fPallusedtext\fP modifier is not available for JIT matching,
+because JIT matching does not maintain the first and last consulted characters.
+.
+.
+.
+.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
+.rs
+.sp
+The DFA function moves along the subject string character by character, without
+backtracking, searching for all possible matches simultaneously. If the end of
+the subject is reached before the end of the pattern, there is the possibility
+of a partial match.
.P
-3. Matching a subject string that is split into multiple segments may not
-always produce exactly the same result as matching over one single long string
-when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
-Boundaries" above describes an issue that arises if the pattern ends with \eb
-or \eB. Another kind of difference may occur when there are multiple matching
-possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
-only when there are no completed matches. This means that as soon as the
-shortest match has been found, continuation to a new subject segment is no
-longer possible. Consider this \fBpcre2test\fP example:
-.sp
- re> /dog(sbody)?/
- data> dogsb\e=ps
- 0: dog
- data> do\e=ps,dfa
- Partial match: do
- data> gsb\e=ps,dfa,dfa_restart
- 0: g
- data> dogsbody\e=dfa
- 0: dogsbody
- 1: dog
-.sp
-The first data line passes the string "dogsb" to a standard matching function,
-setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
-for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
-string "dog" is a complete match. Similarly, when the subject is presented to
-a DFA matching function in several parts ("do" and "gsb" being the first two)
-the match stops when "dog" has been found, and it is not possible to continue.
-On the other hand, if "dogsbody" is presented as a single string, a DFA
-matching function finds both matches.
+When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
+have been no complete matches. Otherwise, the complete matches are returned.
+If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
+complete matches. The portion of the string that was matched when the longest
+partial match was found is set as the first matching string.
.P
-Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
-multi-segment data. The example above then behaves differently:
+Because the DFA function always searches for all possible matches, and there is
+no difference between greedy and ungreedy repetition, its behaviour is
+different from the \fBpcre2_match()\fP. Consider the string "dog" matched
+against this ungreedy pattern:
+.sp
+ /dog(sbody)??/
+.sp
+Whereas the standard function stops as soon as it finds the complete match for
+"dog", the DFA function also finds the partial match for "dogsbody", and so
+returns that when PCRE2_PARTIAL_HARD is set.
+.
+.
+.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
+.rs
.sp
- re> /dog(sbody)?/
- data> dogsb\e=ph
- Partial match: dogsb
- data> do\e=ps,dfa
- Partial match: do
- data> gsb\e=ph,dfa,dfa_restart
- Partial match: gsb
+When a partial match has been found using the DFA matching function, it is
+possible to continue the match by providing additional subject data and calling
+the function again with the same compiled regular expression, this time setting
+the PCRE2_DFA_RESTART option. You must pass the same working space as before,
+because this is where details of the previous partial match are stored. You can
+set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
+to continue partial matching over multiple segments. Here is an example using
+\fBpcre2test\fP:
.sp
-4. Patterns that contain alternatives at the top level which do not all start
-with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
-used. For example, consider this pattern:
+ re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
+ data> 23ja\e=dfa,ps
+ Partial match: 23ja
+ data> n05\e=dfa,dfa_restart
+ 0: n05
+.sp
+The first call has "23ja" as the subject, and requests partial matching; the
+second call has "n05" as the subject for the continued (restarted) match.
+Notice that when the match is complete, only the last part is shown; PCRE2 does
+not retain the previously partially-matched string. It is up to the calling
+program to do that if it needs to. This means that, for an unanchored pattern,
+if a continued match fails, it is not possible to try again at a new starting
+point. All this facility is capable of doing is continuing with the previous
+match attempt. For example, consider this pattern:
.sp
1234|3789
.sp
@@ -386,28 +340,15 @@ alternative is found at offset 3. There is no partial match for the second
alternative, because such a match does not start at the same point in the
subject string. Attempting to continue with the string "7890" does not yield a
match because only those alternatives that match at one point in the subject
-are remembered. The problem arises because the start of the second alternative
-matches within the first alternative. There is no problem with anchored
-patterns or patterns such as:
-.sp
- 1234|ABCD
-.sp
-where no string can be a partial match for both alternatives. This is not a
-problem if a standard matching function is used, because the entire match has
-to be rerun each time:
-.sp
- re> /1234|3789/
- data> ABC123\e=ph
- Partial match: 123
- data> 1237890
- 0: 3789
-.sp
-Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
-the entire match can also be used with the DFA matching function. Another
-possibility is to work with two buffers. If a partial match at offset \fIn\fP
-in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
-the second buffer, you can then try a new match starting at offset \fIn+1\fP in
-the first buffer.
+are remembered. Depending on the application, this may or may not be what you
+want.
+.P
+If you do want to allow for starting again at the next character, one way of
+doing it is to retain the matched part of the segment and try a new complete
+match, as described for \fBpcre2_match()\fP above. Another possibility is to
+work with two buffers. If a partial match at offset \fIn\fP in the first buffer
+is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
+you can then try a new match starting at offset \fIn+1\fP in the first buffer.
.
.
.SH AUTHOR
@@ -424,6 +365,6 @@ Cambridge, England.
.rs
.sp
.nf
-Last updated: 22 July 2019
+Last updated: 07 August 2019
Copyright (c) 1997-2019 University of Cambridge.
.fi