Partial match documentation rewritten.

git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1156 6239d852-aaf2-0410-a92c-79f79f948069
author: ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> 2019-08-07 17:21:02 +0000
committer: ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> 2019-08-07 17:21:02 +0000
commit: c277d094e4d01ae9afad8bdd4d7537033a695a4f (patch)
tree: be95494defe9a921ff3a45c3ffe166b44bc6b1b2 /doc/pcre2partial.3
parent: c54035e9187b182c5de4cd73c425a2360b9f5878 (diff)
download: pcre2-c277d094e4d01ae9afad8bdd4d7537033a695a4f.tar.gz
1 files changed, 227 insertions, 286 deletions
diff --git a/doc/pcre2partial.3 b/doc/pcre2partial.3
index adb7814..92d5038 100644
--- a/doc/pcre2partial.3
+++ b/doc/pcre2partial.3
@@ -1,73 +1,107 @@
-.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34"
+.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions
 .SH "PARTIAL MATCHING IN PCRE2"
 .rs
 .sp
-In normal use of PCRE2, if the subject string that is passed to a matching
-function matches as far as it goes, but is too short to match the entire
-pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it
-might be helpful to distinguish this case from other cases in which there is no
-match.
+In normal use of PCRE2, if there is a match up to the end of a subject string,
+but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
+is returned, just like any other failing match. There are circumstances where
+it might be helpful to distinguish this "partial match" case.
 .P
-Consider, for example, an application where a human is required to type in data
-for a field with specific formatting requirements. An example might be a date
-in the form \fIddmmmyy\fP, defined by this pattern:
-.sp
-  ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
-.sp
-If the application sees the user's keystrokes one by one, and can check that
-what has been typed so far is potentially valid, it is able to raise an error
-as soon as a mistake is made, by beeping and not reflecting the character that
-has been typed, for example. This immediate feedback is likely to be a better
-user interface than a check that is delayed until the entire string has been
-entered. Partial matching can also be useful when the subject string is very
-long and is not all available at once, as discussed below.
+One example is an application where the subject string is very long, and not
+all available at once. The requirement here is to be able to do the matching
+segment by segment, but special action is needed when a matched substring spans
+the boundary between two segments.
 .P
-PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
-PCRE2_PARTIAL_HARD options, which can be set when calling a matching function.
-The difference between the two options is whether or not a partial match is
-preferred to an alternative complete match, though the details differ between
-the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD
-takes precedence.
+Another example is checking a user input string as it is typed, to ensure that
+it conforms to a required format. Invalid characters can be immediately
+diagnosed and rejected, giving instant feedback.
 .P
-If you want to use partial matching with just-in-time optimized code, you must
-call \fBpcre2_jit_compile()\fP with one or both of these options:
+Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
+requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
+options when calling a matching function. The difference between the two
+options is whether or not a partial match is preferred to an alternative
+complete match, though the details differ between the two types of matching
+function. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
+.P
+If you want to use partial matching with just-in-time optimized code, as well 
+as setting a partial match option for the matching function, you must also call
+\fBpcre2_jit_compile()\fP with one or both of these options:
 .sp
-  PCRE2_JIT_PARTIAL_SOFT
   PCRE2_JIT_PARTIAL_HARD
+  PCRE2_JIT_PARTIAL_SOFT
 .sp
 PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
-matches on the same pattern. If the appropriate JIT mode has not been compiled,
-interpretive matching code is used.
+matches on the same pattern. Separate code is compiled for each mode. If the
+appropriate JIT mode has not been compiled, interpretive matching code is used.
 .P
 Setting a partial matching option disables two of PCRE2's standard
-optimizations. PCRE2 remembers the last literal code unit in a pattern, and
-abandons matching immediately if it is not present in the subject string. This
-optimization cannot be used for a subject string that might match only
-partially. PCRE2 also knows the minimum length of a matching string, and does
+optimization hints. PCRE2 remembers the last literal code unit in a pattern,
+and abandons matching immediately if it is not present in the subject string.
+This optimization cannot be used for a subject string that might match only
+partially. PCRE2 also remembers a minimum length of a matching string, and does
 not bother to run the matching function on shorter strings. This optimization
 is also disabled for partial matching.
 .
 .
-.SH "PARTIAL MATCHING USING pcre2_match()"
+.SH "REQUIREMENTS FOR A PARTIAL MATCH"
 .rs
 .sp
-A partial match occurs during a call to \fBpcre2_match()\fP when the end of the
-subject string is reached successfully, but matching cannot continue because
-more characters are needed, and in addition, either at least one character in
-the subject has been inspected or the pattern contains a lookbehind, or (when 
-PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An
-inspected character need not form part of the final matched string; lookbehind
-assertions and the \eK escape sequence provide ways of inspecting characters
-before the start of a matched string.
+A possible partial match occurs during matching when the end of the subject
+string is reached successfully, but either more characters are needed to
+complete the match, or the addition of more characters might change what is
+matched.
+.P
+Example 1: if the pattern is /abc/ and the subject is "ab", more characters are
+definitely needed to complete a match. In this case both hard and soft matching
+options yield a partial match.
+.P
+Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match
+can be found, but the addition of more characters might change what is
+matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
+PCRE2_PARTIAL_SOFT returns the complete match.
+.P
+On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
+pattern item is \ez, \eZ, \eb, \eB, or $ there is always a partial match.
+Otherwise, for both options, the next pattern item must be one that inspects a
+character, and at least one of the following must be true:
+.P
+(1) At least one character has already been inspected. An inspected character
+need not form part of the final matched string; lookbehind assertions and the
+\eK escape sequence provide ways of inspecting characters before the start of a
+matched string.
 .P
-The three additional requirements define the cases where adding more characters
-to the existing subject may complete the same match that would occur if they
-had all been present in the first place. Without these conditions there would
-be a partial match of an empty string at the end of the subject for all
-unanchored patterns (and also for anchored patterns if the subject itself is
-empty).
+(2) The pattern contains one or more lookbehind assertions. This condition
+exists in case there is a lookbehind that inspects characters before the start 
+of the match.
+.P
+(3) There is a special case when the whole pattern can match an empty string.
+When the starting point is at the end of the subject, the empty string match is
+a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
+conditions is true, it is returned. However, because adding more characters
+might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
+which in this case means "there is going to be a match at this point, but until
+some more characters are added, we do not know if it will be an empty string or
+something longer".
+.
+.
+.
+.SH "PARTIAL MATCHING USING pcre2_match()"
+.rs
+.sp
+When a partial matching option is set, the result of calling
+\fBpcre2_match()\fP can be one of the following:
+.TP 2
+\fBA successful match\fP
+A complete match has been found, starting and ending within this subject.
+.TP
+\fBPCRE2_ERROR_NOMATCH\fP
+No match can start anywhere in this subject.
+.TP
+\fBPCRE2_ERROR_PARTIAL\fP
+Adding more characters may result in a complete match that uses one or more
+characters from the end of this subject.
 .P
 When a partial match is returned, the first two elements in the ovector point
 to the portion of the subject that was matched, but the values in the rest of
@@ -83,24 +117,6 @@ is "456abc12", a partial match is found for the string "abc12", because all
 these characters are needed for a subsequent re-match with additional
 characters.
 .P
-What happens when a partial match is identified depends on which of the two
-partial matching options is set.
-.
-.
-.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()"
-.rs
-.sp
-If PCRE2_PARTIAL_SOFT is set when \fBpcre2_match()\fP identifies a partial
-match, the partial match is remembered, but matching continues as normal, and
-other alternatives in the pattern are tried. If no complete match can be found,
-PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH.
-.P
-This option is "soft" because it prefers a complete match over a partial match.
-All the various matching items in a pattern behave as if the subject string is
-potentially complete. For example, \ez, \eZ, and $ match at the end of the
-subject, as normal, and for \eb and \eB the end of the subject is treated as a
-non-alphanumeric.
-.P
 If there is more than one partial match, the first one that was found provides
 the data that is returned. Consider this pattern:
 .sp
@@ -109,27 +125,32 @@ the data that is returned. Consider this pattern:
 If this is matched against the subject string "abc123dog", both alternatives
 fail to match, but the end of the subject is reached during matching, so
 PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
-"123dog" as the first partial match that was found. (In this example, there are
-two partial matches, because "dog" on its own partially matches the second
-alternative.)
+"123dog" as the first partial match. (In this example, there are two partial
+matches, because "dog" on its own partially matches the second alternative.)
 .
 .
-.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()"
-.rs
-.sp
-If PCRE2_PARTIAL_HARD is set for \fBpcre2_match()\fP, PCRE2_ERROR_PARTIAL is
-returned as soon as a partial match is found, without continuing to search for
-possible complete matches. This option is "hard" because it prefers an earlier
-partial match over a later complete match. For this reason, the assumption is
-made that the end of the supplied subject string may not be the true end of the
-available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end
-of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any 
-characters have been inspected.
-.
-.
-.SS "Comparing hard and soft partial matching"
+.SS "How a partial match is processed by pcre2_match()"
 .rs
 .sp
+What happens when a partial match is identified depends on which of the two
+partial matching options is set.
+.P
+If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
+partial match is found, without continuing to search for possible complete
+matches. This option is "hard" because it prefers an earlier partial match over
+a later complete match. For this reason, the assumption is made that the end of
+the supplied subject string is not the true end of the available data, which is 
+why \ez, \eZ, \eb, \eB, and $ always give a partial match.
+.P
+If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
+continues as normal, and other alternatives in the pattern are tried. If no
+complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
+PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
+over a partial match. All the various matching items in a pattern behave as if
+the subject string is potentially complete; \ez, \eZ, and $ match at the end of
+the subject, as normal, and for \eb and \eB the end of the subject is treated
+as a non-alphanumeric.
+.P
 The difference between the two partial matching options can be illustrated by a
 pattern such as:
 .sp
@@ -154,157 +175,83 @@ The second pattern will never match "dogsbody", because it will always find the
 shorter match first.
 .
 .
-.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
+.SS "Example of partial matching using pcre2test"
 .rs
 .sp
-The DFA functions move along the subject string character by character, without
-backtracking, searching for all possible matches simultaneously. If the end of
-the subject is reached before the end of the pattern, there is the possibility
-of a partial match, again provided that at least one character has been
-inspected.
-.P
-When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
-have been no complete matches. Otherwise, the complete matches are returned.
-However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over
-any complete matches. The portion of the string that was matched when the
-longest partial match was found is set as the first matching string.
-.P
-Because the DFA functions always search for all possible matches, and there is
-no difference between greedy and ungreedy repetition, their behaviour is
-different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider
-the string "dog" matched against the ungreedy pattern shown above:
-.sp
-  /dog(sbody)??/
-.sp
-Whereas the standard function stops as soon as it finds the complete match for
-"dog", the DFA function also finds the partial match for "dogsbody", and so
-returns that when PCRE2_PARTIAL_HARD is set.
-.
-.
-.SH "PARTIAL MATCHING AND WORD BOUNDARIES"
-.rs
-.sp
-If a pattern ends with one of sequences \eb or \eB, which test for word
-boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive
-results. Consider this pattern:
-.sp
-  /\ebcat\eb/
-.sp
-This matches "cat", provided there is a word boundary at either end. If the
-subject string is "the cat", the comparison of the final "t" with a following
-character cannot take place, so a partial match is found. However, normal
-matching carries on, and \eb matches at the end of the subject when the last
-character is a letter, so a complete match is found. The result, therefore, is
-\fInot\fP PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield
-PCRE2_ERROR_PARTIAL, because then the partial match takes precedence.
-.
-.
-.SH "EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST"
-.rs
-.sp
-If the \fBpartial_soft\fP (or \fBps\fP) modifier is present on a
-\fBpcre2test\fP data line, the PCRE2_PARTIAL_SOFT option is used for the match.
-Here is a run of \fBpcre2test\fP that uses the date example quoted above:
+The \fBpcre2test\fP data modifiers \fBpartial_hard\fP (or \fBph\fP) and
+\fBpartial_soft\fP (or \fBps\fP) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
+respectively, when calling \fBpcre2_match()\fP. Here is a run of
+\fBpcre2test\fP using a pattern that matches the whole subject in the form of a
+date:
 .sp
     re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
-  data> 25jun04\e=ps
-   0: 25jun04
-   1: jun
-  data> 25dec3\e=ps
+  data> 25dec3\e=ph
   Partial match: 23dec3
-  data> 3ju\e=ps
+  data> 3ju\e=ph
   Partial match: 3ju
-  data> 3juj\e=ps
-  No match
-  data> j\e=ps
+  data> 3juj\e=ph
   No match
 .sp
-The first data string is matched completely, so \fBpcre2test\fP shows the
-matched substrings. The remaining four strings do not match the complete
-pattern, but the first two are partial matches. Similar output is obtained
-if DFA matching is used.
-.P
-If the \fBpartial_hard\fP (or \fBph\fP) modifier is present on a
-\fBpcre2test\fP data line, the PCRE2_PARTIAL_HARD option is set for the match.
-.
-.
-.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
-.rs
-.sp
-When a partial match has been found using a DFA matching function, it is
-possible to continue the match by providing additional subject data and calling
-the function again with the same compiled regular expression, this time setting
-the PCRE2_DFA_RESTART option. You must pass the same working space as before,
-because this is where details of the previous partial match are stored. Here is
-an example using \fBpcre2test\fP:
+This example gives the same results for both hard and soft partial matching 
+options. Here is an example where there is a difference:
 .sp
     re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
-  data> 23ja\e=dfa,ps
-  Partial match: 23ja
-  data> n05\e=dfa,dfa_restart
-   0: n05
-.sp
-The first call has "23ja" as the subject, and requests partial matching; the
-second call has "n05" as the subject for the continued (restarted) match.
-Notice that when the match is complete, only the last part is shown; PCRE2 does
-not retain the previously partially-matched string. It is up to the calling
-program to do that if it needs to.
-.P
-That means that, for an unanchored pattern, if a continued match fails, it is
-not possible to try again at a new starting point. All this facility is capable
-of doing is continuing with the previous match attempt. In the previous
-example, if the second set of data is "ug23" the result is no match, even
-though there would be a match for "aug23" if the entire string were given at
-once. Depending on the application, this may or may not be what you want.
-The only way to allow for starting again at the next character is to retain the
-matched part of the subject and try a new complete match.
-.P
-You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
-PCRE2_DFA_RESTART to continue partial matching over multiple segments. This
-facility can be used to pass very long subject strings to the DFA matching
-functions.
+  data> 25jun04\e=ps
+   0: 25jun04
+   1: jun
+  data> 25jun04\e=ph
+  Partial match: 25jun04 
+.sp    
+With PCRE2_PARTIAL_SOFT, the subject is matched completely. For
+PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
+there is only a partial match.
+.
 .
 .
 .SH "MULTI-SEGMENT MATCHING WITH pcre2_match()"
 .rs
 .sp
-Unlike the DFA function, it is not possible to restart the previous match with
-a new segment of data when using \fBpcre2_match()\fP. Instead, new data must be
-added to the previous subject string, and the entire match re-run, starting
-from the point where the partial match occurred. Earlier data can be discarded.
+PCRE was not originally designed with multi-segment matching in mind. However,
+over time, features (including partial matching) that make multi-segment
+matching possible have been added. The string is searched segment by segment by
+calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same 
+results that would happen if the entire string was available for searching.
+.P
+Special logic must be implemented to handle a matched substring that spans a
+segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
+partial match at the end of a segment whenever there is the possibility of
+changing the match by adding more characters. The PCRE2_NOTBOL option should
+also be set for all but the first segment.
 .P
-It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not
-treat the end of a segment as the end of the subject when matching \ez, \eZ,
-\eb, \eB, and $. Consider an unanchored pattern that matches dates:
+When a partial match occurs, the next segment must be added to the current 
+subject and the match re-run, using the \fIstartoffset\fP argument of 
+\fBpcre2_match()\fP to begin at the point where the partial match started.
+Multi-segment matching is usually used to search for substrings in the middle
+of very long sequences, so the patterns are normally not anchored. For example:
 .sp
     re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
-  data> The date is 23ja\e=ph
+  data> ...the date is 23ja\e=ph
   Partial match: 23ja
+  data> ...the date is 23jan19 and on that day...\e=offset=15
+   0: 23jan19
+   1: jan
 .sp
-At this stage, an application could discard the text preceding "23ja", add on
-text from the next segment, and call the matching function again. Unlike the
-DFA matching function, the entire matching string must always be available,
-and the complete matching process occurs for each call, so more memory and more
-processing time is needed.
-.
-.
-.SH "ISSUES WITH MULTI-SEGMENT MATCHING"
-.rs
-.sp
-Certain types of pattern may give problems with multi-segment matching,
-whichever matching function is used.
+Note the use of the \fBoffset\fP modifier to start the new match where the 
+partial match was found.
 .P
-1. If the pattern contains a test for the beginning of a line, you need to pass
-the PCRE2_NOTBOL option when the subject string for any call does start at the
-beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when
-doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which
-includes the effect of PCRE2_NOTEOL.
+In this simple example, the next segment was just added to the one in which the 
+partial match was found. However, if there are memory constraints, it may be 
+necessary to discard text that precedes the partial match before adding the 
+next segment. In cases such as the above, where the pattern does not contain
+any lookbehinds, it is sufficient to retain only the partially matched
+substring. However, if a pattern contains a lookbehind assertion, characters
+that precede the start of the partial match may have been inspected during the
+matching process.
 .P
-2. If a pattern contains a lookbehind assertion, characters that precede the
-start of the partial match may have been inspected during the matching process.
-When using \fBpcre2_match()\fP, sufficient characters must be retained for the
-next match attempt. You can ensure that enough characters are retained by doing
-the following:
+The only lookbehind information that is available is the length of the longest
+lookbehind in a pattern. This may not, of course, be at the start of the
+pattern, but retaining that many characters before the partial match is
+sufficient, if not always strictly necessary. The way to do this is as follows:
 .P
 Before doing any matching, find the length of the longest lookbehind in the
 pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND
@@ -313,71 +260,78 @@ partial match, moving back from the ovector[0] offset in the subject by the
 number of characters given for the maximum lookbehind gets you to the earliest
 character that must be retained. In a non-UTF or a 32-bit situation, moving
 back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters
-while moving back through the code units.
-.P
-Characters before the point you have now reached can be discarded, and after
-the next segment has been added to what is retained, you should run the next
-match with the \fBstartoffset\fP argument set so that the match begins at the
-same point as before.
+while moving back through the code units. Characters before the point you have
+now reached can be discarded.
 .P
 For example, if the pattern "(?<=123)abc" is partially matched against the
 string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum
 lookbehind count is 3, so all characters before offset 2 can be discarded. The
 value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP
 displays a partial match, it indicates the lookbehind characters with '<'
-characters if the "allusedtext" modifier is set:
+characters if the \fBallusedtext\fP modifier is set:
 .sp
     re> "(?<=123)abc"
   data> xx123ab\e=ph,allusedtext
   Partial match: 123ab
                  <<<
-However, the "allusedtext" modifier is not available for JIT matching, because 
-JIT matching does not maintain the first and last consulted characters.
+.sp                  
+Note that the \fPallusedtext\fP modifier is not available for JIT matching,
+because JIT matching does not maintain the first and last consulted characters.
+.
+.
+.
+.SH "PARTIAL MATCHING USING pcre2_dfa_match()"
+.rs
+.sp
+The DFA function moves along the subject string character by character, without
+backtracking, searching for all possible matches simultaneously. If the end of
+the subject is reached before the end of the pattern, there is the possibility
+of a partial match.
 .P
-3. Matching a subject string that is split into multiple segments may not
-always produce exactly the same result as matching over one single long string
-when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word
-Boundaries" above describes an issue that arises if the pattern ends with \eb
-or \eB. Another kind of difference may occur when there are multiple matching
-possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given
-only when there are no completed matches. This means that as soon as the
-shortest match has been found, continuation to a new subject segment is no
-longer possible. Consider this \fBpcre2test\fP example:
-.sp
-    re> /dog(sbody)?/
-  data> dogsb\e=ps
-   0: dog
-  data> do\e=ps,dfa
-  Partial match: do
-  data> gsb\e=ps,dfa,dfa_restart
-   0: g
-  data> dogsbody\e=dfa
-   0: dogsbody
-   1: dog
-.sp
-The first data line passes the string "dogsb" to a standard matching function,
-setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match
-for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter
-string "dog" is a complete match. Similarly, when the subject is presented to
-a DFA matching function in several parts ("do" and "gsb" being the first two)
-the match stops when "dog" has been found, and it is not possible to continue.
-On the other hand, if "dogsbody" is presented as a single string, a DFA
-matching function finds both matches.
+When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
+have been no complete matches. Otherwise, the complete matches are returned.
+If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
+complete matches. The portion of the string that was matched when the longest
+partial match was found is set as the first matching string.
 .P
-Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching
-multi-segment data. The example above then behaves differently:
+Because the DFA function always searches for all possible matches, and there is
+no difference between greedy and ungreedy repetition, its behaviour is
+different from the \fBpcre2_match()\fP. Consider the string "dog" matched
+against this ungreedy pattern:
+.sp
+  /dog(sbody)??/
+.sp
+Whereas the standard function stops as soon as it finds the complete match for
+"dog", the DFA function also finds the partial match for "dogsbody", and so
+returns that when PCRE2_PARTIAL_HARD is set.
+.
+.
+.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()"
+.rs
 .sp
-    re> /dog(sbody)?/
-  data> dogsb\e=ph
-  Partial match: dogsb
-  data> do\e=ps,dfa
-  Partial match: do
-  data> gsb\e=ph,dfa,dfa_restart
-  Partial match: gsb
+When a partial match has been found using the DFA matching function, it is
+possible to continue the match by providing additional subject data and calling
+the function again with the same compiled regular expression, this time setting
+the PCRE2_DFA_RESTART option. You must pass the same working space as before,
+because this is where details of the previous partial match are stored. You can
+set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
+to continue partial matching over multiple segments. Here is an example using
+\fBpcre2test\fP:
 .sp
-4. Patterns that contain alternatives at the top level which do not all start
-with the same pattern item may not work as expected when PCRE2_DFA_RESTART is
-used. For example, consider this pattern:
+    re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
+  data> 23ja\e=dfa,ps
+  Partial match: 23ja
+  data> n05\e=dfa,dfa_restart
+   0: n05
+.sp
+The first call has "23ja" as the subject, and requests partial matching; the
+second call has "n05" as the subject for the continued (restarted) match.
+Notice that when the match is complete, only the last part is shown; PCRE2 does
+not retain the previously partially-matched string. It is up to the calling
+program to do that if it needs to. This means that, for an unanchored pattern,
+if a continued match fails, it is not possible to try again at a new starting
+point. All this facility is capable of doing is continuing with the previous
+match attempt. For example, consider this pattern:
 .sp
   1234|3789
 .sp
@@ -386,28 +340,15 @@ alternative is found at offset 3. There is no partial match for the second
 alternative, because such a match does not start at the same point in the
 subject string. Attempting to continue with the string "7890" does not yield a
 match because only those alternatives that match at one point in the subject
-are remembered. The problem arises because the start of the second alternative
-matches within the first alternative. There is no problem with anchored
-patterns or patterns such as:
-.sp
-  1234|ABCD
-.sp
-where no string can be a partial match for both alternatives. This is not a
-problem if a standard matching function is used, because the entire match has
-to be rerun each time:
-.sp
-    re> /1234|3789/
-  data> ABC123\e=ph
-  Partial match: 123
-  data> 1237890
-   0: 3789
-.sp
-Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running
-the entire match can also be used with the DFA matching function. Another
-possibility is to work with two buffers. If a partial match at offset \fIn\fP
-in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on
-the second buffer, you can then try a new match starting at offset \fIn+1\fP in
-the first buffer.
+are remembered. Depending on the application, this may or may not be what you
+want.
+.P
+If you do want to allow for starting again at the next character, one way of
+doing it is to retain the matched part of the segment and try a new complete
+match, as described for \fBpcre2_match()\fP above. Another possibility is to
+work with two buffers. If a partial match at offset \fIn\fP in the first buffer
+is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer,
+you can then try a new match starting at offset \fIn+1\fP in the first buffer.
 .
 .
 .SH AUTHOR
@@ -424,6 +365,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 22 July 2019
+Last updated: 07 August 2019
 Copyright (c) 1997-2019 University of Cambridge.
 .fi
author	ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>	2019-08-07 17:21:02 +0000
committer	ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069>	2019-08-07 17:21:02 +0000
commit	c277d094e4d01ae9afad8bdd4d7537033a695a4f (patch)
tree	be95494defe9a921ff3a45c3ffe166b44bc6b1b2 /doc/pcre2partial.3
parent	c54035e9187b182c5de4cd73c425a2360b9f5878 (diff)
download	pcre2-c277d094e4d01ae9afad8bdd4d7537033a695a4f.tar.gz