diff options
author | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2019-08-07 17:21:02 +0000 |
---|---|---|
committer | ph10 <ph10@6239d852-aaf2-0410-a92c-79f79f948069> | 2019-08-07 17:21:02 +0000 |
commit | c277d094e4d01ae9afad8bdd4d7537033a695a4f (patch) | |
tree | be95494defe9a921ff3a45c3ffe166b44bc6b1b2 /doc/pcre2partial.3 | |
parent | c54035e9187b182c5de4cd73c425a2360b9f5878 (diff) | |
download | pcre2-c277d094e4d01ae9afad8bdd4d7537033a695a4f.tar.gz |
Partial match documentation rewritten.
git-svn-id: svn://vcs.exim.org/pcre2/code/trunk@1156 6239d852-aaf2-0410-a92c-79f79f948069
Diffstat (limited to 'doc/pcre2partial.3')
-rw-r--r-- | doc/pcre2partial.3 | 513 |
1 files changed, 227 insertions, 286 deletions
diff --git a/doc/pcre2partial.3 b/doc/pcre2partial.3 index adb7814..92d5038 100644 --- a/doc/pcre2partial.3 +++ b/doc/pcre2partial.3 @@ -1,73 +1,107 @@ -.TH PCRE2PARTIAL 3 "22 July 2019" "PCRE2 10.34" +.TH PCRE2PARTIAL 3 "07 August 2019" "PCRE2 10.34" .SH NAME PCRE2 - Perl-compatible regular expressions .SH "PARTIAL MATCHING IN PCRE2" .rs .sp -In normal use of PCRE2, if the subject string that is passed to a matching -function matches as far as it goes, but is too short to match the entire -pattern, PCRE2_ERROR_NOMATCH is returned. There are circumstances where it -might be helpful to distinguish this case from other cases in which there is no -match. +In normal use of PCRE2, if there is a match up to the end of a subject string, +but more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH +is returned, just like any other failing match. There are circumstances where +it might be helpful to distinguish this "partial match" case. .P -Consider, for example, an application where a human is required to type in data -for a field with specific formatting requirements. An example might be a date -in the form \fIddmmmyy\fP, defined by this pattern: -.sp - ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$ -.sp -If the application sees the user's keystrokes one by one, and can check that -what has been typed so far is potentially valid, it is able to raise an error -as soon as a mistake is made, by beeping and not reflecting the character that -has been typed, for example. This immediate feedback is likely to be a better -user interface than a check that is delayed until the entire string has been -entered. Partial matching can also be useful when the subject string is very -long and is not all available at once, as discussed below. +One example is an application where the subject string is very long, and not +all available at once. The requirement here is to be able to do the matching +segment by segment, but special action is needed when a matched substring spans +the boundary between two segments. .P -PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and -PCRE2_PARTIAL_HARD options, which can be set when calling a matching function. -The difference between the two options is whether or not a partial match is -preferred to an alternative complete match, though the details differ between -the two types of matching function. If both options are set, PCRE2_PARTIAL_HARD -takes precedence. +Another example is checking a user input string as it is typed, to ensure that +it conforms to a required format. Invalid characters can be immediately +diagnosed and rejected, giving instant feedback. .P -If you want to use partial matching with just-in-time optimized code, you must -call \fBpcre2_jit_compile()\fP with one or both of these options: +Partial matching is a PCRE2-specific feature; it is not Perl-compatible. It is +requested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT +options when calling a matching function. The difference between the two +options is whether or not a partial match is preferred to an alternative +complete match, though the details differ between the two types of matching +function. If both options are set, PCRE2_PARTIAL_HARD takes precedence. +.P +If you want to use partial matching with just-in-time optimized code, as well +as setting a partial match option for the matching function, you must also call +\fBpcre2_jit_compile()\fP with one or both of these options: .sp - PCRE2_JIT_PARTIAL_SOFT PCRE2_JIT_PARTIAL_HARD + PCRE2_JIT_PARTIAL_SOFT .sp PCRE2_JIT_COMPLETE should also be set if you are going to run non-partial -matches on the same pattern. If the appropriate JIT mode has not been compiled, -interpretive matching code is used. +matches on the same pattern. Separate code is compiled for each mode. If the +appropriate JIT mode has not been compiled, interpretive matching code is used. .P Setting a partial matching option disables two of PCRE2's standard -optimizations. PCRE2 remembers the last literal code unit in a pattern, and -abandons matching immediately if it is not present in the subject string. This -optimization cannot be used for a subject string that might match only -partially. PCRE2 also knows the minimum length of a matching string, and does +optimization hints. PCRE2 remembers the last literal code unit in a pattern, +and abandons matching immediately if it is not present in the subject string. +This optimization cannot be used for a subject string that might match only +partially. PCRE2 also remembers a minimum length of a matching string, and does not bother to run the matching function on shorter strings. This optimization is also disabled for partial matching. . . -.SH "PARTIAL MATCHING USING pcre2_match()" +.SH "REQUIREMENTS FOR A PARTIAL MATCH" .rs .sp -A partial match occurs during a call to \fBpcre2_match()\fP when the end of the -subject string is reached successfully, but matching cannot continue because -more characters are needed, and in addition, either at least one character in -the subject has been inspected or the pattern contains a lookbehind, or (when -PCRE2_PARTIAL_HARD is set) the pattern could match an empty string. An -inspected character need not form part of the final matched string; lookbehind -assertions and the \eK escape sequence provide ways of inspecting characters -before the start of a matched string. +A possible partial match occurs during matching when the end of the subject +string is reached successfully, but either more characters are needed to +complete the match, or the addition of more characters might change what is +matched. +.P +Example 1: if the pattern is /abc/ and the subject is "ab", more characters are +definitely needed to complete a match. In this case both hard and soft matching +options yield a partial match. +.P +Example 2: if the pattern is /ab+/ and the subject is "ab", a complete match +can be found, but the addition of more characters might change what is +matched. In this case, only PCRE2_PARTIAL_HARD returns a partial match; +PCRE2_PARTIAL_SOFT returns the complete match. +.P +On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next +pattern item is \ez, \eZ, \eb, \eB, or $ there is always a partial match. +Otherwise, for both options, the next pattern item must be one that inspects a +character, and at least one of the following must be true: +.P +(1) At least one character has already been inspected. An inspected character +need not form part of the final matched string; lookbehind assertions and the +\eK escape sequence provide ways of inspecting characters before the start of a +matched string. .P -The three additional requirements define the cases where adding more characters -to the existing subject may complete the same match that would occur if they -had all been present in the first place. Without these conditions there would -be a partial match of an empty string at the end of the subject for all -unanchored patterns (and also for anchored patterns if the subject itself is -empty). +(2) The pattern contains one or more lookbehind assertions. This condition +exists in case there is a lookbehind that inspects characters before the start +of the match. +.P +(3) There is a special case when the whole pattern can match an empty string. +When the starting point is at the end of the subject, the empty string match is +a possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above +conditions is true, it is returned. However, because adding more characters +might result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match, +which in this case means "there is going to be a match at this point, but until +some more characters are added, we do not know if it will be an empty string or +something longer". +. +. +. +.SH "PARTIAL MATCHING USING pcre2_match()" +.rs +.sp +When a partial matching option is set, the result of calling +\fBpcre2_match()\fP can be one of the following: +.TP 2 +\fBA successful match\fP +A complete match has been found, starting and ending within this subject. +.TP +\fBPCRE2_ERROR_NOMATCH\fP +No match can start anywhere in this subject. +.TP +\fBPCRE2_ERROR_PARTIAL\fP +Adding more characters may result in a complete match that uses one or more +characters from the end of this subject. .P When a partial match is returned, the first two elements in the ovector point to the portion of the subject that was matched, but the values in the rest of @@ -83,24 +117,6 @@ is "456abc12", a partial match is found for the string "abc12", because all these characters are needed for a subsequent re-match with additional characters. .P -What happens when a partial match is identified depends on which of the two -partial matching options is set. -. -. -.SS "PCRE2_PARTIAL_SOFT WITH pcre2_match()" -.rs -.sp -If PCRE2_PARTIAL_SOFT is set when \fBpcre2_match()\fP identifies a partial -match, the partial match is remembered, but matching continues as normal, and -other alternatives in the pattern are tried. If no complete match can be found, -PCRE2_ERROR_PARTIAL is returned instead of PCRE2_ERROR_NOMATCH. -.P -This option is "soft" because it prefers a complete match over a partial match. -All the various matching items in a pattern behave as if the subject string is -potentially complete. For example, \ez, \eZ, and $ match at the end of the -subject, as normal, and for \eb and \eB the end of the subject is treated as a -non-alphanumeric. -.P If there is more than one partial match, the first one that was found provides the data that is returned. Consider this pattern: .sp @@ -109,27 +125,32 @@ the data that is returned. Consider this pattern: If this is matched against the subject string "abc123dog", both alternatives fail to match, but the end of the subject is reached during matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying -"123dog" as the first partial match that was found. (In this example, there are -two partial matches, because "dog" on its own partially matches the second -alternative.) +"123dog" as the first partial match. (In this example, there are two partial +matches, because "dog" on its own partially matches the second alternative.) . . -.SS "PCRE2_PARTIAL_HARD WITH pcre2_match()" -.rs -.sp -If PCRE2_PARTIAL_HARD is set for \fBpcre2_match()\fP, PCRE2_ERROR_PARTIAL is -returned as soon as a partial match is found, without continuing to search for -possible complete matches. This option is "hard" because it prefers an earlier -partial match over a later complete match. For this reason, the assumption is -made that the end of the supplied subject string may not be the true end of the -available data, and so, if \ez, \eZ, \eb, \eB, or $ are encountered at the end -of the subject, the result is PCRE2_ERROR_PARTIAL, whether or not any -characters have been inspected. -. -. -.SS "Comparing hard and soft partial matching" +.SS "How a partial match is processed by pcre2_match()" .rs .sp +What happens when a partial match is identified depends on which of the two +partial matching options is set. +.P +If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a +partial match is found, without continuing to search for possible complete +matches. This option is "hard" because it prefers an earlier partial match over +a later complete match. For this reason, the assumption is made that the end of +the supplied subject string is not the true end of the available data, which is +why \ez, \eZ, \eb, \eB, and $ always give a partial match. +.P +If PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching +continues as normal, and other alternatives in the pattern are tried. If no +complete match can be found, PCRE2_ERROR_PARTIAL is returned instead of +PCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match +over a partial match. All the various matching items in a pattern behave as if +the subject string is potentially complete; \ez, \eZ, and $ match at the end of +the subject, as normal, and for \eb and \eB the end of the subject is treated +as a non-alphanumeric. +.P The difference between the two partial matching options can be illustrated by a pattern such as: .sp @@ -154,157 +175,83 @@ The second pattern will never match "dogsbody", because it will always find the shorter match first. . . -.SH "PARTIAL MATCHING USING pcre2_dfa_match()" +.SS "Example of partial matching using pcre2test" .rs .sp -The DFA functions move along the subject string character by character, without -backtracking, searching for all possible matches simultaneously. If the end of -the subject is reached before the end of the pattern, there is the possibility -of a partial match, again provided that at least one character has been -inspected. -.P -When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there -have been no complete matches. Otherwise, the complete matches are returned. -However, if PCRE2_PARTIAL_HARD is set, a partial match takes precedence over -any complete matches. The portion of the string that was matched when the -longest partial match was found is set as the first matching string. -.P -Because the DFA functions always search for all possible matches, and there is -no difference between greedy and ungreedy repetition, their behaviour is -different from the standard functions when PCRE2_PARTIAL_HARD is set. Consider -the string "dog" matched against the ungreedy pattern shown above: -.sp - /dog(sbody)??/ -.sp -Whereas the standard function stops as soon as it finds the complete match for -"dog", the DFA function also finds the partial match for "dogsbody", and so -returns that when PCRE2_PARTIAL_HARD is set. -. -. -.SH "PARTIAL MATCHING AND WORD BOUNDARIES" -.rs -.sp -If a pattern ends with one of sequences \eb or \eB, which test for word -boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-intuitive -results. Consider this pattern: -.sp - /\ebcat\eb/ -.sp -This matches "cat", provided there is a word boundary at either end. If the -subject string is "the cat", the comparison of the final "t" with a following -character cannot take place, so a partial match is found. However, normal -matching carries on, and \eb matches at the end of the subject when the last -character is a letter, so a complete match is found. The result, therefore, is -\fInot\fP PCRE2_ERROR_PARTIAL. Using PCRE2_PARTIAL_HARD in this case does yield -PCRE2_ERROR_PARTIAL, because then the partial match takes precedence. -. -. -.SH "EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST" -.rs -.sp -If the \fBpartial_soft\fP (or \fBps\fP) modifier is present on a -\fBpcre2test\fP data line, the PCRE2_PARTIAL_SOFT option is used for the match. -Here is a run of \fBpcre2test\fP that uses the date example quoted above: +The \fBpcre2test\fP data modifiers \fBpartial_hard\fP (or \fBph\fP) and +\fBpartial_soft\fP (or \fBps\fP) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, +respectively, when calling \fBpcre2_match()\fP. Here is a run of +\fBpcre2test\fP using a pattern that matches the whole subject in the form of a +date: .sp re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ - data> 25jun04\e=ps - 0: 25jun04 - 1: jun - data> 25dec3\e=ps + data> 25dec3\e=ph Partial match: 23dec3 - data> 3ju\e=ps + data> 3ju\e=ph Partial match: 3ju - data> 3juj\e=ps - No match - data> j\e=ps + data> 3juj\e=ph No match .sp -The first data string is matched completely, so \fBpcre2test\fP shows the -matched substrings. The remaining four strings do not match the complete -pattern, but the first two are partial matches. Similar output is obtained -if DFA matching is used. -.P -If the \fBpartial_hard\fP (or \fBph\fP) modifier is present on a -\fBpcre2test\fP data line, the PCRE2_PARTIAL_HARD option is set for the match. -. -. -.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()" -.rs -.sp -When a partial match has been found using a DFA matching function, it is -possible to continue the match by providing additional subject data and calling -the function again with the same compiled regular expression, this time setting -the PCRE2_DFA_RESTART option. You must pass the same working space as before, -because this is where details of the previous partial match are stored. Here is -an example using \fBpcre2test\fP: +This example gives the same results for both hard and soft partial matching +options. Here is an example where there is a difference: .sp re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ - data> 23ja\e=dfa,ps - Partial match: 23ja - data> n05\e=dfa,dfa_restart - 0: n05 -.sp -The first call has "23ja" as the subject, and requests partial matching; the -second call has "n05" as the subject for the continued (restarted) match. -Notice that when the match is complete, only the last part is shown; PCRE2 does -not retain the previously partially-matched string. It is up to the calling -program to do that if it needs to. -.P -That means that, for an unanchored pattern, if a continued match fails, it is -not possible to try again at a new starting point. All this facility is capable -of doing is continuing with the previous match attempt. In the previous -example, if the second set of data is "ug23" the result is no match, even -though there would be a match for "aug23" if the entire string were given at -once. Depending on the application, this may or may not be what you want. -The only way to allow for starting again at the next character is to retain the -matched part of the subject and try a new complete match. -.P -You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with -PCRE2_DFA_RESTART to continue partial matching over multiple segments. This -facility can be used to pass very long subject strings to the DFA matching -functions. + data> 25jun04\e=ps + 0: 25jun04 + 1: jun + data> 25jun04\e=ph + Partial match: 25jun04 +.sp +With PCRE2_PARTIAL_SOFT, the subject is matched completely. For +PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so +there is only a partial match. +. . . .SH "MULTI-SEGMENT MATCHING WITH pcre2_match()" .rs .sp -Unlike the DFA function, it is not possible to restart the previous match with -a new segment of data when using \fBpcre2_match()\fP. Instead, new data must be -added to the previous subject string, and the entire match re-run, starting -from the point where the partial match occurred. Earlier data can be discarded. +PCRE was not originally designed with multi-segment matching in mind. However, +over time, features (including partial matching) that make multi-segment +matching possible have been added. The string is searched segment by segment by +calling \fBpcre2_match()\fP repeatedly, with the aim of achieving the same +results that would happen if the entire string was available for searching. +.P +Special logic must be implemented to handle a matched substring that spans a +segment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a +partial match at the end of a segment whenever there is the possibility of +changing the match by adding more characters. The PCRE2_NOTBOL option should +also be set for all but the first segment. .P -It is best to use PCRE2_PARTIAL_HARD in this situation, because it does not -treat the end of a segment as the end of the subject when matching \ez, \eZ, -\eb, \eB, and $. Consider an unanchored pattern that matches dates: +When a partial match occurs, the next segment must be added to the current +subject and the match re-run, using the \fIstartoffset\fP argument of +\fBpcre2_match()\fP to begin at the point where the partial match started. +Multi-segment matching is usually used to search for substrings in the middle +of very long sequences, so the patterns are normally not anchored. For example: .sp re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/ - data> The date is 23ja\e=ph + data> ...the date is 23ja\e=ph Partial match: 23ja + data> ...the date is 23jan19 and on that day...\e=offset=15 + 0: 23jan19 + 1: jan .sp -At this stage, an application could discard the text preceding "23ja", add on -text from the next segment, and call the matching function again. Unlike the -DFA matching function, the entire matching string must always be available, -and the complete matching process occurs for each call, so more memory and more -processing time is needed. -. -. -.SH "ISSUES WITH MULTI-SEGMENT MATCHING" -.rs -.sp -Certain types of pattern may give problems with multi-segment matching, -whichever matching function is used. +Note the use of the \fBoffset\fP modifier to start the new match where the +partial match was found. .P -1. If the pattern contains a test for the beginning of a line, you need to pass -the PCRE2_NOTBOL option when the subject string for any call does start at the -beginning of a line. There is also a PCRE2_NOTEOL option, but in practice when -doing multi-segment matching you should be using PCRE2_PARTIAL_HARD, which -includes the effect of PCRE2_NOTEOL. +In this simple example, the next segment was just added to the one in which the +partial match was found. However, if there are memory constraints, it may be +necessary to discard text that precedes the partial match before adding the +next segment. In cases such as the above, where the pattern does not contain +any lookbehinds, it is sufficient to retain only the partially matched +substring. However, if a pattern contains a lookbehind assertion, characters +that precede the start of the partial match may have been inspected during the +matching process. .P -2. If a pattern contains a lookbehind assertion, characters that precede the -start of the partial match may have been inspected during the matching process. -When using \fBpcre2_match()\fP, sufficient characters must be retained for the -next match attempt. You can ensure that enough characters are retained by doing -the following: +The only lookbehind information that is available is the length of the longest +lookbehind in a pattern. This may not, of course, be at the start of the +pattern, but retaining that many characters before the partial match is +sufficient, if not always strictly necessary. The way to do this is as follows: .P Before doing any matching, find the length of the longest lookbehind in the pattern by calling \fBpcre2_pattern_info()\fP with the PCRE2_INFO_MAXLOOKBEHIND @@ -313,71 +260,78 @@ partial match, moving back from the ovector[0] offset in the subject by the number of characters given for the maximum lookbehind gets you to the earliest character that must be retained. In a non-UTF or a 32-bit situation, moving back is just a subtraction, but in UTF-8 or UTF-16 you have to count characters -while moving back through the code units. -.P -Characters before the point you have now reached can be discarded, and after -the next segment has been added to what is retained, you should run the next -match with the \fBstartoffset\fP argument set so that the match begins at the -same point as before. +while moving back through the code units. Characters before the point you have +now reached can be discarded. .P For example, if the pattern "(?<=123)abc" is partially matched against the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maximum lookbehind count is 3, so all characters before offset 2 can be discarded. The value of \fBstartoffset\fP for the next match should be 3. When \fBpcre2test\fP displays a partial match, it indicates the lookbehind characters with '<' -characters if the "allusedtext" modifier is set: +characters if the \fBallusedtext\fP modifier is set: .sp re> "(?<=123)abc" data> xx123ab\e=ph,allusedtext Partial match: 123ab <<< -However, the "allusedtext" modifier is not available for JIT matching, because -JIT matching does not maintain the first and last consulted characters. +.sp +Note that the \fPallusedtext\fP modifier is not available for JIT matching, +because JIT matching does not maintain the first and last consulted characters. +. +. +. +.SH "PARTIAL MATCHING USING pcre2_dfa_match()" +.rs +.sp +The DFA function moves along the subject string character by character, without +backtracking, searching for all possible matches simultaneously. If the end of +the subject is reached before the end of the pattern, there is the possibility +of a partial match. .P -3. Matching a subject string that is split into multiple segments may not -always produce exactly the same result as matching over one single long string -when PCRE2_PARTIAL_SOFT is used. The section "Partial Matching and Word -Boundaries" above describes an issue that arises if the pattern ends with \eb -or \eB. Another kind of difference may occur when there are multiple matching -possibilities, because (for PCRE2_PARTIAL_SOFT) a partial match result is given -only when there are no completed matches. This means that as soon as the -shortest match has been found, continuation to a new subject segment is no -longer possible. Consider this \fBpcre2test\fP example: -.sp - re> /dog(sbody)?/ - data> dogsb\e=ps - 0: dog - data> do\e=ps,dfa - Partial match: do - data> gsb\e=ps,dfa,dfa_restart - 0: g - data> dogsbody\e=dfa - 0: dogsbody - 1: dog -.sp -The first data line passes the string "dogsb" to a standard matching function, -setting the PCRE2_PARTIAL_SOFT option. Although the string is a partial match -for "dogsbody", the result is not PCRE2_ERROR_PARTIAL, because the shorter -string "dog" is a complete match. Similarly, when the subject is presented to -a DFA matching function in several parts ("do" and "gsb" being the first two) -the match stops when "dog" has been found, and it is not possible to continue. -On the other hand, if "dogsbody" is presented as a single string, a DFA -matching function finds both matches. +When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there +have been no complete matches. Otherwise, the complete matches are returned. +If PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any +complete matches. The portion of the string that was matched when the longest +partial match was found is set as the first matching string. .P -Because of these problems, it is best to use PCRE2_PARTIAL_HARD when matching -multi-segment data. The example above then behaves differently: +Because the DFA function always searches for all possible matches, and there is +no difference between greedy and ungreedy repetition, its behaviour is +different from the \fBpcre2_match()\fP. Consider the string "dog" matched +against this ungreedy pattern: +.sp + /dog(sbody)??/ +.sp +Whereas the standard function stops as soon as it finds the complete match for +"dog", the DFA function also finds the partial match for "dogsbody", and so +returns that when PCRE2_PARTIAL_HARD is set. +. +. +.SH "MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()" +.rs .sp - re> /dog(sbody)?/ - data> dogsb\e=ph - Partial match: dogsb - data> do\e=ps,dfa - Partial match: do - data> gsb\e=ph,dfa,dfa_restart - Partial match: gsb +When a partial match has been found using the DFA matching function, it is +possible to continue the match by providing additional subject data and calling +the function again with the same compiled regular expression, this time setting +the PCRE2_DFA_RESTART option. You must pass the same working space as before, +because this is where details of the previous partial match are stored. You can +set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART +to continue partial matching over multiple segments. Here is an example using +\fBpcre2test\fP: .sp -4. Patterns that contain alternatives at the top level which do not all start -with the same pattern item may not work as expected when PCRE2_DFA_RESTART is -used. For example, consider this pattern: + re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ + data> 23ja\e=dfa,ps + Partial match: 23ja + data> n05\e=dfa,dfa_restart + 0: n05 +.sp +The first call has "23ja" as the subject, and requests partial matching; the +second call has "n05" as the subject for the continued (restarted) match. +Notice that when the match is complete, only the last part is shown; PCRE2 does +not retain the previously partially-matched string. It is up to the calling +program to do that if it needs to. This means that, for an unanchored pattern, +if a continued match fails, it is not possible to try again at a new starting +point. All this facility is capable of doing is continuing with the previous +match attempt. For example, consider this pattern: .sp 1234|3789 .sp @@ -386,28 +340,15 @@ alternative is found at offset 3. There is no partial match for the second alternative, because such a match does not start at the same point in the subject string. Attempting to continue with the string "7890" does not yield a match because only those alternatives that match at one point in the subject -are remembered. The problem arises because the start of the second alternative -matches within the first alternative. There is no problem with anchored -patterns or patterns such as: -.sp - 1234|ABCD -.sp -where no string can be a partial match for both alternatives. This is not a -problem if a standard matching function is used, because the entire match has -to be rerun each time: -.sp - re> /1234|3789/ - data> ABC123\e=ph - Partial match: 123 - data> 1237890 - 0: 3789 -.sp -Of course, instead of using PCRE2_DFA_RESTART, the same technique of re-running -the entire match can also be used with the DFA matching function. Another -possibility is to work with two buffers. If a partial match at offset \fIn\fP -in the first buffer is followed by "no match" when PCRE2_DFA_RESTART is used on -the second buffer, you can then try a new match starting at offset \fIn+1\fP in -the first buffer. +are remembered. Depending on the application, this may or may not be what you +want. +.P +If you do want to allow for starting again at the next character, one way of +doing it is to retain the matched part of the segment and try a new complete +match, as described for \fBpcre2_match()\fP above. Another possibility is to +work with two buffers. If a partial match at offset \fIn\fP in the first buffer +is followed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, +you can then try a new match starting at offset \fIn+1\fP in the first buffer. . . .SH AUTHOR @@ -424,6 +365,6 @@ Cambridge, England. .rs .sp .nf -Last updated: 22 July 2019 +Last updated: 07 August 2019 Copyright (c) 1997-2019 University of Cambridge. .fi |